2024-11-26
I know a little about coding. Enough to sometimes brute force a small quality-of-life tweak into someone else’s code, or write a janky-as-heck shell script. But not enough to build anything substantial from the ground up.
So to fill this knowledge hole, I’ve been reading a few introductions to Python. And while learning in the abstract is a start, I didn’t have a project.
Until my research workflow fell over and I had to spend a couple afternoons in vim writing a replacement: a simple wrapper around Pandoc to make archiving webpages easier. I called it Jones. Because he was into archeology. And research is a kind of archeology. Anyway.
I’m putting the full code here as the internet equivalent of sticking it to my fridge. That way I can smile at its crudeness whenever I put away the milk. It’s also on github. Yeah, I’ve got one of those.
#!/usr/bin/env python3
# research archiving with pandoc
import re
import requests
import bs4
import subprocess
import sys
import os
from pathlib import Path
# fail if no url
if len(sys.argv) < 2:
print("no url")
sys.exit()
# path setup
archive_path = Path.home() / "ark"
if archive_path.is_dir():
pass
else:
archive_path.mkdir()
print("archive path created")
print("Getting page info...")
# get url
url = sys.argv[1]
res = requests.get(url)
# get title
soup = bs4.BeautifulSoup(res.text, "html.parser")
page_title = soup.title.string
# format title (name.of.article.txt)
word_title = re.sub("\\W", "_", page_title)
short_title = word_title[:36]
dedup_title = re.sub("_{2,}", "_", short_title)
clip_title = dedup_title.rstrip("_")
actual_title = clip_title.lower() + ".txt"
# build destination path
title_path = Path(actual_title)
destination = archive_path / title_path
# pandoc command
print("Running Pandoc...")
subprocess.run(["pandoc", "-f", "html", "-t", "plain", url, "-o", destination])
# append to archive list
catalog_path = archive_path / "ark.catalog"
with open(catalog_path, "a") as file:
file.write(url + "\n")
Last time, I wanted to convert the list of links exported from Pocket into full text copies. That turned out to be pretty easy--I just needed htmlq. Then to get an unformatted list of links, I ran this in git-bash on Windows:
cat export.html | htmlq --attribute href a | sed '/^http/!d' > pocket.txt
And ran the list through Jones:
while IFS='' read -r line; do
jones "$line"
done < pocket.txt