Michael Buckley

Pythonesque

2024-11-26

I know a little about coding. Enough to sometimes brute force a small quality-of-life tweak into someone else’s code, or write a janky-as-heck shell script. But not enough to build anything substantial from the ground up.

So to fill this knowledge hole, I’ve been reading a few introductions to Python. And while learning in the abstract is a start, I didn’t have a project.

Until my research workflow fell over and I had to spend a couple afternoons in vim writing a replacement: a simple wrapper around Pandoc to make archiving webpages easier. I called it Jones. Because he was into archeology. And research is a kind of archeology. Anyway.

I’m putting the full code here as the internet equivalent of sticking it to my fridge. That way I can smile at its crudeness whenever I put away the milk. It’s also on github. Yeah, I’ve got one of those.

#!/usr/bin/env python3
# research archiving with pandoc
import re
import requests
import bs4
import subprocess
import sys
import os

from pathlib import Path

# fail if no url
if len(sys.argv) < 2:
    print("no url")
    sys.exit()

# path setup
archive_path = Path.home() / "ark"

if archive_path.is_dir():
    pass
else:
    archive_path.mkdir()
    print("archive path created")

print("Getting page info...")

# get url
url = sys.argv[1]
res = requests.get(url)

# get title
soup = bs4.BeautifulSoup(res.text, "html.parser")
page_title = soup.title.string

# format title (name.of.article.txt)
word_title = re.sub("\\W", "_", page_title)
short_title = word_title[:36]
dedup_title = re.sub("_{2,}", "_", short_title)
clip_title = dedup_title.rstrip("_")
actual_title = clip_title.lower() + ".txt"

# build destination path
title_path = Path(actual_title)
destination = archive_path / title_path

# pandoc command
print("Running Pandoc...")
subprocess.run(["pandoc", "-f", "html", "-t", "plain", url, "-o", destination])

# append to archive list
catalog_path = archive_path / "ark.catalog"

with open(catalog_path, "a") as file:
    file.write(url + "\n")

Bonus: Un-Pocket

Last time, I wanted to convert the list of links exported from Pocket into full text copies. That turned out to be pretty easy--I just needed htmlq. Then to get an unformatted list of links, I ran this in git-bash on Windows:

cat export.html | htmlq --attribute href a | sed '/^http/!d' > pocket.txt

And ran the list through Jones:

while IFS='' read -r line; do
    jones "$line"
done < pocket.txt