Here’s a follow up to my earlier post on archiving. I spent a couple of days coming up with a quick Python app to fill my needs.
Here it is: web2pdf.
Once installed, the configuration simply expects a bookmarks.html file on the filesystem. It reads it, stores the contents to an sqlite DB and starts saving PDF versions of each link there-in.
You can kill the script and re-run it at a later point in time and it will continue where it stopped. The output looks like this:
(pdf) bash-4.3 ~/code/web2pdf/web2pdf$ ./web2pdf.py Found 2599 links in the bookmark file Found 2599 rows in the bookmark db ..of which 81 links are already saved ..and 2506 are pending Hit enter to start downloading pending PDFs Downloading https://www.quantamagazine.org/20170207-bell-test-quantum-loophole/ | experiment-reaffirms-quantum-weirdness
There are more details in the github link. I’m not happy with how
fast slow it is but I seem to be limited by the library I’m using and the use-case itself: fetching a page is trivial but it has to render it before exporting it.
As always there is more to do but it works pretty well already. It tags failed bookmarks separately in the DB in case it needs retrying later. I’ve tried to speed it up using Python3’s native async/await, but the performance improvements are not noticeable so far. I’ll try with multiprocess instead and commit whichever one works better.