How I Archive Webpages Linked To From My Website

In some ways, my website is a little knowledge base for myself. Of course, it is also a place for me to share with others. But I very likely might be the most common visitor of my own website.

I post links to articles and podcasts I found worthwhile reading and listing to on my Links blog. The same as with the website, I created the blog as much to share with others as I did it for myself, to have a record which I can revisit.

But the internet is a transient place. Websites do vanish regularly. What if the article I want to reread or the podcast I want to listen to again is not available any more? Untenable! I decided to archive any page I link to from my website.

Note: In many cases if a website is not available anymore chances are high that one finds it on the Internet Archive. If you do not know it check it out, its awesome. But I did not want to rely on them. I just want to have a copy on my local machine.

To build such an archive, I reached for the power of command line programs and the composition of them. My plan was to:

  1. Search my static website directory for any kind of URL
  2. Download the websites of those URLs found

Sounds easy enough, but as usual there are more edge cases to consider than what one thought about when starting out. Probably, the biggest one is what happens if the webpage of one URL changes (or is taken offline entirely)? If I run the above steps again and again, at some point I will overwrite an archived webpage with a webpage which does not represent the one I have linked to initially. To avoid that, I decided to download webpages only once. Here is a more detailed list of steps:

  1. Search my static website directory for any kind of URL using ripgrep and regex
  2. Check existing archive for URLs for which the website already has been downloaded
  3. Remove existing URLs of step 2 from the list of found URLs of step 1
  4. Download websites for the remaining URLs

There are many tools to download websites. After a detour with wget, I settled on monolith. Mainly because it looked simple to use, and it conveniently produces a single file containing all assets for a webpage:

"Unlike the conventional “Save page as”, monolith not only saves the target document, it embeds CSS, image, and JavaScript assets all at once, producing a single HTML5 document that is a joy to store and share.

If compared to saving websites with wget -mpk, this tool embeds all assets as data URLs and therefore lets browsers render the saved page exactly the way it was on the Internet, even when no network connection is available."

In my current setup, I download every webpage to a single directory. The names of the file is the encoded URL (I use urlencode). The URL need to be encoded because under Unix the / character can not be part of a file name.

Here is the whole script I wrote:

#!/bin/bash
set -e
cache_dir="$HOME/.cache/webpage-archiver"
mkdir -p "$cache_dir"
mkdir -p archive

function info () {
    printf "\n%s %s\n" " -" "$*" >&2;
}

function urldecode () {
    while read -r data; do
        urlencode -d "$data"
    done
}

info "Searching site for ULRs"
rg \
    --only-matching \
    --no-line-number \
    --no-filename \
    --glob=*.html \
    "https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)" \
    site/ |
    rg . | # remove empty lines
    sort |
    uniq > "$cache_dir/links.txt"

info "Get existing URLs from archive"
ls archive/ |
    urldecode |
    sort > "$cache_dir/links-archived.txt"

info "Filter URLs which are already in archive"
# comm - Compare two sorted files line by line.
# -1    Suppress lines unique to FILE1
# -3    Suppress lines that appear in both files
comm -1 -3 \
    "$cache_dir/links-archived.txt" \
    "$cache_dir/links.txt" > "$cache_dir/links-to-be-archived.txt"

info "Archive URLs"
total=$(cat "$cache_dir/links-to-be-archived.txt" | wc -l)
count=0
while read -r url; do
    count=$((count + 1))
    info "$count/$total: Archive $url"
    # monolith - CLI tool for saving web pages as a single HTML file
    # --isolate    Cut off document from the Internet
    #
    # Might be useful:
    # --no-audio
    # --no-video
    # --ignore-errors
    # --silent
    monolith \
        --no-fonts \
        --no-js \
        --no-frames \
        --isolate \
        --output "archive/$(urlencode "$url")" \
        "$url"
done < "$cache_dir/links-to-be-archived.txt"

After executing the script, I have my personal archive of webpages I have linked to from my blog. It is just a directory containing a single file for each webpage, which makes it easily searchable.

During one of my initial runs, the script failed because a webpage was not available any more (it is possible to ignore failures with the --ignore-errors option for monolith). I was too late! The webpage was a post by Figma about Design Thinking. Previously, the page could be accessed via this URL. The Internet Archive does not have a copy of that page (at least under this URL), but fortunately, Figma only moved the page to this URL (for which the Internet Archive does have a copy). Maybe I should automate the submission of webpages to the Internet Archive?

The above script was the easiest way I could come up to automate webpage archiving. Maybe you know of an even easier way of. If so, I would love to hear about it. Reach out to me!

Until then, happy archiving.

Published 2024-11-25Last updated 2024-11-28