Adventures in web archiving

From 2012 to 2018 I worked at Canadian Business magazine. The publishing business itself was crumbling around us, but I still loved the work and the magazine we were putting out. The dwindling budgets fostered a foxhole spirit among the remaining staff; we had to get creative to keep it all going.

The magazine all but folded in 2017, when Rogers Media ended its print edition. remained online to host a few remaining sponsored editorial packages and a trickle of CP wire stories. The brand was sold to St. Joseph’s Publishing in 2019.

A few months ago we learned that St. Joe’s was going to relaunch the magazine, in print and online. Good news! Then, a few days ago, I heard through the grapevine that the new probably wasn’t going to preserve the existing site content. Bad news!

I’d known this day would come sooner or later, and I get that the company needs a fresh start to build something new. But I also didn’t want something that I’d worked on for years to be obliterated outright. So, what to do?

Introduction to web scraping

The site is already fairly well-indexed in the Internet Archive’s Wayback Machine. But it’s hard to get a good sense from the outside how comprehensive the archive is, and how systematic the crawl. So I decided to look at how to create my own.

My goal was (and is) to ensure that the Wayback Machine has a complete record of as it existed in the last days before its 2021 relaunch.

A few years ago I ran across Archive Team, a loose network of “rogue archivists” who do huge scrapes of endangered web content, mostly to preserve through the Internet Archive (although they have an arm’s-length relationship to the IA itself.) Archive Team maintains a list of software for general-purpose web scraping. But the Wayback Machine only ingests Web Archive (.warc) files, which meant straying a little farther into the tall grass.

I ended up installing grab-site, which will crawl a website and produce a series of WARC files containing the results. I started scraping the site on the evening of Monday, October 4, 2021, and it ran for about 72 hours. It traversed 516,858 URLs, and downloaded 26 gigabytes total. A little extra searching around led me to some helpful directions for uploading large web archives to the Internet Archive.

Those WARC files are now freely available on officially relaunched on the morning of Thursday, October 7. I think it’s a very good looking magazine. Its sitemap lists 27 posts.

Now what?

I’m not totally sure what the next step is. Apparently only approved accounts can put web archives into the Wayback Machine. I’ll need to contact someone at the Internet Archive or on Archive Team to see what it takes.

An important caveat: I have no idea whether I did this right. I followed the documentation and I can see that I created a big pile of data, but given the short time frame and the fairly steep learning curve, it’s hard to know whether what I produced is actually fit for purpose. It is absolutely possible that I’ve produced something that cannot ever be inserted into the Wayback Machine; I just don’t know yet.

I opted to scrape first and ask questions later, and I think on balance that was still the right choice. But it means I’m still figuring out exactly what the necessary steps are. To be continued!

Stray thoughts and observations

  • The software ecosystem for web archiving feels very wild and woolly. The tooling can be somewhat arcane and intimidating, largely command-line Python programs. Someone of my experience level can puzzle through, but there’s no hand-holding here.
  • It’s difficult, as a newcomer, to inspect your work and get a sense of whether you’re doing it right. This is partly due to the industrial-grade nature of the tooling; It’s also due to the scale you have to work at for even a modest-sized site — potentially hundreds of thousands of URLs and gigabytes of data. It’s hard to do a dry run and make sense of the resulting output with any confidence.
  • The Internet Archive itself strikes me as curiously silent on citizen contributions to the Wayback Machine. You can save a single page at a time through the web interface but there’s otherwise very little official guidance on this type of workflow.
  • All of the above makes this whole enterprise pretty intimidating to get into, which is a shame. I think further democratizing this ecosystem and lowering the barriers to entry might be healthy overall. But obviously it’s hugely reliant on volunteers who are likely already stretched thin. That’s why I chose to donate to the Internet Archive this week. None of my minor gripes about the ergonomics of this process change the fact that the IA does absolutely essential work.