Adventures in web archiving

From 2012 to 2018 I worked at Canadian Business magazine. The publishing business itself was crumbling around us, but I still loved the work and the magazine we were putting out. The dwindling budgets fostered a foxhole spirit among the remaining staff; we had to get creative to keep it all going.

The magazine all but folded in 2017, when Rogers Media ended its print edition. CanadianBusiness.com remained online to host a few remaining sponsored editorial packages and a trickle of CP wire stories. The brand was sold to St. Joseph’s Publishing in 2019.

A few months ago we learned that St. Joe’s was going to relaunch the magazine, in print and online. Good news! Then, a few days ago, I heard through the grapevine that the new CanadianBusiness.com probably wasn’t going to preserve the existing site content. Bad news!

I’d known this day would come sooner or later, and I get that the company needs a fresh start to build something new. But I also didn’t want something that I’d worked on for years to be obliterated outright. So, what to do?

Introduction to web scraping

The site is already fairly well-indexed in the Internet Archive’s Wayback Machine. But it’s hard to get a good sense from the outside how comprehensive the archive is, and how systematic the crawl. So I decided to look at how to create my own.

My goal was (and is) to ensure that the Wayback Machine has a complete record of CanadianBusiness.com as it existed in the last days before its 2021 relaunch.

A few years ago I ran across Archive Team, a loose network of “rogue archivists” who do huge scrapes of endangered web content, mostly to preserve through the Internet Archive (although they have an arm’s-length relationship to the IA itself.) Archive Team maintains a list of software for general-purpose web scraping. But the Wayback Machine only ingests Web Archive (.warc) files, which meant straying a little farther into the tall grass.

I ended up installing grab-site, which will crawl a website and produce a series of WARC files containing the results. I started scraping the site on the evening of Monday, October 4, 2021, and it ran for about 72 hours. It traversed 516,858 URLs, and downloaded 26 gigabytes total. A little extra searching around led me to some helpful directions for uploading large web archives to the Internet Archive.

Those WARC files are now freely available on archive.org.

CanadianBusiness.com officially relaunched on the morning of Thursday, October 7. I think it’s a very good looking magazine. Its sitemap lists 27 posts.

Now what?

I’m not totally sure what the next step is. Apparently only approved accounts can put web archives into the Wayback Machine. I’ll need to contact someone at the Internet Archive or on Archive Team to see what it takes.

An important caveat: I have no idea whether I did this right. I followed the documentation and I can see that I created a big pile of data, but given the short time frame and the fairly steep learning curve, it’s hard to know whether what I produced is actually fit for purpose. It is absolutely possible that I’ve produced something that cannot ever be inserted into the Wayback Machine; I just don’t know yet.

I opted to scrape first and ask questions later, and I think on balance that was still the right choice. But it means I’m still figuring out exactly what the necessary steps are. To be continued!

Stray thoughts and observations