August 24, 2015

Writing WARCs

We started doing deduplication four years before we started using WARC. As the ARC format had no revisit concept, the only record of the deduplicated items from that era lies in the crawl logs.

When we put our collection online back in 2009 we built our own indexer that consumed these crawl logs so we could include these items. It worked very well at the time.

As our collection grew, we eventually hit scaling issues with the custom indexer. Ultimately, it was abandoned when we upgraded to Wayback 1.8 two years ago. We moved to use a CDX index instead. The only casualty was the early deduplication/revisit records from our ARC days which were no longer included.

So, for the last two years, I've had a task on my todo list. Create a program that consumes all of those crawl logs and spits out WARC files containing the revisit records that are missing from the ARC files.

For a while I toyed with the idea of incorporating this in a wider ARC to WARC migration effort. But it's become clear that such a migration just isn't worthwhile. At least not yet.

Recently I finally made some time to address this. I figured it shouldn't take much time to implement. Basically, the program needs to do two things:

  1. Ingest and parse a crawl.log file from Heritrix, making it possible to iterate over its lines and access specific fields. As the DeDuplicator already does this, it was a minor task to adapt the code.
  2. For each line that represented a deduplicated URI, write a revisit record in a WARC file.

Boom, done. Can knock that out in an hour, easy.

Or, I should be able to.

It turns out that writing WARCs is painfully tedious. The libraries require you to understand the structure of a WARC file very well and do nothing to guide you. There is also no useful documentation on this subject. You best bet is to find code examples, but even those can be hard to find and may not directly address your needs.

I tried both the webarchive-commons library and JWAT. Both were tedious, but JWAT was less so. Both require you to understand exactly what header fields need to be written for each record type. To know that you need to write a warcinfo field first and so on. At least JWAT made it fairly simple to configure the header fields.

Consulting both existing WARC files and the WARC spec, I was able to put all the pieces together in about half a day using JWAT.

And that's when I realized that JWAT does not track the number of bytes written out. That means I can't split the writing up to make "even sized" WARCs like Heritrix does (usually 1 GiB/file).

Darn, I need to rewrite this, after all, using webarchive-commons.


It seems to me that such an important format should have better library support. A better library would save a lot of time and effort.

By saving effort, we may also wind up saving the format itself. The danger that someone will create a widely used program that writes invalid WARCs is very real. If such invalid WARCs proliferate that can greatly undermine the usefulness of the format.

It is important to make it easy to do the right thing. Right now, even someone, like myself, who is very familiar with WARC needs to double and triple check every step of a very simple WARC writing program. Never mind if you need to do something a little advanced.

Someone with less knowledge and, perhaps, less motivation to "get it right" could all too easily write something "good enough".

It is important to demystify the WARC format. Good library support would be an ideal start.

August 17, 2015

The WARC Format 1.1

The WARC Format 1.0 is an ISO standard describing the container format that web archives use to store their data. WARCs contain not only the actual file resources (HTML, images, JavaScript etc.) but also request and response headers, metadata about the resources and the overall collection, deduplication records and conversion records.

It's a pretty flexible format. It has served us quite well, but it is not perfect.

While it is an ISO standard, most of it was written by IIPC members. Indeed, it is heavily influenced the ARC format developed by The Internet Archive. So, now that the WARC format is being revisited it is only natural that the IIPC community, again, writes the first draft.

At the IIPC GA this year, in Stanford, there was a workshop where the pain points of the current specification were brought to light. There was a lot of energy in the room and people were excited. But, as everyone got back home a lot of that energy went away.

It is a lot easier to talk about change, than making it happen. Making things more difficult, few of us know much about the standards process. It all felt very inscrutable.

To help with the procedural aspect we came up with an approach that involves using the tools we are familiar with (software development). Consequently, we (and by "we" I mean Andy Jackson of the British Library) set up a GitHub project around the WARC specification.

Any problems with the existing specification could be raised there as "issues" (you'll find all the ones discussed in Stanford on there!). The existing spec could be included as markdown formatted text and any proposed changes could be submitted as "pull requests" acting on the text of the existing spec.

Currently there are two pull requests, each representing a proposed set of changes to address one specific shortcoming of the existing spec.

One of the pull requests comes from yours truly. It address the concerns of "uri agnostic revisit records". This was previously dealt with via an advisory on the subject adopted by the IIPC. This allows us to promote what has been a defacto standard into the actual standard.

The other pull request centers on improving the resolution of timestamps in WARC headers.

Neither pull request has been merged, meaning that both are up for comment and may change or be rejected altogether. There are also many issues that still need to be addressed.

I would like to encourage all interested parties (IIPC members and non-members alike) to take advantage of the GitHub venue if the WARC format is important to you. You can do this by opening issues if you have a problem with the spec that hasn't been brought up. You can do this by commenting on existing issues and pull requests, suggesting solutions or pointing out pitfalls.

And you can do this be suggesting actual edits to the standard via pull requests (let us know if you need help with the technical bits).

Ultimately, the draft thus generated will be passed on to the relevant ISO group for review and approval. This will happen (as I understand it) next year.

So grab the opportunity while it presents itself and have your say on The WARC Format 1.1.