December 16, 2014

Deduplicating text based data

In my last two posts about deduplication, you may have noticed the following caveat:
It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated.
The reasons for ignoring text documents derive from analysis I did 8-9 years ago when first developing the DeDuplicator. The basic argument was essentially, that HTML documents (which at the time were the overwhelming type of plain text documents) were highly dynamic (yielding little by way of deduplication), generally small and highly compressible.

The first and last assumptions are still true, but HTML files have certainly gotten bigger in the interim. Perhaps more importantly, other text files are becoming more common and may benefit from deduplication. For example, CSS and JavaScript files are unlikely to change very frequently and have gotten much larger. Also, commonly used JavaScript libraries are replicated across many websites.

So, I figured it was time to do a little analysis and see what numbers came up. I've used the same crawl logs as were discussed in the two last posts.

In my last post there was a table that showed that a total of 73.6 million URLs had been exempted from deduplication based on mime type. This is about two thirds of all URLs crawled. This is a bit misleading as it includes non-200 responses. When we limit ourselves to 200 responses, the actual number is 53.9 million. Thus text documents account for 62% of URLs

The table also showed that these URLs accounted for about 2.3 TiB of the data collected, about 32% of all the data (that is about right as non-200 responses usually have no or negligible payload). Clearly the average uncompressed file size of text documents is much smaller than of non-text documents.

Of the 2.3 TiB, 25% could have been deduplicated (553 GiB). By URL, it was about 26% overall.

I didn't attempt to break it down to exact-url, digest and crawl time deduplication.

Looking at it further, about half of the duplicate data was from non-HTML documents. More interesting is that while 14 million of the 50 million HTML documents were deemed duplicates, 2 million of the 3.1 million non-HTML text documents were deemed duplicates. The probability of a non-HTML text document being a duplicate is very high (almost comparable to non-text documents) and it size is, on average, much larger than that of an HTML document.

This is pretty conclusive. Including non-HTML text documents in deduplication yields significant savings and is fairly cheap in terms of index size and number of additional lookups.

The savings with regards to HTML is more questionable. By more than doubling the number of lookups there is a potential saving of about 5% of the total compressed data size (assuming 60% compression, which is likely conservative). With a larger index, the cost of each lookup also becomes more expensive.

Unless resources are plentiful, I believe that skipping HTML documents when deduplicating is still a reasonable choice for large scale crawls. For focused crawls (especially those conducted on short intervals), I would however recommend including HTML content.

December 5, 2014

URI agnostic deduplication on content discovered at crawl time

In my last blog post I showed that URI agnostic duplicates accounted for about 5% of all duplicates by volume (bytes) and about 11% by URI count. But this is limited to looking up content digests that had been discovered in a previous crawl. What if we also deduplicated on content digests that are discovered during a crawl?

So I put together a little script and set it lose on the crawl logs for the domain crawl. As before I only considered documents whose content (mime) type does not start with "text/".

In all, this would have increased the number of duplicates found by 3.5%. It would also increase the number of bytes deemed duplicate by 3.5%.

In practical terms this means I could have avoided storing 121 GiB of data. This is about 9.2% of the data volume that was subjected to deduplication and deemed novel. Or 3.3% of the overall data volume deemed novel and stored.

The following is a table showing the actual numbers. The difference between the total and 'subject to deduplication' is made up of URIs whose content type started with "text/".

Subject to deduplication:33.096.2194.791
Deemed duplicates (total):24.522.4093.477
- Exact URL matches:18.505.1382.941
- Canonical URL matches:3.273.397353
- Digest only matches:2.743.874176
Missed digest at crawl time matches:853.013121

So there doesn't seem to be that much gain from tackling this class of duplicates. Heritrix does offer a tool for this  (that I haven't tried). I think it'll come down to how difficult this is to implement and its effect on performance. If its easy and doesn't hurt performance, reducing data volume by 3-4% can add up.

December 4, 2014

The results of URI-agnostic deduplication on a domain crawl

During the recently completed .is domain crawl, URI-agnostic deduplicaction was employed for the first time. Until now, .is domain crawls have only deduplicated in instances where the same URI served up identical content to that observed during the most recent prior crawl.

It should be noted that the URI-agnostic deduplication only relied on the data from the index created pre-crawl. It did not try to deduplicate on documents that were discovered for the first time during the crawl.

It should also be noted that only URLs whose content (mime) type did not begin with "text/" were deduplicated. Generally, deduplicating on HTML and other text based documents yields limited results due to them being dynamically generated and heavily compressible. We will be looking at this class in the future.

The deduplication index contained all known URIs for known hashes. This made it possible to record for each encountered duplicate if it was...
  • An exact URL match.
    Meaning that that exact URL had already been recorded as having this payload
  • A canonical URL match.
    Meaning that an URL whose canonical form is identical to the current URL's canonical form had already been recorded as having this payload. The canonicalization employed is the same one utilized by OpenWayback when determining URL equivalency. Exact URL matches do not count as canonical matches.
  • A digest only match
    Meaning that we have recorded an unrelated URL as having the same payload. Exact and canonical URL matches do not count towards this.
The results:
  • Exact URL: 75,46% of URLs, 84.58% of bytes deduplicated
  • Canonical URL: 13,35% of URLs, 10.15% of bytes deduplicated
  • Digest only: 11,19% of URLs, 5.07% of bytes deduplicated
As you can see, allowing URI-agnostic does not significantly affect the amount of data that can be deduplicated. It is also clear that it is primarily smaller files that are likely to be deduplicated in this manner.

In all, URI-agnostic deduplication saved 176 GiB in a crawl that ultimately required 2.1 TiB of storage space. So even if we assume that none of the 176 was compressible it is a saving of 7.6%. The actual value is probably closer to 5%.

If you count canonical URL matches as URI-agnostic (it is an arguable point) then the number rises notably to 19.7%.

October 17, 2014

Webarchive deduplication. Does it matter which record is marked as the original?

We've been doing deduplication in webarchiving for a long time now. But, due to limits in our tools and storage format (WARC) it has always been, so called, URL based deduplication. I.e. we record that this capture of a particular URL is a duplicate (or revisit if you prefer). The content isn't stored and playback software simply moves 'backwards in time' until it finds the original record.

With recent clarifications to the WARC spec adopted by the IIPC and implemented in tools like Heritrix and OpenWayback we are no longer limited to this URL based deduplication.

With Heritrix 3.3.0 (still in development) now having robust handling for 'url agnostic' or 'digest based' deduplication, I set out to update my DeDuplicator software, an add-on for Heritrix. Implementing digest based deduplication was super easy.

But something nagged at me.

Consider the following scenario. You are crawling URL A. Its content digest indicates that it is a duplicate of URL A from some earlier crawl (lets call this A-1), but is also a duplicate of URL B. B may have been crawled at the same time (i.e. during the same harvesting round) as A-1, or at another time altogether, including earlier during the current round of harvesting.

Obviously, if we didn't have A-1, we would deduplicate on B and say that A is a duplicate of B. That's the whole point of digest based deduplication.

But, if you do have A-1, does it matter which one A is declared a duplicate of?

During large scale crawling you need lookups in your deduplication index to be as efficient as possible. Doing simple lookups on digests is more performant than doing a lookup on both digest and URL (or searching within a result set for the digest). Additionally, you can make the index smaller by only including one instance of each digest in the index.

But this bowing to performance means that it is blind chance whether A-1 or B will be designated as the 'original' for A.

The engineer in me insist that this is irrelevant. After all, if we didn't have A-1, we'd want to use B. So does the existence of A-1 in our archive really matter.

I haven't been able to come up with any solid argument for preferring A-1. Logically, it shouldn't matter. But, somehow, it just feels off to me.

If anyone has a concrete technical or practical reason for preferring A-1 (when available) please share!

Thank you.

Edit: In response to Peter Websters comment, let me just clarify that replay tools (e.g. OpenWayback) would still be aware of A-1, is it would be in their index, and they would show it as the precursor to A. The link between A and B would be considered incidental (as far as replay tools are concerned) and would not be directly evident to users.

Update: I've come up with at least one reason why it might matter. See my blog post answering my own question.

October 2, 2014

Heritrix, Java 8 and

I ran into this issue and I figured if I don't write it up, I'll be sure to have forgotten all the details when it occurs again.

The issue is that Heritrix (which is still built against Java 6) uses on startup to generate a self signed certificate for its HTTPS connection. However, in Java 8, Oracle changed this class to be

As Heritrix only generates the certificate once, I only ran into this issue when installing a new build of Heritrix, not when I upgraded Java to 8 on my crawl server. You can run Heritrix with Java 8 just fine as long as you launch it once with Java 7 (or, presumably older).

It should be noted that Java warns against using anything from the sun package. It is not considered part of the Java API. But I believe that the only alternative is to have people manually set up the certificates.

This does mean two things:

1. You need a version of Java prior to 8 to work on Heritrix. It is possible for newer versions to be in compatibility mode with a prior version. But Keytool isn't part of Java proper. If you only have Java 8 installed, you will not have the necessary dependency available on the classpath. Your IDE will complain incessantly.

2. Building Heritrix on machines with only Java 8 is not possible.

I've also seen at least one unit test also using Keytool (this may be only in a pending pull request, I haven't looked into it deeply).

This isn't an immediate problem as Java 7 is still supported and available from Oracle. However, if they discontinue Java 7 it will quickly become a problem (just try get Java 6 to install from Oracle).

If you want to run Heritrix with Java 8 your options are:

1. First run it once with Java 7 or prior.
2. Use the -s option to specify a custom keystore location and passwords. You can build that keystore using external tools.
3. Manually create the adhoc.keystore file (in Heritrix's working directory) that Heritrix usually generates automatically. This can be done using Java 8 tools with the following command (assumes Java's bin directory is on the path):
  $ keytool -keystore adhoc.keystore -storepass password 
    -keypass password -alias adhoc -genkey -keyalg RSA 
    -dname "CN=Heritrix Ad-Hoc HTTPS Certificate" -validity 3650

Number 3 rather points at a possible solution to this. Just move this generation of an adhoc keystore to the shell script that launches Heritrix.

Edited to add #4: Copy an adhoc.keystore from a previous Heritrix install, if you have one lying about.

August 28, 2014

Rethinking Heritrix's crawl.log

I've been looking a lot at the Heritrix crawl.log recently, for a number of reasons. I can't help feeling that it is flawed and it's time to rethink it. 

Issue 1; it contains data that isn't directly comparable

Currently it has data for at least two protocols (DNS and HTTP, three if you count HTTPS) and that is assuming you aren't doing any FTP or other 'exotic' crawling. While these share elements (notably an URL), they are not apples to apples comparable.

Worse still, the crawl.log includes failed fetches and even decisions to not crawl (-9998 and -500X status codes).

It seems to me that we really need multiple logs. One per protocol, plus one for failures and one for URLs that the crawler chooses not to crawl. Each with fields that are appropriate to its nature.

This resolves, for example, the problem of the status code sometimes being Heritrix specific and sometimes protocol specific. In fact, we may replace integer status codes with short text labels for increased clarity in the non-protocol logs.

For the sake of backward compatibility, these could be added while retaining the existing crawl.log. Ultimately, the basic crawl.log could be either eliminated or changed into a simpler form meant primarily as a way of gauging activity in the crawler at crawl time.

Issue 2; It doesn't account for revisit/duplicate discovery

Annotations have been used to address this, but it deserves to be treated better. This can be done by adding three new fields:

  • Revisit Profile - Either - if not a revisit or a suitable constant for server-not-modified and identical-payload-digest. These should not be obtuse number or some such to make it easy for custom deduplication schemes to extend this as needed.
  • Original Capture Time - Mandatory if revisit profile is not -
  • Original URL - Either - to signify that original URL is the same as current or the original URL
These would only be in the logs for protocols. Possibly omitted in the DNS protocol log.

Issue 3; Changes are extremely difficult because tools will break

To help with this, going forward, a header specification for the new logs should be written. Either to note a version or to specify fields in use (e.g. the first line of CDX files). Possibly both.

This will allow for somewhat more flexible log formats and we should provide an ability to configure exactly which fields are written in each log.

This does place a burden on tool writers, but at least it will be a solvable issue. Currently, tools need to sniff for the few minor changes that have been made in the last eleven years, such as the change in the date format of the first field.

Issue 4; Annotations have been horribly abused

Annotations were added for the sake of flexibility in what the log could contain. They have, however been abused quite thoroughly. I'm in favor of dropping them entirely (not just in the logs, but excising them completely at the code level) and in their place (for data that isn't accounted for in the log spec) use the JSON style data structure "extra information" that is present but generally unused.

Very common usages of the annotations field should be promoted to dedicated fields. Notably, revisits (as discussed above) and number of fetch attempts.

Some of these fields might be optional as per issue 3.

Closing thoughts 

In writing the above I've intentionally avoided more moderate/short term fixes that could be applied with less of an upset. I wanted to shake things up and hopefully get all us to reevaluate our stance on this long serving tool.

Whether the solutions I outline are used or not, the issues remain and the above is not an exhaustive list, I'm sure. It's time, indeed past time, we did something about them.

August 15, 2014

Packaging Heritrix add-on projects

I've made several projects that add-on to Heritrix. Typically, these build a tarball (or zip file) that you can explode into Heritrix's root directory and all the necessary JAR files, job configurations and shell scripts wind up where they are supposed to be. This works well enough, but it does impose an extra step, a double install if you will.

So I decided to see if I could improve on this and have the add-on project actually bake itself into the Heritrix distribution. Turns out, this is easy!

Step one, update the project POM to have a dependency on the root Heritrix project distibution. Like so:


The key there is the classifier and type.

Next add to the plugin section of the POM instructions to unpack the above. Make sure this comes before the assembly plugin.

  <!-- Unzip Heritrix distribution -->


Now all you need to do is ensure that the assembly plugin puts the necessary files into the correct directories. This can be done by specifying the outputdirectory as follows:


and make sure that there is a fileSet to include the entire exploded Heritrix distribution. E.g.:


And done. The assembly will now unpack the Heritrix distribution, add your files and pack it back up, ready for install like any other Heritrix distro.

August 12, 2014

JWAT WARC reading speed doubled

This is a follow up to my last post about inefficiencies in WARC readers.

As I noted there, reading a WARC file straight (via á GZIP reader, so uncompressing it, but not parsing the content) takes about 10 seconds, whereas iterating over the same file, using the Java tools available (webarchive-commons, JWAT) takes about 40 seconds.

More specifically:

JWAT GzipReader: 10s
webarchive-commons WARCReader: 42s
JWAT WarcReader: 45s

There is a variability there of a few hundred milliseconds between runs so I've rounded to the nearest second. Note also, that the WARC readers were run immediately after the GzipReader and would have benefited from any OS caching.

In my earlier post I speculated that adding a read buffer to JWAT's ByteCountingPushbackInputStream would likely improve the efficiency considerably. I proceeded to test this hypothesis. New run time:

JWAT WarcReader for GZ: 21s

So, JWAT goes from being slightly slower than webarchive-commons, to being twice as fast.

The class still passes all its unit tests and as far as I can tell it is still functionally equivalent.

I've forked the JWAT project to my account on GitHub. You can see the modified file here:

There are no doubt greater gains to be had, but they'll require a deeper understanding of the code than I possesses at this moment.

The downside to this change is, of course, a slight uptick in the amount of memory used, as a 10K buffer is assigned every time an instance of the ByteCountingPushBackInputStream is created (and that is done more frequently than just to wrap the core gzip read). Still, it seems a small price to pay for the speed increase.

I have no doubt that improvements can also be made in webarchive-commons, but it is far less clear to me where those changes should be made

August 6, 2014

Inefficiencies in WARC readers

There is a hard limit to how fast you can read a WARC file, dictated by the storage mechanism. You can't process it faster than you can read it off of an HDD or SSD.

Considering how glacially slow those devices are, compared to everything else you might expect that processing a WARC (to build a CDX for example) would take about as long as it takes to read it from disk. After all, nothing overly special is being done. GZIP decompression is fairly good and parsing the WARC header fields is simple enough.

This is, unfortunately, not the case.

I've suspected as much ever since I generated CDXs for our entire collection some months ago. I ran into this issue again last week as I was trying to extract data from WARC files to build a deduplication index. So I set out to do some testing.

Using a randomly selected 1 GB gzipped WARC file I tried running it through the WarcReader that is in the webarchive-commons library. It averaged about 40 seconds to process this one WARC file.

I then tried reading it using the GZIPInputStream provided by the Java Standard Library and found that it took about 10 seconds to read the file. This, by the way, is consistent with the amount of time needed to read a file of that size from disk.

So, why does it take the webarchive-commons tools four times as long?

I looked under the hood and saw that it was using its own input stream named GZIPMembersInputStream. However, when I tried to use that directly, I also got a run time of about 10 seconds.

Digging a bit further I noticed something interesting. Whereas I was reading the file like so:

  GZIPMembersInputStream gin = new GZIPMembersInputStream(
      new FileInputStream(file),BUFFER);

  byte[] buf = new byte[BUFFER];
  int bytesRead =;
  while (bytesRead!=-1) {
    bytesRead =;


I.e. in 10KB blocks, the webarchive-commons tools were (at least mostly) reading it a byte at a time. This can be seen, for example in LaxHttpParser, line 84. Calls to read()accounted for 20 of the 40 second run time.

I changed my tests on the GZIPInputStream and GZIPMembersInputStream to read a byte at a time and, sure enough, it now took about 40 seconds to read the WARC files.

Clearly, there is an overhead to each read action from a compressed stream, unrelated to the size of the read.

Unfortunately, the GZIPMembersInputStream and its parent classes are quite convoluted, so adding an output buffer is complicated. I barely know where to begin.

For the sake of completeness, I also tested the JWAT WARC reader. It has the same issue, although I suspect it may be easier to address there in either the ByteCountingPushBackInputStream or PushbackInputStream as they aren't nearly as convoluted as GZIPMembersInputStream in webarchive-commons.

Bottom line. Byte-at-a-time reading from a compressed stream is very inefficient. A buffer (of about 7-8 KB according to my tests) provides massive improvements. Implementing this in webarchive-commons is going to be tricky, given the legacy code there, but the performance gains would seem to justify the effort.