Kris's blog: April 2016

April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.


Heritrix 3.3.0-LBS-2016-02

I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

Some fixes to how server-not-modified revisit records are written (PR #118).
Fix outlink hoppath in metadata records (PR #119)
Allow dots in filenames for known good extensions (PR #120)
Require Maven 3.3 (PR #126)
Allow realm to be set by server for basic auth (PR #124)
Better error handling in StatisticsTracker (PR #130)
Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
Changes to how cookies are stored in Bdb (PR #133)
Handle multiple clauses for same user agent in robots.txt (PR #139)
SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
'Novel' URL and byte quotes (PR #138)
Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
Form login improvements (PR #142 and #143)
Improvements to hosts report (PR #123)
Handle SNI error better (PR #141)
Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
Fix to ExtractorHTML dealing with HTML comments (PR #149)
Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good. So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

April 18, 2016

A long week is over. Thank you all.

The 2016 IIPC General Assembly and Web Archiving Conference is over. Phew!


Me, opening the Harvesting Tools session on Tuesday

I always look forward to this event each year. It is by far the most stimulating and productive meeting/conference that I attend regularly. I believe we managed to live up to that this time.

The meeting had a wonderful Twitter back-channel that you can still review using the hashtags #iipcGA16 and #iipcWAC16.

It has been over two years since we, at the National and University Library of Iceland, offered to host the 2016 GA, and over a half year before that when the initial decision was made. Even with a 2.5 year lead time, it barely felt like enough.

I'd like to take this opportunity to thank, again, all the people who helped make last week's event a success.

First off, there is the program committee, which was very small this year, comprising, in addition to myself, (in alphabetical order) Alex Thurman (Columbia University Libraries), Gina Jones (Library of Congress), Jason Webber (IIPC PCO/British Library), Nicholas Taylor (Stanford University Libraries) and Peter Stirling (Bibliothèque nationale de France). I literally couldn't have done this without you.

I'd also like to note the contribution of our incoming PCO in this list, Olga Holownia who put in a lot of work during the conference to help make sure everything was just right for each session.

Next, I'd like to thank my colleagues at the National Library who assisted me in organizing this event and helped out during by week by handling registration, running tours etc. It was a team effort. Notable mentions to Áki Karlsson and Erla Bjarnadóttir who spent much of the week making sure that all the little details were attended to.

The Steering Committee on Friday
following the SC meeting

A big thank you to all the speakers and session moderators.

And lastly, I'd like to thank the members of the Steering Committee for being willing to entrust the single most important event of the IIPC calendar to one of the IIPC's smallest members. Indeed, doing so without the slightest hesitation.

I've learned a lot from this past week and I hope to be able to distill that experience and write it up so that next year's GA/WAC can be even better. But that will have to wait for another day and another blog post.

For now, I'll just say thanks for coming and see you all again in Lisbon for #iipcGA17 and #iipWAC17.

April 7, 2016

Still Looking For Stability In Heritrix Releases

I'd just like to briefly follow up on a blog post I wrote last September, Looking For Stability In Heritrix Releases.

The short version is that the response I got was, in my opinion, insufficient to proceed. I'm open to revisiting the idea if that changes, but for now it is on ice.

There is little doubt in my mind that having (somewhat) regular stable releases made of Heritrix would be of notable benefit. Even better if they are published to Maven Central.

Instead, I'll continue to make my own forks from time to time and make sure they are stable for me. The last one was dubbed LBS-2015-01. It is now over a year old and a lot has changed. I expect I'll be making a new one in May/June. You can see what's changed in Heritrix in the meantime here.

I know a few organizations are also using my semi-stable releases. If you are one of them and would like to get some changes in before the next version (to be dubbed LBS-2016-02), you should try to get a PR into Heritrix before the end of April. Likewise, if you know of a serious/blocking bug in the current master of Heritrix, please bring it to my attention.

April 1, 2016

Duplicate DeDuplicators?

A question on the Heritrix mailing list prompted my to write a few words about deduplication in Heritrix and why there are multiple ways of doing.

Heritrix's built in service

Heritrix's inbuilt deduplication service comes in two processors. One processor records each URL in a BDB datastore or index (PersistStoreProcessor), the other looks up the current URL in this datastore and, if it finds it, compares the content digest (FetchHistoryProcessor).

The index used to be mingled in with other crawler state. That is often undesirable as you may not wish to carry forward any of that state to subsequent crawls. Typically, thus, you configure the above processors to use their own directory by wiring in a special "BDB module" configured to write to an alternative directory.

There is no way to construct the index outside of a crawl. Which can be problematic since a hard crash will often corrupt the DBD data. Of course, you can recover from a checkpoint, if you have one.

More recently, a new set of processors have been added. The ContentDigestHistoryLoader and ContentDigestHistoryStorer. They work in much the same way except they use an index that is keyed on the content digest, rather than the URL. This enables URL agnostic deduplication.

This was a questionable feature when introduced, but after the changes made to implement a more robust way of recording non-URL duplicates, it became a useful feature. Although its utility will vary based on the nature of your crawl.

As this index is updated at crawl time, it also makes it possible to deduplicate on material discovered during the same crawl. A very useful feature that I now use in most crawls.

You still can't build the index outside of a crawl.

For more information about the built in features, consult the Duplication Reduction Processors page on the Heritrix wiki.

The DeDuplicator add-on

The DeDuplicator add-on pre-dates the built in function in Heritrix by about a year (released in 2006). It essentially accomplishes the same thing, but with a few notable notable differences in tactics.

Most importantly, its index is always built outside of a crawl. Either from the crawl.log (possibly multiple log files) or from WARC files. This provides a considerable amount of flexibility as you can build an index covering multiple crawls. You can also gain the benefit of the deduplication as soon as you implement it. You don't have to run one crawl just to populate it.

The DeDuplicator uses Lucene to build its index. This allows for multiple searchable fields which in turn means that deduplication can do things like prefer exact URL matches but still do digest only matches when exact URL matches do not exist. This affords a choice of search strategies.

The DeDuplicator also some additional statistics, can write more detailed deduplication data to the crawl.log and comes with pre-configured job profiles.

The Heritrix 1 version of the DeDuplicator actually also supported deduplication based on 'server not modified'. But it was dropped when migrating the H3 as no one seemed to be using it. The index still contains enough information to easily bring it back.

Bottom line

Both approaches ultimately accomplish the same thing. Especially after the changes that were made a couple of years ago to how these modules interact with the rest of Heritrix, there really isn't any notable difference in the output. All these processors, after determining an document is a duplicate, set the same flags and cause the same information to be written to the WARC (if you are using ARCs, do not use any URL agnostic features!)

Ultimately, it is just a question of which fits better into your workflow.