Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. It’s being used and supported by such institutions as the Library of Congress, the National and University Library of Iceland, and the National Library of Norway. It could be a helpful tool for universities getting into web archiving.
Tag Archives: web archiving
Since 2000 the Library of Congress has been working to collect and preserve web sites. Their project had been called MINERVA, now it is known as The Library of Congress Web Archives (LCWA). They have archived websites around themes such as the United States National Elections, the Iraq War, and the events of September 11. They have helpful technical information that provides information on how they harvested websites and which metadata fields they used.
Chris Prom has a very helpful blog entry about his progress on evaluating web archiving service providers. His first three installments were reviews of open source software such as HTTrack, GNU Wget free utility, and Heritrix. The fourth installment is a review of the Web Archiving Service (WAS) developed by the California Digital Library, which is a fee based service for capturing and storing websites.
Check it out on his blog at: