Harvesting began on 3/9/2001.
The following websites were recently added to WebArchiv:
Karel Janeček - Názory aktuálně
Čtenářská gramotnost a projektové vyučování
Česká odborná společnost pro inkluzivní vzdělávání
Festival otrlého diváka
Muzeum Českého ráje v Turnově
Muzeum Bělá pod Bezdězem
Overview of the WebArchiv project
The archive of the Czech online-born documents (widely known as “WebArchiv”) was originally launched within the R & D project funded by the Ministry of Culture in 2000 and since then it has been implemented in the National Library and funded almost exclusively through various grants. There are two other institutions involved in the project: Moravian Library, which is responsible for IT issues, and Institute of Computer Science of Masaryk University as external co-operator.
The main aim of WebArchiv
The amount of documents published on the Internet is growing dramatically – many of them are often changing and others are even being lost. If the documents that have a research value are not archived a considerable part of the national cultural heritage would disappear forever. The responsibility for archiving online-born documents and their registration in the national bibliography is usually assumed by national libraries and/or other deposit libraries.
The main aim of the WebArchiv project is to implement a comprehensive solution in the field of archiving of the national web, i.e. bohemical online-born documents. That includes tools and methods for collecting, archiving and preserving web resources as well as providing long-term access to them. Both large-scale automated harvesting of the entire national web and selective archiving are being carried out, including thematic „event-based“ collections. At present these methods are tested and are a subject of further research. To run all operations in a routine way, two conditions must be met: long-term funding has to be provided and the current legal issues have to be solved (primarily the legal deposit legislation).
Collecting online-born documents
Strictly from a technical point of view, collecting online documents is an automated process carried out by a set of software tools that harvest, index and save data in the archive according to preassigned parameters. At present open-source software tool (Heritrix) is being used for web crawling. Besides that, a set of criteria is being defined for selecting online-born documents in order to register them in the Czech National Bibliography. In this context, finding a suitable solution of the legal issues is considered necessary.
Archiving and preservation
Harvested files including relevant metadata are saved in standardized archival formats supported by the IIPC consortium. The data is stored on a dedicated redundant disk array (RAID) with expected migration to National Library’s new data storage facility in the near future.
One server is used to give online access to a limited dataset whose content is covered by agreements with its original publishers. Fulltext indexing is performed by open-source system Nutch and is accessed through Nutchwax and WERA tools.
International standards are being used for description and identification (MARC21, Dublin Core, ISSN and URN) and archiving (ARC). Selected online-born documents are catalogued in an ALEPH library system which supports Z39.50 (both client and server levels) and OAI-PMH protocols (both repository and harvesting levels with the profile for MARC21 and qualified Dublin Core); records are thus also registered in the Czech National Bibliography.
In general, the current state of legislation in this field is not convenient. Legal Deposit Act doesn‘t cover online-born documents and according to the Copyright Act, it is not possible to make archived data available to public. Fortunately, it is possible to harvest and store online documents to protect them from disappearing forever.