For publishers

About project

WebArchiv content

Facts

WebArchiv contains 15,5 terabytes of data. Harvesting began on 3/9/2001.

Selection criteria for web resources

Introduction

One of the core tasks of the Czech National Library is to preserve documents published in the Czech Republic. For this purpose, the National Library catalogues, preserves and provides access to a preservation collection and registers it in the Czech National Bibliography. In addition to traditional documents, electronic online publications (web sites) are published nowadays. The issue of preserving and providing access to these types of resources of “national production” has, since 2000, been part of the WebArchiv project which is a digital archive of Czech web. Because of the enormous number of online documents and the varied quality of Internet publications, it is important to apply selection criteria while creating an archive of web resources, so that documents of a current and future value which constitute the Czech national cultural heritage are preserved. The technical and library criteria used for selection of web resources to be archived and catalogued in the Czech National Bibliography are based on previous experience from the WebArchiv project (with regard to the methodology used within similar projects in other countries – especially the National Library of Australia’s Pandora) and the results of the European project Web Cultural Heritage (CULTURE 2000 program) conducted in 2005–2006 in cooperation with the National Library of Estonia, the University Library in Bratislava (Slovakia) and the National and University Library of Slovenia. Experience shows it is desirable to divide the criteria into two categories depending on the method of acquisition of the web resources, or rather the legal conditions of providing access to the archived data: 1) comprehensive archiving of web resources and 2) selective approach – harvesting of resources for which a contract was signed with the publisher permitting online access to the archived copies of the documents published on the web.

A) Comprehensive archiving (harvesting)

The aim of this approach is to archive the largest possible number of domestic web resources with minimum intellectual effort by setting parameters in the harvesting SW.

  • Domain – the only general criterion for comprehensive archiving of domestic web resources is the national domain (the national web space .cz). Resources located outside the national domain can also be archived provided they meet further selection criteria (see B – Selective harvesting).

Other aspects (such as format, access, protocol) are optional depending on technical facilities and other means.

  • Format – those formats that the harvester is able to download are automatically harvested. Based on available storage space and other factors, limits can be set for different file types, file type categories, file sizes, etc. This way, illegal resources under copyright can be identified and excluded by their size and file type (CD/DVD images, large video files, etc.).
  • Access – depends on the current legislation (copyright law, legal deposit) or negotiations with publishers (their willingness to provide access rights to their resources).
  • Protocol – depends on WebArchiv staff discretion and the features of the harvester. Presently, only resources within the http/ftp protocols are harvested. This means that streamed protocols (sound or video broadcasts) and content of peer-to-peer networks are automatically excluded from harvesting.
  • File size – files bigger than 100 MB are generally not accepted/harvested (the actual size limit may vary depending on the file type).
  • Number of files – a maximum of 5000 harvested files per site is recommended for most resources.

B) Selective approach

The selection guidelines for documents registered in the National Bibliography include territory, language, authorship and content:

  • Territory – all documents (resources) published in the Czech Republic
  • Language – all resources in Czech (regardless of the publication place)
  • Authorship – all resources by Czech authors (regardless of the publication place)
  • Topic/Content – all resources concerning the Czech Republic or Czechs (regardless of the publication place)

a) Selection criteria

  • 1. Domain – the national domain and other selected domains (such as .com, .org, etc.) provided they meet at least one of the criteria in point (2)
  • 2. National aspectsauthor’s nationality (the author of resource content comes from the Czech Republic), publisher’s residency (the publisher is a resident of the Czech Republic), „language“ (the resource is in Czech), nation/country as the topic (resource contains significant information about the Czech Republic or Czechs)
  • 3. Content – resources of significant cultural and scientific value with original content and long-term research value
  • 4. Access – freely available/acces­sible resources; for password-protected resources, a permission of the publisher or the copyright holder (depending on the legislation) is required
  • 5. Format – only resources in common formats which can be interpreted by common web browsers are selected
  • 6. Original form – resources originally published on the web (i.e. the web resources are born digital) are preferred; resources which are copies of documents published in traditional ways, or their supplements (digitized materials, electronic versions of hard-copy publications, etc.) are not selected

7. Resource type

The following types of resources are preferred: online journals, monographs, materials from conferences, research reports, academic publications, government documents, other types of resources with significant cultural or scientific value (such as weblogs or web sites covering specific topics)

Resources which are not archived: computer games, intranets (resources), personal weblogs, portals without original intellectual content, data sets/databases, radio and TV broadcasts, etc.


b) Recommendations

The following aspects should also be considered:

  • Transmission protocol – resources in commonly used protocols are selected (http, ftp, etc.)
  • Copyright issues – the copyright holder of the resource should be identified before inclusion in the archive
  • Resource Integrity – only documents that constitute a whole, not their individual parts, will be harvested and archived, even if such parts meet the selection criteria
  • Frequency of harvesting – resources selected according to the selection criteria to be harvested at least four times a year, when possible; important resources with frequent and significant updates such as serials, news, etc. are ideally harvested daily
WebArchiv
Contact: webarchiv@nkp.cz
Last update: 8/9/2010