Using the web infrastructure to preserve web pages |
| |
Authors: | Michael L Nelson Frank McCown Joan A Smith Martin Klein |
| |
Affiliation: | (1) Old Dominion University, Norfolk, VA 23529, USA |
| |
Abstract: | To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved
from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject
to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications).
For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However,
due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach
is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what
is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide
preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval.
The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation
capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they
focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the
future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are
reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar
pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP
email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21
DIDL representations of web pages (“web server enhanced preservation”). |
| |
Keywords: | Web infrastructure Digital preservation Web pages OAI-PMH Complex objects |
本文献已被 SpringerLink 等数据库收录! |
|