Archiving the Internet
Archiving the Internet
Author: John Class
Introduction
This project examines the application of the WARC format on both current websites and older ones. WARC is an acronym that stands for Web ARChiving format. Derived from ARChiving format (ARC) WARC was designed with the needs of web pages in mind. The goal of this bibliography is to explore how different organizations have implemented WARC files in their digital preservation process. The format itself has some technical limitations that should be kept in mind during the planning process. One of the main issues that impact the usefulness of WARC regards in the difference between the preservation of currently working websites and the resuscitation of old or obsolete ones. Another, issue explored here relates to the learning curve involved in using WARC files. The open source nature of WARC files is useful but also puts a burden on any organization looking to learn how to use it. As a result, some organizations choose to work with a third-party vendor.
This list of resources is designed to help think through some of these issues during the planning stage. Some of the entries are more technical and explain the best way to work with the files. Other entries involve case studies of organizations that have incorporated these files in their preservation planning process and show the results. Finally, some of the entries explore more conceptual questions concerning future needs and the changing nature of access.
Primary search terms will be:
- WARC
- Web Archiving
- Preservation planning
- Long term digital preservation
- Digital preservation standards
Annotations
Chudnov, D. (2011). Saving the web. Computers in Libraries, 31(10), 30-32.
This article explains how WARC files are used in an automated backup solution. Beginning with a discussion of common organizational needs for backup, the author explains how the Internet Archive had addressed these with Archive-It. The author then goes on to explain an automated solution called Heretrix. Backups created by both of these systems are WARC files. There is some small description of how WARC files exist in a one to one representation of a specific URL and how this requires some consolidating of individual digital objects on each page. However, considering the potential benefit of being able to use a custom solution for web archiving, the conclusion is a recommendation for using Heritrix web crawler and the WARC file format.
Duncan, S. (2015). Preserving born-digital catalogues raisonnés: Web archiving at the New York Art Resources Consortium (NYARC). Art Libraries Journal, 40(2), 50-55
Hare, J., Dupplaw, D., Lewis, P., Hall, W., & Martinez, K. (2014). Exploiting multimedia in creating and analysing multimedia web archives. Future Internet, 6(2), 242–260. MDPI AG. Retrieved from http://dx.doi.org/10.3390/fi6020242
Heil, J. M., & Jin, Shan,C.R.M., C.I.P. (2017). Preserving seeds of knowledge: A web archiving case study. Information Management, 51(3), 20-24.
Kim, Y., & Ross, S. (2012). Digital forensics formats: Seeking a digital preservation storage container format for web archiving. The International Journal of Digital Curation, 7(2), 21-39. Retrieved from http://dx.doi.org/10.2218/ijdc.v7i2.227
This article addresses the variety of container formats available for digital preservation. Laying out a theoretical idea that there are five primary attributes of digital objects, this piece goes on to address how well each major type of file format performs. This piece helps to show some of the unique problems the WARC file was created to address. Further, this article helps to crystalize some of the differences between WARC and two other file formats (TAR and AFF). The authors of this article conclude that AFF is the best format of the three because of its “fidelity, integrity and authenticity”(p31). Understanding the various options for long term digital preservation is important when using a particular file container. The purpose of this bibliography is not to assert the superiority of the WARC format but, rather, provide some further detail and context for its use. This article does a great job providing an overview of the relative strengths and weaknesses of the main file container options in use.
Lin, J., Milligan, I., Wiebe, J., & Zhou, A. (2017). Warcbase: Scalable analytics infrastructure for exploring web archives. Journal on Computing and Cultural Heritage (JOCCH), 10(4), 1-30. Retrieved from http://dx.doi.org/10.1145/3097570
This article explores a software platform that ingests WARC files already complete and uses them as a basis for further research. This shows another way WARC files can be of use. Not just as the end points of an archival process but as the data points in a searchable dataset. The Warcbase was developed specifically for use in the fields of social sciences. However, this kind of platform provides a model for other offline resources to find a future home. Some of the technical reasons behind the development for Warcbase had to do with the limitations on collection sizes found through other methods like “the Wayback machine”. Warcbase was designed with big data in mind and leverages technologies like Hadoop, Spark and Hbase to help accommodate larger datasets. The authors explain that little research has been done into the search and access behavior concerning archived datasets. Warcbase has “scalable analytics” built in that help scholars with their information searches. These tools are often command line and may come with a steeper learning curve than most users are used to with live pages. In spite of this limitation Warcbase remains an interesting alternative way to reuse and access completed WARC files for scholarly purposes.
Pennock, M. (2013, March). Web-Archiving. DPC Technology Watch Report, 13(01). Retrieved from http://dx.doi.org/10.7207/twr13-01
Published by the Digital Preservation Coalition in 2013, Web-Archiving, was a report written by Maureen Pinnock. In it she provides a comprehensive overview of the major issues involved with the web archiving process. Sections are broken down by category, much like a text book. This report serves as a primer and useful glossary for many of the concepts and services one is likely to encounter when attempting to perform any kind of web archiving. This resource, as explained in the abstract, has been designed with the beginner in mind. There are eight major sections divided into smaller subsections as needed. WARC files are mentioned several times. The first mention is in the context of the history of the organizations that have developed web archival standards. Later mentions refer to some technical aspects of the WARC format such as its ability to avoid duplicated data. This report also notices that WARC files are conformed to an official ISO standard. ISO 28500:2009 is the official standard and that can be quite helpful to those working in organizations concerned with following best practices. The sheer amount of data surrounding web archiving makes this a must have for anyone concerned with the topic. This report also provides multiple examples of the technical and legal standing concerning the WARC file format.
Thompson, D. (December 2008). Archiving web resources. DCC Digital Curation Manual. Day, M & Ross, S. (eds). Retrieved from http://www.dcc.ac.uk/resource/curation-manual/chapters/web-archiving
The Digital Curation Manual Installments on Archiving web resources is one of several feature rich resources developed by the Digital Curation Center (DCC). This particular manual was published in 2008 and provides a chapter like breakdown for the development of a digital preservation plan for web resources. This document can be accessed form the DCC website alongside other manuals that have been completed or are still in production. Some of the topics covered include: the relationship of digital curation and web archiving, curating websites with an open access model, trusted repositories, data use and re-use, cost, technical obstacles and recommendations for the future. While a bit dated, this particular document, in conjunction with other resources from the DCC website, help to provide some very useful explanations of theory as well as recommendations for best practices. This manual addresses the issues surrounding the need for and implementation of some kind of preservation plan. Any use of WARC files in a project would necessitate the background information presented here. This is a fantastic resource to use during the planning stage of a preservation plan.
Ury, C. (2009, January 27). Contribution from Warc Usage Task Force. In Warc Implementation Guidelines. Retrieved from https://web.archive.org/web/20170317191510/http:/netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf
This report was produced for the International Internet Preservation Consortium (IIPC). Unlike other reports that offer an overview of the filed, this report is specific to the details of how to produce and use WARC files. Since the audience is assumed to be somewhat proficient and knowledgeable already, this report helps to point out some differences from earlier web archiving solutions and formats in order to show the advances of WARC. Written in 2009, this was one of the first attempts to create a useful handbook following ISO certification. This was written by a task force of experts on the WARC format and is intended to be a work in progress. Interestingly enough, this has additionally been placed in the Wayback machine. This access through the IIPC site or the Wayback machine is smooth and flawless. This guideline is written to roughly three main categories including; web harvesting, data packaging and the operations of WARC files. From the perspective of employees in the planning stages of their preservation plan, this document provides valuable tips on file manipulation, setting technical requirements and memory management for WARC files.