Archiving the Internet

From SIS Wiki
Jump to: navigation, search

Archiving the Internet

Author: John Class

Introduction

This project examines the application of the WARC format on both current websites and older ones. WARC is an acronym that stands for Web ARChiving format. Derived from ARChiving format (ARC) WARC was designed with the needs of web pages in mind. The goal of this bibliography is to explore how different organizations have implemented WARC files in their digital preservation process. The format itself has some technical limitations that should be kept in mind during the planning process. One of the main issues that impact the usefulness of WARC regards in the difference between the preservation of currently working websites and the resuscitation of old or obsolete ones. Another, issue explored here relates to the learning curve involved in using WARC files. The open source nature of WARC files is useful but also puts a burden on any organization looking to learn how to use it. As a result, some organizations choose to work with a third-party vendor.

This list of resources is designed to help think through some of these issues during the planning stage. Some of the entries are more technical and explain the best way to work with the files. Other entries involve case studies of organizations that have incorporated these files in their preservation planning process and show the results. Finally, some of the entries explore more conceptual questions concerning future needs and the changing nature of access.

Primary search terms will be:

  • WARC
  • Web Archiving
  • Preservation planning
  • Long term digital preservation
  • Digital preservation standards

Annotations

Bailey, J., Grotke, A., Hanna, K., Hartman, C., McCain, E., Moffat, C., & Tyler, N. (2014). Web Archiving in the United States: A 2016 Survey National Digital Stewardship Alliance. Retrieved from http://dx.doi.org/10.17605/OSF.IO/R5PQK
This document was written for the National Digital Stewardship Alliance (NDSA) and presented as report in 2017. This report presents information concerning the archival needs and practices of dozens of participating organizations in the United States. One of the trends identified through this research was the increasing tendency of organizations to archive their own web materials. Since each organization has to build capacity to use their own archival tools there can be an accidental silo effect. This creates a need for individual organizations to find a better way to collaborate and share their archival resources. A stable and open source solution is needed. WARC files are mentioned as a good candidate to solve this problem. Open source and versatile, WARC files are suitable for to this task and facilitate sharing between organizations.


Chudnov, D. (2011). Saving the web. Computers in Libraries, 31(10), 30-32.

This article explains how WARC files are used in an automated backup solution. Beginning with a discussion of common organizational needs for backup, the author explains how the Internet Archive had addressed these with Archive-It. The author then goes on to explain an automated solution called Heretrix. Backups created by both of these systems are WARC files. There is some small description of how WARC files exist in a one to one representation of a specific URL and how this requires some consolidating of individual digital objects on each page. However, considering the potential benefit of being able to use a custom solution for web archiving, the conclusion is a recommendation for using Heritrix web crawler and the WARC file format.


Duncan, S. (2015). Preserving born-digital catalogues raisonnés: Web archiving at the New York Art Resources Consortium (NYARC). Art Libraries Journal, 40(2), 50-55

This article provides an in-depth look at how WARC files were used for a cultural heritage institution. In a web archiving project for the New York Art Resources Consortium (NYARC) researchers experimented with Archive-It and other web crawlers on various art rich websites. Eventually, a project named “Making the Black Hole Gray” was started. Here the goal was to create the best WARC archives possible for art based websites while staying under a limit of 2 terabytes. Since high resolution images can be large in size this prompted many decisions about how to proceed. The space limitations prompted hard decisions about selection and, as quality affects size, what quality of resolution to capture. Further Quality Assurance revealed issues concerning the navigation of the site or “look and feel”. These issues would have to be solved through the employment of an outside vendor that could help out with what the original solution had lacked.


Hare, J., Dupplaw, D., Lewis, P., Hall, W., & Martinez, K. (2014). Exploiting multimedia in creating and analysing multimedia web archives. Future Internet, 6(2), 242–260. MDPI AG. Retrieved from http://dx.doi.org/10.3390/fi6020242

This article for the Web and Internet Science Research Group was written to address the aspects of web archiving related to multimedia content. This work looks at several case studies performed on the issues surrounding the needs of social media sites in particular. The approaches outlined here come from what is often used by the Archiving Community MEMEories (ARCOMEM) project. The ephemerality of social media is explained as a challenging obstacle standing in the way of keeping a satisfactory record for these types of websites. Capturing enough volume and updated data for analysis is a problem. In discussing the export of the web backups, the authors point out how WARC files are situated in the workflow of the ARCOMEM process. The WARC files are created as an export of the backup but they do not have to exist in that stage unmodified. Access to these files is granted through the “Wayback Machine” as an online access point for the files. Additionally, the ARMOMEM project allows for additional analysis and metadata creation with the WARC files. This new, enriched version of data is then again exported in a different file format (in this case RDF). Here the researchers can use the RDF data for post crawl analysis. This use case for WARC files helps to showcase how the file format is useful both as an access copy and an intermediate format.


Heil, J. M., & Jin, Shan,C.R.M., C.I.P. (2017). Preserving seeds of knowledge: A web archiving case study. Information Management, 51(3), 20-24.

This article describes a one year pilot study performed for the Queens University Archives in Kingston, Ontario. Study participants decided that it would be best accomplished using Archive-It and WARC files. One interesting consideration was made during the planning stage that Archive-It would be a better solution than other more customizable services due to its low training needs. The planners decided it would be easier and cheaper to go with a more automated solution for their web archiving needs. Authors from different articles here in this project have come to opposite conclusions. The importance of understanding the needs of your organization and balancing that against the cost and training time of employees is something that should always guide these decisions. Another important point, from the perspective of project planning, was the fact that while the project itself took one year, the authors described a planning process of three. This points to the difficulty and importance of the planning process. One of the lessons learned while performing the archive project was prioritizing information. The study participants were limited to 500 GB space which was not nearly large enough to capture an archive for every webpage. Due to the way WARC files operate, the whole page must be captured in its entirety. Here, the solution appeared to be a matter of changing the frequency of future updates to accommodate their storage space limitations. The lesson here is that perfect and up to date capture may be impossible. The need to define and adhere to project priorities is critical for a successful backup project.


Kim, Y., & Ross, S. (2012). Digital forensics formats: Seeking a digital preservation storage container format for web archiving. The International Journal of Digital Curation, 7(2), 21-39. Retrieved from http://dx.doi.org/10.2218/ijdc.v7i2.227

This article addresses the variety of container formats available for digital preservation. Laying out a theoretical idea that there are five primary attributes of digital objects, this piece goes on to address how well each major type of file format performs. This piece helps to show some of the unique problems the WARC file was created to address. Further, this article helps to crystalize some of the differences between WARC and two other file formats (TAR and AFF). The authors of this article conclude that AFF is the best format of the three because of its “fidelity, integrity and authenticity”(p31). Understanding the various options for long term digital preservation is important when using a particular file container. The purpose of this bibliography is not to assert the superiority of the WARC format but, rather, provide some further detail and context for its use. This article does a great job providing an overview of the relative strengths and weaknesses of the main file container options in use.


Lin, J., Milligan, I., Wiebe, J., & Zhou, A. (2017). Warcbase: Scalable analytics infrastructure for exploring web archives. Journal on Computing and Cultural Heritage (JOCCH), 10(4), 1-30. Retrieved from http://dx.doi.org/10.1145/3097570

This article explores a software platform that ingests WARC files already complete and uses them as a basis for further research. This shows another way WARC files can be of use. Not just as the end points of an archival process but as the data points in a searchable dataset. The Warcbase was developed specifically for use in the fields of social sciences. However, this kind of platform provides a model for other offline resources to find a future home. Some of the technical reasons behind the development for Warcbase had to do with the limitations on collection sizes found through other methods like “the Wayback machine”. Warcbase was designed with big data in mind and leverages technologies like Hadoop, Spark and Hbase to help accommodate larger datasets. The authors explain that little research has been done into the search and access behavior concerning archived datasets. Warcbase has “scalable analytics” built in that help scholars with their information searches. These tools are often command line and may come with a steeper learning curve than most users are used to with live pages. In spite of this limitation Warcbase remains an interesting alternative way to reuse and access completed WARC files for scholarly purposes.


Pennock, M. (2013, March). Web-Archiving. DPC Technology Watch Report, 13(01). Retrieved from http://dx.doi.org/10.7207/twr13-01

Published by the Digital Preservation Coalition in 2013, Web-Archiving, was a report written by Maureen Pinnock. In it she provides a comprehensive overview of the major issues involved with the web archiving process. Sections are broken down by category, much like a text book. This report serves as a primer and useful glossary for many of the concepts and services one is likely to encounter when attempting to perform any kind of web archiving. This resource, as explained in the abstract, has been designed with the beginner in mind. There are eight major sections divided into smaller subsections as needed. WARC files are mentioned several times. The first mention is in the context of the history of the organizations that have developed web archival standards. Later mentions refer to some technical aspects of the WARC format such as its ability to avoid duplicated data. This report also notices that WARC files are conformed to an official ISO standard. ISO 28500:2009 is the official standard and that can be quite helpful to those working in organizations concerned with following best practices. The sheer amount of data surrounding web archiving makes this a must have for anyone concerned with the topic. This report also provides multiple examples of the technical and legal standing concerning the WARC file format.


Thompson, D. (December 2008). Archiving web resources. DCC Digital Curation Manual. Day, M & Ross, S. (eds). Retrieved from http://www.dcc.ac.uk/resource/curation-manual/chapters/web-archiving

The Digital Curation Manual Installments on Archiving web resources is one of several feature rich resources developed by the Digital Curation Center (DCC). This particular manual was published in 2008 and provides a chapter like breakdown for the development of a digital preservation plan for web resources. This document can be accessed form the DCC website alongside other manuals that have been completed or are still in production. Some of the topics covered include: the relationship of digital curation and web archiving, curating websites with an open access model, trusted repositories, data use and re-use, cost, technical obstacles and recommendations for the future. While a bit dated, this particular document, in conjunction with other resources from the DCC website, help to provide some very useful explanations of theory as well as recommendations for best practices. This manual addresses the issues surrounding the need for and implementation of some kind of preservation plan. Any use of WARC files in a project would necessitate the background information presented here. This is a fantastic resource to use during the planning stage of a preservation plan.


Ury, C. (2009, January 27). Contribution from Warc Usage Task Force. In Warc Implementation Guidelines. Retrieved from https://web.archive.org/web/20170317191510/http:/netpreserve.org/sites/default/files/resources/WARC_Guidelines_v1.pdf

This report was produced for the International Internet Preservation Consortium (IIPC). Unlike other reports that offer an overview of the filed, this report is specific to the details of how to produce and use WARC files. Since the audience is assumed to be somewhat proficient and knowledgeable already, this report helps to point out some differences from earlier web archiving solutions and formats in order to show the advances of WARC. Written in 2009, this was one of the first attempts to create a useful handbook following ISO certification. This was written by a task force of experts on the WARC format and is intended to be a work in progress. Interestingly enough, this has additionally been placed in the Wayback machine. This access through the IIPC site or the Wayback machine is smooth and flawless. This guideline is written to roughly three main categories including; web harvesting, data packaging and the operations of WARC files. From the perspective of employees in the planning stages of their preservation plan, this document provides valuable tips on file manipulation, setting technical requirements and memory management for WARC files.