Tools for Archiving Websites
Archiving Websites: A Survey of the Tools and Techniques Used to Preserve the World Wide Web
Annotations by Cassandra Meyer
Definition of Project
Web archiving is the practice of preserving web pages and saving them from inaccessibility and becoming obsolete. Though a relatively new practice, web archiving can be performed using many tool and techniques. This bibliography examines specific tools, techniques, and approaches used to perform web archiving efficiently. Subtopics covered in the examination include specific programs used, capturing API data, social media archiving, metadata standards for web archiving, and mobile web archiving. The following terms were most relevant for searching the databases to discover literature on the topic: web archiving, digital archives, born-digital preservation, and website preservation. The databases that produced the most relevant material include Taylor and Francis Online, the Library and Information Science Abstracts (LISA), and ProQuest’s Library Science Database. The annotations in this bibliography represent some of the most widely used tools and techniques in web archiving from 2009-2018 and provide a broad survey of the approaches used to preserve web pages.
Annotations
Antracoli, A., Duckworth, S., Silva, J., & Yarmey, K. (2014). Capture all the URLs: First steps in web archiving. Pennsylvania Libraries: Research & Practice, 2(2), 155-170. doi: https://doi.org/10.5195/palrap.2014.67
This article details the use of the subscription service Archive-It to curate and preserve websites. Archive-It, an affiliation of the Internet Archive, charges a subscription fee to crawl the web using open source software, and to identify, select and store data as open source WARC files for subscribers. The web crawler, Heritrix, even crawls password-protected content of authorized users, ensuring that all desired content is collected. Archive-It also allows subscribers to curate their own data so that they may perform their own quality control. While Archive-It is a subscription service that requires payments to use for a length of time, content that has already been archived prior to the end of a subscription can continue to be accessed. Archive-It is a popular service for curating and preserving web pages and is a very useful tool to consider when researching web archiving tools and techniques.
Denev, D., Mazeika, A., Spaniol, M., & Weikum, G. (2011). The SHARC framework for data quality in Web archiving. The VLDB Journal – The International Journal on Very Large Data Bases, 20(2), 183-207.
The authors of this article present their framework - SHARC, or Sharp Archiving of Web-site Captures – which increases data quality in web archiving. This approach to web archiving considers two data measures: blur and coherence. Blur occurs when a web crawler crawls a site while a page changes. This results in imperfect data capture. Coherence is when a page is crawled with as few changes as possible. Using the SHARC approach, the authors devised strategies to decrease the occurrence of blur while increasing coherence during crawls, including predicting the patterns of page changes. This highly technical article presents the algorithms and calculus involved in web archiving within the SHARC framework. The SHARC approach to web archiving is useful because it improves the quality of the web archive.
Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs. Journal of the
Association for Information Science and Technology, 65(11), 2195-2209.
Dougherty and Meyer describe current approaches to web archiving gleaned from interviews with six information professionals. While the focus of the interviews was understanding the data archiving needs of social scientists, the information presented is applicable to the whole of web archiving. The authors demonstrate that web archiving should be included in the realm of other archived media such as books, digital records, or images (p. 2200). Currently, web archiving is often viewed as an isolated domain, when in fact it is making contributions to the field of archives. Considering web archives in the same sphere as traditional archives would give the discipline more attention and increase its practice and visibility (p. 2201). The authors of this article also explain how collaboration is key to successful web archiving. Working with others offers different perspectives and techniques. As web archiving is a relatively new field, collaboration is an effective way to learn from others and innovate. This demonstration of the importance of collaboration, along with the recommended inclusion of web archiving within traditional archiving, are two approaches to web archiving that should be considered before beginning a project.
Espley, S., Carpentier, F., Pop, R., & Medjkoune, L. (2014). Collect, preserve, access: Applying the governing principles of the national archives UK government web archive to social media content. Alexandria, 25(1), 31-50.
This article presents the web archiving efforts of The National Archives of the United Kingdom and the Internet Memory Foundation (IMF) to crawl and capture Twitter and YouTube API data. This approach to web archiving only captures raw tweet data (from Twitter), URLs linked in the tweets, metadata for the tweets, and the video and video metadata for the YouTube videos. The IMF archive displays a Twitter or YouTube homepage interface so that users make a connection between the rendered tweet or video data and the social media websites. The use of Twitter and YouTube interfaces is an innovative way to maintain the appearance of the websites in the archives while effectively presenting XML and JSON content from the APIs. All of the data collected is then stored on the IMF infrastructure, which consists of Hbase, the Hadoop database, and the Hadoop File System. These tools and approaches facilitate efficient API data capture, storage, and presentation, and allow The National Archives of the UK to make years of archived tweets and videos publicly available.
Hswe, P., Kaczmarek, J., Houser, L., & Eke, J. (2009). The web archives workbench (WAW) tool suite: Taking an archival approach to the preservation of web content. Library Trends, 57(3), 442-460.
Hswe, Kaczmarek, Houser, and Eke describe the use of OCLC’s Web Archives Workbench in this article, detailing the usefulness of its “Arizona Model” of web content archiving. The “Arizona Model”, as the authors explain, is “…An aggregate-based approach to Web archiving designed to bridge the gap between human selection and automated capture” (p. 444). This efficient approach allows human involvement in the selection of content while using automated content harvesting to drastically decrease the amount of work that would be required to manually archive webpages. The Web Archives Workbench itself is comprised of five tools which identify, select, describe, and harvest content, ultimately preparing digital objects for ingestion into a repository: the discovery tool, the properties tool, the analysis tool, the harvest tool, and the system tools. A user may or may not choose to use all of these tools, which allows for an adaptable web archiving experience. The Web Archives Workbench is an important web archiving tool because its archival principles and adoption of the “Arizona Model” of content capture make it flexible and efficient.
Littman, J., Chudnov, D., Kerchner, D., Peterson, C., Tan, Y., Trent, R., Rajat, V., & Wrubel, L. (2018). API-based social media collecting as a form of web archiving. International Journal on Digital Libraries, 19(1), 21-38.
This article describes the use of the Social Feed Manager application to archive websites. Social Feed Manager is an open source program that captures data from Twitter’s API. The authors explain how this application can also be used in web archiving. They demonstrate that social media collecting should be considered a form of web archiving, as both aim to harvest data form a website. Social Feed Manager harvests data from the social media website’s API, as opposed to harvesting data from the website itself, which is the approach of many web archiving tools. Collecting data from the API has several advantages, including more efficient data harvest (p. 3). However, harvesting from the API as opposed to the website itself does have limitations. Social Feed Manager only harvests information such as Twitter tweets and does not harvest data that gives context to those tweets, i.e. data of the webpage itself. Social Feed Manager collects XML and JSON data as opposed to the HTML of webpages. Social Feed Manager is best suited for archiving data that does not require the context of the webpage from which it originated.
McCown, F., Yarbrough, M., Enlow, K., & Yarbrough, M. (2015). Tools for discovering and archiving the mobile web. D-lib Magazine, 21, 3-4.
Websites that are viewed on mobile devices often are distinct mobile pages of the same websites that can be accessed with a PC. Like desktop web content, mobile web content is deserving of web archiving, as it contains different content than the information displayed on PC versions of the same URL. The authors of this article present the MobileFinder tool, which is a web archiving tool specifically designed to distinguish mobile URLs from desktop URLs. MobileFinder is able to distinguish mobile sites from desktop sites by looking at a particular tag in the HTML code. Using the same URL, the tag will contain slightly different values within the mobile and desktop HTML code. Once the distinction is made, a web crawler can be implemented to capture data from the mobile site. MobileFinder is an important tool for the subfield of mobile web archiving, as web crawlers are not able to distinguish mobile from PC URLs.
Romaniuk, L. M. (2014). Metadata for a web archive: PREMIS and XMP as tools for the task. Library Philosophy & Practice, 1–20.
In this article, Romaniuk combines two metadata standards, XMP and PREMIS, and assesses them for suitability for building web archives. XMP, or Extensible Metadata Platform, is a metadata schema, while PREMIS, or Preservation Metadata Implementation Strategies, is a data dictionary for preserving digital objects. Romaniuk successfully records metadata using an XMP schema and the controlled vocabulary of PREMIS. The author recommends using the Adobe Bridge application with these standards, as this app “…Significantly improved cataloguing efficiency” (p. 9). While the use of XMP and PREMIS standards have minor flaws within the context of creating web archives, they are overall well-suited to record the metadata required to preserve webpage content, especially when used in conjunction with an application such as Adobe Bridge.
Saad, M. B., & Gançarski, S. (2012). Archiving the web using page changes patterns: A case study. International Journal on Digital Libraries, 13(1), 33-49.
In this article, Saad and Gançarski discuss the importance of maintaining quality records in web archives by ensuring that they are as complete as possible. To do this the web crawler collects as many pages as is possible to create the most complete and accurate record. Web pages change daily or even hourly, making the number of crawls needed for complete records myriad. The authors suggest that patterns in page changes should first be determined and then used to crawl webpages at optimum times. Page changes that are unimportant, such as advertisements, shouldn’t be taken into consideration; only important page changes should be noted. Ultimately, when a pattern of significant page changes is discerned and crawls are conducted regularly during those times, a higher quality archive can be accomplished. This technique contributes to a more complete and higher quality web archive that crawls at optimal times for data capture.
Seneca, T. (2009). The web-at-risk at three: Overview of an NDIIPP web archiving initiative. Library Trends, 57(3), 427-441. doi:10.1353/lib.0.0045
In this article, Seneca describes the Web Archiving Service, which was created by the Web-At-Risk program at the NDIIPP, or National Digital Information Infrastructure and Preservation Program. This project preserves government and political websites for researchers. The Web Archiving Service, or WAS, was developed with user needs in mind, making data capture and use of the program efficient and intuitive. The curator using WAS first gives the web crawler a seed URL. After giving the seed URL, the curator can then capture sites with a web crawler. A web crawl can take up to 36 hours, after which time the content can be perused by the curator. Finally, collections can be created with the captured web content. The Web Archiving Service is a user-friendly web archiving application that was created to efficiently crawl webpages, collect digital content, and create collections in one program.