Internet Archive Wayback Machine

From SIS Wiki
Revision as of 21:41, 11 December 2020 by Csocha (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The Internet Archive's Wayback Machine: Digital Portal To The Past

Annotations written by Cameron Socha

Introduction

This annotated bibliography seeks to provide a comprehensive historical overview of the Internet Archive Wayback Machine’s nearly twenty-five year history. Since they began archiving webpages and implemented webcrawlers to automate the process back in the late nineties, the Internet Archive has been working tirelessly to provide the most comprehensive and user-friendly digital archive on the net which is known as the Wayback Machine. Made available to the public back in 2001, the Wayback Machine has since served as a versatile and useful tool for students, professionals, or pretty much anyone with an information need that requires historical data from a archived web content. Early articles published around the time the Wayback Machine was released often cover the basics of how the archive operates and might give some pointers as to the best practices involved. Most of these articles are written in a friendly and informal tone. As time passed, articles pertaining to the Wayback Machine become more granular and technical, but they are certainly worth the effort to ingest as professionals have done some really interesting and creative things with the archive over time.

Annotations

AlNoamany, Y., AlSum, A., Weigle, M. C., & Nelson, M. L. (2014). Who and what links to the Internet Archive. International Journal on Digital Libraries, 14(3), 101-115. doi:10.1007/s00799-014-0111-5

This paper features an analysis of the server logs of the Internet Archive’s Wayback Machine. The intention is to provide answers to a variety of questions pertaining to Web archives and their users, including but not limited to: “Why do users come to Web archives? Do they come because they cannot find the Web pages on the live Web, or do they come because they want a copy of a Web page at a specific time? Where do Web archive users come from? Who links to Web archives? How do sites link to Web archives?” (p. 101). AlNoamany et. al conclude that many users of the Wayback Machine are using the archives to look for sites that no longer exist, as 65% of the archived pages they requested no longer exist on the Web. The study uses mixed sampling strategies, and in order to distinguish between human and robot requests to the best of their ability they applied robot detection heuristics in their analysis of the server logs. Unsurprisingly, users were most frequently referred from Wikipedia, likely a result of the multitude of dead links inevitably associated with a database as large as Wikipedia. In recent years, it appears as if much of the literature being published pertaining to the Wayback Machine is primarily concerned with link rot and recovering dead URLs, or using it to retrieve information and/or historical data. Also, the authors point out that the usefulness of being able to browse a timeline of vivid and comprehensive snapshots from a webpage’s past cannot be overstated.

Arora, S.K., Li, Y., Youtie, J., & Shapira, P. (2016) Using the Wayback Machine to mine websites in the social sciences: A methodological resource. Journal of the Association for Information Science and Technology, 67(8), 1904-1915. doi:10.1002/asi.23503

This article illustrates the Wayback Machine’s wide variety of applications in different disciplines; while social scientists are the primary audience in this particular instance, it is representative of the Wayback Machine’s versatility and proves how useful it can be for students and professionals studying and working in a wide variety of different fields. Arora, Li, Youtie, and Shapira searched Google Scholar in May 2014 to find significant growth (an average annual rate of around 16%) in the number of documents referring to the Wayback Machine from 2004 to 2013. These documents can be categorized into a diverse range of different disciplines: technology management, bibliometrics, business, information technology, archives, libraries, and legal areas are mentioned specifically. I feel this is evidence of the Wayback Machine’s versatility and usefulness as a resource for collection and analysis of historical data. For archivists and information professionals in particular, this powerful tool is likely indispensable. The data set analyzed in this instance consisted of “U.S.-based small- and medium-sized enterprises (SMEs) in the green goods industry” (p. 1905) The article also refers to the SMEs in the green goods industry as ‘GGCs’: green goods companies. This data is used throughout a six-step process of accessing archived historical data: “(a) sampling, (b) organizing and defining the boundaries of the web crawl, (c) crawling, (d) website variable operationalization, (e) integration with other data sources, and (f) analysis” (p. 1905) This particular project also used software such as VantagePoint and Java in conjunction with the Wayback Machine to automate webcrawling, data collection, and data analysis.

Bates, M. E. (2003). Archiving the web. Online, 27(6), 64.

This 2003 article by Mary Ellen Bates mentions a full-text search engine developed specifically for the Internet Archive’s Wayback Machine by Anna Patterson called CobWebSearch. Previously hosted in beta, [1] the search engine allowed for keyword searches supporting Boolean operators (AND/OR/NOT) and allowed for some manual categorization and limiting of search results, which were ranked by relevance. Bates describes charts at the top of the page that shows the number of pages retrieved by the search over time and the relative frequency of related concepts, calling to mind the Wayback Machine’s intuitive “Calendar” feature. This displays the activity of a particular webpage over time and allows the user to interact with the chart, accessing a given date simply by clicking on the desired time period. Users were also given the option to limit their CobWebSearch results by time period. As CobWebSearch was a highly ambitious project, it was not without its problems. For instance, the search engine was case-sensitive, but this was not made apparent to the user besides suggesting capitalization variants on the search results page. Also, according to Bates’ article, only a third of the comprehensive yet still incomplete archive’s content was supported by CobWebSearch.

Ben-David, A., & Huurdeman, H. (2014). Web archive search as research: Methodological and theoretical implications. Alexandria, 25(1), 93-111. doi:10.7227/ALX.0022

Anat Ben-David and Hugo Huurdeman provide a comprehensive overview of search interfaces in web archives, and throughout they refer to the Wayback Machine as the premier example of a web archive used for scholarly research. Albeit, they point out the limitations of the Wayback Machine’s single URL access point functionality as opposed to search interfaces, which are oftentimes more powerful and intuitive when framed within the context of research. The paper provides a brief historical account of the creation, growth, and significance of the Internet Archive and the Wayback Machine. It also briefly describes the process of a using a single URL entered into the Wayback Machine to browse surrogate captures of the webpage across time. A distinction is made between “horizontal searching” and “vertical searching,” as the former describes browsing a single URL through different points in time while the latter describes linking to other webpages. The article mentions full-text searching as an alternative to but not a replacement for the Wayback Machine. Although the Internet Archive did attempt to implement full-text searching by adding a full-text search engine some time ago, it is very limited as it is only capable of searching the home pages of websites which only compose a small portion of the total number of webpages contained in the Wayback Machine. [2] Ben-David and Huurdeman note in the article’s conclusion that the Wayback Machine is still by and large the most prevalent web archive interface, as many web archives have yet to fully implement search interfaces or make them available for scholarly use.

Gordon-Murnane, L. (2018). Linkrot + content drift = reference rot. Online Searcher, 42(6), 10-17.

In this 2018 article Laura Gordon-Murnane details the role that the Internet Archive’s Wayback Machine plays in combatting link rot and content drift, as well as the combination of the two, referred to as “reference rot.” Gordon-Murnane elaborates upon the Wayback Machine’s “Save This Page” tool, [3] a Firefox add-on called “No More 404s,” as well as two Chrome extensions: “Save to the Wayback Machine” from March 2016 [4] and an extension from January 2017 that detects 404 (page not found) errors and asks the user if they would like to see the archived page courtesy of the Wayback Machine. [5] Gordon-Murnane suggests using the “Save This Page” extension to save any and all important links that appear in important documents, ensuring greater longevity and less chance of reference rot in the future. The ability to look back at older Web content as it appeared at a specific date and time “enables and facilitates research that cuts across all disciplines, from historians who want to look at changes to prior government administrations, economists monitoring changing company strategies and products, social scientists tracking changes in social attitudes and values over time, lawyers introducing screenshots from long gone websites to prove copyright and IP infringement violations, and journalists who need access to historical digital materials that verify and authenticate accurate news and expose ‘fake news.’” (p. 15)

Graham, Mark. (2020). Fact checks and context for Wayback Machine pages. Internet Archive Blogs. [6]

Mark Graham recently posted this brief article on the Internet Archive Blogs concerning the implementation of contextual information notices from fact checking organizations on archived pages in the Wayback Machine. These notices are prominently displayed near the top of the page in a bright yellow banner. Ubiquitous proliferation of disinformation via the World Wide Web (social media in particular) has led to a resurgence in archiving and curating problematic information which might have once been deleted even just ten years ago. Providing access to potentially controversial content with context prominently displayed is in alignment with several core principles of information science, such as the provision of open and unrestricted access to worthwhile sources of quality information for the public. In this ambitious undertaking, the Internet Archive has built upon the work of many renowned organizations, such as “FactCheck.org, Check Your Fact, Lead Stories, Politifact, Washington Post Fact-Checker, AP News Fact Check, USA Today Fact Check, Graphika, Stanford Internet Observatory, and Our.news.” (2020)

Kumar, B. T. S, Kumar, D.V, & Prithviraj, K. R. (2015). Wayback Machine: reincarnation to vanished online citations. Program: Electronic Library and Information Systems 49(2), 205-223. doi:10.1108/PROG-07-2013-0039

This journal article takes a comprehensive, in-depth look at vanished citations (a common manifestation of dead links, as many researchers can tell you) and intends to make a case for the Wayback Machine as a tool to recover lost citations. As other peer-reviewed literature attests to, the Wayback Machine is an excellent tool for recovering dead links discovered when searching for references in a given scholarly journal. Ironically, articles similar to this one (with subject matter pertaining to lost and dead links) will often contain citations that feature dead links. The study analyzes both the decay and recovery of online citations from scholarly journals published between 2008-2012. They attempted to discover the rate of loss of online citations, and to identify the co-relation between their path depths recovery. The study also set out to calculate the half-life (before and after recovery) of vanished online citations. Some notable findings were inconsistencies in the volume of online citations in scholarly journals published between the years of 2008 and 2012, and a significant portion (30.98%) of inaccessible citations. The article points out how detrimental missing citations can be for their hosts as well as students and professional researchers, as they “challenges the reader’s traditional assumption of reference availability and access” and “tend to stymie the ability of a reader to further investigate interesting or significant aspects of an article.” (p. 220)

Kumar, D.V, & Kumar, B.T.S. (2017). Recovery of vanished URLs: Comparing the efficiency of Internet Archive and Google. Malaysian Journal of Library & Information Science, 22(2), 31-43. doi:10.22452/mjlis.vol22no2.3

This study conducted by D. Vinay Kumar and B. T. Sampath Kumar compares the Internet Archive and Google based upon their ability to retrieve dead links. The reasons why a URL might vanish are myriad, and almost anyone who has conducted scholarly research in the past can tell you that dead links are an all-too-common problem. Many of these very same individuals may be completely unaware of the Internet Archive’s Wayback Machine, a resource that doesn’t receive as much as press as a search engine like Google but serves as a more efficient tool in recovering and accessing dead links than a search engine can. It should be noted that this is far from the first study of its kind. The article cites many similar studies conducted in the past, several of which concluded that a search engine was able to retrieve more URLs, but the majority of studies mentioned favored the Wayback Machine as the more efficient tool in retrieving dead links. Like the similar studies that preceded it, the team compared the efficiency in retrieving URLs associated with different HTTP errors (e.g. 301, 404), domains (e.g. .com, .org, .gov), and file types (e.g. PDF, DOC) and found that the Wayback Machine was typically more efficient in retrieving vanished URLs in almost every instance. Some of these instances drastically favor the Wayback Machine; for example, out of the sample of 226 missing URL’s associated with HTTP error 404, the Wayback Machine retrieved 154 URLs (68.14%) as opposed to Google’s 41 URLs (18.4%). Another example can be found in the sample of 115 missing URLs associated with the “.org” domain, as the Wayback Machine retrieved 82 of these URLs (71.30%) as opposed to Google’s retrieval of 24 URLs (20.87%). The study also concludes that URLs cited in between the years of 2009-2013 the Wayback Machine recovered more vanished URLs than Google, with the most significant disparity occurring in the year 2012, in which the Wayback Machine recovered 92 URLs (80.70%) compared with Google’s recovery of 28 URLs (24.56%) from a sample of 114 missing URLs. In summation, the study found that the Wayback Machine was able to retrieve 66.19% of vanished URLs and Google was only able to retrieve 30.70%, thus one may conclude from the results of this study that the Wayback Machine is a more efficient tool than Google in recovering dead links.

Notess, G. R. (2002). The Wayback Machine: the web’s archive. Online, 26(3), 59–61.

This article gives a brief and accessible introduction to the Internet Archive’s Wayback Machine. The article, published back in 2002, mentions that the archive is not text-searchable. While this is still the case today, the “Using The Wayback Machine” page (updated as recent as 4 months ago) on the Internet Archive website does state that they would like to implement a full-text search engine at some point. [7] Vinay Goel’s post from October 2016 on the Internet Archive Blogs announced the implementation of a keyword search in the Beta Wayback Machine. [8] The keyword search was eventually implemented into the Wayback Machine, but it is rather limited in scope and can assist one in finding relatively well-known websites that have been indexed. Another limitation of the Wayback Machine that Notess mentions is how much material is missing, an issue that has only been exacerbated by the explosive growth of the World Wide Web since the Internet Archive began archiving for the Wayback Machine back in 1996.

Pearce, D., & Charlton, B. G. (2009). Plagiarism of online material may be proven using the Internet Archive Wayback Machine (archive.org). Medical Hypotheses, 73(6), 875-875. doi:10.1016/j.mehy.2009.07.049

This short editorial provides a theoretical example of the Internet Archive’s Wayback Machine being used to enforce intellectual property rights by using the archives to check a published journal article for plagiarism. Essentially, in this hypothetical example the archives are used to defend intellectually property rights by providing an archived record of a webpage for comparison with a given document to check if any plagiarism has occurred. This method is limited by the Wayback Machine’s shortcomings when it comes to searching the archives, and finding documents or webpages to check for plagiarism using the methods described in the article could be more difficult than one might presume. The “Save Page Now” function is worth mentioning in this case, as this article was published years before it was released, but saving URLs is a good habit for scholarly literature, as it will make things easier on other students and professionals using the archive in the future. I find creative and practical applications of the Wayback Machine such as this one attest to the versatility and ingenuity of this particular digital archive. The wealth of literature characterizing the Wayback Machine as an effective tool for recovering lost, incomplete, or historically relevant information is evidence of its value, and the wide variety of possible applications make the Wayback Archive a unique and indispensable tool for students, professionals, and the general public.