Wayback Machine

From SIS Wiki
Jump to: navigation, search

The Internet Archive’s Wayback Machine: Then and Now

Created by Lynne Lambdin

Definition of Project

The Internet Archive, a non-profit organization, started with the intention of capturing the contents of the internet to provide long term records and access. After crawling the world wide web for many years, they presented their archived data to the public. In 2001, the public was formally introduced to the Wayback Machine. The Wayback Machine acts as an interface of sorts allowing researchers, librarians, and the general public to reference content that has long since disappeared from the internet. Since the launch of the Wayback Machine, it has proven to be useful in numerous ways. But it also has been met with some grief regarding its capabilities and the limited ease of access. Over time the Internet Archive has received numerous grants to continue their venture of archiving the internet, and also to support developments to improve upon its original launch. These funds have in turn allowed important data, websites, videos and such to remain available for the years to come.

Annotations

Agata, T., Miyata, Y., Ishita, E., Ikeuchi, A., & Ueda, S. (2014). Life span of web pages: A survey of 10 million pages collected in 2001. Paper presented at the Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, 463-464. doi:10.1109/JCDL.2014.6970226

Since the Wayback Machine’s technology was presented to the public in 2001, it has been the subject of many studies. These studies have had a helping hand in providing further detailed information on the Wayback Machine’s importance and capabilities. This article used the Wayback Machine’s archival skills to try and identify the average life span of websites and the stability of their URLs. This study selected 10 million web pages to follow in 2001 and studied the stability of their existence for about twelve years. The websites included in the study had to have been archived by the Wayback Machine to ensure they could follow the birth and death of the site. After twelve long years and numerous crawls, the researchers could determine that ninety percent of the websites had a URL that expired within the last twelve years. Furthermore, they determined the average lifespan of an active URL to be around 1,132.1 days (just over 3 years). This article provides detailed coverage of studies’ findings which support the importance of archiving the internet due to the high frequency of change.

AlNoamany, Y., AlSum, A., Weigle, M. C., & Nelson, M. L. (2014). Who and What Links to the Internet Archive. International Journal of Digital Libraries, 14(3-4), 101-115. doi:10.1007/s00799-014-0111-5

The Wayback Machine Internet Archive is a universally free web application available to anyone with internet access. However, the tool is not necessarily being utilized to the best of its ability. It seems not many human users are aware of such a tool existing or they haven’t had the need for such a web application. One study focused on the human and robotic users of the Wayback Machine and the frequency in which it is being utilized by them. Results determined that 81.9% of the human users of the Wayback Machine Internet Archive were referred to this tool by another webpage. Some of the top referrers included Wikipedia, archive.org and Reddit.com. Tests concluded that only 15.2% of robot users had to be referred to the Wayback Machine through another website’s suggestion. Nearly 85% of the robotic users had some inclination to such a tool being readily available and accessible. Based on the results from the study and concluding remarks, the Wayback Machine is highly underutilized due to a lack of knowledge on such a tool existing or the need isn’t as great among the human users. Although, it certainly is a step in the right direction having other sites redirecting users to the Wayback Machine for information.

Arora, S. K., Youtie, J., & Shapira, P. (2016). Using the Wayback Machine to Mine Websites in the Social Sciences: A Methodological Resource. Journal of the Association for Information Science and Technology, 67(8), 1904-1915. doi:10.1002/asi.23503

The Internet Archive aims to crawl over a great depth of information. It isn’t intended to discriminate against subject matter. Some Social Science researchers took interest unequal distribution of published information on the Wayback Machine and certain topics. The study being cited used Google Scholar to determine the number of scholarly articles published each year that referenced the Wayback Machine. Internet Archive. The search returned with nearly five thousand results discussing or referencing the web application. Further breakdown provided a steady increase among the literature produced through the years with a minor dip in 2004. Between 2005 – 2013, Google Scholar results for information on the Wayback Machine increased by an average of 16%. The study participants had a special interest in Social Sciences and the WayBack Machine, the researchers filtered out any articles that did not include information on the desired topic. They concluded literature having to deal with the Wayback Machine and Social Sciences decreased over the years. This sparked some interest in other subject areas, the study participants then assigned a skill field to results that were removed from the study for not dealing with Social Sciences. After assigning a study field to remaining results, study participants determined that 37% of the returned articles could be sorted with a specific field of research. Based on the 37%, the study concluded that 31% of the returned information referenced Information Technology, 16% dealing with the library area and 11% focused on the law. Overall, the study found some discrepancies in the fields being represented. Some areas are receiving greater attention than others indicating an unequal balance with information crawling.

Bibb, D. D. (2002). Don't Cache Out: What To Do When the Server Isn't There. Internet Reference Services Quarterly, 7(4), 35-39. doi:10.1300/J136v07n04_04

In recent years, the government has begun to communicate information digitally for various reasons. This proved to be an issue in 2001, when Judge Royce C. Lambert enforced the Department of Interior to discontinue their web servers and digital communication for an intricate case. Shutting down this single server inadvertently effected other agencies and their web sites. While a great hit was taken, Google and the Wayback Machine Internet Archive together could remedy the lost web pages. Where the Wayback Machine lacks in search functionality, Google certainly makes up. Google can cache indexed dead web pages, which can result in recovering the URL. With the URL provided by Google, users can refer to the Wayback Machine and view time stamped data from certain sites. This article provides a good example of the Internet Archive returning data that was thought to be otherwise gone. With the help of some web applications, data doesn’t necessarily have to cease to exist in the case of issues or natural disasters.

Davis, P. R. (2017, January 6). Phil's Stock World: After Move to Canada, Wayback Machine Launches Trump Video Library, Complete with "Fact Checks" [Weblog post].

The Wayward Machine will be facing a rather expensive server move soon. Upon Donald J. Trump being elected the president of the United States, the non-profit organization announced they would be moving operations to Canada. One of the greatest pushes behind the movement having to do with “a new administration promising radical change” (Davis, 2017). They also announced they would be launching a Trump Archive in response to the election. The archive is said to hold over 520 hours of Trump’s various speeches, interviews and such from the 2016 election period. They would be further providing fact checks to ensure the authenticity of the collection. The Internet Archive finds it necessary to obtain multiple copies of their server’s data to ensure that history remains known and accurate. This article provides insight on the Internet Archives desire to provide factual and uncensored information to the people. The immediate decision discussed truly support the non-profit organizations true mission to serve uncensored, accurate information.

Eltgroth, D. R. (2009). Best Evidence and the Wayback Machine: Toward a Workable Authentication Standard for Archived Internet Evidence. Fordham Law Review, 78(1), 198-201.

The Internet Archive has been archiving web data since before the launch date of the Wayback Machine to the public in 2001. After a little over two decades of existence, the Wayback Machine has preserver over 510 billion web. Through a method referred to as “crawling”, the number of digital archives continues to expand. The Internet Archive established a partnership with Alexa Internet, Inc. to obtain the crawling software. Alexa is not capable of capturing all sites with active URLs instead the crawler establishes a priority. The web pages which are scanned are determined by the users. Their search history and requests help to determine what web pages are crawled and the frequency of the crawling. Once Alexa has captured the web data, it is deposited to the Internet Archive for storing and made accessible through the Wayback Machine. Although it is important to note, the crawled content will not be immediately available upon donation. For an in depth understanding of the exact process and best practices for web crawling refer to the annotation.

Kumar, B. S., Kumar, D. V., & Prithviraj, K. R. (2015). Wayback Machine: Reincarnation to Vanished Online Citations. Program, 49(2), 205-223. doi:10.1108/PROG-07-2013-0039

The Wayback Machine is a grand tool that can relocate dated archived websites, audio files, videos, books and much more. However, it is not a perfect tool. The Wayback Machine Internet Archive has made it their objective to preserve the contents of the world wide web, but it is important to realize that it there are limitations. The frequency in which certain websites are time stamped and screen shot can vary greatly from daily, weekly, monthly to yearly. Some websites can become difficult to capture everything due to the rapid updates, for example, new sites like CNN. The pace in which articles are produced and updated makes it nearly impossible to capture all URLs and their content. When the crawler does capture the updates, it takes nearly six-month before it is available through the Wayback Machine. To get a better understand of the Wayback Machine and accessibility, certain studies have been completed. The study being discussed focused on 15,211 journal articles with the objective of determining their availability on the web. Many journal articles had URLs that returned “Page Not Found”. Out of all the articles with dead URLs, only 48.33% of the broken links were found recoverable through the Wayback Machine. Meaning the Internet Archive had failed to obtain and track over 50% of the now non-existent web pages. Unfortunately, those pages and articles are now lost and are non-retrievable permanently. This article provided data that helps to further explain the scope of the Wayward Machine. While it certainly is a goal to capture all aspects and changes made on the web, it is a rather lofty goal with road blocks.

Lueck, T. (2014). Internet Archive: Digital Library of Free Books, Movies, Music, and Wayback Machine/The Internet Archive Companion. American Journalism, 31(2), 299-301. doi:10.1080/08821127.2014.905381

The Internet Archive and their developers have worked hard to meet the demands of users and technology alike. With the Wayback Machines ability to present scans of the internet so expansively, accessing the data can be complicated and inconvenient. This article does an excellent job of discussing the attempts made to address user criticisms. Beginning with their attempt to launch the Wayward Machine mobile application. The home interface provides the following categories for easy accessibility and as a refinery for searching: Web, video, live music, audio and text. While many new features were introduced with the new venture, the most notable would be the direct access to files with executables for vintage software. Which is a great move, considering there is now no need to try and download unsafe executables from various non-authenticated websites that could potential result in machine viruses. Some attempts to address the dreaded search feature include a keyword search function which produces domestic and international works. Which is an attempt to resolve one of the largest user complaints. In another area of improvement, the search feature can now return results from data that is much older than what was previously possible. Although there are limitations, the results of older data often depend on the type of data trying to be accessed, some contingencies do apply. When the mobile application for the Wayback Machine first launched, the marketing was highly focused around streaming music archives. The search improvements, greater collection availability and mobile application premier was an obvious attempt to advance. But it was also a major effort aimed towards the younger era of researchers. This article highlights the transformation of the Wayback Machine’s interface to better suit the needs of today’s users.

Murphy, J., Hashim, N. H., & O'Connor, P. (2007). Take Me Back: Validating the Wayback Machine. Journal of Computer-Mediated Communication, 13(1), 60-75. doi:10.1111/j.1083-6101.2007.00386.x

Given the extent of coverage and vastness of the world wide web, it is important to verify the Wayback Machine’s reliability when using the tool for factual data. The study being cited focused on the Wayback Machine’s collection in three specific areas: website content, website age and website updates. Participants used three methods of validation: face, content and predictive. Face validity is a test area which is less substantial when compared to content and predictive approaches. Therefore, the face validity method was tested in a few ways. As to be expected, the face validity results were all over the spectrum. One face validity test had participants recruit two hotels in Malaysia and asked them to test their current website against the Wayback Machine’s results. The outcomes proved to be accurate according to the hotel managers. While this face validity method proved to be successful, the next test indicates why face validity can be problematic. This test compared the website age provided by Wayback Machine against the domain name’s age, which was provided by the domain name registrar, Mynic. Researchers enlisted a considerable number of hotel participants to ensure a decent sample size. Results determined that 68 out of 79 hotels had an established domain name prior to the launch of the Wayback Machine. This unfortunate discovery essentially made the test results inconclusive. When two tests using the same method produce drastically different results, it sheds lights on validity of the test verses the application’s authenticity. Moving forward, researchers should ensure that the method or test chosen for validity is within the means of the application being discussed.

Thelwall, M., & Vaughn, L. (2004). A Fair History of the Web? Examining Country Balance in the Internet Archive. Library & Information Science Research, 26(2), 162-176. doi:10.1016/j.lisr.2003.12.009

Despite the Wayback Machine Internet Archive being produced in the United States, it is intended to crawl over all possible sites, whether they are U.S. based or international. However, the depth of archiving coverage outside the United States has been called into question after further examination. According to the statistical analysis performed, some countries are not being fairly represented or archived. Results concluded that the United States was being over represented. China was significantly under represented. While significantly smaller countries, Singapore and Taiwan, were fairly represented according to the journal article. Many explanations could be provided for the discrepancy; two technical issues are at the forefront. First, the language in which the site had been built can be a resulting factor to the inequality of representation. As well, the accessibility of the site to the crawling feature should be considered. Some countries such as India have utilized their IP providers to block the crawling feature ensuring their content is not archived. Overall, this article shed light into the national coverage bias, which may not be intentional, but it certainly exists.