Focused Crawling and Seed Selection

Focused Crawling and Seed Selection: Approaches to Creating Topic and Event-Centric Web Archive Collections

Diana Thurm

Introduction

This bibliography focuses on the methods and processes used to find and select web content for inclusion in collections focused on a topic or event. Specifically, it focuses on seed collection and web crawling methodologies used for creating these kinds of web archive collections.

Every paper included in this bibliography additionally provides background information on the topic more generally and/or discusses related research. Because this is a relatively new and rapidly developing subject involving ever-changing technologies, it looks exclusively at articles published within the last five years. Citation tracing using the bibliographies of the papers retrieved and selected was helpful in discovering other relevant papers to consider for inclusion. These selection criteria were used to make the bibliography more accessible and to help ensure that there is something for readers of various levels of knowledge and experience to glean from the information provided. The hope is that this bibliography will help professionals involved in digital curation activities plan and choose the best methods of seed selection and focused crawling to create their own topic- or event-focused web archive collections, while at the same time be generally useful in helping others understand the processes involved in focused crawling.

Keywords: Web Archives; Selection; Appraisal; Focused Crawlers; Seeds; Collections

Useful Databases: ACM Digital Library; Library, Information Science & Technology Abstracts (LISTA)

Annotations

Farag, M. M. G., Lee, S., & Fox, E. A. (2018). Focused crawler for events. International Journal on Digital Libraries, 19(1), 3-19. https://doi.org/10.1007/s00799-016-0207-1

Rather than applying a traditional topical approach to creating web archive collections focused on particular events, Farag et. al. propose using event modeling to enable improved relevancy judgments of target webpages and to allow for adaption as events unfold over time. Unlike traditional focused crawlers, the event model takes into consideration not only topic, but also date and location when compiling related web content in a crawl. Using diagrams and detailed, step-by-step outlining of the processes undertaken by both archivists and the web crawlers themselves, the authors explain how the event model differs from other focused crawling methods and what its advantages are for creating event-focused web archive collections. They also outline a series of experiments conducted to evaluate the performance of the event model-based focused crawler and demonstrate its efficacy as compared to a traditional topic-centric baseline focused crawler. As the authors acknowledge, this model may not be best suited for all types of events, such as those that lack geographic and temporal specificity, but it may be useful to consider for some event-focused web archive collections.

Gossen, G., Risse, T., & Demidova, E. (2018). Towards extracting event-centric collections from web archives. International Journal on Digital Libraries, 21(1), 31-45. https://doi.org/10.1007/s00799-018-0258-6

We typically think of web crawling as a way to generate web archive collections directly from the web itself. However, in this study, Gossen et. al. propose using focused crawlers to extract event-centric collections from existing large-scale web archive repositories. The authors provide a substantial explanation of the archive crawler’s architecture, elaborating on the algorithms and logic behind how the estimated temporal and topical relevance of target webpages are determined. Presenting the results of the experiments they conducted to evaluate this method of collecting relevant web pages, aided by several graphs, they suggest that this “re-crawling” method is able to identify relevant documents both within and beyond the archive itself while also mitigating many of the challenges that web archives present. This approach to creating event-centric collections from large-scale web archives has potential applications for both researchers and web archivists.

Gossen, G., Demidova, E., & Risse, T. (2015). iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, June 21-25, 2015, Knoxville, Tennessee (pp. 75-84). https://doi.org/10.1145/2756406.2756925

In this paper, Gossen et. al. present iCrawl, a web crawler architecture that aims to integrate the crawling of social media and the Web. In addition to crawling both types of sources, the integrated nature of the crawler allows for the use of relevant links posted on social media as potential additional seed URLs, thereby constantly maintaining the freshness and relevance of the collection. In their experiments, they compare this integrated crawling method to several other crawler configurations—unfocused, focused, and Twitter-based. The report elaborates on the parameters used to evaluate the relevance and freshness of the results and includes graphs showing their findings that the inclusion of input from social media improved the consistent freshness of the content collected when compared to the other crawling modalities. As the authors point out, both the Web and social media provide valuable information and context about topics and events, often in ways that are interdependent (e.g. web content linked on social media), thus, crawling them in tandem might produce more focused and comprehensive web archive collections.

Klein, M., Balakireva, L., & Van de Sompel, H. (2018). Focused crawl of web archives to build event collections. WebSci ’18: Proceedings of the 10th ACM Conference on Web Science, May 27-30, 2018, Amsterdam, Netherlands (pp. 333-342). Association for Computing Machinery. ACM Web Science Conference, Amsterdam, Netherlands. https://doi.org/10.1145/3201064.3201085

Inspired by a 2017 paper by Gossen et. al., Klein et. al. conduct some of the first experiments in using focused crawlers to build event-focused web archive collections from existing general web archives. This project uses the Memento protocol, and the paper explains in both technical detail and more simplified language the algorithmic methods of determining the relevance of resources, which take both topical and temporal relevance into consideration. They compare the collections generated through this method with collections generated by crawling the live web as well as via manual curation. The authors also detail the methodology of their experiment, providing accompanying graphs to convey the results, which show that while archive crawls did not outperform live web crawls for relatively recent events, they were much more effective than live web crawls for collecting web content related to events from several years earlier, likely because that archived content no longer exists on the live web. The methods outlined in this paper could be useful for the creation of some event-focused web archive collections. In addition, it includes an interesting exploration of how using older versions of a Wikipedia page to pull seed URIs from, rather than the current version, may positively influence the relevance of web crawling results.

Nanni, F., Ponzetto, S. P., & Dietz, L. (2020). Toward comprehensive event collections. International Journal on Digital Libraries, 21(2), 215-229. https://doi.org/10.1007/s00799-018-0246-x

Like other papers mentioned here, this paper offers an approach to creating event-focused collections from large-scale archives. However, the approach outlined in this paper is unique in that it aims to not only collect documents relevant to the event itself, but also to collect documents concerning related aspects such as premises and consequences. This work focuses on named events (e.g. Korean War, Charlie Hebdo Shooting), drawing on the work of the natural language processing and information retrieval communities for their methodology. Testing their method on four collections, the paper outlines their methods and results for collecting and ranking entities, collecting contextual passages, and retrieving relevant documents. Presenting another method for creating event collections for web archives, along with the ability to include documents that provide further context, this paper may be useful for curators looking to generate focused collections from large-scale archives.

Nwala, A. C., Weigle, M. C., & Nelson, M. L. (2019). Using micro-collections in social media to generate seeds for web archive. JCDL’19 Proceedings of the 2019 ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), June 2-6, 2019, Champaign, IL (pp. 251-260). Association for Computing Machinery. doi: 10.1109/JCDL.2019.00042.

The primary methods curators use to generate seed URIs for the creation of web archives are scraping web search engine result pages (SERPs) and social media SERPs. In this paper, Nwala et. al. explore the characteristics of seeds collected from different sources and suggest what they call “micro-collections” as an additional source of seeds. These micro-collections are social media posts where users compile web content on a particular topic, such as threads of news articles about a significant event. Using these micro-collections enables the discovery of high-quality seeds that may not be easily accessible through a Web search and the creation of topical or event-focused web archive collections at a faster rate before resources are lost. This paper provides a generous background of prior research related to focused crawling and seed selection. The methodology of their experiment is outlined step-by-step, providing valuable insight, although the explanations are highly technical. Their results show that there are pros and cons to generating seeds from social media micro-collections as well as from SERPs, so this paper could provide guidance on which method might be best suited for a particular collection, depending on the aims of collection development. As the authors state, “these findings may provide useful information to curators using social media to generate seeds” (p.254).

Nwala, A. C., Weigle, M. C., Nelson, M. L. (2018). Bootstrapping web archive collections from social media. HT ’18: Proceedings of the 29th ACM Conference on Hypertext and Social Media, July 9–12, 2018, Baltimore, MD. Association for Computing Machinery. https://doi.org/10.1145/3209542.3209560

Nwala et. al. explore the efficacy of automatically and semi-automatically generating seed URIs for topical web archive collections from social media sources by comparing the results to those produced by human-generated Archive-It collections. In addition to introducing an alternative to the time- and labor-intensive work of generating seeds manually, the thorough detailing of the seven-point metric they used to draw these comparisons could be useful in evaluating web archive collections and the methods of generating those collections. Based on their findings that collections generated in this way are of similar quality as those produced by Archive-It, the details of which are further explained in the paper itself, the authors propose extracting URIs from social media as a method of starting or adding to web archive collections on a particular topic.

Suebchua, T., Manaskasemsak, B., Rungsawang, A., Yamana, H. (2018). Efficient topical focused crawling through neighborhood feature. New Generation Computing, 36(2), 95-118. https://doi.org/10.1007/s00354-017-0029-8

Suebchua et. al. propose a “neighborhood” feature for focused web crawlers which, rather than solely considering the relevancy of the source that links to the destination, also takes into account the relevancy of web pages located in the same or similar directory path as the target web page. They give the algorithm behind this feature and present their experimental results comparing it to other focused crawlers, clearly explaining how it differs from and improves upon focused crawlers that do not use it, aided by helpful diagrams. This paper also provides ample background on related research work and foundational concepts such as what focused crawlers are, some different types, and how they function. While some portions of this paper are highly technical, the ample background and explanations will aid any reader in understanding the algorithms behind focused crawlers and analyzing their efficiency.

Summers, E., & Punzalan, R. Bots, seeds, and people: Web archives as infrastructure. CSCW '17: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, Portland, OR. Association for Computing Machinery. doi: https://doi.org/10.1145/2998181.2998345

According to Summers and Punzalan, this study explores “how archivists interact with web archiving systems and collaborate with automated agents when deciding what to collect from the web”. To do this, they conducted interviews with individuals involved in web content appraisal, and the report includes some excerpts from these interviews. While web crawlers are not the only web archive technology this study looks at, the authors discuss in detail different crawl modalities and the ways in which these modalities influence appraisal and selection. The study also explores archivists’ experiences with other tools used in the selection aspect of the web archive creation process, such as Archive-It. Their hope is that by better understanding the collaboration between archivists and technologies we can gain new insight on the selection and appraisal process.

Viera, K., Barbosa, L., da Silva, A.S., Freire, J., & Moura, E. (2016). Finding seeds to bootstrap focused crawlers. World Wide Web: Internet and Web Information Systems, 19, 449-474. https://doi.org/10.1007/s11280-015-0331-7

In this paper, Viera et. al. propose a new method for automatic collection of seed URIs, arguing that giving crawlers a large, varied set of seeds leads to higher rates of harvest and better topic coverage. The authors explain how these large and varied seed sets can have farther reach, covering more ground by bridging gaps between relevant links distributed throughout the Web. Their approach, which they call BFC (Bootstrapping Focused Crawlers) issues queries to search engines as a way of collecting seeds in an automated fashion. This paper provides extensive background on seed selection and focused crawlers and provides clear explanations of more technical concepts. Considering the time-consuming nature of manual seed selection, those conducting this kind of work may be interested in a framework for automatically selecting a large number of varied seed URIs on a particular topic or event.

Focused Crawling and Seed Selection