International Journal on Digital Libraries

, Volume 17, Issue 3, pp 203–221

Detecting off-topic pages within TimeMaps in Web archives

  • Yasmin AlNoamany
  • Michele C. Weigle
  • Michael L. Nelson
Article

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy. We do not delete off-topic pages (they remain part of the collection), but they are flagged as off-topic so they can be excluded for consideration for downstream services, such as collection summarization and thumbnail generation. We propose different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length) to detect when a page has gone off-topic. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC \(=\) 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting off-topic pages in the collections is 0.89.

Keywords

Web archiving Document filtering Information retrieval Document similarity Archived collections Web content mining Internet Archive 

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Yasmin AlNoamany
    • 1
  • Michele C. Weigle
    • 1
  • Michael L. Nelson
    • 1
  1. 1.Department of Computer ScienceOld Dominion UniversityNorfolkUSA

Personalised recommendations