Advertisement

International Journal on Digital Libraries

, Volume 17, Issue 3, pp 203–221 | Cite as

Detecting off-topic pages within TimeMaps in Web archives

  • Yasmin AlNoamany
  • Michele C. Weigle
  • Michael L. Nelson
Article

Abstract

Web archives have become a significant repository of our recent history and cultural heritage. Archival integrity and accuracy is a precondition for future cultural research. Currently, there are no quantitative or content-based tools that allow archivists to judge the quality of the Web archive captures. In this paper, we address the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy. We do not delete off-topic pages (they remain part of the collection), but they are flagged as off-topic so they can be excluded for consideration for downstream services, such as collection summarization and thumbnail generation. We propose different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length) to detect when a page has gone off-topic. Those predicted off-topic pages will be presented to the collection’s curator for possible elimination from the collection or cessation of crawling. We created a gold standard data set from three Archive-It collections to evaluate the proposed methods at different thresholds. We found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold −0.85 performs the best with accuracy = 0.987, \(F_{1}\) score = 0.906, and AUC \(=\) 0.968. We evaluated the performance of the proposed method on several Archive-It collections. The average precision of detecting off-topic pages in the collections is 0.89.

Keywords

Web archiving Document filtering Information retrieval Document similarity Archived collections Web content mining Internet Archive 

Notes

Acknowledgments

This work was supported in part by the AMF and the IMLS LG-71-15-0077-15. We thank Kristine Hanna from the Internet Archive for help in obtaining the data set. We also thank the anonymous reviewers for their insights regarding future directions to this work.

References

  1. 1.
    AlNoamany, Y.: Using Web Archives to Enrich the Live Web Experience Through Storytelling. Dissertation, Old Dominion University (2016)Google Scholar
  2. 2.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Characteristics of Social Media Stories. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 267–279 (2015). doi: 10.1007/978-3-319-24592-8_20
  3. 3.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting Off-Topic Pages in Web Archives. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries, TPDL ’15, pp. 225–237. Springer International Publishing (2015). doi: 10.1007/978-3-319-24592-8_17
  4. 4.
    AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13, pp. 377–378. ACM Press (2013). doi: 10.1145/2467696.2467751
  5. 5.
    AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Tech. Rep. (2013). arXiv:1305.5959
  6. 6.
    AlSum, A., Nelson, M.L.: Thumbnail Summarization Techniques for Web Archives. In: Proceedings of the 36th European Conference on Information Retrieval, ECIR 2014, pp. 299–310 (2014). doi: 10.1007/978-3-319-06028-6_25
  7. 7.
    Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a Research Library for the History of the Web. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’06, pp. 95–102 (2006). doi: 10.1145/1141753.1141771
  8. 8.
    Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, pp. 328–337. ACM Press (2004). doi: 10.1145/988672.988716
  9. 9.
    Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, ECDL ’02, pp. 91–106. Springer-Verlag (2002)Google Scholar
  10. 10.
    Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Brewington, B., Cybenko, G.: Keeping up with the changing web. Computer 33(5), 52–58 (2000). doi: 10.1109/2.841784 CrossRefGoogle Scholar
  12. 12.
    Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. Overview of the Third Text REtrieval Conference (TREC-3) pp. 69–80 (1995)Google Scholar
  13. 13.
    Capra, R.G., Lee, C.A., Marchionini, G., Russell, T., Shah, C., Stutzman, F.: Selection and context scoping for digital video collections: an investigation of youtube and blogs. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’08, pp. 211–220. ACM (2008). doi: 10.1145/1378889.1378925
  14. 14.
    Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999). doi: 10.1016/S1389-1286(99)00052-3 CrossRefGoogle Scholar
  15. 15.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). doi: 10.1145/857166.857170 CrossRefGoogle Scholar
  16. 16.
    Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. ISDN Syst. 30(1–7), 161–172 (1998). doi: 10.1016/S0169-7552(98)00108-1 CrossRefGoogle Scholar
  17. 17.
    Farag, M.M.G., Fox, E.A.: Intelligent Event Focused Crawling. In: Proceedings of the 11th International ISCRAM Conference, pp. 18–21 (2014)Google Scholar
  18. 18.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006). doi: 10.1016/j.patrec.2005.10.010 MathSciNetCrossRefGoogle Scholar
  19. 19.
    Foot, K., Schneider, S.: Web Campaigning (Acting with Technology). The MIT Press, Cambridge (2006)Google Scholar
  20. 20.
    ISO 28500:2009—Information and documentation–WARC file format. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717 (2009)
  21. 21.
    Jatowt, A., Kawai, Y., Tanaka, K.: Detecting Age of Page Content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, WIDM ’07, pp. 137–144 (2007)Google Scholar
  22. 22.
    Jatowt, A., Kawai, Y., Tanaka, K.: Page history explorer: visualizing and comparing page histories. IEICE Trans. Inf. Syst. 94(3), 564–577 (2011)CrossRefGoogle Scholar
  23. 23.
    Jatowt, A., Tanaka, K.: Towards mining past content of Web pages. New Rev. Hypermed. Multimed. 13(1), 77–86 (2007). doi: 10.1080/13614560701478897 CrossRefGoogle Scholar
  24. 24.
    Kahle, B.: Preserving the internet. Sci. Am. 276(3), 82–83 (1997)CrossRefGoogle Scholar
  25. 25.
    Kahle, B.: Wayback Machine Hits 400,000,000,000! http://blog.archive.org/2014/05/09/wayback-machine-hits-400000000000 (2014)
  26. 26.
    Klein, M., Nelson, M.L.: Find, new, copy, web, page-tagging for the (re-)discovery of web pages. In: Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries, TPDL’11, vol. 6966, pp. 27–39. Springer, Berlin Heidelberg (2011). doi: 10.1007/978-3-642-24469-8_5
  27. 27.
    Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proceedings of the 21st ACM conference on Hypertext and Hypermedia, HT ’10, pp. 3–12. ACM (2010). doi: 10.1145/1810617.1810621
  28. 28.
    Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva, L., Zhou, K., Tobin, R.: Scholarly context not found: one in five articles suffers from reference rot. PloS One 9(12), e115,253 (2014). doi: 10.1371/journal.pone.0115253
  29. 29.
    Klein, M., Ware, J., Nelson, M.L.: Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 137–140. ACM Press (2011). doi: 10.1145/1998076.1998101
  30. 30.
    Koehler, W.: Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2), 162–171 (2002)CrossRefGoogle Scholar
  31. 31.
    Koehler, W.: A longitudinal study of web pages continued: a consideration of document persistence. Inf. Res. 9(2), 2–9 (2004)Google Scholar
  32. 32.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate Detection Using Shallow Text Features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pp. 441–450. ACM (2010). doi: 10.1145/1718487.1718542
  33. 33.
    Kosala, R., Blockeel, H.: Web mining research: a survey. SIGKDD Explor. Newslett. 2(1), 1–15 (2000). doi: 10.1145/360402.360406 CrossRefGoogle Scholar
  34. 34.
    Lawrence, S., Pennock, D.M., Flake, G.W., Krovetz, R., Coetzee, F.M., Glover, E., Nielsen, F.A., Kruger, A., Giles, C.L.: Persistence of web references in scientific research. Computer 34(2), 26–31 (2001). doi: 10.1109/2.901164 CrossRefGoogle Scholar
  35. 35.
    Manning, C.D., Raghavan, P., Schütze, H., Schutze, H.: Introduction to information retrieval. Cambridge University Press (2008). doi: 10.1017/CBO9780511809071
  36. 36.
    Marchionini, G., Shah, C., Lee, C.A., Capra, R.: Query parameters for harvesting digital video and associated contextual information. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’09, pp. 77–86. ACM (2009). doi: 10.1145/1555400.1555414
  37. 37.
    Marshall, C., McCown, F., Nelson, M.: Evaluating Personal Archiving Strategies for Internet-based Information. In: Proceedings of Archiving 2007, vol. 2007, pp. 151–156 (2007)Google Scholar
  38. 38.
    Masanès, J.: Web Archiving. Springer, Cham (2006)CrossRefGoogle Scholar
  39. 39.
    Mohr, G., Stack, M., Ranitovic, I., Avery, D., Kimpton, M.: An Introduction to Heritrix An open source archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop, IWAW ’04, pp. 43–49. http://iwaw.europarchive.org/04/Mohr.pdf (2004)
  40. 40.
    Negulescu, K.C.: Web Archiving @ the Internet Archive. Presentation at the 2010 Digital Preservation Partners Meeting. http://www.digitalpreservation.gov/meetings/documents/ndiipp10/NDIIPP072110FinalIA.ppt (2010)
  41. 41.
    Nelson, M.L.: A Plan For Curating “Obsolete Data or Resources”. Tech. Rep. (2012). arXiv:1209.2664
  42. 42.
    Odijk, D., Grbacea, C., Schoegje, T., Hollink, L., de Boer, V., Ribbens, K., van Ossenbruggen, J.: Supporting exploration of historical perspectives across collections. In: Proceedings of the 19th International Conference on Theory and Practice of Digital Libraries. Lecture Notes in Computer Science, vol. 9316, pp. 238–251. Springer-Verlag (2015). doi: 10.1007/978-3-319-24592-8_18
  43. 43.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Proceeding of the 17th International World Wide Web Conference, WWW ’08, p. 437. ACM Press (2008). doi: 10.1145/1367497.1367557
  44. 44.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATHGoogle Scholar
  45. 45.
    Reilly, B., Palaima, C., Norsworthy, K., Myrick, L., Tuchel, G., Simon, J.: Political Communications Web Archiving: Addressing Typology and Timing for Selection, Preservation and Access. In: Proceedings of the 3rd Workshop on Web Archives (2003)Google Scholar
  46. 46.
    Saad, M., Gançarski, S.: Archiving the Web using Page Changes Patterns: A Case Study. In: Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’11, pp. 113–122 (2012). doi: 10.1145/1998076.1998098
  47. 47.
    Sahami, M., Heilman, T.D.: A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM (2006). doi: 10.1145/1135777.1135834
  48. 48.
    SalahEldeen, H.M., Nelson, M.L.: Carbon Dating The Web: Estimating the Age of Web Resources. In: Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, pp. 1075–1082 (2013)Google Scholar
  49. 49.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). doi: 10.1145/361219.361220 CrossRefMATHGoogle Scholar
  50. 50.
    Schneider, S.M., Foot, K., Kimpton, M., Jones, G.: Building Thematic Web Collections: Challenges and Experiences from the September 11 Web Archive and the Election 2002 Web Archive. In: Proceedings of the 3rd Workshop on Web Archives (2003)Google Scholar
  51. 51.
    Singhal, A.: Modern information retrieval: a brief overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 24(4), 35–42 (2001)Google Scholar
  52. 52.
    Spaniol, M., Weikum, G.: Tracking Entities in Web Archives: The LAWA Project. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW ’12 Companion, pp. 287–290. ACM (2012). doi: 10.1145/2187980.2188030
  53. 53.
    Teevan, J., Dumais, S.T., Liebling, D.J.: A longitudinal study of how highlighting web content change affects people’s web interactions. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, pp. 1353–1356. ACM (2010). doi: 10.1145/1753326.1753530
  54. 54.
    Teevan, J., Dumais, S.T., Liebling, D.J., Hughes, R.L.: Changing how people view changes on the web. In: Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’09, pp. 237–246. ACM (2009). doi: 10.1145/1622176.1622221
  55. 55.
    Van de Sompel, H., Nelson, M.L., Sanderson, R.: RFC 7089—HTTP framework for time-based access to resource states—Memento. http://tools.ietf.org/html/rfc7089 (2013)
  56. 56.
    Yin, Z., Shokouhi, M., Craswell, N.: Query expansion using external evidence. In: Advances in Information Retrieval, pp. 362–374. Springer (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Yasmin AlNoamany
    • 1
  • Michele C. Weigle
    • 1
  • Michael L. Nelson
    • 1
  1. 1.Department of Computer ScienceOld Dominion UniversityNorfolkUSA

Personalised recommendations