Focused crawler for events

  • Mohamed M. G. Farag
  • Sunshin Lee
  • Edward A. Fox
Article

Abstract

There is need for an Integrated Event Focused Crawling system to collect Web data about key events. When a disaster or other significant event occurs, many users try to locate the most up-to-date information about that event. Yet, there is little systematic collecting and archiving anywhere of event information. We propose intelligent event focused crawling for automatic event tracking and archiving, ultimately leading to effective access. We developed an event model that can capture key event information, and incorporated that model into a focused crawling algorithm. For the focused crawler to leverage the event model in predicting webpage relevance, we developed a function that measures the similarity between two event representations. We then conducted two series of experiments to evaluate our system about two recent events: California shooting and Brussels attack. The first experiment series evaluated the effectiveness of our proposed event model representation when assessing the relevance of webpages. Our event model-based representation outperformed the baseline method (topic-only); it showed better results in precision, recall, and F1-score with an improvement of 20% in F1-score. The second experiment series evaluated the effectiveness of the event model-based focused crawler for collecting relevant webpages from the WWW. Our event model-based focused crawler outperformed the state-of-the-art baseline focused crawler (best-first); it showed better results in harvest ratio with an average improvement of 40%.

Keywords

Event archiving Focused crawling Web archiving Event modeling Digital libraries 

References

  1. 1.
    O’reilly, T.: What is web 2.0: design patterns and business models for the next generation of software. Commun. Strateg. 1(1), 17 (2007)Google Scholar
  2. 2.
    Fox, E.A., Leidig, J.P.: Digital Libraries Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS, vol. 6. Morgan & Claypool Publishers, San Rafael (2014)Google Scholar
  3. 3.
    Fox, E.A., da Silva Torres, R.: Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security, vol. 6. Morgan & Claypool Publishers, San Rafael (2014)Google Scholar
  4. 4.
    Shen, R., Goncalves, M.A., Fox, E.A.: Key Issues Regarding Digital Libraries: Evaluation and Integration, vol. 5. Morgan & Claypool Publishers, San Rafael (2013)Google Scholar
  5. 5.
    IDEAL. Integrated Digital Event Archive and Library. Accessed: 2016-07-26Google Scholar
  6. 6.
    Internet Archive. A digital library of free content and wayback machine. Accessed: 2016-07-26Google Scholar
  7. 7.
    Archive-It Collections. Spontaneous events. Accessed: 2016-07-26Google Scholar
  8. 8.
    Farag, M., Nakate, P., Fox, E.A.: Big data processing of school shooting archives. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 271–272. ACM (2016)Google Scholar
  9. 9.
    IDEAL Collections. IDEAL event collections. Accessed: 2016-07-26Google Scholar
  10. 10.
    Archive-It. Web archiving services for libraries and archives. Accessed: 2016-07-26Google Scholar
  11. 11.
    Batsakis, S., Petrakis, E.G.M., Milios, E.: Improving the performance of focused web crawlers. Data Knowl. Eng. 68(10), 1001–1013 (2009)CrossRefGoogle Scholar
  12. 12.
    Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)CrossRefGoogle Scholar
  13. 13.
    Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. (TOIS) 23(4), 430–462 (2005)CrossRefGoogle Scholar
  14. 14.
    Rennie, J., McCallum, A.: Efficient web spidering with reinforcement learning. In: Proceedings of the International Conference on Machine Learning. Citeseer (1999)Google Scholar
  15. 15.
    Grigoriadis, A., Paliouras, G.: Focused crawling using temporal difference-learning. In: Hellenic Conference on Artificial Intelligence, pp. 142–153. Springer (2004)Google Scholar
  16. 16.
    Singh, N., Sandhawalia, H., Monet, N., Poirier, H., Coursimault, J.-M.: Large scale URL-based classification using online incremental learning. In: 2012 11th International Conference on Machine Learning and Applications (ICMLA), vol. 2, pp. 402–409. IEEE (2012)Google Scholar
  17. 17.
    Menczer, F., Monge, A.E.: Scalable web search by adaptive online agents: an infospiders case study. In: Intelligent Information Agents, pp. 323–347. Springer (1999)Google Scholar
  18. 18.
    Dong, H., Hussain, F.K., Chang, E.: A survey in semantic web technologies-inspired focused crawlers. In: Third International Conference on Digital Information Management, 2008 (ICDIM 2008), pp. 934–936. IEEE (2008)Google Scholar
  19. 19.
    Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the 2003 ACM symposium on Applied computing, pp. 1174–1178. ACM (2003)Google Scholar
  20. 20.
    Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling—an application for vertical search engines. Inf. Syst. 32(6), 886–908 (2007)CrossRefGoogle Scholar
  21. 21.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M. et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)Google Scholar
  22. 22.
    Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. IEEE Trans. Knowl. Data Eng. 18(1), 107–122 (2006)CrossRefGoogle Scholar
  23. 23.
    Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: measurements, models, and methods. In: International Computing and Combinatorics Conference, pp. 1–17. Springer (1999)Google Scholar
  24. 24.
    Brin, S., Page, L.: Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 56(18), 3825–3833 (2012)CrossRefGoogle Scholar
  25. 25.
    De Assis, Guilherme T., Laender, A.H.F., Gonçalves, M.A., Da Silva, A.S.: Exploiting genre in focused crawling. In: International Symposium on String Processing and Information Retrieval, pp. 62–73. Springer (2007)Google Scholar
  26. 26.
    Pant, G., Srinivasan, P.: Predicting web page status. Inf. Syst. Res. 21(2), 345–364 (2010)CrossRefGoogle Scholar
  27. 27.
    Pant, G., Srinivasan, P.: Status locality on the web: implications for building focused collections. Inf. Syst. Res. 24(3), 802–821 (2013)CrossRefGoogle Scholar
  28. 28.
    Chen, Y.: A novel hybrid focused crawling algorithm to build domain-specific collections. PhD thesis, Virginia Polytechnic Institute and State University (2007)Google Scholar
  29. 29.
    Allan, J.: Introduction to topic detection and tracking. In: Topic detection and tracking, pp. 1–16. Springer (2002)Google Scholar
  30. 30.
    Volkova, S., Caragea, D., Hsu, W.H., Bujuru, S.: Animal disease event recognition and classification. In: Proceedings of the First International Workshop on Web Science and Information Exchange in the Medical Web (MedEx 2010). Citeseer (2010)Google Scholar
  31. 31.
    Westermann, U., Jain, R.: Toward a common event model for multimedia applications. IEEE Multimed. 14(1), 19–29 (2007)CrossRefGoogle Scholar
  32. 32.
    Strötgen, J., Gertz, M., Junghans, C.: An event-centric model for multilingual document similarity. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 953–962. ACM (2011)Google Scholar
  33. 33.
    Farag, M.M.G., Fox, E.A.: Intelligent event focused crawling. In: Proceedings of the 11th International ISCRAM Conference. University Park, Pennsylvania, USA (2014)Google Scholar
  34. 34.
    Allan, J.: Topic Detection and Tracking: Event-Based Information Organization, vol. 12. Springer, Berlin (2012)MATHGoogle Scholar
  35. 35.
    Gossen, G., Demidova, E., Risse, T.: iCrawl: improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 75–84. ACM (2015)Google Scholar
  36. 36.
    AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages in web archives. In: International Conference on Theory and Practice of Digital Libraries, pp. 225–237. Springer (2015)Google Scholar
  37. 37.
    Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  38. 38.
    Menczer, F., Pant, G., Srinivasan, P.: Topical web crawlers: evaluating adaptive algorithms. ACM Trans. Internet Technol. (TOIT) 4(4), 378–419 (2004)CrossRefGoogle Scholar
  39. 39.
    Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia, pp. 3–12. ACM (2010)Google Scholar
  40. 40.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)Google Scholar
  41. 41.
    Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern Information Retrieval, vol. 463. ACM press, New York (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Mohamed M. G. Farag
    • 1
  • Sunshin Lee
    • 1
  • Edward A. Fox
    • 1
  1. 1.Virginia TechBlacksburgUSA

Personalised recommendations