Multimedia Tools and Applications

, Volume 75, Issue 3, pp 1563–1587 | Cite as

Focussed crawling of environmental Web resources based on the combination of multimedia evidence

  • Theodora TsikrikaEmail author
  • Anastasia Moumtzidou
  • Stefanos Vrochidis
  • Ioannis Kompatsiaris


Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.


Focussed crawling Environmental data Link context Image classification Heatmaps 



This work was supported by MULTISENSOR (contract no. FP7-610411) and HOMER (contract no. FP7-312388) projects, partially funded by the European Commission.


  1. 1.
    Cao R, Tan C (2002) Text/graphics separation in maps. In: Blostein D, Kwon YB (eds) Graphics Recognition: Algorithms and Applications, 4th IAPR International Workshop on Graphics Recognition (GREC 2001), Selected Papers, Lecture Notes in Computer Science, vol 2390, pp 167–177. Springer Berlin HeidelbergGoogle Scholar
  2. 2.
    Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. In: Proceedings of the 8th International Conference on World Wide Web, (WWW 1999), pp 1623–1640CrossRefGoogle Scholar
  3. 3.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27Google Scholar
  4. 4.
    Chang SF, Sikora T, Puri A (2001) Overview of the MPEG-7 standard. IEEE Trans Circ Syst Video Technol 11(6):688–695CrossRefGoogle Scholar
  5. 5.
    Chatfield K, Lempitsky VS, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of the British Machine Vision Conference (BMVC 2011), pp 1–12Google Scholar
  6. 6.
    Cho J, Garcia-Molina H, Page L. (1998) Efficient crawling through URL ordering. Comput Netw 30(1-7):161–172Google Scholar
  7. 7.
    Davison BD (2000) Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR 2000), pp 272–279Google Scholar
  8. 8.
    De Bra P, Post RDJ (1994) Information retrieval in the world-wide web: Making client-based searching feasible. Comput Netw ISDN Syst 27(2):183–192CrossRefGoogle Scholar
  9. 9.
    Epitropou V, Karatzas K, Bassoukos A (2010) A method for the inverse reconstruction of environmental data applicable at the chemical weather portal. In: Proceedings of the GI-Forum Symposium and Exhibit on Applied Geoinformatics, pp 58–68Google Scholar
  10. 10.
    Henderson TC, Linton T (2009) Raster map image analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR 2009), pp 376–380. IEEE Computer SocietyGoogle Scholar
  11. 11.
    Karatzas K, Moussiopoulos N (2000) Urban air quality management and information systems in Europe: legal framework and information access. J Environ Assess Policy Manag 2(02):263–272Google Scholar
  12. 12.
    Lin H-T, Lin C-J, Weng RC (2007) A note on Platts probabilistic outputs for support vector machines. Mach Learn 68(3):267–276CrossRefGoogle Scholar
  13. 13.
    Moumtzidou A, Vrochidis S, Chatzilari E, Kompatsiaris I (2013) Discovery of environmental nodes based on heatmap recognition. In: Proceedings of the 20th IEEE International Conference on Image Processing (ICIP 2013)Google Scholar
  14. 14.
    Moumtzidou A, Vrochidis S, Kompatsiaris I (2013) Discovery, analysis and retrieval of multimodal environmental information. In: Encyclopedia of Information Science and Technology (in press). IGI GlobalGoogle Scholar
  15. 15.
    Moumtzidou A, Vrochidis S, Tonelli S, Kompatsiaris I, Pianta E (2012) Discovery of environmental nodes in the web. In: Multidisciplinary Information Retrieval, Proceedings of the 5th International Retrieval Facility Conference (IRFC 2012), LNCS, vol 7356, pp 58–72Google Scholar
  16. 16.
    Olston C, Najork M (2010) Web crawling. Found Trends Inf Retr 4(3):175–246CrossRefGoogle Scholar
  17. 17.
    Over P, Awad G, Kraaij W, Smeaton AF (2007) TRECVID 2007–overview. In: TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)Google Scholar
  18. 18.
    Oyama S, Kokubo T, Ishida T (2004) Domain-specific web search with keyword spices. IEEE Trans Knowl Data Eng 16(1):17–27CrossRefGoogle Scholar
  19. 19.
    Pant G, Srinivasan P (2005) Learning to crawl: Comparing classification schemes. ACM Trans Inf Syst 23(4):430–462CrossRefGoogle Scholar
  20. 20.
    Pant G, Srinivasan P (2006) Link contexts in classifier-guided topical crawlers. IEEE Trans Knowl Data Eng 18(1):107–122CrossRefGoogle Scholar
  21. 21.
    Pant G, Srinivasan P, Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Levene M, Poulovassilis A (eds) Proceedings of the 2nd International Workshop on Web Dynamics, in conjunction with the World Wide Web Conference (WWW 2002)Google Scholar
  22. 22.
    San José R, Baklanov A, Sokhi R, Karatzas K, Pérez J (2008) Computational air quality modelling. Dev Integr Environ Assess 3:247–267CrossRefGoogle Scholar
  23. 23.
    Sidiropoulos P, Vrochidis S, Kompatsiaris I (2011) Content-based binary image retrieval using the adaptive hierarchical density histogram. Pattern Recog 44(4):739–750CrossRefGoogle Scholar
  24. 24.
    Srinivasan P, Menczer F, Pant G (2005) A general evaluation framework for topical crawlers. Information Retrieval 8(3):417–447. doi: 10.1007/s10791-005-6993-5 CrossRefGoogle Scholar
  25. 25.
    Tang TT, Hawking D, Craswell N, Griffiths K (2005) Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, (CIKM 2005), pp 147–154Google Scholar
  26. 26.
    Tang TT, Hawking D, Craswell N, Sankaranarayana RS (2004) Focused crawling in depression portal search: A feasibility study. In: Proceedings of the 9th Australasian Document Computing Symposium (ADCS 2004), pp 1–9Google Scholar
  27. 27.
    Tsikrika T, Moumtzidou A, Vrochidis S, Kompatsiaris I (2014) Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence. In: Proceedings of the International Workshop on Environmental Multimedia Retrieval (EMR 2014), pp 61–68Google Scholar
  28. 28.
    Yuan J et al (2007) THU and ICRC at TRECVID 2007. In: Over P, Awad G, Kraaij W, Smeaton AF (eds) TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Theodora Tsikrika
    • 1
    Email author
  • Anastasia Moumtzidou
    • 1
  • Stefanos Vrochidis
    • 1
  • Ioannis Kompatsiaris
    • 1
  1. 1.Information Technologies Institute, CERTHThessalonikiGreece

Personalised recommendations