Abstract
Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Personalised Environmental Service Configuration and Delivery Orchestration (http://www.pescado-project.eu/).
Adaptive Hierarchical Density Histogram.
Both datasets are available at: http://mklab.iti.gr/project/heatmaps.
These URLs are different to the ones used for training the classifiers.
References
Cao R, Tan C (2002) Text/graphics separation in maps. In: Blostein D, Kwon YB (eds) Graphics Recognition: Algorithms and Applications, 4th IAPR International Workshop on Graphics Recognition (GREC 2001), Selected Papers, Lecture Notes in Computer Science, vol 2390, pp 167–177. Springer Berlin Heidelberg
Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. In: Proceedings of the 8th International Conference on World Wide Web, (WWW 1999), pp 1623–1640
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chang SF, Sikora T, Puri A (2001) Overview of the MPEG-7 standard. IEEE Trans Circ Syst Video Technol 11(6):688–695
Chatfield K, Lempitsky VS, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of the British Machine Vision Conference (BMVC 2011), pp 1–12
Cho J, Garcia-Molina H, Page L. (1998) Efficient crawling through URL ordering. Comput Netw 30(1-7):161–172
Davison BD (2000) Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR 2000), pp 272–279
De Bra P, Post RDJ (1994) Information retrieval in the world-wide web: Making client-based searching feasible. Comput Netw ISDN Syst 27(2):183–192
Epitropou V, Karatzas K, Bassoukos A (2010) A method for the inverse reconstruction of environmental data applicable at the chemical weather portal. In: Proceedings of the GI-Forum Symposium and Exhibit on Applied Geoinformatics, pp 58–68
Henderson TC, Linton T (2009) Raster map image analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR 2009), pp 376–380. IEEE Computer Society
Karatzas K, Moussiopoulos N (2000) Urban air quality management and information systems in Europe: legal framework and information access. J Environ Assess Policy Manag 2(02):263–272
Lin H-T, Lin C-J, Weng RC (2007) A note on Platts probabilistic outputs for support vector machines. Mach Learn 68(3):267–276
Moumtzidou A, Vrochidis S, Chatzilari E, Kompatsiaris I (2013) Discovery of environmental nodes based on heatmap recognition. In: Proceedings of the 20th IEEE International Conference on Image Processing (ICIP 2013)
Moumtzidou A, Vrochidis S, Kompatsiaris I (2013) Discovery, analysis and retrieval of multimodal environmental information. In: Encyclopedia of Information Science and Technology (in press). IGI Global
Moumtzidou A, Vrochidis S, Tonelli S, Kompatsiaris I, Pianta E (2012) Discovery of environmental nodes in the web. In: Multidisciplinary Information Retrieval, Proceedings of the 5th International Retrieval Facility Conference (IRFC 2012), LNCS, vol 7356, pp 58–72
Olston C, Najork M (2010) Web crawling. Found Trends Inf Retr 4(3):175–246
Over P, Awad G, Kraaij W, Smeaton AF (2007) TRECVID 2007–overview. In: TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)
Oyama S, Kokubo T, Ishida T (2004) Domain-specific web search with keyword spices. IEEE Trans Knowl Data Eng 16(1):17–27
Pant G, Srinivasan P (2005) Learning to crawl: Comparing classification schemes. ACM Trans Inf Syst 23(4):430–462
Pant G, Srinivasan P (2006) Link contexts in classifier-guided topical crawlers. IEEE Trans Knowl Data Eng 18(1):107–122
Pant G, Srinivasan P, Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Levene M, Poulovassilis A (eds) Proceedings of the 2nd International Workshop on Web Dynamics, in conjunction with the World Wide Web Conference (WWW 2002)
San José R, Baklanov A, Sokhi R, Karatzas K, Pérez J (2008) Computational air quality modelling. Dev Integr Environ Assess 3:247–267
Sidiropoulos P, Vrochidis S, Kompatsiaris I (2011) Content-based binary image retrieval using the adaptive hierarchical density histogram. Pattern Recog 44(4):739–750
Srinivasan P, Menczer F, Pant G (2005) A general evaluation framework for topical crawlers. Information Retrieval 8(3):417–447. doi:10.1007/s10791-005-6993-5
Tang TT, Hawking D, Craswell N, Griffiths K (2005) Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, (CIKM 2005), pp 147–154
Tang TT, Hawking D, Craswell N, Sankaranarayana RS (2004) Focused crawling in depression portal search: A feasibility study. In: Proceedings of the 9th Australasian Document Computing Symposium (ADCS 2004), pp 1–9
Tsikrika T, Moumtzidou A, Vrochidis S, Kompatsiaris I (2014) Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence. In: Proceedings of the International Workshop on Environmental Multimedia Retrieval (EMR 2014), pp 61–68
Yuan J et al (2007) THU and ICRC at TRECVID 2007. In: Over P, Awad G, Kraaij W, Smeaton AF (eds) TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)
Acknowledgments
This work was supported by MULTISENSOR (contract no. FP7-610411) and HOMER (contract no. FP7-312388) projects, partially funded by the European Commission.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tsikrika, T., Moumtzidou, A., Vrochidis, S. et al. Focussed crawling of environmental Web resources based on the combination of multimedia evidence. Multimed Tools Appl 75, 1563–1587 (2016). https://doi.org/10.1007/s11042-015-2624-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-015-2624-3