Latent Semantic Analysis Evaluation of Conceptual Dependency Driven Focused Crawling
In this paper we study a focused crawler driven by deep semantic analysis provided by the Conceptual Dependency (CD) theory. We test in practice the application of CD scripts as an approach of defining topics (queries) in a focused crawler and its robustness in evaluating real text structures extracted from HTML documents. In order to benchmark its efficiency in comparison to classical approaches, apart from human evaluation we also provide an evaluation of the result set based on its internal similarity using Latent Semantic Analysis (LSA). The performed measurement brings us to the conclusion that the CD theory is well suited for evaluating the similarity of HTML documents provided a specific query, as it achieves a high precision measured through human evaluation. At the same time we observe the drawbacks of LSA used in the same context.
Keywordsfocused crawling topic crawling conceptual dependency LSA
Unable to display preview. Download preview PDF.
- 1.Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery (1999)Google Scholar
- 4.Dorosz, K.: Usage of dedicated data structures for url databases in a large-scale crawling. Computer Science: rocznik Akademii Górniczo-Hutniczej imienia Stanisława Staszica w Krakowie 10, 7–17 (2009)Google Scholar
- 5.Dumais, S.: Enhancing Performance in Latent Semantic Indexing. Technical report, TM-ARH-017527 Technical Report, Bellcore (1990)Google Scholar
- 6.Hao, H.-W., Mu, C.-X., Yin, X.-C., Li, S., Wang, Z.-B.: An improved topic relevance algorithm for focused crawling. In: SMC, pp. 850–855 (2011)Google Scholar
- 9.Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating topic-driven web crawlers (2001)Google Scholar