Abstract
Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.
Similar content being viewed by others
References
Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012
McCallum A, Nigam K. A comparison of event models for Naïve Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998, 752: 41–48
Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In: Proceedings of Knowledge Discovery and Data Mining (KDD’ 98), 1998, 80–86
Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, 5: 144–152
Chou C, Lee C, Chen Y. GA-based keyword selection for the design of an intelligent Web document search system. Computer Journal, 2009, 52(8): 890–901
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining, 2000, 400(1): 525–526
Jain A, Murty M, Flyn P. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323
Fung B, Wang K, Ester M. Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM’ 03), 2003, 59–70
Fu T, Abbasi A, Chen H. A focused crawler for dark Web forums. Journal of the American Society for Information Science and Technology, 2010, 61(6): 1213–1231
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005, 1190–1191
Liu H, Milios E. Probabilistic models for focused Web crawling. Computational Intelligence, 2012, 28(3): 289–328
Hao H, Mu C, Yin X, Li S, Wang Z. An improved topic relevance algorithm for focused crawling. In: Proceedings of IEEE International Conference on Systems Man and Cyvernetics Conference, 2011, 850–855
Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 107–122
Zhang H, Lu J. SCTWC: an online semi-supervised clustering approach to topical Web crawlers. Applied Soft Computing, 2010, 10(2): 490–495
Liu Y, Agah A. Topical crawling on the Web through local sitesearchers. Journal of Web Engineering, 2013, 12(3–4): 203–214
Rangrej A, Kulkarni S, Tendulkar A. Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011, 111–112
Wang X, Tang J, Liu H. Document clustering via matrix representation. In: Proceedings 2011 IEEE 11th International Conference on Data Mining, 2011, 804–813
Cota R, Ferreira A, Nascimento C, Goncalves M, Laender A. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 2010, 61(9): 1853–1870
Spanakis G, Siolas G, Stafylopatis A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. Computer Journal, 2012, 55(3): 299–312
Bouras C, Tsogkas V. A clustering technique for news articles using WordNet. Knowledge-Based Systems, 2012, 36: 115–128
Li J, Zhao Y, Liu B. Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 2012, 39(3): 763–788
Trivedi A, Rai, P, Daume H, Duvall, S. Leveraging social bookmarks from partially tagged corpus for improved Web page clustering. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4), Article 67
Wu M, Hawking D, Turpin A, Scholer F. Using anchor text for homepage and topic distillation search tasks. Journal of the American Society for Information Science and Technology, 2012, 63(6): 1234–1255
Hersovici M, Jacovi M, Maarek Y, Pellegb D, Shtalhaima M, Ura S. The shark-search algorithm. an application: tailored Web site mapping. Computer Networks and ISDN Systems, 1998, 30(1): 317–326
Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S. Automatic resource list compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 1998, 30(1): 65–74
Pant G. Deriving link-context from HTML tag tree. In: Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, 49–55
Qi G, Aggarwal C, Tian Q, Ji H, Huang T. Exploring context and content links in social media: a latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(5): 850–862
Attardi G, Gullı’ A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Proceedings of the 1st European Symposium on Telematics, Hypermedia, and Artificial Intelligence, 1999, 99: 105–109
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1–7): 107–117
Pant G, Tsioutsiouliklis K, Johnson J, Giles C. Panorama: extending digital libraries with topical crawlers. In: Proceedings of 4th ACM/IEEE-CS Joint Conference Digital Libraries, 2004, 142–150
Peng T, Zhang C, Zuo W. Tunneling enhanced by Web page content bloc partition for focused crawling. Concurrency and Computation: Practice and Experience, 2008, 20(1): 61–74
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of 14th International Conference on World Wide Web, 2005, 1190–1191
Yu H, Zuo W, Peng T. A new PU learning algorithm for text classification. Lecture Notes in Computer Science, 2005, 3789: 824–832
Author information
Authors and Affiliations
Corresponding author
Additional information
Lu Liu received her BS in computer science from Jilin University in 2012. She is currently a PhD visiting student in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA. Her research interests include data mining, Web mining, and machine learning.
Tao Peng received his PhD and MSc in computer science from Jilin University, China in 2007 and 2004, respectively. He is a member of ACM and CCF (China Computer Federation). His research interests include Web mining, text mining, information retrieval, and machine learning. He is a past workshop co-chair of FCST2010. He is currently a postdoctoral researcher in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA.
Rights and permissions
About this article
Cite this article
Liu, L., Peng, T. Clustering-based topical Web crawling using CFu-tree guided by link-context. Front. Comput. Sci. 8, 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-014-3050-9