Clustering-based topical Web crawling using CFu-tree guided by link-context

Liu, Lu; Peng, Tao

doi:10.1007/s11704-014-3050-9

Clustering-based topical Web crawling using CFu-tree guided by link-context

Research Article
Published: 26 May 2014

Volume 8, pages 581–595, (2014)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Lu Liu^1,2 &
Tao Peng^1,2

173 Accesses
7 Citations
Explore all metrics

Abstract

Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic-Level Clustering on Web Resources

An effective approach for semantic-based clustering and topic-based ranking of web documents

Article 15 March 2018

Clustering Web Search Results to Identify Information Domain

References

Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012
Google Scholar
McCallum A, Nigam K. A comparison of event models for Naïve Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998, 752: 41–48
Google Scholar
Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In: Proceedings of Knowledge Discovery and Data Mining (KDD’ 98), 1998, 80–86
Google Scholar
Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, 5: 144–152
Google Scholar
Chou C, Lee C, Chen Y. GA-based keyword selection for the design of an intelligent Web document search system. Computer Journal, 2009, 52(8): 890–901
Article Google Scholar
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining, 2000, 400(1): 525–526
Google Scholar
Jain A, Murty M, Flyn P. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323
Article Google Scholar
Fung B, Wang K, Ester M. Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM’ 03), 2003, 59–70
Google Scholar
Fu T, Abbasi A, Chen H. A focused crawler for dark Web forums. Journal of the American Society for Information Science and Technology, 2010, 61(6): 1213–1231
Google Scholar
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005, 1190–1191
Chapter Google Scholar
Liu H, Milios E. Probabilistic models for focused Web crawling. Computational Intelligence, 2012, 28(3): 289–328
Article MATH MathSciNet Google Scholar
Hao H, Mu C, Yin X, Li S, Wang Z. An improved topic relevance algorithm for focused crawling. In: Proceedings of IEEE International Conference on Systems Man and Cyvernetics Conference, 2011, 850–855
Google Scholar
Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 107–122
Article Google Scholar
Zhang H, Lu J. SCTWC: an online semi-supervised clustering approach to topical Web crawlers. Applied Soft Computing, 2010, 10(2): 490–495
Article Google Scholar
Liu Y, Agah A. Topical crawling on the Web through local sitesearchers. Journal of Web Engineering, 2013, 12(3–4): 203–214
Google Scholar
Rangrej A, Kulkarni S, Tendulkar A. Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011, 111–112
Chapter Google Scholar
Wang X, Tang J, Liu H. Document clustering via matrix representation. In: Proceedings 2011 IEEE 11th International Conference on Data Mining, 2011, 804–813
Chapter Google Scholar
Cota R, Ferreira A, Nascimento C, Goncalves M, Laender A. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 2010, 61(9): 1853–1870
Article Google Scholar
Spanakis G, Siolas G, Stafylopatis A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. Computer Journal, 2012, 55(3): 299–312
Article Google Scholar
Bouras C, Tsogkas V. A clustering technique for news articles using WordNet. Knowledge-Based Systems, 2012, 36: 115–128
Article Google Scholar
Li J, Zhao Y, Liu B. Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 2012, 39(3): 763–788
Article Google Scholar
Trivedi A, Rai, P, Daume H, Duvall, S. Leveraging social bookmarks from partially tagged corpus for improved Web page clustering. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4), Article 67
Google Scholar
Wu M, Hawking D, Turpin A, Scholer F. Using anchor text for homepage and topic distillation search tasks. Journal of the American Society for Information Science and Technology, 2012, 63(6): 1234–1255
Article Google Scholar
Hersovici M, Jacovi M, Maarek Y, Pellegb D, Shtalhaima M, Ura S. The shark-search algorithm. an application: tailored Web site mapping. Computer Networks and ISDN Systems, 1998, 30(1): 317–326
Article Google Scholar
Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S. Automatic resource list compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 1998, 30(1): 65–74
Article Google Scholar
Pant G. Deriving link-context from HTML tag tree. In: Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, 49–55
Chapter Google Scholar
Qi G, Aggarwal C, Tian Q, Ji H, Huang T. Exploring context and content links in social media: a latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(5): 850–862
Article Google Scholar
Attardi G, Gullı’ A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Proceedings of the 1st European Symposium on Telematics, Hypermedia, and Artificial Intelligence, 1999, 99: 105–109
Google Scholar
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1–7): 107–117
Article Google Scholar
Pant G, Tsioutsiouliklis K, Johnson J, Giles C. Panorama: extending digital libraries with topical crawlers. In: Proceedings of 4th ACM/IEEE-CS Joint Conference Digital Libraries, 2004, 142–150
Google Scholar
Peng T, Zhang C, Zuo W. Tunneling enhanced by Web page content bloc partition for focused crawling. Concurrency and Computation: Practice and Experience, 2008, 20(1): 61–74
Article Google Scholar
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of 14th International Conference on World Wide Web, 2005, 1190–1191
Google Scholar
Yu H, Zuo W, Peng T. A new PU learning algorithm for text classification. Lecture Notes in Computer Science, 2005, 3789: 824–832
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Lu Liu & Tao Peng
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Lu Liu & Tao Peng

Authors

Lu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Peng.

Additional information

Lu Liu received her BS in computer science from Jilin University in 2012. She is currently a PhD visiting student in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA. Her research interests include data mining, Web mining, and machine learning.

Tao Peng received his PhD and MSc in computer science from Jilin University, China in 2007 and 2004, respectively. He is a member of ACM and CCF (China Computer Federation). His research interests include Web mining, text mining, information retrieval, and machine learning. He is a past workshop co-chair of FCST2010. He is currently a postdoctoral researcher in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, L., Peng, T. Clustering-based topical Web crawling using CFu-tree guided by link-context. Front. Comput. Sci. 8, 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9

Download citation

Received: 04 February 2013
Accepted: 18 February 2014
Published: 26 May 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11704-014-3050-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering-based topical Web crawling using CFu-tree guided by link-context

Abstract

Access this article

Similar content being viewed by others

Topic-Level Clustering on Web Resources

An effective approach for semantic-based clustering and topic-based ranking of web documents

Clustering Web Search Results to Identify Information Domain

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering-based topical Web crawling using CFu-tree guided by link-context

Abstract

Access this article

Similar content being viewed by others

Topic-Level Clustering on Web Resources

An effective approach for semantic-based clustering and topic-based ranking of web documents

Clustering Web Search Results to Identify Information Domain

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation