Skip to main content
Log in

Clustering-based topical Web crawling using CFu-tree guided by link-context

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to labelmanually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012

    Google Scholar 

  2. McCallum A, Nigam K. A comparison of event models for Naïve Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998, 752: 41–48

    Google Scholar 

  3. Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In: Proceedings of Knowledge Discovery and Data Mining (KDD’ 98), 1998, 80–86

    Google Scholar 

  4. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, 5: 144–152

    Google Scholar 

  5. Chou C, Lee C, Chen Y. GA-based keyword selection for the design of an intelligent Web document search system. Computer Journal, 2009, 52(8): 890–901

    Article  Google Scholar 

  6. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining, 2000, 400(1): 525–526

    Google Scholar 

  7. Jain A, Murty M, Flyn P. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264–323

    Article  Google Scholar 

  8. Fung B, Wang K, Ester M. Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM’ 03), 2003, 59–70

    Google Scholar 

  9. Fu T, Abbasi A, Chen H. A focused crawler for dark Web forums. Journal of the American Society for Information Science and Technology, 2010, 61(6): 1213–1231

    Google Scholar 

  10. Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005, 1190–1191

    Chapter  Google Scholar 

  11. Liu H, Milios E. Probabilistic models for focused Web crawling. Computational Intelligence, 2012, 28(3): 289–328

    Article  MATH  MathSciNet  Google Scholar 

  12. Hao H, Mu C, Yin X, Li S, Wang Z. An improved topic relevance algorithm for focused crawling. In: Proceedings of IEEE International Conference on Systems Man and Cyvernetics Conference, 2011, 850–855

    Google Scholar 

  13. Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 107–122

    Article  Google Scholar 

  14. Zhang H, Lu J. SCTWC: an online semi-supervised clustering approach to topical Web crawlers. Applied Soft Computing, 2010, 10(2): 490–495

    Article  Google Scholar 

  15. Liu Y, Agah A. Topical crawling on the Web through local sitesearchers. Journal of Web Engineering, 2013, 12(3–4): 203–214

    Google Scholar 

  16. Rangrej A, Kulkarni S, Tendulkar A. Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011, 111–112

    Chapter  Google Scholar 

  17. Wang X, Tang J, Liu H. Document clustering via matrix representation. In: Proceedings 2011 IEEE 11th International Conference on Data Mining, 2011, 804–813

    Chapter  Google Scholar 

  18. Cota R, Ferreira A, Nascimento C, Goncalves M, Laender A. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 2010, 61(9): 1853–1870

    Article  Google Scholar 

  19. Spanakis G, Siolas G, Stafylopatis A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. Computer Journal, 2012, 55(3): 299–312

    Article  Google Scholar 

  20. Bouras C, Tsogkas V. A clustering technique for news articles using WordNet. Knowledge-Based Systems, 2012, 36: 115–128

    Article  Google Scholar 

  21. Li J, Zhao Y, Liu B. Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 2012, 39(3): 763–788

    Article  Google Scholar 

  22. Trivedi A, Rai, P, Daume H, Duvall, S. Leveraging social bookmarks from partially tagged corpus for improved Web page clustering. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4), Article 67

    Google Scholar 

  23. Wu M, Hawking D, Turpin A, Scholer F. Using anchor text for homepage and topic distillation search tasks. Journal of the American Society for Information Science and Technology, 2012, 63(6): 1234–1255

    Article  Google Scholar 

  24. Hersovici M, Jacovi M, Maarek Y, Pellegb D, Shtalhaima M, Ura S. The shark-search algorithm. an application: tailored Web site mapping. Computer Networks and ISDN Systems, 1998, 30(1): 317–326

    Article  Google Scholar 

  25. Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S. Automatic resource list compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 1998, 30(1): 65–74

    Article  Google Scholar 

  26. Pant G. Deriving link-context from HTML tag tree. In: Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, 49–55

    Chapter  Google Scholar 

  27. Qi G, Aggarwal C, Tian Q, Ji H, Huang T. Exploring context and content links in social media: a latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(5): 850–862

    Article  Google Scholar 

  28. Attardi G, Gullı’ A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Proceedings of the 1st European Symposium on Telematics, Hypermedia, and Artificial Intelligence, 1999, 99: 105–109

    Google Scholar 

  29. Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1–7): 107–117

    Article  Google Scholar 

  30. Pant G, Tsioutsiouliklis K, Johnson J, Giles C. Panorama: extending digital libraries with topical crawlers. In: Proceedings of 4th ACM/IEEE-CS Joint Conference Digital Libraries, 2004, 142–150

    Google Scholar 

  31. Peng T, Zhang C, Zuo W. Tunneling enhanced by Web page content bloc partition for focused crawling. Concurrency and Computation: Practice and Experience, 2008, 20(1): 61–74

    Article  Google Scholar 

  32. Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of 14th International Conference on World Wide Web, 2005, 1190–1191

    Google Scholar 

  33. Yu H, Zuo W, Peng T. A new PU learning algorithm for text classification. Lecture Notes in Computer Science, 2005, 3789: 824–832

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Peng.

Additional information

Lu Liu received her BS in computer science from Jilin University in 2012. She is currently a PhD visiting student in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA. Her research interests include data mining, Web mining, and machine learning.

Tao Peng received his PhD and MSc in computer science from Jilin University, China in 2007 and 2004, respectively. He is a member of ACM and CCF (China Computer Federation). His research interests include Web mining, text mining, information retrieval, and machine learning. He is a past workshop co-chair of FCST2010. He is currently a postdoctoral researcher in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, L., Peng, T. Clustering-based topical Web crawling using CFu-tree guided by link-context. Front. Comput. Sci. 8, 581–595 (2014). https://doi.org/10.1007/s11704-014-3050-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-014-3050-9

Keywords

Navigation