Conceptual Hierarchical Clustering of Documents using Wikipedia knowledge

  • Gerasimos Spanakis
  • Georgios Siolas
  • Andreas Stafylopatis
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 62)


In this paper, we propose a novel method for conceptual hierarchical clustering of documents using knowledge extracted from Wikipedia. A robust and compact document representation is built in real-time using the Wikipedia API. The clustering process is hierarchi- cal and creates cluster labels which are descriptive and important for the examined corpus. Experiments show that the proposed technique greatly improves over the baseline approach.


Noun Phrase Document Representation Candidate Concept Consecutive Word Encyclopedic Knowledge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Banerjee, S., Ramanathan, K. and Gupta, A.: Clustering short texts using Wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007) 787–788Google Scholar
  2. 2.
    Wang, P. and Domeniconi, C.: Building Semantic Kernels for text classification using Wikipedia. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008) 713–721Google Scholar
  3. 3.
    Gabrilovich, E. and Markovitch, S.: Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Pro- ceedings of the 21st National Conference on Artificial Intelligence (2006) 1301–1306Google Scholar
  4. 4.
    Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., and Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In Proceedings of the 31st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (2008) 179–186Google Scholar
  5. 5.
    Fung B., Wang K., Ester M.: Hierarchical Document Clustering Using Frequent Itemsets. In Proceedings of the SIAM International Conference on Data Mining (2003)Google Scholar
  6. 6.
    Marcus, M., Santorini, B., and Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics (1993) Volume 19, Number 2, 313–330Google Scholar
  7. 7.
    Mihalcea, R. and Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the Sixteenth ACM Conference on information and Knowledge Management (2007) 233–242Google Scholar
  8. 8.
    Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceed- ings of the International Conference on New Methods in Language Processing (2004)Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Gerasimos Spanakis
    • 1
  • Georgios Siolas
    • 1
  • Andreas Stafylopatis
    • 1
  1. 1.Intelligent Systems Laboratory School of Electrical and Computer EngineeringNational Technical University of AthensAthensGreece

Personalised recommendations