Skip to main content

Clustering Documents Using a Wikipedia-Based Concept Representation

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5476))

Included in the following conference series:

Abstract

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee, S., Ramanathan, K., Gupta, A.: Clustering Short Texts using Wikipedia. In: Proceedings of the SIGIR, pp. 787–788. ACM, New York (2007)

    Google Scholar 

  2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  3. Gabrilovich, E., Markovitch, S.: Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In: Proceedings of AAAI, pp. 1301–1306. AAAI, Menlo Park (2006)

    Google Scholar 

  4. Hotho, A., Staab, S., Stumme, G.: WordNet improves Text Document Clustering. In: Proceedings of SIGIR Semantic Web Workshop, pp. 541–544. ACM, New York (2003)

    Google Scholar 

  5. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing Text Clustering by Leveraging Wikipedia Semantics. In: Proceedings of SIGIR, pp. 179–186. ACM, New York (2008)

    Google Scholar 

  6. Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering Documents with Active Learning using Wikipedia. In: Proceedings of ICDM, pp. 839–844. IEEE, Los Alamitos (2008)

    Google Scholar 

  7. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley Interscience, Hoboken (2001)

    Book  Google Scholar 

  8. Kolenda, T., Hansen, L.K.: Independent Components in Text. In: Girolami, M. (ed.) Advances in Independent Component Analysis, ch. 13, pp. 235–256. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  9. Milne, D., Witten, I.H.: Learning to Link with Wikipedia. In: Proceedings of CIKM, pp. 509–518. ACM, New York (2008)

    Chapter  Google Scholar 

  10. Milne, D., Witten, I.H.: An Effective, Low-Cost Measure of Semantic Relatedness obtained from Wikipedia Links. In: Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI), pp. 25–30. AAAI, Menlo Park (2008)

    Google Scholar 

  11. Minier, Z., Bodo, Z., Csato, L.: Wikipedia-Based Kernels for Text Categorization. In: Proceedings of SYNASC, pp. 157–164. IEEE, Los Alamitos (2007)

    Google Scholar 

  12. Recupero, D.R.: A New Unsupervised Method for Document Clustering by Using WordNet Lexical and Conceptual Relations. Information Retrieval 10, 563–579 (2007)

    Article  Google Scholar 

  13. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    MATH  Google Scholar 

  14. Torkkola, K.: Discriminative Features for Document Classification. In: Proceedings of ICPR, pp. 10472–10475. IEEE, Los Alamitos (2002)

    Google Scholar 

  15. Wang, P., Hu, J., Zeng, H.J., Chen, L., Chen, Z.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of ICDM, pp. 332–341. IEEE, Los Alamitos (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, A., Milne, D., Frank, E., Witten, I.H. (2009). Clustering Documents Using a Wikipedia-Based Concept Representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_62

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01307-2_62

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01306-5

  • Online ISBN: 978-3-642-01307-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics