Skip to main content

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6941))

Abstract

This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identified NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The system is evaluated using F-score, Purity and Normalized Mutual Information measures and the results obtained are encouraging.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval 4, 209–230 (2001)

    Article  MATH  Google Scholar 

  2. Kumar, N.K., Santosh, G., Varma, V.: Multilingual document clustering using wikipedia as external knowledge. In: Proceedings of IRFC (2011)

    Google Scholar 

  3. Santosh, G., Kumar, N.K., Varma, V.: Ranking multilingual documents using minimal language dependent resources. In: Proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics, Tokyo, Japan,

    Google Scholar 

  4. Montalvo, S., Martínez, R., Casillas, A., Fresno, V.: Multilingual document clustering: an heuristic approach based on cognate named entities. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1145–1152. Association for Computational Linguistics, Morristown (2006)

    Google Scholar 

  5. Romaric, B.M., Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO, pp. 1–10 (2004)

    Google Scholar 

  6. Friburger, N., Maurel, D., Giacometti, A.: Textual similarity based on proper names. In: Proceedings of the workshop Mathematical/Formal Methods in Information Retrieval (MFIR 2002) at the 25 th ACM SIGIR Conference, pp. 155–167 (2002)

    Google Scholar 

  7. Negri, M., Magnini, B.: Using wordnet predicates for multilingual named entity recognition. In: Proceedings of The Second Global Wordnet Conference, pp. 169–174 (2004)

    Google Scholar 

  8. Pianta, E., Bentivogli, L., Girardi, C.: Multiwordnet: Developing an aligned multilingual database. In: Proceedings of the 1st International Global WordNet Conference, Mysore, India (2002)

    Google Scholar 

  9. Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL 2008 HLT (2008)

    Google Scholar 

  10. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  11. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: TextMining Workshop, KDD (2000)

    Google Scholar 

  12. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, Department of Computer Science, University of Minnesota. (2002)

    Google Scholar 

  13. Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowledge and Information Systems 8, 374–384 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kumar, N.K., Santosh, G.S.K., Varma, V. (2011). A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents. In: Forner, P., Gonzalo, J., Kekäläinen, J., Lalmas, M., de Rijke, M. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2011. Lecture Notes in Computer Science, vol 6941. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23708-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23708-9_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23707-2

  • Online ISBN: 978-3-642-23708-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics