A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents

Kumar, N. Kiran; Santosh, G. S. K.; Varma, Vasudeva

doi:10.1007/978-3-642-23708-9_9

A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents

N. Kiran Kumar²¹,
G. S. K. Santosh²¹ &
Vasudeva Varma²¹

Conference paper

632 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6941))

Abstract

This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identified NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn’t make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The system is evaluated using F-score, Purity and Normalized Mutual Information measures and the results obtained are encouraging.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval 4, 209–230 (2001)
Article MATH Google Scholar
Kumar, N.K., Santosh, G., Varma, V.: Multilingual document clustering using wikipedia as external knowledge. In: Proceedings of IRFC (2011)
Google Scholar
Santosh, G., Kumar, N.K., Varma, V.: Ranking multilingual documents using minimal language dependent resources. In: Proceedings of 12th International Conference on Intelligent Text Processing and Computational Linguistics, Tokyo, Japan,
Google Scholar
Montalvo, S., Martínez, R., Casillas, A., Fresno, V.: Multilingual document clustering: an heuristic approach based on cognate named entities. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1145–1152. Association for Computational Linguistics, Morristown (2006)
Google Scholar
Romaric, B.M., Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO, pp. 1–10 (2004)
Google Scholar
Friburger, N., Maurel, D., Giacometti, A.: Textual similarity based on proper names. In: Proceedings of the workshop Mathematical/Formal Methods in Information Retrieval (MFIR 2002) at the 25 th ACM SIGIR Conference, pp. 155–167 (2002)
Google Scholar
Negri, M., Magnini, B.: Using wordnet predicates for multilingual named entity recognition. In: Proceedings of The Second Global Wordnet Conference, pp. 169–174 (2004)
Google Scholar
Pianta, E., Bentivogli, L., Girardi, C.: Multiwordnet: Developing an aligned multilingual database. In: Proceedings of the 1st International Global WordNet Conference, Mysore, India (2002)
Google Scholar
Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL 2008 HLT (2008)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: TextMining Workshop, KDD (2000)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, Department of Computer Science, University of Minnesota. (2002)
Google Scholar
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowledge and Information Systems 8, 374–384 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

International Institute of Information Technology, Hyderabad, India
N. Kiran Kumar, G. S. K. Santosh & Vasudeva Varma

Authors

N. Kiran Kumar
View author publications
You can also search for this author in PubMed Google Scholar
G. S. K. Santosh
View author publications
You can also search for this author in PubMed Google Scholar
Vasudeva Varma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for the Evaluation of Language and Communication Technologies (CELCT), Via alla Casata 56/c, 38123, Povo, Italy
Pamela Forner
National University of Distance Education, E.T.S.I. Informática de la UNED, c/Juan del Rosal 16, 28040, Madrid, Spain
Julio Gonzalo
School of Information Sciences, University of Tampere, Kanslerinrinne 1, 33014, Tampere, Finland
Jaana Kekäläinen
Yahoo! Research, Avinguda Diagonal 177, 8th Floor, 08018, Barcelona, Spain
Mounia Lalmas
Intelligent Systems Laboratory, University of Amsterdam, Science Park 107, 1098 XG, Amsterdam, The Netherlands
Marteen de Rijke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, N.K., Santosh, G.S.K., Varma, V. (2011). A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents. In: Forner, P., Gonzalo, J., Kekäläinen, J., Lalmas, M., de Rijke, M. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2011. Lecture Notes in Computer Science, vol 6941. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23708-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-642-23708-9_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23707-2
Online ISBN: 978-3-642-23708-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics