Advertisement

Indian Language Information Retrieval

  • Prasenjit Majumder
  • Mandar Mitra
Chapter
Part of the Advances in Pattern Recognition book series (ACVPR)

Abstract

With the proliferation of the Internet in south Asia over the last decade, the availability of digital documents in Indian languages has increased considerably. The need for effective information access methods for these languages is being increasingly felt. Although Indian language information retrieval (ILIR) research is in a relatively nascent stage (especially with regard to large-scale quantitative evaluation), several research efforts in this area have been reported in the recent past. This chapter reviews the current state of the art in mono-lingual and cross-lingual information access in Indian languages and outlines a recent project that aims to create a comprehensive, end-to-end IR system for Indian languages, along with a standardized evaluation framework (in the spirit of TREC, CLEF, or NTCIR) that will provide a sound empirical basis for further work.

Keywords

Indian languages Information retrieval (IR) Mono lingual IR Cross lingual IR 

References

  1. 1.
    Majumder, P., Mitra, M., Datta, K.: Multilingual information access: an Indian language perspective. In Gey, F., Peters, C., eds.: Proceedings of ACM SIGIR Workshop on MLIR (2006)Google Scholar
  2. 2.
    Rajashekar, T.: Building Indian language digital library collections: Some experiences with Greenstone software. In: Digital Libraries: International Collaboration and Cross-Fertilization: 7th International Conference on Asian Digital Libraries, ICADL 2004, Springer Berlin/Heidelberg (2004)Google Scholar
  3. 3.
    Urs, S.R., Raghavan, K.S.: Vidyanidhi: Indian digital library of electronic theses. Commun. ACM 44(5) (2001) 88–89CrossRefGoogle Scholar
  4. 4.
    Mitra, M., Chaudhuri, B.B.: An OCR-based architecture for indexing Indian language web documents. In: Proceedings 2nd Symposium on Indian Morphology, Phonology and Language Engineering (SIMPLE 05) (2005)Google Scholar
  5. 5.
    Pingali, P., Jagarlamudi, J., Varma, V.: Webkhoj: Indian language IR from multiple character encodings. In: Proceedings of http://WWW2006 Workshop (May 2006)
  6. 6.
    He, D., Oard, D.W., Wang, J., Luo, J., Demner-Fushman, D., Darwish, K., Resnik, P., Khudanpur, S., Nossal, M., Subotin, M., Leuski, A.: Making miracles: Interactive translingual search for Cebuano and Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 219–244CrossRefGoogle Scholar
  7. 7.
    Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, ACM Press (1998) 55–63Google Scholar
  8. 8.
    Mitra, M., Chaudhuri, B.: Information retrieval from documents: A survey. Information Retrieval 2(2/3) (2000) 141–163CrossRefGoogle Scholar
  9. 9.
    Larkey, L.S., Connell, M.E., Abduljaleel, N.: Hindi CLIR in thirty days. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 130–142CrossRefGoogle Scholar
  10. 10.
    Ramanathan, A., Rao, D.: A lightweight stemmer for Hindi. In: EACL Workshop on Computational Linguistics for South Asian Languages (2003)Google Scholar
  11. 11.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1) (2003) 19–51CrossRefGoogle Scholar
  12. 12.
    Callan, J.P., Croft, W.B., Broglio, J.: TREC and Tipster experiments with Inquery. Information Processing and Management 31(3) (1995) 327–343CrossRefGoogle Scholar
  13. 13.
    Weischedel, R., Nguyen, C.: Evaluating a probabilistic model for cross-lingual information retrieval. In: SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, ACM Press (2001) 105–110CrossRefGoogle Scholar
  14. 14.
    Xu, J., Weischedel, R.: Cross-lingual retrieval for Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 164–168CrossRefGoogle Scholar
  15. 15.
    Leuski, A., Lin, C.Y., Zhou, L., Germann, U., Och, F.J., Hovy, E.: Cross-lingual (c*st*rd): English access to Hindi information. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 245–269CrossRefGoogle Scholar
  16. 16.
    Chklovski, T., Mihalcea, R., Pedersen, T., Purandare, A.: The Senseval-3 multilingual EnglishHindi lexical sample task. In Mihalcea, R., Edmonds, P., eds.: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Association for Computational Linguistics (July 2004) 5–8Google Scholar
  17. 17.
    Lee, Y.K., Ng, H.T., Chia, T.K.: Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In Mihalcea, R., Edmonds, P., eds.: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (July 2004) 137–140Google Scholar
  18. 18.
    May, J., Brunstein, A., Natarajan, P., Weischedel, R.: Surprise! what’s in a Cebuano or Hindi name? ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 169–180CrossRefGoogle Scholar
  19. 19.
    Bikel, D.M., Miller, S., Schwartz, R.L., Weischedel, R.M.: Nymble: a high-performance learning name-finder. In: ANLP Washington, DC, ACL (1997) 194–201Google Scholar
  20. 20.
    Li, W., McCallum, A.: Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 290–294CrossRefGoogle Scholar
  21. 21.
    Huang, F., Vogel, S., Waibel, A.: Extracting named entity translingual equivalence with limited resources. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 124–129CrossRefGoogle Scholar
  22. 22.
    Majumder, P., Mitra, M., Sarkar, N., Mitra, P., Datta, K.: Bengali name identification using a noisy comparable corpus. In: International Conference on Emerging Applications of IT (2006) 41–44Google Scholar
  23. 23.
    Cucerzan, S., Yarowsky, D.: Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora (1999) 90–99Google Scholar
  24. 24.
    Sekine, S., Grishman, R.: Hindi-English cross-lingual question-answering system. ACM Transactions on Asian Language Information Processing (TALIP) 2(3) (2003) 181–192CrossRefGoogle Scholar
  25. 25.
    Allan, J.: Introduction to topic detection and tracking. Norwell, MA, Kluwer Academic Publishers (2002)Google Scholar
  26. 26.
    Allan, J., Lavrenko, V., Connell, M.E.: A month to topic detection and tracking in Hindi. ACM Transactions on Asian Language Information Processing (TALIP) 2(2) (2003) 85–100CrossRefGoogle Scholar
  27. 27.
    Mandal, D., Gupta, M., Dandapat, S., Banerjee, P., Sarkar, S.: Bengali and Hindi to English CLIR evaluation. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapest, Hungary Springer Verlag (2008) 95–102Google Scholar
  28. 28.
    Jagarlamudi, J., Kumaran, A.: Cross-Lingual Information Retrieval System for Indian Languages. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapset, Hungary Springer Verlag (2008) 80–87Google Scholar
  29. 29.
    Pingali, P., Tune, K., Varma, V.: Improving Recall for Hindi, Telugu, Oromo to English CLIR. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapset, Hungary Springer Verlag (2008) 103–110Google Scholar
  30. 30.
    Chinnakotla, M., Ranadive, S., Damani, O., Bhattacharyya, P.: Hindi to English and Marathi to English cross language information retrieval evaluation. In: Advances in Multilingual and Multimodal Information Retrieval (8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007). Number 5152 in LNCS, Budapest, Hungary Springer Verlag (2008) 111–118Google Scholar
  31. 31.
    Monz, C., Dorr, B.J.: Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of 28th ACM SIGIR (2005) 520527Google Scholar
  32. 32.
    Majumder, P., Mitra, M., Pal, D., Bandyopadhyay, A., Maiti, S., Mitra, S., Sen, A., Pal, S.: Text collections for FIRE. In: Proceedings of ACM SIGIR (2008) 699–700Google Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Prasenjit Majumder
    • 1
  • Mandar Mitra
    • 1
  1. 1.CVPR Unit, Indian Statistical InstituteKolkataIndia

Personalised recommendations