Advertisement

A Case Study in Decompounding for Bengali Information Retrieval

  • Debasis Ganguly
  • Johannes Leveling
  • Gareth J. F. Jones
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8138)

Abstract

Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.

Keywords

Machine Translation Mean Average Precision Compound Word Statistical Machine Translation European Language 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alfonseca, E., Bilac, S., Pharies, S.: Decompounding query keywords from compounding languages. In: ACL/HLT 2008, HLT-Short 2008, pp. 253–256 (2008)Google Scholar
  2. 2.
    Braschler, M., Ripplinger, B.: Stemming and decompounding for German text retrieval. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 177–192. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Monz, C., de Rijke, M.: Shallow morphological analysis in monolingual information retrieval for Dutch, German, and Italian. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 262–277. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Koehn, P., Knight, K.: Empirical methods for compound splitting. In: EACL 2003, pp. 187–193. ACL, Stroudsburg (2003)Google Scholar
  5. 5.
    Chen, A., Gey, F.C.: Multilingual information retrieval using machine translation, relevance feedback and decompounding. Inf. Retr. 7(1-2), 149–182 (2004)CrossRefGoogle Scholar
  6. 6.
    Dash, N.S.: The morphodynamics of Bengali compounds – decomposing them for lexical processing. Language in India 6 (2006)Google Scholar
  7. 7.
    Dasgupta, S., Khan, M.: Morphological parsing of Bangla words using PC-KIMMO. In: ICCIT 2004 (2004)Google Scholar
  8. 8.
    Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of NAACL HLT 2007, April 22-27, pp. 155–163. ACL, Rochester (2007)Google Scholar
  9. 9.
    Roy, M.: Approaches to handle scarce resources for Bengali statistical machine translation. PhD thesis, School of Computing, Simon Fraser University (2010)Google Scholar
  10. 10.
    Deepa, S.R., Bali, K., Ramakrishnan, A.G., Talukdar, P.P.: Automatic generation of compound word lexicon for Hindi speech synthesis. In: LREC 2004 (2004)Google Scholar
  11. 11.
    McNamee, P.: N-gram tokenization for Indian language text retrieval. In: FIRE 2008, Kolkata, India (2008)Google Scholar
  12. 12.
    Leveling, J., Jones, G.J.F.: Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR. TALIP 9(3) (September 2010)Google Scholar
  13. 13.
    Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999)Google Scholar
  14. 14.
    Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, The Netherlands (2000)Google Scholar
  15. 15.
    Ganguly, D., Leveling, J., Jones, G.J.F.: DCU@FIRE 2012: Rule-based stemmers for Bengali and Hindi. In: FIRE 2012, pp. 37–42. ISI, Kolkata (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Debasis Ganguly
    • 1
  • Johannes Leveling
    • 1
  • Gareth J. F. Jones
    • 1
  1. 1.CNGL, School of ComputingDublin City UniversityDublin 9Ireland

Personalised recommendations