A Case Study in Decompounding for Bengali Information Retrieval
Decompounding has been found to improve information retrieval (IR) effectiveness for compounding languages such as Dutch, German, or Finnish. No previous studies, however, exist on the effect of decomposition of compounds in IR for Indian languages. In this case study, we investigate the effect of decompounding for Bengali, a highly agglutinative Indian language. The standard approach of decompounding for IR, i.e. indexing compound parts (constituents) in addition to compound words, has proven beneficial for European languages. Our experiments reported in this paper show that such a standard approach does not work particularly well for Bengali IR. Some unique characteristics of Bengali compounds are: i) only one compound constituent may be a valid word in contrast to the stricter requirement of both being so; and ii) the first character of the right constituent can be modified by the rules of Sandhi in contrast to simple concatenation. As a solution, we firstly propose a more relaxed decompounding where a compound word is decomposed into only one constituent if the other constituent is not a valid word, and secondly we perform selective decompounding by ensuring that constituents often co-occur with the compound word, which indicates how related the constituents and the compound are. We perform experiments on Bengali ad-hoc IR collections from FIRE 2008 to 2012. Our experiments show that both the relaxed decomposition and the co-occurrence-based constituent selection proves more effective than the standard frequency-based decomposition method, improving mean average precision (MAP) up to 2.72% and recall up to 1.8%, compared to not decompounding words.
KeywordsMachine Translation Mean Average Precision Compound Word Statistical Machine Translation European Language
Unable to display preview. Download preview PDF.
- 1.Alfonseca, E., Bilac, S., Pharies, S.: Decompounding query keywords from compounding languages. In: ACL/HLT 2008, HLT-Short 2008, pp. 253–256 (2008)Google Scholar
- 4.Koehn, P., Knight, K.: Empirical methods for compound splitting. In: EACL 2003, pp. 187–193. ACL, Stroudsburg (2003)Google Scholar
- 6.Dash, N.S.: The morphodynamics of Bengali compounds – decomposing them for lexical processing. Language in India 6 (2006)Google Scholar
- 7.Dasgupta, S., Khan, M.: Morphological parsing of Bangla words using PC-KIMMO. In: ICCIT 2004 (2004)Google Scholar
- 8.Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: Sidner, C.L., Schultz, T., Stone, M., Zhai, C. (eds.) Proceedings of NAACL HLT 2007, April 22-27, pp. 155–163. ACL, Rochester (2007)Google Scholar
- 9.Roy, M.: Approaches to handle scarce resources for Bengali statistical machine translation. PhD thesis, School of Computing, Simon Fraser University (2010)Google Scholar
- 10.Deepa, S.R., Bali, K., Ramakrishnan, A.G., Talukdar, P.P.: Automatic generation of compound word lexicon for Hindi speech synthesis. In: LREC 2004 (2004)Google Scholar
- 11.McNamee, P.: N-gram tokenization for Indian language text retrieval. In: FIRE 2008, Kolkata, India (2008)Google Scholar
- 12.Leveling, J., Jones, G.J.F.: Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR. TALIP 9(3) (September 2010)Google Scholar
- 13.Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999)Google Scholar
- 14.Hiemstra, D.: Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, The Netherlands (2000)Google Scholar
- 15.Ganguly, D., Leveling, J., Jones, G.J.F.: DCU@FIRE 2012: Rule-based stemmers for Bengali and Hindi. In: FIRE 2012, pp. 37–42. ISI, Kolkata (2012)Google Scholar