Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili
- 236 Downloads
Abstract
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.
Keywords
Parallel corpus Swahili English Machine translation Projection of annotation African language technologyNotes
Acknowledgments
We are very grateful for the insightful and useful comments from the reviewers, which helped shape the final version of this paper. We are also greatly indebted to Dr. James Omboga Zaja for contributing some of his translated data, to Mahmoud Shokrollahi-Far for his advice on the Quran and to Anne Kimani, Chris Wangai Njoka and Naomi Maajabu for their tireless annotation efforts.
References
- ANLoc. (2011). The African network for localization. Available at: http://www.africanlocalisation.net. Accessed: 10 June 2011.
- Benjamin, M. (2011). The Kamusi project. Available at: http://www.kamusiproject.org. Accessed: 10 June 2011.
- Bojar, O. (2007). English-to-Czech factored machine translation. In Proceedings of the second workshop on statistical machine translation (pp. 232–239). Morristown, USA: Association for Computational Linguistics.Google Scholar
- Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. In Proceedings of the 5th international conference on language resources and evaluation (pp. 2134–2137). Genoa, Italy: ELRA—European Language Resources Association.Google Scholar
- De Pauw, G., & Wagacha, P. (2007). Bootstrapping morphological analysis of Gĩkũyũ using unsupervised maximum entropy learning. In Proceedings of the eighth INTERSPEECH conference. Antwerp, Belgium: International Speech Communication Association.Google Scholar
- De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.Google Scholar
- De Pauw, G., de Schryver, G.-M., & Wagacha, P. (2006). Data-driven part-of-speech tagging of Kiswahili. In P. Sojka, I. Kopeček, & K. Pala (Eds.), Proceedings of text, speech and dialogue, ninth international conference (pp. 197–204). Berlin, Germany: Springer.CrossRefGoogle Scholar
- De Pauw, G., Wagacha, P., & Abade, D. (2007). Unsupervised induction of Dholuo word classes using maximum entropy learning. In K. Getao & E. Omwenga (Eds.), Proceedings of the first international computer science and ICT conference (pp. 139–143). Nairobi, Kenya: University of Nairobi.Google Scholar
- De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009a). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.Google Scholar
- De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2009b). The SAWA corpus: A parallel corpus English—Swahili. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 9–16). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
- De Pauw, G., Maajabu, N., & Wagacha, P. (2010). A knowledge-light approach to Luo machine translation and part-of-speech tagging. In G. De Pauw, H. Groenewald, & G.-M. de Schryver (Eds.),Proceedings of the second workshop on African language technology (AfLaT 2010) (pp. 15–20). Valletta, Malta: European Language Resources Association (ELRA).Google Scholar
- de Schryver, G.-M., & De Pauw, G. (2007). Dictionary writing system (DWS) + corpus query package (CQP): The case of TshwaneLex. Lexikos, 17, 226–246.Google Scholar
- de Schryver, G.-M., & Joffe, D. (2009). TshwaneDJe Kiswahili internet corpus. Pretoria, South Africa: TshwaneDJe HLT.Google Scholar
- Diaz de Ilarraza, A., Labaka, G., & Sarasola, K. (2009). Relevance of different segmentation options on Spanish-Basque SMT. In L. Mrquez & H. Somers (Eds.), Proceedings of the 13th annual conference of the European association for machine translation (pp. 74–80). Barcelona, Spain: European Association for Machine TranslationGoogle Scholar
- Faaß, G., Heid, U., Taljard, E., & Prinsloo, D. J. (2009). Part-of-speech tagging of Northern Sotho: Disambiguating polysemous function words. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 38–45). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
- Fraser, A., & Marcu, D. (2007). Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3), 293–303.CrossRefGoogle Scholar
- Gambäck, B., Olsson, F., Argaw, A. A., & Asker, L. (2009). Methods for Amharic part-of-speech tagging. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT 2009) (pp. 104–111). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
- Graff, D. (2003). English Gigaword. [Online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05. Accessed: 10 June 2011.
- Groenewald, H. J. (2009). Using technology transfer to advance automatic lemmatisation for Setswana. In G. De Pauw, G.-M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages(AfLaT 2009) (pp. 32–37). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
- Hwa, R., & Madnani, N. (2004). The UMIACS Word alignment interface. Available at: http://www.umiacs.umd.edu/~nmadnani/alignment. Accessed: 10 June 2011.
- Hwa, R., Resnik, P., Weinberg, A., & Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 392–399). Philadelphia, USA: Association for Computational Linguistics.Google Scholar
- Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). MOSES: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session (pp. 177–180). Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
- Le, Z. (2004). Maximum entropy modeling toolkit for Python and C++. Available at: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html. Accessed: 10 June 2011.
- Minkov, E., Toutanova, K., & Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 128–135). Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
- Moore, R. (2002). Fast and accurate sentence alignment of bilingual corpora. In S. Richardson (Ed.), Proceedings of the fifth conference of the association for machine translation in the Americas on machine translation: From research to real users (pp. 135–144). Berlin, Germany: Springer.Google Scholar
- Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
- Oflazer, K. (2008). Statistical machine translation into a morphologically complex language. In Computational linguistics and intelligent text processing (pp. 376–388). Berlin, Germany: Springer.Google Scholar
- OpenSubtitles.org. (2011). OpenSubtitles. Available at http://www.opensubtitles.org. Accessed: 10 June 2011.
- Ramanathan, A., Hegde, J., Shah, R., Bhattacharya, P., & Sasikumar, M. (2008). Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In Third international joint conference on natural language processing (pp. 513–520). Hyderabad, India: Asian Federation of Natural Language Processing.Google Scholar
- Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(1), 349–380.CrossRefGoogle Scholar
- Roukos, S., Graff, D., & Melamed, D. (1997). Hansard French/English. [Online]. Available at: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20. Accessed: 10 June 2011.
- Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In D. Jones (Ed.), Proceedings of the international conference on new methods in language processing (pp. 44–49). Manchester, UK: UMIST.Google Scholar
- Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. In J. Hansen & B. Pellom (Ed.), Proceedings of the international conference on spoken language processing (pp. 901–904). Denver, USA: International Speech Communication Association.Google Scholar
- Stymne, S., Holmqvist, M., & Ahrenberg, L. (2008). Effects of morphological analysis in translation between German and English.In Proceedings of the third workshop on statistical machine translation (pp. 135–138). Columbus, USA: Association for Computational Linguistics.Google Scholar
- Woodhouse, D. (1968). A note on the translation of Swahili into English. Mechanical Translation and Computational Linguistics, 11, 75–77.Google Scholar
- Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 81–87). Morristown, USA: Association for Computational Linguistics.Google Scholar