Advertisement

Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

  • Guy De PauwEmail author
  • Peter Waiganjo Wagacha
  • Gilles-Maurice de Schryver
Original Paper
  • 292 Downloads

Abstract

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

Keywords

Parallel corpus Swahili English Machine translation Projection of annotation African language technology 

Notes

Acknowledgments

We are very grateful for the insightful and useful comments from the reviewers, which helped shape the final version of this paper. We are also greatly indebted to Dr. James Omboga Zaja for contributing some of his translated data, to Mahmoud Shokrollahi-Far for his advice on the Quran and to Anne Kimani, Chris Wangai Njoka and Naomi Maajabu for their tireless annotation efforts.

References

  1. ANLoc. (2011). The African network for localization. Available at: http://www.africanlocalisation.net. Accessed: 10 June 2011.
  2. Benjamin, M. (2011). The Kamusi project. Available at: http://www.kamusiproject.org. Accessed: 10 June 2011.
  3. Bojar, O. (2007). English-to-Czech factored machine translation. In Proceedings of the second workshop on statistical machine translation (pp. 232–239). Morristown, USA: Association for Computational Linguistics.Google Scholar
  4. Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. In Proceedings of the 5th international conference on language resources and evaluation (pp. 2134–2137). Genoa, Italy: ELRA—European Language Resources Association.Google Scholar
  5. De Pauw, G., & Wagacha, P. (2007). Bootstrapping morphological analysis of Gĩkũyũ using unsupervised maximum entropy learning. In Proceedings of the eighth INTERSPEECH conference. Antwerp, Belgium: International Speech Communication Association.Google Scholar
  6. De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.Google Scholar
  7. De Pauw, G., de Schryver, G.-M., & Wagacha, P. (2006). Data-driven part-of-speech tagging of Kiswahili. In P. Sojka, I. Kopeček, & K. Pala (Eds.), Proceedings of text, speech and dialogue, ninth international conference (pp. 197–204). Berlin, Germany: Springer.CrossRefGoogle Scholar
  8. De Pauw, G., Wagacha, P., & Abade, D. (2007). Unsupervised induction of Dholuo word classes using maximum entropy learning. In K. Getao & E. Omwenga (Eds.), Proceedings of the first international computer science and ICT conference (pp. 139–143). Nairobi, Kenya: University of Nairobi.Google Scholar
  9. De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009a). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.Google Scholar
  10. De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2009b). The SAWA corpus: A parallel corpus English—Swahili. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 9–16). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
  11. De Pauw, G., Maajabu, N., & Wagacha, P. (2010). A knowledge-light approach to Luo machine translation and part-of-speech tagging. In G. De Pauw, H. Groenewald, & G.-M. de Schryver (Eds.),Proceedings of the second workshop on African language technology (AfLaT 2010) (pp. 15–20). Valletta, Malta: European Language Resources Association (ELRA).Google Scholar
  12. de Schryver, G.-M., & De Pauw, G. (2007). Dictionary writing system (DWS) + corpus query package (CQP): The case of TshwaneLex. Lexikos, 17, 226–246.Google Scholar
  13. de Schryver, G.-M., & Joffe, D. (2009). TshwaneDJe Kiswahili internet corpus. Pretoria, South Africa: TshwaneDJe HLT.Google Scholar
  14. Diaz de Ilarraza, A., Labaka, G., & Sarasola, K. (2009). Relevance of different segmentation options on Spanish-Basque SMT. In L. Mrquez & H. Somers (Eds.), Proceedings of the 13th annual conference of the European association for machine translation (pp. 74–80). Barcelona, Spain: European Association for Machine TranslationGoogle Scholar
  15. Faaß, G., Heid, U., Taljard, E., & Prinsloo, D. J. (2009). Part-of-speech tagging of Northern Sotho: Disambiguating polysemous function words. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 38–45). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
  16. Fraser, A., & Marcu, D. (2007). Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3), 293–303.CrossRefGoogle Scholar
  17. Gambäck, B., Olsson, F., Argaw, A. A., & Asker, L. (2009). Methods for Amharic part-of-speech tagging. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT 2009) (pp. 104–111). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
  18. Graff, D. (2003). English Gigaword. [Online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05. Accessed: 10 June 2011.
  19. Groenewald, H. J. (2009). Using technology transfer to advance automatic lemmatisation for Setswana. In G. De Pauw, G.-M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages(AfLaT 2009) (pp. 32–37). Athens, Greece: Association for Computational Linguistics.CrossRefGoogle Scholar
  20. Hwa, R., & Madnani, N. (2004). The UMIACS Word alignment interface. Available at: http://www.umiacs.umd.edu/~nmadnani/alignment. Accessed: 10 June 2011.
  21. Hwa, R., Resnik, P., Weinberg, A., & Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 392–399). Philadelphia, USA: Association for Computational Linguistics.Google Scholar
  22. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). MOSES: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session (pp. 177–180). Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
  23. Le, Z. (2004). Maximum entropy modeling toolkit for Python and C++. Available at: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html. Accessed: 10 June 2011.
  24. Minkov, E., Toutanova, K., & Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 128–135). Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
  25. Moore, R. (2002). Fast and accurate sentence alignment of bilingual corpora. In S. Richardson (Ed.), Proceedings of the fifth conference of the association for machine translation in the Americas on machine translation: From research to real users (pp. 135–144). Berlin, Germany: Springer.Google Scholar
  26. Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  27. Oflazer, K. (2008). Statistical machine translation into a morphologically complex language. In Computational linguistics and intelligent text processing (pp. 376–388). Berlin, Germany: Springer.Google Scholar
  28. OpenSubtitles.org. (2011). OpenSubtitles. Available at http://www.opensubtitles.org. Accessed: 10 June 2011.
  29. Ramanathan, A., Hegde, J., Shah, R., Bhattacharya, P., & Sasikumar, M. (2008). Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In Third international joint conference on natural language processing (pp. 513–520). Hyderabad, India: Asian Federation of Natural Language Processing.Google Scholar
  30. Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(1), 349–380.CrossRefGoogle Scholar
  31. Roukos, S., Graff, D., & Melamed, D. (1997). Hansard French/English. [Online]. Available at: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20. Accessed: 10 June 2011.
  32. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In D. Jones (Ed.), Proceedings of the international conference on new methods in language processing (pp. 44–49). Manchester, UK: UMIST.Google Scholar
  33. Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. In J. Hansen & B. Pellom (Ed.), Proceedings of the international conference on spoken language processing (pp. 901–904). Denver, USA: International Speech Communication Association.Google Scholar
  34. Stymne, S., Holmqvist, M., & Ahrenberg, L. (2008). Effects of morphological analysis in translation between German and English.In Proceedings of the third workshop on statistical machine translation (pp. 135–138). Columbus, USA: Association for Computational Linguistics.Google Scholar
  35. Woodhouse, D. (1968). A note on the translation of Swahili into English. Mechanical Translation and Computational Linguistics, 11, 75–77.Google Scholar
  36. Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 81–87). Morristown, USA: Association for Computational Linguistics.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Guy De Pauw
    • 1
    • 2
    Email author
  • Peter Waiganjo Wagacha
    • 2
  • Gilles-Maurice de Schryver
    • 3
    • 4
  1. 1.CLiPS, Department of LinguisticsUniversity of AntwerpAntwerpBelgium
  2. 2.School of Computing and InformaticsUniversity of NairobiNairobiKenya
  3. 3.Department of African Languages and CulturesGhent UniversityGhentBelgium
  4. 4.Xhosa DepartmentUniversity of the Western CapeCape TownSouth Africa

Personalised recommendations