Skip to main content
Log in

Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. On an Intel Xeon 2.4Ghz system with 8Gb RAM, training took about 36h. The classification phase fares better, taking only a few seconds per paragraph.

References

  • ANLoc. (2011). The African network for localization. Available at: http://www.africanlocalisation.net. Accessed: 10 June 2011.

  • Benjamin, M. (2011). The Kamusi project. Available at: http://www.kamusiproject.org. Accessed: 10 June 2011.

  • Bojar, O. (2007). English-to-Czech factored machine translation. In Proceedings of the second workshop on statistical machine translation (pp. 232–239). Morristown, USA: Association for Computational Linguistics.

  • Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. In Proceedings of the 5th international conference on language resources and evaluation (pp. 2134–2137). Genoa, Italy: ELRA—European Language Resources Association.

  • De Pauw, G., & Wagacha, P. (2007). Bootstrapping morphological analysis of Gĩkũyũ using unsupervised maximum entropy learning. In Proceedings of the eighth INTERSPEECH conference. Antwerp, Belgium: International Speech Communication Association.

  • De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.

    Google Scholar 

  • De Pauw, G., de Schryver, G.-M., & Wagacha, P. (2006). Data-driven part-of-speech tagging of Kiswahili. In P. Sojka, I. Kopeček, & K. Pala (Eds.), Proceedings of text, speech and dialogue, ninth international conference (pp. 197–204). Berlin, Germany: Springer.

    Chapter  Google Scholar 

  • De Pauw, G., Wagacha, P., & Abade, D. (2007). Unsupervised induction of Dholuo word classes using maximum entropy learning. In K. Getao & E. Omwenga (Eds.), Proceedings of the first international computer science and ICT conference (pp. 139–143). Nairobi, Kenya: University of Nairobi.

    Google Scholar 

  • De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009a). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.

    Google Scholar 

  • De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2009b). The SAWA corpus: A parallel corpus English—Swahili. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 9–16). Athens, Greece: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • De Pauw, G., Maajabu, N., & Wagacha, P. (2010). A knowledge-light approach to Luo machine translation and part-of-speech tagging. In G. De Pauw, H. Groenewald, & G.-M. de Schryver (Eds.),Proceedings of the second workshop on African language technology (AfLaT 2010) (pp. 15–20). Valletta, Malta: European Language Resources Association (ELRA).

  • de Schryver, G.-M., & De Pauw, G. (2007). Dictionary writing system (DWS) + corpus query package (CQP): The case of TshwaneLex. Lexikos, 17, 226–246.

    Google Scholar 

  • de Schryver, G.-M., & Joffe, D. (2009). TshwaneDJe Kiswahili internet corpus. Pretoria, South Africa: TshwaneDJe HLT.

    Google Scholar 

  • Diaz de Ilarraza, A., Labaka, G., & Sarasola, K. (2009). Relevance of different segmentation options on Spanish-Basque SMT. In L. Mrquez & H. Somers (Eds.), Proceedings of the 13th annual conference of the European association for machine translation (pp. 74–80). Barcelona, Spain: European Association for Machine Translation

    Google Scholar 

  • Faaß, G., Heid, U., Taljard, E., & Prinsloo, D. J. (2009). Part-of-speech tagging of Northern Sotho: Disambiguating polysemous function words. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 38–45). Athens, Greece: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Fraser, A., & Marcu, D. (2007). Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3), 293–303.

    Article  Google Scholar 

  • Gambäck, B., Olsson, F., Argaw, A. A., & Asker, L. (2009). Methods for Amharic part-of-speech tagging. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT 2009) (pp. 104–111). Athens, Greece: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Graff, D. (2003). English Gigaword. [Online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05. Accessed: 10 June 2011.

  • Groenewald, H. J. (2009). Using technology transfer to advance automatic lemmatisation for Setswana. In G. De Pauw, G.-M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages(AfLaT 2009) (pp. 32–37). Athens, Greece: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Hwa, R., & Madnani, N. (2004). The UMIACS Word alignment interface. Available at: http://www.umiacs.umd.edu/~nmadnani/alignment. Accessed: 10 June 2011.

  • Hwa, R., Resnik, P., Weinberg, A., & Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 392–399). Philadelphia, USA: Association for Computational Linguistics.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). MOSES: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session (pp. 177–180). Prague, Czech Republic: Association for Computational Linguistics.

  • Le, Z. (2004). Maximum entropy modeling toolkit for Python and C++. Available at: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html. Accessed: 10 June 2011.

  • Minkov, E., Toutanova, K., & Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 128–135). Prague, Czech Republic: Association for Computational Linguistics.

  • Moore, R. (2002). Fast and accurate sentence alignment of bilingual corpora. In S. Richardson (Ed.), Proceedings of the fifth conference of the association for machine translation in the Americas on machine translation: From research to real users (pp. 135–144). Berlin, Germany: Springer.

    Google Scholar 

  • Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Oflazer, K. (2008). Statistical machine translation into a morphologically complex language. In Computational linguistics and intelligent text processing (pp. 376–388). Berlin, Germany: Springer.

  • OpenSubtitles.org. (2011). OpenSubtitles. Available at http://www.opensubtitles.org. Accessed: 10 June 2011.

  • Ramanathan, A., Hegde, J., Shah, R., Bhattacharya, P., & Sasikumar, M. (2008). Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In Third international joint conference on natural language processing (pp. 513–520). Hyderabad, India: Asian Federation of Natural Language Processing.

  • Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(1), 349–380.

    Article  Google Scholar 

  • Roukos, S., Graff, D., & Melamed, D. (1997). Hansard French/English. [Online]. Available at: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20. Accessed: 10 June 2011.

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In D. Jones (Ed.), Proceedings of the international conference on new methods in language processing (pp. 44–49). Manchester, UK: UMIST.

    Google Scholar 

  • Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. In J. Hansen & B. Pellom (Ed.), Proceedings of the international conference on spoken language processing (pp. 901–904). Denver, USA: International Speech Communication Association.

    Google Scholar 

  • Stymne, S., Holmqvist, M., & Ahrenberg, L. (2008). Effects of morphological analysis in translation between German and English.In Proceedings of the third workshop on statistical machine translation (pp. 135–138). Columbus, USA: Association for Computational Linguistics.

  • Woodhouse, D. (1968). A note on the translation of Swahili into English. Mechanical Translation and Computational Linguistics, 11, 75–77.

    Google Scholar 

  • Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 81–87). Morristown, USA: Association for Computational Linguistics.

Download references

Acknowledgments

We are very grateful for the insightful and useful comments from the reviewers, which helped shape the final version of this paper. We are also greatly indebted to Dr. James Omboga Zaja for contributing some of his translated data, to Mahmoud Shokrollahi-Far for his advice on the Quran and to Anne Kimani, Chris Wangai Njoka and Naomi Maajabu for their tireless annotation efforts.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guy De Pauw.

Additional information

The research presented in this paper was made possible through the support of the VLIR-IUC-UON program and was partly funded by the sawa BOF UA-2007 project. The first author is funded as a Postdoctoral Fellow of the Research Foundation—Flanders (FWO).

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Pauw, G., Wagacha, P.W. & de Schryver, GM. Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili. Lang Resources & Evaluation 45, 331–344 (2011). https://doi.org/10.1007/s10579-011-9159-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9159-7

Keywords

Navigation