Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

De Pauw, Guy; Wagacha, Peter Waiganjo; de Schryver, Gilles-Maurice

doi:10.1007/s10579-011-9159-7

Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

Original Paper
Published: 19 July 2011

Volume 45, pages 331–344, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Guy De Pauw^1,2,
Peter Waiganjo Wagacha² &
Gilles-Maurice de Schryver^3,4

449 Accesses
3 Citations
Explore all metrics

Abstract

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Natural Language Processing

Word prevalence norms for 62,000 English lemmas

Article 02 July 2018

Notes

On an Intel Xeon 2.4Ghz system with 8Gb RAM, training took about 36h. The classification phase fares better, taking only a few seconds per paragraph.

References

ANLoc. (2011). The African network for localization. Available at: http://www.africanlocalisation.net. Accessed: 10 June 2011.
Benjamin, M. (2011). The Kamusi project. Available at: http://www.kamusiproject.org. Accessed: 10 June 2011.
Bojar, O. (2007). English-to-Czech factored machine translation. In Proceedings of the second workshop on statistical machine translation (pp. 232–239). Morristown, USA: Association for Computational Linguistics.
Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. In Proceedings of the 5th international conference on language resources and evaluation (pp. 2134–2137). Genoa, Italy: ELRA—European Language Resources Association.
De Pauw, G., & Wagacha, P. (2007). Bootstrapping morphological analysis of Gĩkũyũ using unsupervised maximum entropy learning. In Proceedings of the eighth INTERSPEECH conference. Antwerp, Belgium: International Speech Communication Association.
De Pauw, G., & de Schryver, G.-M. (2008). Improving the computational morphological analysis of a Swahili corpus for lexicographic purposes. Lexikos, 18, 303–318.
Google Scholar
De Pauw, G., de Schryver, G.-M., & Wagacha, P. (2006). Data-driven part-of-speech tagging of Kiswahili. In P. Sojka, I. Kopeček, & K. Pala (Eds.), Proceedings of text, speech and dialogue, ninth international conference (pp. 197–204). Berlin, Germany: Springer.
Chapter Google Scholar
De Pauw, G., Wagacha, P., & Abade, D. (2007). Unsupervised induction of Dholuo word classes using maximum entropy learning. In K. Getao & E. Omwenga (Eds.), Proceedings of the first international computer science and ICT conference (pp. 139–143). Nairobi, Kenya: University of Nairobi.
Google Scholar
De Pauw, G., de Schryver, G.-M., & Wagacha, P. W. (2009a). A corpus-based survey of four electronic Swahili–English bilingual dictionaries. Lexikos, 19, 340–352.
Google Scholar
De Pauw, G., Wagacha, P., & de Schryver, G.-M. (2009b). The SAWA corpus: A parallel corpus English—Swahili. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 9–16). Athens, Greece: Association for Computational Linguistics.
Chapter Google Scholar
De Pauw, G., Maajabu, N., & Wagacha, P. (2010). A knowledge-light approach to Luo machine translation and part-of-speech tagging. In G. De Pauw, H. Groenewald, & G.-M. de Schryver (Eds.),Proceedings of the second workshop on African language technology (AfLaT 2010) (pp. 15–20). Valletta, Malta: European Language Resources Association (ELRA).
de Schryver, G.-M., & De Pauw, G. (2007). Dictionary writing system (DWS) + corpus query package (CQP): The case of TshwaneLex. Lexikos, 17, 226–246.
Google Scholar
de Schryver, G.-M., & Joffe, D. (2009). TshwaneDJe Kiswahili internet corpus. Pretoria, South Africa: TshwaneDJe HLT.
Google Scholar
Diaz de Ilarraza, A., Labaka, G., & Sarasola, K. (2009). Relevance of different segmentation options on Spanish-Basque SMT. In L. Mrquez & H. Somers (Eds.), Proceedings of the 13th annual conference of the European association for machine translation (pp. 74–80). Barcelona, Spain: European Association for Machine Translation
Google Scholar
Faaß, G., Heid, U., Taljard, E., & Prinsloo, D. J. (2009). Part-of-speech tagging of Northern Sotho: Disambiguating polysemous function words. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages (AfLaT 2009) (pp. 38–45). Athens, Greece: Association for Computational Linguistics.
Chapter Google Scholar
Fraser, A., & Marcu, D. (2007). Measuring word alignment quality for statistical machine translation. Computational Linguistics, 33(3), 293–303.
Article Google Scholar
Gambäck, B., Olsson, F., Argaw, A. A., & Asker, L. (2009). Methods for Amharic part-of-speech tagging. In G. De Pauw, G.-M. de Schryver, & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT 2009) (pp. 104–111). Athens, Greece: Association for Computational Linguistics.
Chapter Google Scholar
Graff, D. (2003). English Gigaword. [Online]. Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05. Accessed: 10 June 2011.
Groenewald, H. J. (2009). Using technology transfer to advance automatic lemmatisation for Setswana. In G. De Pauw, G.-M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African languages(AfLaT 2009) (pp. 32–37). Athens, Greece: Association for Computational Linguistics.
Chapter Google Scholar
Hwa, R., & Madnani, N. (2004). The UMIACS Word alignment interface. Available at: http://www.umiacs.umd.edu/~nmadnani/alignment. Accessed: 10 June 2011.
Hwa, R., Resnik, P., Weinberg, A., & Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 392–399). Philadelphia, USA: Association for Computational Linguistics.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). MOSES: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session (pp. 177–180). Prague, Czech Republic: Association for Computational Linguistics.
Le, Z. (2004). Maximum entropy modeling toolkit for Python and C++. Available at: http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html. Accessed: 10 June 2011.
Minkov, E., Toutanova, K., & Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 128–135). Prague, Czech Republic: Association for Computational Linguistics.
Moore, R. (2002). Fast and accurate sentence alignment of bilingual corpora. In S. Richardson (Ed.), Proceedings of the fifth conference of the association for machine translation in the Americas on machine translation: From research to real users (pp. 135–144). Berlin, Germany: Springer.
Google Scholar
Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Article Google Scholar
Oflazer, K. (2008). Statistical machine translation into a morphologically complex language. In Computational linguistics and intelligent text processing (pp. 376–388). Berlin, Germany: Springer.
OpenSubtitles.org. (2011). OpenSubtitles. Available at http://www.opensubtitles.org. Accessed: 10 June 2011.
Ramanathan, A., Hegde, J., Shah, R., Bhattacharya, P., & Sasikumar, M. (2008). Simple syntactic and morphological processing can help English–Hindi statistical machine translation. In Third international joint conference on natural language processing (pp. 513–520). Hyderabad, India: Asian Federation of Natural Language Processing.
Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(1), 349–380.
Article Google Scholar
Roukos, S., Graff, D., & Melamed, D. (1997). Hansard French/English. [Online]. Available at: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20. Accessed: 10 June 2011.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In D. Jones (Ed.), Proceedings of the international conference on new methods in language processing (pp. 44–49). Manchester, UK: UMIST.
Google Scholar
Stolcke, A. (2002). SRILM - an extensible language modeling toolkit. In J. Hansen & B. Pellom (Ed.), Proceedings of the international conference on spoken language processing (pp. 901–904). Denver, USA: International Speech Communication Association.
Google Scholar
Stymne, S., Holmqvist, M., & Ahrenberg, L. (2008). Effects of morphological analysis in translation between German and English.In Proceedings of the third workshop on statistical machine translation (pp. 135–138). Columbus, USA: Association for Computational Linguistics.
Woodhouse, D. (1968). A note on the translation of Swahili into English. Mechanical Translation and Computational Linguistics, 11, 75–77.
Google Scholar
Zhao, B., Zechner, K., Vogel, S., & Waibel, A. (2003). Efficient optimization for bilingual sentence alignment based on linear regression. In Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: Data driven machine translation and beyond (Vol. 3, pp. 81–87). Morristown, USA: Association for Computational Linguistics.

Download references

Acknowledgments

We are very grateful for the insightful and useful comments from the reviewers, which helped shape the final version of this paper. We are also greatly indebted to Dr. James Omboga Zaja for contributing some of his translated data, to Mahmoud Shokrollahi-Far for his advice on the Quran and to Anne Kimani, Chris Wangai Njoka and Naomi Maajabu for their tireless annotation efforts.

Author information

Authors and Affiliations

CLiPS, Department of Linguistics, University of Antwerp, Antwerp, Belgium
Guy De Pauw
School of Computing and Informatics, University of Nairobi, Nairobi, Kenya
Guy De Pauw & Peter Waiganjo Wagacha
Department of African Languages and Cultures, Ghent University, Ghent, Belgium
Gilles-Maurice de Schryver
Xhosa Department, University of the Western Cape, Cape Town, South Africa
Gilles-Maurice de Schryver

Authors

Guy De Pauw
View author publications
You can also search for this author in PubMed Google Scholar
Peter Waiganjo Wagacha
View author publications
You can also search for this author in PubMed Google Scholar
Gilles-Maurice de Schryver
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guy De Pauw.

Additional information

The research presented in this paper was made possible through the support of the VLIR-IUC-UON program and was partly funded by the sawa BOF UA-2007 project. The first author is funded as a Postdoctoral Fellow of the Research Foundation—Flanders (FWO).

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Pauw, G., Wagacha, P.W. & de Schryver, GM. Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili. Lang Resources & Evaluation 45, 331–344 (2011). https://doi.org/10.1007/s10579-011-9159-7

Download citation

Published: 19 July 2011
Issue Date: September 2011
DOI: https://doi.org/10.1007/s10579-011-9159-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Word prevalence norms for 62,000 English lemmas

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring the sawa corpus: collection and deployment of a parallel corpus English—Swahili

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

Word prevalence norms for 62,000 English lemmas

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation