Advertisement

Language Resources and Evaluation

, Volume 40, Issue 2, pp 127–152 | Cite as

Efficient corpus development for lexicography: building the New Corpus for Ireland

  • Adam Kilgarriff
  • Michael Rundell
  • Elaine Uí Dhonnchadha
Original Paper

Abstract

In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems encountered. We believe we have a good model for corpus creation for lexicography, and others may find it useful as a blueprint. The corpus has two parts, one Irish, the other Hiberno-English (English as spoken in Ireland). We describe its design, collection and encoding.

Keywords

Corpus linguistics Lexicography Computational linguistics Natural language processing Dictionaries Irish Gaelic Hiberno-English Language technology 

Notes

Acknowledgements

In addition to the authors, the main corpus-development team comprised Steve Finch, Eamon Keegan, Eoghan Mac Aogáin, Mark McLauchlan, Lisa Nic Shea, Jo O’Donoghue, Paul Atkins, Pavel Rychly and Dan Xu, all of whom deserve our heartfelt gratitude. We would also like to thank Seosamh Ó Murchú, Foras na Gaeilge’s Project Manager for the NEID, for his supportive role; Josef van Genabith of Dublin City University, for arranging the student internships; Dónall Ó Riagáin for helpful advice at the corpus design stage; John Kirk of the Queen’s University, Belfast, for permission to use NICTS; and Anne O’Keefe and Fiona Farr of the University of Limerick, for permission to use the Limerick Corpus of Irish English.

References

  1. An Roinn Oideachais. (1986). Foclóir Póca English-Irish/Irish-English Dictionary. Baile Átha Cliath: An Gúm.Google Scholar
  2. Atkins, B. T. S. (2002). Then and now: Competence and performance in 35 years of lexicography. In Braasch & Povlsen (Eds.) Proceedings of the Tenth Euralex Congress (pp. 1–28). Denmark: University of Copenhagen .Google Scholar
  3. Atkins, B. T. S., Clear, J. H., & Ostler, N. (1992). Corpus design criteria. Journal of Literary and Linguistic Computing. 1–16.Google Scholar
  4. Beesley, K. & Karttunen, L. (2003). Finite state morphology. California: CSLI Publications.Google Scholar
  5. Broder, A., Glassman, S., Manasse, M. & Zweig, G. (1997). Syntactic clustering on the Web. In Proceedings 6th Intnl World-Wide Web Conference.Google Scholar
  6. Census of Ireland, (2002). Volume 11 Irish language. Tables 7A and 31A http://www.cso.ie/.Google Scholar
  7. Clough, P., Gaizauskas, R., Piao, S. & Wilks, Y. (2002). MeTeR, Measuring Text Reuse. Proc. 40th Anniversary Meeting for the Association for Computational Linguistics (ACL-02) (pp. 152–159). 7–12 July, University of Pennsylvania, Philadelphia, USA.Google Scholar
  8. Christian Brothers, (1980). New Irish grammar. Dublin: Fallons.Google Scholar
  9. de Bhaldraithe, T. (1959). English–Irish dictionary. Baile Átha Cliath: An Gúm.Google Scholar
  10. Grefenstette, G., & Nioche, J. (2000). Estimation of English and non-English Language Use on the WWW. Proc. RIAO (Recherche d’Informations Assistee par Ordinateur), Paris.Google Scholar
  11. Janes, A. (2004). Bilingual comparable corpora for bilingual lexicography. MSc Dissertation, University of Brighton.Google Scholar
  12. Johnson, S. (1747). The plan of an English dictionary.Google Scholar
  13. Jones, R. & Ghani, R. (2000). Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop (pp. 29–36). Hong Kong.Google Scholar
  14. Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.) (1995). Constraint grammar: A language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York.Google Scholar
  15. Karttunen, L. & Beesley, K. (1992). Two-level rule compiler. Technical report, Xerox PARC.Google Scholar
  16. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. Proceedings of the Eleventh Euralex Congress (pp. 105–116). France: UBS Lorient.Google Scholar
  17. Kilgarriff, A., & Grefenstette, G. (2003). Web as Corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.CrossRefGoogle Scholar
  18. Schulze, B. & Christ, O. (1994). The IMS Corpus Workbench. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.Google Scholar
  19. Tapanainen, P. (1996). The Constraint Grammar Parser CG-2. Publication No. 27, University of Helsinki.Google Scholar
  20. Trench, R. C. (1857). On some deficiencies in our English dictionaries. London: The Philological Society. (reprinted at http://www.oed.com/archive/paper-deficiencies/).Google Scholar
  21. Uí Dhonnchadha, E. (2002). An analyser and generator for Irish inflectional morphology using finite state transducers. Unpublished MSc Thesis: Dublin, DCU.Google Scholar
  22. Uí Dhonnchadha, E., Nic Pháidín, C. Van Genabith, J. (2003). Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. In MT Journal - Special issue on finite state language resources and language processing. Kluwer.Google Scholar
  23. Uí Dhonnchadha, E., & Van Genabith, J. (2005). Scaling an Irish FST morphology engine for use on unrestricted text. In Proceedings of FSMNLP 2005, Helsinki, September 2005.Google Scholar

Copyright information

© Springer Science+Business Media 2006

Authors and Affiliations

  • Adam Kilgarriff
    • 1
  • Michael Rundell
    • 1
  • Elaine Uí Dhonnchadha
    • 2
  1. 1.Lexicography MasterClass LtdBrightonUK
  2. 2.Trinity CollegeDublinIreland

Personalised recommendations