Abstract
In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems encountered. We believe we have a good model for corpus creation for lexicography, and others may find it useful as a blueprint. The corpus has two parts, one Irish, the other Hiberno-English (English as spoken in Ireland). We describe its design, collection and encoding.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The project is under the direction of Foras na Gaeilge, the government-funded body responsible for the promotion of the Irish language throughout the island of Ireland, whose statutory functions include the development of new dictionaries (http://www.forasnagaeilge.ie). Full details of the NEID project can be found at http://www.focloir.ie. The main contractor for setting up the project, including corpus preparation, is Lexicography MasterClass Ltd (http://www.lexmasterclass.com/).
Figures from the 2002 Census.
Irish is taught throughout the school system, and about 30,000 students are educated in Irish-medium schools, ‘Gaelscoileanna’.
See http://natcorp.ox.ac.uk
See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
While this is clearly also true of English worldwide, it is a lesser consideration for English produced in Ireland, where English is the mother tongue of an overwhelming majority of the population.
See http://www.ul.ie/∼lcie/
Since the work was done, the shingling algorithm (Broder, Glassman, Manasse, & Zweig, 1997) has become widely known as the leading tool for de-duplication.
Constraint Grammar vislcg downloadable at http://www.sourceforge.net
For alternative work on Irish grammar checking see: http://borel.slu.edu/gramadoir/
References
An Roinn Oideachais. (1986). Foclóir Póca English-Irish/Irish-English Dictionary. Baile Átha Cliath: An Gúm.
Atkins, B. T. S. (2002). Then and now: Competence and performance in 35 years of lexicography. In Braasch & Povlsen (Eds.) Proceedings of the Tenth Euralex Congress (pp. 1–28). Denmark: University of Copenhagen .
Atkins, B. T. S., Clear, J. H., & Ostler, N. (1992). Corpus design criteria. Journal of Literary and Linguistic Computing. 1–16.
Beesley, K. & Karttunen, L. (2003). Finite state morphology. California: CSLI Publications.
Broder, A., Glassman, S., Manasse, M. & Zweig, G. (1997). Syntactic clustering on the Web. In Proceedings 6th Intnl World-Wide Web Conference.
Census of Ireland, (2002). Volume 11 Irish language. Tables 7A and 31A http://www.cso.ie/.
Clough, P., Gaizauskas, R., Piao, S. & Wilks, Y. (2002). MeTeR, Measuring Text Reuse. Proc. 40th Anniversary Meeting for the Association for Computational Linguistics (ACL-02) (pp. 152–159). 7–12 July, University of Pennsylvania, Philadelphia, USA.
Christian Brothers, (1980). New Irish grammar. Dublin: Fallons.
de Bhaldraithe, T. (1959). English–Irish dictionary. Baile Átha Cliath: An Gúm.
Grefenstette, G., & Nioche, J. (2000). Estimation of English and non-English Language Use on the WWW. Proc. RIAO (Recherche d’Informations Assistee par Ordinateur), Paris.
Janes, A. (2004). Bilingual comparable corpora for bilingual lexicography. MSc Dissertation, University of Brighton.
Johnson, S. (1747). The plan of an English dictionary.
Jones, R. & Ghani, R. (2000). Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop (pp. 29–36). Hong Kong.
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.) (1995). Constraint grammar: A language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York.
Karttunen, L. & Beesley, K. (1992). Two-level rule compiler. Technical report, Xerox PARC.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. Proceedings of the Eleventh Euralex Congress (pp. 105–116). France: UBS Lorient.
Kilgarriff, A., & Grefenstette, G. (2003). Web as Corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.
Schulze, B. & Christ, O. (1994). The IMS Corpus Workbench. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.
Tapanainen, P. (1996). The Constraint Grammar Parser CG-2. Publication No. 27, University of Helsinki.
Trench, R. C. (1857). On some deficiencies in our English dictionaries. London: The Philological Society. (reprinted at http://www.oed.com/archive/paper-deficiencies/).
Uí Dhonnchadha, E. (2002). An analyser and generator for Irish inflectional morphology using finite state transducers. Unpublished MSc Thesis: Dublin, DCU.
Uí Dhonnchadha, E., Nic Pháidín, C. Van Genabith, J. (2003). Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. In MT Journal - Special issue on finite state language resources and language processing. Kluwer.
Uí Dhonnchadha, E., & Van Genabith, J. (2005). Scaling an Irish FST morphology engine for use on unrestricted text. In Proceedings of FSMNLP 2005, Helsinki, September 2005.
Acknowledgements
In addition to the authors, the main corpus-development team comprised Steve Finch, Eamon Keegan, Eoghan Mac Aogáin, Mark McLauchlan, Lisa Nic Shea, Jo O’Donoghue, Paul Atkins, Pavel Rychly and Dan Xu, all of whom deserve our heartfelt gratitude. We would also like to thank Seosamh Ó Murchú, Foras na Gaeilge’s Project Manager for the NEID, for his supportive role; Josef van Genabith of Dublin City University, for arranging the student internships; Dónall Ó Riagáin for helpful advice at the corpus design stage; John Kirk of the Queen’s University, Belfast, for permission to use NICTS; and Anne O’Keefe and Fiona Farr of the University of Limerick, for permission to use the Limerick Corpus of Irish English.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kilgarriff, A., Rundell, M. & Uí Dhonnchadha, E. Efficient corpus development for lexicography: building the New Corpus for Ireland. Lang Resources & Evaluation 40, 127–152 (2006). https://doi.org/10.1007/s10579-006-9011-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-006-9011-7