Language Resources and Evaluation

, Volume 51, Issue 3, pp 643–662 | Cite as

Building and evaluating web corpora representing national varieties of English

  • Paul CookEmail author
  • Laurel J. Brinton
Original Paper


Corpora are essential resources for language studies, as well as for training statistical natural language processing systems. Although very large English corpora have been built, only relatively small corpora are available for many varieties of English. National top-level domains (e.g., .au, .ca) could be exploited to automatically build web corpora, but it is unclear whether such corpora would reflect the corresponding national varieties of English; i.e., would a web corpus built from the .ca domain correspond to Canadian English? In this article we build web corpora from national top-level domains corresponding to countries in which English is widely spoken. We then carry out statistical analyses of these corpora in terms of keywords, measures of corpus comparison based on the Chi-square test and spelling variants, and the frequencies of words known to be marked in particular varieties of English. We find evidence that the web corpora indeed reflect the corresponding national varieties of English. We then demonstrate, through a case study on the analysis of Canadianisms, that these corpora could be valuable lexicographical resources.


Web corpora Corpus evaluation Corpus similarity Varieties of English Canadian English 



This research was started while the first author was a McKenzie Postdoctoral Fellow at The University of Melbourne. This research was financially supported by the University of Melbourne, the University of New Brunswick, and the Natural Sciences and Engineering Research Council of Canada.


  1. Atkins, B. T. S. (2010). The DANTE database: Its contribution to English lexical research, and in particular to complementing the FrameNet data. In G. M. de Schryver (Ed.), A way with words: Recent advances in lexical theory and analysis. A Festschrift for Patrick Hanks. Kampala: Menha Publishers.Google Scholar
  2. Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of 39th annual meeting of the Association for Computational Linguistics (ACL 2001) (pp. 26–33), Toulouse, France.Google Scholar
  3. Barber, K. (Ed.). (2005). Canadian Oxford dictionary (2nd ed.). Oxford: Oxford University Press.Google Scholar
  4. Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the Web. In Proceedings of the fourth international conference on language resources and evaluation (LREC 2004) (pp. 1313–1316), Lisbon, Portugal.Google Scholar
  5. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed Web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.CrossRefGoogle Scholar
  6. Baroni, M., Chantree, F., Kilgarriff, A., & Sharoff, S. (2008). Cleaneval: A competition for cleaning Web pages. In Proceedings of the sixth international conference on language resources and evaluation (LREC 2008) (pp. 638–643), Marrakech, Morocco.Google Scholar
  7. Baroni, M., Kilgarriff, A., Pomikálek J., & Rychlý, P. (2006). WebBootCaT: A web tool for instant corpora. In Proceedings XII EURALEX International Congress (EURALEX 2006) (pp. 123–131), Torino, Italy.Google Scholar
  8. Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol, CA: O’Reilly Media Inc.Google Scholar
  9. Brewington, B. E., & Cybenko, G. (2000). How dynamic is the web. In Proceedings of the 9th international world wide web conference (pp. 257–276), Amsterdam, Netherlands.Google Scholar
  10. Burnard, L. (2000). The British National Corpus users reference guide. Oxford: Oxford University Computing Services.Google Scholar
  11. Burnard, L. (2007). Reference guide for the British National Corpus (XML edition). Oxford: Oxford University Computing Services.Google Scholar
  12. Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of the 3rd annual symposium on document analysis and information retrieval (SDAIR-94) (pp. 161–175). Las Vegas, USA.Google Scholar
  13. Chambers, J. K. (2008). The tangled garden: Relics and vestiges in Canadian English. Anglistik, 19(2), 7–21. (special issue: Focus on Canadian English).Google Scholar
  14. Clarke, C. L. A., Craswell. N., Soboroff, I., & Voorhees, E. M. (2011). Overview of the TREC 2011 Web Track. In Proceedings of the twentieth text REtrieval Conference (TREC 2011). NIST special publication: SP 500-295.Google Scholar
  15. Cook, P., & Hirst, G. (2012). Do Web corpora from top-level domains represent national varieties of English? In Actes des 11es Journées internationales d’Analyse statistique des Données Textuelles/Proceedings of the 11th international conference on textual data statistical analysis (pp. 281–293). Liège, Belgium.Google Scholar
  16. Cook, P., & Lui, M. (2012). for better language modelling. In Proceedings of the Australasian Language Technology Association workshop 2012 (ALTA 2012) (pp. 107–112), Dunedin, New Zealand.Google Scholar
  17. Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.CrossRefGoogle Scholar
  18. Dillon, G. (2010). Building webcorpora of academic prose with BootCaT. In Proceedings of the NAACL HLT 2010 sixth web as corpus workshop (pp. 26–31), Los Angeles.Google Scholar
  19. Dollinger, S. (2016). Googleology as smart lexicography: Big messy data for better regional labels. Dictionaries: Journal of the Dictionary Society of North America, 37, 60–98.Google Scholar
  20. Dollinger, S., & Clarke, S. (2012). On the autonomy and homogeneity of Canadian English. World Englishes, 31(4), 449–466.CrossRefGoogle Scholar
  21. Dollinger, S., & Gaylie, S. (2015). Canadianisms in Canadian desk dictionaries: Scope, accuracy, desiderata. Presented at the 20th Biennial Dictionary Society of North America Meeting (DSNA-20) and the 9th Studies in the History of the English Language Conference (SHEL-9), Vancouver, Canada.Google Scholar
  22. Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th web as corpus workshop: Can we beat Google (pp. 47–54), Marrakech, Morocco.Google Scholar
  23. Green, E., & Peters, P. (1991). The Australian corpus project and Australian English. International Computer Archive of Modern English, 15, 37–53.Google Scholar
  24. Hall, J. H. (Ed.). (2012). Dictionary of American regional English (Vol. V: SI-Z). Cambridge: The Belknap Press of Harvard University Press.Google Scholar
  25. Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.CrossRefGoogle Scholar
  26. Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147–151.CrossRefGoogle Scholar
  27. Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the corpus linguistics conference, Liverpool, UK.Google Scholar
  28. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of Euralex (pp. 105–116), Lorient, France.Google Scholar
  29. Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010). A corpus factory for many languages. In Proceedings of the seventh conference on international language resources and evaluation (LREC 2010) (pp. 904–910), Valletta, Malta.Google Scholar
  30. Ljubešić, N., & Klubička, F. (2014). bs, hr, srwac—Web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th web as corpus workshop (WaC-9) (pp. 29–35), Gothenburg, Sweden.Google Scholar
  31. Lui, M., & Baldwin, T. (2011). Cross-domain feature selection for language identification. In Proceedings of the fifth international joint conference on natural language processing (IJCNLP 2011) (pp. 553–561), Chiang Mai, Thailand.Google Scholar
  32. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) system demonstrations (pp. 55–60).Google Scholar
  33. Murphy, B., & Stemle, E. (2011). PaddyWaC: A minimally-supervised Web-corpus of Hiberno-English. In Proceedings of the first workshop on algorithms and resources for modelling of dialects and language varieties (pp. 22–29), Edinburgh, Scotland.Google Scholar
  34. Passonneau, R. J., Ide, N., Su, S., & Stuart, J. (2014). Biber Redux: Reconsidering dimensions of variation in American English. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 565–576), Dublin, Ireland.Google Scholar
  35. Peirsman, Y., Geeraerts, D., & Speelman, D. (2010). The automatic identification of lexical variation between language varieties. Natural Language Engineering, 16(4), 469–491.CrossRefGoogle Scholar
  36. Peters, P. (2009). The architecture of a multipurpose Australian national corpus. In Selected Proceedings of the 2008 HCSNet workshop on designing an Australian National Corpus (pp. 1–9), Sommerville, MA.Google Scholar
  37. Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University.Google Scholar
  38. Pomikálek, J., Jakubíček, M., & Rychlý, P. (2012). Building a 70 billion word corpus of English from ClueWeb. In Proceedings of the eighth international conference on language resources and evaluation (LREC 2012) (pp. 502–506), Istanbul, Turkey.Google Scholar
  39. Ramson, W. S. (Ed.). (1988). The Australian National Dictionary: A dictionary of Australianisms on historical principles. Oxford: Oxford University Press.Google Scholar
  40. Resnik, P., & Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349–380.CrossRefGoogle Scholar
  41. Rezapour Asheghi, N., Markert, K., & Sharoff, S. (2014). Semi-supervised graph-based genre classification for web pages. In Proceedings of TextGraphs-9: The workshop on graph-based methods for natural language processing (pp. 39–47), Doha, Qatar.Google Scholar
  42. Roth, T. (2012). Using web corpora for the recognition of regional variation in standard German collocations. In Proceedings of the seventh web as corpus workshop (WAC7) (pp. 31–38), Lyon, France.Google Scholar
  43. Schäfer, R., & Bildhauer, F. (2013). Web corpus construction. San Rafael, CA: Morgan and Claypool.Google Scholar
  44. Schulz, S., Lyding, V., & Nicolas, L. (2013). STirWaC—Compiling a diverse corpus based on texts from the web for south Tyrolean German. In Proceedings of the 8th web as corpus workshop (WAC-8) (pp. 37–45), Lancaster, UK.Google Scholar
  45. Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the Web as Corpus (pp. 63–98), GEDIT, Bologna, Italy.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.Faculty of Computer ScienceUniversity of New BrunswickFrederictonCanada
  2. 2.Department of EnglishUniversity of British ColumbiaVancouverCanada

Personalised recommendations