The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC versus the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
- Baayen, A. (2001). Word frequency distributions. Dordrecht: Kluwer.
- Baroni, M., & Bernardini, S. (Eds.). (2006). Wacky! Working papers on the web as corpus. Bologna: Gedit.
- Baroni, M., & Kilgarriff, A. (2006). Large linguistically-processed web corpora for multiple languages. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy, pp. 87–90.
- Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application, Tokyo, Japan, pp. 31–40.
- Boleda, G., Bott, S., Meza, R., Castillo, C., Badia, T., & López, V. (2006). CUCWeb: A Catalan corpus built from the web. In Kilgarriff and Baroni (2006), pp. 19–26.
- Brants, T., & Franz, A. (2006). Web 1T 5-gram, version 1. Philadelphia: Linguistic Data Consortium.
- Broder, A., Glassman, S., Manasse, M., & Zweig, G. (1997). Syntactic clustering of the web. In Proceedings of the sixth international world wide web conference, Santa Clara, California, pp. 391–404.
- Ciaramita, M., & Baroni, M. (2006). Measuring web corpus randomness: A progress report. In Baroni and Bernardini (2006), pp. 127–158.
- Clarke, C., Cormack, G., Laszlo, M., Lynam, T., & Terra, E. (2002). The impact of corpus size on question answering performance. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland, pp. 369–370.
- Clarke, C., Craswell, N., & Soboroff, I. (2005). The TREC terabyte retrieval track. SIGIR Forum, 39(1), 25. CrossRef
- Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
- Emerson, T., & O’Neil, J. (2006). Experience building a large corpus for Chinese lexicon construction. In Baroni and Bernardini (2006), pp. 41–62.
- Fairon, C., Naets, H., Kilgarriff, A., & de Schryver, G.-M. (Eds.). (2007). Building and exploring web corpora. In Proceedings of the 3rd web as corpus workshop, incorporating Cleaneval. Louvain: Presses Universitaires de Louvain.
- Ferraresi, A. (2007). Building a very large corpus of English obtained by web crawling: ukWaC. MA Dissertation, University of Bologna. Retrieved January 28, 2008, from http://wacky.sslmit.unibo.it
- Fletcher, W. (2004). Making the web more useful as a source for linguistic corpora. In U. Connor & T. Upton (Eds.), Corpus linguistics in North America 2002 (pp. 191–205). Amsterdam: Rodopi.
- Hundt, M., Nesselhauf, N., & Biewer, C. (Eds.). (2007). Corpus linguistics and the web. Amsterdam: Rodopi.
- Kilgarriff, A., & Baroni, M. (Eds.). (2006). Proceedings of the 2nd international workshop on the web as corpus. East Stroudsburg, PA: ACL.
- Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347. CrossRef
- Kornai, A., Halácsy, P., Nagy, V., Oravecz, C., Trón, V., & Varga, D. (2006). Web-based frequency dictionaries for medium density languages. In Kilgarriff and Baroni (2006), pp. 1–8.
- Lee, D. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning & Technology, 5(3), 37–72.
- Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics. Trento, Italy, pp. 233–240.
- Santini, M., & Sharoff, S. (Eds.). (2007). Proceedings of the CL 2007 colloquium: Towards a reference corpus of web genres, Birmingham, UK.
- Shaoul, C., & Westbury, C. 2007. A USENET corpus (2005–2007). Retrieved January 28, 2008, from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
- Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni and Bernardini (2006), pp. 63–98.
- Sinclair, J. McH. (1996). The search for units of meaning. Textus 9(1), 71–106.
- Sinclair, J. McH. (2005). Corpus and text—Basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford: Oxbow Books.
- Thelwall, M. (2005). Creating and using web corpora. International Journal of Corpus Linguistics, 10(4), 517–541. CrossRef
- Ueyama, M. (2006). Evaluation of Japanese web-based reference corpora: Effects of seed selection and time interval. In Baroni and Bernardini (2006), pp. 99–126.
- The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
Language Resources and Evaluation
Volume 43, Issue 3 , pp 209-226
- Cover Date
- Print ISSN
- Online ISSN
- Springer Netherlands
- Additional Links
- Annotated corpora
- Corpus construction
- General-purpose linguistic resources
- Web as corpus
- Industry Sectors