Abstract
Using a case study approach, this chapter provides a worked demonstration of how a detailed design frame for a national corpus may be built. Focusing on each individual ‘mode’ of data in turn (i.e. spoken, written and e-language), the chapter explores some key generic questions that are helpful in informing design frame development. By reporting on the specific targets presented, and providing justifications for these, useful guidance for those considering creating their own corpus design frame emerges. While the points raised in this chapter relate to our experiences of creating design frames for CorCenCC specifically, the guidance provided can facilitate and inform the design of corpora in other languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aroonmanakun, W., Tansiri, K., & Nittayanuparp, P. (2009). Thai National Corpus: A progress report. Proceedings of the 7th Workshop on Asian Language Resources (ALR7), Suntec, Singapore, pp. 153–158.
Asheghi, N. R., Sharoff, S., & Markert, K. (2016). Crowdsourcing for web genre annotation. Language Resources and Evaluation, 50(3), 603–641.
Asheghi, N., S., & Markert, K. (2014). Designing and evaluating a reliable corpus of web genres via crowd-sourcing. Proceedings of the Language Resources and Evaluation Conference, Reykjavik: Iceland, pp. 1339–1346.
Aston, G., & Burnard, L. (1997). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh University Press.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Beaufort Research. (2013). Ymchwilio i ddefnydd iaith siaradwyr Cymraeg yn eu bywyd pob dydd [Exploring Welsh speakers’ language use in their daily lives] [Online]. Retrieved from https://www.s4c.cymru/abouts4c/corporate/pdf/e_daily-lives-and-language-use-research-report.pdf. Accessed 20 June 2021.
Biber, D. (1994). Representativeness in corpus design. In A. Zampolli, N. Calzolari, & M. Palmer (Eds.), Current issues in computational linguistics: In Honour of Don Walker (pp.377–407). Dordrecht: Springer Netherlands.
Biber, D., & Egbert, J. (2018). Register variation online. Cambridge University Press.
Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45.
Bible Society. (1997). Challenge to change: Results of the 1995 Welsh churches survey. Bible Society.
Burnard, L. (2007). Reference guide for the British national corpus (XML Edition). Retrieved from: http://www.natcorp.ox.ac.uk/docs/URG/index.html. Accessed 20 June 2021.
Carter, R. (1998). Orders of reality: CANCODE, communication, and culture. ELT Journal, 52(1), 43–56.
Carter, R., & McCarthy, M. (2004). Talking, creating: Interactional language, creativity, and context. Applied Linguistics, 25(1), 62–88.
Cheng, W., & Warren, M. (1999). Facilitating a description of intercultural conversations: The Hong Kong Corpus of Conversational English. International Computer Archive of Modern English, 20, 5–20.
Coleg Cymraeg Cenedlaethol [National Welsh College]. (2014). Circular: 2015/16 Welsh medium targets. Retrieved from http://www.colegcymraeg.ac.uk/en/media/main/dogfennau-ccc/cylchlythyron/REF14_1TargetCircular.pdf. Accessed 20 June 2021.
Condon, S., & Cech, C. (1996). Functional comparison of face-to-face and computer- mediated decision-making interactions. In S. Herring (Ed.), Computer-mediated communication: Linguistic, social, and cross-cultural perspectives (pp.65–80). John Benjamins.
Crowdy, S. (1993). Spoken corpus design. Literary and Linguistic Computing, 8, 259–265.
Crystal, D. (2008). Texting. ELT Journal, 62(1), 77–83.
CUP. (2020). Cambridge English corpus [Online]. Retrieved from https://www.cambridge.org/us/cambridgeenglish/better-learning-insights/corpus. Accessed 20 June 2021.
Cymdeithas yr Iaith Gymraeg [Welsh Language Society]. (2015). Herio Aelodau Cynulliad: Gwnewch adduned blwyddyn newydd i siarad mwy o Gymraeg yn 2016 [Challenging Assembly Members: Make a new year’s resolution to speak more Welsh in 2016]. Retrieved from http://cymdeithas.cymru/newyddion/herio-aelodau-cynulliad-gwnewch-adduned-blwyddyn-newydd-i-siarad-mwy-o-gymraeg-yn-2016. Accessed 20 June 2021.
Davies, M. (2010). The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.
Davies, M. (2016). Corpus of news on the web (NOW) [Online]. Retrieved from https://www.english-corpora.org/now/. Accessed 20 June 2021.
Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide, 36(1), 1–28.
Eisteddfod Genedlaethol Cymru [National Eisteddfod of Wales]. (2016). Ffigurau Ymwelwyr yr Wythnos [The week’s visitor figures]. Retrieved from https://eisteddfod.cymru/ffigurau-ymwelwyr-yr-wythnos. Accessed 20 June 2021.
Ferraresi, A., Zanchetta, E., Bernardini, S., & Baroni, M. (2008). Introducing and evaluating ukWaC, a very large Web-derived corpus of English. Proceedings of the 4th Web as Corpus Workshop (WAC-4), Marrakech, Morocco, pp.47–54.
Forsyth, E., & Martell, C. (2007). Lexical and discourse analysis of online chat dialog. Proceedings of the International Conference on Semantic Computing (ICSC 2007), Irvine, California, pp.19–26.
Gardner, S., & Moreton, E. (2020). Written corpora. In S. Adolphs & D. Knight (Eds.), The Routledge handbook of English language and digital humanities (pp. 26–48). Routledge.
Gries, S. T., & Newman, J. (2014). Creating and using corpora. In D. Sharma & R. J. Podesva (Eds.), Research methods in linguistics (pp. 257–287). Cambridge University Press.
Gwales.com. (2012). Gwales.com. Retrieved from www.gwales.com. Accessed 20 June 2021.
Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and meaning. Edward Arnold.
Handford, M. (2010). The language of business meetings. Cambridge University Press.
Hawtin, A. (2018). The written British national corpus 2014: Design, compilation and analysis [Unpublished PhD Thesis]. Lancaster University, Lancaster.
Heer, J., & Boyd, D. (2005). Vizster: Visualizing online social networks. Proceedings of the IEEE Symposium on Information Visualization (INFOVIS 2005), Minneapolis, pp.32–39.
Herring, S. C. (2007). A faceted classification scheme for computer-mediated discourse. Language@Internet, 4(1), 1–37.
Hughes, S. (2015). School census results, 2015: First release [Online]. Retrieved from: https://gov.wales/sites/default/files/statistics-and-research/2019-05/school-census-results-2015.pdf. Accessed 20 June 2021.
Hunston, S. (2008). Collection strategies and design decisions. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 154–168). de Gruyter.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The tenten corpus family. Paper presented at the The 7th International Corpus Linguistics Conference, Lancaster University, UK.
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers–Technical report. University of Oslo.
Kilgarriff, A., Reddy, S., Pomikálek, J., & Pvs, A. (2010). A Corpus factory for many languages. Proceedings of the Language Resources and Evaluation (LREC), Valetta, Malta, pp.904–910.
Kirk, J., & Nelson, G. (2018). The international corpus of English project: A progress report. World Englishes, 37(4), 697–716.
Klimt, B., & Yang, Y. (2004). Introducing the enron corpus. Proceedings of the European Conference on Machine Learning (ECML), Pisa, Italy, pp. 217–226.
Knight, D. (2015). e-Language: Communication in the Digital Age. In T. McEnery & P. Baker (Eds.), Corpora and discourse studies—Integrating discourse and corpora (pp. 20–40). Palgrave.
Knight, D., Adolphs, S., & Carter, R. (2014). CANELC—Constructing an e-language corpus. Corpora, 9(1), 29–56.
Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., & Carter, R. (2006). Beyond the text: Construction and analysis of multi-modal linguistic corpora. Paper presented at the 2nd International Conference on e-Social Science, Manchester.
Knight, D., Morris, S., & Fitzpatrick, T. (2021). Corpus design and construction in minoritised language contexts: The national corpus of contemporary Welsh. Palgrave.
Ko, K. (1996). Structural characteristics of computer-mediated language: A comparative analysis of InterChange discourse. Electronic Journal of Communication, 6(3).
Lee, D. Y. W. (2001). Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5, 37–72.
Lleol.net. (2014). The directory of Welsh language service providers [Online]. Retrieved from http://www.lleol.cymru/en/listing/. Accessed 20 June 2021.
Love, R. (2020). Overcoming challenges in corpus construction. Routledge.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22, 319–344.
Love, R., Hawtin, A., & Hardie, A. (2017). The British National Corpus 2014: User manual and reference guide (version 1.0). Retrieved from: http://corpora.lancs.ac.uk/bnc2014/doc/BNC2014manual.pdf. Accessed 20 June 2021.
McCarthy, M. J. (1998). Spoken language and applied linguistics. Cambridge University Press.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies an advanced resource book. Routledge.
National Survey Team (Welsh Government). (2016). Results viewer—Viewed websites in Welsh in the last month. Retrieved from https://gov.wales/national-survey-wales-results-viewer. Accessed 20 June 2021.
National Survey Team (Welsh Government). (2015). Welsh language use survey 2013–15: Use of the Welsh language. Retrieved from: https://statswales.gov.wales/Download/File?fileId=507. Accessed 20 June 2021.
Nelson, F. W., & Kucera, H. (1967). Computational analysis of present-day American English. Brown University Press.
Ofcom. (2015). The UK is now a smartphone society [Online]. Retrieved from http://media.ofcom.org.uk/news/2015/cmr-uk-2015/. Accessed 20 June 2021.
ONS (Office for National Statistics). (2011). Census: Digitised boundary data (England and Wales) [Computer File]. Retrieved from https://borders.ukdataservice.ac.uk/. Accessed 20 June 2021.
Rajar. (2016). Quarterly summary of radio listening: Survey period ending 20th December 2015 [Online]. Retrieved from http://www.rajar.co.uk/docs/2015_12/2015_Q4_Quarterly_Summary_Figures.pdf. Accessed 20 June 2021.
Reppen, R. (2010). Building a COrpus. In A. O’Keeffe & M. J. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31–37). Routledge.
Rosser, S. (2012). Language, culture and identity in Welsh children's literature: O.M. Edwards and Cymru'r Plant 1892–1920. In N. Congáil (Ed.), Codladh Céad Bliain: Cnuasach Aistí ar Litríocht na nÓg (pp. 223–251). LeabhairCOMHAR.
S4C. (2015). Annual report & statement of accounts for the 12 month period to 31 March 2015 [Online]. Retrieved from http://www.s4c.cymru/abouts4c/annualreport/acrobats/s4c-annual-report-2015.pdf. Accessed 20 June 2021.
S4C. (2016). Amserlenni [Timetables]. Retrieved from http://www.s4c.cymru/c_listings.shtml?dt=2016-04-27. Accessed 20 June 2021.
Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of age and gender on blogging. Paper presented at the 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Palo Alto, California.
Siepmann, D., Bürgel, C., & Sascha, D. (2015). The Corpus de référence du français contemporain (CRFC) as the first genre-diverse mega-corpus of French. International Journal of Lexicography, 30(1), 63–84.
Simpson, R., Lucka, B., & Ovens, J. (2000). Methodological challenges of planning a spoken corpus with pedagogic outcomes. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora (pp. 43–49). Lang.
Sinclair, J. (1987). Collocation: A progress report. In R. Steele & T. Threadgold (Eds.), Language topics: Essays in honour of Michael Halliday (pp. 319–331). John Benjamins.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Stubbe, A., Ringlstetter, C., & Schulz, K. (2007). Genre as noise: Noise in genre. International Journal on Document Analysis and Recognition, 10, 199–209.
Tagg, C. (2009). A corpus linguistics study of SMS text messaging [Unpublished PhD Thesis]. Birmingham: University of Birmingham.
Thompson, P., & Nesi, H. (2001). The British Academic Spoken English (BASE) corpus project. Language Teaching Research, 5, 263–264.
Twitter. (2012). API Terms of Service: between Sept 5, 2012 and July 2, 2013 [Online]. Retrieved from https://dev.twitter.com/archive/terms/api-terms/diff-20130702. Accessed 11 June 2014.
Urdd Gobaith Cymru. (2021). What is the Eisteddfod. Retrieved from https://www.urdd.cymru/en/eisteddfod/what-eisteddfod/.
Welsh Books Council. (2012). Buying and reading Welsh-language books: Welsh Speakers Omnibus Survey 2012—Report of survey findings. Retrieved from https://llyfrau.cymru/wp-content/uploads/2021/01/Buying_and_Reading_Welsh-language_Books_-_Welsh_Speakers_Omnibus_Survey_2012_-_July_2013.pdf. Accessed 20 June 2021.
Welsh Government (StatsWales). (2015). Further education: Learning activities at further education institutions by subject and medium of delivery: 2012/13 [Online]. Retrieved from https://statswales.gov.wales/Catalogue/Education-and-Skills/Post-16-Education-and-Training/Further-Education-and-Work-Based-Learning/Learners/Further-Education/learningactivitiesfurthereducationinstitutions-by-subject-mediumofdelivery. Accessed 20 June 2021.
WJEC (Welsh Joint Education Committee). (2012). Final results archive [Online]. Retrieved from http://www.wjec.co.uk/students/results-and-research/results-statistics.html. Accessed 20 June 2021.
WJEC (Welsh Joint Education Committee). (2015). Results statistics: Welsh medium entries [Online]. Retrieved from http://www.wjec.co.uk/students/results-and-research/results-statistics.html. Accessed 20 June 2021.
Yiğit, O. (2005). Emoticon usage in task-oriented and socio-emotional contexts in online discussion boards [Unpublished PhD Thesis]. Florida State University, Florida.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Knight, D., Morris, S., Arman, L., Needs, J., Rees, M. (2021). Design Frames. In: Building a National Corpus. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-81858-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-81858-6_2
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-030-81857-9
Online ISBN: 978-3-030-81858-6
eBook Packages: Social SciencesSocial Sciences (R0)