Skip to main content

Design Frames

  • Chapter
  • First Online:
Building a National Corpus

Abstract

Using a case study approach, this chapter provides a worked demonstration of how a detailed design frame for a national corpus may be built. Focusing on each individual ‘mode’ of data in turn (i.e. spoken, written and e-language), the chapter explores some key generic questions that are helpful in informing design frame development. By reporting on the specific targets presented, and providing justifications for these, useful guidance for those considering creating their own corpus design frame emerges. While the points raised in this chapter relate to our experiences of creating design frames for CorCenCC specifically, the guidance provided can facilitate and inform the design of corpora in other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 64.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Aroonmanakun, W., Tansiri, K., & Nittayanuparp, P. (2009). Thai National Corpus: A progress report. Proceedings of the 7th Workshop on Asian Language Resources (ALR7), Suntec, Singapore, pp. 153–158.

    Google Scholar 

  • Asheghi, N. R., Sharoff, S., & Markert, K. (2016). Crowdsourcing for web genre annotation. Language Resources and Evaluation, 50(3), 603–641.

    Article  Google Scholar 

  • Asheghi, N., S., & Markert, K. (2014). Designing and evaluating a reliable corpus of web genres via crowd-sourcing. Proceedings of the Language Resources and Evaluation Conference, Reykjavik: Iceland, pp. 1339–1346.

    Google Scholar 

  • Aston, G., & Burnard, L. (1997). The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh University Press.

    Google Scholar 

  • Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.

    Article  Google Scholar 

  • Beaufort Research. (2013). Ymchwilio i ddefnydd iaith siaradwyr Cymraeg yn eu bywyd pob dydd [Exploring Welsh speakers’ language use in their daily lives] [Online]. Retrieved from https://www.s4c.cymru/abouts4c/corporate/pdf/e_daily-lives-and-language-use-research-report.pdf. Accessed 20 June 2021.

  • Biber, D. (1994). Representativeness in corpus design. In A. Zampolli, N. Calzolari, & M. Palmer (Eds.), Current issues in computational linguistics: In Honour of Don Walker (pp.377–407). Dordrecht: Springer Netherlands.

    Google Scholar 

  • Biber, D., & Egbert, J. (2018). Register variation online. Cambridge University Press.

    Book  Google Scholar 

  • Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11–45.

    Article  Google Scholar 

  • Bible Society. (1997). Challenge to change: Results of the 1995 Welsh churches survey. Bible Society.

    Google Scholar 

  • Burnard, L. (2007). Reference guide for the British national corpus (XML Edition). Retrieved from: http://www.natcorp.ox.ac.uk/docs/URG/index.html. Accessed 20 June 2021.

  • Carter, R. (1998). Orders of reality: CANCODE, communication, and culture. ELT Journal, 52(1), 43–56.

    Article  Google Scholar 

  • Carter, R., & McCarthy, M. (2004). Talking, creating: Interactional language, creativity, and context. Applied Linguistics, 25(1), 62–88.

    Article  Google Scholar 

  • Cheng, W., & Warren, M. (1999). Facilitating a description of intercultural conversations: The Hong Kong Corpus of Conversational English. International Computer Archive of Modern English, 20, 5–20.

    Google Scholar 

  • Coleg Cymraeg Cenedlaethol [National Welsh College]. (2014). Circular: 2015/16 Welsh medium targets. Retrieved from http://www.colegcymraeg.ac.uk/en/media/main/dogfennau-ccc/cylchlythyron/REF14_1TargetCircular.pdf. Accessed 20 June 2021.

  • Condon, S., & Cech, C. (1996). Functional comparison of face-to-face and computer- mediated decision-making interactions. In S. Herring (Ed.), Computer-mediated communication: Linguistic, social, and cross-cultural perspectives (pp.65–80). John Benjamins.

    Google Scholar 

  • Crowdy, S. (1993). Spoken corpus design. Literary and Linguistic Computing, 8, 259–265.

    Article  Google Scholar 

  • Crystal, D. (2008). Texting. ELT Journal, 62(1), 77–83.

    Article  Google Scholar 

  • CUP. (2020). Cambridge English corpus [Online]. Retrieved from https://www.cambridge.org/us/cambridgeenglish/better-learning-insights/corpus. Accessed 20 June 2021.

  • Cymdeithas yr Iaith Gymraeg [Welsh Language Society]. (2015). Herio Aelodau Cynulliad: Gwnewch adduned blwyddyn newydd i siarad mwy o Gymraeg yn 2016 [Challenging Assembly Members: Make a new year’s resolution to speak more Welsh in 2016]. Retrieved from http://cymdeithas.cymru/newyddion/herio-aelodau-cynulliad-gwnewch-adduned-blwyddyn-newydd-i-siarad-mwy-o-gymraeg-yn-2016. Accessed 20 June 2021.

  • Davies, M. (2010). The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.

    Article  Google Scholar 

  • Davies, M. (2016). Corpus of news on the web (NOW) [Online]. Retrieved from https://www.english-corpora.org/now/. Accessed 20 June 2021.

  • Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide, 36(1), 1–28.

    Google Scholar 

  • Eisteddfod Genedlaethol Cymru [National Eisteddfod of Wales]. (2016). Ffigurau Ymwelwyr yr Wythnos [The week’s visitor figures]. Retrieved from https://eisteddfod.cymru/ffigurau-ymwelwyr-yr-wythnos. Accessed 20 June 2021.

  • Ferraresi, A., Zanchetta, E., Bernardini, S., & Baroni, M. (2008). Introducing and evaluating ukWaC, a very large Web-derived corpus of English. Proceedings of the 4th Web as Corpus Workshop (WAC-4), Marrakech, Morocco, pp.47–54.

    Google Scholar 

  • Forsyth, E., & Martell, C. (2007). Lexical and discourse analysis of online chat dialog. Proceedings of the International Conference on Semantic Computing (ICSC 2007), Irvine, California, pp.19–26.

    Google Scholar 

  • Gardner, S., & Moreton, E. (2020). Written corpora. In S. Adolphs & D. Knight (Eds.), The Routledge handbook of English language and digital humanities (pp. 26–48). Routledge.

    Chapter  Google Scholar 

  • Gries, S. T., & Newman, J. (2014). Creating and using corpora. In D. Sharma & R. J. Podesva (Eds.), Research methods in linguistics (pp. 257–287). Cambridge University Press.

    Chapter  Google Scholar 

  • Gwales.com. (2012). Gwales.com. Retrieved from www.gwales.com. Accessed 20 June 2021.

  • Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and meaning. Edward Arnold.

    Google Scholar 

  • Handford, M. (2010). The language of business meetings. Cambridge University Press.

    Book  Google Scholar 

  • Hawtin, A. (2018). The written British national corpus 2014: Design, compilation and analysis [Unpublished PhD Thesis]. Lancaster University, Lancaster.

    Google Scholar 

  • Heer, J., & Boyd, D. (2005). Vizster: Visualizing online social networks. Proceedings of the IEEE Symposium on Information Visualization (INFOVIS 2005), Minneapolis, pp.32–39.

    Google Scholar 

  • Herring, S. C. (2007). A faceted classification scheme for computer-mediated discourse. Language@Internet, 4(1), 1–37.

    Google Scholar 

  • Hughes, S. (2015). School census results, 2015: First release [Online]. Retrieved from: https://gov.wales/sites/default/files/statistics-and-research/2019-05/school-census-results-2015.pdf. Accessed 20 June 2021.

  • Hunston, S. (2008). Collection strategies and design decisions. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (pp. 154–168). de Gruyter.

    Google Scholar 

  • Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The tenten corpus family. Paper presented at the The 7th International Corpus Linguistics Conference, Lancaster University, UK.

    Google Scholar 

  • Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers–Technical report. University of Oslo.

    Google Scholar 

  • Kilgarriff, A., Reddy, S., Pomikálek, J., & Pvs, A. (2010). A Corpus factory for many languages. Proceedings of the Language Resources and Evaluation (LREC), Valetta, Malta, pp.904–910.

    Google Scholar 

  • Kirk, J., & Nelson, G. (2018). The international corpus of English project: A progress report. World Englishes, 37(4), 697–716.

    Article  Google Scholar 

  • Klimt, B., & Yang, Y. (2004). Introducing the enron corpus. Proceedings of the European Conference on Machine Learning (ECML), Pisa, Italy, pp. 217–226.

    Google Scholar 

  • Knight, D. (2015). e-Language: Communication in the Digital Age. In T. McEnery & P. Baker (Eds.), Corpora and discourse studies—Integrating discourse and corpora (pp. 20–40). Palgrave.

    Chapter  Google Scholar 

  • Knight, D., Adolphs, S., & Carter, R. (2014). CANELC—Constructing an e-language corpus. Corpora, 9(1), 29–56.

    Article  Google Scholar 

  • Knight, D., Bayoumi, S., Mills, S., Crabtree, A., Adolphs, S., Pridmore, T., & Carter, R. (2006). Beyond the text: Construction and analysis of multi-modal linguistic corpora. Paper presented at the 2nd International Conference on e-Social Science, Manchester.

    Google Scholar 

  • Knight, D., Morris, S., & Fitzpatrick, T. (2021). Corpus design and construction in minoritised language contexts: The national corpus of contemporary Welsh. Palgrave.

    Google Scholar 

  • Ko, K. (1996). Structural characteristics of computer-mediated language: A comparative analysis of InterChange discourse. Electronic Journal of Communication, 6(3).

    Google Scholar 

  • Lee, D. Y. W. (2001). Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5, 37–72.

    Google Scholar 

  • Lleol.net. (2014). The directory of Welsh language service providers [Online]. Retrieved from http://www.lleol.cymru/en/listing/. Accessed 20 June 2021.

  • Love, R. (2020). Overcoming challenges in corpus construction. Routledge.

    Book  Google Scholar 

  • Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22, 319–344.

    Article  Google Scholar 

  • Love, R., Hawtin, A., & Hardie, A. (2017). The British National Corpus 2014: User manual and reference guide (version 1.0). Retrieved from: http://corpora.lancs.ac.uk/bnc2014/doc/BNC2014manual.pdf. Accessed 20 June 2021.

  • McCarthy, M. J. (1998). Spoken language and applied linguistics. Cambridge University Press.

    Google Scholar 

  • McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies an advanced resource book. Routledge.

    Google Scholar 

  • National Survey Team (Welsh Government). (2016). Results viewer—Viewed websites in Welsh in the last month. Retrieved from https://gov.wales/national-survey-wales-results-viewer. Accessed 20 June 2021.

  • National Survey Team (Welsh Government). (2015). Welsh language use survey 2013–15: Use of the Welsh language. Retrieved from: https://statswales.gov.wales/Download/File?fileId=507. Accessed 20 June 2021.

  • Nelson, F. W., & Kucera, H. (1967). Computational analysis of present-day American English. Brown University Press.

    Google Scholar 

  • Ofcom. (2015). The UK is now a smartphone society [Online]. Retrieved from http://media.ofcom.org.uk/news/2015/cmr-uk-2015/. Accessed 20 June 2021.

  • ONS (Office for National Statistics). (2011). Census: Digitised boundary data (England and Wales) [Computer File]. Retrieved from https://borders.ukdataservice.ac.uk/. Accessed 20 June 2021.

  • Rajar. (2016). Quarterly summary of radio listening: Survey period ending 20th December 2015 [Online]. Retrieved from http://www.rajar.co.uk/docs/2015_12/2015_Q4_Quarterly_Summary_Figures.pdf. Accessed 20 June 2021.

  • Reppen, R. (2010). Building a COrpus. In A. O’Keeffe & M. J. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp. 31–37). Routledge.

    Google Scholar 

  • Rosser, S. (2012). Language, culture and identity in Welsh children's literature: O.M. Edwards and Cymru'r Plant 1892–1920. In N. Congáil (Ed.), Codladh Céad Bliain: Cnuasach Aistí ar Litríocht na nÓg (pp. 223–251). LeabhairCOMHAR.

    Google Scholar 

  • S4C. (2015). Annual report & statement of accounts for the 12 month period to 31 March 2015 [Online]. Retrieved from http://www.s4c.cymru/abouts4c/annualreport/acrobats/s4c-annual-report-2015.pdf. Accessed 20 June 2021.

  • S4C. (2016). Amserlenni [Timetables]. Retrieved from http://www.s4c.cymru/c_listings.shtml?dt=2016-04-27. Accessed 20 June 2021.

  • Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of age and gender on blogging. Paper presented at the 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Palo Alto, California.

    Google Scholar 

  • Siepmann, D., Bürgel, C., & Sascha, D. (2015). The Corpus de référence du français contemporain (CRFC) as the first genre-diverse mega-corpus of French. International Journal of Lexicography, 30(1), 63–84.

    Google Scholar 

  • Simpson, R., Lucka, B., & Ovens, J. (2000). Methodological challenges of planning a spoken corpus with pedagogic outcomes. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective: Papers from the Third International Conference on Teaching and Language Corpora (pp. 43–49). Lang.

    Google Scholar 

  • Sinclair, J. (1987). Collocation: A progress report. In R. Steele & T. Threadgold (Eds.), Language topics: Essays in honour of Michael Halliday (pp. 319–331). John Benjamins.

    Google Scholar 

  • Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

    Google Scholar 

  • Stubbe, A., Ringlstetter, C., & Schulz, K. (2007). Genre as noise: Noise in genre. International Journal on Document Analysis and Recognition, 10, 199–209.

    Article  Google Scholar 

  • Tagg, C. (2009). A corpus linguistics study of SMS text messaging [Unpublished PhD Thesis]. Birmingham: University of Birmingham.

    Google Scholar 

  • Thompson, P., & Nesi, H. (2001). The British Academic Spoken English (BASE) corpus project. Language Teaching Research, 5, 263–264.

    Google Scholar 

  • Twitter. (2012). API Terms of Service: between Sept 5, 2012 and July 2, 2013 [Online]. Retrieved from https://dev.twitter.com/archive/terms/api-terms/diff-20130702. Accessed 11 June 2014.

  • Urdd Gobaith Cymru. (2021). What is the Eisteddfod. Retrieved from https://www.urdd.cymru/en/eisteddfod/what-eisteddfod/.

  • Welsh Books Council. (2012). Buying and reading Welsh-language books: Welsh Speakers Omnibus Survey 2012—Report of survey findings. Retrieved from https://llyfrau.cymru/wp-content/uploads/2021/01/Buying_and_Reading_Welsh-language_Books_-_Welsh_Speakers_Omnibus_Survey_2012_-_July_2013.pdf. Accessed 20 June 2021.

  • Welsh Government (StatsWales). (2015). Further education: Learning activities at further education institutions by subject and medium of delivery: 2012/13 [Online]. Retrieved from https://statswales.gov.wales/Catalogue/Education-and-Skills/Post-16-Education-and-Training/Further-Education-and-Work-Based-Learning/Learners/Further-Education/learningactivitiesfurthereducationinstitutions-by-subject-mediumofdelivery. Accessed 20 June 2021.

  • WJEC (Welsh Joint Education Committee). (2012). Final results archive [Online]. Retrieved from http://www.wjec.co.uk/students/results-and-research/results-statistics.html. Accessed 20 June 2021.

  • WJEC (Welsh Joint Education Committee). (2015). Results statistics: Welsh medium entries [Online]. Retrieved from http://www.wjec.co.uk/students/results-and-research/results-statistics.html. Accessed 20 June 2021.

  • Yiğit, O. (2005). Emoticon usage in task-oriented and socio-emotional contexts in online discussion boards [Unpublished PhD Thesis]. Florida State University, Florida.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Knight, D., Morris, S., Arman, L., Needs, J., Rees, M. (2021). Design Frames. In: Building a National Corpus. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-030-81858-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81858-6_2

  • Published:

  • Publisher Name: Palgrave Macmillan, Cham

  • Print ISBN: 978-3-030-81857-9

  • Online ISBN: 978-3-030-81858-6

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics