Skip to main content

Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora

  • Chapter
  • First Online:
Tracing the Life Cycle of Ideas in the Humanities and Social Sciences

Abstract

When analysing corpora with automatic and statistical means, one should remember that the raw material being treated is language and the specific nature thereof ought to be considered in all stages of research. Since language cannot be investigated per se, corpora can only reveal the characteristics of limited instances of linguistic behaviour: even exhaustive corpora only supply a finite set of texts which should be assessed in the light of a number of extra-linguistic factors impacting linguistic traits from different viewpoints: the sender’s and recipient’s region of origin, social and educational background and gender; the channel of communication; the topic under discussion and the formality of the situation, not to speak of the period in history when texts were produced. Such factors come into play in defining the linguistic properties of each single text (fragment) in the corpus, and their overall balance should be considered during the preliminary stages of corpus design and compilation.

After having made decisions in terms of the selection of the texts to be included in the corpus, linguistic data need to be prepared for automatic processing. This stage too is far from intuitive and automatic: from the very identification of tokens of language to the extraction of lemmas, researchers should take into account qualitative aspects. Both corpus compilation and pre-processing cannot be considered neutral operations with a view to the results of automatic analysis and should be made explicit to enable the assessment of results and further exploitation of the same corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Consistently with the studies illustrated in this book, all the examples provided in this chapter will be mostly in English and Italian.

References

  • Antonelli, G. (2010). Lingua. In A. Afribo & E. Zinato (Eds.), Modernità italiana. Cultura, lingua e letteratura dagli anni Settanta a oggi (pp. 15–52). Roma: Carocci.

    Google Scholar 

  • Attili, G., & Benigni, L. (1979). Interazione sociale, ruolo sessuale e comportamento verbale: lo stile retorico naturale del linguaggio femminile nell’interazione faccia a faccia. In F. A. Leoni & M. R. Pigliasco (Eds.), Retorica e scienze del linguaggio: atti del 10. Congresso internazionale di studi, Pisa, 31 maggio - 2 giugno 1976. SLI, Società di linguistica italiana (pp. 261–280). Roma, Bulzoni.

    Google Scholar 

  • Barbera, M. (2009). Schema e storia del Corpus Taurinense: linguistica dei corpora dell’italiano antico. Alessandria: Edizioni dell’Orso.

    Google Scholar 

  • Barbera, M., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In M. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra.

    Google Scholar 

  • Berruto, G. (1987). Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.

    Google Scholar 

  • Berruto, G. (2012). L’italiano popolare e la semplificazione linguistica. In G. Berruto (Ed.), Saggi di sociolinguistica e linguistica (pp. 141–181). Alessandria: Edizioni dell’Orso.

    Google Scholar 

  • Cortelazzo, M. A. (1990). Lingue speciali. La dimensione verticale. Padova: Unipress.

    Google Scholar 

  • Cortelazzo, M. A. (1994). Il parlato giovanile. In L. Serianni & P. Trifone (Eds.), Storia della lingua italiana, vol. II, Scritto e parlato (pp. 291–317). Torino: Einaudi.

    Google Scholar 

  • Coseriu, E. (1988). Einführung in die Allgemeine Sprachwissenschaft. Tübingen: Francke.

    Google Scholar 

  • Coveri, L. (2014). Una lingua per crescere. Scritti sull’italiano dei giovani. Firenze: Franco Cesati editore.

    Google Scholar 

  • De Mauro, T. (2014). Storia Linguistica dell’Italia repubblicana dal 1946 ai nostri giorni. Roma-Bari: Laterza.

    Google Scholar 

  • Fiorentino, G. (2013). Frontiere della scrittura: lineamenti di web writing. Roma: Carocci.

    Google Scholar 

  • Fitschen, A., & Gupta, P. (2008). Lemmatising and morphological tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 552–564). Berlin: Walter de Gruyter.

    Google Scholar 

  • Halliday, M. A. K. (1989). Spoken and written language. Oxford: OUP.

    Google Scholar 

  • Hunston, S. (2008). Corpus compilation and corpus types. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 154–168). Berlin: Walter de Gruyter.

    Google Scholar 

  • Kaplan, A. (2016). Women talk more than men ... and other myths about language explained. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Lakoff, R. (1975). Language and Woman’s Place. New York: Harper.

    MATH  Google Scholar 

  • McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction (2nd ed.). Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Mortara Garavelli, B. (1985). La parola d’altri: prospettive di analisi del discorso. Palermo: Sellerio.

    Google Scholar 

  • Ondelli, S. (2013). Un genere testuale attraverso i confini nazionali: la sentenza. In S. Ondelli (Ed.), Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto (pp. 67–92). Trieste: EUT.

    Google Scholar 

  • Ondelli, S., & Viale, M. (2010). L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. Rivista internazionale di tecnica della traduzione, 12, 1–62.

    Google Scholar 

  • Oxford English Dictionary (1933). Oxford: OUP.

    Google Scholar 

  • Park, G., et al. (2016). Women are Warmer but No Less Assertive than Men: Gender and Language on Facebook. PLOS, 25(2016), e0155885. https://doi.org/10.1371/journal.pone.0155885.

    Article  Google Scholar 

  • Renzi, L. (2012). Come cambia la lingua: l’italiano in movimento. Bologna: il Mulino.

    Google Scholar 

  • Romaine, S. (2008). Corpus linguistics and sociolinguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 97–111). Berlin: Walter de Gruyter.

    Google Scholar 

  • Ross, A. S. C. (1980). U and non-U. In N. Mitford (Ed.), Noblesse oblige (pp. 11–38). London: Futura.

    Google Scholar 

  • Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In A. Abeillé (Ed.), Treebanks (pp. 23–41). Dordrecht: Springer.

    Chapter  Google Scholar 

  • Stenström, A.-B. (1991). Expletives in the London-Lund Corpus. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: In honour of Jan Svartvik (pp. 230-253). London: Longman.

    Google Scholar 

  • Swales, J. M. (2004). Research Genres: Explorations and Applications. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Swan, M. (2016). Practical English Usage. Oxford: OUP.

    Google Scholar 

  • Wehrlich, E. (1982). A Text Grammar of English. Heidelberg: Quelle & Meyer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ondelli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ondelli, S. (2018). Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_7

Download citation

Publish with us

Policies and ethics