Abstract
When analysing corpora with automatic and statistical means, one should remember that the raw material being treated is language and the specific nature thereof ought to be considered in all stages of research. Since language cannot be investigated per se, corpora can only reveal the characteristics of limited instances of linguistic behaviour: even exhaustive corpora only supply a finite set of texts which should be assessed in the light of a number of extra-linguistic factors impacting linguistic traits from different viewpoints: the sender’s and recipient’s region of origin, social and educational background and gender; the channel of communication; the topic under discussion and the formality of the situation, not to speak of the period in history when texts were produced. Such factors come into play in defining the linguistic properties of each single text (fragment) in the corpus, and their overall balance should be considered during the preliminary stages of corpus design and compilation.
After having made decisions in terms of the selection of the texts to be included in the corpus, linguistic data need to be prepared for automatic processing. This stage too is far from intuitive and automatic: from the very identification of tokens of language to the extraction of lemmas, researchers should take into account qualitative aspects. Both corpus compilation and pre-processing cannot be considered neutral operations with a view to the results of automatic analysis and should be made explicit to enable the assessment of results and further exploitation of the same corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Consistently with the studies illustrated in this book, all the examples provided in this chapter will be mostly in English and Italian.
References
Antonelli, G. (2010). Lingua. In A. Afribo & E. Zinato (Eds.), Modernità italiana. Cultura, lingua e letteratura dagli anni Settanta a oggi (pp. 15–52). Roma: Carocci.
Attili, G., & Benigni, L. (1979). Interazione sociale, ruolo sessuale e comportamento verbale: lo stile retorico naturale del linguaggio femminile nell’interazione faccia a faccia. In F. A. Leoni & M. R. Pigliasco (Eds.), Retorica e scienze del linguaggio: atti del 10. Congresso internazionale di studi, Pisa, 31 maggio - 2 giugno 1976. SLI, Società di linguistica italiana (pp. 261–280). Roma, Bulzoni.
Barbera, M. (2009). Schema e storia del Corpus Taurinense: linguistica dei corpora dell’italiano antico. Alessandria: Edizioni dell’Orso.
Barbera, M., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In M. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra.
Berruto, G. (1987). Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.
Berruto, G. (2012). L’italiano popolare e la semplificazione linguistica. In G. Berruto (Ed.), Saggi di sociolinguistica e linguistica (pp. 141–181). Alessandria: Edizioni dell’Orso.
Cortelazzo, M. A. (1990). Lingue speciali. La dimensione verticale. Padova: Unipress.
Cortelazzo, M. A. (1994). Il parlato giovanile. In L. Serianni & P. Trifone (Eds.), Storia della lingua italiana, vol. II, Scritto e parlato (pp. 291–317). Torino: Einaudi.
Coseriu, E. (1988). Einführung in die Allgemeine Sprachwissenschaft. Tübingen: Francke.
Coveri, L. (2014). Una lingua per crescere. Scritti sull’italiano dei giovani. Firenze: Franco Cesati editore.
De Mauro, T. (2014). Storia Linguistica dell’Italia repubblicana dal 1946 ai nostri giorni. Roma-Bari: Laterza.
Fiorentino, G. (2013). Frontiere della scrittura: lineamenti di web writing. Roma: Carocci.
Fitschen, A., & Gupta, P. (2008). Lemmatising and morphological tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 552–564). Berlin: Walter de Gruyter.
Halliday, M. A. K. (1989). Spoken and written language. Oxford: OUP.
Hunston, S. (2008). Corpus compilation and corpus types. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 154–168). Berlin: Walter de Gruyter.
Kaplan, A. (2016). Women talk more than men ... and other myths about language explained. Cambridge: Cambridge University Press.
Lakoff, R. (1975). Language and Woman’s Place. New York: Harper.
McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction (2nd ed.). Edinburgh: Edinburgh University Press.
Mortara Garavelli, B. (1985). La parola d’altri: prospettive di analisi del discorso. Palermo: Sellerio.
Ondelli, S. (2013). Un genere testuale attraverso i confini nazionali: la sentenza. In S. Ondelli (Ed.), Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto (pp. 67–92). Trieste: EUT.
Ondelli, S., & Viale, M. (2010). L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. Rivista internazionale di tecnica della traduzione, 12, 1–62.
Oxford English Dictionary (1933). Oxford: OUP.
Park, G., et al. (2016). Women are Warmer but No Less Assertive than Men: Gender and Language on Facebook. PLOS, 25(2016), e0155885. https://doi.org/10.1371/journal.pone.0155885.
Renzi, L. (2012). Come cambia la lingua: l’italiano in movimento. Bologna: il Mulino.
Romaine, S. (2008). Corpus linguistics and sociolinguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 97–111). Berlin: Walter de Gruyter.
Ross, A. S. C. (1980). U and non-U. In N. Mitford (Ed.), Noblesse oblige (pp. 11–38). London: Futura.
Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In A. Abeillé (Ed.), Treebanks (pp. 23–41). Dordrecht: Springer.
Stenström, A.-B. (1991). Expletives in the London-Lund Corpus. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: In honour of Jan Svartvik (pp. 230-253). London: Longman.
Swales, J. M. (2004). Research Genres: Explorations and Applications. Cambridge: Cambridge University Press.
Swan, M. (2016). Practical English Usage. Oxford: OUP.
Wehrlich, E. (1982). A Text Grammar of English. Heidelberg: Quelle & Meyer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ondelli, S. (2018). Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-97064-6_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97063-9
Online ISBN: 978-3-319-97064-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)