Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora

Ondelli, Stefano

doi:10.1007/978-3-319-97064-6_7

Stefano Ondelli⁷

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

800 Accesses
1 Citations

Abstract

When analysing corpora with automatic and statistical means, one should remember that the raw material being treated is language and the specific nature thereof ought to be considered in all stages of research. Since language cannot be investigated per se, corpora can only reveal the characteristics of limited instances of linguistic behaviour: even exhaustive corpora only supply a finite set of texts which should be assessed in the light of a number of extra-linguistic factors impacting linguistic traits from different viewpoints: the sender’s and recipient’s region of origin, social and educational background and gender; the channel of communication; the topic under discussion and the formality of the situation, not to speak of the period in history when texts were produced. Such factors come into play in defining the linguistic properties of each single text (fragment) in the corpus, and their overall balance should be considered during the preliminary stages of corpus design and compilation.

After having made decisions in terms of the selection of the texts to be included in the corpus, linguistic data need to be prepared for automatic processing. This stage too is far from intuitive and automatic: from the very identification of tokens of language to the extraction of lemmas, researchers should take into account qualitative aspects. Both corpus compilation and pre-processing cannot be considered neutral operations with a view to the results of automatic analysis and should be made explicit to enable the assessment of results and further exploitation of the same corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Corpus Linguistic Analysis: How Far Can We Go?

Corpus Linguistics: Some (Meta-)Pragmatic Reflections

Article 23 June 2017

Corpus Compilation

Notes

1.
Consistently with the studies illustrated in this book, all the examples provided in this chapter will be mostly in English and Italian.

References

Antonelli, G. (2010). Lingua. In A. Afribo & E. Zinato (Eds.), Modernità italiana. Cultura, lingua e letteratura dagli anni Settanta a oggi (pp. 15–52). Roma: Carocci.
Google Scholar
Attili, G., & Benigni, L. (1979). Interazione sociale, ruolo sessuale e comportamento verbale: lo stile retorico naturale del linguaggio femminile nell’interazione faccia a faccia. In F. A. Leoni & M. R. Pigliasco (Eds.), Retorica e scienze del linguaggio: atti del 10. Congresso internazionale di studi, Pisa, 31 maggio - 2 giugno 1976. SLI, Società di linguistica italiana (pp. 261–280). Roma, Bulzoni.
Google Scholar
Barbera, M. (2009). Schema e storia del Corpus Taurinense: linguistica dei corpora dell’italiano antico. Alessandria: Edizioni dell’Orso.
Google Scholar
Barbera, M., Corino, E., & Onesti, C. (2007). Cosa è un corpus? Per una definizione più rigorosa di corpus, token, markup. In M. Barbera, E. Corino, & C. Onesti (Eds.), Corpora e linguistica in rete (pp. 25–88). Perugia: Guerra.
Google Scholar
Berruto, G. (1987). Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.
Google Scholar
Berruto, G. (2012). L’italiano popolare e la semplificazione linguistica. In G. Berruto (Ed.), Saggi di sociolinguistica e linguistica (pp. 141–181). Alessandria: Edizioni dell’Orso.
Google Scholar
Cortelazzo, M. A. (1990). Lingue speciali. La dimensione verticale. Padova: Unipress.
Google Scholar
Cortelazzo, M. A. (1994). Il parlato giovanile. In L. Serianni & P. Trifone (Eds.), Storia della lingua italiana, vol. II, Scritto e parlato (pp. 291–317). Torino: Einaudi.
Google Scholar
Coseriu, E. (1988). Einführung in die Allgemeine Sprachwissenschaft. Tübingen: Francke.
Google Scholar
Coveri, L. (2014). Una lingua per crescere. Scritti sull’italiano dei giovani. Firenze: Franco Cesati editore.
Google Scholar
De Mauro, T. (2014). Storia Linguistica dell’Italia repubblicana dal 1946 ai nostri giorni. Roma-Bari: Laterza.
Google Scholar
Fiorentino, G. (2013). Frontiere della scrittura: lineamenti di web writing. Roma: Carocci.
Google Scholar
Fitschen, A., & Gupta, P. (2008). Lemmatising and morphological tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 552–564). Berlin: Walter de Gruyter.
Google Scholar
Halliday, M. A. K. (1989). Spoken and written language. Oxford: OUP.
Google Scholar
Hunston, S. (2008). Corpus compilation and corpus types. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 154–168). Berlin: Walter de Gruyter.
Google Scholar
Kaplan, A. (2016). Women talk more than men ... and other myths about language explained. Cambridge: Cambridge University Press.
Book Google Scholar
Lakoff, R. (1975). Language and Woman’s Place. New York: Harper.
MATH Google Scholar
McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction (2nd ed.). Edinburgh: Edinburgh University Press.
Google Scholar
Mortara Garavelli, B. (1985). La parola d’altri: prospettive di analisi del discorso. Palermo: Sellerio.
Google Scholar
Ondelli, S. (2013). Un genere testuale attraverso i confini nazionali: la sentenza. In S. Ondelli (Ed.), Realizzazioni testuali ibride in contesto europeo. Lingue dell’UE e lingue nazionali a confronto (pp. 67–92). Trieste: EUT.
Google Scholar
Ondelli, S., & Viale, M. (2010). L’assetto dell’italiano delle traduzioni in un corpus giornalistico. Aspetti qualitativi e quantitativi. Rivista internazionale di tecnica della traduzione, 12, 1–62.
Google Scholar
Oxford English Dictionary (1933). Oxford: OUP.
Google Scholar
Park, G., et al. (2016). Women are Warmer but No Less Assertive than Men: Gender and Language on Facebook. PLOS, 25(2016), e0155885. https://doi.org/10.1371/journal.pone.0155885.
Article Google Scholar
Renzi, L. (2012). Come cambia la lingua: l’italiano in movimento. Bologna: il Mulino.
Google Scholar
Romaine, S. (2008). Corpus linguistics and sociolinguistics. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 97–111). Berlin: Walter de Gruyter.
Google Scholar
Ross, A. S. C. (1980). U and non-U. In N. Mitford (Ed.), Noblesse oblige (pp. 11–38). London: Futura.
Google Scholar
Sampson, G. (2003). Thoughts on Two Decades of Drawing Trees. In A. Abeillé (Ed.), Treebanks (pp. 23–41). Dordrecht: Springer.
Chapter Google Scholar
Stenström, A.-B. (1991). Expletives in the London-Lund Corpus. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: In honour of Jan Svartvik (pp. 230-253). London: Longman.
Google Scholar
Swales, J. M. (2004). Research Genres: Explorations and Applications. Cambridge: Cambridge University Press.
Book Google Scholar
Swan, M. (2016). Practical English Usage. Oxford: OUP.
Google Scholar
Wehrlich, E. (1982). A Text Grammar of English. Heidelberg: Quelle & Meyer.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Trieste, Trieste, Italy
Stefano Ondelli

Authors

Stefano Ondelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ondelli .

Editor information

Editors and Affiliations

Department of Philosophy, Sociology, Education and Applied Psychology, University of Padova, Padova, Italy
Arjuna Tuzzi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ondelli, S. (2018). Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora. In: Tuzzi, A. (eds) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-97064-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-97064-6_7
Published: 31 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97063-9
Online ISBN: 978-3-319-97064-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora

Abstract

Access this chapter

Similar content being viewed by others

Corpus Linguistic Analysis: How Far Can We Go?

Corpus Linguistics: Some (Meta-)Pragmatic Reflections

Corpus Compilation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Treat Texts as Data but Remember They Are Made of Words: Compiling and Pre-processing Corpora

Abstract

Access this chapter

Similar content being viewed by others

Corpus Linguistic Analysis: How Far Can We Go?

Corpus Linguistics: Some (Meta-)Pragmatic Reflections

Corpus Compilation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation