Abstract
This chapter provides both a theoretical discussion of what multivariate exploratory approaches entail and step-by-step instructions to implement each of them with R. Four methods are presented: correspondence analysis, multiple correspondence analysis, principal component analysis, and exploratory factor analysis. These methods are designed to explore and summarize large and complex data tables by means of summary statistics. They help generate hypotheses by providing informative clusters using the variable values that characterize each observation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
I am using scare quotes because, as Kilgarriff (2005) puts it, “language is never, ever, ever, random”.
- 2.
See Baayen (2008, Sect. 5.1.1).
- 3.
A second kind of PCA is based on loadings (Baayen 2008, Sect. 5.1.1). Loadings are correlations between the original variables and the unit-scaled principal components. The two kinds of PCA are similar: both are meant to normalize the coordinates of the data points. The variant exemplified in this chapter is more flexible because it allows for the introduction of supplementary variables.
- 4.
- 5.
- 6.
Details on how the data were extracted can be found in this blog post: https://corpling.hypotheses.org/284 (accessed 9 June 2019).
- 7.
- 8.
The code for the extraction was partly contributed by Mathilde Léger, a third-year student at Paris 8 University, as part of her end-of-term project.
- 9.
http://www.natcorp.ox.ac.uk/docs/catRef.xml (accessed 9 June 2019).
- 10.
Several packages and functions implement PCA in R: e.g. princomp( ) and prcomp( ) from the stats package, ggbiplot( ) from the ggbiplot package (which is itself based on ggplot2), dudi.pca( ) from the ade4 package, and PCA( ) from the FactoMineR package. Mind you, princomp( ) and prcomp( ) perform PCA based on loadings.
- 11.
For this kind of analysis, the first two components should represent a cumulative percentage of variance that is far above 50%. The more dimensions there are in the input data table, the harder it will be to reach this percentage.
- 12.
The contribution is a measure of how much an individual contributes to the construction of a component.
- 13.
The squared cosine (cos 2) is a measure of how well an individual is projected onto a component.
- 14.
How many factors are considered worth keeping involves a choice based a metric known as SS loadings, as explained below.
- 15.
The FactoMineR package includes several extensions of factor analysis. Multiple factor analysis (MFA) is used to explore datasets where variables are structured into groups. Like PCA, it can handle continuous and/or categorical variables simultaneously (Pagès 2014). MFA further breaks down into hierarchical multiple factor analysis (Lê and Pagès 2003) and dual multiple factor analysis (Lê and Pagès 2010). Although commonly used in sensorimetrics, these methods are rare in linguistics.
References
Baayen, R.H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.
Benzécri, J.-P. (1984). Analyse des correspondances: Exposé Élémentaire (Vol. 1). Pratique de l’Analyse des Données. Paris: Dunod.
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.
Biber, D. (2001). Dimensions of variation among eighteenth-century registers. In H.-J. Diller & M. Görlach (Ed.), Towards a history of english as a history of genres (pp. 89–110). Heidelberg: C. Winter.
Biber, D., & Conrad, S. (2001). Variation in english: Multi-dimensional studies. London: Longman.
Biber, D., & Gray, B. (2016). Grammatical complexity in academic english: Linguistic change in writing. Cambridge: Cambridge University Press.
Biber, D., & Hared, M. (1992). Dimensions of register variation in Somali. Language Variation and Change, 4(1), 41–75.
de Leeuw, J., & Mair, P. (2009). Simple and canonical correspondence analysis using the R package anacor. Journal of Statistical Software, 31(5), 1–18.
Desagulier, G. (2015). A lesson from associative learning: Asymmetry and productivity in multiple-slot constructions. In Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2015-0012.
Desagulier, G. (2017). Corpus linguistics and statistics with R. introduction to quantitative methods in linguistics. Quantitative methods in the humanities and social sciences. New York: Springer.
Francis, W. N., & Kučera, H. (1964). A standard corpus of present-day edited American english, for use with digital computers (Brown). Providence: Brown University.
Glynn, D. (2014). The many uses of run. In D. Glynn & J. A. Robinson (Ed.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Vol. 43, pp. 117–144). John Benjamins: Human Cognitive Processing.
Gréa, P. (2017). Inside in French. Cognitive Linguistics, 28(1), 77–130.
Greenacre, M. J. (2007). Correspondence analysis in practice (Vol. 2). Interdisciplinary statistics series. Boca Raton: Chapman & Hall/CRC.
Gries, S. T. (2006). Corpus-based methods and cognitive semantics: The many senses of to run. In S. T. Gries & A. Stefanowitsch (Ed.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin: Mouton de Gruyter.
Grieve, J., et al. (2010). Variation among blogs: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Ed.), Genres on the web. Text, speech and language technology (Vol. 42, pp. 303–322). New York: Springer.
Habert, B. (1985). L’analyse des formes «spécifiques» [bilan critique et propositions d’utilisation]. Mots, 11(1), 127–154.
Hirschfeld, H. O. (1935). A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4), 520–524. Cambridge University Press.
Hirschmüller, H. (1989). The use of complex prepositions in Indian English in comparison with British and American English. In G. Graustein & W. Thiele (Ed.), Englische textlinguistik und varietätenforschung. Linguistische arbeitsberichte (Vol. 69, pp. 52–58). Leipzig: Karl Marx Universität.
Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.
Husson, F., Lê, S., Pagès, J. (2010). Exploratory multivariate analysis by example using R. London: CRC press.
Kassambara, A., & Mundt, F. (2017). Factoextra: Extract and visualize the results of multivariate data analyses. R package version 1.0.5.
Kay, P. (2013). The limits of (construction) grammar. In T. Hoffmann & G. Trousdale (Ed.), The Oxford handbook of construction grammar. Oxford: Oxford University Press.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus linguistics and linguistic theory, 1(2), 263–276.
Kim, Y.-J., & Biber, D. (1994). A corpus-based analysis of register variation in Korean. In D. Biber & E. Finegan (Ed.), Sociolinguistic perspectives on register (pp. 157–181). New York: Oxford University Press.
Labbé, C., & Labbé, D. (1994). Que mesure la spécificité du vocabulaire? Lexicometrica, 3, 2001.
Lacheret-Dujour, A., et al. (2019). The distribution of prosodic features in the Rhapsodie corpus. In A. Lacheret-Dujour & S. Kahane (Ed.), Rhapsodie: A prosodic and syntactic treebank for spoken French, Chap. 17. Studies in corpus linguistics (Vol. 89, pp. 315–338). Amsterdam: John Benjamins.
Lê, S., & Pagès, J. (2003). Hierarchical multiple factor analysis: Application to the comparison of sensory profiles. Food Quality and Preference, 14(5–6), 397–403.
Lê, S., & Pagès, J. (2010). DMFA: Dual multiple factor analysis. Communications in Statistics—Theory and Methods, 39(3), 483–492.
Leech, G., & Fallon, R. (1992). Computer corpora–what do they tell us about culture. ICAME Journal, 16, 29–50.
Leech, G., Johansson, S., & Hofland, K. (1978). The LOB corpus, original version (1970–1978). Lancaster/Oslo/Bergen.
Leech, G., et al. (1986). The LOB corpus, POS-tagged version (1981–1986). Lancaster/Oslo/Bergen.
Leitner, G. (1991). The Kolhapur Corpus of Indian English: Intra-varietal description and/or intervarietal comparison. In S. Johansson & A.-B. Stenström (Ed.), English computer corpora. Topics in english linguistics (pp. 215–232). Berlin: Mouton de Gruyter.
Nenadic, O., & Greenacre, M. J. (2007). Correspondence analysis in R, with two and three-dimensional graphics: The CA package. Journal of Statistical Software, 20(3), 1–13.
Pagès, J. (2014). Multiple factor analysis by example using R. Boca Raton: Chapman & Hall/CRC.
Rayson, P., Leech, G. N., & Hodges M. (1997). Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1), 133–152.
Salem, A. (1987). Pratique des Segments Répétés: Essai de Statistique Textuelle. Paris: Klincksieck.
Schmid, H. J. (2003). Do men and women really live in different cultures? Evidence from the BNC. In A. Wilson, P. Rayson, & T. McEnery (Ed.), Corpus linguistics by the Lune. Lódź studies in language (pp. 185–221). Frankfurt: Peter Lang.
Shastri, S. V., Patilkulkarni, C. T., & Shastri, G. S. (1986). The Kolhapur Corpus. India: Kolhapur.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic Supplementary Materials
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Desagulier, G. (2020). Multivariate Exploratory Approaches. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-46216-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)