Multivariate Exploratory Approaches

Desagulier, Guillaume

doi:10.1007/978-3-030-46216-1_19

Guillaume Desagulier^3,4,5,6

1848 Accesses
1 Citations

Abstract

This chapter provides both a theoretical discussion of what multivariate exploratory approaches entail and step-by-step instructions to implement each of them with R. Four methods are presented: correspondence analysis, multiple correspondence analysis, principal component analysis, and exploratory factor analysis. These methods are designed to explore and summarize large and complex data tables by means of summary statistics. They help generate hypotheses by providing informative clusters using the variable values that characterize each observation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
I am using scare quotes because, as Kilgarriff (2005) puts it, “language is never, ever, ever, random”.
2.
See Baayen (2008, Sect. 5.1.1).
3.
A second kind of PCA is based on loadings (Baayen 2008, Sect. 5.1.1). Loadings are correlations between the original variables and the unit-scaled principal components. The two kinds of PCA are similar: both are meant to normalize the coordinates of the data points. The variant exemplified in this chapter is more flexible because it allows for the introduction of supplementary variables.
4.
Factor analysis of mixed data (FAMD) accommodates data sets containing both continuous and nominal data (Pagès 2014, Chap. 3). In this respect, it should be considered an interesting alternative to standard EFA. For reasons of space, however, this chapter focuses on ‘plain’ EFA.
5.
On top of FactoMineR, several packages contain a dedicated CA function, e.g. ca (Nenadic and Greenacre 2007), and anacor (de Leeuw and Mair 2009).
6.
Details on how the data were extracted can be found in this blog post: https://corpling.hypotheses.org/284 (accessed 9 June 2019).
7.
This possibility is not offered in FactoMineR. It is offered in the factoextra (Kassambara and Mundt 2017) and ca (Nenadic and Greenacre 2007) packages.
8.
The code for the extraction was partly contributed by Mathilde Léger, a third-year student at Paris 8 University, as part of her end-of-term project.
9.
http://www.natcorp.ox.ac.uk/docs/catRef.xml (accessed 9 June 2019).
10.
Several packages and functions implement PCA in R: e.g. princomp( ) and prcomp( ) from the stats package, ggbiplot( ) from the ggbiplot package (which is itself based on ggplot2), dudi.pca( ) from the ade4 package, and PCA( ) from the FactoMineR package. Mind you, princomp( ) and prcomp( ) perform PCA based on loadings.
11.
For this kind of analysis, the first two components should represent a cumulative percentage of variance that is far above 50%. The more dimensions there are in the input data table, the harder it will be to reach this percentage.
12.
The contribution is a measure of how much an individual contributes to the construction of a component.
13.
The squared cosine (cos ²) is a measure of how well an individual is projected onto a component.
14.
How many factors are considered worth keeping involves a choice based a metric known as SS loadings, as explained below.
15.
The FactoMineR package includes several extensions of factor analysis. Multiple factor analysis (MFA) is used to explore datasets where variables are structured into groups. Like PCA, it can handle continuous and/or categorical variables simultaneously (Pagès 2014). MFA further breaks down into hierarchical multiple factor analysis (Lê and Pagès 2003) and dual multiple factor analysis (Lê and Pagès 2010). Although commonly used in sensorimetrics, these methods are rare in linguistics.

References

Baayen, R.H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.
Book Google Scholar
Benzécri, J.-P. (1984). Analyse des correspondances: Exposé Élémentaire (Vol. 1). Pratique de l’Analyse des Données. Paris: Dunod.
Google Scholar
Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D. (2001). Dimensions of variation among eighteenth-century registers. In H.-J. Diller & M. Görlach (Ed.), Towards a history of english as a history of genres (pp. 89–110). Heidelberg: C. Winter.
Google Scholar
Biber, D., & Conrad, S. (2001). Variation in english: Multi-dimensional studies. London: Longman.
Google Scholar
Biber, D., & Gray, B. (2016). Grammatical complexity in academic english: Linguistic change in writing. Cambridge: Cambridge University Press.
Book Google Scholar
Biber, D., & Hared, M. (1992). Dimensions of register variation in Somali. Language Variation and Change, 4(1), 41–75.
Article Google Scholar
de Leeuw, J., & Mair, P. (2009). Simple and canonical correspondence analysis using the R package anacor. Journal of Statistical Software, 31(5), 1–18.
Article Google Scholar
Desagulier, G. (2015). A lesson from associative learning: Asymmetry and productivity in multiple-slot constructions. In Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2015-0012.
Desagulier, G. (2017). Corpus linguistics and statistics with R. introduction to quantitative methods in linguistics. Quantitative methods in the humanities and social sciences. New York: Springer.
Google Scholar
Francis, W. N., & Kučera, H. (1964). A standard corpus of present-day edited American english, for use with digital computers (Brown). Providence: Brown University.
Google Scholar
Glynn, D. (2014). The many uses of run. In D. Glynn & J. A. Robinson (Ed.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Vol. 43, pp. 117–144). John Benjamins: Human Cognitive Processing.
Chapter Google Scholar
Gréa, P. (2017). Inside in French. Cognitive Linguistics, 28(1), 77–130.
Article Google Scholar
Greenacre, M. J. (2007). Correspondence analysis in practice (Vol. 2). Interdisciplinary statistics series. Boca Raton: Chapman & Hall/CRC.
Google Scholar
Gries, S. T. (2006). Corpus-based methods and cognitive semantics: The many senses of to run. In S. T. Gries & A. Stefanowitsch (Ed.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin: Mouton de Gruyter.
Google Scholar
Grieve, J., et al. (2010). Variation among blogs: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Ed.), Genres on the web. Text, speech and language technology (Vol. 42, pp. 303–322). New York: Springer.
Google Scholar
Habert, B. (1985). L’analyse des formes «spécifiques» [bilan critique et propositions d’utilisation]. Mots, 11(1), 127–154.
Article Google Scholar
Hirschfeld, H. O. (1935). A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4), 520–524. Cambridge University Press.
Google Scholar
Hirschmüller, H. (1989). The use of complex prepositions in Indian English in comparison with British and American English. In G. Graustein & W. Thiele (Ed.), Englische textlinguistik und varietätenforschung. Linguistische arbeitsberichte (Vol. 69, pp. 52–58). Leipzig: Karl Marx Universität.
Google Scholar
Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.
Google Scholar
Husson, F., Lê, S., Pagès, J. (2010). Exploratory multivariate analysis by example using R. London: CRC press.
Book Google Scholar
Kassambara, A., & Mundt, F. (2017). Factoextra: Extract and visualize the results of multivariate data analyses. R package version 1.0.5.
Google Scholar
Kay, P. (2013). The limits of (construction) grammar. In T. Hoffmann & G. Trousdale (Ed.), The Oxford handbook of construction grammar. Oxford: Oxford University Press.
Google Scholar
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus linguistics and linguistic theory, 1(2), 263–276.
Article Google Scholar
Kim, Y.-J., & Biber, D. (1994). A corpus-based analysis of register variation in Korean. In D. Biber & E. Finegan (Ed.), Sociolinguistic perspectives on register (pp. 157–181). New York: Oxford University Press.
Google Scholar
Labbé, C., & Labbé, D. (1994). Que mesure la spécificité du vocabulaire? Lexicometrica, 3, 2001.
Google Scholar
Lacheret-Dujour, A., et al. (2019). The distribution of prosodic features in the Rhapsodie corpus. In A. Lacheret-Dujour & S. Kahane (Ed.), Rhapsodie: A prosodic and syntactic treebank for spoken French, Chap. 17. Studies in corpus linguistics (Vol. 89, pp. 315–338). Amsterdam: John Benjamins.
Lê, S., & Pagès, J. (2003). Hierarchical multiple factor analysis: Application to the comparison of sensory profiles. Food Quality and Preference, 14(5–6), 397–403.
Google Scholar
Lê, S., & Pagès, J. (2010). DMFA: Dual multiple factor analysis. Communications in Statistics—Theory and Methods, 39(3), 483–492.
Article Google Scholar
Leech, G., & Fallon, R. (1992). Computer corpora–what do they tell us about culture. ICAME Journal, 16, 29–50.
Google Scholar
Leech, G., Johansson, S., & Hofland, K. (1978). The LOB corpus, original version (1970–1978). Lancaster/Oslo/Bergen.
Google Scholar
Leech, G., et al. (1986). The LOB corpus, POS-tagged version (1981–1986). Lancaster/Oslo/Bergen.
Google Scholar
Leitner, G. (1991). The Kolhapur Corpus of Indian English: Intra-varietal description and/or intervarietal comparison. In S. Johansson & A.-B. Stenström (Ed.), English computer corpora. Topics in english linguistics (pp. 215–232). Berlin: Mouton de Gruyter.
Google Scholar
Nenadic, O., & Greenacre, M. J. (2007). Correspondence analysis in R, with two and three-dimensional graphics: The CA package. Journal of Statistical Software, 20(3), 1–13.
Google Scholar
Pagès, J. (2014). Multiple factor analysis by example using R. Boca Raton: Chapman & Hall/CRC.
Book Google Scholar
Rayson, P., Leech, G. N., & Hodges M. (1997). Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1), 133–152.
Article Google Scholar
Salem, A. (1987). Pratique des Segments Répétés: Essai de Statistique Textuelle. Paris: Klincksieck.
Google Scholar
Schmid, H. J. (2003). Do men and women really live in different cultures? Evidence from the BNC. In A. Wilson, P. Rayson, & T. McEnery (Ed.), Corpus linguistics by the Lune. Lódź studies in language (pp. 185–221). Frankfurt: Peter Lang.
Google Scholar
Shastri, S. V., Patilkulkarni, C. T., & Shastri, G. S. (1986). The Kolhapur Corpus. India: Kolhapur.
Google Scholar

Download references

Author information

Authors and Affiliations

MoDyCo – Université Paris 8, Saint-Denis, France
Guillaume Desagulier
CNRS, Paris, France
Guillaume Desagulier
Université Paris Nanterre, Nanterre, France
Guillaume Desagulier
Institut Universitaire de France, Paris, France
Guillaume Desagulier

Authors

Guillaume Desagulier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guillaume Desagulier .

Editor information

Editors and Affiliations

FNRS Centre for English Corpus Linguistics, Language and Communication Institute, UCLouvain, Louvain-la-Neuve, Belgium
Magali Paquot
Department of Linguistics, University of California, Santa Barbara, CA, USA
Stefan Th. Gries

1 Electronic Supplementary Materials

19_Desagulier (ZIP 711 kb)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Desagulier, G. (2020). Multivariate Exploratory Approaches. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-46216-1_19
Published: 05 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46215-4
Online ISBN: 978-3-030-46216-1
eBook Packages: Religion and PhilosophyPhilosophy and Religion (R0)

Publish with us

Policies and ethics