Skip to main content

Multivariate Exploratory Approaches

  • Chapter
  • First Online:
A Practical Handbook of Corpus Linguistics

Abstract

This chapter provides both a theoretical discussion of what multivariate exploratory approaches entail and step-by-step instructions to implement each of them with R. Four methods are presented: correspondence analysis, multiple correspondence analysis, principal component analysis, and exploratory factor analysis. These methods are designed to explore and summarize large and complex data tables by means of summary statistics. They help generate hypotheses by providing informative clusters using the variable values that characterize each observation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 53.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 53.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    I am using scare quotes because, as Kilgarriff (2005) puts it, “language is never, ever, ever, random”.

  2. 2.

    See Baayen (2008, Sect. 5.1.1).

  3. 3.

    A second kind of PCA is based on loadings (Baayen 2008, Sect. 5.1.1). Loadings are correlations between the original variables and the unit-scaled principal components. The two kinds of PCA are similar: both are meant to normalize the coordinates of the data points. The variant exemplified in this chapter is more flexible because it allows for the introduction of supplementary variables.

  4. 4.

    Factor analysis of mixed data (FAMD) accommodates data sets containing both continuous and nominal data (Pagès 2014, Chap. 3). In this respect, it should be considered an interesting alternative to standard EFA. For reasons of space, however, this chapter focuses on ‘plain’ EFA.

  5. 5.

    On top of FactoMineR, several packages contain a dedicated CA function, e.g. ca (Nenadic and Greenacre 2007), and anacor (de Leeuw and Mair 2009).

  6. 6.

    Details on how the data were extracted can be found in this blog post: https://corpling.hypotheses.org/284 (accessed 9 June 2019).

  7. 7.

    This possibility is not offered in FactoMineR. It is offered in the factoextra (Kassambara and Mundt 2017) and ca (Nenadic and Greenacre 2007) packages.

  8. 8.

    The code for the extraction was partly contributed by Mathilde Léger, a third-year student at Paris 8 University, as part of her end-of-term project.

  9. 9.

    http://www.natcorp.ox.ac.uk/docs/catRef.xml (accessed 9 June 2019).

  10. 10.

    Several packages and functions implement PCA in R: e.g. princomp( ) and prcomp( ) from the stats package, ggbiplot( ) from the ggbiplot package (which is itself based on ggplot2), dudi.pca( ) from the ade4 package, and PCA( ) from the FactoMineR package. Mind you, princomp( ) and prcomp( ) perform PCA based on loadings.

  11. 11.

    For this kind of analysis, the first two components should represent a cumulative percentage of variance that is far above 50%. The more dimensions there are in the input data table, the harder it will be to reach this percentage.

  12. 12.

    The contribution is a measure of how much an individual contributes to the construction of a component.

  13. 13.

    The squared cosine (cos 2) is a measure of how well an individual is projected onto a component.

  14. 14.

    How many factors are considered worth keeping involves a choice based a metric known as SS loadings, as explained below.

  15. 15.

    The FactoMineR package includes several extensions of factor analysis. Multiple factor analysis (MFA) is used to explore datasets where variables are structured into groups. Like PCA, it can handle continuous and/or categorical variables simultaneously (Pagès 2014). MFA further breaks down into hierarchical multiple factor analysis (Lê and Pagès 2003) and dual multiple factor analysis (Lê and Pagès 2010). Although commonly used in sensorimetrics, these methods are rare in linguistics.

References

  • Baayen, R.H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Benzécri, J.-P. (1984). Analyse des correspondances: Exposé Élémentaire (Vol. 1). Pratique de l’Analyse des Données. Paris: Dunod.

    Google Scholar 

  • Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D. (2001). Dimensions of variation among eighteenth-century registers. In H.-J. Diller & M. Görlach (Ed.), Towards a history of english as a history of genres (pp. 89–110). Heidelberg: C. Winter.

    Google Scholar 

  • Biber, D., & Conrad, S. (2001). Variation in english: Multi-dimensional studies. London: Longman.

    Google Scholar 

  • Biber, D., & Gray, B. (2016). Grammatical complexity in academic english: Linguistic change in writing. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Biber, D., & Hared, M. (1992). Dimensions of register variation in Somali. Language Variation and Change, 4(1), 41–75.

    Article  Google Scholar 

  • de Leeuw, J., & Mair, P. (2009). Simple and canonical correspondence analysis using the R package anacor. Journal of Statistical Software, 31(5), 1–18.

    Article  Google Scholar 

  • Desagulier, G. (2015). A lesson from associative learning: Asymmetry and productivity in multiple-slot constructions. In Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2015-0012.

  • Desagulier, G. (2017). Corpus linguistics and statistics with R. introduction to quantitative methods in linguistics. Quantitative methods in the humanities and social sciences. New York: Springer.

    Google Scholar 

  • Francis, W. N., & Kučera, H. (1964). A standard corpus of present-day edited American english, for use with digital computers (Brown). Providence: Brown University.

    Google Scholar 

  • Glynn, D. (2014). The many uses of run. In D. Glynn & J. A. Robinson (Ed.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Vol. 43, pp. 117–144). John Benjamins: Human Cognitive Processing.

    Chapter  Google Scholar 

  • Gréa, P. (2017). Inside in French. Cognitive Linguistics, 28(1), 77–130.

    Article  Google Scholar 

  • Greenacre, M. J. (2007). Correspondence analysis in practice (Vol. 2). Interdisciplinary statistics series. Boca Raton: Chapman & Hall/CRC.

    Google Scholar 

  • Gries, S. T. (2006). Corpus-based methods and cognitive semantics: The many senses of to run. In S. T. Gries & A. Stefanowitsch (Ed.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin: Mouton de Gruyter.

    Google Scholar 

  • Grieve, J., et al. (2010). Variation among blogs: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Ed.), Genres on the web. Text, speech and language technology (Vol. 42, pp. 303–322). New York: Springer.

    Google Scholar 

  • Habert, B. (1985). L’analyse des formes «spécifiques» [bilan critique et propositions d’utilisation]. Mots, 11(1), 127–154.

    Article  Google Scholar 

  • Hirschfeld, H. O. (1935). A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4), 520–524. Cambridge University Press.

    Google Scholar 

  • Hirschmüller, H. (1989). The use of complex prepositions in Indian English in comparison with British and American English. In G. Graustein & W. Thiele (Ed.), Englische textlinguistik und varietätenforschung. Linguistische arbeitsberichte (Vol. 69, pp. 52–58). Leipzig: Karl Marx Universität.

    Google Scholar 

  • Hofland, K., & Johansson, S. (1982). Word frequencies in British and American English. Norwegian Computing Centre for the Humanities.

    Google Scholar 

  • Husson, F., Lê, S., Pagès, J. (2010). Exploratory multivariate analysis by example using R. London: CRC press.

    Book  Google Scholar 

  • Kassambara, A., & Mundt, F. (2017). Factoextra: Extract and visualize the results of multivariate data analyses. R package version 1.0.5.

    Google Scholar 

  • Kay, P. (2013). The limits of (construction) grammar. In T. Hoffmann & G. Trousdale (Ed.), The Oxford handbook of construction grammar. Oxford: Oxford University Press.

    Google Scholar 

  • Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus linguistics and linguistic theory, 1(2), 263–276.

    Article  Google Scholar 

  • Kim, Y.-J., & Biber, D. (1994). A corpus-based analysis of register variation in Korean. In D. Biber & E. Finegan (Ed.), Sociolinguistic perspectives on register (pp. 157–181). New York: Oxford University Press.

    Google Scholar 

  • Labbé, C., & Labbé, D. (1994). Que mesure la spécificité du vocabulaire? Lexicometrica, 3, 2001.

    Google Scholar 

  • Lacheret-Dujour, A., et al. (2019). The distribution of prosodic features in the Rhapsodie corpus. In A. Lacheret-Dujour & S. Kahane (Ed.), Rhapsodie: A prosodic and syntactic treebank for spoken French, Chap. 17. Studies in corpus linguistics (Vol. 89, pp. 315–338). Amsterdam: John Benjamins.

  • Lê, S., & Pagès, J. (2003). Hierarchical multiple factor analysis: Application to the comparison of sensory profiles. Food Quality and Preference, 14(5–6), 397–403.

    Google Scholar 

  • Lê, S., & Pagès, J. (2010). DMFA: Dual multiple factor analysis. Communications in Statistics—Theory and Methods, 39(3), 483–492.

    Article  Google Scholar 

  • Leech, G., & Fallon, R. (1992). Computer corpora–what do they tell us about culture. ICAME Journal, 16, 29–50.

    Google Scholar 

  • Leech, G., Johansson, S., & Hofland, K. (1978). The LOB corpus, original version (1970–1978). Lancaster/Oslo/Bergen.

    Google Scholar 

  • Leech, G., et al. (1986). The LOB corpus, POS-tagged version (1981–1986). Lancaster/Oslo/Bergen.

    Google Scholar 

  • Leitner, G. (1991). The Kolhapur Corpus of Indian English: Intra-varietal description and/or intervarietal comparison. In S. Johansson & A.-B. Stenström (Ed.), English computer corpora. Topics in english linguistics (pp. 215–232). Berlin: Mouton de Gruyter.

    Google Scholar 

  • Nenadic, O., & Greenacre, M. J. (2007). Correspondence analysis in R, with two and three-dimensional graphics: The CA package. Journal of Statistical Software, 20(3), 1–13.

    Google Scholar 

  • Pagès, J. (2014). Multiple factor analysis by example using R. Boca Raton: Chapman & Hall/CRC.

    Book  Google Scholar 

  • Rayson, P., Leech, G. N., & Hodges M. (1997). Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics, 2(1), 133–152.

    Article  Google Scholar 

  • Salem, A. (1987). Pratique des Segments Répétés: Essai de Statistique Textuelle. Paris: Klincksieck.

    Google Scholar 

  • Schmid, H. J. (2003). Do men and women really live in different cultures? Evidence from the BNC. In A. Wilson, P. Rayson, & T. McEnery (Ed.), Corpus linguistics by the Lune. Lódź studies in language (pp. 185–221). Frankfurt: Peter Lang.

    Google Scholar 

  • Shastri, S. V., Patilkulkarni, C. T., & Shastri, G. S. (1986). The Kolhapur Corpus. India: Kolhapur.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Desagulier .

Editor information

Editors and Affiliations

1 Electronic Supplementary Materials

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Desagulier, G. (2020). Multivariate Exploratory Approaches. In: Paquot, M., Gries, S.T. (eds) A Practical Handbook of Corpus Linguistics. Springer, Cham. https://doi.org/10.1007/978-3-030-46216-1_19

Download citation

Publish with us

Policies and ethics