Taming big data: Applying the experimental method to naturalistic data sets

  • Eyal SagiEmail author


Psychological researchers have traditionally focused on lab-based experiments to test their theories and hypotheses. Although the lab provides excellent facilities for controlled testing, some questions are best explored by collecting information that is difficult to obtain in the lab. The vast amounts of data now available to researchers can be a valuable resource in this respect. By incorporating this new realm of data and translating it into traditional laboratory methods, we can expand the reach of the lab into the wilderness of human society. This study demonstrates how the troves of linguistic data generated by humans can be used to test theories about cognition and representation. It also suggests how similar interpretations can be made of other research in cognition. The first case tests a long-standing prediction of Gentner’s natural partition hypothesis: that verb meaning is more subject to change due to the textual context in which it appears than is the meaning of nouns. Within a diachronic corpus, verbs and other relational words indeed showed more evidence of semantic change than did concrete nouns. In the second case, corpus statistics were employed to empirically support the existence of phonesthemes—nonmorphemic units of sound that are associated with aspects of meaning. A third study also supported this measure, by demonstrating that it corresponds with performance in a lab experiment. Neither of these questions can be adequately explored without the use of big data in the form of linguistic corpora.


Corpus statistics Big data Semantic change Representation Phonesthemes 



  1. Aarts, A. A., Anderson, J. E., Anderson, C. J., Attridge, P. R., Attwood, A., Axt, J., . . . Barnett-Cowan, M. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716.
  2. Asmuth, J., & Gentner, D. (2005). Context sensitivity of relational nouns. In B. G. Bara, L. Barsalou, & M. Bucciarelli (Eds.), Proceedings of the Twenty-Seventh Annual Conference of the Cognitive Science Society (pp. 163–168). Mahwah, NJ: Erlbaum.Google Scholar
  3. Asmuth, J., & Gentner, D. (2017). Relational categories are more mutable than entity categories. Quarterly Journal of Experimental Psychology, 70, 2007–2025.CrossRefGoogle Scholar
  4. Bergen, B. K. (2004). The psychological reality of phonaesthemes. Language, 80, 290–311.CrossRefGoogle Scholar
  5. Blust, R. A. (2003). The phonestheme ŋ in Austronesian languages. Oceanic Linguistics, 42, 187–212.Google Scholar
  6. BNC Consortium. (2007). British National Corpus, version 3 (BNC XML ed.). Oxford, UK: Oxford University Computing Services. Retrieved from Google Scholar
  7. Boussidan, A., Sagi, E., & Ploux, S. (2009). Phonaesthemic and etymological effects on the distribution of senses in statistical models of semantics. In Proceedings of the CogSci Workshop on Distributional Semantics beyond Concrete Concepts (DiSCo 2009) (pp. 35–40). Austin, TX: Cognitive Science Society.Google Scholar
  8. de Saussure, F. (2011). Nature of the linguistic sign. In Course in general linguistics (pp. 65–70). New York, NY: Columbia University Press. (Original French work published in 1916)Google Scholar
  9. Dehghani, M., Johnson, K., Hoover, J., Sagi, E., Garten, J., Parmar, N. J., . . . Graham, J. (2016). Purity homophily in social networks. Journal of Experimental Psychology: General, 145, 366–375. CrossRefGoogle Scholar
  10. Ellegård, A. (1953). The auxiliary do: The establishment and regulation of its use in English (Vol. 2). Stockholm, Sweden: Almquist & Wiksell.Google Scholar
  11. Farmer, T. A., Christiansen, M. H., & Monaghan, P. (2006). Phonological typicality influences on-line sentence comprehension. Proceedings of the National Academy of Sciences, 103, 12203–12208.CrossRefGoogle Scholar
  12. Firth, J. R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in linguistic analysis (pp. 1–31). Oxford, UK: Blackwell.Google Scholar
  13. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1, 939–944.Google Scholar
  14. Gentner, D. (1982). Why nouns are learned before verbs: Linguistic relativity versus natural partitioning (Technical Report No. 257). Urbana, IL: Center for the Study of Reading.Google Scholar
  15. Gentner, D. (2006). Why verbs are hard to learn. In K. Hirsh-Pasek & R. Michnick Golinkoff (Eds.), Action meets word: How children learn verbs (pp. 544–564). Oxford, UK: Oxford University Press.CrossRefGoogle Scholar
  16. Gentner, D., & France, I. M. (1988). The verb mutability effect: Studies of the combinatorial semantics of nouns and verbs. In S. L. Small, G. W. Cottrell, & M. K. Tanenhaus (Eds.), Lexical ambiguity resolution: Perspectives from psycholinguistics, neuropsychology, and artificial intelligence (pp. 343–382). San Mateo, CA: Kaufmann.CrossRefGoogle Scholar
  17. Gillette, J., Gleitman, H., Gleitman, L., & Lederer, A. (1999). Human simulations of vocabulary learning. Cognition, 73, 135–176.CrossRefGoogle Scholar
  18. Graesser, A. C., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Tutoring Research Group, & Person, N. (2000). Using latent semantic analysis to evaluate the contributions of students in AutoTutor. Interactive Learning Environments, 8, 129–147.CrossRefGoogle Scholar
  19. Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, 211–244. CrossRefPubMedGoogle Scholar
  20. Günther, F., Dudschig, C., & Kaup, B. (2016). Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies. Quarterly Journal of Experimental Psychology, 69, 626–653.CrossRefGoogle Scholar
  21. Haidt, J., & Joseph, C. (2004). Intuitive ethics: How innately prepared intuitions generate culturally variable virtues. Daedalus, 133, 55–66.CrossRefGoogle Scholar
  22. Hockett, C. F., & Hockett, C. D. (1960). The origin of speech. Scientific American, 203, 88–97.CrossRefGoogle Scholar
  23. Hutchins, S. S. (1998). The psychological reality, variability, and compositionality of English phonesthemes (Ph.D. dissertation). Emory University, Atlanta, GA.Google Scholar
  24. Hutchinson, S., & Louwerse, M. M. (2014). Language statistics explain the spatial–numerical association of response codes. Psychonomic Bulletin & Review, 21, 470–478. CrossRefGoogle Scholar
  25. Iliev, R., Dehghani, M., & Sagi, E. (2015). Automated text analysis in psychology: Methods, applications, and future developments. Language and Cognition, 7, 265–290.CrossRefGoogle Scholar
  26. Infomap [Computer Software]. (2007). Stanford, CA. Retrieved from
  27. Jakobson, R., & Waugh, L. R. (1979). The sound shape of language. Bloomington, IN: Indiana University Press.Google Scholar
  28. Lakoff, G. (2009). The political mind: A cognitive scientist’s guide to your brain and its politics. New York, NY: Penguin.Google Scholar
  29. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. CrossRefGoogle Scholar
  30. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284. CrossRefGoogle Scholar
  31. Lebert, M. (2011). The EBook is 40 (1971–2011). Project Gutenberg.Google Scholar
  32. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. ArXiv preprint. ArXiv:1301.3781Google Scholar
  33. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. Paper presented at the International Conference on Learning Representations Workshop, Scottsdale, AZ.Google Scholar
  34. Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 567–575). Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
  35. Monaghan, P., Chater, N., & Christiansen, M. H. (2005). The differential role of phonological and distributional cues in grammatical categorisation. Cognition, 96, 143–182. CrossRefPubMedGoogle Scholar
  36. Nuckolls, J. B. (1999). The case for sound symbolism. Annual Review of Anthropology, 28, 225–252.CrossRefGoogle Scholar
  37. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Stroudsburg, PA: Association for Computational Linguistics.
  38. Ramachandran, V. S., & Hubbard, E. M. (2001). Synaesthesia—A window into perception, thought and language. Journal of Consciousness Studies, 8, 3–34.Google Scholar
  39. Saffran, J. R. (2003). Statistical language learning: Mechanisms and constraints. Current Directions in Psychological Science, 12, 110–114.CrossRefGoogle Scholar
  40. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–1928. CrossRefGoogle Scholar
  41. Sagi, E. (2018). Developing a new method for psychological investigation using text as data. Sage Research Methods Cases.
  42. Sagi, E., & Dehghani, M. (2014). Measuring moral rhetoric in text. Social Science Computer Review, 32, 132–144. CrossRefGoogle Scholar
  43. Sagi, E., Diermeier, D., & Kaufmann, S. (2013). Identifying issue frames in text. PLoS ONE, 8, e69185. CrossRefPubMedPubMedCentralGoogle Scholar
  44. Sagi, E., Kaufmann, S., & Clark, B. (2012). Tracing semantic change with latent semantic analysis. In K. Allan & J. A. Robinson (Eds.), Current methods in historical semantics (pp. 161–183). Berlin, Germany: Walter de Gruyter.Google Scholar
  45. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 427–448). Mahwah, NJ: Erlbaum.Google Scholar
  46. Takayama, Y., Flournoy, R., Kaufmann, S., & Peters, S. (1998). Information mapping: Concept-based information retrieval based on word associations. Stanford, CA: CSLI Publications.Google Scholar
  47. Tam, Y.-C., Lane, I., & Schultz, T. (2007). Bilingual LSA-based adaptation for statistical machine translation. Machine Translation, 21, 187–207.CrossRefGoogle Scholar
  48. Traugott, E. C., & Dasher, R. B. (2001). Regularity in semantic change. Cambridge, UK: Cambridge University Press.CrossRefGoogle Scholar
  49. Wallis, J. (1699). Grammar of the English language. Oxford, UK: Lichfield.Google Scholar
  50. Watkins, C. (2000). The American heritage dictionary of Indo-European roots (2nd ed.). Boston, MA: Houghton Mifflin Harcourt.Google Scholar
  51. Wilson, M. (1988). MRC Psycholinguistic Database: Machine-usable dictionary, version 2.00. Behavior Research Methods, Instruments, & Computers, 20, 6–10. CrossRefGoogle Scholar
  52. Yeh, J.-Y., Ke, H.-R., Yang, W.-P., & Meng, I.-H. (2005). Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41, 75–95.CrossRefGoogle Scholar

Copyright information

© The Psychonomic Society, Inc. 2019

Authors and Affiliations

  1. 1.University of St. FrancisJolietUSA

Personalised recommendations