Skip to main content

Psychological Text Analysis in the Digital Humanities

  • Chapter
  • First Online:
Data Analytics in Digital Humanities

Part of the book series: Multimedia Systems and Applications ((MMSA))

Abstract

In the digital humanities, it has been particularly difficult to establish the psychological properties of a person or group of people in an objective, reliable manner. Traditionally, the attempt to understand an author’s psychological makeup has been primarily (if not exclusively) accomplished through subjective interpretation, qualitative analysis, and speculation. In the world of empirical psychological research, however, the past two decades have witnessed an explosion of computerized language analysis techniques that objectively measure psychological features of the individual. Indeed, by using modern text analysis methods, it is now possible to quickly and accurately extract information about people—personalities, individual differences, social processes, and even their mental health—all through the words that people write and speak. This chapter serves as a primer for researchers interested in learning about how language can provide powerful insights into the minds of others via well-established and easy-to-use psychometric methods. First, this chapter provides a general background on language analysis in the field of psychology, followed by an introduction to modern methods and developments within the field of psychological text analysis. Finally, a solid foundation to psychological text analysis is provided in the form of an overview of research spanning hundreds of studies from labs all over the world.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that while it may seem trivial to create new categories of words to measure a psychological process, it can be an unspeakably difficult task in practice. Determining how specific words are often used out in the “real world,” establishing whether certain words are related to psychologically meaningful processes, and creating dictionaries that possess adequate statistical properties is deceptively tricky. Many researchers have spent years working on word dictionaries that have ultimately proven to be meaningless.

  2. 2.

    At the time of this writing, various translations of the LIWC dictionary exist in Spanish, Dutch, German, Italian, French, Korean, Chinese, Portuguese, and Turkish, among others. These translations are typically available from their respective translators rather than the LIWC creators, and most have accompanying peer-reviewed publications that evaluate the psychometric properties of the translated dictionaries.

  3. 3.

    Interestingly, the negation of an emotion (e.g., “not sad”) appears to be psychologically different from expressing the opposite of an emotion (“happy”). Research has found that people who think along a “sadness” dimension, even if they are repeatedly saying that they are “not sad” or talking about someone else’s sadness, are psychologically different from those who are thinking along the lines of a different emotion altogether (Pennebaker et al. 1997).

  4. 4.

    To spare the curious reader from having to seek out a dictionary, “enantiodromia” refers to a tendency for something to convert into its opposite form.

  5. 5.

    The general rule of thumb is that more data is almost always better. Data quantity is the foundation of both reliability and accuracy when quantifying psychological processes, which can be incredibly difficult to assess using any research method. The same holds true when using language to extract psychological information.

  6. 6.

    These comments are in no way intended to disparage or discourage the use of heavy-hitting statistical algorithms and machine learning procedures. In fact, the author of this chapter uses these analytic techniques with absolute regularity in his own work, and he enjoys few things in life more than the intricacy of well-crafted, complex models (why yes, he is a huge hit at parties—how did you know?). However, the importance of considering tradeoffs between prediction power and being able to describe/understand one’s model in practical terms cannot be overstated.

  7. 7.

    Note, however, that self-report measures of personality are not without their own drawbacks and imperfections. In order to accurately answer a self-report question about yourself, you must have both accurate information about yourself in a given domain and a willingness/ability to make accurate self-reports. The literature on self-report biases and pitfalls is rather extensive but beyond the scope of this chapter.

  8. 8.

    Another interesting analysis of the Beatles using LIWC was performed by Kasser (2013), who explored the interpretation of the song Lucy in the Sky with Diamonds from a psychological perspective. While the song is often cited as being overtly about drug use, Kasser (2013) found that the psychological fingerprint of the song was generally quite similar to other lyrics authored by John Lennon. Kasser did find linguistic markers consistent with drug experience descriptions, however, Lucy in the Sky with Diamonds also scored relatively high on language measures that pertain to distancing oneself from painful experiences, such as a lack of emotional content and very few markers of “here and now” thinking (sometimes called “psychological distancing”).

  9. 9.

    Interestingly, this study and others suggest that higher use of both positive and negative emotion words may generally reflect greater immersion a given writing topic (e.g., Holmes et al. 2007; Tausczik and Pennebaker 2010).

  10. 10.

    The degree to which people synchronize their function words is often not directly perceptible, however, higher language style matching among individuals can foster perceptions of social connectedness and support (Rains 2015).

References

  • J.L. Baddeley, G.R. Daniel, J.W. Pennebaker, How Henry Hellyer’s use of language foretold his suicide. Crisis 32(5), 288–292 (2011)

    Article  Google Scholar 

  • Borelli, J. L., Ramsook, K. A., Smiley, P., Kyle Bond, D., West, J. L., K.H. Buttitta, Language matching among mother-child dyads: associations with child attachment and emotion reactivity. Soc. Dev. (2016). doi:10.1111/sode.12200

  • R.L. Boyd, MEH: Meaning Extraction Helper (Version 1.4.13) [Software]. Available from http://meh.ryanb.cc (2016)

  • R.L. Boyd, J.W. Pennebaker, Did Shakespeare write double falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychol. Sci. 26(5), 570–582 (2015)

    Article  Google Scholar 

  • R.L. Boyd, S.R. Wilson, J.W. Pennebaker, M. Kosinski, D.J. Stillwell, R. Mihalcea, Values in words: using language to evaluate and understand personal values, in Proceedings of the Ninth International AAAI Conference on Web and Social Media (2015), pp. 31–40

    Google Scholar 

  • C.K. Chung, J.W. Pennebaker, Revealing dimensions of thinking in open-ended self-descriptions: an automated meaning extraction method for natural language. J. Res. Pers. 42(1), 96–132 (2008)

    Article  Google Scholar 

  • M.A. Cohn, M.R. Mehl, J.W. Pennebaker, Linguistic markers of psychological change surrounding September 11, 2001. Psychol. Sci. 15(10), 687–693 (2004)

    Article  Google Scholar 

  • M. De Choudhury, M. Gamon, S. Counts, E. Horvitz, Predicting depression via social media, in Annual Proceedings of the 2013 AAAI Conference on Web and Social Media (ICWSM) (2013)

    Google Scholar 

  • J. Dewey, How we think (D.C. Heath, Boston, 1910)

    Book  Google Scholar 

  • M.J. Egnoto, D.J. Griffin, Analyzing language in suicide notes and legacy tokens: investigating clues to harm of self and harm to others in writing. Crisis 37(2), 140–147 (2016)

    Article  Google Scholar 

  • M. Fernández-Cabana, A. García-Caballero, M.T. Alves-Pérez, M.J. García-García, R. Mateos, Suicidal traits in Marilyn Monroe’s fragments. Crisis 34(2), 124–130 (2013)

    Article  Google Scholar 

  • A.K. Fetterman, M.D. Robinson, Do you use your head or follow your heart? Self-location predicts personality, emotion, decision making, and performance. J. Pers. Soc. Psychol. 105, 316–334 (2013)

    Article  Google Scholar 

  • L. Flekova, I. Gurevych, Personality profiling of fictional characters using sense-level links between lexical resources, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)

    Google Scholar 

  • S. Freud, On Aphasia (International Universities Press, London, 1891)

    Google Scholar 

  • E. Gortner, J.W. Pennebaker, The archival anatomy of a disaster: media coverage and community-wide health effects of the Texas A&M bonfire tragedy. J. Soc. Clin. Psychol. 22(5), 580–603 (2003)

    Article  Google Scholar 

  • D. Holmes, G.W. Alpers, T. Ismailji, C. Classen, T. Wales, V. Cheasty, A. Miller, C. Koopman, Cognitive and emotional processing in narratives of women abused by intimate partners. Violence Against Women 13(11), 1192–1205 (2007)

    Article  Google Scholar 

  • M.E. Ireland, J.W. Pennebaker, Language style matching in writing: synchrony in essays, correspondence, and poetry. J. Pers. Soc. Psychol. 99(3), 549–571 (2010)

    Article  Google Scholar 

  • M.E. Ireland, R.B. Slatcher, P.W. Eastwick, L.E. Scissors, E.J. Finkel, J.W. Pennebaker, Language style matching predicts relationship initiation and stability. Psychol. Sci. 22(1), 39–44 (2011)

    Article  Google Scholar 

  • O.P. John, L.P. Naumann, C.J. Soto, in Handbook of Personality: Theory and Research, ed. by O. P. John, R. W. Robins, L. A. Pervin. Paradigm shift to the integrative big-five trait taxonomy: history, measurement, and conceptual issues (Guilford Press, New York, 2008), pp. 114–158

    Google Scholar 

  • K. Jordan, J.W. Pennebaker, How the candidates are thinking: analytic versus narrative thinking styles. Retrieved January 21, 2016, from https://wordwatchers.wordpress.com/2016/01/21/how-the-candidates-are-thinking-analytic-versus-narrative-thinking-styles/ (2016)

  • P. Juola, Authorship attribution. Found. Trends Inf. Retr. 1(3), 233 (2006)

    Article  Google Scholar 

  • D. Kahneman, Thinking, Fast and Slow (Farrar, Straus and Giroux, New York, 2011)

    Google Scholar 

  • T. Kasser, Lucy in the Mind of Lennon (Oxford University Press, New York, 2013)

    Book  Google Scholar 

  • M. Komisin, C. Guinn, Identifying personality types using document classification methods, in Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference (2012)

    Google Scholar 

  • M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2008)

    Article  Google Scholar 

  • C.M. Laserna, Y. Seih, J.W. Pennebaker, J. Lang. Soc. Psychol. 33(3), 328–338 (2014)

    Article  Google Scholar 

  • H.D. Lasswell, D. Lerner, I. De Sola Pool, The Comparative Study of Symbols: An Introduction (Stanford University Press, Stanford, 1952)

    Book  Google Scholar 

  • M. Liberman, Linguistic dominance in house of cards. Retrieved March 12, 2015, from http://languagelog.ldc.upenn.edu/nll/?p=18147 (2015)

  • R.D. Lowe, D. Heim, C.K. Chung, J.C. Duffy, J.B. Davies, J.W. Pennebaker, In verbis, vinum? Relating themes in an open-ended writing task to alcohol behaviors. Appetite 68, 8–13 (2013)

    Article  Google Scholar 

  • F. Mairesse, M.A. Walker, M.R. Mehl, R.K. Moore, Using linguistic cues for the automatic recognition of personality and conversation in text. J. Artif. Intell. Res. 30(1), 457–500 (2007)

    MATH  Google Scholar 

  • C. Martindale, The grammar of altered states of consciousness: a semiotic reinterpretation of aspects of psychoanalytic theory. Psychoanal. Contemp. Thought 4, 331–354 (1975)

    Google Scholar 

  • D.C. McClelland, J.W. Atkinson, R.A. Clark, E.L. Lowell, The Achievement Motive (Irvington, Oxford, 1953)

    Book  Google Scholar 

  • E. Mergenthaler, Emotion-abstraction patterns in verbatim protocols: a new way of describing psychotherapeutic processes. J. Consult. Clin. Psychol. 64(6), 1306–1315 (1996)

    Article  Google Scholar 

  • G.A. Miller, The Science of Words (Scientific American Library, New York, 1995)

    Google Scholar 

  • F. Moretti, Distant Reading (Verso, London, 2013)

    Google Scholar 

  • J.W. Pennebaker, J.F. Evans, Expressive Writing: Words that Heal (Idyll Arbor, Enumclaw, 2014)

    Google Scholar 

  • J.W. Pennebaker, M.E. Francis, Linguistic Inquiry and Word Count (LIWC): A Computer-Based Text Analysis Program (Erlbaum, Mahwah, NJ, 1999)

    Google Scholar 

  • J.W. Pennebaker, L.A. King, Linguistic styles: language use as an individual difference. J. Pers. Soc. Psychol. 77(6), 1296–1312 (1999)

    Article  Google Scholar 

  • J.W. Pennebaker, L.D. Stone, Words of wisdom: Language use over the life span. Pers. Processes Individ. Differ. 85(2), 291–301 (2003)

    Google Scholar 

  • J.W. Pennebaker, T.J. Mayne, M.E. Francis, Linguistic predictors of adaptive bereavement. J. Pers. Soc. Psychol. 72, 863–871 (1997)

    Article  Google Scholar 

  • J.W. Pennebaker, C.K. Chung, J. Frazee, G.M. Lavergne, D.I. Beaver, When small words foretell academic success: the case of college admissions essays. PLoS One 9(12), e115844 (2014)

    Article  Google Scholar 

  • J.W. Pennebaker, R.L. Boyd, K. Jordan, K. Blackburn, The Development and Psychometric Properties of LIWC2015 (University of Texas, Austin, TX, 2015a)

    Google Scholar 

  • J.W. Pennebaker, R.J. Booth, R.L. Boyd, M.E. Francis, Linguistic Inquiry and Word Count: LIWC2015 (Pennebaker Conglomerates, Austin, TX, 2015b)

    Google Scholar 

  • K.J. Petrie, J.W. Pennebaker, B. Sivertsen, Things we said today: a linguistic analysis of the Beatles. Psychol. Aesthet. Creat. Arts 2(4), 197–202 (2008)

    Article  Google Scholar 

  • S.T. Piantadosi, Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)

    Article  Google Scholar 

  • C.S. Pulverman, R.L. Boyd, A.M. Stanton, C.M. Meston, Changes in the sexual self-schema of women with a history of childhood sexual abuse following expressive writing treatment. Psychol. Trauma. 9(2), 181–188 (2016). doi:10.1037/tra0000163

  • S.A. Rains, Language style matching as a predictor of perceived social support in computer-mediated interaction among individuals coping with illness. Commun. Res. 43(5), 694–712 (2015)

    Article  Google Scholar 

  • N. Ramirez-Esparza, C.K. Chung, E. Kacewicz, J.W. Pennebaker, The psychology of word use in depression forums in English and in Spanish: texting two text analytic approaches, in Annual Proceedings of the 2008 AAAI Conference on Web and Social Media (ICWSM) (2008)

    Google Scholar 

  • B.H. Richardson, P.J. Taylor, B. Snook, S.M. Conchi, C. Bennell, Language style matching and police interrogation outcomes. Law Hum. Behav. 38(4), 357–366 (2014)

    Article  Google Scholar 

  • D.M. Romero, R.I. Swaab, B. Uzzi, A.D. Galinsky, Mimicry is presidential: Linguistic style matching in presidential debates and improved polling numbers. Personal. Soc. Psychol. Bull. 41(10), 1311–1319 (2015)

    Article  Google Scholar 

  • S. Ross, In praise of overstating the case: a review of Franco Moretti, distant reading. Dig. Humanit. Q. 8(1), 1 (2014)

    MathSciNet  Google Scholar 

  • R.M. Sapolsky, Why Zebras Don't Get Ulcers: A Guide To Stress, Stress Related Diseases, and Coping (W.H. Freeman, New York, 1994)

    Google Scholar 

  • J. Schler, M. Koppel, S. Argamon, J.W. Pennebaker, Effects of age and gender on blogging, in Proceedings of the 2005 AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs (2006)

    Google Scholar 

  • T.E. Senn, M.P. Carey, P.A. Vanable, Childhood sexual abuse and sexual risk behavior among men and women attending a sexually transmitted disease clinic. J. Consult. Clin. Psychol. 74(4), 720–731 (2006)

    Article  Google Scholar 

  • A.M. Stanton, R.L. Boyd, C.S. Pulverman, C.M. Meston, Determining women’s sexual self-schemas through advanced computerized text analysis. Child Abuse Negl. 46, 78–88 (2015)

    Article  Google Scholar 

  • S.W. Stirman, J.W. Pennebaker, Word use in the poetry of suicidal and nonsuicidal poets. Psychosom. Med. 63, 517–522 (2001)

    Article  Google Scholar 

  • P.J. Stone, D.C. Dunphy, M.S. Smith, D.M. Ogilvie, The General Inquirer: A Computer Approach to Content Analysis (MIT, Cambridge, 1966)

    Google Scholar 

  • Y.R. Tausczik, J.W. Pennebaker, The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 24–54 (2010)

    Article  Google Scholar 

  • D. Watson, Mood and Temperament (Guilford Press, New York, 2000)

    Google Scholar 

  • W. Weintraub, Verbal Behavior in Everyday Life (Springer, New York, 1989)

    Google Scholar 

  • M. Wolf, C.K. Chung, H. Kordy, Inpatient treatment to online aftercare: e-mailing themes as a function of therapeutic outcomes. Psychother. Res. 20(1), 71–85 (2010)

    Article  Google Scholar 

  • T. Yarkoni, Personality in 100,000 words: a large-scale analysis of personality and word use among bloggers. J. Res. Pers. 44(3), 363–373 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

Preparation of this chapter was aided by grants from the National Institute of Health (5R01GM112697-02), John Templeton Foundation (#48503), and the National Science Foundation (IIS-1344257). The views, opinions, and findings contained in this chapter are those of the author and should not be construed as position, policy, or decision of the aforementioned agencies, unless so designated by other documents. The author would like to thank Elisavet Makridis, Natalie M. Peluso, James W. Pennebaker, and the anonymous reviewers for their helpful feedback on earlier versions of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryan L. Boyd .

Editor information

Editors and Affiliations

Appendix

Appendix

When preparing and processing texts using psychological text analysis, there are some widely accepted (but often unspoken) guidelines that are quite commonplace. These guidelines serve to ensure accurate insights both while processing texts and in subsequent statistical analyses. This appendix can be thought of as a basic “need to know” reference and primer for technical considerations when performing psychological text analysis. Feel free to treat this appendix as a “cheat sheet” for the methods covered in this chapter—one that should help to give you a head start in the world of research with language.

Preparing Texts for Analysis

One of the most common questions that people ask during their first romp into the world of psychological text analysis is this: “How should I prepare my texts?” Ultimately, there is no answer to this question that will apply in all cases, as guidelines will vary as a function of text source, research questions, and goals. However, there are some basic considerations that apply to virtually all cases. As a general rule when it comes to the psychological analysis of text, “good enough” really is “good enough” for most purposes. One could literally spend years preparing a collection of text files so that they are 100% perfect for analysis, however, the conceptual (and, more importantly, statistical) gains from doing so are often nil.

Spelling and Spelling Variations

It is tempting to worry about correcting texts so that all words that could potentially be captured by a dictionary are successfully recognized. Note, however, that word distributions follow what is known as a Zipf distribution (see Piantadosi 2014), wherein a relatively small number of words constitute the majority of words actually seen in texts, verbalizations, and so on. Translated into practical terms, what this means is that unless a very common word is misspelled with high frequency, it is unlikely to have a measurable impact on LIWC-based and MEM-based measures of psychological processes. For example, if a single text has the misspelling “teh” two times, yet contains 750 uses of other articles in the whole text, the measured percentage of articles in a text will differ from the actual occurrence of articles by such a small amount as to be negligible.

Texts with several high-frequency misspellings, however, may benefit from correction. Multiple software programs exist that allow users to automatically batch-correct text files to avoid the tedious job of manual spelling correction (e.g., GNU Aspell; http://aspell.net). While most of these applications are useful only for the technically savvy, other options exist that allow users to find specific misspellings and replace them with corrections, such as Find and Replace (FAR; http://findandreplace.sourceforge.net). Relatedly, regional spelling variants may benefit from standardization, depending on the nature of one’s research question. For example, given that the MEM looks for co-occurrences of words, we might expect the words “bloody,” “neighbor,” and “color” to co-occur more often than “bloody,” “neighbor,” and “color.” Unless we are interested in identifying culture-specific word co-occurrences, we would want to standardize regional variants to have parallel spellings across all texts, ensuring more accurate psychological insights.

Finally, certain special circumstances arise when analyzing transcribed verbal exchanges, particularly when using software such as LIWC. Several categories of “utterances” (such as nonfluency words like “uh” and “um,” and filler words like “like” and “y’know”) are psychologically meaningful (Laserna et al. 2014), but are often transcribed idiosyncratically according to certain traditions. The word “like” is particularly problematic given its various meanings as a result of homography (e.g., expressing evaluation—“I like him”—or filling spaces—“I, like, love sandwiches”). Primarily in the case of verbal transcriptions, many utterances must be converted to software-specific tokens that are recognized explicitly as filler words to improve text analysis accuracy (LIWC, for example, uses “rrlike” for filler word recognition).

Working with the MEM and MEM Results

When performing any type of topic modeling procedure, including the MEM, several decisions must be made by the researcher performing the analysis. These typically include answering questions such as “What is the correct number of themes to extract?” and “How do I know what label to give each of these themes?” Topic modeling results are occasionally dismissed as purely subjective, however, this is seldom the case. Indeed, while some steps occasionally require some (arguably) arbitrary decisions, such decisions are typically made by relying on domain expertise. The author’s recommendation is that, when in doubt, the best course of action is to consult with experts familiar with the type of research being conducted.

How to Extract/Understand Themes Using MEM

While the heart of the MEM is a statistical procedure known as a Principal Components Analysis (PCA), most of the typical guidelines that are recommended for a PCA do not extend to the modeling of language patterns. Instead, it may be more useful to think of the MEM as co-opting a statistical procedure to reach an end-goal, rather than the PCA being the goal itself.

When extracting themes using a PCA, a researcher must specify a k parameter, with k being the number of themes for which a PCA must solve. What is the ideal k parameter? In other words, how many themes should be extracted? Is it 10? 15? 50? This is a contentious issue in virtually every field that uses some form of topic modeling. The best answer at this time comes in the form of another question: “What makes sense given your research question?”

When attempting to perform a psychological analysis of text, domain expertise on psychological constructs is immensely helpful. The primary recommendation for determining the optimal k parameter is to test multiple k’s, settling on the k parameter that appears to best represent the problem space. The ideal k parameter is also directly influenced by the amount of data that you are processing. If you have an extremely large number of wordy texts, you will be able to extract many, many more themes than if you are analyzing 100 posts made to Twitter.

Even with extremely large datasets, however, the smallest number of themes that can be coherently extracted is typically the most optimal, particularly in cases of psychological research. Whereas it is not uncommon to see sophisticated studies that extract hundreds (or even thousands) of topics from a corpus, 95% of these topics will end up being an uninterpretable and highly inter-correlated grab-bag of words that does not represent anything in particular. Extracting large numbers of topics may still be highly useful in predictive modeling, however, this practice is often problematic from a social sciences/digital humanities perspective and also leads to serious concerns about the replicability of one’s findings.

Finally, when it comes to labeling themes that have been extracted using MEM or other topic models, there are no hard and fast rules. Labels assigned to word clusters should primarily be treated as a shorthand for referring to word clusters and should not treated as objective operationalizations during research in most cases. For example, one researcher may see the words “happy,” “cry,” and “shout” cluster together in a MEM analysis and label this as a broad “emotion” theme. Another researcher may see the same word clusters and see a “joyous” theme, believing that the words “cry” and “shout” are being used in a positive manner in conjunction with happiness. Most of the time, the labels are fairly simply to agree upon (e.g., a word cluster comprised of “college,” “study,” “test,” and “class” is unlikely to be anything other than a broader “school” theme). However, when in doubt, one of the best ways to understand or interpret a questionable theme is to look for texts that score high for that theme, then read them closely to see how the theme-related words are used.

How to Score Texts for MEM Themes

We will typically want to score each text in our corpus for each theme that has been extracted. Once themes have been extracted using the MEM, there are two primary ways to make this information useful for subsequent statistical modeling. In the first method, “factor scores” can be calculated using a statistical approach, such as regression scoring. In this approach, each word is weighted for its “representativeness” of a given theme—these weights correspond to the “factor loadings” of each word to each theme. For example, if a theme is extracted pertaining to animals, the word “cat” may have a factor loading of 0.5, the word “bird” has a loading of 0.25, and so on. These scores can be used in a multiple linear regression model:

y = (ax) + (bz) + … etc., or

Animal Theme = (0.5*cat) + (0.25*bird)… etc.

This procedure is used to score a text for the “animal” theme, and so on for all other themes extracted during the MEM. Some statistical software, such as IBM’s SPSS, have options that allow users to perform this scoring method automatically.

If using the regression approach to scoring texts for MEM themes, it is extremely important to note that the MEM typically requires that one performs their PCA with something called a VARIMAX rotation. Without going too far into the details, an orthogonal axis rotation such as VARIMAX ensures that all themes are mathematically 100% independent of each other. In practical terms, this means that regression scores for themes will be perfectly uncorrelated, which is often not an accurate reflection of psychological processes. This method is occasionally used in the literature but is not recommended for most purposes unless you have a well-articulated reason for doing so.

The second method for scoring MEM themes in texts is conceptually much simpler and strongly recommended. Essentially, this alternative method uses the MEM as a means of theme discovery or, in other words, simply determining what common themes exist in a corpus, as well as which words are indicative of which themes. Following this, a return to the word counting approach used by LIWC is used to score texts for each theme. For example, imagine that your use of the MEM uncovers two broad themes in your texts: a color theme (including the words “red,” “blue,” “green,” and “brown”) and a clothing theme (including the words “dress,” “shirt,” “shorts,” and “jeans”). The next step is to create a custom dictionary that places the “color” and “clothing” words into separate categories. You would then rescan your texts with software like LIWC to calculate the percentage of words pertaining to color and clothing, respectively. Unlike with the regression method of scoring texts (which would force a perfect zero correlation between the two themes), one would likely find that use of color words and clothing words shows modest bivariate correlation, which makes sense from both and intuitive and psychological perspective.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Boyd, R.L. (2017). Psychological Text Analysis in the Digital Humanities. In: Hai-Jew, S. (eds) Data Analytics in Digital Humanities. Multimedia Systems and Applications. Springer, Cham. https://doi.org/10.1007/978-3-319-54499-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54499-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54498-4

  • Online ISBN: 978-3-319-54499-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics