Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis

Distributed dictionary representation
  • Justin Garten
  • Joe Hoover
  • Kate M. Johnson
  • Reihane Boghrati
  • Carol Iskiwitch
  • Morteza Dehghani
Article

Abstract

Theory-driven text analysis has made extensive use of psychological concept dictionaries, leading to a wide range of important results. These dictionaries have generally been applied through word count methods which have proven to be both simple and effective. In this paper, we introduce Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts. This allows for the measurement of the similarity between dictionaries and spans of text ranging from complete documents to individual words. We show how DDR enables dictionary authors to place greater emphasis on construct validity without sacrificing linguistic coverage. We further demonstrate the benefits of DDR on two real-world tasks and finally conduct an extensive study of the interaction between dictionary size and task performance. These studies allow us to examine how DDR and word count methods complement one another as tools for applying concept dictionaries and where each is best applied. Finally, we provide references to tools and resources to make this method both available and accessible to a broad psychological audience.

Keywords

Methodological innovation Text analysis Semantic representation Dictionary-based text analysis 

Supplementary material

13428_2017_875_MOESM1_ESM.pdf (81 kb)
(PDF 81.4 KB)

References

  1. Box, G E, & Hill, W J (1967). Discrimination among mechanistic models Discrimination among mechanistic models. Technometrics, 9(1), 57–71.CrossRefGoogle Scholar
  2. Boyd, R L, & Pennebaker, J W (2015). Did Shakespeare write Double Falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychological Science, 0956797614566658.Google Scholar
  3. Byrt, T., Bishop, J., & Carlin, J B (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429.CrossRefPubMedGoogle Scholar
  4. Carnap, R (1959). Logical positivism. New York: The Free Press.Google Scholar
  5. Chen, Q., Li, W., Lei, Y., Liu, X, & He, Y (2015). Learning to adapt credible knowledge in cross-lingual sentiment analysis. In ACL (1) (pp. 419–429).Google Scholar
  6. Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd edn. Hillsdale: L. Erlbaum.Google Scholar
  7. Collobert, R, & Weston, J (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning (pp. 160–167).Google Scholar
  8. Deerwester, S C., Dumais, S T., Landauer, T K., Furnas, G W, & Harshman, R A (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.CrossRefGoogle Scholar
  9. Dehghani, M., Johnson, K., Hoover, J., Sagi, E., Garten, J., Parmar, N. J., ... & Graham, J. (2016). Purity homophily in social networks. Journal of Experimental Psychology: General, 145(3), 366.Google Scholar
  10. Eichstaedt, J C., Schwartz, H A., Kern, M L., Park, G., Labarthe, D R., Merchant, R M, & et al. (2015). Psychological language on twitter predicts county-level heart disease mortality. Psychological Science, 26 (2), 159–169.CrossRefPubMedPubMedCentralGoogle Scholar
  11. Fedorov, V V (1972). Theory of optimal experiments. Elsevier.Google Scholar
  12. Firth, J R (1957). A synopsis of linguistic theory, 1930-1955.Google Scholar
  13. Foltz, P W., Kintsch, W, & Landauer, T K (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.CrossRefGoogle Scholar
  14. Frimer, J A, & Brandt, M J (2015). Conservatives display greater happiness but only when they are in power: A linguistic analysis of the U.S. Congress.Google Scholar
  15. Godbole, N., Srinivasaiah, M., & Skiena, S (2007). Large-scale sentiment analysis for news and blogs. ICWSM, 7(21), 219–222.Google Scholar
  16. Graham, J., Haidt, J., & Nosek, B A (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029.CrossRefPubMedGoogle Scholar
  17. Gunn, J F, & Lester, D (2015). Twitter postings and suicide: An analysis of the postings of a fatal suicide in the 24 hours prior to death. Suicidologi, 17(3).Google Scholar
  18. Haidt, J., Graham, J., & Joseph, C (2009). Above and below left–right: Ideological narratives and moral foundations. Psychological Inquiry, 20(2–3), 110–119.CrossRefGoogle Scholar
  19. Henrich, J., Heine, S J, & Norenzayan, A (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2–3), 61–83.CrossRefPubMedGoogle Scholar
  20. Hosmer, D W Jr, & Lemeshow, S (2004). Applied logistic regression. Wiley.Google Scholar
  21. Iliev, R., Dehghani, M., & Sagi, E (2015). Automated text analysis in psychology: Methods, applications, and future developments. Language and Cognition, 7(02), 265–290.CrossRefGoogle Scholar
  22. Jones, M N., & Mewhort, D J (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological Review, 114(1), 1.CrossRefPubMedGoogle Scholar
  23. Kacewicz, E., Pennebaker, J W., Davis, M., Jeon, M., & Graesser, A C. (2013). Pronoun use reflects standings in social hierarchies. Journal of Language and Social Psychology, 0261927X13502654.Google Scholar
  24. Kahn, J H., Tobin, R M., Massey, A E., & Anderson, J A (2007). Measuring emotional expression with the linguistic inquiry and word count. The American Journal of Psychology, 263–286.Google Scholar
  25. Kern, M L., Eichstaedt, J C., Schwartz, H A., Dziurzynski, L., Ungar, L H., Stillwell, D J., & Seligman, M E (2014). The online social self an open vocabulary approach to personality. Assessment, 21(2), 158–169.CrossRefPubMedGoogle Scholar
  26. Kouloumpis, E., Wilson, T., & Moore, J (2011). Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11, 538–541.Google Scholar
  27. Landauer, T K., & Dumais, S T (1997). A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.CrossRefGoogle Scholar
  28. Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations, Proceedings of the eighteenth conference on computational natural language learning. Association for computational linguistics. Baltimore.Google Scholar
  29. Li, G., & Liu, F (2012). Application of a clustering method on sentiment analysis. Journal of Information Science, 38(2), 127–139.CrossRefGoogle Scholar
  30. Li, J., Jurafsky, D., & Hovy, E. (2015). When are tree structures necessary for deep learning of representations? arXiv:1503.00185.
  31. Lindley, D V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 986–1005.Google Scholar
  32. Liu, B. (2010). Sentiment analysis and subjectivity. Handbook of Natural Language Processing, 2, 627–666.Google Scholar
  33. Louwerse, M M. (2004). Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities, 38(2), 207–221.CrossRefGoogle Scholar
  34. Mcauliffe, J D., & Blei, D M. (2008). Supervised topic models. In Advances in neural information processing systems (pp. 121–128).Google Scholar
  35. Medin, D L., Bennis, W., & Chandler, M (2010). Culture and the home-field disadvantage. Perspectives on Psychological Science, 5(6), 708–713.CrossRefPubMedGoogle Scholar
  36. Medin, D L., Goldstone, R L., & Gentner, D (1990). Similarity involving attributes and relations: Judgments of similarity and difference are not inverses. Psychological Science, 1(1), 64–69.CrossRefGoogle Scholar
  37. Menke, J., & Martinez, T R (2004). Using permutations instead of student’s distribution for p-values in paired-difference algorithm comparisons. In 2004 IEEE international joint conference on neural networks (Vol. 2, pp. 1331–1335).Google Scholar
  38. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
  39. Mikolov, T., Sutskever, I., Chen, K., Corrado, G S., & Dean, J (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).Google Scholar
  40. Mikolov, T., Yih, W t., & Zweig, G (2013). Linguistic regularities in continuous space word representations. In HLT-NAACL Hlt-naacl (pp. 746–751).Google Scholar
  41. Mitchell, J., & Lapata, M (2008). Vector-based models of semantic composition. In Acl (pp. 236–244).Google Scholar
  42. Mitchell, L., Frank, M. R., Harris, K. D., Dodds, P. S., & Danforth, C. M.(2013). The geography of happiness: connecting twitter sentiment and expression, demographics, and objective characteristics of place. PloS one, 8(5), e64417.Google Scholar
  43. Osgood, C E., Suci, G J., & Tannenbaum, P H (1957). The measurement of meaning. Urbana: Univer. of Illinois Press, 195, 36.Google Scholar
  44. Pang, B., & Lee, L (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics (pp. 115–124).Google Scholar
  45. Pang, B., & Lee, L (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135.CrossRefGoogle Scholar
  46. Pennebaker, J W. (1997). Writing about emotional experiences as a therapeutic process. Psychological Science, 8(3), 162–166.CrossRefGoogle Scholar
  47. Pennebaker, J W. (2011). The secret life of pronouns. New Scientist, 211(2828), 42–45.CrossRefGoogle Scholar
  48. Pennebaker, J W., Francis, M E., & Booth, R J (2001). Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71, 2001.Google Scholar
  49. Pennington, J., Socher, R., & Manning, C D. (2014). Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP, 12, 1532–1543.Google Scholar
  50. Powers, D M. (1998). Applications and explanations of zipf’s law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning (pp. 151–160).Google Scholar
  51. Powers, D M W. (2011). Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2, 37–63.Google Scholar
  52. Ramirez-Esparza, N., Chung, C K., Kacewicz, E., & Pennebaker, J W. (2008). The psychology of word use in depression forums in english and in spanish: Texting two text analytic approaches. In Icwsm.Google Scholar
  53. Rumelhart, D E., McClelland, J L., Group, P R., & et al. (1988). Parallel distributed processing (Vol. 1). IEEE.Google Scholar
  54. Salton, G., Wong, A., & Yang, C S (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.CrossRefGoogle Scholar
  55. Sim, J., & Wright, C C (2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257–268.PubMedGoogle Scholar
  56. Smith, D A., Rydberg-Cox, J A., & Crane, G R (2000). The perseus project: A digital library for the humanities. Literary and Linguistic Computing, 15(1), 15–25.CrossRefGoogle Scholar
  57. Smucker, M D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation, Proceedings of the Sixteenth ACM conference on conference on information and knowledge management (pp. 623–632). New York: ACM.CrossRefGoogle Scholar
  58. Socher, R., Pennington, J., Huang, E H., Ng, A Y., & Manning, C D (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the conference on empirical methods in natural language processing (pp. 151–161).Google Scholar
  59. Socher, R., Perelygin, A., Wu, J Y., Chuang, J., Manning, C D., Ng, A Y., & Potts, C (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the conference on empirical methods in natural language processing (emnlp) (Vol. 1631, pp. 1642).Google Scholar
  60. Stone, P., Dunphy, D C., Smith, M S., & Ogilvie, D (1968). The general inquirer: A computer approach to content analysis. Journal of Regional Science, 8(1), 113–116.CrossRefGoogle Scholar
  61. Tai, K S., Socher, R., & Manning, C D (2015). Improved semantic representations from tree-structured long short-term memory networks. arXiv:1503.00075.
  62. Tausczik, Y R, & Pennebaker, J W (2010). The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.CrossRefGoogle Scholar
  63. Tumasjan, A., Sprenger, T O., Sandner, P G., & Welpe, I M (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178–185.Google Scholar
  64. Turney, P D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 417–424).Google Scholar
  65. Turney, P D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416.CrossRefGoogle Scholar
  66. Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327.CrossRefGoogle Scholar
  67. Watson, D., & Clark, L A (1999). The panas-x: Manual for the positive and negative affect schedule-expanded form.Google Scholar
  68. Wilbur, W J., & Sirotkin, K (1992). The automatic identification of stop words. Journal of Information Science, 18(1), 45–55.CrossRefGoogle Scholar

Copyright information

© Psychonomic Society, Inc. 2017

Authors and Affiliations

  • Justin Garten
    • 1
  • Joe Hoover
    • 1
  • Kate M. Johnson
    • 1
  • Reihane Boghrati
    • 1
  • Carol Iskiwitch
    • 1
  • Morteza Dehghani
    • 1
  1. 1.Computational Social Science LaboratoryUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations