Concept Mover’s Distance: measuring concept engagement via word embeddings in texts

Abstract

We propose a method for measuring a text’s engagement with a focal concept using distributional representations of the meaning of words. More specifically, this measure relies on word mover’s distance, which uses word embeddings to determine similarities between two documents. In our approach, which we call Concept Mover’s Distance, a document is measured by the minimum distance the words in the document need to travel to arrive at the position of a “pseudo document” consisting of only words denoting a focal concept. This approach captures the prototypical structure of concepts, is fairly robust to pruning sparse terms as well as variation in text lengths within a corpus, and with pre-trained embeddings, can be used even when terms denoting concepts are absent from corpora and can be applied to bag-of-words datasets. We close by outlining some limitations of the proposed method as well as opportunities for future research.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    As the document-by-term matrix used with WMD is weighted by relatively frequency this is the same as saying 100% of words are the same word.

  2. 2.

    Specifically, O(\(p^3\) log p), where p is the number of unique words in the collection.

  3. 3.

    Replication materials are available at https://github.com/dustinstoltz/concept_movers_distance_jcss.

  4. 4.

    https://github.com/statsmaths/fasttextM.

  5. 5.

    We compared the difference between including and removing stopwords on a variety of terms and corpora. Overall the results were highly correlated, but the larger the initial corpus size, the higher the correlation. However, including stopwords tended to make the distances much more stark, i.e., documents which were close became much closer and documents which were far became much further. Therefore, we chose to remove stopwords throughout. This is certainly an area for further research.

  6. 6.

    CMD works with any size document; therefore, we could have compare the two works as a whole (or even sentence by sentence), rather than by individual chapters. Our choice is entirely for illustrative purposes, more specifically to show variation across more observations.

  7. 7.

    It is outside the scope of this paper to unpack this further, however it is worth noting that Jaynes saw a direct connection between “gods forsaking” people and the breakdown of bicamerality (see [21]).

References

  1. 1.

    Artetxe, M., Labaka, G., & Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2289–2294). Austin: Association for Computational Linguistics.

  2. 2.

    Benoit, K., & Watanabe, K. (2019). quanteda.corpora: A Collection of Corpora for quanteda. R package version 0.86. https://github.com/quanteda/quanteda.corpora. Accessed 18 Feb 2019.

  3. 3.

    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  4. 4.

    Boas, F. S. (1896). Shakespeare and his predecessors. London: John Murray.

    Google Scholar 

  5. 5.

    Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  6. 6.

    Bonikowski, B., & Gidron, N. (2016). The populist style in American politics: Presidential campaign discourse, 1952–1996. Social Forces, 94, 1593–1621.

    Article  Google Scholar 

  7. 7.

    Brokos, G. -I., Malakasiotis, P, & Androutsopoulos, I. (2016). Using centroids of word embeddings and Word Mover’s Distance for biomedical document retrieval in question answering. arXiV preprint arXiv:1608.03905.

  8. 8.

    Core R Team. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  9. 9.

    Dennett, D. C. (1991). Consciousness explained. Boston: Back Bay Books.

    Google Scholar 

  10. 10.

    Diuk, C. G., Fernandez Slezak, D., Raskovsky, I., Sigman, M., & Cecchi, G. A. (2012). A quantitative philology of introspection. Frontiers in Integrative Neuroscience, 6, 1–12.

    Article  Google Scholar 

  11. 11.

    Dodds, E. R. (1951). The Greeks and the irrational. Berkeley: The University of California Press.

    Google Scholar 

  12. 12.

    Ellis, N. C. (2019). Essentials of a theory of language cognition. The Modern Language Journal, 103, 39–60.

    Article  Google Scholar 

  13. 13.

    Emirbayer, M. (1997). Manifesto for relational sociology. American Journal of Sociology, 103, 281–317.

    Article  Google Scholar 

  14. 14.

    Firth, J. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in linguistic analysis (pp. 168–205). Oxford: Blackwell.

    Google Scholar 

  15. 15.

    Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115, E3635–E3644.

    Article  Google Scholar 

  16. 16.

    Garvin, P. L. (1962). Computer participation in linguistic research. Language, 38(4), 385–389.

    Article  Google Scholar 

  17. 17.

    Greimas, A. (1983). Structural semantics: An attempt at a method. Lincoln: University of Nebraska Press.

    Google Scholar 

  18. 18.

    Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the \(54{th}\) Annual Meeting of the Association for Computational Linguistics (pp. 1489–1501). Berlin: Association for Computational Linguistics.

  19. 19.

    Ignatow, G. (2009). Culture and embodied cognition: Moral discourses in internet support groups for overeaters. Social Forces, 88, 643–670.

    Article  Google Scholar 

  20. 20.

    Jaynes, J. (1976). The origins of consciousness in the breakdown of the bicameral mind. Boston: Houghton Mifflin.

    Google Scholar 

  21. 21.

    Jaynes, J. (1986). Consciousness and the voices of the mind. Lecture given at the Canadian Psychological Association Symposium on Consciousness. Halifax: Canadian Psychological Association.

  22. 22.

    Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

  23. 23.

    Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012: Technical Papers (pp. 1459–1474 ). Mumbai: Association for Computational Linguistics.

  24. 24.

    Kozlowski, A. C., Taddy, M., & Evans, J. A. (2018). The geometry of culture: Analyzing meaning through word embeddings. arXiv preprint arXiv:1803.09288.

  25. 25.

    Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. In: Proceedings of the \(32{nd}\) International Conference on Machine Learning. Lille: International Machine Learning Society.

  26. 26.

    Lakoff, G. (2002). Moral politics: How liberals and conservatives think. Chicago: The University of Chicago Press.

    Google Scholar 

  27. 27.

    Leaf, W. (1892). A companion to the iliad, for English readers. London: MacMillan and Co.

    Google Scholar 

  28. 28.

    Lenci, A. (2018). Distributional models of word meaning. Annual Review of Linguistics, 4, 151–171.

    Article  Google Scholar 

  29. 29.

    Levina, E., & Peter, B. (2001). The Earth Mover’s Distance is the mallows distance: Some insights from statistics. In: IEEE Proceedings of the Eighth IEEE International Conference on Computer Vision. Vancouver: Institute of Electrical and Electronics Engineers.

  30. 30.

    Meyers, V. (1991). George Orwell. London: MacMillan.

    Google Scholar 

  31. 31.

    Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT 2013 (pp. 746–751 ). Atlanta: Association for Computational Linguistics.

  32. 32.

    Mohr, John W. (1998). Measuring meaning structures. Annual Review of Sociology, 24, 345–370.

    Article  Google Scholar 

  33. 33.

    Mullins, Daniel Austin, Hoyer, Daniel, Collins, Christina, Currie, Thomas, Freeney, Kevin, François, Pieter, et al. (2018). A systematic assessment of ’Axial Age’ proposals using global comparative historical evidence. American Sociological Review, 83, 596–626.

    Article  Google Scholar 

  34. 34.

    Pagel, Mark, Atkinson, Quentin D., & Meade, Andrew. (2007). Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature, 49, 717–721.

    Article  Google Scholar 

  35. 35.

    Pele, O., & Werman, M. (2009). Fast and Robust Earth Mover’s Distances. In: 2009 IEEE \(12{th}\) International Conference on Computer Vision (pp. 460–467). Kyoto: Institute of Electrical and Electronics Engineers.

  36. 36.

    Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532–1543). Doha: Association for Computational Linguistics.

  37. 37.

    Project Gutenberg. (2019). Project Gutenberg. https://www.gutenberg.org/wiki/Main_Page. Accessed 18 Feb 2019.

  38. 38.

    Raskovsky, I., Fernández Slezak, D., Diuk, C. G., & Cecchi, G. A. (2010). The emergence of the modern concept of introspection: A quantitative linguistic analysis. In: Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas (pp. 68–75 ). Los Angeles: Association for Computational Linguistics.

  39. 39.

    Rosch, Eleanor, & Mervis, Carolyn B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7, 573–605.

    Article  Google Scholar 

  40. 40.

    Rubner, Y., Tomasi, C., & Guibas, L. J. (1998). A metric for distributions with applications to image databases. In: Proceedings of the 1998 IEEE International Conference on Computer Vision. Bombay: Institute of Electrical and Electronics Engineers.

  41. 41.

    Scheff, Thomas J. (2011). What’s love got to do with it? Emotions and relationships in pop songs. New York: Routledge.

    Google Scholar 

  42. 42.

    Schloerke, B., Crowley, J., Cook, D., Hofmann, H., Wickham, H., Briatte, F., Marbach, M., Thoen, E., Elberg, A., & Larmarange, J. (2018). “GGally: Extension to ‘ggplot2.”’ R package version 1.4.0. https://cran.r-project.org/web/packages/GGally/GGally.pdf. Accessed 18 Feb 2019.

  43. 43.

    Snell, B. (2013). The Discovery of the Mind: The Greek Origins of European Thought. Translated by T. G. Rosenmeyer. Tacoma: Angelico Press (1953) .

  44. 44.

    Selivanov, D., & Wang, Q. (2018). text2vec: Modern text mining framework for R.” R package 0.5.1 documentation. https://cran.r-project.org/web/packages/text2vec/text2vec.pdf. Accessed 16 Feb 2019.

  45. 45.

    Smith, S., Turban, D., Hamblin, S., & Hammerla, N. (2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.

  46. 46.

    Taylor, John R. (2003). Linguistic categorization. New York: Oxford University Press.

    Google Scholar 

  47. 47.

    Taylor, Marshall A., Stoltz, Dustin S., & McDonnell, Terence E. (2019). Binding signicance to form: Cultural objects, neural binding, and cultural change. Poetics, 73, 1–16.

    Article  Google Scholar 

  48. 48.

    The American Presidency Project. (2018). Annual Messages to Congress on the State of the Union (Washington 1790—Trump 2018). https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union. Accessed 3 Feb 2019.

  49. 49.

    Urban Institute Research. (2019). urbnthemes: Urban Institute’s ggplot2 Theme and Tools. https://github.com/UI-Research/urbnthemes. Accessed 18 Feb 2019.

  50. 50.

    Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1006–1011). Denver: Association for Computational Linguistics.

  51. 51.

    Wickham, Hadley. (2016). ggplot2: Elegant graphics for data science. New York: Springer.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dustin S. Stoltz.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Stoltz, D.S., Taylor, M.A. Concept Mover’s Distance: measuring concept engagement via word embeddings in texts. J Comput Soc Sc 2, 293–313 (2019). https://doi.org/10.1007/s42001-019-00048-6

Download citation

Keywords

  • Cultural sociology
  • Concept Mover’s Distance
  • Word embeddings
  • Natural language processing
  • Text analysis