Integrating semantic directions with concept mover’s distance to measure binary concept engagement

Abstract

In an earlier article published in this journal (“Concept Mover’s Distance”, 2019), we proposed a method for measuring concept engagement in texts that uses word embeddings to find the minimum cost necessary for words in an observed document to “travel” to words in a “pseudo-document” consisting only of words denoting a concept of interest. One potential limitation we noted is that, because words associated with opposing concepts will be located close to one another in the embedding space, documents will likely have similar closeness to starkly opposing concepts (e.g., “life” and “death”). Using aggregate vector differences between antonym pairs to extract a direction in the semantic space pointing toward a pole of the binary opposition (following “The Geometry of Culture,” American Sociological Review, 2019), we illustrate how CMD can be used to measure a document’s engagement with binary concepts.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    Which we could get by subtracting the cosine similarity between “bowling” and “rich” (\(-0.754\)) from the cosine similarity between “bowling” and “poor” (0.962), and dividing by two.

  2. 2.

    See [20, pp. 296–9] and [11] for more detailed discussions of the underlying algorithm. Several teams have found computationally efficient methods of solving the transportation problem and our method now incorporates linear complexity relaxed word mover’s distance [2], as implemented in the text2vec package [19].

  3. 3.

    There are two differences in the SOTU example from [20]. First, there are 241 speeches in this analysis while there were 239 in [20]. Second, non-ASCII characters were removed as part of the processing procedure in the present analysis.

References

  1. 1.

    Arseniev-Koehler, A., & Foster, J. (2020). Machine learning as a model for cultural learning: Teaching an algorithm what it means to be fat. SocArXiv. https://osf.io/preprints/socarxiv/c9yj3/.

  2. 2.

    Atasu, K., Parnell, T., Dünner, C., Sifalakas, M., Pozidis, H., Vasileiadis, V., et al. (2017). Linear-complexity related word mover's distance with GPU acceleration. In J.-Y. Nie, Z. Obradovic, T. Suzumura, R. Ghosh, R. Nambiar, C. Wang, et al. (Eds.), 2017 IEEE international conference on big data (pp. 889–896). Boston: IEEE.

    Chapter  Google Scholar 

  3. 3.

    Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arXiv. https://arxiv.org/abs/1606.06121.

  4. 4.

    Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 4349–4357). Curran Associates Inc.

  5. 5.

    Caliskan, A., Bryson, Joanna  J, & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186.

    Article  Google Scholar 

  6. 6.

    Ethayarajh, K., Duvenaud, D., & Hirst, G. (2019). Understanding undesirable word embedding associations. arXiv. https://arxiv.org/abs/1908.06361.

  7. 7.

    Garg, N., Schiebinger, L., Jurafsky, D., & Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences of the United States of America, 115(16), E3635–E3644.

    Article  Google Scholar 

  8. 8.

    Goldberg, A. (2011). Mapping shared understandings using relational class analysis: The case of the cultural omnivore reexamined. American Journal of Sociology, 116(5), 1397–1436.

    Article  Google Scholar 

  9. 9.

    Kassambara, A. (2020). ggpubr: ‘ggplot2’ based publication ready plots. R package version 0.2.5. https://cran.r-project.org/web/packages/ggpubr/ggpubr.pdf. Accessed 11 June 2020.

  10. 10.

    Kozlowski, A. C., Taddy, M., & Evans, J. A. (2019). The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review, 84(5), 905–949.

    Article  Google Scholar 

  11. 11.

    Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In: International conference on machine learning (pp. 957–966).

  12. 12.

    Lakoff, George. (2010). Moral politics: How liberals and conservatives think. Chicago: University of Chicago Press.

    Google Scholar 

  13. 13.

    Larsen, A. B. L., Sønderby, S. K., Larochelle, H., & Winther, O. (2016). Autoencoding beyond pixels using a learned similarity metric. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning (pp. 1558–1566). New York: ACM.

    Google Scholar 

  14. 14.

    Makrai, M., Nemeskey, D., & Kornai, A. (2013). Applicative structure in vector space models. In A. Allauzen, H. Larochelle, C. Manning, & R. Socher (Eds.), Proceedings of the workshop on continuous vector space models and their compositionality (pp. 59–63). Sofia, Bulgaria: ACL.

    Google Scholar 

  15. 15.

    Mikolov, T, Yih, W.-T., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 746–751). aclweb.org.

  16. 16.

    Project Gutenberg. 2020. https://www.gutenberg.org/wiki/Main_Page.

  17. 17.

    Rubner, Y., Tomasi, C., & Guibas L. J. (1998). A metric for distributions with applications to image databases. In Sixth international conference on computer vision (IEEE Cat. No. 98CH36271) (pp. 59–66). IEEE.

  18. 18.

    Sahlgren, Magnus. (2008). The distributional hypothesis. Italian Journal of Disability Studies, 20, 33–53.

    Google Scholar 

  19. 19.

    Selivanov, D., Bickel, M., & Wang, Q. (2020) text2vec: Modern text mining framework for R. R package version 0.6. https://cran.r-project.org/web/packages/text2vec/text2vec.pdf. Accessed 11 June 2020.

  20. 20.

    Stoltz, D. S., & Taylor, M. A. (2019). Concept mover’s distance: measuring concept engagement via word embeddings in texts. Journal of Computational Social Science, 2(2), 293–313.

    Article  Google Scholar 

  21. 21.

    Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). New York: Springer. (ISBN 0-387-95457-0).

    Book  Google Scholar 

  22. 22.

    Wickham, Hadley. (2016). ggplot2: Elegant Graphics for Data Analysis. New York: Springer.

    Book  Google Scholar 

  23. 23.

    Woolley, J. T, & Peters, G. (2008). The American presidency project, Santa Barbara. Available from: http://www.presidency.ucsb.edu/ws.

Download references

Acknowledgements

A replication repository for this paper can be found at: https://github.com/Marshall-Soc/cmd_geometry.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marshall A. Taylor.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Procedures for deriving a semantic directions

Appendix: Procedures for deriving a semantic directions

Deriving a semantic direction in an embedding space is a specific kind of relation extraction or induction. As such, there are many viable procedures one could use to find the pole of a binary concept in an embedding space. First, the simplest method would involve changing the order of operations used by Kozlowski et al. [10]: average the vectors for the words on each pole and then take the difference between these two averages. Arseniev-Koehler and Foster [1] refer to this method as the “Larsen method” following [13, p. 5]. Kozlowski et al. [10, p. 943 fn8] state that the Larsen method produced “nearly identical results” to theirs.

Second, Arseniev-Koehler and Foster [1] compare the Larsen method to one used in Bolukbasi et al. [3, pp. 42–43], which entails getting the vector offsets of antonym pairs through subtraction, then dividing the resulting vector by the Euclidean norm of the vector offset for those antonym pairs (see also [6]). Arseniev-Koehler and Foster find the results are similar, but the Larsen method was more accurate than this Bolukbasi method.

Third, Bolukbasi et al. [4] offer an additional method involving taking the difference between antonym pairs (they specifically use gendered terms), but then using principal component analysis to find a suitable aggregate from the resulting vector differences.

Finally, for exhaustiveness, there is another procedure which involves measuring individual target words’ associations with antonym pairs. This procedure does not, however, define a semantic direction against which any word could be compared and thus cannot be used directly with CMD. Caliskan et al. [5] incorporate this approach into a measure of gender bias in target terms, a technique they refer to as the Word-Embedding Association Test (WEAT). This entails first picking a target term, such as “wrench” or “boat.” Then one would take the mean of this target term’s distances to female-typed words—such as “girl,” “woman,” or “lady.” Next, one would take the mean of this same term’s distances to male-typed words, such as “boy,” “man,” and “gentleman.” Finally, the analyst subtracts the first mean from the second mean, to arrive at a single measure of how strongly associated this target term is to either side of the binary (see also [7]).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Taylor, M.A., Stoltz, D.S. Integrating semantic directions with concept mover’s distance to measure binary concept engagement. J Comput Soc Sc 4, 231–242 (2021). https://doi.org/10.1007/s42001-020-00075-8

Download citation

Keywords

  • Concept mover’s distance
  • Geometry of culture
  • Word embeddings
  • Text analysis
  • Cultural sociology
  • Natural language processing