Skip to main content

Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12279))

  • 3755 Accesses

Abstract

The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

L. Hillebrand and D. Biesner—First author equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    More recent expansions of these methods can be found in [9, 14].

  2. 2.

    All results are completely reproducible based on the information in this section. Our Python implementation to reproduce the results is available on https://github.com/LarsHill/text-dedicom-paper.

  3. 3.

    We provide a large scale evaluation of all article combinations listed in Table 3, including different choices for k, as supplementary material at https://bit.ly/3cBxsGI.

References

  1. Andrzej, A.H., Cichocki, A., Dinh, T.V.: Nonnegative dedicom based on tensor decompositions for social networks exploration. Aust. J. Intell. Inf. Process. Syst. 12 (2010)

    Google Scholar 

  2. Bader, B.W., Harshman, R.A., Kolda, T.G.: Pattern analysis of directed graphs using dedicom: an application to enron email (2006)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Chew, P., Bader, B., Rozovskaya, A.: Using DEDICOM for completely unsupervised part-of-speech tagging. In: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pp. 54–62. Association for Computational Linguistics, Boulder (2009)

    Google Scholar 

  5. Furnas, G.W., et al.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of ACM SIGIR (1988)

    Google Scholar 

  6. Harshman, R., Green, P., Wind, Y., Lundy, M.: A model for the analysis of asymmetric data in marketing research. Mark. Sci. 1, 205–242 (1982)

    Article  Google Scholar 

  7. Jolliffe, I.: Principal Component Analysis. Wiley, New York (2005)

    MATH  Google Scholar 

  8. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  9. Lebret, R., Collobert, R.: Word embeddings through hellinger PCA. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 482–490. Association for Computational Linguistics, Gothenburg (2014)

    Google Scholar 

  10. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS 2000, pp. 535–541. MIT Press, Cambridge (2000)

    Google Scholar 

  11. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2014, pp. 2177–2185. MIT Press, Cambridge (2014)

    Google Scholar 

  12. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2018)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)

    Google Scholar 

  14. Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)

    Article  Google Scholar 

  15. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha (2014)

    Google Scholar 

  16. Sifa, R., Ojeda, C., Bauckhage, C.: User churn migration analysis with dedicom. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, pp. 321–324. Association for Computing Machinery, New York (2015)

    Google Scholar 

  17. Sifa, R., Ojeda, C., Cvejoski, K., Bauckhage, C.: Interpretable matrix factorization with stochasticity constrained nonnegative dedicom (2018)

    Google Scholar 

  18. Wang, Y., Zhu, L.: Research and implementation of SVD in machine learning. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), pp. 471–475 (2017)

    Google Scholar 

Download references

Acknowledgement

The authors of this work were supported by the Competence Center for Machine Learning Rhine Ruhr (ML2R) which is funded by the Federal Ministry of Education and Research of Germany (grant no. 01|S18038C). We gratefully acknowledge this support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Hillebrand .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1010 KB)

Appendix

Appendix

Table 3. Overview of our semi-artifical dataset. Each synthetic sample consists of the corresponding Wikipedia articles 1–3. We differentiate between different articles, i.e. articles that have little thematical overlap (for example a person and a city, a fish and an insect or a ball game and a combat sport), and similar articles, i.e. articles with large thematical overlap (for example European countries, tech companies or aquatic animals). We group our dataset into different samples (3 articles that are pairwise different), similar samples (3 articles that are all similar) and mixed samples (2 similar articles, 1 different).
Table 4. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.
Table 5. Top half lists the top 10 representative words per dimension of the basis matrix A, bottom half lists the 5 most similar words based on cosine similarity for the 2 top words from each topic.
Fig. 5.
figure 5

(a) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by topic assignment. (b) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by original Wikipedia article assignment (words that occur in more than one article are excluded). (c) Colored heatmap of affinity matrix \(\textit{\textbf{R}}\).

Table 6. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.
Table 7. Top half lists the top 10 representative words per dimension of the basis matrix A, bottom half lists the 5 most similar words based on cosine similarity for the 2 top words from each topic.
Fig. 6.
figure 6

(a) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by topic assignment. (b) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by original Wikipedia article assignment (words that occur in more than one article are excluded). (c) Colored heatmap of affinity matrix \(\textit{\textbf{R}}\).

Table 8. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hillebrand, L., Biesner, D., Bauckhage, C., Sifa, R. (2020). Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2020. Lecture Notes in Computer Science(), vol 12279. Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57321-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57320-1

  • Online ISBN: 978-3-030-57321-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics