Abstract
The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.
Keywords
- Word embeddings
- Topic analysis
- Matrix factorization
- Natural language processing
L. Hillebrand and D. Biesner—First author equal contribution.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
All results are completely reproducible based on the information in this section. Our Python implementation to reproduce the results is available on https://github.com/LarsHill/text-dedicom-paper.
- 3.
We provide a large scale evaluation of all article combinations listed in Table 3, including different choices for k, as supplementary material at https://bit.ly/3cBxsGI.
References
Andrzej, A.H., Cichocki, A., Dinh, T.V.: Nonnegative dedicom based on tensor decompositions for social networks exploration. Aust. J. Intell. Inf. Process. Syst. 12 (2010)
Bader, B.W., Harshman, R.A., Kolda, T.G.: Pattern analysis of directed graphs using dedicom: an application to enron email (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chew, P., Bader, B., Rozovskaya, A.: Using DEDICOM for completely unsupervised part-of-speech tagging. In: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pp. 54–62. Association for Computational Linguistics, Boulder (2009)
Furnas, G.W., et al.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of ACM SIGIR (1988)
Harshman, R., Green, P., Wind, Y., Lundy, M.: A model for the analysis of asymmetric data in marketing research. Mark. Sci. 1, 205–242 (1982)
Jolliffe, I.: Principal Component Analysis. Wiley, New York (2005)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lebret, R., Collobert, R.: Word embeddings through hellinger PCA. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 482–490. Association for Computational Linguistics, Gothenburg (2014)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS 2000, pp. 535–541. MIT Press, Cambridge (2000)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2014, pp. 2177–2185. MIT Press, Cambridge (2014)
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha (2014)
Sifa, R., Ojeda, C., Bauckhage, C.: User churn migration analysis with dedicom. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, pp. 321–324. Association for Computing Machinery, New York (2015)
Sifa, R., Ojeda, C., Cvejoski, K., Bauckhage, C.: Interpretable matrix factorization with stochasticity constrained nonnegative dedicom (2018)
Wang, Y., Zhu, L.: Research and implementation of SVD in machine learning. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), pp. 471–475 (2017)
Acknowledgement
The authors of this work were supported by the Competence Center for Machine Learning Rhine Ruhr (ML2R) which is funded by the Federal Ministry of Education and Research of Germany (grant no. 01|S18038C). We gratefully acknowledge this support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
(a) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by topic assignment. (b) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by original Wikipedia article assignment (words that occur in more than one article are excluded). (c) Colored heatmap of affinity matrix \(\textit{\textbf{R}}\).
(a) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by topic assignment. (b) 2-dimensional representation of word embeddings \(\textit{\textbf{A}}'\) colored by original Wikipedia article assignment (words that occur in more than one article are excluded). (c) Colored heatmap of affinity matrix \(\textit{\textbf{R}}\).
Rights and permissions
Copyright information
© 2020 IFIP International Federation for Information Processing
About this paper
Cite this paper
Hillebrand, L., Biesner, D., Bauckhage, C., Sifa, R. (2020). Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2020. Lecture Notes in Computer Science(), vol 12279. Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-57321-8_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57320-1
Online ISBN: 978-3-030-57321-8
eBook Packages: Computer ScienceComputer Science (R0)
-
Published in cooperation with
http://www.ifip.org/