Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM

Hillebrand, Lars; Biesner, David; Bauckhage, Christian; Sifa, Rafet

doi:10.1007/978-3-030-57321-8_22

Lars Hillebrand^12,13,
David Biesner^12,13,
Christian Bauckhage^12,13 &
…
Rafet Sifa¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12279))

Included in the following conference series:

International Cross-Domain Conference for Machine Learning and Knowledge Extraction

3755 Accesses

Abstract

The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.

L. Hillebrand and D. Biesner—First author equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
More recent expansions of these methods can be found in [9, 14].
2.
All results are completely reproducible based on the information in this section. Our Python implementation to reproduce the results is available on https://github.com/LarsHill/text-dedicom-paper.
3.
We provide a large scale evaluation of all article combinations listed in Table 3, including different choices for k, as supplementary material at https://bit.ly/3cBxsGI.

References

Andrzej, A.H., Cichocki, A., Dinh, T.V.: Nonnegative dedicom based on tensor decompositions for social networks exploration. Aust. J. Intell. Inf. Process. Syst. 12 (2010)
Google Scholar
Bader, B.W., Harshman, R.A., Kolda, T.G.: Pattern analysis of directed graphs using dedicom: an application to enron email (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chew, P., Bader, B., Rozovskaya, A.: Using DEDICOM for completely unsupervised part-of-speech tagging. In: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pp. 54–62. Association for Computational Linguistics, Boulder (2009)
Google Scholar
Furnas, G.W., et al.: Information retrieval using a singular value decomposition model of latent semantic structure. In: Proceedings of ACM SIGIR (1988)
Google Scholar
Harshman, R., Green, P., Wind, Y., Lundy, M.: A model for the analysis of asymmetric data in marketing research. Mark. Sci. 1, 205–242 (1982)
Article Google Scholar
Jolliffe, I.: Principal Component Analysis. Wiley, New York (2005)
MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lebret, R., Collobert, R.: Word embeddings through hellinger PCA. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 482–490. Association for Computational Linguistics, Gothenburg (2014)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proceedings of the 13th International Conference on Neural Information Processing Systems, NIPS 2000, pp. 535–541. MIT Press, Cambridge (2000)
Google Scholar
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 2014, pp. 2177–2185. MIT Press, Cambridge (2014)
Google Scholar
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction (2018)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)
Google Scholar
Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha (2014)
Google Scholar
Sifa, R., Ojeda, C., Bauckhage, C.: User churn migration analysis with dedicom. In: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys 2015, pp. 321–324. Association for Computing Machinery, New York (2015)
Google Scholar
Sifa, R., Ojeda, C., Cvejoski, K., Bauckhage, C.: Interpretable matrix factorization with stochasticity constrained nonnegative dedicom (2018)
Google Scholar
Wang, Y., Zhu, L.: Research and implementation of SVD in machine learning. In: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), pp. 471–475 (2017)
Google Scholar

Download references

Acknowledgement

The authors of this work were supported by the Competence Center for Machine Learning Rhine Ruhr (ML2R) which is funded by the Federal Ministry of Education and Research of Germany (grant no. 01|S18038C). We gratefully acknowledge this support.

Author information

Authors and Affiliations

Fraunhofer IAIS, Sankt Augustin, Germany
Lars Hillebrand, David Biesner, Christian Bauckhage & Rafet Sifa
University of Bonn, Bonn, Germany
Lars Hillebrand, David Biesner & Christian Bauckhage

Authors

Lars Hillebrand
View author publications
You can also search for this author in PubMed Google Scholar
David Biesner
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bauckhage
View author publications
You can also search for this author in PubMed Google Scholar
Rafet Sifa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Hillebrand .

Editor information

Editors and Affiliations

Human-Centered AI Lab, Institute for Medical Informatics, Statistics and Doumentation, Medical University Graz, Graz, Austria
Andreas Holzinger
UAS St. Pölten, St. Pölten, Austria
Peter Kieseberg
Institute of Software Technology and Interactive Systems, Technical University of Vienna, Vienna, Austria
A Min Tjoa
SBA Research, Vienna, Austria
Edgar Weippl

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1010 KB)

Appendix

Table 3. Overview of our semi-artifical dataset. Each synthetic sample consists of the corresponding Wikipedia articles 1–3. We differentiate between different articles, i.e. articles that have little thematical overlap (for example a person and a city, a fish and an insect or a ball game and a combat sport), and similar articles, i.e. articles with large thematical overlap (for example European countries, tech companies or aquatic animals). We group our dataset into different samples (3 articles that are pairwise different), similar samples (3 articles that are all similar) and mixed samples (2 similar articles, 1 different).

Full size table

Table 4. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Full size table

Table 5. Top half lists the top 10 representative words per dimension of the basis matrix A, bottom half lists the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Full size table

Table 6. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Full size table

Table 7. Top half lists the top 10 representative words per dimension of the basis matrix A, bottom half lists the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Full size table

Table 8. For each evaluated matrix factorization method we display the top 10 words for each topic and the 5 most similar words based on cosine similarity for the 2 top words from each topic.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hillebrand, L., Biesner, D., Bauckhage, C., Sifa, R. (2020). Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2020. Lecture Notes in Computer Science(), vol 12279. Springer, Cham. https://doi.org/10.1007/978-3-030-57321-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-57321-8_22
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57320-1
Online ISBN: 978-3-030-57321-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Interpretable Topic Extraction and Word Embedding Learning Using Row-Stochastic DEDICOM

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1010 KB)

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation