Detecting Topics in Documents by Clustering Word Vectors

de Miranda, Guilherme Raiol; Pasti, Rodrigo; de Castro, Leandro Nunes

doi:10.1007/978-3-030-23887-2_27

Detecting Topics in Documents by Clustering Word Vectors

Guilherme Raiol de Miranda^17,18,
Rodrigo Pasti¹⁸ &
Leandro Nunes de Castro^17,18

Conference paper
First Online: 22 June 2019

1042 Accesses
5 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1003 ))

Abstract

The automatic detection of topics in a set of documents is one of the most challenging and useful tasks in Natural Language Processing. Word2Vec has proven to be an effective tool for the distributed representation of words (word embeddings) usually applied to find their linguistic context. This paper proposes the use of a Self-Organizing Map (SOM) to cluster the word vectors generated by Word2Vec so as to find topics in the texts. After running SOM, a k-means algorithm is applied to separate the SOM output grid neurons into k clusters, such that the words mapped into each centroid represent the topics of that cluster. Our approach was tested on a benchmark text dataset with 19,997 texts and 20 groups. The results showed that the method is capable of finding the expected groups, sometimes merging some of them that deal with similar topics.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

McAfee, A., Brynjolfsson, E., Davenport, T.H., Patil, D., Barton, D.: Big data: the management revolution. Harvard Bus. Rev. 90(10), 60–68 (2012)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 171–180 (2014)
Google Scholar
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)
Article MathSciNet Google Scholar
de Castro, L.N., et al.: Análise e síntese de estratégias de aprendizado para redes neurais artificiais (1998)
Google Scholar
Yoshioka, K., Dozono, H.: The classification of the documents based on word2vec and 2-layer self organizing maps. Int. J. Mach. Learn. Comput. 8(3) (2018)
Article Google Scholar
Subramanian, S., Vora, D.: Unsupervised text classification and search using word embeddings on a self-organizing map. Int. J. Comput. Appl. 156(11) (2016)
Article Google Scholar
Shi, M., Liu, J., Zhou, D., Tang, M., Cao, B.: WE-LDA: a word embeddings augmented LDA model for web services clustering. In: 2017 IEEE International Conference on Web Services (ICWS), pp. 9–16. IEEE (2017)
Google Scholar
Dai, X., Bikdash, M., Meyer, B.: From social media to public health surveillance: word embedding based clustering method for Twitter classification. In: SoutheastCon 2017, pp. 1–7. IEEE (2017)
Google Scholar
Ritter, H., Kohonen, T.: Self-organizing semantic maps. Biol. Cybern. 61(4), 241–254 (1989)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Google Scholar
Mitchell, T.: UCI machine learning repository (1999)
Google Scholar
de Castro, L.N., Von Zuben, F.J.: aiNet: an artificial immune network for data analysis. In: Data Mining: A Heuristic Approach, pp. 231–260. IGI Global (2002)
Google Scholar

Download references

Acknowledgments

The authors thank CAPES, CNPq, Fapesp, Mackpesquisa and Intel for the financial support.

Author information

Authors and Affiliations

Natural Computing Laboratory, Mackenzie Presbyterian University, São Paulo, Brazil
Guilherme Raiol de Miranda & Leandro Nunes de Castro
AxonData Tecnologia Analítica, São Paulo, Brazil
Guilherme Raiol de Miranda, Rodrigo Pasti & Leandro Nunes de Castro

Authors

Guilherme Raiol de Miranda
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Pasti
View author publications
You can also search for this author in PubMed Google Scholar
Leandro Nunes de Castro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guilherme Raiol de Miranda .

Editor information

Editors and Affiliations

Department of Computer Science and Artificial Intelligence, University of Granada, ETS de Ingenierias Informática y de Telecomunicación, Granada, Spain
Francisco Herrera
Department of Engineering, Osaka Institute of Technology, Osaka , Japan
Kenji Matsui
Department of Computer Science, University of Salamanca, Salamanca, Spain
Sara Rodríguez-González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Miranda, G.R., Pasti, R., de Castro, L.N. (2020). Detecting Topics in Documents by Clustering Word Vectors. In: Herrera, F., Matsui , K., Rodríguez-González, S. (eds) Distributed Computing and Artificial Intelligence, 16th International Conference. DCAI 2019. Advances in Intelligent Systems and Computing, vol 1003 . Springer, Cham. https://doi.org/10.1007/978-3-030-23887-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-23887-2_27
Published: 22 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23886-5
Online ISBN: 978-3-030-23887-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics