Abstract
The automatic detection of topics in a set of documents is one of the most challenging and useful tasks in Natural Language Processing. Word2Vec has proven to be an effective tool for the distributed representation of words (word embeddings) usually applied to find their linguistic context. This paper proposes the use of a Self-Organizing Map (SOM) to cluster the word vectors generated by Word2Vec so as to find topics in the texts. After running SOM, a k-means algorithm is applied to separate the SOM output grid neurons into k clusters, such that the words mapped into each centroid represent the topics of that cluster. Our approach was tested on a benchmark text dataset with 19,997 texts and 20 groups. The results showed that the method is capable of finding the expected groups, sometimes merging some of them that deal with similar topics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
McAfee, A., Brynjolfsson, E., Davenport, T.H., Patil, D., Barton, D.: Big data: the management revolution. Harvard Bus. Rev. 90(10), 60–68 (2012)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 171–180 (2014)
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)
de Castro, L.N., et al.: Análise e síntese de estratégias de aprendizado para redes neurais artificiais (1998)
Yoshioka, K., Dozono, H.: The classification of the documents based on word2vec and 2-layer self organizing maps. Int. J. Mach. Learn. Comput. 8(3) (2018)
Subramanian, S., Vora, D.: Unsupervised text classification and search using word embeddings on a self-organizing map. Int. J. Comput. Appl. 156(11) (2016)
Shi, M., Liu, J., Zhou, D., Tang, M., Cao, B.: WE-LDA: a word embeddings augmented LDA model for web services clustering. In: 2017 IEEE International Conference on Web Services (ICWS), pp. 9–16. IEEE (2017)
Dai, X., Bikdash, M., Meyer, B.: From social media to public health surveillance: word embedding based clustering method for Twitter classification. In: SoutheastCon 2017, pp. 1–7. IEEE (2017)
Ritter, H., Kohonen, T.: Self-organizing semantic maps. Biol. Cybern. 61(4), 241–254 (1989)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Mitchell, T.: UCI machine learning repository (1999)
de Castro, L.N., Von Zuben, F.J.: aiNet: an artificial immune network for data analysis. In: Data Mining: A Heuristic Approach, pp. 231–260. IGI Global (2002)
Acknowledgments
The authors thank CAPES, CNPq, Fapesp, Mackpesquisa and Intel for the financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
de Miranda, G.R., Pasti, R., de Castro, L.N. (2020). Detecting Topics in Documents by Clustering Word Vectors. In: Herrera, F., Matsui , K., Rodríguez-González, S. (eds) Distributed Computing and Artificial Intelligence, 16th International Conference. DCAI 2019. Advances in Intelligent Systems and Computing, vol 1003 . Springer, Cham. https://doi.org/10.1007/978-3-030-23887-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-23887-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23886-5
Online ISBN: 978-3-030-23887-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)