Topic Classification for Short Texts

Neagu, Dan Claudiu; Rus, Andrei Bogdan; Grec, Mihai; Boroianu, Mihai; Silaghi, Gheorghe Cosmin

doi:10.1007/978-3-031-32418-5_12

Part of the book series: Lecture Notes in Information Systems and Organisation ((LNISO,volume 63))

Included in the following conference series:

International Conference on Information Systems Development

158 Accesses

Abstract

In the context of TV and social media surveillance, constructing models to automate topic identification of short texts is a key task. This paper constructs worth-to-consider models for practical usage, employing Top-K multinomial classification methodology. We describe the full data processing pipeline, discussing about dataset selection, text preprocessing, feature extraction, model selection and learning, including hyperparameter optimization. We will test and compare popular methods including: standard machine learning, deep learning, and a fine-tuned BERT for topic classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://jmcauley.ucsd.edu/data/amazon/.
2.
http://trec.nist.gov/data/tweets.
3.
https://www.kaggle.com/datasets/rmisra/news-category-dataset.
4.
https://www.huffpost.com/, formerly The Huffington Post until 2017, is an American news aggregator and blog with localized and international editions.
5.
https://spacy.io/.
6.
https://radimrehurek.com/gensim/.
7.
https://huggingface.co/docs/transformers/main_classes/tokenizer.
8.
https://huggingface.co/docs/transformers/model_doc/bert.
9.
https://sklearn-genetic-opt.readthedocs.io/.
10.
https://github.com/deap/deap.

References

Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? and how to fix it using search-based software engineering. Information and Software Technology, 98, 74–88.
Article Google Scholar
Albanese, F., & Feuerstein, E. (2021) Improved topic modeling in twitter through community pooling. In String Processing and Information Retrieval—28th International Symposium, SPIRE 2021, LNCS (vol. 12944, pp. 209–216). Springer.
Google Scholar
Arenas Gomez, R. (2021). GASearchCV—sklearn genetic opt 0.4.0 documentation. https://sklearn-genetic-opt.readthedocs.io/en/0.4.0/api/gasearchcv.html
Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries, 17(4), 305–338.
Article Google Scholar
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research,13, 281–305.
Google Scholar
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Proceedings of the 25th Annual Conference on NIPS Advances in Neural Information Processing Systems (vol. 24, pp. 2546–2554).
Google Scholar
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Article Google Scholar
Boyd-Graber, J.L., & Blei, D.M. (2008). Syntactic topic models. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (pp. 185–192)
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article Google Scholar
Cheng, X., Yan, X., Lan, Y., & Guo, J. (2014). Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26(12), 2928–2941.
Article Google Scholar
Chollet, F. et al. (2015). Keras. https://keras.io
Cicada Technologies. (2020). Innovative platform for measuring tv audience, automatic identification of viewers and correlating it with analytic data from social media. https://www.cicadatech.eu/projects/
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 (Long and Short Papers), ACL (vol. 1, pp. 4171–4186)
Google Scholar
Eisenstein, J. (2013). What to do about bad language on the Internet. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, ACL(pp. 359–369).
Google Scholar
Fan, X., Lin, H., Yang, L., Diao, Y., Shen, C., Chu, Y., & Zou, Y. (2020). Humor detection via an internal and external neural network. Neurocomputing, 394, 105–111.
Article Google Scholar
Fortin, F. A., De Rainville, F. M., Gardner, M. A., Parizeau, M., & Gagné, C. (2012). DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research,13, 2171–2175.
Google Scholar
Gentzkow, M., Kelly, B., & Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3), 535–74.
Article Google Scholar
Gorgolis, N., Hatzilygeroudis, I., Istenes, Z., & Gyenne, L. (2019). Hyperparameter optimization of LSTM network models through genetic algorithm. In 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, IEEE (pp. 1–4).
Google Scholar
Gupta, M. R., Bengio, S., & Weston, J. (2014). Training highly multiclass classifiers. Journal of Machine Learning Research, 15(1), 1461–1492.
Google Scholar
Guzella, T. S., & Caminhas, W. M. (2009). A review of machine learning approaches to spam filtering. Expert Systems with Applications, 36(7), 10206–10222.
Article Google Scholar
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in Twitter. In 3rd Workshop on Social Network Mining and Analysis, SNAKDD 2009, ACM (pp. 80–88).
Google Scholar
Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, ACL (pp. 1373–1378)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International journal of computer vision,116(1), 1–20.
Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In 10th European Conference on Machine Learning, ECML-98, Springer, LNCS (vol. 1398, pp. 137–142)
Google Scholar
Lee, K., Palsetia, D., Narayanan, R., Patwary, M. M. A., Agrawal, A., & Choudhary, A. N. (2011). Twitter trending topic classification. In 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), IEEE (pp. 251–258)
Google Scholar
Liu, B. (2020). Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press.
Google Scholar
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop (pp. 41–48)
Google Scholar
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113.
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems 2013 (pp. 3111–3119)
Google Scholar
Misra, R. (2018). News Category Dataset—Sculpturing Data for ML. http://doi.org/10.13140/RG.2.2.20331.18729
Mori, N., Takeda, M., & Matsumoto, K. (2005) A comparison study between genetic algorithms and bayesian optimize algorithms by novel indices. In 7th Annual Conference on Genetic and Evolutionary Computation, ACM (pp. 1485–1492)
Google Scholar
Müller, T., Cotterell, R., Fraser, A. M., & Schütze, H. (2015). Joint lemmatization and morphological tagging with lemming. In 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, ACL (pp. 2268–2274)
Google Scholar
Oh, S. (2017). Top-k hierarchical classification. In: 31st AAAI Conference on Artificial Intelligence (pp. 2450–2456). AAAI Press
Google Scholar
Ojha, V. K., Abraham, A., & Snásel, V. (2017). Metaheuristic design of feedforward neural networks: A review of two decades of research. Engineering Applications of Artificial Intelligence, 60, 97–116.
Article Google Scholar
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In 30th International Conference on Machine Learning, ICML 2013, JMLR.org, JMLR Workshop and Conference Proc. (vol. 28, pp. 1310–1318)
Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems,32, 8024–8035.
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,12, 2825–2830.
Google Scholar
Pelikan, M., Goldberg, D. E., & Lobo, F. G. (2002). A survey of optimization by building and using probabilistic models. Computational Optimizations and Applications, 21(1), 5–20.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, ACL (pp. 1532–1543).
Google Scholar
Rahman, M. A., & Akter, Y. A. (2019). Topic classification from text using decision tree, K-NN and Multinomial Naïve Bayes. In 2019 1st International Conference on Advances in Science Engineering and Robotics Technology (ICASERT), IEEE (pp. 1–4).
Google Scholar
Řehůřek, R., & Sojka, P. (2010) Software framework for topic modelling with large corpora. In LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA (pp. 45–50)
Google Scholar
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
Article Google Scholar
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In 26th Annual Conference on Neural Information Processing Systems 2012 (pp. 2960–2968)
Google Scholar
Vayansky, I., & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582.
Article Google Scholar
Violos, J., Tsanakas, S., Androutsopoulou, M., Palaiokrassas, G., & Varvarigou, T. (2020). Next position prediction using lstm neural networks. In 11th Hellenic Conference on Artificial Intelligence, ACM (pp. 232–240).
Google Scholar
Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM (pp. 424–433).
Google Scholar
Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M. R., & King, I. (2018). Topic memory networks for short text classification. In 2018 Conference on Empirical Methods in Natural Language Processing, ACL (pp. 3120–3131).
Google Scholar

Download references

Acknowledgements

This paper was financed by the project with the title “Platformă inovativă pentru măsurarea audienţei TV, identificarea automată a telespectatorilor şi corelarea cu date analitice din platforme de socializare online” (Innovative platform for measuring TV audience, automatic identification of viewers and correlating it with analytic data from social media). The project was cofinanced by “Fondul European de Dezvoltare Regională prin Programul Operaţional Competitivitate (POC) 2014–2020, Axa prioritară: 2-Tehnologia Informaţiei şi Comunicaţiilor (TIC) pentru o economie digitală competitivă”. (the European Regional Development Fund (ERDF) through the Competitiveness Operational Program 2014–2020, Priority Axis 2 - Information and Communication Technology (ICT) for a competitive digital economy), project code SMIS 2014+:128960, beneficiary: CICADA TECHNOLOGIES S.R.L. The project is part of the call: POC/524/2/2/ “Sprijinirea creşterii valorii adăugate generate de sectorul TIC şi a inovării în domeniu prin dezvoltarea de clustere” (Supporting the added value generated by the ICT sector and innovation in the field through cluster development). The content of this material does not necessarily represent the official position of the European Union or the Romanian Government.

Author information

Authors and Affiliations

Cicada Technologies, Bd. Nicolae Titulescu 18 apt. 82, 400420, Cluj-Napoca, Romania
Dan Claudiu Neagu, Andrei Bogdan Rus, Mihai Grec & Mihai Boroianu
Babes-Bolyai University, Str. Theodor Mihali 58-60, 400591, Cluj-Napoca, Romania
Dan Claudiu Neagu & Gheorghe Cosmin Silaghi
Technical University, Str. Memorandumului 28, 400114, Cluj-Napoca, Romania
Andrei Bogdan Rus

Authors

Dan Claudiu Neagu
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Bogdan Rus
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Grec
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Boroianu
View author publications
You can also search for this author in PubMed Google Scholar
Gheorghe Cosmin Silaghi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gheorghe Cosmin Silaghi .

Editor information

Editors and Affiliations

Business Informatics Research Center, Babes-Bolyai University, Cluj-Napoca, Romania
Gheorghe Cosmin Silaghi
Business Informatics Research Center, Babes-Bolyai University, Cluj Napoca, Romania
Robert Andrei Buchmann
Department of Computer Science, Babes-Bolyai University, Cluj-Napoca, Romania
Virginia Niculescu
Department of Computer Science, Babes-Bolyai University, Cluj-Napoca, Romania
Gabriela Czibula
Business Information Systems Discipline, University of Galway, Galway, Ireland
Chris Barry
Business Information Systems Discipline, University of Galway, Galway, Ireland
Michael Lang
Department of Human-Centred Computing, Faculty of Information Technology, Monash University, Clayton, VIC, Australia
Henry Linger
IESE Business School, University of Navarra, Barcelona, Spain
Christoph Schneider

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Neagu, D.C., Rus, A.B., Grec, M., Boroianu, M., Silaghi, G.C. (2023). Topic Classification for Short Texts. In: Silaghi, G.C., et al. Advances in Information Systems Development. ISD 2022. Lecture Notes in Information Systems and Organisation, vol 63. Springer, Cham. https://doi.org/10.1007/978-3-031-32418-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-32418-5_12
Published: 27 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-32417-8
Online ISBN: 978-3-031-32418-5
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics