Abstract
Nowadays, millions of data are generated every day, and their use and interpretation have become fundamental in all fields. This is particularly true in the area of e-mails classification where, beyond its key role in organizing huge amounts of incoming information, it presents several challenging aspects to be solved. To the well-known problems presented by textual data (as its ambiguity) e-mails are usually characterized by their short lenght and informal language. These difficulties are increased when a relatively large number of highly imbalanced classes need to be considered and manual labeling is expensive and must be carried out by specialized personnel. Those are the main issues addressed in the present work, where Spanish-language e-mails sent by students of an Argentinian university needs to be categorized in 16 different classes.
Our proposal to address this problem consists of a semi-supervised approach based on an automatic feature selection process complemented with an information retrieval strategy. From an initial data set of manually labeled e-mails, the main features are selected for each class, using three techniques: logistic regression, TF-IDF, and SS3. Then, the remaining (non labeled) instances are indexed with a general-purpose search engine (Elasticsearch) and documents of each class are retrieved based on the selected features identified by each technique.
Our very simple approach shows that classifiers trained with labeled documents plus those retrieved in an automatic way obtain an improvement in performance (up to 6%) regarding classifiers trained only with manually labeled instances. Those improvements are observed in both, traditional learning algorithms like SVM, but also in more recent, state of the arts, transformer-based models (BERT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Taken from https://www.elastic.co/es/what-is/elasticsearch.
- 2.
For this purpose, the default TF-IDF formula of the TfidfVectorizer class of the sklearn Python library was used.
- 3.
Previously, a grid search was performed alternating different values of the parameter C of the Logistic Regression class of the sklearn library, varying the weighting schemes of the terms. The best value was obtained with \(C=1\).
- 4.
The hyperparameters alternated throughout the experiments are C, Gamma, and the kernels used by the algorithm.
References
Ali, R.S.H., El Gayar, N.: Sentiment analysis using unlabeled email data. In: 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), pp. 328–333. IEEE (2019)
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE (2016)
Bogawar, P.S., Bhoyar, K.K.: Email mining: a review. Int. J. Comput. Sci. Issues 9(1), 429–434 (2012)
Burdisso, S.G., Errecalde, M., Montes-y Gómez, M.: A text classification framework for simple and effective early depression detection over social media streams. Expert Syst. Appl. 133, 182–197 (2019)
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4
Fernandez, J.M., Cavasin, N., Errecalde, M.: Classic and recent (neural) approaches to automatic text classification: a comparative study with e-mails in the Spanish language. In: Short Papers of the 9th Conference on Cloud Computing, Big Data & Emerging Topics, p. 20 (2021)
Ferretti, E., Errecalde, M.L., Anderka, M., Stein, B.: On the use of reliable-negatives selection strategies in the PU learning approach for quality flaws prediction in Wikipedia. In: 2014 25th International Workshop on Database and Expert Systems Applications, pp. 211–215. IEEE (2014)
Ferretti, E., Hernández Fusilier, D., Guzmán Cabrera, R., Montes y Gómez, M., Errecalde, M., Rosso, P.: On the use of PU learning for quality flaw prediction in Wikipedia. In: CEUR Workshop Proceedings, vol. 1178 (2012)
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanford 1(12), 2009 (2009)
Group, T.R.: Email statistics report, 2019–2023 (2019). http://www.radicati.com
Gupta, I., Joshi, N.: Real-time twitter corpus labelling using automatic clustering approach. Int. J. Comput. Digit. Syst. 10, 519–532 (2021)
Igual, L., Seguí, S.: Introduction to data science. In: Introduction to Data Science. UTCS, pp. 1–4. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50017-1_1
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)
Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2), 1–41 (2022)
Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text classification by labeling words. In: AAAI, vol. 4, pp. 425–430 (2004)
Liu, S., Lee, I.: Email sentiment analysis through k-means labeling and support vector machine classification. Cybern. Syst. 49(3), 181–199 (2018)
Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48 (2005)
Reddy, Y., Viswanath, P., Reddy, B.E.: Semi-supervised learning: a brief review. Int. J. Eng. Technol. 7(1.8), 81 (2018)
Rosso, P., Errecalde, M., Pinto, D.: Analysis of short texts on the web: introduction to special issue. Lang. Resour. Eval. 47(1), 123–126 (2013)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Silva, N.F.F.D., Coletta, L.F., Hruschka, E.R.: A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Comput. Surv. 49(1), 1–26 (2016)
Skiena, S.S.: The Data Science Design Manual. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55444-0
Sneiders, E., Sjöbergh, J., Alfalahi, A.: Automated email answering by text-pattern matching: performance and error analysis. Expert. Syst. 35(1), e12251 (2018)
Statista: Most popular global mobile messenger apps as of July 2019, based on number of monthly active users (in millions) (2019). http://www.statista.com/
Tang, G., Pei, J., Luk, W.-S.: Email mining: tasks, common techniques, and tools. Knowl. Inf. Syst. 41(1), 1–31 (2013). https://doi.org/10.1007/s10115-013-0658-2
Usai, A., Pironti, M., Mital, M., Mejri, C.A.: Knowledge discovery out of text data: a systematic review via text mining. J. Knowl. Manag. (2018)
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Zhou, Z.H., Zhan, D.C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: AAAI, vol. 675680 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fernández, J.M., Errecalde, M. (2022). Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-14599-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14598-8
Online ISBN: 978-3-031-14599-5
eBook Packages: Computer ScienceComputer Science (R0)