Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval

Fernández, Juan Manuel; Errecalde, Marcelo

doi:10.1007/978-3-031-14599-5_6

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1634))

Included in the following conference series:

Conference on Cloud Computing, Big Data & Emerging Topics

344 Accesses

Abstract

Nowadays, millions of data are generated every day, and their use and interpretation have become fundamental in all fields. This is particularly true in the area of e-mails classification where, beyond its key role in organizing huge amounts of incoming information, it presents several challenging aspects to be solved. To the well-known problems presented by textual data (as its ambiguity) e-mails are usually characterized by their short lenght and informal language. These difficulties are increased when a relatively large number of highly imbalanced classes need to be considered and manual labeling is expensive and must be carried out by specialized personnel. Those are the main issues addressed in the present work, where Spanish-language e-mails sent by students of an Argentinian university needs to be categorized in 16 different classes.

Our proposal to address this problem consists of a semi-supervised approach based on an automatic feature selection process complemented with an information retrieval strategy. From an initial data set of manually labeled e-mails, the main features are selected for each class, using three techniques: logistic regression, TF-IDF, and SS3. Then, the remaining (non labeled) instances are indexed with a general-purpose search engine (Elasticsearch) and documents of each class are retrieved based on the selected features identified by each technique.

Our very simple approach shows that classifiers trained with labeled documents plus those retrieved in an automatic way obtain an improvement in performance (up to 6%) regarding classifiers trained only with manually labeled instances. Those improvements are observed in both, traditional learning algorithms like SVM, but also in more recent, state of the arts, transformer-based models (BERT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Taken from https://www.elastic.co/es/what-is/elasticsearch.
2.
For this purpose, the default TF-IDF formula of the TfidfVectorizer class of the sklearn Python library was used.
3.
Previously, a grid search was performed alternating different values of the parameter C of the Logistic Regression class of the sklearn library, varying the weighting schemes of the terms. The best value was obtained with \(C=1\).
4.
The hyperparameters alternated throughout the experiments are C, Gamma, and the kernels used by the algorithm.

References

Ali, R.S.H., El Gayar, N.: Sentiment analysis using unlabeled email data. In: 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), pp. 328–333. IEEE (2019)
Google Scholar
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE (2016)
Google Scholar
Bogawar, P.S., Bhoyar, K.K.: Email mining: a review. Int. J. Comput. Sci. Issues 9(1), 429–434 (2012)
Google Scholar
Burdisso, S.G., Errecalde, M., Montes-y Gómez, M.: A text classification framework for simple and effective early depression detection over social media streams. Expert Syst. Appl. 133, 182–197 (2019)
Article Google Scholar
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4
Book Google Scholar
Fernandez, J.M., Cavasin, N., Errecalde, M.: Classic and recent (neural) approaches to automatic text classification: a comparative study with e-mails in the Spanish language. In: Short Papers of the 9th Conference on Cloud Computing, Big Data & Emerging Topics, p. 20 (2021)
Google Scholar
Ferretti, E., Errecalde, M.L., Anderka, M., Stein, B.: On the use of reliable-negatives selection strategies in the PU learning approach for quality flaws prediction in Wikipedia. In: 2014 25th International Workshop on Database and Expert Systems Applications, pp. 211–215. IEEE (2014)
Google Scholar
Ferretti, E., Hernández Fusilier, D., Guzmán Cabrera, R., Montes y Gómez, M., Errecalde, M., Rosso, P.: On the use of PU learning for quality flaw prediction in Wikipedia. In: CEUR Workshop Proceedings, vol. 1178 (2012)
Google Scholar
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanford 1(12), 2009 (2009)
Google Scholar
Group, T.R.: Email statistics report, 2019–2023 (2019). http://www.radicati.com
Gupta, I., Joshi, N.: Real-time twitter corpus labelling using automatic clustering approach. Int. J. Comput. Digit. Syst. 10, 519–532 (2021)
Article Google Scholar
Igual, L., Seguí, S.: Introduction to data science. In: Introduction to Data Science. UTCS, pp. 1–4. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50017-1_1
Chapter MATH Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)
Google Scholar
Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2), 1–41 (2022)
Google Scholar
Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text classification by labeling words. In: AAAI, vol. 4, pp. 425–430 (2004)
Google Scholar
Liu, S., Lee, I.: Email sentiment analysis through k-means labeling and support vector machine classification. Cybern. Syst. 49(3), 181–199 (2018)
Article Google Scholar
Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48 (2005)
Google Scholar
Reddy, Y., Viswanath, P., Reddy, B.E.: Semi-supervised learning: a brief review. Int. J. Eng. Technol. 7(1.8), 81 (2018)
Google Scholar
Rosso, P., Errecalde, M., Pinto, D.: Analysis of short texts on the web: introduction to special issue. Lang. Resour. Eval. 47(1), 123–126 (2013)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article Google Scholar
Silva, N.F.F.D., Coletta, L.F., Hruschka, E.R.: A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Comput. Surv. 49(1), 1–26 (2016)
Google Scholar
Skiena, S.S.: The Data Science Design Manual. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55444-0
Sneiders, E., Sjöbergh, J., Alfalahi, A.: Automated email answering by text-pattern matching: performance and error analysis. Expert. Syst. 35(1), e12251 (2018)
Article Google Scholar
Statista: Most popular global mobile messenger apps as of July 2019, based on number of monthly active users (in millions) (2019). http://www.statista.com/
Tang, G., Pei, J., Luk, W.-S.: Email mining: tasks, common techniques, and tools. Knowl. Inf. Syst. 41(1), 1–31 (2013). https://doi.org/10.1007/s10115-013-0658-2
Article Google Scholar
Usai, A., Pironti, M., Mital, M., Mejri, C.A.: Knowledge discovery out of text data: a systematic review via text mining. J. Knowl. Manag. (2018)
Google Scholar
van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6
Article MathSciNet MATH Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Zhou, Z.H., Zhan, D.C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: AAAI, vol. 675680 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Basic Sciences, National University of Lujan, Lujan, Argentina
Juan Manuel Fernández
LIDIC, National University of San Luis, San Luis, Argentina
Marcelo Errecalde

Authors

Juan Manuel Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Errecalde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Manuel Fernández .

Editor information

Editors and Affiliations

National University of La Plata, La Plata, Argentina
Enzo Rucci
National University of La Plata, La Plata, Argentina
Marcelo Naiouf
National University of La Plata, La Plata, Argentina
Franco Chichizola
National University of La Plata, La Plata, Argentina
Laura De Giusti
National University of La Plata, La Plata, Argentina
Armando De Giusti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernández, J.M., Errecalde, M. (2022). Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-14599-5_6
Published: 05 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14598-8
Online ISBN: 978-3-031-14599-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval