Skip to main content

Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval

  • Conference paper
  • First Online:
Cloud Computing, Big Data & Emerging Topics (JCC-BD&ET 2022)

Abstract

Nowadays, millions of data are generated every day, and their use and interpretation have become fundamental in all fields. This is particularly true in the area of e-mails classification where, beyond its key role in organizing huge amounts of incoming information, it presents several challenging aspects to be solved. To the well-known problems presented by textual data (as its ambiguity) e-mails are usually characterized by their short lenght and informal language. These difficulties are increased when a relatively large number of highly imbalanced classes need to be considered and manual labeling is expensive and must be carried out by specialized personnel. Those are the main issues addressed in the present work, where Spanish-language e-mails sent by students of an Argentinian university needs to be categorized in 16 different classes.

Our proposal to address this problem consists of a semi-supervised approach based on an automatic feature selection process complemented with an information retrieval strategy. From an initial data set of manually labeled e-mails, the main features are selected for each class, using three techniques: logistic regression, TF-IDF, and SS3. Then, the remaining (non labeled) instances are indexed with a general-purpose search engine (Elasticsearch) and documents of each class are retrieved based on the selected features identified by each technique.

Our very simple approach shows that classifiers trained with labeled documents plus those retrieved in an automatic way obtain an improvement in performance (up to 6%) regarding classifiers trained only with manually labeled instances. Those improvements are observed in both, traditional learning algorithms like SVM, but also in more recent, state of the arts, transformer-based models (BERT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Taken from https://www.elastic.co/es/what-is/elasticsearch.

  2. 2.

    For this purpose, the default TF-IDF formula of the TfidfVectorizer class of the sklearn Python library was used.

  3. 3.

    Previously, a grid search was performed alternating different values of the parameter C of the Logistic Regression class of the sklearn library, varying the weighting schemes of the terms. The best value was obtained with \(C=1\).

  4. 4.

    The hyperparameters alternated throughout the experiments are C, Gamma, and the kernels used by the algorithm.

References

  1. Ali, R.S.H., El Gayar, N.: Sentiment analysis using unlabeled email data. In: 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), pp. 328–333. IEEE (2019)

    Google Scholar 

  2. Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 61–66. IEEE (2016)

    Google Scholar 

  3. Bogawar, P.S., Bhoyar, K.K.: Email mining: a review. Int. J. Comput. Sci. Issues 9(1), 429–434 (2012)

    Google Scholar 

  4. Burdisso, S.G., Errecalde, M., Montes-y Gómez, M.: A text classification framework for simple and effective early depression detection over social media streams. Expert Syst. Appl. 133, 182–197 (2019)

    Article  Google Scholar 

  5. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-trained BERT model and evaluation data. In: PML4DC at ICLR 2020 (2020)

    Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F.: Learning from Imbalanced Data Sets. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98074-4

    Book  Google Scholar 

  8. Fernandez, J.M., Cavasin, N., Errecalde, M.: Classic and recent (neural) approaches to automatic text classification: a comparative study with e-mails in the Spanish language. In: Short Papers of the 9th Conference on Cloud Computing, Big Data & Emerging Topics, p. 20 (2021)

    Google Scholar 

  9. Ferretti, E., Errecalde, M.L., Anderka, M., Stein, B.: On the use of reliable-negatives selection strategies in the PU learning approach for quality flaws prediction in Wikipedia. In: 2014 25th International Workshop on Database and Expert Systems Applications, pp. 211–215. IEEE (2014)

    Google Scholar 

  10. Ferretti, E., Hernández Fusilier, D., Guzmán Cabrera, R., Montes y Gómez, M., Errecalde, M., Rosso, P.: On the use of PU learning for quality flaw prediction in Wikipedia. In: CEUR Workshop Proceedings, vol. 1178 (2012)

    Google Scholar 

  11. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanford 1(12), 2009 (2009)

    Google Scholar 

  12. Group, T.R.: Email statistics report, 2019–2023 (2019). http://www.radicati.com

  13. Gupta, I., Joshi, N.: Real-time twitter corpus labelling using automatic clustering approach. Int. J. Comput. Digit. Syst. 10, 519–532 (2021)

    Article  Google Scholar 

  14. Igual, L., Seguí, S.: Introduction to data science. In: Introduction to Data Science. UTCS, pp. 1–4. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50017-1_1

    Chapter  MATH  Google Scholar 

  15. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2002)

    Google Scholar 

  16. Li, Q., et al.: A survey on text classification: from traditional to deep learning. ACM Trans. Intell. Syst. Technol. 13(2), 1–41 (2022)

    Google Scholar 

  17. Liu, B., Li, X., Lee, W.S., Yu, P.S.: Text classification by labeling words. In: AAAI, vol. 4, pp. 425–430 (2004)

    Google Scholar 

  18. Liu, S., Lee, I.: Email sentiment analysis through k-means labeling and support vector machine classification. Cybern. Syst. 49(3), 181–199 (2018)

    Article  Google Scholar 

  19. Read, J.: Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In: Proceedings of the ACL Student Research Workshop, pp. 43–48 (2005)

    Google Scholar 

  20. Reddy, Y., Viswanath, P., Reddy, B.E.: Semi-supervised learning: a brief review. Int. J. Eng. Technol. 7(1.8), 81 (2018)

    Google Scholar 

  21. Rosso, P., Errecalde, M., Pinto, D.: Analysis of short texts on the web: introduction to special issue. Lang. Resour. Eval. 47(1), 123–126 (2013)

    Article  Google Scholar 

  22. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  23. Silva, N.F.F.D., Coletta, L.F., Hruschka, E.R.: A survey and comparative study of tweet sentiment analysis via semi-supervised learning. ACM Comput. Surv. 49(1), 1–26 (2016)

    Google Scholar 

  24. Skiena, S.S.: The Data Science Design Manual. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55444-0

  25. Sneiders, E., Sjöbergh, J., Alfalahi, A.: Automated email answering by text-pattern matching: performance and error analysis. Expert. Syst. 35(1), e12251 (2018)

    Article  Google Scholar 

  26. Statista: Most popular global mobile messenger apps as of July 2019, based on number of monthly active users (in millions) (2019). http://www.statista.com/

  27. Tang, G., Pei, J., Luk, W.-S.: Email mining: tasks, common techniques, and tools. Knowl. Inf. Syst. 41(1), 1–31 (2013). https://doi.org/10.1007/s10115-013-0658-2

    Article  Google Scholar 

  28. Usai, A., Pironti, M., Mital, M., Mejri, C.A.: Knowledge discovery out of text data: a systematic review via text mining. J. Knowl. Manag. (2018)

    Google Scholar 

  29. van Engelen, J.E., Hoos, H.H.: A survey on semi-supervised learning. Mach. Learn. 109(2), 373–440 (2019). https://doi.org/10.1007/s10994-019-05855-6

    Article  MathSciNet  MATH  Google Scholar 

  30. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  31. Zhou, Z.H., Zhan, D.C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: AAAI, vol. 675680 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Manuel Fernández .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fernández, J.M., Errecalde, M. (2022). Multi-class E-mail Classification with a Semi-Supervised Approach Based on Automatic Feature Selection and Information Retrieval. In: Rucci, E., Naiouf, M., Chichizola, F., De Giusti, L., De Giusti, A. (eds) Cloud Computing, Big Data & Emerging Topics. JCC-BD&ET 2022. Communications in Computer and Information Science, vol 1634. Springer, Cham. https://doi.org/10.1007/978-3-031-14599-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14599-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14598-8

  • Online ISBN: 978-3-031-14599-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics