Skip to main content

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 121))

Abstract

An email has become one of the prime ways of communication for individuals or organizations and has emerged as an important research field to categorize emails and enable users for easy data segregation, topic modeling, spam detection, network analysis for investigative and analytical purposes. The paper aims to cluster the emails comprising of 500,000 emails taken from the Enron email dataset which was obtained by the Federal Energy Regulatory Commission during its investigation of Enron’s collapse, based on the relevance of the words to the whole corpus. The proposed algorithm calculates the cohesion score of each cluster using intra-cluster similarity. This paper implements two unsupervised clustering algorithms for the email clustering process, namely k-means and hierarchical clustering and evaluates the cosine similarity of all the words from each cluster to evaluate the semantic similarity pervading through each cluster. The emails were clustered into three groups and the cohesion score was obtained for each cluster which measured the intra-cluster similarity. The proposed method helped in the computation of the score distribution among the clusters, as well as the intra-cluster similarity. Cluster 1 obtained the highest cohesion score among all the three clusters by attaining the cohesion score 0.1655 while using the k-means algorithm and the cohesion score of 0.2513 while using the hierarchical clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alsmadi, I., Alhami, I.: Clustering and classification of email contents. J. King Saud Univ.-Comput. Inf. Sci. 27(1), 46–57 (2015)

    Google Scholar 

  2. Chen, M.: Soft clustering for very large data sets. Comput. Sci Netw Secur. J 17(11), 102–108 (2017)

    Google Scholar 

  3. Chiang, M.M.-T., Mirkin, B.: Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J. Classif. 27, 3–40 (2010)

    Article  MathSciNet  Google Scholar 

  4. Azizpour, S., Giesecke, K., Schwenkler, G.: Exploring the sources of default clustering. J. Financ. Econ. 129(1), 154–183 (2018)

    Article  Google Scholar 

  5. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp 478–487 (2016)

    Google Scholar 

  6. Nayak, P., Devulapalli, A.: A fuzzy logic-based clustering algorithm for WSN to extend the network lifetime. IEEE Sens J 16(1), 137–144 (2015)

    Article  Google Scholar 

  7. Ferrari, D.G., De Castro, L.N.: Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf. Sci. 301, 181–194 (2015)

    Article  Google Scholar 

  8. Lensen, A., Xue, B., Zhang, M.: Using particle swarm optimisation and the silhouette metric to estimate the number of clusters, select features, and perform clustering. In: European Conference on the Applications of Evolutionary Computation, pp. 538–554. Springer, Cham (2017)

    Google Scholar 

  9. Basavaraju, M., Prabhakar, D.R.: A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5(4), 15–25 (2010)

    Google Scholar 

  10. Huang, Y., Mitchell, T.M.: Exploring hierarchical user feedback in email clustering. Email 8, 36–41 (2008)

    Google Scholar 

  11. Ercan, G., Cicekli, I.: Lexical cohesion based topic modeling for summarization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 582–592. Springer, Berlin (2018)

    Google Scholar 

  12. Klebanov, B.B., Diermeier, D., Beigman, E.: Lexical cohesion analysis of political speech. Polit. Anal. 16(4), 447–463 (2008)

    Article  Google Scholar 

  13. Pérez, R.A., Pagola, J.E.M.: Text segmentation by clustering cohesion. In: Iberoamerican Congress on Pattern Recognition, pp. 261–268. Springer, Berlin (2010)

    Google Scholar 

  14. Behrens, T., Schmidt, K., Viscarra Rossel, R.A., Gries, P., Scholten, T., MacMillan, R.A.: Spatial modelling with euclidean distance fields and machine learning. Eur. J. Soil Sci. 69(5), 757–770 (2018)

    Article  Google Scholar 

  15. Rathee, A., Chhabra, J.K.: Improving cohesion of a software system by performing usage pattern based clustering. Procedia Comput. Sci. 125, 740–746 (2018)

    Article  Google Scholar 

  16. Kulkarni, A., Pedersen, T.: Name discrimination and e-mail clustering using unsupervised clustering of similar contexts. J. Intell. Syst. 17(1–3), 37–50 (2008)

    Google Scholar 

  17. Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A. R.: TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), pp. 258–263. IEEE (2006)

    Google Scholar 

  18. Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering vol. 2, pp. 7–16. IEEE (2015)

    Google Scholar 

  19. Al-Anzi, F.S., AbuZeina, D.: Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J King Saud Univ-Comput. Inf. Sci 29(2), 189–195 (2017)

    Google Scholar 

  20. Bernard, J.: Python data analysis with pandas. In: Python Recipes Handbook, pp 37–48. Apress, Berkeley, CA (2016)

    Google Scholar 

  21. Gupta, R., Jivani, A.G.: Analyzing the stemming paradigm. In: International Conference on Information and Communication Technology for Intelligent Systems, pp 333–342. Springer, Cham (2017)

    Google Scholar 

  22. Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)

    Article  Google Scholar 

  23. Syakur, M.A., Khotimah, B.K., Rochman, E.M.S., Satoto, B.D.: Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336, no. 1, p. 012017. IOP Publishing (2018)

    Google Scholar 

  24. Zhou, S., Xu, Z., Liu, F.: Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans. Neural Netw. Learn. Syst. 28(12), 3007–3017 (2016)

    Article  MathSciNet  Google Scholar 

  25. Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984)

    Article  Google Scholar 

  26. Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun Stat. Simul. Comput 38(9), 1925–1949 (2009)

    Article  MathSciNet  Google Scholar 

  27. Kent, D., Toris, R.: Adaptive autonomous grasp selection via pairwise ranking. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2971–2976. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhishek Kathuria .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kathuria, A., Mukhopadhyay, D., Thakur, N. (2020). Evaluating Cohesion Score with Email Clustering. In: Singh, P., Pawłowski, W., Tanwar, S., Kumar, N., Rodrigues, J., Obaidat, M. (eds) Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019). Lecture Notes in Networks and Systems, vol 121. Springer, Singapore. https://doi.org/10.1007/978-981-15-3369-3_9

Download citation

Publish with us

Policies and ethics