Evaluating Cohesion Score with Email Clustering

Kathuria, Abhishek; Mukhopadhyay, Devarshi; Thakur, Narina

doi:10.1007/978-981-15-3369-3_9

Abhishek Kathuria¹⁵,
Devarshi Mukhopadhyay¹⁵ &
Narina Thakur¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 121))

834 Accesses
1 Citations

Abstract

An email has become one of the prime ways of communication for individuals or organizations and has emerged as an important research field to categorize emails and enable users for easy data segregation, topic modeling, spam detection, network analysis for investigative and analytical purposes. The paper aims to cluster the emails comprising of 500,000 emails taken from the Enron email dataset which was obtained by the Federal Energy Regulatory Commission during its investigation of Enron’s collapse, based on the relevance of the words to the whole corpus. The proposed algorithm calculates the cohesion score of each cluster using intra-cluster similarity. This paper implements two unsupervised clustering algorithms for the email clustering process, namely k-means and hierarchical clustering and evaluates the cosine similarity of all the words from each cluster to evaluate the semantic similarity pervading through each cluster. The emails were clustered into three groups and the cohesion score was obtained for each cluster which measured the intra-cluster similarity. The proposed method helped in the computation of the score distribution among the clusters, as well as the intra-cluster similarity. Cluster 1 obtained the highest cohesion score among all the three clusters by attaining the cohesion score 0.1655 while using the k-means algorithm and the cohesion score of 0.2513 while using the hierarchical clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alsmadi, I., Alhami, I.: Clustering and classification of email contents. J. King Saud Univ.-Comput. Inf. Sci. 27(1), 46–57 (2015)
Google Scholar
Chen, M.: Soft clustering for very large data sets. Comput. Sci Netw Secur. J 17(11), 102–108 (2017)
Google Scholar
Chiang, M.M.-T., Mirkin, B.: Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J. Classif. 27, 3–40 (2010)
Article MathSciNet Google Scholar
Azizpour, S., Giesecke, K., Schwenkler, G.: Exploring the sources of default clustering. J. Financ. Econ. 129(1), 154–183 (2018)
Article Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp 478–487 (2016)
Google Scholar
Nayak, P., Devulapalli, A.: A fuzzy logic-based clustering algorithm for WSN to extend the network lifetime. IEEE Sens J 16(1), 137–144 (2015)
Article Google Scholar
Ferrari, D.G., De Castro, L.N.: Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf. Sci. 301, 181–194 (2015)
Article Google Scholar
Lensen, A., Xue, B., Zhang, M.: Using particle swarm optimisation and the silhouette metric to estimate the number of clusters, select features, and perform clustering. In: European Conference on the Applications of Evolutionary Computation, pp. 538–554. Springer, Cham (2017)
Google Scholar
Basavaraju, M., Prabhakar, D.R.: A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5(4), 15–25 (2010)
Google Scholar
Huang, Y., Mitchell, T.M.: Exploring hierarchical user feedback in email clustering. Email 8, 36–41 (2008)
Google Scholar
Ercan, G., Cicekli, I.: Lexical cohesion based topic modeling for summarization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 582–592. Springer, Berlin (2018)
Google Scholar
Klebanov, B.B., Diermeier, D., Beigman, E.: Lexical cohesion analysis of political speech. Polit. Anal. 16(4), 447–463 (2008)
Article Google Scholar
Pérez, R.A., Pagola, J.E.M.: Text segmentation by clustering cohesion. In: Iberoamerican Congress on Pattern Recognition, pp. 261–268. Springer, Berlin (2010)
Google Scholar
Behrens, T., Schmidt, K., Viscarra Rossel, R.A., Gries, P., Scholten, T., MacMillan, R.A.: Spatial modelling with euclidean distance fields and machine learning. Eur. J. Soil Sci. 69(5), 757–770 (2018)
Article Google Scholar
Rathee, A., Chhabra, J.K.: Improving cohesion of a software system by performing usage pattern based clustering. Procedia Comput. Sci. 125, 740–746 (2018)
Article Google Scholar
Kulkarni, A., Pedersen, T.: Name discrimination and e-mail clustering using unsupervised clustering of similar contexts. J. Intell. Syst. 17(1–3), 37–50 (2008)
Google Scholar
Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A. R.: TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), pp. 258–263. IEEE (2006)
Google Scholar
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering vol. 2, pp. 7–16. IEEE (2015)
Google Scholar
Al-Anzi, F.S., AbuZeina, D.: Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J King Saud Univ-Comput. Inf. Sci 29(2), 189–195 (2017)
Google Scholar
Bernard, J.: Python data analysis with pandas. In: Python Recipes Handbook, pp 37–48. Apress, Berkeley, CA (2016)
Google Scholar
Gupta, R., Jivani, A.G.: Analyzing the stemming paradigm. In: International Conference on Information and Communication Technology for Intelligent Systems, pp 333–342. Springer, Cham (2017)
Google Scholar
Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)
Article Google Scholar
Syakur, M.A., Khotimah, B.K., Rochman, E.M.S., Satoto, B.D.: Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336, no. 1, p. 012017. IOP Publishing (2018)
Google Scholar
Zhou, S., Xu, Z., Liu, F.: Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans. Neural Netw. Learn. Syst. 28(12), 3007–3017 (2016)
Article MathSciNet Google Scholar
Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984)
Article Google Scholar
Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun Stat. Simul. Comput 38(9), 1925–1949 (2009)
Article MathSciNet Google Scholar
Kent, D., Toris, R.: Adaptive autonomous grasp selection via pairwise ranking. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2971–2976. IEEE (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Abhishek Kathuria, Devarshi Mukhopadhyay & Narina Thakur

Authors

Abhishek Kathuria
View author publications
You can also search for this author in PubMed Google Scholar
Devarshi Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Narina Thakur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhishek Kathuria .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Jaypee University of Information Technology, Waknaghat, Solan, Himachal Pradesh, India
Pradeep Kumar Singh
Faculty of Mathematics, Physics, and Informatics, University of Gdańsk, Gdańsk, Poland
Wiesław Pawłowski
Department of Computer Engineering, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India
Sudeep Tanwar
Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
Neeraj Kumar
The National Institute of Telecommunications (Inatel), Santa Rita do Sapucaí, Brazil
Joel J. P. C. Rodrigues
King Abdullah II School of Information Technology, University of Jordan, Amman, Jordan
Mohammad Salameh Obaidat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathuria, A., Mukhopadhyay, D., Thakur, N. (2020). Evaluating Cohesion Score with Email Clustering. In: Singh, P., Pawłowski, W., Tanwar, S., Kumar, N., Rodrigues, J., Obaidat, M. (eds) Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019). Lecture Notes in Networks and Systems, vol 121. Springer, Singapore. https://doi.org/10.1007/978-981-15-3369-3_9

Download citation

DOI: https://doi.org/10.1007/978-981-15-3369-3_9
Published: 28 April 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3368-6
Online ISBN: 978-981-15-3369-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics