Abstract
An email has become one of the prime ways of communication for individuals or organizations and has emerged as an important research field to categorize emails and enable users for easy data segregation, topic modeling, spam detection, network analysis for investigative and analytical purposes. The paper aims to cluster the emails comprising of 500,000 emails taken from the Enron email dataset which was obtained by the Federal Energy Regulatory Commission during its investigation of Enron’s collapse, based on the relevance of the words to the whole corpus. The proposed algorithm calculates the cohesion score of each cluster using intra-cluster similarity. This paper implements two unsupervised clustering algorithms for the email clustering process, namely k-means and hierarchical clustering and evaluates the cosine similarity of all the words from each cluster to evaluate the semantic similarity pervading through each cluster. The emails were clustered into three groups and the cohesion score was obtained for each cluster which measured the intra-cluster similarity. The proposed method helped in the computation of the score distribution among the clusters, as well as the intra-cluster similarity. Cluster 1 obtained the highest cohesion score among all the three clusters by attaining the cohesion score 0.1655 while using the k-means algorithm and the cohesion score of 0.2513 while using the hierarchical clustering algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alsmadi, I., Alhami, I.: Clustering and classification of email contents. J. King Saud Univ.-Comput. Inf. Sci. 27(1), 46–57 (2015)
Chen, M.: Soft clustering for very large data sets. Comput. Sci Netw Secur. J 17(11), 102–108 (2017)
Chiang, M.M.-T., Mirkin, B.: Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. J. Classif. 27, 3–40 (2010)
Azizpour, S., Giesecke, K., Schwenkler, G.: Exploring the sources of default clustering. J. Financ. Econ. 129(1), 154–183 (2018)
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, pp 478–487 (2016)
Nayak, P., Devulapalli, A.: A fuzzy logic-based clustering algorithm for WSN to extend the network lifetime. IEEE Sens J 16(1), 137–144 (2015)
Ferrari, D.G., De Castro, L.N.: Clustering algorithm selection by meta-learning systems: a new distance-based problem characterization and ranking combination methods. Inf. Sci. 301, 181–194 (2015)
Lensen, A., Xue, B., Zhang, M.: Using particle swarm optimisation and the silhouette metric to estimate the number of clusters, select features, and perform clustering. In: European Conference on the Applications of Evolutionary Computation, pp. 538–554. Springer, Cham (2017)
Basavaraju, M., Prabhakar, D.R.: A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5(4), 15–25 (2010)
Huang, Y., Mitchell, T.M.: Exploring hierarchical user feedback in email clustering. Email 8, 36–41 (2008)
Ercan, G., Cicekli, I.: Lexical cohesion based topic modeling for summarization. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 582–592. Springer, Berlin (2018)
Klebanov, B.B., Diermeier, D., Beigman, E.: Lexical cohesion analysis of political speech. Polit. Anal. 16(4), 447–463 (2008)
Pérez, R.A., Pagola, J.E.M.: Text segmentation by clustering cohesion. In: Iberoamerican Congress on Pattern Recognition, pp. 261–268. Springer, Berlin (2010)
Behrens, T., Schmidt, K., Viscarra Rossel, R.A., Gries, P., Scholten, T., MacMillan, R.A.: Spatial modelling with euclidean distance fields and machine learning. Eur. J. Soil Sci. 69(5), 757–770 (2018)
Rathee, A., Chhabra, J.K.: Improving cohesion of a software system by performing usage pattern based clustering. Procedia Comput. Sci. 125, 740–746 (2018)
Kulkarni, A., Pedersen, T.: Name discrimination and e-mail clustering using unsupervised clustering of similar contexts. J. Intell. Syst. 17(1–3), 37–50 (2008)
Reed, J.W., Jiao, Y., Potok, T.E., Klump, B.A., Elmore, M.T., Hurson, A. R.: TF-ICF: a new term weighting scheme for clustering dynamic data streams. In: 2006 5th International Conference on Machine Learning and Applications (ICMLA’06), pp. 258–263. IEEE (2006)
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering vol. 2, pp. 7–16. IEEE (2015)
Al-Anzi, F.S., AbuZeina, D.: Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J King Saud Univ-Comput. Inf. Sci 29(2), 189–195 (2017)
Bernard, J.: Python data analysis with pandas. In: Python Recipes Handbook, pp 37–48. Apress, Berkeley, CA (2016)
Gupta, R., Jivani, A.G.: Analyzing the stemming paradigm. In: International Conference on Information and Communication Technology for Intelligent Systems, pp 333–342. Springer, Cham (2017)
Capó, M., Pérez, A., Lozano, J.A.: An efficient approximation to the K-means clustering for massive data. Knowl. Based Syst. 117, 56–69 (2017)
Syakur, M.A., Khotimah, B.K., Rochman, E.M.S., Satoto, B.D.: Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336, no. 1, p. 012017. IOP Publishing (2018)
Zhou, S., Xu, Z., Liu, F.: Method for determining the optimal number of clusters based on agglomerative hierarchical clustering. IEEE Trans. Neural Netw. Learn. Syst. 28(12), 3007–3017 (2016)
Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984)
Ferreira, L., Hitchcock, D.B.: A comparison of hierarchical methods for clustering functional data. Commun Stat. Simul. Comput 38(9), 1925–1949 (2009)
Kent, D., Toris, R.: Adaptive autonomous grasp selection via pairwise ranking. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2971–2976. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kathuria, A., Mukhopadhyay, D., Thakur, N. (2020). Evaluating Cohesion Score with Email Clustering. In: Singh, P., Pawłowski, W., Tanwar, S., Kumar, N., Rodrigues, J., Obaidat, M. (eds) Proceedings of First International Conference on Computing, Communications, and Cyber-Security (IC4S 2019). Lecture Notes in Networks and Systems, vol 121. Springer, Singapore. https://doi.org/10.1007/978-981-15-3369-3_9
Download citation
DOI: https://doi.org/10.1007/978-981-15-3369-3_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3368-6
Online ISBN: 978-981-15-3369-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)