Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification

Le, Tho Thi Ngoc; Phuong, Tran Viet Xuan

doi:10.1007/s11277-020-07131-6

Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification

Published: 24 January 2020

Volume 112, pages 1875–1892, (2020)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

245 Accesses
9 Citations
Explore all metrics

Abstract

In the current information era, data mining has been extensively applied in many fields to discover the huge knowledge. It is desirable to design a system to exchange the sensitive data between the data owner and the client during data mining process regardless of the leakage of information. The privacy preserving mechanism is indeed an essential enabler to protect the data being corrupted in the trend of sophisticated cyber-attack. In fact, an approach to secure the data while mining knowledge is appropriate such that both data owners and users are not able to learn anything from each other data. Hence, we target preserving data privacy by mining on encrypted data. In data mining, similarity is the important factor, which either measures how much alike two data objects are, or describes as a distance with dimensions representing features of the objects. In this work, we firstly propose a privacy preserving approach on computing Jaccard similarity by cloud-assisted for data mining scheme. Particularly, our design focuses on the k-nearest neighbors classification, which is a conventional data mining algorithm. Rather, due to the efficient measurement of the Jaccard similarity, we present a solution how to employ the cloud services to preserve the Jaccard computation between the owner’s data and the client’s queries in a secure manner. We also implemented the proposed scheme on the workstation and observed that the computation efficiency of the proposed approach is well-suited in the data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on security challenges in cloud computing: issues, threats, and solutions

Article 28 February 2020

Cloud Security Threats and Solutions: A Survey

Article 19 September 2022

A systematic review of homomorphic encryption and its contributions in healthcare industry

Article Open access 03 May 2022

References

Björkegren, D., & Grissen, D. (2017). Behavior revealed in mobile phone usage predicts loan repayment. CoRR arXiv:1712.05840.
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Sociète Vaudense des Sciences Naturelles, 44, 223–270.
Google Scholar
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11, 37–50.
Article Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
Article Google Scholar
Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5), 1–18.
Google Scholar
Güvenir, H. A., Demiröz, G., & Ilter, N. (1998). Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, 13(3), 147–165.
Article Google Scholar
Jiang, L., Wang, D., Cai, Z., & Yan, X. (2007). Survey of improving naive bayes for classification. In Proceedings of the 3rd international conference on advanced data mining and applications, ADMA ’07 (pp. 134–145).
Han, E.-H., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining, PAKDD ’01 (pp. 53–65).
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the second international conference on knowledge discovery and data mining, KDD’96 (pp. 202–207).
Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15, 177–206.
Article MathSciNet Google Scholar
Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 639–644).
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00 (pp. 439–450).
Du, W., & Zhan, Z. (2003). Using randomized response techniques for privacy-preserving data mining. In Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03 (pp. 505–510).
Zhan, J. Z., Matwin, S., & Chang, L. (2007). Privacy preserving multi party decision tree induction. International Journal of Business Intelligence and Data Mining, 2(2), 197–212.
Article Google Scholar
Dwork, C., & Nissim, K. (2004). Privacy-preserving data mining on vertically partitioned databases. In M. Franklin (Ed.), Advances in cryptology—CRYPTO 2004 (pp. 528–544).
Kantarcioglu, M., & Clifton, C. (2004). Privately computing a distributed k-NN classifier. In Proceedings of the 8th European conference on principles of data mining and knowledge discovery databases, PKDD ’04 (pp. 279–290).
Zhan, J. Z., Chang, L., & Matwin, S. (2005). Privacy preserving k-nearest neighbor classification. I. J. Network Security, 1(1), 46–51.
Google Scholar
Pascal, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th International conference on the theory and applications of cryptographic techniques, EUROCRYPT’99 (pp. 223–238).
Qi, Y., & Atallah, M. J. (2008).Efficient privacy-preserving k-nearest neighbor search. In 28th international conference on distributed computing systems (pp. 311–319).
Shaneck, M., Kim, Y., & Kumar, V. (2006). Privacy preserving nearest neighbor search. In Sixth IEEE international conference data mining—Workshops (ICDMW’06) (pp. 541–545).
Rong, H., Wang, H., Liu, J., & Xian, M. (2016). Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access, 4, 9589–9603.
Article Google Scholar
Singh, M. D., Krishna, P. R., & Saxena, A. (2009). A privacy preserving Jaccard similarity function for mining encrypted data. In TENCON 2009—2009 IEEE Region 10 conference (pp. 1–4).
Blundo, C., Cristofaro, E. D., & Gasti, P. (2014). Espresso: Efficient privacy-preserving evaluation of sample set similarity. Journal of Computer Security, 22(3), 355–381.
Article Google Scholar
Wong, K., & Kim, M. H. (2013). Privacy-preserving similarity coefficients for binary data. Computers & Mathematics with Applications, 65(9), 1280–1290.
Article MathSciNet Google Scholar
Fan, J., & Vercauteren, F. (2012). Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012, 144.
Google Scholar
Chen, H., Laine, K., & Player, R. (2017). Simple encrypted arithmetic library—seal (v2.1).
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Ho Chi Minh City University of Technology (HUTECH), 475A Dien Bien Phu Street, Ward 25, Binh Thanh District, Ho Chi Minh City, Vietnam
Tho Thi Ngoc Le
University of Wollongong, Northfields Ave, Wollongong, NSW, 2522, Australia
Tran Viet Xuan Phuong

Authors

Tho Thi Ngoc Le
View author publications
You can also search for this author in PubMed Google Scholar
Tran Viet Xuan Phuong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tho Thi Ngoc Le.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

This part describes experiments of using Jaccard similarity to measure distances among data samples when classifying by k-NN algorithm. Data sets for experiments include MNIST, WDBC and DERM:

1.
MNIST [4] is a public data set of 10 handwritten digits from 0 to 9 (10 classes), available at http://yann.lecun.com/exdb/mnist/. This data set has already been separated into two sets: 60,000 samples of training and 10,000 samples of testing. Figure 6 shows examples of digits in MNIST data set. Each sample is a bitmap image of handwritten digit in size of \(28\times 28\). We convert bitmap matrices of images into 784-dimension-vectors and use as inputs for classification. In short, each MNIST data sample is a binary vector of 784 attributes. In each vector, attributes have values of 1s at corresponding positions of digit shape, otherwise, they are 0s.
2.
WDBC [5] is Breast Cancer Wisconsin Data Set (available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29). WDBC includes 699 samples of clinical cases for benign and malignant (two classes). We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing. Originally, each sample has 9 attributes about cells taking integer values in [1, 10]. We express each sample as a binary vector by mapping its attributes to 1s if values is greater than 5, otherwise mapping them to 0s. We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing.
3.
DERM [6] is Dermatology Data Set (available at https://archive.ics.uci.edu/ml/datasets/Dermatology). DERM includes 466 real samples of six dermatological problems (six classes). We divide DERM data set to two sets: 244 samples for training and 122 samples for testing. Each data sample has 34 attributes whose values are numbers: 32 attributes take integer values in range [0, 3], one attribute takes integer value in range [0, 1] and one attribute takes value in range [0, 75]. We also express each data sample as a binary vector. In which, binary vector attributes are 1s if corresponding sample attribute values are greater than mean of its range, i.e., 2, 0 and 33 correspondingly. Otherwise, binary vector attributes are 0s.

The characteristics of these three data sets are summarized in Table 3.

Table 3 Characteristics of data sets MNIST, WDBD and DERM

Full size table

Classification process follows Algorithm 1 in Sect. 2.1. We choose values for parameter K as 5, 10 and 15. Similarity between any two samples is Jaccard similarity, computed using formula (3). The detailed performance of k-NN on these three data sets is shown in Table 4.

Table 4 Performance of k-NN with \(K\in \{5,10,15\}\) on three datasets using Jaccard similarity

Full size table

Generally, we note the high performance of k-NN on all data sets with different parameter settings. The lowest and highest accuracies are 86.9% and 96.3%.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Le, T.T.N., Phuong, T.V.X. Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification. Wireless Pers Commun 112, 1875–1892 (2020). https://doi.org/10.1007/s11277-020-07131-6

Download citation

Published: 24 January 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11277-020-07131-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification

Abstract

Access this article

Similar content being viewed by others

A survey on security challenges in cloud computing: issues, threats, and solutions

Cloud Security Threats and Solutions: A Survey

A systematic review of homomorphic encryption and its contributions in healthcare industry

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification

Abstract

Access this article

Similar content being viewed by others

A survey on security challenges in cloud computing: issues, threats, and solutions

Cloud Security Threats and Solutions: A Survey

A systematic review of homomorphic encryption and its contributions in healthcare industry

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation