Skip to main content
Log in

Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

In the current information era, data mining has been extensively applied in many fields to discover the huge knowledge. It is desirable to design a system to exchange the sensitive data between the data owner and the client during data mining process regardless of the leakage of information. The privacy preserving mechanism is indeed an essential enabler to protect the data being corrupted in the trend of sophisticated cyber-attack. In fact, an approach to secure the data while mining knowledge is appropriate such that both data owners and users are not able to learn anything from each other data. Hence, we target preserving data privacy by mining on encrypted data. In data mining, similarity is the important factor, which either measures how much alike two data objects are, or describes as a distance with dimensions representing features of the objects. In this work, we firstly propose a privacy preserving approach on computing Jaccard similarity by cloud-assisted for data mining scheme. Particularly, our design focuses on the k-nearest neighbors classification, which is a conventional data mining algorithm. Rather, due to the efficient measurement of the Jaccard similarity, we present a solution how to employ the cloud services to preserve the Jaccard computation between the owner’s data and the client’s queries in a secure manner. We also implemented the proposed scheme on the workstation and observed that the computation efficiency of the proposed approach is well-suited in the data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Björkegren, D., & Grissen, D. (2017). Behavior revealed in mobile phone usage predicts loan repayment. CoRR arXiv:1712.05840.

  2. Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Sociète Vaudense des Sciences Naturelles, 44, 223–270.

    Google Scholar 

  3. Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11, 37–50.

    Article  Google Scholar 

  4. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.

    Article  Google Scholar 

  5. Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5), 1–18.

    Google Scholar 

  6. Güvenir, H. A., Demiröz, G., & Ilter, N. (1998). Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, 13(3), 147–165.

    Article  Google Scholar 

  7. Jiang, L., Wang, D., Cai, Z., & Yan, X. (2007). Survey of improving naive bayes for classification. In Proceedings of the 3rd international conference on advanced data mining and applications, ADMA ’07 (pp. 134–145).

  8. Han, E.-H., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining, PAKDD ’01 (pp. 53–65).

  9. Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the second international conference on knowledge discovery and data mining, KDD’96 (pp. 202–207).

  10. Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15, 177–206.

    Article  MathSciNet  Google Scholar 

  11. Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 639–644).

  12. Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00 (pp. 439–450).

  13. Du, W., & Zhan, Z. (2003). Using randomized response techniques for privacy-preserving data mining. In Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03 (pp. 505–510).

  14. Zhan, J. Z., Matwin, S., & Chang, L. (2007). Privacy preserving multi party decision tree induction. International Journal of Business Intelligence and Data Mining, 2(2), 197–212.

    Article  Google Scholar 

  15. Dwork, C., & Nissim, K. (2004). Privacy-preserving data mining on vertically partitioned databases. In M. Franklin (Ed.), Advances in cryptology—CRYPTO 2004 (pp. 528–544).

  16. Kantarcioglu, M., & Clifton, C. (2004). Privately computing a distributed k-NN classifier. In Proceedings of the 8th European conference on principles of data mining and knowledge discovery databases, PKDD ’04 (pp. 279–290).

  17. Zhan, J. Z., Chang, L., & Matwin, S. (2005). Privacy preserving k-nearest neighbor classification. I. J. Network Security, 1(1), 46–51.

    Google Scholar 

  18. Pascal, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th International conference on the theory and applications of cryptographic techniques, EUROCRYPT’99 (pp. 223–238).

  19. Qi, Y., & Atallah, M. J. (2008).Efficient privacy-preserving k-nearest neighbor search. In 28th international conference on distributed computing systems (pp. 311–319).

  20. Shaneck, M., Kim, Y., & Kumar, V. (2006). Privacy preserving nearest neighbor search. In Sixth IEEE international conference data mining—Workshops (ICDMW’06) (pp. 541–545).

  21. Rong, H., Wang, H., Liu, J., & Xian, M. (2016). Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access, 4, 9589–9603.

    Article  Google Scholar 

  22. Singh, M. D., Krishna, P. R., & Saxena, A. (2009). A privacy preserving Jaccard similarity function for mining encrypted data. In TENCON 2009—2009 IEEE Region 10 conference (pp. 1–4).

  23. Blundo, C., Cristofaro, E. D., & Gasti, P. (2014). Espresso: Efficient privacy-preserving evaluation of sample set similarity. Journal of Computer Security, 22(3), 355–381.

    Article  Google Scholar 

  24. Wong, K., & Kim, M. H. (2013). Privacy-preserving similarity coefficients for binary data. Computers & Mathematics with Applications, 65(9), 1280–1290.

    Article  MathSciNet  Google Scholar 

  25. Fan, J., & Vercauteren, F. (2012). Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012, 144.

    Google Scholar 

  26. Chen, H., Laine, K., & Player, R. (2017). Simple encrypted arithmetic library—seal (v2.1).

  27. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tho Thi Ngoc Le.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity

This part describes experiments of using Jaccard similarity to measure distances among data samples when classifying by k-NN algorithm. Data sets for experiments include MNIST, WDBC and DERM:

  1. 1.

    MNIST [4] is a public data set of 10 handwritten digits from 0 to 9 (10 classes), available at http://yann.lecun.com/exdb/mnist/. This data set has already been separated into two sets: 60,000 samples of training and 10,000 samples of testing. Figure 6 shows examples of digits in MNIST data set. Each sample is a bitmap image of handwritten digit in size of \(28\times 28\). We convert bitmap matrices of images into 784-dimension-vectors and use as inputs for classification. In short, each MNIST data sample is a binary vector of 784 attributes. In each vector, attributes have values of 1s at corresponding positions of digit shape, otherwise, they are 0s.

  2. 2.

    WDBC [5] is Breast Cancer Wisconsin Data Set (available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29). WDBC includes 699 samples of clinical cases for benign and malignant (two classes). We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing. Originally, each sample has 9 attributes about cells taking integer values in [1, 10]. We express each sample as a binary vector by mapping its attributes to 1s if values is greater than 5, otherwise mapping them to 0s. We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing.

  3. 3.

    DERM [6] is Dermatology Data Set (available at https://archive.ics.uci.edu/ml/datasets/Dermatology). DERM includes 466 real samples of six dermatological problems (six classes). We divide DERM data set to two sets: 244 samples for training and 122 samples for testing. Each data sample has 34 attributes whose values are numbers: 32 attributes take integer values in range [0, 3], one attribute takes integer value in range [0, 1] and one attribute takes value in range [0, 75]. We also express each data sample as a binary vector. In which, binary vector attributes are 1s if corresponding sample attribute values are greater than mean of its range, i.e., 2, 0 and 33 correspondingly. Otherwise, binary vector attributes are 0s.

Fig. 6
figure 6

Examples of handwritten digits in MNIST dataset

The characteristics of these three data sets are summarized in Table 3.

Table 3 Characteristics of data sets MNIST, WDBD and DERM

Classification process follows Algorithm 1 in Sect. 2.1. We choose values for parameter K as 5, 10 and 15. Similarity between any two samples is Jaccard similarity, computed using formula (3). The detailed performance of k-NN on these three data sets is shown in Table 4.

Table 4 Performance of k-NN with \(K\in \{5,10,15\}\) on three datasets using Jaccard similarity

Generally, we note the high performance of k-NN on all data sets with different parameter settings. The lowest and highest accuracies are 86.9% and 96.3%.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, T.T.N., Phuong, T.V.X. Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification. Wireless Pers Commun 112, 1875–1892 (2020). https://doi.org/10.1007/s11277-020-07131-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-020-07131-6

Keywords

Navigation