# Outsourcing analyses on privacy-protected multivariate categorical data stored in untrusted clouds

- 26 Downloads

## Abstract

Outsourcing data storage and computation to the cloud is appealing due to the cost savings it entails. However, when the data to be outsourced contain private information, appropriate protection mechanisms should be implemented by the data controller. Data splitting, which consists of fragmenting the data and storing them in separate clouds for the sake of privacy preservation, is an interesting alternative to encryption in terms of flexibility and efficiency. However, multivariate analyses on data split among various clouds are challenging, and they are even harder when data are nominal categorical (i.e., textual, non-ordinal), because the standard arithmetic operators cannot be used. In this article, we tackle the problem of outsourcing multivariate analyses on nominal data split over several honest-but-curious clouds. Specifically, we propose several secure protocols to outsource to multiple clouds the computation of a variety of multivariate analyses on nominal categorical data (frequency-based and semantic-based). Our protocols have been designed to outsource as much workload as possible to the clouds, in order to retain the cost-saving benefits of cloud computing while ensuring that the outsourced stay split and hence privacy-protected versus the clouds. The experiments we report on the Amazon cloud service show that by using our protocols the controller can save nearly all the runtime because it can integrate partial results received from the clouds with very little computation.

## Keywords

Cloud computing Data privacy Data splitting Nominal data## Notes

### Acknowledgements

Partial support to this work has been received from the European Commission (projects H2020-700540 “CANVAS” and H2020-644024 “CLARUS”), from the Government of Catalonia (ICREA Acadèmia Prize to J. Domingo-Ferrer and grant 2017 SGR 705), and from the Spanish Government (projects RTI2018-095094-B-C21 “CONSENT” and TIN2016-80250-R “Sec-MCloud”). The authors are with the UNESCO Chair in Data Privacy, but the views in this paper are the authors’ own and are not necessarily shared by UNESCO.

## Supplementary material

## References

- 1.Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Motwani R, Srivastava U, Thomas D, Xu Y (2005) Two can keep a secret: a distributed architecture for secure database services. CIDR 2005:186–199Google Scholar
- 2.Agresti A, Kateri M (2011) Categorical data analysis. Springer, BerlinzbMATHGoogle Scholar
- 3.Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
- 4.Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58CrossRefGoogle Scholar
- 5.Atallah MJ, Frikken KB (2010) Securely outsourcing linear algebra computations. In: 5th ACM symposium on information, computer and communications security—ASIACCS 2010, ACM, pp 48–59Google Scholar
- 6.Batet M, Harispe S, Ranwez S, Sánchez D, Ranwez V (2014) An information theoretic approach to improve semantic similarity assessments across multiple ontologies. Inf Sci 283:197–2010CrossRefGoogle Scholar
- 7.Batet M, Sánchez D (2015) A review on semantic similarity. In: Encyclopedia of information science and technology, 3rd edn. IGI Global, pp 7575–7583Google Scholar
- 8.California patient discharge data: California Office of Statewide Health Planning and Development (OSHPD), 2009. http://www.oshpd.ca.gov/HID/DataFlow/index.html
- 9.Calviño A, Ricci S, Domingo-Ferrer J (2015) Privacy-preserving distributed statistical computation to a semi-honest multi-cloud. In: IEEE conference on communications and network security (CNS 2015), IEEE, pp 506–514Google Scholar
- 10.Cimiano P (2006) Ontology learning and population from text: algorithms, evaluation and applications. Springer, BerlinGoogle Scholar
- 11.Ciriani V, De Capitani di Vimercati S, Foresti S, Jajodia S, Paraboschi S, Samarati P (2011) Selective data outsourcing for enforcing privacy. J Comput Secur 19(3):531–566CrossRefGoogle Scholar
- 12.CLARUS—a Framework for user centred privacy and security in the cloud, H2020 project (2015–2017). http://www.clarussecure.eu
- 13.Clifton C, Kantarcioglu M, Vaidya J, Lin X, Zhu M (2002) Tools for privacy preserving distributed data mining. ACM SiGKDD Explor Newsl 4(2):28–34CrossRefGoogle Scholar
- 14.Domingo-Ferrer J, Ricci S, Domingo-Enrich C (2018) Outsourcing scalar products and matrix products on privacy-protected unencrypted data stored in untrusted clouds. Inf Sci 436–437:320–342MathSciNetCrossRefGoogle Scholar
- 15.Domingo-Ferrer J, Sánchez D, Rufian-Torrell G (2013) Anonymization of nominal data based on semantic marginality. Inf Sci 242:35–48CrossRefGoogle Scholar
- 16.Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212MathSciNetCrossRefGoogle Scholar
- 17.Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM, vol 4. SIAM, pp 222–233Google Scholar
- 18.Dubovitskaya A, Urovi V, Vasirani M, Aberer K, Schumacher M (2015) A cloud-based eHealth architecture for privacy preserving data integration. In: ICT systems security and privacy protection, Springer, pp 585–598Google Scholar
- 19.Fu Z, Sun X, Ji S, Xie G (2016) Towards efficient content-aware search over encrypted outsourced data in cloud. In: Computer communications, IEEE INFOCOM 2016-the 35th annual IEEE international conference, IEEE, pp 1–9Google Scholar
- 20.General data protection regulation. European Union. http://www.gdpr-info.eu
- 21.Ghattas B, Michel P, Boyer L (2017) Clustering nominal data using unsupervised binary decision trees: comparisons with the state of the art methods. Pattern Recognit 67:177–85CrossRefGoogle Scholar
- 22.Gelman A (2005) Analysis of variance—why it is more important than ever. Ann Stat 33(1):1–53MathSciNetzbMATHCrossRefGoogle Scholar
- 23.Goethals B, Laur S , Lipmaa H, Mielikäinen T (2005) On private scalar product computation for privacy-preserving data mining. In: Information security and cryptology—ICISC 2004, LNCS, vol 3506, Springer, pp 104–120Google Scholar
- 24.Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte Nordholt E, Spicer K, De Wolf P-P (2006) Statistical disclosure control. Wiley, HobokenGoogle Scholar
- 25.Karr A, Lin X, Sanil A, Reiter J (2009) Privacy-preserving analysis of vertically partitioned data using secure matrix products. J Off Stat 25(1):125–138Google Scholar
- 26.Lei X, Liao X, Huang T, Li H, Hu C (2013) Outsourcing large matrix inversion computation to a public cloud. IEEE Trans Cloud Comput 1(1):78–87Google Scholar
- 27.Lei X, Liao X, Huang T, Heriniaina F (2014) Achieving security, robust cheating resistance, and high-efficiency for outsourcing large matrix multiplication computation to a malicious cloud. Inf Sci 280:205–217CrossRefGoogle Scholar
- 28.Li H, Yang Y, Luan TH, Liang X, Zhou L, Shen XS (2016) Enabling fine-grained multi-keyword search supporting classified sub-dictionaries over encrypted cloud data. IEEE Trans Dependable Secur Comput 13(3):312–25CrossRefGoogle Scholar
- 29.Li L, Lu R, Choo KK, Datta A, Shao J (2016) Privacy-preserving-outsourced association rule mining on vertically partitioned databases. IEEE Trans Inf Forensics Secur 11(8):1847–61CrossRefGoogle Scholar
- 30.Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML 1998, pp 296–304Google Scholar
- 31.Nassar M, Erradi A, Sabry F, Malluhi Q M (2014) Secure outsourcing of matrix operations as a service. In: IEEE CLOUD 2013, IEEE, pp 918–925Google Scholar
- 32.Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Advances in cryptology—EUROCRYPT ’99, LNCS, vol 1592, Springer, pp 223–238Google Scholar
- 33.Rada R, Mili H, Bichnell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 9:17–30CrossRefGoogle Scholar
- 34.Ren K, Wang C, Wang Q (2012) Security challenges for the public cloud. IEEE Internet Comput 16(1):69–73MathSciNetCrossRefGoogle Scholar
- 35.Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI, vol 1, pp 448–453Google Scholar
- 36.Ricci S, Domingo-Ferrer J, Sánchez D (2016) Privacy-preserving cloud-based statistical analyses on sensitive categorical data. In: Modeling decisions for artificial intelligence, Springer, pp 227–238Google Scholar
- 37.Rodríguez-García M, Batet M, Sánchez D (2017) A semantic framework for noise addition with nominal data. Knowl Based Syst 112:103–118CrossRefGoogle Scholar
- 38.Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027CrossRefGoogle Scholar
- 39.Sánchez D, Batet M (2017) Privacy-preserving data outsourcing in the cloud via semantic data splitting. Comput Commun 110:187–201CrossRefGoogle Scholar
- 40.Sánchez D, Batet M, Isern D, Valls A (2012) Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl 39(9):7718–7728CrossRefGoogle Scholar
- 41.Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Knowl Based Syst 24(2):297–303CrossRefGoogle Scholar
- 42.Sánchez D, Batet M, Martínez S, Domingo-Ferrer J (2015) Semantic variance: an intuitive measure for ontology accuracy evaluation. Eng Appl Artif Intell 39:89–99CrossRefGoogle Scholar
- 43.SNOMED-CT Ontology. https://en.wikipedia.org/wiki/SNOMED_CT
- 44.Sun Y, Yu Y, Li X, Zhang K, Qian H, Zhou Y (2016) Batch verifiable computation with public verifiability for outsourcing polynomials and matrix computations. In: Australasian conference on information security and privacy—ACISP 2016, Lecture Notes in Computer Science, vol 9722, Springer, pp 293–309Google Scholar
- 45.Székely GJ, Rizzo ML (2009) Brownian distance covariance. Ann Appl Stat 3(4):1236–1265MathSciNetzbMATHCrossRefGoogle Scholar
- 46.Taha A, Hadi AS (2016) Pair-wise association measures for categorical and mixed data. Inf Sci 346:73–89CrossRefGoogle Scholar
- 47.Tugrul B, Polat H (2014) Privacy-preserving kriging interpolation on partitioned data. Knowl Based Syst 62:38–46zbMATHCrossRefGoogle Scholar
- 48.U.S. Federal Trade Commission: Data Brokers, A Call for Transparency and Accountability (2014)Google Scholar
- 49.Wang I-C, Shen C-H, Hsu T-S, Liao C-C, Wang DW, Zhan J (2009) Towards empirical aspects of secure scalar product. IEEE Trans Syst Man Cybern Part C 39(4):440–447CrossRefGoogle Scholar
- 50.Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the annual meeting of the association for computational linguistics, pp 133–139Google Scholar
- 51.Xia Z, Wang X, Sun X, Wangm Q (2016) A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans Parallel Distrib Syst 27(2):340–52CrossRefGoogle Scholar
- 52.Yang JJ, Li JQ, Niu Y (2015) A hybrid solution for privacy preserving medical data sharing in the cloud environment. Future Gener Comput Syst 43:74–86CrossRefGoogle Scholar
- 53.Zhang X, Boscardin WJ, Belin TR, Wan X, He Y, Zhang K (2015) A Bayesian method for analyzing combinations of continuous, ordinal, and nominal categorical data with missing values. J Multivar Anal 135:43–58MathSciNetzbMATHCrossRefGoogle Scholar