Skip to main content

Privacy-Preserving Data Analytics

  • Reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Abstract

Real-time processing of user data streams in online services inadvertently creates tension between the users and analysts: users are looking for stronger privacy, while analysts desire for higher utility data analytics in real time. To resolve this tension, this paper describes the design, implementation, and evaluation of PrivApprox, a data analytics system for privacy-preserving stream processing. PrivApprox provides three important properties: (i) privacy, zero-knowledge privacy guarantee for users, a privacy bound tighter than the state-of-the-art differential privacy; (ii) utility, an interface for data analysts to systematically explore the trade-offs between the output accuracy (with error estimation) and the query execution budget; and (iii) latency, near real-time stream processing based on a scalable “synchronization-free” distributed architecture. The key idea behind PrivApprox is to combine two techniques together, namely, sampling (used for approximate computation) and randomized response (used for privacy-preserving analytics). The resulting combination is complementary – it achieves stronger privacy guarantees and also improves the performance for stream analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Al-Kateb M, Lee BS (2010) Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)

    Google Scholar 

  • Apache spark streaming. http://spark.apache.org/streaming. Accessed Nov 2017

  • Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)

    Google Scholar 

  • Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)

    Google Scholar 

  • Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)

    Google Scholar 

  • Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) Slider: incremental sliding-window computations for large-scale data analysis. Technical Report MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.org/tr/2012-004.pdf

  • Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)

    Google Scholar 

  • Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)

    Google Scholar 

  • Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)

    Google Scholar 

  • Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache Flink: stream and batch processing in a single engine. Bull IEEE Comput Soc Tech Committee Data Eng 36(4)

    Google Scholar 

  • Chan THH, Shi E, Song D (2011) Private and continual release of statistics. ACM Trans Inf Syst Secur 14(3), 26

    Article  Google Scholar 

  • Chan THH, Li M, Shi E, Xu W (2012) Differentially private continual monitoring of heavy hitters from distributed streams. In: Proceedings of the 12th international conference on privacy enhancing technologies (PETS)

    Google Scholar 

  • Chaudhuri K, Mishra N (2006) When random sampling preserves privacy. In: Proceedings of the 26th annual international conference on advances in cryptology (CRYPTO)

    Google Scholar 

  • Chen R, Akkus IE, Francis P (2013) SplitX: high-performance private analytics. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM)

    Google Scholar 

  • Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294

    MATH  Google Scholar 

  • Dingledine R, Mathewson N, Syverson P (2004) Tor: the second-generation onion router. Technical report, DTIC Document

    Book  Google Scholar 

  • Douceur JR (2002) The Sybil attack. In: Proceedings of 1st international workshop on peer-to-peer systems (IPTPS)

    Chapter  Google Scholar 

  • Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international colloquium on automata, languages and programming, part II (ICALP)

    Google Scholar 

  • Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006a) Our data, ourselves: privacy via distributed noise generation. In: Proceedings of the 24th annual international conference on the theory and applications of cryptographic techniques (EUROCRYPT)

    Google Scholar 

  • Dwork C, McSherry F, Nissim K, Smith A (2006b) Calibrating noise to sensitivity in private data analysis. In: Proceedings of the third conference on theory of cryptography (TCC)

    Google Scholar 

  • Dwork C, Naor M, Pitassi T, Rothblum GN (2010) Differential privacy under continual observation. In: Proceedings of the ACM symposium on theory of computing (STOC)

    Google Scholar 

  • Fox JA, Tracy PE (1986) Randomized response: a method for sensitive surveys. Sage Publications, Beverly Hills

    Book  Google Scholar 

  • Gehrke J, Lui E, Pass R (2011) Towards privacy for social networks: a zero-knowledge based Definitions of privacy. In: Theory of cryptography

    MATH  Google Scholar 

  • Gehrke J, Hay M, Lui E, Pass R (2012) Crowd-blending privacy. In: Proceedings of the 32nd annual international conference on advances in cryptology (CRYPTO)

    Google Scholar 

  • Guha S, Cheng B, Francis P (2011) Privad: practical privacy in online advertising. In: Proceedings of the 8th symposium on networked systems design and implementation (NSDI)

    Google Scholar 

  • Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)

    Google Scholar 

  • HealthCare.gov sends personal data to dozens of tracking websites. https://www.eff.org/deeplinks/2015/01/healthcare.gov-sends-personal-data. Accessed Nov 2017

  • Hubert Chan Th, Shi E, Song D (2012) Privacy-preserving stream aggregation with fault tolerance. In: Proceedings of 16th international conference on financial cryptography and data security (FC)

    Google Scholar 

  • Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on world wide web (WWW)

    Google Scholar 

  • McSherry F, Mahajan R (2010) Differentially-private network trace analysis. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communications (SIGCOMM)

    Google Scholar 

  • Mohan P, Thakurta A, Shi E, Song D, Culler D (2012) GUPT: privacy preserving data analysis made easy. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data (SIGMOD)

    Google Scholar 

  • Moore DS (1999) The basic practice of statistics, 2nd edn. W. H. Freeman & Co., New York

    Google Scholar 

  • Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403

  • Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)

    Google Scholar 

  • Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in Apache Flink and Apache Spark streaming. CoRR, abs/1709.02946

    Google Scholar 

  • Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (Middleware)

    Google Scholar 

  • Rastogi V, Nath S (2010) Differentially private aggregation of distributed time-series with transformation and encryption. In: Proceedings of the international conference on management of data (SIGMOD)

    Google Scholar 

  • SEC Charges Two Employees of a Credit Card Company with Insider Trading. http://www.sec.gov/litigation/litreleases/2015/lr23179.htm. Accessed Nov 2017

  • Shi E, Chan TH, Rieffel EG, Chow R, Song D (2011) Privacy-preserving aggregation of time-series data. In: Proceedings of the symposium on network and distributed system security (NDSS)

    Google Scholar 

  • Wang G, Wang B, Wang T, Nika A, Zheng H, Zhao BY (2016a) Defending against Sybil devices in crowdsourced mapping services. In: Proceedings of the 14th annual international conference on mobile systems, applications, and services (MobiSys)

    Google Scholar 

  • Wang Q, Zhang Y, Lu X, Wang Z, Qin Z, Ren K (2016b) RescueDP: real-time spatio-temporal crowd-sourced data publishing with differential privacy. In: Proceedings of the 35th annual IEEE international conference on computer communications (INFOCOM)

    Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling mapreduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)

    Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)

    Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Do Le Quoc .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Quoc, D.L., Beck, M., Bhatotia, P., Chen, R., Fetzer, C., Strufe, T. (2019). Privacy-Preserving Data Analytics. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_152

Download citation

Publish with us

Policies and ethics