Knowledge and Information Systems

, Volume 20, Issue 2, pp 157–185 | Cite as

A distributed approach to enabling privacy-preserving model-based classifier training

  • Hangzai Luo
  • Jianping Fan
  • Xiaodong Lin
  • Aoying Zhou
  • Elisa Bertino
Regular Paper

Abstract

This paper proposes a novel approach for privacy-preserving distributed model-based classifier training. Our approach is an important step towards supporting customizable privacy modeling and protection. It consists of three major steps. First, each data site independently learns a weak concept model (i.e., local classifier) for a given data pattern or concept by using its own training samples. An adaptive EM algorithm is proposed to select the model structure and estimate the model parameters simultaneously. The second step deals with combined classifier training by integrating the weak concept models that are shared from multiple data sites. To reduce the data transmission costs and the potential privacy breaches, only the weak concept models are sent to the central site and synthetic samples are directly generated from these shared weak concept models at the central site. Both the shared weak concept models and the synthetic samples are then incorporated to learn a reliable and complete global concept model. A computational approach is developed to automatically achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility. The third step deals with validating the combined classifier by distributing the global concept model to all these data sites in the collaboration network while at the same time limiting the potential privacy breaches. Our approach has been validated through extensive experiments carried out on four UCI machine learning data sets and two image data sets.

Keywords

Privacy-preserving classifier training Synthetic samples Adaptive EM algorithm 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Westin AF (1967) Privacy and freedom. Atheneum, New YorkGoogle Scholar
  2. 2.
    Rosenthal A, Winslett M (2004) Security of shared data in large systems: state of the art and research directions. In: ACM SIGMODGoogle Scholar
  3. 3.
    Thuraisingham BM (2002) Data mining, national security, privacy and civil liberties. SIGKDD Explor Newsl 4(2): 1–5CrossRefGoogle Scholar
  4. 4.
    Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Mishra N, Motwani R, Srivastava U, Thomas D, Widom J, Xu Y (2004) Vision paper: enabling privacy for the paranoids. In: VLDB, pp 708–719Google Scholar
  5. 5.
    Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: VLDB, pp 720–731Google Scholar
  6. 6.
    Deutsch A, Papakonstantinou Y (2005) Privacy in database publishing. In ICDT, pp 230–245Google Scholar
  7. 7.
    Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertainty 10(5): 571–588MATHMathSciNetGoogle Scholar
  8. 8.
    Kantarcioglu M, Jin J, Clifton C (2004) What do data mining results violate privacy. In: ACM SIGKDDGoogle Scholar
  9. 9.
    Liew CK, Coi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10(3): 395–411MATHCrossRefGoogle Scholar
  10. 10.
    Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493CrossRefGoogle Scholar
  11. 11.
    Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: ACM SIGMOD, pp 439–450Google Scholar
  12. 12.
    Agrawal D, Aggarwal C (2001) On the design and quantification of privacy preserving data mining algorithms. In: ACM PODSGoogle Scholar
  13. 13.
    Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: ACM SIGKDDGoogle Scholar
  14. 14.
    Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: ACM PODSGoogle Scholar
  15. 15.
    Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: IEEE ICDMGoogle Scholar
  16. 16.
    Ma D, Sivakumar K, Kargupta H (2004) privacy sensitive bayesian network parameter learning. In: IEEE ICDMGoogle Scholar
  17. 17.
    Yao A (1986) How to generate and exchange secrets. In: IEEE Symp. on Foundations of Computer Science, pp 162–167Google Scholar
  18. 18.
    Lindell Y, Israel R, Pinkas B (2000) Privacy preserving data mining. CRYPTO, pp 36–54Google Scholar
  19. 19.
    Goldreich O, Micali S, Wigderson A (1987) How to play any mental game- a completeness theorem for protocols with honest majority. In: STOCGoogle Scholar
  20. 20.
    Du W, Atallah MJ (2001) Privacy-preserving cooperative statistical analysis. In: 17th Annual Computer Security Applications Conference, pp 103–110Google Scholar
  21. 21.
    Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: SIAM Conference on Data MiningGoogle Scholar
  22. 22.
    Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitional data. In: ACM SIGKDDGoogle Scholar
  23. 23.
    Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: ACM SIGKDDGoogle Scholar
  24. 24.
    Wright R, Yang Z (2004) Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: ACM SIGKDDGoogle Scholar
  25. 25.
    Chen K, Liu L (2005) Privacy preserving data classification with rotation perturbation. In: IEEE ICDM, pp 589–592Google Scholar
  26. 26.
    Oliveira S, Zaiane OR (2003) Privacy preserving clustering by data transformation. In: SBBDGoogle Scholar
  27. 27.
    Domingo-Ferrer J, Mateo-Sanz JM (2001) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1): 189–201CrossRefGoogle Scholar
  28. 28.
    Fienberg SE, Makov UE, Steele RJ (1998) Disclosure limitation using perturbation and related methods for categorial data. J Official Stat 14(4): 485–502Google Scholar
  29. 29.
    Raghunathan TJ, Reiter JP, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Official Stat 19(1): 1–16Google Scholar
  30. 30.
    Crises G (2004) Synthetic microdata generation for database privacy protection. Technical report, CRISES Research Group, CRIREP-04-009Google Scholar
  31. 31.
    Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: IEEE ICDMGoogle Scholar
  32. 32.
    Chan, P, Stolfo, S, Wolpert, D (eds) (1996) Working Notes of AAAI Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, vol 36. AAAI/MIT Press, CambridgeGoogle Scholar
  33. 33.
    Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data perturbation techniques. In: IEEE ICDMGoogle Scholar
  34. 34.
    Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: ACM SIGMODGoogle Scholar
  35. 35.
    Zhu Y, Liu L (2004) Optimal randomization for privacy preserving data mining. In: ACM SIGKDD, pp 761–766Google Scholar
  36. 36.
    Xiong L, Chitti S, Liu L (2007) Mining multiple private databases using a knn classifier. In: SACGoogle Scholar
  37. 37.
    Kim J, Winkler WE (2003) Multiplicative noise for masking continuous data. Technical report, US Bureau of Census, Statistics Research Division technical report statistics 2003-01Google Scholar
  38. 38.
    Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1): 92–106CrossRefGoogle Scholar
  39. 39.
    Ting K, Witten I (1999) Issues in stacked generalization. J Artif Intell Res 10: 271–289MATHGoogle Scholar
  40. 40.
    Fan J, Luo H, Hacid M-S, Bertino E (2005) A novel approach for privacy-preserving video sharing. In: ACM CIKM, pp 609–616Google Scholar
  41. 41.
    Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24: 381–396CrossRefGoogle Scholar
  42. 42.
    McLachlan G, Krishnan T (2000) The EM algorithm and extensions. Wiley, New YorkGoogle Scholar
  43. 43.
    Ueda N, Nakano R, Ghahramani Z, Hinton GE (2002) Smem algorithm for mixture models. Neural Comput 12(9): 2109–2128CrossRefGoogle Scholar
  44. 44.
    Luo H (2007) Concept-based large-scale video database browsing and retrieval via visualization. Ph.D. thesis, The University of North Carolina at Charlotte, pp 58–60. http://hdl.handle.net/2029/87
  45. 45.
    Hyvarinen A (1998) New approximations of dioeerential entropy for independent component analysisand projection pursuit. In: Annual Conference on Neural Information Processing Systems, vol 10, pp 273–279Google Scholar
  46. 46.
    Gomantam S, Karr AF, Sanil AP (2005) Data swapping as a decision problem. J Official Stat 13(4): 635–655Google Scholar
  47. 47.
    Lamber D (1993) Measures of disclosure risk and harm. J Official Stat 9: 313–331Google Scholar
  48. 48.
    Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2-3): 103–134MATHCrossRefGoogle Scholar
  49. 49.
    Joachims T (1999) Transductive inference for text classification using support vector machine. In: ICMLGoogle Scholar
  50. 50.
    Hettich S, Blake C, Merz C (1998) Uci respository of machine learning databases. Technical report. http://www.ics.uci.edu/~mlearn/

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Hangzai Luo
    • 1
  • Jianping Fan
    • 2
  • Xiaodong Lin
    • 3
  • Aoying Zhou
    • 1
  • Elisa Bertino
    • 4
  1. 1.Shanghai Key Lab of Trustworthy ComputingEast China Normal UniversityShanghaiChina
  2. 2.Department of Computer ScienceUniversity of North CarolinaCharlotteUSA
  3. 3.Department of Mathematical SciencesUniversity of CincinnatiCincinnatiUSA
  4. 4.Department of Computer SciencePurdue UniversityWest LafayetteUSA

Personalised recommendations