A distributed decision support algorithm that preserves personal privacy
- 310 Downloads
Abstract
Assuring confidentiality of personal information and preserving privacy are vital when data is harvested from multiple institutions for business decision-making. An algorithm that builds knowledge using statistics based on subject data from distributed sites that satisfy specified selection criteria is presented here. The algorithm maintains complete fidelity of information structures in the distributed data compared to the centralized equivalent. Heterogeneous data schemas across sites can be accommodated and thresholds can be set for global minimum saturation for attributes to participate in the prediction model building. Policies for inclusion and exclusion of non-exhaustive attributes among sites are introduced. Unification of attributes is introduced for homogenizing attribute values globally. Results of experiments using data from medical, higher education, and social domains elucidate the value of our algorithm in regulated industries, where shipping raw data outside parent institution is not practical.
Keywords
Data privacy Privacy-preserving framework Distributed decision support systemsNotes
Acknowledgments
This research was supported in part by the National Science Foundation through major research instrumentation grant number CNS-09-58854.
References
- Adam, N. R., & Wortman, J. C. (1989). Security control methods for statistical databases. ACM Computing Surveys, 21(4), 515–556.CrossRefGoogle Scholar
- Aggarwal, C. C., & Yu, P. S. (2008). Privacy-preserving data mining: Models and algorithms. New York: Springer Science+Business Media, LLC.CrossRefGoogle Scholar
- Allaert, F.-A., & Barber, B. (1998). Some Systems Implications of EU data protection directive. European Journal of Information Systems, 7(1), 1–4.CrossRefGoogle Scholar
- Bar-Or, A., Keren, D., Schuster, A., & Wolff, R. (2005). Hierarchical decision tree induction in distributed genomic databases. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1138–1151.CrossRefGoogle Scholar
- Bialecki, A., Muir, R., & Ingersoll, G. (2012). Apache Lucene 4. ACM SIGIR Workshop on Open Source Information Retrieval (pp. 17–24). Portland, OR, USA.Google Scholar
- Brand, R. (2002). Microdata protection through noise addition. Inference Control in Statistical Databases. Lecture Notes in Computer Science, Vol. 2316. Springer-Verlag, Berlin-Heidelberg.Google Scholar
- Canetti, R. (1998). Security and composition of multi-party cryptographic protocols. Journal of Cryptography, 2000(13), 143–202.Google Scholar
- Caragea, D., Silvescu, A., & Honavar, V. (2004). A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal on Hybrid Intelligent Systems, 1(1–2), 80–89.MATHGoogle Scholar
- Chow, C., & Mokbel, M. F. (2011). Trajectory privacy in location-based services and data publication. ACM SIGKDD Explorations: Special Issue on Privacy in Mobility Data Mining, 13(1), 19–29.CrossRefGoogle Scholar
- Cieslak, D. A., Hoens, T. R., Chawla, N. V., & Kegelmeyer, W. P. (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery, 24(1), 136–158.CrossRefMATHMathSciNetGoogle Scholar
- Courtright, C. G. (2001). Criteria for developing clinical decision support systems. 14th IEEE Symposium on Computer-Based Medical Systems (pp. 270 – 275). Bethesda, MD, USA.Google Scholar
- Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. 22nd ACM Symposium on Principles of Database Systems (PODS) (pp. 202–210). San Diego, CA, USA.Google Scholar
- Du, W., & Atallah, M.J. (2001). Secure multi-party computation problems and their applications: A review and open problems. New Security Paradigms Workshop (pp. 11–20). Cloudcroft, NM, USA.Google Scholar
- Dwork, C. (2006). Differential privacy. 33rd International Colloquium on Automata, Languages and Programming (ICALP) (pp. 1–12). Venice, Italy.Google Scholar
- Frank, A., & Asuncion, A. (2010a). SPECT heart data set, UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/datasets/SPECT+Heart
- Frank, A., & Asuncion, A. (2010b). Student loan relational data set, UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml/datasets/Student+Loan+Relational
- Friedman, A., & Schuster, A. (2010). Data mining with differential privacy. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 493–502). Washington D.C., USA.Google Scholar
- Fu, Y. (2001). Distributed data mining: An overview. Newsletter of the IEEE Technical Committee on Distributed Processing. Spring 2001, 5–9.Google Scholar
- Giannella, C., Liu, K., Olsen, T., & Kargupta, H. (2004). Communication efficient construction of decision trees over heterogeneously distributed data. Fourth IEEE International Conference on Data Mining (pp. 67–74). Brighton, UK.Google Scholar
- Goldreich, O. (1998). Secure multi-party computation. Available at http://www.wisdom.weizmann.ac.il/~oded/pp.html.
- Goldstein, D. E. (2000). e-Healthcare: Harness the power of internet e-commerce & e-care (pp. 417–418). Gaithersberg: Aspen Publishers Inc.Google Scholar
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutermann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations, 11(1), 10–18.CrossRefGoogle Scholar
- Han, J., & Fu, Y. (1994). Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases. AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94) (pp. 157–168). Seattle WA, USA.Google Scholar
- Heston, A., Summers, R., & Aten, B. (2009). Penn World Table Version 6.3. Center for International Comparisons of Production, Income and Prices. USA: University of Pennsylvania.Google Scholar
- Huang, H., & Dong, Z. (2013). Research on architecture and query performance based on graph database Neo4j. 3rd International Conference on Consumer Electronics, Communications and Networks (CECNet) (pp. 533–536). Xianning, China.Google Scholar
- Inan, A., Saygyn, Y., Savas, E., Hintoglu, A. A., & Levi, A. (2006). Privacy preserving clustering on horizontally partitioned data. 22nd International Conference on Data Engineering Workshops, 95. Atlanta, GA, USA.Google Scholar
- Kantarcioglu, M. (2008). A survey of privacy-preserving methods across horizontally partitioned data. Advances in Database Systems, 34, 313–335.CrossRefGoogle Scholar
- Kantarcioglu, M., & Clifton, C. (2004). Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Transactions on Knowledge and Data Engineering, 16(9), 1026–1037.CrossRefGoogle Scholar
- Kantarcioglu, M., Nix, R., & Vaidya, J. (2009). An efficient approximate protocol for privacy-preserving association rule mining. 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD) (pp. 515–524). Bangkok, Thailand.Google Scholar
- Karthikeyan, G., & Pais, P. (2010). Clinical judgment and evidence-based medicine: time for reconciliation. Indian Journal of Medical Research, 132(5), 623–626.Google Scholar
- Khalilia, M., Chakraborty, S., & Popescu, M. (2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, http://www.biomedcentral.com/1472-6947/11/51
- Khoshgoftaar, T. M. (2005). Identifying noise in attributes of interest. Fourth International Conference on Machine Learning Applications (pp. 55–60). Boca Raton, FL, USA.Google Scholar
- Kumbhar, M. N., & Kharat, R. (2012). Privacy preserving mining of association rules on horizontally and vertically partitioned data: A review paper. 12th International Conference on Hybrid Intelligent Systems (HIS), (pp. 231–235). Pune, India.Google Scholar
- (2000). Privacy-preserving data mining. Advances in Cryptology – CRYPTO ‘00, Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg. 1880, 36–53.Google Scholar
- Lindell, Y., & Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. The Journal of Privacy and Confidentiality, 1(1), 59–98.Google Scholar
- Mathew, G., & Obradovic, Z. (2010). Vocabularies in collaboration channels. 6th International Conference on Collaborative Computing: Networking, Applications and Work Sharing (pp. 1–5). Chicago, IL, USA.Google Scholar
- Mathew, G., & Obradovic, Z. (2011a). Constraint graphs as security filters for privacy assurance in medical transactions. 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine (pp. 502–504). Chicago, IL, USA.Google Scholar
- Mathew, G., & Obradovic, Z. (2011b). A privacy-preserving framework for distributed clinical decision support. 1st IEEE International Conference on Computational Advances in Bio and medical Sciences (pp. 129–134). Orlando, FL, USA.Google Scholar
- Mathew, G., & Obradovic, Z. (2012). Distributed privacy preserving decision system for predicting hospitalization risks in hospitals with insufficient data. Machine Learning in Health Informatics Workshop: International Conference on Machine Learning Applications - ICMLA (pp. 178–183). Boca Raton, FL, USA.Google Scholar
- Mathew, G. & Obradovic, Z. (2013). Improving computational efficiency for personalized medical applications in mobile cloud computing environment. IEEE International Conference on Healthcare Informatics, The First Workshop on Mobile Cloud Computing in Healthcare (pp. 535–540). Philadelphia, PA, USA.Google Scholar
- Moret, B. M. E. (1982). Decision trees and diagrams. ACM Computing Surveys, 14(4), 593–623.CrossRefGoogle Scholar
- Navathe, S., Ceri, S., Wiederhold, G., & Dou, J. (1984). Vertical partitioning algorithms for database design. ACM Transactions on Database Systems, 9(4), 680–710.Google Scholar
- Park, B-H., & Kargupta, H. (2003). Distributed data mining: Algorithms, systems and applications. In N. Ye (Ed.), The handbook of data mining (pp. 341–358). Lawrence Erlbaum Associates.Google Scholar
- Pinkas, B. (2002). Cryptographic techniques for privacy-preserving data mining. SIGKDD Explorations, 4(2), 12–19.CrossRefGoogle Scholar
- Quinlan, J. R. (1986). Introduction to decision trees. Machine Learning, 1, 81–106.Google Scholar
- Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann Publishers.Google Scholar
- Rockwell, R. C., & Abeles, R. P. (1998). Sharing and archiving data is fundamental to scientific progress. Journal of Gerontology Series B: Psychological Sciences and Social Sciences., 53(1), S5–S8.CrossRefGoogle Scholar
- Samarati, P. (2001). Protecting respondents’ identities in Microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6), 1010–1027.CrossRefGoogle Scholar
- Silva, J. C. D., Klusch, M., Lodi, S., & Moro, G. (2004). Inference attacks in peer-to-peer homogeneous distributed data mining. 16th European Conference on Artificial Intelligence (ECAI) (pp. 450–454). Valencia, Spain.Google Scholar
- Spirit and Power: A 10-Country Survey of Pentecostals. (2006). Available at: http://www.thearda.com/Archive/Files/Descriptions/PENTEC.asp
- Sweeney, L. (2010). Data Sharing Under HIPAA: 12 Years Later. Advance HIT Project. White paper 1006. USA: Harvard University.Google Scholar
- Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Pearson Addison Wesley.Google Scholar
- Vaidya, J., & Clifton, C. (2003a). Privacy-preserving k-means Clustering over Vertically Partitioned Data. ACM SIGKDD International Conference on Knowledge Discovery and Data (pp. 206–215). Washington, DC, USA.Google Scholar
- Vaidya, J., & Clifton, C. (2003b). Leveraging the “Multi” in secure multi-party computation, ACM Workshop on Privacy in the Electronic Society (pp. 53–59). Washington, DC, USA.Google Scholar
- Vaidya, J., & Clifton, C. (2005). Privacy-preserving decision trees over vertically partitioned data. Lecture Notes in Computer Science, Springer, Berlin-Heidelberg. 3654, 139–152.Google Scholar
- Vaidya, J., & Clifton, C. (2009). Privacy-preserving Kth element score over vertically partitioned data. IEEE Transactions on Knowledge and Data Engineering, 21(2), 253–258.CrossRefGoogle Scholar
- Verykios, V. S., Bertino, E., Fovino, I. N., Provenza, L. P., Saygin, Y., & Theodoridis, Y. (2004). State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1), 50–57.CrossRefGoogle Scholar
- Vest, J. R., & Gamm, L. D. (2010). Health information exchange: persistent challenges and new strategies. Journal of American Medical Association, 17(3), 288–294.Google Scholar
- Wu, Y., Jiang, X., & Ohno-Machado, L. (2012). Grid Binary LOgic REgression (GLORE): building shared models without sharing data. Journal of American Medical Informatics Association, 19(5), 758–764.CrossRefGoogle Scholar
- Xu, Z. (2011). Classification of privacy-preserving distributed data mining protocols. 6th International Conference on Digital Information Management (pp. 337–342). Melbourne, Australia.Google Scholar
- Yao, A. C. (1986). How to generate and exchange secrets. 27th IEEE Symposium on Foundations of Computer Science (pp. 162–167). Toronto, Canada.Google Scholar
- Yu, H., Vaidya, J., & Jiang, X. (2006). Privacy-preserving svm classification on vertically partitioned data. Advances in Knowledge Discovery and Data Mining, 3918, 647–656.CrossRefGoogle Scholar
- Zheleva, E., & Getoor, L. (2007). Preserving the privacy of sensitive relationships in graph data. Privacy, Security and Trust in KDD, First ACM SIGKDD International Workshop (PinKDD), (pp. 153–171). San Jose, CA, USA.Google Scholar