Computing Covariance and Correlation in Optimally Privacy-Protected Statistical Databases: Feasible Algorithms
In many real-life situations, e.g., in medicine, it is necessary to process data while preserving the patients’ confidentiality. One of the most efficient methods of preserving privacy is to replace the exact values with intervals that contain these values. For example, instead of an exact age, a privacy-protected database only contains the information that the age is, e.g., between 10 and 20, or between 20 and 30, etc. Based on this data, it is important to compute correlation and covariance between different quantities. For privacy-protected data, different values from the intervals lead, in general, to different estimates for the desired statistical characteristic. Our objective is then to compute the range of possible values of these estimates.
Algorithms for effectively computing such ranges have been developed for situations when intervals come from the original surveys, e.g., when a person fills in whether his or her age is between 10 or 20, between 20 and 30, etc. These intervals, however, do not always lead to an optimal privacy protection; it turns out that more complex, computer-generated “intervalization” can lead to better privacy under the same accuracy, or, alternatively, to more accurate estimates of statistical characteristics under the same privacy constraints. In this paper, we extend the existing efficient algorithms for computing covariance and correlation based on privacy-protected data to this more general case of interval data.
Keywordsprivacy protection statistical database computing covariance computing correlation interval uncertainty
Unable to display preview. Download preview PDF.
- 1.Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: A Framework for Efficient Data Anonymization under Privacy and Accuracy Constraints. ACM Transactions on Database Systems 34(2), Article 9 (2009)Google Scholar
- 3.Jalal-Kamali, A., Kreinovich, V., Longpré, L.: Estimating Covariance for Privacy Case under Interval (and Fuzzy) Uncertainty. In: Yager, R.R., Reformat, M., Shahbazova, S., Ovchinnikov, S. (eds.) Proceedings of the World Conference on Soft Computing, San Francisco, CA, May 23-26 (2011)Google Scholar
- 4.Kreinovich, V., Longpré, L., Starks, S.A., Xiang, G., Beck, J., Kandathi, R., Nayak, A., Ferson, S., Hajagos, J.: Interval Versions of Statistical Techniques, with Applications to Environmental Analysis, Bioinformatics, and Privacy in Statistical Databases. Journal of Computational and Applied Mathematics 199(2), 418–423 (2007)MathSciNetCrossRefMATHGoogle Scholar
- 5.Kreinovich, V., Xiang, G., Starks, S.A., Longpré, L., Ceberio, M., Araiza, R., Beck, J., Kandathi, R., Nayak, A., Torres, R., Hajagos, J.: Towards combining probabilistic and interval uncertainty in engineering calculations: algorithms for computing statistics under interval uncertainty, and their computational complexity. Reliable Computing 12(6), 471–501 (2006)MathSciNetCrossRefMATHGoogle Scholar
- 10.Xiang, G., Ferson, S., Ginzburg, L., Longpré, L., Mayorga, E., Kosheleva, O.: Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics: Fuzzy-Motivated Approach. In: Proceedings of the Joint World Congress of the International Fuzzy Systems Association and Annual Conference of the North American Fuzzy Information Processing Society IFSA/NAFIPS 2013, Edmonton, Canada, June 24-28, pp. 611–616 (2013)Google Scholar
- 11.Xiang, G., Kreinovich, V.: Data Anonymization that Leads to the Most Accurate Estimates of Statistical Characteristics. In: Proceedings of the IEEE Symposium on Computational Intelligence for Engineering Solutions CIES 2013, Singapore, April 16-19, pp. 163–170 (2013)Google Scholar