Abstract
Anonymization method, as a kind of privacy protection technology for data publishing, has been heavily researched during the past twenty years. However, fewer researches have been conducted on making better use of the anonymized data for data mining. In this paper, we focus on training regression model using anonymized data and predicting on original samples using the trained model. Anonymized training instances are generally considered as hyper-rectangles, which is different from most machine learning tasks. We propose several hyper-rectangle vectorization methods that are compatible with both anonymized data and original data for model training. Anonymization brings additional uncertainty. To address this issue, we propose an Uncertainty-based Hyper-Rectangle Pruning method (UHRP) to reduce the disturbance introduced by anonymized data. In this method, we prune hyper-rectangle by its global uncertainty which is calculated from all uncertain attributes. Experiments show that a linear regressor trained on anonymized data could be expected to do as well as the model trained with original data under specific conditions. Experimental results also prove that our pruning method could further improve the model’s performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kaggle. https://www.kaggle.com/. Accessed 1 Dec 2018
Tianchi. https://tianchi.aliyun.com/. Accessed 1 Dec 2019
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), p. 188. ACM, New York (1998). https://doi.org/10.1145/275487.275508
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: International Conference on Data Engineering (2006)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (2007)
Google Privacy Terms. https://policies.google.com/technologies/anonymization. Accessed 14 Jan 2019
Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: SIGMOD 2005, Baltimore, MD, USA, pp. 49–60 (2005)
Lefevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE 2006, Atlanta, GA, USA, pp. 25–36 (2006)
Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006. VLDB Endowment (2006)
Standard for privacy of individually identifiable health information (HIPAA). Fed. Reg. 67(157), 53181–53273 (2002)
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in privacy preserving data publishing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 747–758. ACM, June 2010
Mondrian. https://github.com/qiyuangong/Mondrian. Accessed 1 Dec 2018
Buratović, I., Miličević, M., Žubrinić, K.: Effects of data anonymization on the data mining results. In: 2012 Proceedings of the 35th International Convention MIPRO, pp. 1619–1623. IEEE, May 2012
Prasser, F., Eicher, J., Bild, R., Spengler, H., Kuhn, K.A.: A tool for optimizing de-identified health data for use in statistical classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 169–174. IEEE, June 2017
Lin, B.R., Kifer, D.: Information measures in statistical privacy and data processing applications. ACM Trans. Knowl. Discov. Data (TKDD) 9(4), 28 (2015)
Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 429–440. IEEE, March 2009
Salzberg, S.: A nearest hyperrectangle learning method. Mach. Learn. 6(3), 251–276 (1991)
Akbari, M.G., Hesamian, G.: Linear model with exact inputs and interval-valued fuzzy outputs. IEEE Trans. Fuzzy Syst. 26(2), 518–530 (2018)
Akbari, M.G., Hesamian, G.: Signed-distance measures oriented to rank interval-valued fuzzy numbers. IEEE Trans. Fuzzy Syst. 26(6), 3506–3513 (2018)
Huang, Y., Li, T., Luo, C., Fujita, H., Horng, S.J.: Dynamic fusion of multi-source interval-valued data by fuzzy granulation. IEEE Trans. Fuzzy Syst. 26(6), 3403–3417 (2018)
Mancuhan, K., Clifton, C.: Statistical learning theory approach for data classification with l-diversity. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 651–659. Society for Industrial and Applied Mathematics, June 2017
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Pearce, T., Zaki, M., Brintrup, A., Neely, A.: High-quality prediction intervals for deep learning: a distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167 (2018)
Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017). School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml
Pedregosa, F., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(10), 2825–2830 (2013)
Acknowledgments
This work is supported by National Key R&D Program of China (No. 2017YFC0803700), NSFC grants (No. 61532021), Shanghai Knowledge Service Platform Project (No. ZF1213) and SHEITC.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, K., Liu, W., Cheng, J., Lu, X. (2019). UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11448. Springer, Cham. https://doi.org/10.1007/978-3-030-18590-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-18590-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18589-3
Online ISBN: 978-3-030-18590-9
eBook Packages: Computer ScienceComputer Science (R0)