UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression

Liu, Kun; Liu, Wenyan; Cheng, Junhong; Lu, Xingjian

doi:10.1007/978-3-030-18590-9_2

Kun Liu¹⁹,
Wenyan Liu¹⁹,
Junhong Cheng¹⁹ &
…
Xingjian Lu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11448))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

3577 Accesses

Abstract

Anonymization method, as a kind of privacy protection technology for data publishing, has been heavily researched during the past twenty years. However, fewer researches have been conducted on making better use of the anonymized data for data mining. In this paper, we focus on training regression model using anonymized data and predicting on original samples using the trained model. Anonymized training instances are generally considered as hyper-rectangles, which is different from most machine learning tasks. We propose several hyper-rectangle vectorization methods that are compatible with both anonymized data and original data for model training. Anonymization brings additional uncertainty. To address this issue, we propose an Uncertainty-based Hyper-Rectangle Pruning method (UHRP) to reduce the disturbance introduced by anonymized data. In this method, we prune hyper-rectangle by its global uncertainty which is calculated from all uncertain attributes. Experiments show that a linear regressor trained on anonymized data could be expected to do as well as the model trained with original data under specific conditions. Experimental results also prove that our pruning method could further improve the model’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Kaggle. https://www.kaggle.com/. Accessed 1 Dec 2018
Tianchi. https://tianchi.aliyun.com/. Accessed 1 Dec 2019
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), p. 188. ACM, New York (1998). https://doi.org/10.1145/275487.275508
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Article MathSciNet Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: International Conference on Data Engineering (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (2007)
Google Scholar
Google Privacy Terms. https://policies.google.com/technologies/anonymization. Accessed 14 Jan 2019
Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)
Article Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter MATH Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: SIGMOD 2005, Baltimore, MD, USA, pp. 49–60 (2005)
Google Scholar
Lefevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE 2006, Atlanta, GA, USA, pp. 25–36 (2006)
Google Scholar
Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006. VLDB Endowment (2006)
Google Scholar
Standard for privacy of individually identifiable health information (HIPAA). Fed. Reg. 67(157), 53181–53273 (2002)
Google Scholar
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)
Article Google Scholar
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Article MathSciNet Google Scholar
Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in privacy preserving data publishing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 747–758. ACM, June 2010
Google Scholar
Mondrian. https://github.com/qiyuangong/Mondrian. Accessed 1 Dec 2018
Buratović, I., Miličević, M., Žubrinić, K.: Effects of data anonymization on the data mining results. In: 2012 Proceedings of the 35th International Convention MIPRO, pp. 1619–1623. IEEE, May 2012
Google Scholar
Prasser, F., Eicher, J., Bild, R., Spengler, H., Kuhn, K.A.: A tool for optimizing de-identified health data for use in statistical classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 169–174. IEEE, June 2017
Google Scholar
Lin, B.R., Kifer, D.: Information measures in statistical privacy and data processing applications. ACM Trans. Knowl. Discov. Data (TKDD) 9(4), 28 (2015)
Google Scholar
Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 429–440. IEEE, March 2009
Google Scholar
Salzberg, S.: A nearest hyperrectangle learning method. Mach. Learn. 6(3), 251–276 (1991)
Google Scholar
Akbari, M.G., Hesamian, G.: Linear model with exact inputs and interval-valued fuzzy outputs. IEEE Trans. Fuzzy Syst. 26(2), 518–530 (2018)
Article Google Scholar
Akbari, M.G., Hesamian, G.: Signed-distance measures oriented to rank interval-valued fuzzy numbers. IEEE Trans. Fuzzy Syst. 26(6), 3506–3513 (2018)
Article Google Scholar
Huang, Y., Li, T., Luo, C., Fujita, H., Horng, S.J.: Dynamic fusion of multi-source interval-valued data by fuzzy granulation. IEEE Trans. Fuzzy Syst. 26(6), 3403–3417 (2018)
Article Google Scholar
Mancuhan, K., Clifton, C.: Statistical learning theory approach for data classification with l-diversity. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 651–659. Society for Industrial and Applied Mathematics, June 2017
Chapter Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)
Article MathSciNet Google Scholar
Pearce, T., Zaki, M., Brintrup, A., Neely, A.: High-quality prediction intervals for deep learning: a distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167 (2018)
Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017). School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml
Pedregosa, F., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(10), 2825–2830 (2013)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work is supported by National Key R&D Program of China (No. 2017YFC0803700), NSFC grants (No. 61532021), Shanghai Knowledge Service Platform Project (No. ZF1213) and SHEITC.

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, East China Normal University, Shanghai, China
Kun Liu, Wenyan Liu & Junhong Cheng
School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Xingjian Lu

Authors

Kun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Junhong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xingjian Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xingjian Lu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Guoliang Li
Duke University, Durham, NC, USA
Jun Yang
University of Porto, Porto, Portugal
Joao Gama
Chiang Mai University, Chiang Mai, Thailand
Juggapong Natwichai
Beihang University, Beijing, China
Yongxin Tong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, K., Liu, W., Cheng, J., Lu, X. (2019). UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11448. Springer, Cham. https://doi.org/10.1007/978-3-030-18590-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-18590-9_2
Published: 24 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18589-3
Online ISBN: 978-3-030-18590-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics