Skip to main content

UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11448))

Included in the following conference series:

  • 3577 Accesses

Abstract

Anonymization method, as a kind of privacy protection technology for data publishing, has been heavily researched during the past twenty years. However, fewer researches have been conducted on making better use of the anonymized data for data mining. In this paper, we focus on training regression model using anonymized data and predicting on original samples using the trained model. Anonymized training instances are generally considered as hyper-rectangles, which is different from most machine learning tasks. We propose several hyper-rectangle vectorization methods that are compatible with both anonymized data and original data for model training. Anonymization brings additional uncertainty. To address this issue, we propose an Uncertainty-based Hyper-Rectangle Pruning method (UHRP) to reduce the disturbance introduced by anonymized data. In this method, we prune hyper-rectangle by its global uncertainty which is calculated from all uncertain attributes. Experiments show that a linear regressor trained on anonymized data could be expected to do as well as the model trained with original data under specific conditions. Experimental results also prove that our pruning method could further improve the model’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://archive.ics.uci.edu/ml/datasets/Auto+MPG.

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/Air+Quality.

  3. 3.

    https://github.com/build2last/UHRP.

References

  1. Kaggle. https://www.kaggle.com/. Accessed 1 Dec 2018

  2. Tianchi. https://tianchi.aliyun.com/. Accessed 1 Dec 2019

  3. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), p. 188. ACM, New York (1998). https://doi.org/10.1145/275487.275508

  4. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  5. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. In: International Conference on Data Engineering (2006)

    Google Scholar 

  6. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: IEEE International Conference on Data Engineering (2007)

    Google Scholar 

  7. Google Privacy Terms. https://policies.google.com/technologies/anonymization. Accessed 14 Jan 2019

  8. Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)

    Article  Google Scholar 

  9. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1

    Chapter  MATH  Google Scholar 

  10. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: SIGMOD 2005, Baltimore, MD, USA, pp. 49–60 (2005)

    Google Scholar 

  11. Lefevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: ICDE 2006, Atlanta, GA, USA, pp. 25–36 (2006)

    Google Scholar 

  12. Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006. VLDB Endowment (2006)

    Google Scholar 

  13. Standard for privacy of individually identifiable health information (HIPAA). Fed. Reg. 67(157), 53181–53273 (2002)

    Google Scholar 

  14. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  15. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  16. Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in privacy preserving data publishing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 747–758. ACM, June 2010

    Google Scholar 

  17. Mondrian. https://github.com/qiyuangong/Mondrian. Accessed 1 Dec 2018

  18. Buratović, I., Miličević, M., Žubrinić, K.: Effects of data anonymization on the data mining results. In: 2012 Proceedings of the 35th International Convention MIPRO, pp. 1619–1623. IEEE, May 2012

    Google Scholar 

  19. Prasser, F., Eicher, J., Bild, R., Spengler, H., Kuhn, K.A.: A tool for optimizing de-identified health data for use in statistical classification. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 169–174. IEEE, June 2017

    Google Scholar 

  20. Lin, B.R., Kifer, D.: Information measures in statistical privacy and data processing applications. ACM Trans. Knowl. Discov. Data (TKDD) 9(4), 28 (2015)

    Google Scholar 

  21. Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 429–440. IEEE, March 2009

    Google Scholar 

  22. Salzberg, S.: A nearest hyperrectangle learning method. Mach. Learn. 6(3), 251–276 (1991)

    Google Scholar 

  23. Akbari, M.G., Hesamian, G.: Linear model with exact inputs and interval-valued fuzzy outputs. IEEE Trans. Fuzzy Syst. 26(2), 518–530 (2018)

    Article  Google Scholar 

  24. Akbari, M.G., Hesamian, G.: Signed-distance measures oriented to rank interval-valued fuzzy numbers. IEEE Trans. Fuzzy Syst. 26(6), 3506–3513 (2018)

    Article  Google Scholar 

  25. Huang, Y., Li, T., Luo, C., Fujita, H., Horng, S.J.: Dynamic fusion of multi-source interval-valued data by fuzzy granulation. IEEE Trans. Fuzzy Syst. 26(6), 3403–3417 (2018)

    Article  Google Scholar 

  26. Mancuhan, K., Clifton, C.: Statistical learning theory approach for data classification with l-diversity. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 651–659. Society for Industrial and Applied Mathematics, June 2017

    Chapter  Google Scholar 

  27. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106(4), 620 (1957)

    Article  MathSciNet  Google Scholar 

  28. Pearce, T., Zaki, M., Brintrup, A., Neely, A.: High-quality prediction intervals for deep learning: a distribution-free, ensembled approach. arXiv preprint arXiv:1802.07167 (2018)

  29. Dua, D., Karra Taniskidou, E.: UCI Machine Learning Repository (2017). School of Information and Computer Science, University of California, Irvine, CA. http://archive.ics.uci.edu/ml

  30. Pedregosa, F., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(10), 2825–2830 (2013)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is supported by National Key R&D Program of China (No. 2017YFC0803700), NSFC grants (No. 61532021), Shanghai Knowledge Service Platform Project (No. ZF1213) and SHEITC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingjian Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, K., Liu, W., Cheng, J., Lu, X. (2019). UHRP: Uncertainty-Based Pruning Method for Anonymized Data Linear Regression. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11448. Springer, Cham. https://doi.org/10.1007/978-3-030-18590-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18590-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18589-3

  • Online ISBN: 978-3-030-18590-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics