Skip to main content

Picket: guarding against corrupted data in tabular data during learning and inference

Abstract

Data corruption is an impediment to modern machine learning deployments. Corrupted data can severely bias the learned model and can also lead to invalid inferences. We present, Picket, a simple framework to safeguard against data corruptions during both training and deployment of machine learning models over tabular data. For the training stage, Picket identifies and removes corrupted data points from the training data to avoid obtaining a biased model. For the deployment stage, Picket flags, in an online manner, corrupted query points to a trained machine learning model that due to noise will result in incorrect predictions. To detect corrupted data, Picket uses a self-supervised deep learning model for mixed-type tabular data, which we call PicketNet. To minimize the burden of deployment, learning a PicketNet model does not require any human-labeled data. Picket is designed as a plugin that can increase the robustness of any machine learning pipeline. We evaluate Picket on a diverse array of real-world data considering different corruption models that include systematic and adversarial noise during both training and testing. We show that Picket consistently safeguards against corrupted data during both training and deployment of various models ranging from SVMs to neural networks, beating a diverse array of competing methods that span from data quality validation models to robust outlier detection models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Availability of data and material

Data are open source.

References

  1. 1.

    Koh, P.W., Steinhardt, J., Liang, P.: arXiv preprint arXiv:1811.00741 (2018)

  2. 2.

    Schelter, S., Biessmann, F., Lange, D., Rukat, T., Schmidt, P., Seufert, S., Brunelle, P., Taptunov, A.: Unit testing data with deequ. In: Proceedings of the 2019 International Conference on Management of Data (Association for Computing Machinery, New York, NY, USA, 2019), SIGMOD ’19, pp. 1993–1996. https://doi.org/10.1145/3299869.3320210

  3. 3.

    Breck, E., Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data validation for machine learning. In: MLSys-19

  4. 4.

    Baylor, D., Breck, E., Cheng, H.T., Fiedel, N., Foo, C.Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C.Y., Lew, L., Mewald, C., Modi, A.N., Polyzotis, N., Ramesh, S., Roy, S., Whang, S.E., Wicke, M., Wilkiewicz, J., Zhang, X., Zinkevich, M.: Tfx: a tensorflow-based production-scale machine learning platform. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Association for Computing Machinery, New York, NY, USA, 2017), KDD ’17, pp. 1387–1395. https://doi.org/10.1145/3097983.3098021

  5. 5.

    Steinhardt, J., Koh, P.W.W., Liang, P.S.: Certified defenses for data poisoning attacks. In: Advances in Neural Information Processing Systems, pp. 3517–3529 (2017)

  6. 6.

    Xue, Z., Shang, Y., Feng, A.: Semi-supervised outlier detection based on fuzzy rough C-means clustering. Math. Comput. Simul. 80(9), 1911 (2010)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Muñoz-González, L., Biggio, B., Demontis, A., Paudice, A., Wongrassamee, V., Lupu, E.C., Roli, F.: In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38 (2017)

  8. 8.

    Biggio, B., Nelson, B., Laskov, P.: Poisoning attacks against support vector machines. In: Proceedings of the 29th International Conference on International Conference on Machine Learning. Omnipress, Madison, WI, USA, ICML ’12, pp. 1467–1474 (2012)

  9. 9.

    Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)

  10. 10.

    Chen, Y., Zhou, X.S., Huang, T.S.: One-class SVM for learning in image retrieval. In: Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), vol. 1, pp. 34–37. IEEE (2001)

  11. 11.

    Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data, pp. 829–846 (2019)

  12. 12.

    Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Raha: a configuration-free error detection system. In: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, SIGMOD ’19, pp. 865–882 (2019). https://doi.org/10.1145/3299869.3324956

  13. 13.

    Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Moitra, A., Stewart, A.: Being robust (in high dimensions) can be practical. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017), pp. 999–1008

  14. 14.

    Diakonikolas, I., Kamath, G., Kane, D.M., Li, J., Steinhardt, J., Stewart, A.: arXiv preprint arXiv:1803.02815 (2018)

  15. 15.

    Roth, K., Kilcher, Y., Hofmann, T.: In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, vol. 97, ed. by K. Chaudhuri, R. Salakhutdinov (PMLR, 2019), Proceedings of Machine Learning Research, vol. 97, pp. 5498–5507. http://proceedings.mlr.press/v97/roth19a.html

  16. 16.

    Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: arXiv preprint arXiv:1702.06280 (2017)

  17. 17.

    Eduardo, S., Nazábal, A., Williams, C.K.I., Sutton, C.: Robust variational autoencoders for outlier detection and repair of mixed-type data. In: The 23rd International Conference on Artificial Intelligence and Statistics (2020)

  18. 18.

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  19. 19.

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: arXiv preprint arXiv:1810.04805 (2018)

  20. 20.

    Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SygXPaEYvH

  21. 21.

    Wu, R., Zhang, A., Ilyas, I.F., Rekatsinas, T.: Attention-based learning for missing data imputation in holoclean. In: Proceedings of Machine Learning and Systems, pp. 307–325 (2020)

  22. 22.

    Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, pp. 5754–5764 (2019)

  23. 23.

    Simonyan, K., Zisserman, A.: Advances in neural information processing systems, pp. 568–576 (2014)

  24. 24.

    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135 (2017)

    Article  Google Scholar 

  25. 25.

    Arora, S., Liang, Y., Ma, T.: 5th International Conference on Learning Representations, ICLR 2017; Conference date: 24-04-2017 Through 26-04-2017 (2019)

  26. 26.

    Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  27. 27.

    Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: arXiv preprint arXiv:1904.09483 (2019)

  28. 28.

    Herskovits, E.: Computer-based probabilistic-network construction. Ph.D. thesis, Stanford, CA, USA (1992). UMI Order No. GAX92-05646

  29. 29.

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJzIBfZAb

  30. 30.

    Nicolae, M.I., Sinn, M., Tran, M.N., Buesser, B., Rawat, A., Wistuba, M., Zantedeschi, V., Baracaldo, N., Chen, B., Ludwig, H., Molloy, I., Edwards, B.: CoRR (2018). arXiv:1807.01069

  31. 31.

    Goodfellow, I.J., Shlens, J., Szegedy, C.: CoRR abs/1412.6572 (2015)

  32. 32.

    Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017), pp. 1321–1330

  33. 33.

    Khosravi, P., Liang, Y., Choi, Y., Van den Broeck, G.: In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, pp. 2716–2724 (2019). https://doi.org/10.24963/ijcai.2019/377

  34. 34.

    Karlaš, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions (2020)

  35. 35.

    Z. Liu, J. Park, N. Palumbo, T. Rekatsinas, C. Tzamos. Robust mean estimation under coordinate-level corruption (2020)

  36. 36.

    Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281 (2015)

    Article  Google Scholar 

  37. 37.

    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3 (2000)

    Google Scholar 

  38. 38.

    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan & Claypool Publishers, Vermont (2012)

    Book  Google Scholar 

  39. 39.

    An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2, 1 (2015)

    Google Scholar 

  40. 40.

    Sabokrou, M., Fathy, M., Hoseini, M.: Video anomaly detection and localisation based on the sparsity and reconstruction error of auto-encoder. Electron. Lett. 52(13), 1122 (2016)

    Article  Google Scholar 

  41. 41.

    Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017 IEEE Symposium on Security and Privacy (sp), pp. 39–57. IEEE (2017)

  42. 42.

    Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2574–2582 (2016)

  43. 43.

    Goodfellow, I.J., Shlens, J., Szegedy, C.: arXiv preprint arXiv:1412.6572 (2014)

  44. 44.

    Xiao, C., Zhong, P., Zheng, C.: arXiv preprint arXiv:1905.10510 (2019)

  45. 45.

    Pang, T., Xu, K., Dong, Y., Du, C., Chen, N., Zhu, J.: In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=Byg9A24tvB

  46. 46.

    Pang, T., Xu, K., Du, C., Chen, N., Zhu, J.: arXiv preprint arXiv:1901.08846 (2019)

  47. 47.

    Hu, S., Yu, T., Guo, C., Chao, W.L., Weinberger, K.Q.: Advances in Neural Information Processing Systems, pp. 1633–1644 (2019)

  48. 48.

    Kingma, D.P., Ba, J.: arXiv preprint arXiv:1412.6980 (2014)

  49. 49.

    Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895 (1998)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Science Foundation under Grants 1755676 and 1815538 and DARPA under Grant ASKE HR00111990013. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of DARPA or the US Government.

Funding

This work was supported by the National Science Foundation under Grants 1755676 and 1815538 and Defense Advanced Research Projects Agency under Grant ASKE HR00111990013.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zifan Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Code availability

The code with data is available at https://github.com/rekords-uw/Picket

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zhechun Zhou: Work done at University of Wisconsin-Madison.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Z., Zhou, Z. & Rekatsinas, T. Picket: guarding against corrupted data in tabular data during learning and inference. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00699-w

Download citation

Keywords

  • Data validation
  • Error detection
  • Robust outlier detection
  • Robust machine learning