Finding High-Value Training Data Subset Through Differentiable Convex Programming

Das, Soumi; Singh, Arshdeep; Chatterjee, Saptarshi; Bhattacharya, Suparna; Bhattacharya, Sourangshu

doi:10.1007/978-3-030-86520-7_41

Soumi Das¹³,
Arshdeep Singh¹³,
Saptarshi Chatterjee¹³,
Suparna Bhattacharya¹⁴ &
…
Sourangshu Bhattacharya¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1936 Accesses
3 Citations

Abstract

Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the “value” of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion which is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterised convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in an end-to-end training. Using this framework, we design an online alternating minimization based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state of the art methods, sometimes \(\sim 20\%\) higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

No regret sample selection with noisy labels

Article 05 January 2024

Deep Bilevel Learning

Submodular Meta Data Compiling for Meta Optimization

Notes

References

Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., Kolter, J.Z.: Differentiable convex optimization layers. In: NeurIPS (2019)
Google Scholar
Buchbinder, N., Feldman, M., Naor, J., Schwartz, R.: Submodular maximization with cardinality constraints. In: ACM-SIAM SODA (2014)
Google Scholar
Choromanska, A., et al.: Beyond backprop: online alternating minimization with auxiliary variables. In: ICML. PMLR (2019)
Google Scholar
Cook, R.D., Weisberg, S.: Residuals and Influence in Regression. Chapman and Hall, New York (1982)
MATH Google Scholar
Das, S., et al.: Multi-criteria online frame-subset selection for autonomous vehicle videos. Pattern Recognit. Lett. 133, 349–355 (2020)
Article Google Scholar
Elhamifar, E., Kaluza, M.C.D.P.: Online summarization via submodular and convex optimization. In: CVPR (2017)
Google Scholar
Elhamifar, E., Sapiro, G., Sastry, S.S.: Dissimilarity-based sparse subset selection. IEEE TPAMI 38(11), 2182–2197 (2015)
Article Google Scholar
Frénay, B., Verleysen, M.: Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 845–869 (2013)
Article Google Scholar
Ghorbani, A., Kim, M., Zou, J.: A distributional framework for data valuation. In: ICML. PMLR (2020)
Google Scholar
Ghorbani, A., Zou, J.: Data shapley: equitable valuation of data for machine learning. In: ICML. PMLR (2019)
Google Scholar
Ghorbani, A., Zou, J.Y.: Neuron shapley: discovering the responsible neurons. In: NeurIPS (2020)
Google Scholar
Hara, S., Nitanda, A., Maehara, T.: Data cleansing for models trained with SGD. In: NeurIPS (2019)
Google Scholar
Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: ICML. PMLR (2017)
Google Scholar
Lundberg, S., Lee, S.I.: A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017)
Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 (2018)
Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. NeurIPS (2020)
Google Scholar
Steinhardt, J., Koh, P.W., Liang, P.: Certified defenses for data poisoning attacks. In: NIPS 2017 (2017)
Google Scholar
Wu, Y., Dobriban, E., Davidson, S.: Deltagrad: rapid retraining of machine learning models. In: ICML. PMLR (2020)
Google Scholar
Yoon, J., Arik, S., Pfister, T.: Data valuation using reinforcement learning. In: ICML. PMLR (2020)
Google Scholar

Download references

Acknowledgements

This project is funded by Hewlett Packard Labs, Hewlett Packard Enterprise.

Author information

Authors and Affiliations

Indian Institute of Technology, Kharagpur, Kharagpur, West Bengal, India
Soumi Das, Arshdeep Singh, Saptarshi Chatterjee & Sourangshu Bhattacharya
Hewlett Packard Labs, Hewlett Packard Enterprise, Bangalore, India
Suparna Bhattacharya

Authors

Soumi Das
View author publications
You can also search for this author in PubMed Google Scholar
Arshdeep Singh
View author publications
You can also search for this author in PubMed Google Scholar
Saptarshi Chatterjee
View author publications
You can also search for this author in PubMed Google Scholar
Suparna Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Sourangshu Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soumi Das .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, S., Singh, A., Chatterjee, S., Bhattacharya, S., Bhattacharya, S. (2021). Finding High-Value Training Data Subset Through Differentiable Convex Programming. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-86520-7_41
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Finding High-Value Training Data Subset Through Differentiable Convex Programming

Abstract

Access this chapter

Similar content being viewed by others

No regret sample selection with noisy labels

Deep Bilevel Learning

Submodular Meta Data Compiling for Meta Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Finding High-Value Training Data Subset Through Differentiable Convex Programming

Abstract

Access this chapter

Similar content being viewed by others

No regret sample selection with noisy labels

Deep Bilevel Learning

Submodular Meta Data Compiling for Meta Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation