Skip to main content
Log in

Sample size determination for biomedical big data with limited labels

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and<2% class ratio. The methods used involve approximating a learning curve for a large amount of data using a small amount of data. We evaluate an existing method that fits an inverse power law model to a small learning curve and introduce a novel semi-supervised method that utilizes the large amount of unlabeled data for estimating a learning curve. We find that the inverse power law method is applicable to big data, while the semi-supervised method can be better at detecting convergence. To the best of our knowledge, this is the first study to apply an inverse power law curve fitting method to big data with limited labels and compare it to a semi-supervised approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://cruncher.ico2s.org/bdcomp/.

  2. https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Physician-and-Other-Supplier.html.

  3. https://oig.hhs.gov/exclusions/exclusions_list.asp.

  4. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#splice-site.

  5. https://aws.amazon.com/ec2/.

  6. https://github.com/rikturr/semi_ssd.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaron N. Richter.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Richter, A.N., Khoshgoftaar, T.M. Sample size determination for biomedical big data with limited labels. Netw Model Anal Health Inform Bioinforma 9, 12 (2020). https://doi.org/10.1007/s13721-020-0218-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-020-0218-0

Keywords

Navigation