A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology

  • Hugh HarveyEmail author
  • Ben Glocker


Medical imaging data is now extremely abundant due to over two decades of digitisation of imaging protocols and data storage formats. However, clean, well-curated data, that is amenable to machine learning, is relatively scarce, and AI developers are paradoxically data starved. Imaging and clinical data is also heterogeneous, often unstructured and unlabelled, whereas current supervised and semi-supervised machine learning techniques rely on homogeneous and carefully annotated data. While imaging biobanks contain small volumes of well-curated data, it is the leveraging of ‘big data’ from the front-line of healthcare that is the focus of many machine learning developers hoping to train and validate computer vision algorithms. The quest for sufficiently large volumes of clean data that can be used for training, validation and testing involves several hurdles, namely ethics and consent, security, the assessment of data quality, ground truth data labelling, bias reduction, reusability and generalisability. In this chapter we propose a new medical imaging data readiness (MIDaR) scale. The MIDaR scale is designed to objectively clarify data quality for both researchers seeking imaging data and clinical providers aiming to share their data. It is hoped that the MIDaR scale will be used globally during collaborative academic and business conversations, so that everyone can more easily understand and quickly appraise the relevant stages of data readiness for machine learning in relation to their AI development projects. We believe that the MIDaR scale could become essential in the design, planning and management of AI medical imaging projects, and significantly increase chances of success.


Data readiness Medical imaging Machine learning MIDaR scale 



With thanks to Hugh Lyshkow, DesAcc Inc. for his invaluable input and insight.


  1. 1.
    Sivarajah U, Kamal MM, Irani Z, Weerakkody V. Critical analysis of Big Data challenges and analytical methods. J Bus Res 2017;70:263–286. Scholar
  2. 2.
    Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision. Vol 2017-Oct. New York: IEEE; 2017. p. 843–52. ISBN: 9781538610329.
  3. 3.
    Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2)8–12. ISSN: 1541-1672. Scholar
  4. 4.
    Gueld MO, Kohnen M, Keysers D, Schubert H, Wein BB, Bredno J, Lehmann TM. Quality of DICOM header information for image categorization. Proc SPIE. 2002;4685:280–7. ISSN: 0277786X.
  5. 5.
    Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016; 3:160018. ISSN: 2052-4463. Scholar
  6. 6.
    Kohli MD, Summers RM, Raymond Geis J. Medical image data and datasets in the era of machine learning-whitepaper from the 2016 C-MIMI Meeting Dataset Session. J Digit Imaging. 2017;30 (4):392–9. ISSN: 0897-1889. Scholar
  7. 7.
    Lawrence ND. Data readiness levels; 2017.
  8. 8.
    Supplements – DICOM standard.
  9. 9.
    De-identification knowledge base - the cancer imaging archive (TCIA) public access - cancer imaging archive Wiki; 2017.
  10. 10.
    European Commission - Directorate General for Research and Innovation. Ethics for researchers - Facilitating Research Excellence in FP7. Technical report; 2013.
  11. 11.
    Integrated Research Application System; 2018.
  12. 12.
    Research Ethics Committees overview - Health Research Authority; 2018.
  13. 13.
    Institutional Review Board; 2018.
  14. 14.
    Santosh KC, Wendling L. Automated chest X-ray image view classification using force histogram. Singapore: Springer; 2017. p. 333–42.{_}30.
  15. 15.
    Pons E, Braun LMM, Myriam Hunink MG, Kors JA. Natural language processing in radiology: a systematic review. Radiology. 2016;279(2):329–43. ISSN: 0033-8419. Scholar
  16. 16.
    Smith SM, Nichols TE. Statistical challenges in “big data” human neuroimaging; 2018. ISSN: 10974199. Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Kheiron Medical TechnologiesLondonUK
  2. 2.Imperial CollegeLondonUK

Personalised recommendations