How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts


In recent years, there has been a dramatic increase in research papers about machine learning (ML) and artificial intelligence in radiology. With so many papers around, it is of paramount importance to make a proper scientific quality assessment as to their validity, reliability, effectiveness, and clinical applicability. Due to methodological complexity, the papers on ML in radiology are often hard to evaluate, requiring a good understanding of key methodological issues. In this review, we aimed to guide the radiology community about key methodological aspects of ML to improve their academic reading and peer-review experience. Key aspects of ML pipeline were presented within four broad categories: study design, data handling, modelling, and reporting. Sixteen key methodological items and related common pitfalls were reviewed with a fresh perspective: database size, robustness of reference standard, information leakage, feature scaling, reliability of features, high dimensionality, perturbations in feature selection, class balance, bias-variance trade-off, hyperparameter tuning, performance metrics, generalisability, clinical utility, comparison with traditional tools, data sharing, and transparent reporting.

Key Points

• Machine learning is new and rather complex for the radiology community.

• Validity, reliability, effectiveness, and clinical applicability of studies on machine learning can be evaluated with a proper understanding of key methodological concepts about study design, data handling, modelling, and reporting.

• Understanding key methodological concepts will provide a better academic reading and peer-review experience for the radiology community.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6



Machine learning


  1. 1.

    Choy G, Khalilzadeh O, Michalski M et al (2018) Current applications and future impact of machine learning in radiology. Radiology 288:318–328.

    Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Wang S, Summers RM (2012) Machine learning and radiology. Med Image Anal 16:933–951.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  3. 3.

    Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349:255–260.

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Kohli M, Prevedello LM, Filice RW, Geis JR (2017) Implementing machine learning in radiology practice and research. AJR Am J Roentgenol 208:754–760.

    Article  PubMed  Google Scholar 

  5. 5.

    Sollini M, Antunovic L, Chiti A, Kirienko M (2019) Towards clinical application of image mining: a systematic review on artificial intelligence and radiomics. Eur J Nucl Med Mol Imaging 46:2656–2672.

    Article  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL (2018) Artificial intelligence in radiology. Nat Rev Cancer 18:500–510.

  7. 7.

    Do HM, Spear LG, Nikpanah M et al (2020) Augmented radiologist workflow improves report value and saves time: a potential model for implementation of artificial intelligence. Acad Radiol 27:96–105.

    Article  PubMed  Google Scholar 

  8. 8.

    Lou R, Lalevic D, Chambers C, Zafar HM, Cook TS (2020) Automated detection of radiology reports that require follow-up imaging using natural language processing feature engineering and machine learning classification. J Digit Imaging 33:131–136.

  9. 9.

    Mokrane F-Z, Lu L, Vavasseur A et al (2020) Radiomics machine-learning signature for diagnosis of hepatocellular carcinoma in cirrhotic patients with indeterminate liver nodules. Eur Radiol 30:558–570.

    Article  PubMed  Google Scholar 

  10. 10.

    Schaffter T, Buist DSM, Lee CI et al (2020) Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Netw Open 3:e200265.

    Article  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Chauvie S, De Maggi A, Baralis I et al (2020) Artificial intelligence and radiomics enhance the positive predictive value of digital chest tomosynthesis for lung cancer detection within SOS clinical trial. Eur Radiol.

  12. 12.

    Fischer AM, Varga-Szemes A, Martin SS et al (2020) Artificial intelligence-based fully automated per lobe segmentation and emphysema-quantification based on chest computed tomography compared with global initiative for chronic obstructive lung disease severity of smokers. J Thorac Imaging.

  13. 13.

    Kocak B, Durmaz ES, Ates E, Kaya OK, Kilickesmez O (2019) Unenhanced CT texture analysis of clear cell renal cell carcinomas: a machine learning-based study for predicting histopathologic nuclear grade. AJR Am J Roentgenol:W1–W8.

  14. 14.

    Kocak B, Durmaz ES, Ates E, Ulusan MB (2019) Radiogenomics in clear cell renal cell carcinoma: machine learning-based high-dimensional quantitative CT texture analysis in predicting PBRM1 mutation status. AJR Am J Roentgenol 212:W55–W63.

    Article  PubMed  Google Scholar 

  15. 15.

    Kocak B, Durmaz ES, Ates E et al (2020) Radiogenomics of lower-grade gliomas: machine learning-based MRI texture analysis for predicting 1p/19q codeletion status. Eur Radiol 30:877–886.

    Article  PubMed  Google Scholar 

  16. 16.

    Greffier J, Hamard A, Pereira F et al (2020) Image quality and dose reduction opportunity of deep learning image reconstruction algorithm for CT: a phantom study. Eur Radiol.

  17. 17.

    Parmar C, Barry JD, Hosny A, Quackenbush J, Aerts HJWL (2018) Data analysis strategies in medical imaging. Clin Cancer Res 24:3492–3499.

  18. 18.

    Thrall JH, Li X, Li Q et al (2018) Artificial intelligence and machine learning in radiology: opportunities, challenges, pitfalls, and criteria for success. J Am Coll Radiol 15:504–508.

    Article  PubMed  Google Scholar 

  19. 19.

    Leek JT, Scharpf RB, Bravo HC et al (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733–739.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118–127.

    Article  PubMed  Google Scholar 

  21. 21.

    Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl):496–501.

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Lee ML, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A 97:9834–9839.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Yu K-H, Beam AL, Kohane IS (2018) Artificial intelligence in healthcare. Nat Biomed Eng 2:719–731.

    Article  PubMed  Google Scholar 

  24. 24.

    Koçak B, Durmaz EŞ, Ateş E, Kılıçkesmez Ö (2019) Radiomics with artificial intelligence: a practical guide for beginners. Diagn Interv Radiol 25:485–495.

    Article  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Hernández B, Parnell A, Pennington SR (2014) Why have so few proteomic biomarkers “survived” validation? (sample size and independent validation considerations). Proteomics 14:1587–1592.

    CAS  Article  PubMed  Google Scholar 

  26. 26.

    Way TW, Sahiner B, Hadjiiski LM, Chan H-P (2010) Effect of finite sample size on feature selection and classification: a simulation study. Med Phys 37:907–920.

    Article  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Chan HP, Sahiner B, Wagner RF, Petrick N (1999) Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers. Med Phys 26:2654–2668.

    CAS  Article  PubMed  Google Scholar 

  28. 28.

    Sollini M, Cozzi L, Antunovic L, Chiti A, Kirienko M (2017) PET Radiomics in NSCLC: state of the art and a proposal for harmonization of methodology. Sci Rep 7:358.

  29. 29.

    Gillies RJ, Kinahan PE, Hricak H (2016) Radiomics: images are more than pictures, they are data. Radiology 278:563–577.

    Article  PubMed  PubMed Central  Google Scholar 

  30. 30.

    Perlich C (2010) Learning curves in machine learning. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning. Springer US, Boston, MA, pp 577–580

    Google Scholar 

  31. 31.

    Krause J, Gulshan V, Rahimy E et al (2018) Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology 125:1264–1272.

    Article  PubMed  Google Scholar 

  32. 32.

    Zwanenburg A (2019) Radiomics in nuclear medicine: robustness, reproducibility, standardization, and how to avoid data analysis traps and replication crisis. Eur J Nucl Med Mol Imaging 46:2638–2655.

    Article  PubMed  Google Scholar 

  33. 33.

    Mwangi B, Tian TS, Soares JC (2014) A review of feature reduction techniques in neuroimaging. Neuroinformatics 12:229–244.

    Article  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Zwanenburg A, Löck S (2018) Why validation of prognostic models matters? Radiother Oncol 127:370–373.

    Article  PubMed  Google Scholar 

  35. 35.

    Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(Suppl 1):S96–S104.

  36. 36.

    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv150203167 Cs

  37. 37.

    Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. ArXiv160706450 Cs stat

  38. 38.

    Kocak B, Ates E, Durmaz ES, Ulusan MB, Kilickesmez O (2019) Influence of segmentation margin on machine learning-based high-dimensional quantitative CT texture analysis: a reproducibility study on renal clear cell carcinomas. Eur Radiol 29:4765–4775.

  39. 39.

    Kocak B, Durmaz ES, Kaya OK, Ates E, Kilickesmez O (2019) Reliability of single-slice-based 2D CT texture analysis of renal masses: influence of intra- and interobserver manual segmentation variability on radiomic feature reproducibility. AJR Am J Roentgenol 213:377–383.

  40. 40.

    Koçak B (2019) Reliability of 2D magnetic resonance imaging texture analysis in cerebral gliomas: influence of slice selection bias on reproducibility of radiomic features. Istanb Med J 20:413–417

    Article  Google Scholar 

  41. 41.

    Um H, Tixier F, Bermudez D, Deasy JO, Young RJ, Veeraraghavan H (2019) Impact of image preprocessing on the scanner dependence of multi-parametric MRI radiomic features and covariate shift in multi-institutional glioblastoma datasets. Phys Med Biol 64:165011.

  42. 42.

    Berenguer R, Pastor-Juan MDR, Canales-Vázquez J et al (2018) Radiomics of CT features may be nonreproducible and redundant: influence of CT acquisition parameters. Radiology 288:407–415.

    Article  PubMed  Google Scholar 

  43. 43.

    Zhovannik I, Bussink J, Traverso A et al (2019) Learning from scanners: bias reduction and feature correction in radiomics. Clin Transl Radiat Oncol 19:33–38.

    Article  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Bologna M, Corino V, Mainardi L (2019) Technical note: virtual phantom analyses for preprocessing evaluation and detection of a robust feature set for MRI-radiomics of the brain. Med Phys 46:5116–5123.

    Article  PubMed  PubMed Central  Google Scholar 

  45. 45.

    He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284

    Article  Google Scholar 

  46. 46.

    Meyer M, Ronald J, Vernuccio F et al (2019) Reproducibility of CT radiomic features within the same patient: influence of radiation dose and CT reconstruction settings. Radiology 293:583–591.

    Article  PubMed  Google Scholar 

  47. 47.

    Qiu Q, Duan J, Duan Z et al (2019) Reproducibility and non-redundancy of radiomic features extracted from arterial phase CT scans in hepatocellular carcinoma patients: impact of tumor segmentation variability. Quant Imaging Med Surg 9:453–464.

    Article  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Owens CA, Peterson CB, Tang C et al (2018) Lung tumor segmentation methods: impact on the uncertainty of radiomics features for non-small cell lung cancer. PLoS One 13:e0205003.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Estrada S, Lu R, Conjeti S et al (2020) FatSegNet: a fully automated deep learning pipeline for adipose tissue segmentation on abdominal Dixon MRI. Magn Reson Med 83:1471–1483.

    Article  PubMed  Google Scholar 

  50. 50.

    Lambin P, Leijenaar RTH, Deist TM et al (2017) Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol 14:749–762.

    Article  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Leger S, Zwanenburg A, Pilz K et al (2017) A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling. Sci Rep 7:13206.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Vallières M, Kay-Rivest E, Perrin LJ et al (2017) Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer. Sci Rep 7:10117.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Sun R, Limkin EJ, Vakalopoulou M et al (2018) A radiomics approach to assess tumour-infiltrating CD8 cells and response to anti-PD-1 or anti-PD-L1 immunotherapy: an imaging biomarker, retrospective multicohort study. Lancet Oncol 19:1180–1191.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Parmar C, Grossmann P, Bussink J, Lambin P, Aerts HJWL (2015) Machine learning methods for quantitative radiomic biomarkers. Sci Rep 5:13087.

  55. 55.

    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  56. 56.

    Brown G, Pocock A, Zhao M-J, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66

    Google Scholar 

  57. 57.

    Kalousis A, Prados J, Hilario M (2006) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12:95–116.

    Article  Google Scholar 

  58. 58.

    Haury A-C, Gestraud P, Vert J-P (2011) The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6:e28210.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw Off J Int Neural Netw Soc 21:427–436.

  60. 60.

    van Smeden M, Moons KG, de Groot JA et al (2019) Sample size for binary logistic prediction models: beyond events per variable criteria. Stat Methods Med Res 28:2455–2474.

    Article  PubMed  Google Scholar 

  61. 61.

    Olson RS, La Cava W, Mustahsan Z, Varik A, Moore JH (2018) Data-driven advice for applying machine learning to bioinformatics problems. Pac Symp Biocomput 23:192–203

  62. 62.

    Dankers FJWM, Traverso A, Wee L, van Kuijk SMJ (2019) Prediction modeling methodology. In: Kubben P, Dumontier M, Dekker A (eds) Fundamentals of clinical data science. Springer, Cham

    Google Scholar 

  63. 63.

    Vickers AJ, Elkin EB (2006) Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 26:565–574.

    Article  PubMed  PubMed Central  Google Scholar 

  64. 64.

    Vickers AJ, van Calster B, Steyerberg EW (2019) A simple, step-by-step guide to interpreting decision curve analysis. Diagn Progn Res 3:18.

    Article  PubMed  PubMed Central  Google Scholar 

  65. 65.

    de Sitter A, Visser M, Brouwer I et al (2020) Facing privacy in neuroimaging: removing facial features degrades performance of image analysis methods. Eur Radiol 30:1062–1074.

    Article  PubMed  Google Scholar 

  66. 66.

    Mongan J, Moy L, Kahn CE (2020) Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiology Artificial Intelligence 2:e200029.

    Article  Google Scholar 

  67. 67.

    Luo W, Phung D, Tran T et al (2016) Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res 18:e323.

    Article  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Collins GS, Reitsma JB, Altman DG, Moons KGM (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med 162:55–63.

    Article  PubMed  Google Scholar 

  69. 69.

    Collins GS, Moons KGM (2019) Reporting of artificial intelligence prediction models. Lancet 393:1577–1579.

    Article  PubMed  Google Scholar 

Download references


The authors state that this work has not received any funding.

Author information



Corresponding author

Correspondence to Burak Kocak.

Ethics declarations


The scientific guarantor of this publication is Burak Kocak, MD.

Conflict of interest

The authors of this manuscript declare no relationships with any companies whose products or services may be related to the subject matter of the article.

Statistics and biometry

No statistical methods were necessary for this paper.

Informed consent

Not required.

Ethical approval

Not required.


• Review Article

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kocak, B., Kus, E.A. & Kilickesmez, O. How to read and review papers on machine learning and artificial intelligence in radiology: a survival guide to key methodological concepts. Eur Radiol 31, 1819–1830 (2021).

Download citation


  • Machine learning
  • Artificial intelligence
  • Deep learning
  • Radiology
  • Peer-review