Abstract
Deep learning models have revolutionized the field of medical image analysis, offering significant promise for improved diagnostics and patient care. However, their performance can be misleadingly optimistic due to a hidden pitfall called ‘data leakage’. In this study, we investigate data leakage in 3D medical imaging, specifically using 3D Convolutional Neural Networks (CNNs) for brain MRI analysis. While 3D CNNs appear less prone to leakage than 2D counterparts, improper data splitting during cross-validation (CV) can still pose issues, especially with longitudinal imaging data containing repeated scans from the same subject. We explore the impact of different data splitting strategies on model performance for longitudinal brain MRI analysis and identify potential data leakage concerns. GradCAM visualization helps reveal shortcuts in CNN models caused by identity confounding, where the model learns to identify subjects along with diagnostic features. Our findings, consistent with prior research, underscore the importance of subject-wise splitting and evaluating our model further on hold-out data from different subjects to ensure the integrity and reliability of deep learning models in medical image analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arun, N., et al.: Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. radiology. Artif. Intell. 3(6), e200267 (2021). https://doi.org/10.1148/ryai.2021200267
Brown, A., Tomasev, N., Freyberg, J., Liu, Y., Karthikesalingam, A., Schrouff, J.: Detecting shortcut learning for fair medical AI using shortcut testing. Nat. Commun. 14(1), 4314 (2023). https://doi.org/10.1038/s41467-023-39902-7
Bussola, N., Marcolini, A., Maggio, V., Jurman, G., Furlanello, C.: AI slipping on tiles: data leakage in digital pathology. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12661, pp. 167–182. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68763-2_13
Chaibub Neto, E., et al.: Detecting the impact of subject characteristics on machine learning-based diagnostic applications. npj Digit. Med. 2(1), 99 (2019). https://doi.org/10.1038/s41746-019-0178-x
Drukker, K., et al.: Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases in the roadmap from data collection to model deployment. J. Med. Imaging 10(06) (2023). https://doi.org/10.1117/1.JMI.10.6.061104
Gaser, C., Dahnke, R., Thompson, P.M., Kurth, F., Luders, E.: Alzheimer’s disease neuroimaging initiative: CAT – a computational anatomy toolbox for the analysis of structural MRI data. Neuroscience (2022). https://doi.org/10.1101/2022.06.11.495736
Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020). https://doi.org/10.1038/s42256-020-00257-z
Ghazal, M.: Alzheimer RSQUO s disease diagnostics by a 3D deeply supervised adaptable convolutional network. Front. Biosci. 23(2), 584–596 (2018). https://doi.org/10.2741/4606
Goenka, N., Tiwari, S.: AlzVNet: a volumetric convolutional neural network for multiclass classification of alzheimer’s disease through multiple neuroimaging computational approaches. Biomed. Sig. Process. Control 74, 103500 (2022). https://doi.org/10.1016/j.bspc.2022.103500
Jack, C.R., et al.: ADNI study: the Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 27(4), 685–691 (2008). https://doi.org/10.1002/jmri.21049
Jiménez-Sánchez, A., Juodelyte, D., Chamberlain, B., Cheplygina, V.: Detecting shortcuts in medical images - a case study in chest x-rays (2022)
Kaufman, S., Rosset, S., Perlich, C., Stitelman, O.: Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6(4) (2012). https://doi.org/10.1145/2382577.2382579
Korolev, S., Safiullin, A., Belyaev, M., Dodonova, Y.: Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 835–838. IEEE, Melbourne, Australia, April 2017. https://doi.org/10.1109/ISBI.2017.7950647
Little, M.A., et al.: Using and understanding cross-validation strategies. Perspectives on Saeb et al. GigaScience 6(5) (2017). https://doi.org/10.1093/gigascience/gix020
Narazani, M., Sarasua, I., Pölsterl, S., Lizarraga, A., Yakushev, I., Wachinger, C.: Is a PET all you need? a multi-modal study for alzheimer’s disease using 3D CNNs. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, pp. 66–76. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-16431-6_7
Neto, E.C., Perumal, T.M., Pratap, A., Bot, B.M., Mangravite, L., Omberg, L.: On the analysis of personalized medication response and classification of case vs control patients in mobile health studies: the mpower case study (2017)
Petersen, E., Feragen, A., da Costa Zemsch, M.L., Henriksen, A., Wiese Christensen, O.E., Ganz, M.: Alzheimer’s disease neuroimaging initiative: feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer’s disease detection. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, vol. 13431, pp. 88–98. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-16431-6_9
Ricci Lara, M.A., Echeveste, R., Ferrante, E.: Addressing fairness in artificial intelligence for medical imaging. Nat. Commun. 13(1), 4581 (2022). https://doi.org/10.1038/s41467-022-32186-3
Saeb, S., Lonini, L., Jayaraman, A., Mohr, D.C., Kording, K.P.: The need to approximate the use-case in clinical machine learning. GigaScience 6(5) (2017). https://doi.org/10.1093/gigascience/gix019
Solovyev, R., Kalinin, A.A., Gabruseva, T.: 3D convolutional neural networks for stalled brain capillary detection. Comput. Biol. Med. 141, 105089 (2022). https://doi.org/10.1016/j.compbiomed.2021.105089
Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digit. Med. 5(1), 48 (2022). https://doi.org/10.1038/s41746-022-00592-y
Yagis, E., et al.: Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci. Rep. 11(1), 22544 (2021). https://doi.org/10.1038/s41598-021-01681-w
Zhang, J., Zheng, B., Gao, A., Feng, X., Liang, D., Long, X.: A 3D densely connected convolution neural network with connection-wise attention mechanism for Alzheimer’s disease classification. Magn. Reson. Imaging 78, 119–126 (2021). https://doi.org/10.1016/j.mri.2021.02.001
Acknowledgement
This research was funded by the Ministry of Education and Research Technology, Indonesia through the PMDSU scholarship. Special thanks to Prof. I Ketut Eddy Purnama, the author’s PhD supervisor, for securing the research funding and for his valuable ideas and insights, and Prof. Tae-Seong Kim, whose insightful perspectives inspired the development of this paper. Additionally, the author would like to express gratitude to the Bio Imaging Laboratory at Kyung Hee University, South Korea, where the data collection for this study was conducted.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rumala, D.J. (2023). How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis. In: Wesarg, S., et al. Clinical Image-Based Procedures, Fairness of AI in Medical Imaging, and Ethical and Philosophical Issues in Medical Imaging. CLIP EPIMI FAIMI 2023 2023 2023. Lecture Notes in Computer Science, vol 14242. Springer, Cham. https://doi.org/10.1007/978-3-031-45249-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-45249-9_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45248-2
Online ISBN: 978-3-031-45249-9
eBook Packages: Computer ScienceComputer Science (R0)