Abstract
In this paper, we discuss the importance of considering causal relations in the development of machine learning solutions to prevent factors hampering the robustness and generalisation capacity of the models, such as induced biases. This issue often arises when the algorithm decision is affected by confounding factors. In this work, we argue that the integration of research assumptions as causal relationships can help identify potential confounders. Together with metadata information, it can enable meta-comparison of data acquisition pipelines. We call for standardised meta-information practices as a crucial step for proper machine learning solutions development, validation, and data sharing. Such practices include detailing the data acquisition process, aiming for automatic integration of causal relationships and actionable metadata.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016)
Repecka, D., et al.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)
Xu, C., Jackson, S.: Machine learning and complex biological data. Genome Biol. 20, 1–4 (2019)
Wilkinson, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)
Walsh, I., et al.: DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021)
Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Natu. Methods 15, 233 (2018)
Smuha, N.: The EU approach to ethics guidelines for trustworthy artificial intelligence. Comput. Law Rev. Int. 20, 97–106 (2019)
Hutchinson, B., et al.: Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575 (2021)
Mora-Cantallops, M., Sanchez-Alonso, S., Garcia-Barriocanal, E., Sicilia, M.: Traceability for trustworthy AI: a review of models and tools. Big Data Cogn. Comput. 5, 20 (2021)
Paschali, M., Conjeti, S., Navarro, F., Navab, N.: Generalizability vs. robustness: adversarial examples for medical imaging. ArXiv Preprint ArXiv:1804.00504 (2018)
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.: Everyone wants to do the model work, not the data work: data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)
Mitchell, M., et al.: Model cards for model reporting. In: Proceedings Of The Conference On Fairness, Accountability, And Transparency, pp. 220–229 (2019)
Santa Cruz, B., Bossa, M., Sölter, J., Husch, A.: Public Covid-19 X-ray datasets and their impact on model bias-a systematic review of a significant problem. Med. Image Anal. 74, 102225 (2021)
Castro, D., Walker, I., Glocker, B.: Causality matters in medical imaging. Nat. Commun. 11, 1–10 (2020)
Zhu, Y., et al.: Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 1–11 (2021)
Bazgir, O., Zhang, R., Dhruba, S., Rahman, R., Ghosh, S., Pal, R.: Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 1–13 (2020)
Mäkinen, S., Skogström, H., Laaksonen, E., Mikkonen, T.: Who needs MLOps: what data scientists seek to accomplish and how can MLOps Help?. In: 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering For AI (WAIN), pp. 109–112 (2021)
Sweenor, D., Hillion, S., Rope, D., Kannabiran, D., Hill, T., O’Connell, M.: ML Ops: Operationalizing Data Science. O’Reilly Media, Incorporated (2020)
Vega, C.: From Hume to Wuhan: an epistemological journey on the problem of induction in COVID-19 machine learning models and its impact upon medical research. IEEE Access. 9, 97243–97250 (2021)
Maguolo, G., Nanni, L.: A critic evaluation of methods for COVID-19 automatic detection from X-ray images. Inf. Fusion 76, 1–7 (2021)
VanderWeele, T.: Principles of confounder selection. Eur. J. Epidemiol. 34, 211–219 (2019)
Beran, D., Lazo-Porras, M., Mba, C., Mbanya, J.: A global perspective on the issue of access to insulin. Diabetologia 64, 954–962 (2021)
Altevogt, B., Davis, M., Pankevich, D., Norris, S.: Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary. National Academies Press, Washington (2014)
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface. 15, 20170387 (2018)
Leek, J., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)
Holmberg, M., Andersen, L.: Collider Bias. JAMA 327, 1282–1283 (2022)
Griffith, G., et al.: Others collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11, 1–12 (2020)
Leipzig, J., Nüst, D., Hoyt, C., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns. 2, 100322 (2021)
Sansone, S., et al.: Toward interoperable bioscience data. Nat. Genet. 44, 121–126 (2012)
Sharma, A., Kiciman, E.: DoWhy: an end-to-end library for causal inference. ArXiv Preprint ArXiv:2011.04216 (2020)
Shimoni, Y., et al.: An evaluation toolkit to guide model selection and cohort definition in causal inference. ArXiv Preprint ArXiv:1906.00442 (2019)
Keating, S., et al.: SBML Level 3: an extensible format for the exchange and reuse of biological models. Mol. Syst. Biol. 16, e9110 (2020)
Touré, V., Flobak, A., Niarakis, A., Vercruysse, S., Kuiper, M.: The status of causality in biological databases: data resources and data retrieval possibilities to support logical modeling. Briefings Bioinform. 22, bbaa390 (2021)
Juty, N., Le Novere, N., Laibe, C.: Identifiers. org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res. 40, D580–D586 (2012)
Acknowledgments
The authors would like to thank Andreas Husch and Matias Bossa for their support and review efforts. Beatriz Garcia Santa Cruz work is supported by the FNR-PRIDE17/12244779/PARK-QC and Pelican award from the Fondation du Pelican de Mie et Pierre Hippert-Faber.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Garcia Santa Cruz, B., Vega, C., Hertel, F. (2022). The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions. In: Chicco, D., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2021. Lecture Notes in Computer Science(), vol 13483. Springer, Cham. https://doi.org/10.1007/978-3-031-20837-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-20837-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20836-2
Online ISBN: 978-3-031-20837-9
eBook Packages: Computer ScienceComputer Science (R0)