The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions

Garcia Santa Cruz, Beatriz; Vega, Carlos; Hertel, Frank

doi:10.1007/978-3-031-20837-9_16

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13483))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

399 Accesses
2 Citations

Abstract

In this paper, we discuss the importance of considering causal relations in the development of machine learning solutions to prevent factors hampering the robustness and generalisation capacity of the models, such as induced biases. This issue often arises when the algorithm decision is affected by confounding factors. In this work, we argue that the integration of research assumptions as causal relationships can help identify potential confounders. Together with metadata information, it can enable meta-comparison of data acquisition pipelines. We call for standardised meta-information practices as a crucial step for proper machine learning solutions development, validation, and data sharing. Such practices include detailing the data acquisition process, aiming for automatic integration of causal relationships and actionable metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)
Google Scholar
Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016)
Article PubMed PubMed Central Google Scholar
Repecka, D., et al.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)
Article Google Scholar
Xu, C., Jackson, S.: Machine learning and complex biological data. Genome Biol. 20, 1–4 (2019)
Article Google Scholar
Wilkinson, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)
Article Google Scholar
Walsh, I., et al.: DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021)
Article CAS PubMed Google Scholar
Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Natu. Methods 15, 233 (2018)
Article CAS Google Scholar
Smuha, N.: The EU approach to ethics guidelines for trustworthy artificial intelligence. Comput. Law Rev. Int. 20, 97–106 (2019)
Article Google Scholar
Hutchinson, B., et al.: Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575 (2021)
Google Scholar
Mora-Cantallops, M., Sanchez-Alonso, S., Garcia-Barriocanal, E., Sicilia, M.: Traceability for trustworthy AI: a review of models and tools. Big Data Cogn. Comput. 5, 20 (2021)
Article Google Scholar
Paschali, M., Conjeti, S., Navarro, F., Navab, N.: Generalizability vs. robustness: adversarial examples for medical imaging. ArXiv Preprint ArXiv:1804.00504 (2018)
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.: Everyone wants to do the model work, not the data work: data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)
Google Scholar
Mitchell, M., et al.: Model cards for model reporting. In: Proceedings Of The Conference On Fairness, Accountability, And Transparency, pp. 220–229 (2019)
Google Scholar
Santa Cruz, B., Bossa, M., Sölter, J., Husch, A.: Public Covid-19 X-ray datasets and their impact on model bias-a systematic review of a significant problem. Med. Image Anal. 74, 102225 (2021)
Article Google Scholar
Castro, D., Walker, I., Glocker, B.: Causality matters in medical imaging. Nat. Commun. 11, 1–10 (2020)
Article Google Scholar
Zhu, Y., et al.: Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 1–11 (2021)
CAS Google Scholar
Bazgir, O., Zhang, R., Dhruba, S., Rahman, R., Ghosh, S., Pal, R.: Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 1–13 (2020)
Article Google Scholar
Mäkinen, S., Skogström, H., Laaksonen, E., Mikkonen, T.: Who needs MLOps: what data scientists seek to accomplish and how can MLOps Help?. In: 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering For AI (WAIN), pp. 109–112 (2021)
Google Scholar
Sweenor, D., Hillion, S., Rope, D., Kannabiran, D., Hill, T., O’Connell, M.: ML Ops: Operationalizing Data Science. O’Reilly Media, Incorporated (2020)
Google Scholar
Vega, C.: From Hume to Wuhan: an epistemological journey on the problem of induction in COVID-19 machine learning models and its impact upon medical research. IEEE Access. 9, 97243–97250 (2021)
Article PubMed Google Scholar
Maguolo, G., Nanni, L.: A critic evaluation of methods for COVID-19 automatic detection from X-ray images. Inf. Fusion 76, 1–7 (2021)
Article PubMed PubMed Central Google Scholar
VanderWeele, T.: Principles of confounder selection. Eur. J. Epidemiol. 34, 211–219 (2019)
Article PubMed PubMed Central Google Scholar
Beran, D., Lazo-Porras, M., Mba, C., Mbanya, J.: A global perspective on the issue of access to insulin. Diabetologia 64, 954–962 (2021)
Article CAS PubMed PubMed Central Google Scholar
Altevogt, B., Davis, M., Pankevich, D., Norris, S.: Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary. National Academies Press, Washington (2014)
Google Scholar
Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface. 15, 20170387 (2018)
Article Google Scholar
Leek, J., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)
Article CAS PubMed Google Scholar
Holmberg, M., Andersen, L.: Collider Bias. JAMA 327, 1282–1283 (2022)
Article PubMed Google Scholar
Griffith, G., et al.: Others collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11, 1–12 (2020)
Article Google Scholar
Leipzig, J., Nüst, D., Hoyt, C., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns. 2, 100322 (2021)
Article PubMed PubMed Central Google Scholar
Sansone, S., et al.: Toward interoperable bioscience data. Nat. Genet. 44, 121–126 (2012)
Article CAS PubMed PubMed Central Google Scholar
Sharma, A., Kiciman, E.: DoWhy: an end-to-end library for causal inference. ArXiv Preprint ArXiv:2011.04216 (2020)
Shimoni, Y., et al.: An evaluation toolkit to guide model selection and cohort definition in causal inference. ArXiv Preprint ArXiv:1906.00442 (2019)
Keating, S., et al.: SBML Level 3: an extensible format for the exchange and reuse of biological models. Mol. Syst. Biol. 16, e9110 (2020)
Article PubMed PubMed Central Google Scholar
Touré, V., Flobak, A., Niarakis, A., Vercruysse, S., Kuiper, M.: The status of causality in biological databases: data resources and data retrieval possibilities to support logical modeling. Briefings Bioinform. 22, bbaa390 (2021)
Google Scholar
Juty, N., Le Novere, N., Laibe, C.: Identifiers. org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res. 40, D580–D586 (2012)
Google Scholar

Download references

Acknowledgments

The authors would like to thank Andreas Husch and Matias Bossa for their support and review efforts. Beatriz Garcia Santa Cruz work is supported by the FNR-PRIDE17/12244779/PARK-QC and Pelican award from the Fondation du Pelican de Mie et Pierre Hippert-Faber.

Author information

Authors and Affiliations

National Department of Neurosurgery, Centre Hospitalier de Luxembourg, Luxembourg City, Luxembourg
Beatriz Garcia Santa Cruz & Frank Hertel
Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
Carlos Vega

Authors

Beatriz Garcia Santa Cruz
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Vega
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hertel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Beatriz Garcia Santa Cruz .

Editor information

Editors and Affiliations

University of Toronto, Toronto, ON, Canada
Davide Chicco
Consiglio Nazionale delle Ricerche, Avellino, Italy
Angelo Facchiano
Università di Padova, Padua, Italy
Erica Tavazzi
Università di Padova, Padua, Italy
Enrico Longato
Università di Padova, Padua, Italy
Martina Vettoretti
Politecnico di Milano, Milan, Italy
Anna Bernasconi
Università di Verona, Verona, Italy
Simone Avesani
Università di Bergamo, Bergamo, Italy
Paolo Cazzaniga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Garcia Santa Cruz, B., Vega, C., Hertel, F. (2022). The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions. In: Chicco, D., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2021. Lecture Notes in Computer Science(), vol 13483. Springer, Cham. https://doi.org/10.1007/978-3-031-20837-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-20837-9_16
Published: 26 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20836-2
Online ISBN: 978-3-031-20837-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions