Skip to main content

The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions

  • Conference paper
  • First Online:
Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2021)

Abstract

In this paper, we discuss the importance of considering causal relations in the development of machine learning solutions to prevent factors hampering the robustness and generalisation capacity of the models, such as induced biases. This issue often arises when the algorithm decision is affected by confounding factors. In this work, we argue that the integration of research assumptions as causal relationships can help identify potential confounders. Together with metadata information, it can enable meta-comparison of data acquisition pipelines. We call for standardised meta-information practices as a crucial step for proper machine learning solutions development, validation, and data sharing. Such practices include detailing the data acquisition process, aiming for automatic integration of causal relationships and actionable metadata.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017)

    Google Scholar 

  2. Angermueller, C., Pärnamaa, T., Parts, L., Stegle, O.: Deep learning for computational biology. Mol. Syst. Biol. 12, 878 (2016)

    Article  PubMed  PubMed Central  Google Scholar 

  3. Repecka, D., et al.: Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324–333 (2021)

    Article  Google Scholar 

  4. Xu, C., Jackson, S.: Machine learning and complex biological data. Genome Biol. 20, 1–4 (2019)

    Article  Google Scholar 

  5. Wilkinson, M., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016)

    Article  Google Scholar 

  6. Walsh, I., et al.: DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021)

    Article  CAS  PubMed  Google Scholar 

  7. Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Natu. Methods 15, 233 (2018)

    Article  CAS  Google Scholar 

  8. Smuha, N.: The EU approach to ethics guidelines for trustworthy artificial intelligence. Comput. Law Rev. Int. 20, 97–106 (2019)

    Article  Google Scholar 

  9. Hutchinson, B., et al.: Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 560–575 (2021)

    Google Scholar 

  10. Mora-Cantallops, M., Sanchez-Alonso, S., Garcia-Barriocanal, E., Sicilia, M.: Traceability for trustworthy AI: a review of models and tools. Big Data Cogn. Comput. 5, 20 (2021)

    Article  Google Scholar 

  11. Paschali, M., Conjeti, S., Navarro, F., Navab, N.: Generalizability vs. robustness: adversarial examples for medical imaging. ArXiv Preprint ArXiv:1804.00504 (2018)

  12. Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.: Everyone wants to do the model work, not the data work: data cascades in high-stakes AI. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2021)

    Google Scholar 

  13. Mitchell, M., et al.: Model cards for model reporting. In: Proceedings Of The Conference On Fairness, Accountability, And Transparency, pp. 220–229 (2019)

    Google Scholar 

  14. Santa Cruz, B., Bossa, M., Sölter, J., Husch, A.: Public Covid-19 X-ray datasets and their impact on model bias-a systematic review of a significant problem. Med. Image Anal. 74, 102225 (2021)

    Article  Google Scholar 

  15. Castro, D., Walker, I., Glocker, B.: Causality matters in medical imaging. Nat. Commun. 11, 1–10 (2020)

    Article  Google Scholar 

  16. Zhu, Y., et al.: Converting tabular data into images for deep learning with convolutional neural networks. Sci. Rep. 11, 1–11 (2021)

    CAS  Google Scholar 

  17. Bazgir, O., Zhang, R., Dhruba, S., Rahman, R., Ghosh, S., Pal, R.: Representation of features as images with neighborhood dependencies for compatibility with convolutional neural networks. Nat. Commun. 11, 1–13 (2020)

    Article  Google Scholar 

  18. Mäkinen, S., Skogström, H., Laaksonen, E., Mikkonen, T.: Who needs MLOps: what data scientists seek to accomplish and how can MLOps Help?. In: 2021 IEEE/ACM 1st Workshop on AI Engineering-Software Engineering For AI (WAIN), pp. 109–112 (2021)

    Google Scholar 

  19. Sweenor, D., Hillion, S., Rope, D., Kannabiran, D., Hill, T., O’Connell, M.: ML Ops: Operationalizing Data Science. O’Reilly Media, Incorporated (2020)

    Google Scholar 

  20. Vega, C.: From Hume to Wuhan: an epistemological journey on the problem of induction in COVID-19 machine learning models and its impact upon medical research. IEEE Access. 9, 97243–97250 (2021)

    Article  PubMed  Google Scholar 

  21. Maguolo, G., Nanni, L.: A critic evaluation of methods for COVID-19 automatic detection from X-ray images. Inf. Fusion 76, 1–7 (2021)

    Article  PubMed  PubMed Central  Google Scholar 

  22. VanderWeele, T.: Principles of confounder selection. Eur. J. Epidemiol. 34, 211–219 (2019)

    Article  PubMed  PubMed Central  Google Scholar 

  23. Beran, D., Lazo-Porras, M., Mba, C., Mbanya, J.: A global perspective on the issue of access to insulin. Diabetologia 64, 954–962 (2021)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Altevogt, B., Davis, M., Pankevich, D., Norris, S.: Improving and Accelerating Therapeutic Development for Nervous System Disorders: Workshop Summary. National Academies Press, Washington (2014)

    Google Scholar 

  25. Ching, T., et al.: Opportunities and obstacles for deep learning in biology and medicine. J. Roy. Soc. Interface. 15, 20170387 (2018)

    Article  Google Scholar 

  26. Leek, J., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010)

    Article  CAS  PubMed  Google Scholar 

  27. Holmberg, M., Andersen, L.: Collider Bias. JAMA 327, 1282–1283 (2022)

    Article  PubMed  Google Scholar 

  28. Griffith, G., et al.: Others collider bias undermines our understanding of COVID-19 disease risk and severity. Nat. Commun. 11, 1–12 (2020)

    Article  Google Scholar 

  29. Leipzig, J., Nüst, D., Hoyt, C., Ram, K., Greenberg, J.: The role of metadata in reproducible computational research. Patterns. 2, 100322 (2021)

    Article  PubMed  PubMed Central  Google Scholar 

  30. Sansone, S., et al.: Toward interoperable bioscience data. Nat. Genet. 44, 121–126 (2012)

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sharma, A., Kiciman, E.: DoWhy: an end-to-end library for causal inference. ArXiv Preprint ArXiv:2011.04216 (2020)

  32. Shimoni, Y., et al.: An evaluation toolkit to guide model selection and cohort definition in causal inference. ArXiv Preprint ArXiv:1906.00442 (2019)

  33. Keating, S., et al.: SBML Level 3: an extensible format for the exchange and reuse of biological models. Mol. Syst. Biol. 16, e9110 (2020)

    Article  PubMed  PubMed Central  Google Scholar 

  34. Touré, V., Flobak, A., Niarakis, A., Vercruysse, S., Kuiper, M.: The status of causality in biological databases: data resources and data retrieval possibilities to support logical modeling. Briefings Bioinform. 22, bbaa390 (2021)

    Google Scholar 

  35. Juty, N., Le Novere, N., Laibe, C.: Identifiers. org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res. 40, D580–D586 (2012)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Andreas Husch and Matias Bossa for their support and review efforts. Beatriz Garcia Santa Cruz work is supported by the FNR-PRIDE17/12244779/PARK-QC and Pelican award from the Fondation du Pelican de Mie et Pierre Hippert-Faber.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Beatriz Garcia Santa Cruz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Garcia Santa Cruz, B., Vega, C., Hertel, F. (2022). The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions. In: Chicco, D., et al. Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2021. Lecture Notes in Computer Science(), vol 13483. Springer, Cham. https://doi.org/10.1007/978-3-031-20837-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20837-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20836-2

  • Online ISBN: 978-3-031-20837-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics