Keywords

Reporting Standards

An essential element of successful application of a previously developed and validated model is to have documented exactly the intent of the model, its method of construction, the data used, the analytical modeling, the characteristics of the intended application population and the expected generalization performance and safety characteristics. These pieces of information can enable third parties, independent of the developers, to evaluate the rigor of the models, determine the appropriateness of their application to their settings, and perform further evaluations, refinements and enhancements as needed. For all the above to be feasible, sufficient documentation and reporting of the model development and validation process must exist. Certain narrower subsets of the above functions fall also under the rubric of scientific reproducibility, which is a major indicator of reliable science, of AI/ML models with a high likelihood of successful application in additional populations and settings. Reporting standards have been devised in the scientific publication sphere to facilitate these goals. However, these reporting standards can also be applied to internal organizational documentation practices.

The publication reporting standards prescribe a minimal set of information that different types of scientific publications in health sciences must include. They are not designed to improve research quality, study design or any aspect of research and development other than reproducibility. Even when publication is not a primary goal, these reporting standards can be useful as a guide for internal documentation of AI/ML model development.

In this chapter, we review and synthesize several existing reporting standards, and comment on their usefulness and applicability to AI/ML development.

Reporting standards prescribe a minimal set of information that needs to be included in the model description.

Problems with reporting has been recognized as early as 1929 [1], and several guidelines have been proposed. The first evidence-based recommendations, CONSORT, have been published for clinical trials in 1996. These guidelines have established a protocol for developing reporting standards, for disseminating these guidelines, and recognized the need for collaboration among various stakeholders, including the researchers and the journals. The most established organization for reporting guidelines is the Enhancing the QUAlity and Transparancy Of heath Research, EQUATOR [2]. In their own words, “The EQUATOR Network is an ‘umbrella’ organization that brings together researchers, medical journal editors, peer reviewers, developers of reporting guidelines, research funding bodies and other collaborators with mutual interest in improving the quality of research publications and of research itself.”

The EQUATOR Network has over 250 reporting standards, covering a multitude of study types and settings. There are core reporting standards for several areas relevant to this book, including randomized trials (CONSORT), observational studies (STROBE), diagnostic accuracy (STARD), genetic risks prediction (GRIPS), and others. The most relevant standard is the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD).

Purpose and Value of Documenting AI/ML Models

Decision makers at health care provider organizations may appraise published models to determine appropriateness for particular clinical applications. Researchers may wish to replicate the same modeling process to compare the resulting model with alternatives. AI/ML data scientists may want to improve upon the model, etc. The publication of the model must contain sufficient information so that readers (i) can determine whether the research is sound, (ii) can determine whether the model has clinical utility in the intended or other contexts and (iii) test whether they can reproduce the results.

Moreover, changes in healthcare are constant; among many changes, demographics shift, practice patterns improve, diagnostic criteria are updated and new treatments are introduced. As a result, models need to be re-evaluated, updated, extended regularly. Complete documentation can help anticipate which elements need to be updated. The model or its predictors may have to be completely reconstructed for example due to changes in the underlying technologies or the model may have to be recomputed to prove the correctness of the development process to determine liability. Proper documentation of the model is necessary for these purposes and the reporting standards provide help in ensuring that at least a minimal set of required elements have been included in the documentation.

Pitfall 17.1

Poorly documented AI/ML models and their development and evaluation processes make independent review and replication efforts difficult.

Pitfall 17.2

Information required by reporting standards is a minimal set; additional information is often necessary.

Best Practice 17.1

Document the model, its development, validation and deployment process.

Best Practice 17.2

Document AI/ML models using reporting standards and extend with problem and technology-specific necessary information even if not part of the standard. Such extensions can be based on the various development stages and Best Practices in the present volume.

Relation to Reproducibility

Reproducibility is a cornerstone of science. Research is reproducible if independent scientists can recreate the findings (e.g., AI/ML models and their performance characteristics) based on the reported information. In the last decade, several large studies have been conducted to measure the reproducibility of published studies. In an attempt to reproduce results from 100 manuscripts in the psychology literature, a research team found that while 97% of the original studies reported a significant effect, only 36% of replication studies found that effect [3]. Replication efforts in biomedical and other fields found that approximately only 50% of the research studies was reproducible [4,5,6]. Failure of the studies to replicate was found to be predictable; e.g., predictions by peers about whether a study would replicate were highly correlated with the actual replication results [7]. The purpose of the reporting standards in this context is not to advocate for a particular modeling methodology, but to include information in the AI/ML model documentation that allows readers to decide whether the model development process used in the manuscript was sound and to test the reproducibility if needed.

The key purpose of reporting standards/guidelines is to remind researchers what information to include in the manuscript and to remind peer-reviewers what information to look for. Reporting guidelines do not prescribe how research should be done.

Reporting Standards Adoption

In terms of current adoption of reporting standards, a 2022 study [8] examined 152 articles published in the year 2019 to determine which of the 22 key pieces of information recommended by the TRIPOD reporting standard were included. They found that some information, such as interpretation of the results and source of data were included in almost all publications, while others such as the flow of subjects and the predictive performance of the model were included in less than 10%. According to the TRIPOD authors, a model cannot be appraised properly without these pieces of information (we will appraise these claims and describe the TRIPOD standard later in the chapter).

Appraisal of the TRIPOD Standard

Among general purpose reporting standards, TRIPOD is the most applicable to AI/ML models. It comprises a checklist [9] of 22 items, a statement document [10], and an Explanation and Elaboration document [11]. Being a reporting standard for publications, the 22 items are organized by the section of the publication they must be included in: title, abstract, methods, results, etc.

At a high level, key information required includes the clinical context, the study objective, outcome (whether the outcome assessment was “blind”—carried out without knowing the predicted risk), data source, study setting (including the dates), participant information (eligibility, inclusion, exclusion criteria, treatments received), assessment of the predictors (including whether the assessment was “blind”—devoid of knowledge about the outcome), methods information (missing data, feature selection, model type), performance (calibration and discrimination), limitations, potential for clinical use, and funding information. As immediately obvious by examining the TRIPOD reportable elements, there is no guidance on how to pursue AI/ML modeling to ensure safety and effectiveness. Therefore full conformance with TRIPOD reporting does not lead to or guarantee these objectives.

Pitfall 17.3

TRIPOD does NOT guarantee that the resulting model is correct, free of bias and safe for clinical use.

TRIPOD is a reporting standard that seeks to bring to public view elements of proper construction of AI/ML models (e.g. population choice, outcomes and model accuracy, data used etc.). However, several fundamental limitations exist that include:

  1. (a)

    TRIPOD assumes that the reader knows all appropriate best practices for the right design and execution of the above elements.

  2. (b)

    TRIPOD is more of ex post factor forensic tool rather than proactive enabler of good AI/ML.

  3. (c)

    The reporting entity may misrepresent the reported information and TRIPOD has no means of ensuring validity of reporting.

  4. (d)

    TRIPOD reported elements are neither complete nor are they necessary to ensure high quality AI/ML models.

  5. (e)

    TRIPOD does not aim to avoid biases and does not assess the risk of pertinent biases.

To address some of TRIPOD’s limitations other reporting standards have been devised such as PROBAST [12] with a focus on bias reduction, and CHARMS [13] aiming to help reduce pitfalls in the study design, and provide checklists for this purpose.

We also note that TRIPOD is aimed at multivariate models, primarily regression models. Some of its items are not appropriate across machine learning methods. For example, not all machine learning methods have an intercept or a baseline hazard which are required by TRIPOD. At the time of this writing it has been announced that new AI-focused reporting standard versions of TRIPOD and PROBAST will be developed using a survey and Delphi methodology [14].

Synthesis of Reporting Recommendations from Multiple Existing Standards

Given that reporting standards are not tailored to AI/ML, it is useful to attempt a synthesis of recommendations adopting elements from multiple standards. Below we provide such an example, using items from four documentation standards for documenting ML models using EHR data. The standards are TRIPOD [T] (predictive models), PROBAST [P] (assessment tool for the potential of biases), CHARMS [C] (model appraisal and systematic reviews) and RECORD [R] (retrospective study using individual level data). We re-iterate that depending on the specifics of the models additional necessary documentation has to be provided in accordance with the Best Practices in the present volume.

In Table 1, we include the recommendation itself, organized by modeling steps, and denote which item it corresponds to in each standard.

Table 1 Synthesis of four reporting standards. The columns correspond to the four reporting standards and the number to the corresponding item in each reporting standard

The main purpose of this list is to demonstrate how multiple standards can be combined to arrive at a more complete documentation that helps the appraisal of the model, including the assessment of potential biases, and help determine the applicability of the model to another institution with different EHR data elements.

Throughout the book, we present a multitude of suggested best practices, designed to increase the likelihood of good modeling and reduce to a minimum the potential for error. These range from high level design down to fine-grain implementation details and can be used to provide complete reporting of critical factors affecting the quality of modeling. Such reportable dimensions and categories are, aspects of causal modeling, unstructured data modeling, diagnostics and assurances for overfitting and under performance avoidance, model selection strategies, modern feature selection, model explanation, regulatory conformance, equity and fairness considerations, and many other critical pieces of information not addressed in Table 1, and that should supplement and enhance the current reporting standards.

HIMSS Analytic Maturity Model

Another important aid in ensuring the safe, fair, effective and efficient use of AI/ML in clinical practice is the certification of healthcare institutions by professional associations or societies with credible expertise and well-designed, validated certification processes for health AI/ML.

  • Certification ensures that certified healthcare institutions satisfy a minimum core set of requirements that are set forth by a reputable professional association or society with expertise in the area.

  • High quality certification ensures that meeting these requirements guarantees organizational competences.

  • Certification bodies can also offer advice on how to achieve the certification goals in areas of weakness.

Currently there is no professional association specializing in clinical or research AI/ML. HIMSS (Healthcare Information and Management Systems Society), which focuses on Healthcare Information Technology, has developed an 8-stage model of analytic maturity, called the Adaptation Model of Analytic Maturity (AMAM) [15], measuring an institution’s analytic capabilities. AMAM is also intended as a roadmap, which, in consultation with HIMSS experts, lays a path forward for organizations that wish to improve their analytic maturity. The eight stages, numbered 0–7, are as follows [15].

We caution the reader that while the intent is to help organizations grow their analytical capabilities, there is a vast distance between conventional analytic capabilities and AI/ML capabilities. While the AMAM framework is not explicitly designed for AI/ML, there is language that strongly implies AI/ML competencies. Such organization competencies require a deep level of advanced scientific and technological understanding that goes beyond conventional analytics and IT.

The high-level certification requirements of Table 2 for example, lack the level of testable technical rigor and specificity that will ensure safe and effective AI/ML technology deployment in a high-risk domain such as clinical medicine. For example, there is little in the stated requirements that ensures that deployed technology is performant, reliable and cost-effective or that it can generate trust in patients, providers, regulators and other stakeholders. Moreover, much needed protection against grave decision making errors that can be produced by poorly understood and applied AI/ML technology do not seem to be sufficiently addressed. In general, (outside the narrow discussion of the AMAM model) we caution against the following general pitfall:

Table 2 HIMSS Adoption model of analytic maturity stages

Pitfall 17.4

There is a world of difference between conventional hospital analytics and AI/ML technology.

Significant dangers exist for certification processes that do not discriminate among these domains to create false confidence in the existence of deep institutional competency and mastery of technologies, the complexity of which, in reality, radically exceeds the technical capabilities of most healthcare institutions.

Such technologies have the potential to radically advance population health when designed correctly, but also to hurt patients if not properly developed and deployed.

Academic Accreditation and Professional Certification Efforts

Beside reporting standards that aim to promote the quality of AI/ML scientific and technological development, and the certification of healthcare organizations along the lines of AI/ML competencies, another pillar of ensuring safety, equity, efficiency and effectiveness is training a specialized workforce with deep expertise in the science and technology of health AI/ML. In this section, we briefly look at two functions: (i) The availability of formal educational programs that educate health data science specialists in health AI/ML at the graduate, and post graduate levels; (ii) The training and certification of the broader health care and health sciences workforce to a minimally necessary level of understanding of these technologies; and (iii) The accreditation of educational programs that provide the above training and certifications.

Currently there are few programs across the nation providing specialized undergraduate, or graduate degrees in health data science and health AI/ML. There is also no accreditation specific to health AI and ML. Elements of healthcare and health science-related AI/ML are often taught in health informatics programs. Very few institutions are offering health AI/ML-focused degrees. The Commission on Accreditation for Health Informatics and Information Management Education (CAHIIM) [16] currently offers accreditation at the Master’s Degree level. The CAHIIM standard is based on AMIA’s (American Medical Informatics Association) informatics competencies that span health, social and behavioral science, information science and technology, leadership, professionalism, and areas at the intersections of these competencies [17]. These requirements were neither designed to nor do they achieve a comprehensive standard for specialty professional knowledge and competencies in AI/ML.

In terms of continued certification of the broader healthcare workforce, no organization offers AI/ML-specific certification. HIMSS (discussed above) offers certification for health information management systems (Certified Associate/Professional in Healthcare Information and Management Systems; CAHIMS/CPHIMS) and digital transformation strategy (Certified Professional in Digital Health Transformation Strategy; CPDHTS). Alternatively, for board certified physicians engaging in health information technology, the clinical informatics subspecialty [18, 19], developed jointly with AMIA, is offered by the American Board of Preventive Medicine or American Board of Pathology but is not designed to develop deep competency in AI/ML.

Pitfall 17.5

There is no professional association/society specifically for health AI/ML.

Pitfall 17.6

There is no accreditation or certification specifically for heath AI/ML.

Pitfall 17.7

There is a dearth of academic programs for educating health data scientists and health AI/ML experts.

Conclusions

The field of health care and health science research is in dire need for development of focused educational programs, meaningful individual and institutional certification, and comprehensive reporting standards. Initial efforts along these lines are promising and directionally sound; however significant and intensive efforts and investments are needed in these areas.

Key Concepts Discussed in This Chapter

Reporting standards prescribe a minimal set of information that needs to be included in the model description.

The key purpose of reporting guidelines is to remind researchers what information to include in the manuscript and to remind peer-reviewers what information to look for. It does not prescribe how research is done.

We discussed the TRIPOD standard for publishing predictive models.

We discussed additional standards. These include the PROBAST, which is a tool for assessing the potential for biases, and CHARMS for appraisal of models and systemic reviews.

Certification aims to ensure that healthcare institutions satisfy a minimum core set of requirements toward developing technical competency.

Educational programs for specialists and certification for the broader workforce will need to be expanded greatly in order to meet the demand for experts necessitated by the explosive growth of AI/ML.

Pitfalls in This Chapter

Pitfall 17.1. Poorly documented AI/ML models and their development and evaluation processes make independent review and replication efforts difficult.

Pitfall 17.2. Information required by reporting standards is a minimal set; additional information is often necessary.

Pitfall 17.3. TRIPOD does NOT guarantee that the resulting model is correct, free of bias and safe for clinical use.

TRIPOD is a reporting standard that seeks to bring to public view elements of proper construction of AI/ML models (e.g. population choice, outcomes and model accuracy, data used etc.). However several fundamental limitations exist that include:

  1. (a)

    TRIPOD assumes that the reader knows all appropriate best practices for the right design and execution of the above elements.

  2. (b)

    TRIPOD is more of ex post factor forensic tool rather than proactive enabler of good AI/ML.

  3. (c)

    The reporting entity may misrepresent the reported information and TRIPOD has no means of ensuring validity of reporting.

  4. (d)

    TRIPOD reported elements are neither complete nor are they necessary to ensure high quality AI/ML models.

  5. (e)

    TRIPOD does not aim to avoid biases and does not assess the risk of pertinent biases.

Pitfall 17.4. Significant dangers exist for imperfect or immature certification processes to create institutional false confidence in the existence of deep competency and mastery of technologies the complexity of which radically exceeds the technical capabilities of most healthcare providers. Such technologies have the potential to either radically advance population health when designed correctly, but also to hurt patients if not properly developed and deployed.

Pitfall 17.5. There is no professional association/society specifically for health AI/ML.

Pitfall 17.6. There is no accreditation or certification specifically for heath AI/ML.

Pitfall 17.7. There is a dearth of academic programs for educating health data scientists and health AI/ML experts.

Best Practices in This Chapter

Best Practice 17.1. Document the model, its development, validation and deployment process.

Best Practice 17.2. Document AI/ML models using reporting standards and extent with problem and technology-specific necessary information even if not part of the standard. Such extensions can be based on the various development stages and Best Practices in the present volume.

Questions and Discussion Topics in This Chapter

  1. 1.

    List some benefits of internally documenting a predictive model in a way that the model can be recreated, including the predictor variables, the outcomes, the training and validation data sets, and the model.

  2. 2.

    Discuss benefits and possible downsides of publishing models in peer-reviewed scientific journals.

  3. 3.

    Can you think of ways to publish a model other than a research article?

  4. 4.

    Consider the 22 items of the TRIPOD checklist and answer the following questions.

    1. (a)

      Which items are necessary and which can be omitted for the internal documentation of a model?

    2. (b)

      If you develop a deep learning model, which items in the checklist may not make sense? Would you change your answer if the model was a GBM?

    3. (c)

      Can you think of other data elements that you may want to record for a deep learning model?

    4. (d)

      The TRIPOD checklist is designed for diagnostic and prognostic models. Would a model that quantifies the effect of an intervention fall into the purview of TRIPOD?

    5. (e)

      Which of the TRIPOD items should be included for a model that quantifies the effect of an intervention for internal documentation? You can modify the item as necessary.

    6. (f)

      For the same model (that quantifies the effect of intervention), are there additional pieces of information you would include for (a) internal documentation and (b) for publication of the model in a journal?

    7. (g)

      If you were to build a multivariate regression model as a prognostic model based on genetic biomarkers, what additional information would you include?

    8. (h)

      Consult the GRIPS checklist (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175742/) and refine your answer to question (g).

  5. 5.

    The PROBAST tool (https://www.probast.org/wp-content/uploads/2020/02/PROBAST_20190515.pdf) is designed to assess the risk of bias in predictive models. It is not intended as a reporting standard. Read the PROBAST items and propose new reporting items to add to the TRIPOD checklist that allows for assessing the risk of bias.

  6. 6.

    Suppose you conduct a study for discovering the causal relationships among risk factors in Alzheimer’s disease. This is neither a diagnostic nor a prognostic study, so it falls outside the scope of TRIPOD. Which TRIPOD items are relevant and what additional information should be reported?

  7. 7.

    Your goal is to document a prognostic predictive model, e.g. 7-year diabetes risk, for internal use in a manner that allows you to re-construct the predictors, outcome, training and validation data, and the model itself. Create a reporting checklist based on TRIPOD, PROBAST and CHARMS.

  8. 8.

    The HIMSS certification has eight stages, while the CAHIIM accreditation is a pass/defer decision. Can you think of advantages to having multiple stages versus making a binary pass/fail decision?

  9. 9.

    When you look at the stages 1-7 of the AMAM, each stage requires several competencies. Which AMIA competency areas correspond to AMAM competencies?