Good Statistical Monitoring: A Flexible Open-Source Tool to Detect Risks in Clinical Trials

Background Risk-based quality management is a regulatory-recommended approach to manage risk in a clinical trial. A key element of this strategy is to conduct risk-based monitoring to detect potential risks to critical data and processes earlier. However, there are limited publicly available tools to perform the analytics required for this purpose. Good Statistical Monitoring is a new open-source solution developed to help address this need. Methods A team of statisticians, data scientists, clinicians, data managers, clinical operations, regulatory, and quality compliance staff collaborated to design Good Statistical Monitoring, an R package, to flexibly and efficiently implement end-to-end analyses of key risks. The package currently supports the mapping of clinical trial data from a variety of formats, evaluation of 12 key risk indicators, interactive visualization of analysis results, and creation of standardized reports. Results The Good Statistical Monitoring package is freely available on GitHub and empowers clinical study teams to proactively monitor key risks. It employs a modular workflow to perform risk assessments that can be customized by replacing any workflow component with a study-specific alternative. Results can be exported to other clinical systems or can be viewed as an interactive report to facilitate follow-up risk mitigation. Rigorous testing and qualification are performed as part of each release to ensure package quality. Conclusions Good Statistical Monitoring is an open-source solution designed to enable clinical study teams to implement statistical monitoring of critical risks, as part of a comprehensive risk-based quality management strategy.


Introduction
Clinical trials aim to evaluate the safety and efficacy of promising therapeutic candidates, while protecting patients' welfare and rights.To reliably achieve this objective, it is essential that both critical data and processes are high quality.Using traditional monitoring approaches like 100% source data verification and frequent site visits have been shown to be less efficient than a risk-focused strategy [1][2][3].Regulatory authorities recommend risk-based monitoring (RBM) as a superior alternative, given it is a more adaptive One possible driver for slower adoption is the lack of effective, easy-to-use, and inexpensive tools to properly perform risk detection compared to risk assessment.Recent reviews have found a breadth of tools available to assess potential risks to a trial at the start-up stage, but only limited information on how to develop or implement published methods for detecting study risk as the trial is ongoing [11][12][13].In contrast, commercial and home-grown CRO solutions tend to be more sophisticated and include technical support, but are substantially more expensive to implement [14].Given the proprietary nature of these systems, it is often difficult to share analysis findings and details of how the underlying risk detection algorithms work.Unfortunately, this trade-off between quality and cost may leave trial sponsors in a tough spot, especially when there are limited trial resources to support RBM.
To address this gap, we would like to introduce a new open-source R package, Good Statistical Monitoring {gsm}, as a free, flexible, and reliable tool to perform risk detection for RBM.R was chosen as it is freely available and widely used by the clinical trial community.GSM provides a supportive end-to-end framework for risk detection from data ingestion, risk analysis, visualization to reporting.It includes a flexible mapping process that is capable of handling multiple data standards and leverages a modular workflow structure that can easily be adjusted for study-specific customizations.It is also thoroughly tested and qualified prior to each release.

Methods
{gsm} was designed based on a series of extensive discussions with clinicians, statisticians, data scientists, data managers, clinical operations, regulatory, and quality compliance staff, including reviews of existing tools and literature.The goal was to create a scalable and customizable analytics engine that could support an end-to-end workflow for risk detection including data ingestion, analysis, visualization, and reporting.Technical details, vignettes and example reports can be found at: https://gilead-biostats. github.io/gsm/index.html.
Development and testing of the functions in {gsm} relied primarily on two repositories of anonymized clinical trial data: {clindata} and {safetyData}.{safetyData} is an R package that reformats PHUSE's sample ADaM and SDTM trial datasets [15].{clindata} is a repository of anonymized and simulated clinical trial datasets from a variety of different sources and data formats [16].
Statistical analysis of KRIs in {gsm} relies on defining a numerator and a denominator for each metric (Table 1).Then depending on whether the metric is a percentage or a rate, the user can select different statistical methods to be applied.The default method is to use a normal approximation for percentages and rates, with an adjustment for overdispersion, to calculate z-scores for flagging at-risk sites [17].When m sites are in a trial, where m > 2, the adjusted z-score for a site i can be defined as: where   is the KRI metric for site i,  0 is the overall mean, and ′(| 0 ) is the over-dispersion adjusted variance.The over-dispersion parameter ϕ is calculated as the average of unadjusted squared z-scores: . For percentages, the over-dispersion adjusted variance is , where p is the observed overall proportion of events and n i is the total number of study participants at site i.For rates, the over-dispersion adjusted variance is T i , where λ is the observed exposure-adjusted incidence rate, defined as the total number of events divided by the total study exposure time and T i is the total exposure time for participants at site i .Alternatively, users can choose to perform Fisher's exact tests for percentages and Poisson regression analyses for rates.More details can be found at https://gilead-biostats.github.io/gsm/articles/KRI%20Method.html.
Visualizations are built with R and JavaScript to create custom plots to depict analysis results.Interactive reports are produced as HTML documents using R Markdown.A detailed qualification report is automatically generated for # of screen failures / total screened participants The definition of each key risk indicator includes a numerator divided by a denominator.This represents an initial list of key risk indicators available in the current release and will be further updated in future releases each release using a set of machine-readable specifications and test cases to evaluate the expected performance of critical functions.

Results
The analysis of each KRI in {gsm} is defined as an assessment following a standard model: data is first inputted at the trial participant level, transformed into a site-level summary, analyzed to generate test statistics and p-values, flagged to identify sites that cross user-specified thresholds, and then summarized (Fig. 1).Optional customizable mapping functions are provided to support conversion of trial data, from a variety of possible data sources and formats -ADaM, SDTM, raw, etc. -to the input data required for each assessment.Workflows expand upon assessments by adding more capabilities -support for country or region level analyses, analyses of data subsets, and automated data checking -and enable users to perform a set of workflows more easily and at scale through only one function (Fig. 2).An example of the benefit of being able to customize using workflows is a user can easily expand on an analysis of AE reporting rates for all enrolled patients, by adding filter functions to repeat the analysis in the same workflow focusing only on the subset of participants who were randomized and treated, or a subset of participants with a specific category of adverse events.
{gsm} supports the creation of multiple interactive visualizations leading to a better understanding of analysis results.For individual assessments, results can be depicted as a scatter plot or bar plot on different scales (Fig. 3).For an overview of results, a site-by-assessment heatmap can be generated to highlight the commonly flagged KRIs across sites or the sites with the most flagged KRIs.For assessments of a given site over time, longitudinal plots can be First, the raw source data including adverse event data (ae) and subject data (dm) is mapped into the required input data format (input) containing the relevant information per subject, such as study exposure in days, number of adverse events, and the corresponding event rate.Next, the input data is transformed into an aggregate summary per site before being analyzed using the normal approximation method adjusting for over-dispersion.The resulting score is then flagged per the user-specified thresholds.Finally, the output is summarized for aggregation across all assessments performed for a study systematically apply across a portfolio of studies.Commercial options typically offer a software-as-a-service approach [14] with more thorough and customizable analytics, but are substantially more expensive.Thus, {gsm} helps to fill an existing gap in risk detection tools and we hope will support increased adoption of RBM.
Future improvements planned for future releases, in order of prioritization, include expanding the number of KRIs that can be analyzed, supporting qualified QTL analyses, conducting unsupervised statistical monitoring, and incorporating more for statistical testing.Current KRIs focus on critical areas related to study population, safety, deviations, and data quality, but do not yet cover other important areas such as primary and secondary endpoints, as this may require more complex study specific derivations and analyses.Although users can choose from more than one statistical method, some commonly used models like beta binomial models for binary outcomes [21], and linear mixed-effect models for continuous outcomes [22] have not been implemented.These methods may exhibit better performance in different situations; for example, the default method relying on normal approximation will tend to perform better when there are more sites, while an exact method may perform better when there are only a few sites.Further adding unsupervised approaches will allow users to agnostically survey the entirety of available trial data to find unknown risk signals.The {gsm} workflow can also easily be extended to perform QTL analyses, and experimental QTL functions, which need further refinement and validation, are being developed.Another interesting use case to explore is to use {gsm} to analyze real-world data to detect potential risks across regions, data sources, or other groupings.Adding these features will take time; fortunately {gsm} was purposely designed with a modular framework suited for quickly incorporating new improvements and created to show changes in results over multiple analyses.To easily capture and share the analysis results and visualizations, users can create a standard report with supportive trial information and the ability to search, filter, or examine specific data points of interest in more detail.
The {gsm} R package has undergone extensive testing and qualification.As of v1.8.1, over 1,450 unit tests have been written with a 87.3% code coverage.Along with each release, a qualification report is automatically attached ensuring the package meets expected standards and requirements to detect study risks (Fig. 4).Qualification testing currently covers 24 core functions, evaluating 88 use cases across 171 total tests.

Discussion
An effective RBM approach requires the ability to accurately detect study risks in a timely manner.{gsm} is a free opensource qualified solution developed for that purpose.It covers all the steps from data ingestion to reporting and allows R users to do so in a few lines of code.The modular structure of assessments and workflows facilitate study-specific customizations, and interactive visualizations allow users to better understand analysis results.Early efforts implementing {gsm} at Gilead have proven successful; we were able to detect similar risks as found by other proprietary systems, and more easily perform fit-for-purpose analyses for studyspecific nuances across a diverse set of pilot studies.
Compared to alternative tools to detect risks as part of RBM, {gsm} offers a robust and effective solution for free.Among publicly available options, code to implement the proposed statistical methods may not exist, and if available, are usually provided in a piece-wise fashion or limited to a much narrower scope [18][19][20], making it difficult to detect all potential critical risks in a study, and impractical to Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.

Fig. 1
Fig. 1 Assessment Model.This example illustrates the process to analyze the adverse event rate for each site, where datasets are shown as tables labled in italics, functions in purple and userspecified parameters in gold.First, the raw source data including adverse event data (ae) and subject data (dm) is mapped into the required input data format (input) containing the relevant information per subject, such as study exposure in days, number of adverse events, and the corresponding event rate.Next, the input data is transformed into an aggregate summary per site before being analyzed using the normal approximation method adjusting for over-dispersion.The resulting score is then flagged per the user-specified thresholds.Finally, the output is summarized for aggregation across all assessments performed for a study

Fig. 2 Fig. 3
Fig. 2 Workflow Structure.Users can stack workflows into a larger assessment object, where each workflow uses metadata and specifications inputted from separate YAML files to capture study-specific variable mappings, modifications to analyze by different groups (e.g., site,

Table 1
Key risk indicators