Introduction

The pharmaceutical industry is steadily increasing the application of spectroscopic process analytical technology (PAT) for process monitoring [1]. PAT applications are often dependent on multivariate chemometric models to extract the relevant information from the collected spectra. While the lifecycle of PAT methods consists of multiple phases, this work is specific to the model calibration of method development [2]. Model calibration establishes relationships between spectral information and quality attributes using chemometric models, facilitating predictions via PAT methods. Model calibration requires the measurement of spectral data and collection of corresponding reference samples. Reference samples are defined by ICH Q14 as having a known value for the property of interest measured with a validated reference procedure [3].

There are several types of chemometric models applied alongside PAT tools, classified based on their mathematical approach. Decomposition models, such as principal component analysis (PCA) and partial least squares (PLS), are the most popular modeling strategy for spectral interpretation. Representative spectral data and corresponding reference samples define the model space for decomposition models [4, 5]. The straightforward mathematical basis allows integration of the underlying structure and enables evaluation of prediction results. Machine learning (ML) models, such as support vector machines (SVM) and artificial neural networks (ANN), are increasing in popularity due to their generally superior prediction performance [6, 7]. However, the mathematical complexity of these models potentiates pseudo- “black boxes” that make it difficult to interpret and verify model outputs. This challenge associated with ML models leads to the preference for decomposition models in PAT applications for pharmaceutical manufacturing processes, as evidenced by the widespread use of PLS models [8].

The commonality between these models is an intense calibration burden required to establish the model structure and ensure successful prediction performance. The calibration burden is defined as the summation of the time, material, and financial demands throughout the model calibration process (Fig. 1). Reducing the calibration burden is an attractive solution to minimizing expenses associated with PAT method development, leading to the rapid increase in research focused on minimizing model calibration burden.

Fig. 1
figure 1

Illustration of possible factors that contribute toward calibration burden

Pure component models are chemometric models that require only the pure component spectra of chemical components and no reference samples as model inputs to achieve predictions. Notable examples include classical least squares (CLS) and iterative optimization technology (IOT) algorithms [9, 10]. Adjusting the base pure component models to account for non-chemical inputs can improve prediction performance and robustness, as demonstrated by pure augmented CLS (PACLS) and extended IOT (EIOT) [11, 12]. Advancements in pure component modeling have demonstrated a drastic reduction in calibration burden by eliminating the need for reference samples. However, there are notable discrepancies in the application of definitions for terms used to describe model calibration burden during PAT method development.

Implementing modeling strategies which reduce the calibration burden while maintaining the desired prediction performance hinges on the translatability of available techniques in the chemometric research space. Harmonizing the language used to describe a research area of interest has enabled effective research communication and provided a framework for advancing specific fields of research [13, 14]. The goal of this publication is to present recommendations for standard terms and definitions surrounding calibration burden to support both the pharmaceutical industry and regulatory agencies. By harmonizing the terminology used to describe model calibration strategies, the authors hope to facilitate increased learning and continued progress in minimizing model calibration burden during PAT method development. Expanding the understanding of modeling strategies that minimize calibration burden promotes widespread adoption throughout the industry and unlocks the potential for advancing the field of chemometric modeling.

Dataset Classifications

Different classifications of datasets are critical to the calibration of chemometric models. A dataset refers to the collection of spectral data and accompanying reference samples used throughout PAT method development. Datasets can be classified based on structure, role in model construction, and ability to inform the model. Structure refers to the intentional distribution of samples within the defined operating range of the model. The intentional structure of a dataset, referred to as the dataset design, is often derived from the concept of Design of Experiments [15]. Model construction refers to the establishment of a model space based on representative samples. Informing the model refers to setting and adjusting model parameters and is distinct from construction since the spectral information is not directly incorporated into the model space. These classifiers (Table 1) are intended to clarify the use of the dataset classifications used throughout the discussion of calibration burden levels.

Table 1 Dataset classifications according to structure, model construction, and model informing

While the terms training set and calibration set are often used interchangeably within chemometric and data-science literature, it is important to distinguish between them for pharmaceutical applications. The authors suggest using dataset structure as the differentiation, with training sets being unstructured and calibration sets being intentionally structured according to a selected dataset design. The dataset design utilized by a calibration set defines a specific number of data points/samples required before collection begins. In contrast, a training set can use as few or as many data points/samples as deemed necessary. The ICH definition of a calibration set suggests data within the operating range with corresponding reference samples [16]. Both training sets and calibration sets are used to construct and/or inform models.

Test and validation sets are structured datasets applied to assess the constructed model performance. An additional function of test sets is to further inform the model, while validation sets may not be used to continue constructing or informing the model in pharmaceutical manufacturing. This is highlighted in the ICH definitions, where the test set is defined as a sample set similar to, but separate from, the calibration set in physical and chemical characteristics. A test set is intended to challenge the model with realistic samples to the application setting that may further inform the model parameter adjustments. Generating representative samples which adequately assess model performance is performed according to dataset design. Validation sets are designed and applied with a regulatory perspective as a final test of model performance. This aligns with the ICH Q14 definition of a validation set where the data provides an independent assessment of model performance, implying no model adjustments based on this data [3]. Dataset classification is essential as the pharmaceutical industry develops and tests novel modeling strategies.

Levels of Calibration Burden

The calibration burden for a particular model is best understood in terms of the required calibration design. The calibration design, a specific category of dataset design, refers to the organization of the spectral data and accompanying reference samples (collectively called calibration datapoints) used to construct and inform a particular model relative to the anticipated operating range. The calibration burden for a given calibration design describes the number of calibration datapoints required to describe the operating range relative to other calibration designs. Additionally, calibration datapoints collected according to a dataset design require control over the sample composition through careful generation or analytical methods. The relationship between various model classifications (top) and the typically required calibration design (bottom) are presented in Fig. 2.

Fig. 2
figure 2

Model classification (top) and calibration design (bottom) organized from minimum to maximum number of reference samples required. The cube outlines in the calibration designs represent the operating range, and the red dots represent the unique calibration datapoints distributed over the operating range

A training set was described in the dataset classifications section as being unstructured. This can often be the result of collecting a multitude of samples over time, defined as an opportunistic training set. The goal of using an opportunistic training set is to generate datapoints that cover as much of the operating range as possible, either from historical data or as part of data collection during routine production. This ensures that the model can predict new spectra based on similar or equivalent spectra included in the training set. ML models are the primary model class that may rely on an opportunistic training set [17]. However, manufacturing processes which limit the ability to create intentional calibration datapoints may benefit from the opportunistic training set as a method for generating calibration data. The opportunistic training set can be very expensive from a time, material, and financial perspective, but may be necessary under specific manufacturing circumstances.

The full calibration design describes the entire operating range by capturing the relationship between all relevant sources of variance within the operating range, instead of trying to see “everything” like the opportunistic training set. The definition of “calibration data set” provided by the ICH Q2(R1) document can be generally understood to refer to the full calibration design [16]. The focus on variance relationships promotes the use of dataset design, with the full factorial design as the archetypal dataset design for a full calibration. An efficient calibration design also seeks to describe the entire operating range with a reduced number of calibration datapoints relative to the full calibration. Several dataset designs lend themselves effectively to efficient calibrations, with proper selection enabling similar model performance to a full calibration but with a reduced calibration burden [18]. The full and efficient calibration designs are the main designs utilized to define the model space in decomposition models.

A partial calibration design is focused on sources of variance in the operating range that are directly relevant to the analyte(s) of interest. Under certain conditions, a partial calibration may only need to consider the variability of the target analyte(s). This can be achieved with a set of API potency steps for pharmaceutical applications. The minimal calibration design seeks to minimize the total number of calibration datapoints required to construct/inform the model and is not necessarily concerned with the operating range [19]. A minimal calibration may be comprised of as little as one calibration datapoint. The absence of any calibration datapoints requirement constitutes a calibration-free design [20]. Models utilizing a calibration-free design may still require spectral data as model inputs, but there are no accompanying reference samples. Pure component models may be considered to utilize a calibration-free design (only spectral data is the pure component spectra), while extended/augmented pure component models are still generally dependent on partial and minimal calibration designs.

Recommendations & Summary

Specific and consistent terminology is essential to describe model calibration throughout the field of chemometrics and spectroscopic PAT method development. As research toward minimizing the model calibration burden advances, the discontinuity of the vocabulary surrounding calibration burden has the potential to inhibit progress of PAT method development in the pharmaceutical industry. Organizations such as ICH and IUPAC provide harmonized guidelines and terminology which has contributed to the advancement of pharmaceutical research. By using harmonized language, regulators and industrial scientists can effectively communicate expectations and results facilitating review and approval of regulatory submissions. Establishing specific and consistent terminology around model calibration burden enables the rapid adoption and acceptance of reduced calibration burden modeling strategies by manufacturers and regulators.

Developing a chemometric model for deployment with a spectroscopic PAT method requires an understanding of the classes of datasets and the associated calibration burden for a particular modeling strategy. Clarifying the language to describe dataset classifications guides the model calibration phase of the PAT method lifecycle. The proposed definitions for the varying levels of calibration burden are recommended for research that investigates calibration burden reduction. Harmonization of these terms across the field of chemometrics is anticipated to enable effective knowledge communication and advance modeling strategies in PAT applications.