Introduction

In the recent years and within the scientific community involved in human health risk assessment, a growing attention is directed toward the identification of human co-exposure to industrial chemicals or environmental substances and their potential effects on human health (Rotter et al. 2018). Despite that raised attention, most studies focus mainly on the exposure to a single element, whereas the human exposure is generally the outcome of diverse conditions and is therefore often a combination of several substances co-occurring in the environment (IPCS, 2009). The assessment of the co-exposure to multiple substances or substances in mixtures is generally known as combined or cumulative exposure (Price and Chaisson 2005).

At the European level, initiatives such as the EuroMix project, a tiered strategy for risk assessment of mixtures of multiple chemicals, tackle this challenge. The EuroMix project aims to establish and to disseminate new, efficient, and validated test strategies for the toxicity of chemicals in a mixture aiming to deliver refined information for future safety assessment of chemicals (Beronius et al. 2020; Rotter et al. 2018). In addition, EuroMix aims for a refined strategy for grouping chemicals into cumulative assessment groups and prioritizing related data gaps. An additional goal is, to derive a harmonized approach to assessing risk including information on possible additive, synergistic or antagonistic effects of the chemicals in the mixtures at real life exposure levels (Beronius et al. 2020). To this end, human biomonitoring (HBM) studies can be valuable research method and sources of information. HBM investigates human exposure to chemicals and their effects through systematic standardized measurement of the concentration of those compounds or their metabolites in human specimens (Angerer et al. 2007). HBM reflects the past and current aggregated internal exposures from all routes and sources at either an individual or a population level. Moreover, HBM studies are a reliable way of gaining insight into (individual) co-exposures to multiple substances. The application of HBM in exposure and risk assessment gained momentum at the European level with European Human Biomonitoring Initiative HBM4EU (Ganzleben et al. 2017).

One important challenge in HBM studies is the interpretation of the HBM data from an exposure assessment and a risk management perspective with the aim, for instance, to identify the relevant exposure sources or pathways (Angerer et al. 2007). Such interpretation generally consists in linking biomarkers of exposure with external exposure estimates and information on exposure-relevant factors through multivariate analysis. The goal of this linkage is to understand the variation, for example, of external exposures, individual food consumption patterns, physiological parameters, socio-demographic factors, and their influences on concentrations of chemicals and their metabolites in human samples, such as blood and urine. Regarding the analysis of co-exposure to multiple substances, approaches jointly analyzing multiple input variables to multiple output variables are therefore of high importance.

Weighted quantile sum (WQS) regression that was lately introduced for the analysis of health effects of chemical mixtures is, to the best of our knowledge, one of the rare available methods allowing to account for highly correlated data in co-exposure settings (Carrico et al. 2015; Keil et al. 2020; Lee et al. 2019; Tanner et al. 2019). However, in WQS regression, the co-exposure to multiple substances and the associated correlations are considered to predict health effects in a single-output regression structure. The development of robust multi-output regression approaches enabling to extract insightful patterns from multivariate data is thus critical. In the field of machine learning, techniques or algorithms were newly developed for multi-output regression and classification problems. These methods rely on diverse methodologies including for instance multiple independent single-target methods, and input or output space expansion approaches. Indeed, Tsoumakas et al., (2014) introduced the random linear target combinations (RLC) method, which consists in using random linear combinations for generating new output variables, as an output expansion approach. Spyromitros et al. (2016) presented two methods based on input space expansion: stacked single-target (SST) that consists in fitting single-output regressions on the input variables expanded by predictions of the output variables whereas the ensemble of regressor chains (ERC) is built on single-output regressions fitted on the input variables sequentially expanded by the output variables. Despite their efficiency under a predictive perspective, these algorithm-based methods are generally inconvenient to help understanding the dependence structure between output variables within the data. Copula-based regression models appear as an adequate alternative since it provides, while relating multiple outputs and input variables, a mathematical representation of the dependence structure of the variables of interest. Copulas are flexible probabilistic tools for modeling the joint distribution of random vector and are specifically convenient to capture the dependence structure among the vector components (Park et al. 2021; Smith 2013; Song et al. 2009). Copula-based regression modeling was recently implemented in diverse applications such as the electricity market, the prediction of crash counts, in web-marketing, or econometrics etc. (Park et al. 2021; Pitt et al. 2006; Sahu et al. 2003; Smith et al. 2012). In addition, the combination of copula-based regression and the Bayesian approach offers an appropriate framework enabling not only to estimate the parameters of the model but also to conveniently accommodate the complexity stemming from the copula representation. The copula regression model is difficult to estimate by classical maximum likelihood estimation when the multivariate dimension is high, as the likelihood become intractable (Smith et al. 2012).

The main objective of the present study was to propose a flexible and robust framework of mathematical modeling, which could help advancing the assessment of human co-exposure to multiple substances. Because of their large distribution, accumulation and persistence in the environment and in biological systems, cadmium (Cd) and lead (Pb) were selected as candidate substances. Both substances are relatively well known considering their exposure pathways and potential adverse health effects to humans (Heinemeyer and Bösing 2020; IARC 2006; Jan et al. 2015; Tchounwou et al. 2012; Timothy and Williams 2019; WHO, 2009).

In this work, a copula-based regression model was developed to analyze the internal co-exposure of children aged three to fourteen years to cadmium and lead. The copula regression model was applied to HBM data (Cd and Pb in whole blood) collected in the German Environmental Survey for Children (GerES IV) and the individual information (e.g., age, and other co-variates such as the tobacco smoking at home, or regional aspects, etc.) collected using standardized interviews and questionnaires. This analysis intends to better understand the variations of Cd and Pb concentrations in human blood in conjunction with potential determinants of the external exposure to these heavy metals. The unknown quantities of the copula regression model were estimated, under the Bayesian framework, by performing Monte Carlo Markov Chain (MCMC) simulation. The influence of the characterization of correlations provided by the implemented model was assessed using GerES IV data and its predictive performances were evaluated in comparison with machine learning algorithms by using simulated data.

Materials and Methods

Data

The modeling framework that was implemented here was built by considering:

  1. (i)

    The concentrations of Cd and Pb in whole blood samples of children aged from 3 to 14 years participating in GerES IV as outputs, and

  2. (ii)

    The ancillary exposure information collected for these participants as input variables supposed to be influential of the co-exposure to Cd and Pb.

GerES IV was conducted from 2003 to 2006 by the German Environment Agency (UBA) with the aim to provide representative data for health-related environmental monitoring and reporting at a national level (Schulz et al. 2012). GerES IV is the environmental module of the German Health Interview and Examination Survey for Children and Adolescents (KiGGS baseline study) that was carried out by the Robert Koch Institute (RKI) (Kurth et al. 2008). GerES IV allowed for the first time to include children aged 3 to 5 years old and to update information collected in GerES II for children aged 6–14 years (Schulz et al. 2007). In total, 150 sampling points in Germany were included and 1790 children participated in GerES IV. In addition, GerES IV was organized in four modules (a base module providing measurements of substances in human samples and interviews as well as three additional modules in which indoor air measurements, stress hormones and noise pollution, and sensitization to indoor mold were analyzed, respectively), which resulted overall in more than 1000 variables with various pieces of information (Schulz et al. 2007, 2012). Methods for the analysis of Pb and Cd in blood (PbB and CdB) are described elsewhere (Becker et al. 2008).

In our analysis, an initial data extraction procedure was applied to the global GerES IV dataset contained in the public use file made available by UBA, which consisted in:

  • The selection of input variables by using preliminary multivariate analyses (linear regression for continuous variables or analysis of variance for categorical variables) to investigate the dependence between output variables and the considered input variables. Input variables, which were more significantly influential on the variations of PbB and CdB concentrations, were selected. These developments are not presented here, but further information are provided in supplementary materials S1 and S2.

  • The removal of individual records with missing data and individual records for which the concentrations of CdB or PbB were below the respective limits of quantification.

This selection phase resulted in a dataset with 480 individual records and 5 input variables (age, number of smokers living in the same dwelling as the participant, sex, living in former East or West Germany, and the child´s smoking status). The potential impact of this selection phase on the analysis is discussed in Sect.  Study Limitations. Figure 1 illustrates the relationship between CdB and PbB.

Fig. 1
figure 1

Illustration of the relationship between Cd and Pb in whole blood. The logarithm of the concentrations data (in µg/L) drawn from GerES IV and considered in the analysis (i.e., data resulting from the selection phase described in Section Data) is represented

Modeling

Copula-Based Regression Model

The modeling framework presented here relies on the concept of copulas. Copulas are multivariate functions defined, according to Sklar (1959), as joint multidimensional distribution functions having uniformly distributed margins (\({U}_{j}\) with \(j = 1,\cdots ,q\)) on [0,1] as described by Eq. (1).

$$\begin{array}{*{20}c} {{\mathbb{C}}\left( {\left[ {0,1} \right]^{q} } \right) \to \left[ {0,1} \right]} \\ {{\mathbb{C}}\left( u \right)\, = \,{\mathbb{C}}\left( {u_{1} , \cdots , u_{q} } \right)\, = \,{\mathbb{P}}\left( {U_{1} \le u_{1} , \cdots , U_{q} \le u_{q} } \right)} \\ \end{array}$$
(1)

where \(u = ({u}_{1}, \dots , {u}_{q})\) is the vector of the uniformly distributed margins.

It exists a broad variety of families of copula functions (Nelsen 2006). Depending on the nature of the problem (e.g., dependence between the considered variables, computational complexity, etc.), an adequate choice of copula function is required. However, methods to adequately select between copula functions are lacking (Manner 2007). Within the variety of families of copula functions, elliptical copulas and Archimedean copulas are frequently and widely used in high dimension problems, since they can straightforwardly be constructed based on their parametric forms. Archimedean copulas (e.g., Clayton, Frank, Gumbel, Joe, etc.) are known to be easily deduced and to accommodate different types of dependence, and elliptical copulas, which are based on elliptical distributions (e.g., the Gaussian and Student´s T), offer classical correlation structures useful to fully describe the dependencies between variables in multivariate cases (Atique and Attoh-Okine, 2018). Specifically, elliptical copulas are of interest in this analysis for being specifically suitable in multivariate cases as they can be constructed from a continuous multivariate distribution as follows (Smith et al. 2012; Song et al. 2009). Let \(Y=({Y}_{1}, \cdots , {Y}_{q})\) be a vector of random variables with a multivariate distribution function \({F}_{Y}(y;\theta )\), marginal distribution functions \({F}_{j}({y}_{j})\) , and marginal densities \({f}_{j}({y}_{j})\) for \(j = 1,\cdots ,q\) where \(y=({y}_{1},\dots ,{y}_{q})\), and \(\theta\) represents the set of parameters of the copula function. This multivariate distribution yields the copula function below.

$${\mathbb{C}}\left( {u;\theta } \right)\, = \,{\mathbb{P}}\left( {F_{1} \left( {Y_{1} } \right) \le u_{1} , \cdots , F_{q} \left( {Y_{q} } \right) \le u_{q} } \right)\, = \,F_{Y} \left( {F_{1}^{ - 1} \left( {u_{1} } \right), \cdots ,F_{q}^{ - 1} \left( {u_{q} } \right);\theta } \right)$$
(2)

The copula framework that we adopted was built using a multivariate Gaussian for diverse reasons including the parsimonious aspect of its associated parameter inference (it has a less complex structure compared to the Student´s T copula), the possibility it offers to capture the dependency structure of the data using the classical correlation matrix, and the symmetrical aspect observed between both variables of interest as shown in Fig. 1 (Joe 2014; Park et al. 2021). The Gaussian copula has the following general form:

$${\mathbb{C}}\left( {u;R} \right)\, = \,{\mathbb{C}}\left( {u_{1} , \cdots ,u_{q} ;R} \right)\, = \,{\Phi }_{R} \left( {\phi^{ - 1} \left( {u_{1} } \right), \cdots ,\phi^{ - 1} \left( {u_{q} } \right)} \right)$$
(3)

where \({\Phi }_{R}\) represents the probability distribution function (PDF) of a multivariate Gaussian \({N}_{q}(0,R)\) with a q-dimensional vector of zeros as mean and R as correlation matrix, and \({\phi }^{-1}\) corresponds to the inverse of the PDF of a standard univariate Gaussian \(N(\mathrm{0,1})\).

Marginal Regressions

A critical aspect for building a copula-based regression model concerns the marginal distributions of the multivariate distribution. The concentrations of cadmium and lead in whole blood, which are positive and right-skewed variables, can be modeled using the lognormal distribution that is often well suited for such characteristics (Ott 1990).

In this work, we opted to first transform Cd and Pb concentrations with logarithm and to model these log-Cd and log-Pb using univariate normal distributions which are equivalent to consider lognormal distributions for both marginal distributions. Therefore, the marginal regression models are expressed as follows:

$$\begin{array}{*{20}c} {Y_{ij} \, = \,X_{i} .\beta_{j} \, + \,\varepsilon_{ij} } \\ {\varepsilon_{ij} \sim N\left( {0,\sigma_{j} } \right)} \\ \end{array}$$
(4)

The \({Y}_{ij}\) are the logarithms of Cd and Pb concentrations for individual \(i\) and substances \(j\) with \(i = 1,\dots ,n\) and \(j=1, 2=\{Cd, Pb\}\). \({X}_{i}=(1, {X}_{i,1}, \cdots , {X}_{i,p})\) is the (p + 1)-dimensional vector of co-variates for individual \(i\) with the first element being equal to one for the intercept and p = 5 is the number of considered input parameters (see Section Data). The (p + 1)-dimensional regression coefficients \({\beta }_{j}\) and the real-valued residual coefficients \({\sigma }_{j}\) for each substance are unknown quantities to be inferred from the data.

Bayesian Inference

The Bayesian approach is, due to its flexibility, suitable to accommodate the complexity of the copula model. This flexibility is possible by using the combination of a priori knowledge on the parameters of interest (priors), the likelihood that the chosen structure produced the observations on-hand (likelihood) in order to generate posteriors that give an inferred, and an a posteriori representation of the modeled system as the ranges of parameters variations are characterized (Gelman et al. 2013).

Likelihood

In statistical modeling and specifically parameter inference, the likelihood describes the probability that a set of observed data was produced by a given parametric model. This joint probability of the observed data is then represented as a function of the model parameters. In this copula-based regression model, the likelihood is given by:

$${\mathbb{P}}\left( {Y|\beta_{j} ,\sigma_{j} ;R} \right) \propto \left| R \right|^{{ - \frac{n}{2}}} .exp^{{\left( { - \frac{1}{2}\mathop \sum \limits_{i = 1}^{n} Z_{i} \left( {R^{ - 1} - I_{q} } \right)Z_{i}^{t} } \right)}} .exp^{{\left( { - \frac{1}{2}\mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{q} \frac{1}{{\sigma_{j}^{2} }}\left( {y_{ij} - X_{i} .\beta_{j} } \right)^{2} } \right)}} .\mathop \prod \limits_{j = 1}^{q} \left( {\sigma_{j}^{2} } \right)^{{ - \frac{n}{2}}}$$
(5)

where \(\propto\) represents the proportionality symbol up to a known regularization term, \({I}_{q}\) is the q-dimensional identity matrix. The \({Z}_{i}=({Z}_{i1},\dots ,{Z}_{iq})\) are q-dimensional vector of latent variables with \({Z}_{ij}={\Phi }^{-1}\left({F}_{j}({y}_{ij})\right)\). The terms \({\Phi }^{-1}, {F}_{j} {\text{and}} {f}_{j}\) correspond to the quantile function of a standard normal distribution, the cumulative distribution function, and density of the marginal regression models, respectively.

The two first elements of Eq. (5), which are function of the correlation matrix R, embody the dependence structure of the model. This implies that if no correlation between the variables of interest is considered, the likelihood reduces to the product of PDFs that correspond to the likelihood of a multivariate distribution with independent margins.

Priors

The specification of the priors for unknown quantities is an essential requirement in Bayesian analysis. The defined priors are described below:

$$\begin{array}{*{20}c} {\pi \left( {\sigma_{j}^{2} } \right) \sim InvGamma\left( {\alpha_{0} ,\gamma_{0} } \right)} \\ {\pi \left( {\beta_{j} } \right) \sim N_{p + 1} \left( {b_{0}^{j} ;B_{0}^{ - 1} } \right)} \\ {\pi \left( R \right) \sim I\left\{ {R_{jk} :R_{jk} \, = \,1 \left( {j\, = \,k} \right), \left| {R_{jk} } \right| < 1 \left( {j \ne k} \right) {\text{and}} R {\text{positive definite}}} \right\}} \\ \end{array}$$
(6)

where \({\alpha }_{0}={\gamma }_{0}=0.01\) are the hyperparameters of the inverse gamma distribution defined for the variance of the residuals. For the \({\beta }_{j}\) regression coefficients, the hyperparameters \(b_{0}^{j} \, = \,\left( {0, \cdots ,0} \right)^{\prime }\) corresponding to the mean vector and \({B}_{0}=0.01\times {I}_{p+1}\) corresponding to the precision matrix (the inversion of the covariance matrix) were adopted for the (p + 1)-dimensional normal distribution. For the correlation matrix R, a non-informative prior satisfying the constraints for a correlation matrix was defined. This prior is uniform on the restricted space of correlation matrix (Barnard et al. 2000; Chib and Winkelmann 2001; Park et al. 2021).

Posteriors and MCMC Simulation

By applying the Bayes theorem, the following full joint posterior distribution of all the unknown parameters is derived.

$${\mathbb{P}}\left( {\sigma_{1}^{2} , \cdots ,\sigma_{q}^{2} ;\beta_{1} , \cdots ,\beta_{q} ;R|Y} \right) \propto {\mathbb{P}}\left( {Y|\beta_{j} ,\sigma_{j} ;R} \right)\, \times \,\pi \left( R \right)\, \times \,\mathop \prod \limits_{j = 1}^{q} \pi \left( {\sigma_{j}^{2} } \right)\, \times \,\mathop \prod \limits_{j = 1}^{q} \pi \left( {\beta_{j} } \right)$$
(7)

Due to the complexity of this distribution, an MCMC simulation scheme consisted of 2 chains, 105 total iterations including a burn-in period of 5 × 104 iterations, and a thinning parameter fixed at 10 was implemented and is organized as follows:

At each iteration and for each simulated Markov chain,

  1. 1.

    The posterior of the standard deviations of the residuals \({\sigma }_{j}\) (with \(j=1, 2\)) is generated using a random walk Metropolis–Hastings algorithm,

  2. 2.

    The posterior of the regression coefficients \({\beta }_{j}\) is generated using a random walk Metropolis–Hastings algorithm, and

  3. 3.

    The posterior of the correlation matrix is generated using the parameter expansion and reparameterization Metropolis-Hasting procedure, which was developed by Liu and Daniels (2006) and lately applied by Park et al., (2021).

Because of its non-triviality, a more detailed description of the sampling procedure applied for the correlation matrix R is given in Section S1 of the Supporting Information S1.

Predictability of the Bayesian Copula Model

The predictive performances of the Bayesian copula model were evaluated by following two approaches:

  • In the first approach, the influence of the characterization of the dependence structure between the Cd and Pb concentrations on the model predictability is evaluated by comparing the results drawn from the Bayesian copula-based model (i.e., the predicted Pb and Cd concentrations) and the multivariate model with no correlation, which is equivalent to define R as an identity matrix.

  • In the second approach, an experiment to compare of the Bayesian copula model with the random linear combinations, ensemble regressor chains, and stacked single-target methods, which are machine learning (ML) algorithms developed for multi-target regression, was conducted (Spyromitros et al., 2016; Tsoumakas et al., 2014). To this end, a synthetic dataset that consisted of three output variables and six input variables was generated. The generation of the synthetic data, the ML algorithms and the metrics used to compare the predictions are described in the supplementary material S1.

Results

Bayesian Inference

Descriptive statistics of the unknown parameters of the Bayesian copula-based regression model were calculated from the posteriors sampled with MCMC and are displayed in Table 1.

Table 1 Descriptive statistics of the parameters´ posteriors drawn from the Bayesian copula-based regression (BCR) model applied to the GerES IV data

The median of the correlation coefficient \({R}_{12}\) between Cd and Pb concentrations in children´s blood samples, which is central in the Bayesian copula-based regression model, is estimated at 0.067 with a 95% credible interval (abbreviated here 95% CI and equivalent in Bayesian inference to the frequentist confidence interval) varying from − 0.034 to 0.167. Figure 2 illustrates the distribution of the posterior of \({R}_{12}\). The resulting posterior distribution of \({R}_{12}\) values illustrate a minor to negligible dependency between the concentrations of Cd and Pb in whole blood of the children participating in GerES IV.

Fig. 2
figure 2

Density of the posterior distribution of the correlation coefficient between Cd and Pb in whole blood concentrations

The standard deviations obtained for Cd and Pb are in similar ranges with 95%CI varying from 0.422 to 0.487 for Cd and 0.480 to 0.551 µg/L for Pb, respectively. However, proportionally to the order of magnitude of the concentration values, these standard deviations denote a significantly higher variability of Cd concentrations in blood in comparison with Pb.

Because of the logarithmic transformation of Cd and Pb concentrations, the interpretation of the derived regression coefficients \({\beta }_{j}\) can be difficult. Nevertheless, the influence of the input variables was evaluated by analyzing the variations of Cd and Pb concentrations induced by variations of input variables. Specifically, a slight increase (about 6%) and a mild decrease (about 15%) of the concentration values are generated by the age of children as the regression coefficient \({\beta }_{1}^{j}\) equal to 0.043 with 95% CI = [− 0.008, 0.092] and − 0.103 with 95% CI [− 0.159, − 0.047] for Cd and Pb, respectively. For sex and the former Western/Eastern German sample point variable, an opposite and mild variation is noticed between Cd and Pb. Indeed, about a 9% decrease between girls and boys and a 14% decrease from former West to East Germany of the concentration of Cd are observed whereas increases around 12% between girls and boys and 8% from western to eastern Germany are observed. Regarding input variables associated with the smoking behavior around the children (the number of smokers in the dwelling and the smoking status), significant increases (about 50% and 25%) are induced on Cd concentrations with \({\beta }_{12}=0.067\) with 95% CI [0.020, 0.112] and \({\beta }_{15}=0.733\) with 95% CI [0.529, 0.928]. For Pb, mild increases (about 10% and 12%) are observed. This observation coincides with the association between the exposure to Cd and tobacco smoke, which is known to be much higher than the association between Pb and tobacco smoke (Bernhard et al. 2005).

Overall, similarities are observed between both marginal regression models considering the parametric uncertainty. Indeed, the parameter-specific ranges of the 95% CIs are for Cd and for Pb in the same order of magnitudes. For instance, the ranges for the regression coefficient associated with children age \({\beta }_{j1}\), the number of smokers in the dwelling \({\beta }_{j2}\) , or their smoking status \({\beta }_{j5}\) are equal to 0.100, 0.092 and 0.399 for Cd, and 0.112, 0.103, and 0.458 for Pb, respectively. An identical observation is made for the standard deviations of the residuals \({\sigma }_{j}\), as the 95% CI ranges are 0.06 and 0.07 for Cd and Pb, respectively. Therefore, in both marginal regressions, the same level of imprecision or variations around the central input and output values (mean or median) is observed. This could result from the shrinkage effect stemming from the application of the log transformation on the marginal concentration values, which are both coming from a right-skewed distribution.

Influence of Characterizing the Dependence Structure

Fixing the correlation coefficient \({R}_{12}\) at zero reduced the Bayesian copula model to independent multivariate regression models (one for each of Cd and Pb), which enabled us to evaluate the impact of modeling the dependence structure between Cd and Pb concentrations. The resulting posterior distributions of the model parameters are summarized by the descriptive statistics presented in Table 2. A significant shrinkage of the parametric uncertainty is noted since the ranges of the 95% CI, which vary from 0.001 to 0.01, are tenfold lower than those produced by the Bayesian copula-based regression model. The standard deviations of the residuals are also smaller than to those generated by the Bayesian copula-based regression model by a factor ranging from 40 to 50. This illustrates a better coverage of the global uncertainty using the copula-based framework. Indeed, the uncertainty that is supposed to stem from the correlation matrix R and thus the dependence structure of the output space is propagated to the other model parameters and consequently characterized by their posteriors.

Table 2 Descriptive statistics of the parameters posteriors drawn from the independent multivariate regression models applied to the GerES IV data. The independent multivariate regression models is equivalent to the BCR model with the correlation coefficient \({R}_{12}\) being fixed at 1

An identical observation (i.e., a better characterization of the parametric uncertainty by the copula-based model) is made considering the evaluations using simulated data as shown in Table S1 of the supplementary material S1. The derived 95% CIs drawn from the copula-based model cover the simulated parameter values better than the independent multivariate regression models. Despite these discrepancies between both models, similar predictions on the test set extracted from the GerES IV data based on the medians of model parameters are produced as illustrated in Fig. 3.

Fig. 3
figure 3

Observations vs Predictions graph. The median values of the predictions drawn from the Bayesian copula regression (with correlations) and from the Bayesian independent multivariate regressions (without correlations) models are represented by squares and triangles, respectively. The dotted red line corresponds to a perfect match, and the black dotted lines illustrate a two-order range of uncertainty

Discussion

Lessons Learned from this Analysis

In this study, we proposed a modeling approach with the background objective to enhance the analysis and the assessment of the cumulative exposure to multiple chemicals. This modeling approach relies first on the use of biomarkers of exposure and exposure-related data collected from HBM studies, which generally carry valuable information about the ancient and current exposure of human populations and potential exposure-related determinants. As candidate substances, cadmium and lead, two heavy metals that have been extensively studied over the last decades, were selected for illustrative purposes. Second, the proposed approach was built on the implementation of a copula-based model to solve the multi-output and multivariate regression problem posed by the simultaneous consideration of Cd and Pb concentrations in whole blood of German children as outputs and the analysis of the influence of ancillary exposure-related information treated as inputs of the marginal regression models. Specifically, the copula framework enabled accommodating the added complexity by the characterization of the dependence structure between Cd and Pb concentrations. The developed model was fitted to the study data by using Bayesian inference. The Bayesian inference helped us to characterize the uncertainty of the model parameters, namely the marginal regression coefficients and standard deviations of the residuals and the correlation matrix R that is the key parameter in this copula model.

A minor to negligible dependence was observed between Cd and Pb concentrations in whole blood with low correlation coefficient values (the 97.5th percentile of the correlation coefficient smaller than 0.20). This means that high or low exposure to Cd does not imply high or low exposure to Pb and inversely. Even though both heavy metals are co-occurring in environmental and consumer matrices (air, soil, dust, food, drinking water, and other consumer products), the negligible dependence observed in our analysis does not support analyzing the co-exposure to these two elements using the notion of correlation. Exposure sources and pathways specific to each of both or different kinetics within human organism could be investigated to explain this low or absence of association. Moreover, slightly stronger associations between PbB and CdB have been observed in the literature that are explained by either local co-contamination from soil in (King et al. 2015) or by maternal age in (Nakayama et al. 2019).

High parametric uncertainty and residual values were drawn from the copula-based model in comparison with the multivariate regression models without dependence characterization. This is exclusively explained by the inclusion, within the copula-based model structure, of a supplementary source of uncertainty related to the correlation between Cd and Pb concentrations in blood and its inter-dependencies with the marginal regression parameters \({\beta }_{j}\)´s and \({\sigma }_{j}\)´s as illustrated by the equation of the model likelihood in Section Bayesian inference with the correlation-specific terms. In this regard, the Bayesian inference combined with MCMC simulation offered the required flexibility to tackle the complexity introduced in the model by the estimation of the coefficient of the correlation matrix. This observation points out the characterization of the global (parametric) uncertainty in the context of highly correlated outputs. Indeed, the evaluations drawn from the simulated data showed a better adequacy between the estimated and observed parameter values and a better coverage of the parametric uncertainty around the various coefficients of the marginal regression models derived from the copula-based modeling approach.

The fitted copula-based model and the estimated parameter values exhibited insightful relationships between the input variables and the Cd and Pb concentrations. For Cd, an increase of the whole blood concentration is associated with children´s age whereas Pb in whole blood is negatively associated with age. Due to its ubiquity in the environment, humans can be exposed to Cd via numerous pathways including various dietary sources (EFSA 2009). This is supported by first GerES V observations of higher internal Cd exposure being associated with a vegetarian diet in bivariate analysis for participants aged 3 to 17 years (Vogel et al. 2021). The decrease of Pb with age has been observed in several other studies. A possible explanation discussed in literature is that Pb exposure can be higher in younger children because of the global oral exposure, which is highly relevant for this subgroup of the population due to hand-to-mouth behavior and the accumulation of Pb in soils and dusts (Burm et al. 2016). The population body burden of lead decreased strongly in recent decades against the background of the ban of Pb in fuel since the 70 s in Europe, which have considerably reduced the release of Pb in the environment as indicated by the review and collation of historical data directed by Bierkens et al. (2011) or by Lermen et al. (2021). Slight differences in Cd and Pb concentrations were noticed between German boys and girls and between former West and East Germany. Physiological processes can induce differences between males and females regarding the growth of human body, particularly during the adolescence (Ramos et al. 1998; Schedler et al. 2019). Moreover, the body burden of Pb is associated with skeletal growth, which could thus justify the observed differences of Pb concentrations between German boys and girls (EFSA 2012; O´Flaherty 1991). This aspect might also be discussed in view of the observed decrease with age (Tebby et al. 2022). Considering that GerES IV was performed only around 15 years after the German reunification, differences between former East and West Germany might find some explanation in spatial differences in heavy metal pollution (MSC-E, 2020). Higher PbB levels in children and adolescents for residence in former East Germany have also been observed in multivariate analyses of data collected in GerES V, performed from 2014 to 2017 (Hahn et al. 2022). It has to be considered that our regression model only includes a quite small number of predictors. Therefore, we see the former East/West Germany variable as a reasonable, though imperfect proxy for differences in exposure-relevant behaviors or conditions not being included in the regression model. For example, ambient air pollution concentrations or sufficiently detailed information on general food consumption were not part of the GerES IV dataset. GerES V results also indicate a further alignment of internal metal exposures in former East and West Germany (Vogel et al. 2021). Therefore, a more detailed analysis of reasons for still observed exposure differences between former East and West Germany in future studies is warranted.

The impact of variables associated with the smoking activity was found to be higher on Cd than on Pb. This observation is in line with the general knowledge around the associations between environmental tobacco smoke and the exposure to heavy metals (Bernhard et al. 2005), with previous findings of Conrad et al. (2010) who studied the exposure of German children (about 25%) to environmental tobacco smoke at home and also with the conclusions drawn from a recent study conducted on Swedish adolescents by Almerud et al. (2021).

Study Limitations

The inclusion of the correlation-specific terms generated convergence difficulties with the parameters for which the estimated value is close to zero. For instance, this is the case of the correlation coefficient as the 95% CI covers the value zero. In such situation, the MCMC chains might require more iterations to approximate the targeted distribution (Robert and Casella 2004). Options such as the collinearity within the input space or the identifiability of model parameters might be explored as possible explanations. The collinearity within the input space refers to the presence, which is not the case in our dataset, of one or more variables that stem from the linear combination of other input variables (Dormann et al. 2013; Kutner et al. 2004). Generally, this occurs when a categorical variable with K classes is one-hot-encoded (i.e., transformed into K binary variables) and all K variables are considered in the input space. The second option refers to the concept of identifiability of model parameters, which is a classical issue in regression analysis. Model identifiability is a fundamental prerequisite for model identification that concerns the uniqueness of the model parameters determined from observations (Godfrey and DiStefano, 1985; Lecca 2020). In our model, the marginal regression models are built on a small number of input variables (five) and a high number of records (about more than 300), which discards the issue of uniqueness of parameter estimates. This is also confirmed by the results and predictive performances observed with the simulated data, which have a larger input space and stronger dependence within output space.

However, the selected input variables can be assumed to be insufficient to fully reproduce the variations of the internal exposures to Cd and Pb measured in GerES IV participants. Indeed, the inclusion of individually based data informing the consumption or use of the dietary and non-dietary consumer products, which can be major sources of exposure to Cd and Pb, could be considered to improve the analysis (Heinemeyer and Bösing, 2018; Järup 2003; Hahn et al. 2022). Because of the close cooperation between the KiGGS baseline study and GerES IV, health, socio-demographic, anthropometric, and environmental data from both studies could be combined at an individual level and jointly analyzed and used to carry out a more comprehensive regression analysis as a next step (see for example Hahn et al. 2022). Moreover, the relationship between the factors potentially influential on human exposure (e.g., socio-demographics, smoking activity, and other exposure-relevant information) and the estimate of the internal exposure is not necessarily linear. There is a growing advocacy to develop quantitative approaches such as physiologically based toxicokinetic models, which enable better relating the determinants of the external exposure to the internal exposure estimates by simulating the kinetic of substances in human body and integrating physiological and substance-specific parameters (Louro et al. 2019; Sarigiannis et al. 2019).

Another limitation in this study arises from the discard of a large number of individual records (approximately 660 records) corresponding to Cd concentrations below the LOQ and missing values. The treatment of missing data and concentrations under LOD or LOQ, especially in HBM datasets, is of high interest in exposure science. Different approaches going beyond the removal of records corresponding to missing values and the replacement of below LOQ concentrations by fixed value (e.g., LOQ/2) were recently suggested and discussed under the HBM4EU initiative including for instance single and multiple imputation techniques (Vrijheid et al. 2019). In this specific analysis, we tested the replacement of below LOQ concentrations by fixed value, which transformed the marginal distribution of Cd concentrations into a bimodal distribution. This raised the complexity of the model and resulted in convergence issues. Multimodal distributions or mixtures of Gaussian distributions could be considered to improve the analysis. Other alternatives could also be evaluated by using single or multiple imputation methods, which can help avoiding this bimodality.

Perspectives

Every year the chemical industry and globally the anthropogenic activities emit significant amounts of chemicals released in the environment or in the form of mixtures in dietary and non-dietary consumer products, constituting in many cases emerging risk to the general public (Egeghy et al. 2012; Huang et al. 2017; Mitchell et al. 2013). The developed approach stands in line with the efforts invested, in the recent years within the fields of chemical safety and exposure science, to enhance the analysis of the cumulative exposure to multiple substances occurring in the environment and to deepen the understanding of the determinants and health impacts of multiple substance exposure. Furthermore, this modeling framework and more generally the rich class of copula functions could provide to exposure and risk assessors an attractive approach to enhance the assessment of cumulative exposures also on the basis on human biomonitoring data. Cumulative exposure to multiple substances can occur through behavioral co-exposure with distinct pathways or co-exposure to mixtures. This is particularly significant in the context of high correlations as illustrated by the application with the simulated data and the comparison with state-of-the-art machine learning approaches.

Supplementary Information

The supplemental materials are contained in a Word document (supplemental materials S1) and an Excel document (supplemental materials S2), which can be accessed online on the journal website. The supplemental materials consist of further information on the variables discarded from the preliminary analysis, the descriptions of the posterior sampling of the correlation matrix R, of the generation of the simulated data, and of the machine learning algorithms compared with the implemented Bayesian copula-based regression.