1 Introduction

Radon, 222Rn, is a radioactive noble gas which occurs naturally as a byproduct of uranium decay. Lacking color and smell, radon can only be detected by measurement devices and is the main contributor to natural radiation. Due to its potential to cause lung cancer and possibly other diseases (Darby et al. 2005; Rericha et al. 2006; Smith et al. 2007; Bølviken et al. 2003) monitoring campaigns and surveys have been promoted all over the world to assess the risk and identify areas prone to high radon concentrations. Identifying such areas, as well as understanding potential risk factors is essential in dealing with prevention and mitigation of radon exposure.

Typically areas with higher radon potential are called radon prone areas, radon priority areas, radon areas, or radon protection areas. In this paper, those areas are named radon prone. Some countries define a radon-prone area as an area with a high probability of finding high indoor radon concentrations (IRC). Another method is to define a radon-prone area that has an average IRC exceeding a certain radon concentration levels. Austria followed the approach to define radon areas based on average radon concentrations for administrative areas (municipalities). The average radon concentrations can be compared to reference levels, which is in line with 2013/59/Euratom (2014). The majority of methods to identify such regions are based on adapted cluster detection methods (Sarra et al. 2016). More recent work focused on prediction with a quantile regression and utilizing splines (Borgoni et al. 2019), which is more in line with this work.

Besides geogenic factors such as uranium content and permeability of the soil, building characteristics and living habits also influence IRC. Studies on the effects of building factors and geology on measured IRC have been conducted before (Buchli and Burkart 1989; Gates and Gundersen 1992; Gunby et al. 1993; Price et al. 1996; Apte et al. 1999; Kemski et al. 2009; Borgoni 2011; Cinelli et al. 2011, and others). To account for these known effects, this paper aims to model average radon concentrations as a function of building factors as well as underlying geogenic information.

This paper roughly extends the methodology that has been introduced in Borgoni et al. (2014) by adopting a linear additive mixed effects regression approach with the main goal of prediction. Building factors, geology, and other explanatory variables, that were collected in the extensive Austrian ÖNRAP 2 survey (Gruber et al. 2021b), play an integral part along with a flexible spatial spline in predicting radon levels. Multilevel models that incorporate random effects have been used in the past for modeling radon concentrations, see for example Apte et al. (1999), Price et al. (1996), and Gelman (2006) to name a few, and account for the hierarchical structure of the data. The methodology presented in this paper has been tested in a European exercise on a data set for a region in Spain and compared to other modeling approaches by the same authors Gruber et al. (2021a).

The previous Austrian radon map was based on the data and methodology of the first Austrian radon programme (ÖNRAP 1) (Friedmann, 2005). Compared to the method of the previous Austrian radon map, the proposed additive mixed model enables prediction of the mean radon concentration of a dwelling at any coordinates while specifying building characteristics and other variables. The currently published Austrian radon map is based on the predictions of the proposed model.

The next sections are organized as follows. Section 2 introduces the data. Section 3 contains the methodology for modeling, model selection, and prediction. The results of Sect. 3 are outlined in Sect. 4. Finally, some concluding remarks and summary are provided in Sect. 6.

2 Data description

The second national Austrian radon survey (ÖNRAP 2, Gruber et al. 2021b) is used as the data basis for IRC. The characteristics of the data and the survey design from data collection to data wrangling along with descriptive statistics have already been described in Gruber et al. (2021b). In the following paragraphs only a short summary of the data is presented.

The ÖNRAP 2 data set includes about 50000 measurements of IRC of around 27000 dwellings. In each dwelling two measurements in two different rooms were intended. The measurements were carried out with passive track etch (CR39) detectors. The radon concentrations are the means over the measurement periods in Becquerel per cubic meter (\(Bq/m^{3}\)). The measurement period was about six months, where half of the measurements were carried out in the winter and the other half in the summer. This accounts for seasonal variations of indoor radon concentrations. The sampling distribution of the IRC measurements as well as log IRC measurements are shown in Fig. 1. The left histogram shows the highly skewed IRC measurements and visual inspection of the right histogram roughly confirms an approximate log-normal distribution, which was also found by other authors such as Nero et al. (1986, 1990).

In addition to the indoor radon concentration (IRC), building characteristics of the measured dwellings and rooms were collected in a questionnaire. The data has been geo-referenced and the coordinates were extracted using the addresses of the dwellings. The geo-referenced data was used to link the position of the measurements to the geological map of Austria (Geologische Bundesanstalt (GBA), 2020).

Table 1 summarizes the data set, where for each parameter the variable type, typical values, and the source of the parameter are described. In the model selection in subsequent sections, variables are selected from this table. Some of these variables are considered as particularly important for the variability of indoor radon and will be included by default in the modeling process as so-called ’default starters’. Other variables are called ’contenders’ and their importance is tested. The purpose of the variables for modeling is listed in the column ’model’ in Table 1, where ’dep’ stands for dependent variable and ’default’ and ’cont’ for default starters and contenders respectively. The following gives an overview of the relevance of the variables regarding IRC.

The coordinates of the dwellings are a proxy for all natural and artificial spatial dependent features. In context of radon this might include geogenic factors (uranium content, soil permeability), weather and climate conditions and building types. The main source of indoor radon is the soil gas. Geogenic factors determine the available amount of radon in the soil. The construction of a building determines how much radon can move from the soil inside the building and the ventilation rate of fresh air into the building. Weather and climate conditions also correlate with ventilation rates of fresh air into the building and heating habits.

Concerning geology, different types of bedrock show different uranium concentrations and permeability and, therefore, might describe variations of the indoor radon concentrations. The geological units in low resolution are used as described in Table 1. This ensures a high amount of samples within each geological unit. As geology is a spatial feature, the coordinates already cover a part of the information.

Typically, rooms that are in direct contact with the soil show higher radon concentrations than rooms that are not. This is also true regarding the floor level, where lower radon concentrations are usually found in rooms on higher floors. As for the year of construction of a dwelling, older buildings are usually more prone to higher radon concentrations than younger buildings, which might be due to improvements in construction technique and material fatigue. An unoccupied basement may serve as a buffer and typically reduces the radon concentrations in higher occupied floors.

Thermal retro fitting is also worth observing due to the fact that dwellings with air tight building envelopes may show higher indoor radon concentrations due to a lack of ventilation of fresh air. Finally, dwellings mainly built with stone walls typically show higher radon concentrations than dwellings mainly built with concrete walls. This may also correlate with the year of construction. Typically IRC is higher in winter months. Thus the proportion of time the dwelling is measured in winter months is also observed and included as ’proportion measurement time’ in Table 1.

Additional data used for prediction purposes in Sect. 3.3 are a national 250 by 250 m wide grid layer (Statistik Austria 2020b), the settlement area of Austria (Statistik Austria 2020a) and the geo-referenced municipalities of Austria.

Table 1 A summary of all variables of the Austrian indoor radon data set
Fig. 1
figure 1

Sampling distribution of the measured IRC in \(Bq/m^3\) (left) and \(log(Bq/m^3)\) (right) from the ÖNRAP 2 campaign

The data went through a significant cleaning and preparation process before modeling. The majority of this work dealt with corrections of answers in the questionnaire. It was not an exception to find unclear answers that had to be corrected and simplified where possible. Several questions had the answer choice of ’unknown’ and it was quite common to see combinations of known and unknown answers for different rooms in the same dwelling. Plausibility checks and rules were implemented to deal with these cases. A simple example of a plausibility check for the answer of room earth bound is checking the floor level and room type. Corrective actions can be set if these answers do not agree, for example if the the room is on the second floor, but it was indicated in the questionnaire that the room was earth bound. Similar checks can be implemented for the other explanatory variables obtained from the questionaire. Other data cleaning procedures were converting binary variables with original answers of ’yes’ and ’no’ to indicator variables or checking the plausibility of numeric variables. For the final data only complete cases regarding the variables were used which leads to 47010 observations.

3 Methodology

In this section three major issues related to the modeling methodology with generalized additive mixed models are presented: (1) Definition of the generalized additive mixed model in Sect. 3.1, (2) method to identify influential explanatory variables for a final model in Sect. 3.2, and (3) predicting radon-prone areas and assessing uncertainty of predictions in Sects. 3.3 and 3.4 respectively. The latter issue may serve as a useful tool for categorizing municipalities in radon classes by using the aggregated prediction from (3) and comparing its uncertainty with predefined class limits. Issues (1) and (2) allow the identification of a set of influential explanatory variables that may affect mean IRC.

3.1 A generalized additive mixed model for mean log IRC

Generalized additive models extend the class of generalized linear models to include penalized splines or smoothing functions, which are added in an additive nature in the model and depend on explanatory variables (Wood 2017). This flexibility that can be obtained from additive modeling allows to consider spatial dependencies, which come from the specific locations of the monitored dwellings as well as the geological typology.

Mixed models contain fixed as well as random effects, and in combination with a spline term these models are used to analyze complex and hierarchical data structures. For each dwelling that was part of the monitoring campaigns, measurements from two different rooms were taken. Measurements that come from the same dwelling may be dependent for different reasons such as building characteristics, the number of inhabitants, or the tendency of inhabitants to open windows to let in fresh air. To account for this dependency within a particular dwelling, a random effect for the dwelling is included in the additive model.

The dependent variable indoor radon concentration (IRC) is transformed with the natural logarithm, similarly as in Borgoni et al. (2014). The additive model in subject level notation for the dwelling j is as follows:

$$\begin{aligned} log({\textbf {IRC}}_{j})= & {} \pmb {X}_{j}\pmb {\beta } + \pmb {Z}_{j}\pmb {u}_{j} + \pmb {X}_{sp, j}\pmb {\beta }_{sp} + \pmb {\epsilon }_{j}\end{aligned}$$
(1)
$$\begin{aligned}{} & {} \pmb {u}_{j} \sim N(\pmb {0}, \sigma ^{2}_{d}\pmb {I}_{j}),\text { } \pmb {\epsilon }_{j} \sim N({\textbf {0}}, \sigma ^{2}\pmb {I}_{j}) \end{aligned}$$
(2)

where log(.) is the element-wise logarithmic function, \(\pmb {X}_{j}\) is a matrix of dimension \(n_{j} \times p\) of p explanatory variables for \(n_{j}\) measurements in dwelling j, such as building factors, geology, number of inhabitants, and duration of measurement, \(\pmb {\beta }\) is the \(p \times 1\) vector of fixed effects, \(\pmb {Z}_{j}\) is a matrix of dimension \(n_{j} \times q\) of indicator variables for dwellings, \(\pmb {u}_{j}\) is a \(q \times 1\) vector of random effects representing dwellings, \(\pmb {X}_{sp, j}\) is a \(n_{j} \times s\) matrix containing the terms required for the spline construction based on the coordinates of the dwellings, \(\pmb {\beta }_{sp}\) is a \(s \times 1\) vector of spline coefficients, \(\pmb {I}_{j}\) is an \(n_{j} \times n_{j}\) identity matrix, and \(\pmb {\epsilon }_{j}\) is a \(n_{j} \times 1\) vector of errors. The number of knots of the spline is represented by s, which can be pictured as points that connect piece-wise polynomial functions to generate a spline. A typical value for \(n_{j}\) is two, or two rooms per dwelling, but other scenarios with just one room (\(n_{j} = 1\) for dwelling j) are also possible. As there is only one random effect in the applied model, q takes the value one.

The purpose of the spline term is to account for the spatial dependency of the measurements and serves as an additional intercept term next to the intercept of fixed effects, \(\beta _{0}\) of \(\pmb {\beta }\), for a base level of indoor radon concentration. As noted in Borgoni et al. (2019), radon dynamics show patterns that are typically not linear. The inclusion of a flexible spline allows nonlinear modeling of mean IRC as a function of the coordinates of the measurement locations and can thus account for spatial local effects. More specifically, the spline is of the class thin plate regression splines. This class of splines is discussed in detail for generalized additive mixed models in Wood (2003), as well as in Pratesi et al. (2009) and Ruppert et al. (2003). What will be mentioned here is that the main reason for choosing this spline was its ability to represent smooths of more than one explanatory variable and that it is somewhat of an ideal smoother when it comes to smoothing functions of several variables (Wood 2003).

The random effects term \(\pmb {u}_{j}\) represents the effect of dwelling j. It is assumed that the effect of the dwelling is normally distributed with expectation \({\textbf {0}}\) and variance \(\sigma ^{2}_{d}\). The remaining variation of the log indoor radon concentration that cannot be explained by the fixed effects, the spline, or the random effects for the dwelling, is represented by the error term \(\pmb {\epsilon }_{j}\) that is normally distributed with expectation \({\textbf {0}}\) and variance \(\sigma ^{2}\pmb {I}_{j}\) as in Eq. 2. The covariance matrix of \(log({\textbf {IRC}}_{j})\) can be derived from the definition of the model in Eq. 1 and takes the following form in matrix notation

$$\begin{aligned} \begin{aligned} \pmb {cov}(log({\textbf {IRC}}_{j})) = \sigma _{d}^{2}\pmb {J}_{j} + \sigma ^{2}\pmb {I}_{j}, \end{aligned} \end{aligned}$$
(3)

where \(\pmb {J}_{j}\) a square matrix of ones with dimension \(n_{j} \times n_{j}\). The diagonal elements \(\sigma _{d}^{2} + \sigma ^{2}\) represent the variance of log IRC. Within dwelling j, the rooms have a positive covariance \(\sigma ^{2}_{d}\) at the off-diagonal entries in Eq. 3, which is assuming that rooms in dwellings have similar log IRC characteristics.

Usually the penalization of the spline requires the estimation of smoothing parameters commonly accomplished by cross-validation methods. However, the mixed model allows a representation of the spline as a random effect and the smoothing parameter can thus be estimated with mixed model methodology (Wood 2017). In particular, after a suitable Bayesian reformulation of the smoothing term in terms of random effects, the model in Eq. 1 can be fitted by maximizing the restricted maximum likelihood (REML) in a way that is well known for linear mixed models. The REML algorithm to fit the model in 1 is implemented in the R software (R Core Team 2019) in the package gamm4 written by Wood and Scheipl (2017).

3.2 Model selection

The model described in Sect. 3.1 should only contain variables, that contribute to the prediction accuracy of mean log IRC and have a significant effect. The classical model selection methods such as step-wise forward, backward, and best subset selection based on information criteria or statistical significance of estimated coefficients are not chosen due to their problems, which range from biased \(R^{2}\) values and p-values that are too small to residual confounding (Harrell 2015).

To offset some of the problems, the model selection chosen is based on cross validation and prediction error, see for example Hastie et al. (2001) for a comprehensive treatment, thereby deviating from the classical procedures to a more data-driven and predictive approach. The step-wise forward foundation in combination with a 50 times repeated stratified 5-fold cross validation is used to identify influential explanatory variables. The conventional k-folds cross validation splits the data set into k separate subsets, where \(k-1\) subsets act as the training set, which are used for model fitting, and the fifth set is used as test data. The procedure is repeated k times with test and training data chosen in different order, and overall test and training errors can be obtained.

The difference of the cross validation method used here, compared to conventional k-fold cross validation methods, is that the validation is stratified according to dwelling unit, district, and geological type within a district. Stratification results in a better coverage of the whole area in each step of the validation and also ensures that observations from a particular dwelling fall in the same fold. In each forward step, the cross validation is repeated 50 times, which in general guarantees more numerical stability and precision (Kim 2009; Molinaro 2005). The mean absolute error is used as the error measure for the validation. This procedure ensures that along with the default starters, outlined in in column ’Model’ in Table 1, only influential variables are included in the model.

As a first step, before validation of the explanatory variables, an optimal estimate of the basis dimension s of the thin plate spline is found by the same cross validation method, except that it was repeated 10 times. The search for an optimal s is run with a null model, a model that contains only the default explanatory variables, due to computing resources constraints. Values from 10 to 400 are chosen as a suitable range for s to be tested. An optimal s is chosen as the minimal number so that no significant improvement in the cross validation test error can be seen.

The validation of the explanatory variables is then started with the same null model that has been used in the validation of the spline dimension above, but now using the cross validated s for spline construction. In each step a variable from the pool of contenders, see again column ’Model’ in Table 1, is selected and tested in the 50 times repeated stratified 5-fold cross validation. Such a variable contributes to the improvement of the model, if the 75th percentile of the 50 cross validation test errors of the model containing the new variable is less than the 25th quantile of the 50 cross validation test errors of the currently best model. Out of all variables from the contender pool that improve the model according to the above criteria, the one with the smallest mean test error is added to the model.

3.3 Prediction

Using the Austrian definition of a radon-prone area, it is possible to directly model the means conditional on the building factors, geology, and coordinates, as in the model in Sect. 3.1. The final model obtained by the cross validation based model selection can be used for prediction. In particular, it is only required to plug in values for the building characteristics and other explanatory variables in the design matrix \(\pmb {X}_{j}\) along with the coordinates of the dwellings in the spline matrix \(\pmb {X}_{sp, j}\). The predictions on the log scale then are

$$\begin{aligned} \begin{aligned} \pmb {\hat{\tau }} = \pmb {X}\pmb {\hat{\beta }} + \pmb {X}_{sp}\pmb {\hat{\beta }}_{sp}, \end{aligned} \end{aligned}$$
(4)

where the index j is dropped indicating that the predictions are stacked row-wise for a certain number of dwellings in a particular region, such as a municipality. Transforming back to the original scale, \(Bq/m^3\), and considering the properties of the log-normal distribution, i.e. \(E(Y) = exp(\mu + \sigma ^{2}/2)\) for \(log(Y)\sim N(\mu , \sigma ^{2})\) (Shao 2003), the predicted values for m grid cells and thus m dwellings in a municipality are

$$\begin{aligned} \begin{aligned} \pmb {\hat{y}} = exp(\pmb {\hat{\tau }} + (\sigma ^{2} + \sigma ^{2}_{d})/2). \end{aligned} \end{aligned}$$
(5)

The random effect term for the dwelling in Eq. 1 contributes to the prediction only as a variance component. This is reasonable because there is no interest in any individual dwelling. Transforming back to the original scale in this way has also been mentioned in Borgoni et al. (2014). Thus, a dwelling with particular characteristics can be predicted at any location. Finally, the average prediction of a municipality is

$$\begin{aligned} \begin{aligned} \hat{y}_{mun}^{*} = \frac{1}{m}\pmb {1}_{m}^{t}\pmb {\hat{y}}, \end{aligned} \end{aligned}$$
(6)

where \(\pmb {1}_{m}\) is a column vector of length m, and t indicates matrix transposition. It has to be noted that two other types of transformations back to the original scale are possible and have been used for the development of an Austrian indoor radon map. Coming back to the properties of the log-normal distribution, it is also well known that \(exp(\pmb {\hat{\tau }})\) represents a median prediction, for example see again Shao (2003). A geometric mean prediction can be obtained by first averaging the predictions on the log scale in Eq. 4 and transforming back as a last step.

3.4 Uncertainty of predictions

To assess the uncertainty of the predicted average radon concentration, approximate asymptotic confidence intervals can be calculated for each municipality. A multivariate weak convergence theorem for continuously differentiable functions of estimators which are asymptotically normal and the Cramer-Wold Device, see for example theorem 1.9 (iii) and 1.12 of chapter 1 in Shao (2003), can be used to derive variances of the municipality predictor in Eq. 6. Applicability of the theorem is further justified by the fact that the fixed effect estimators \(\pmb {\hat{\beta }}\) and \(\pmb {\hat{\beta }}_{sp}\) of the additive mixed model in Eq. 1 are asymptotically normally distributed according to Wood (2017). The variance parameters in the exponential function in Eq. 5 are treated as known. Applying these results, it is a matter of calculus and matrix algebra to calculate an \(m\times (p+s)\) matrix \(\pmb {D}\) of first derivatives with respect to \(\pmb {\beta }\) and \(\pmb {\beta }_{sp}\) of \(\pmb {\hat{y}}\) in Eq. 5, and along with the \((p+s)\times (p+s)\) covariance matrix \(\pmb {V_{\beta }}\) of the fixed effects estimators, the asymptotic distribution of the municipality predictor is

$$\begin{aligned} \begin{aligned} \hat{y}_{mun}^{*} \sim AN(\frac{1}{m}\pmb {1}_{m}^{t}e^{\pmb {X}\pmb {\beta } + \pmb {X}_{sp}\pmb {\beta }_{sp} + (\sigma ^{2} + \sigma ^{2}_{d})/2}, \frac{1}{m^2}\pmb {1}_{m}^{t}\pmb {D}\pmb {V_{\beta }}\pmb {D}^{t}\pmb {1}_{m}), \end{aligned} \end{aligned}$$
(7)

which can be used to calculate confidence intervals. As for the other types of predictions briefly mentioned in Sect. 3.3, the result in Eq. 7 can still be used by leaving out the variance components in the exponent for the median predictor, and confidence intervals for the geometric mean prediction can be obtained by transforming the endpoints of the confidence intervals on the logarithmic scale. The confidence intervals can be a useful tool to classify municipalities into radon risk regions and assess the uncertainty of the municipality predictions.

4 Results

In this section the methodology is applied to the indoor radon data described in Sect. 2. The structure of the subsections roughly follows Sect. 3. Sections 4.1 and 4.2 contain the results of the model selection using the stratified step-wise forward k-fold cross validation and the final model with all influential explanatory variables along with an interpretation and discussion of the estimates. Predictions of grid cell midpoints and municipalities are shown in Sect. 4.3.

4.1 Selecting a best model

The validation of the basis dimension s of the spline is shown in Fig. 2, which marks an optimal s of 300 with a dashed vertical line. The difference in cross validation test error by increasing s is found to be negligible for values above 300. Figure,3 shows the very last step of the model selection procedure. Variables to the left of the dashed vertical line were selected for the final model. Here the variable type foundation was found to be a significant improvement on top of window condition in the model. The stopping criterion, described above, is marked by the gray horizontal bar. Inclusion of further contender explanatory variables does not further improve the model, and the selection is halted after six steps. The results of intermediary steps are available upon request.

Fig. 2
figure 2

Cross validation (CV) error in terms of mean absolute error (MAE) versus number of knots s for the thin-plate spline. (red: test data, blue: training data). The dashed vertical line marks an optimal number of knots (\(s=300\))

Fig. 3
figure 3

Last step of the cross validation (CV) procedure: cross validation errors in terms of mean absolute error (MAE) are shown for each explanatory variable tested in the model (red: test data, blue: training data, to the left of dashed vertical line: variables that enter the final model, gray horizonal bar: stopping criterion)

4.2 The final model and interpretation of effects

After completion of the cross validation, the influential explanatory variables are identified. Observations which are complete regarding the influential variables but have been removed from the cross validation data set because of missing entries, are added to the final data set. This increases the data set for a final model fit from 47010 to 47344 observations. The added observations were checked for systematic distributional deviations before inclusion, but none were found. Table 2 contains the estimated fixed effects coefficients for \(\pmb {\beta }\) from the final model. The estimated coefficients for \(\pmb {\beta }_{sp}\), which are required for the spline construction, are not shown because a lack of direct relevance. It is important to note that the building factors were obtained by a questionnaire as mentioned in Sect. 2, and it has to be assumed that not all answers are known with certainty by participants.

Table 2 Coefficient estimates of the explanatory variables which were selected by cross validation for the final additive mixed model

Care must be taken when interpreting and analyzing the coefficients in Table 2. The usual interpretation of regression coefficients can only be made on the log scale of IRC. Interpreting coefficients on the original \(Bq/m^{3}\) scale requires a transformation with the exponential function, i.e. \(exp(\beta _{k})\), of the coefficients. This allows an interpretation as multiplicative effects. In what follows, interpretation focuses on the logarithmic indoor radon scale. The reference levels of the categorical variables, which are contained in the intercept term, are shown in the respective rows in parentheses in Table 2.

Some of the estimated coefficients take values as expected and have already been reported by other authors (e.g. Gruber et al. 2021a; Borgoni et al. 2019). The ground is the main source of indoor radon and, therefore, the distance to the ground is a proxy for IRC. In the model this fact is covered by the three explanatory variables room earth bound, floor, and basement. Rooms with direct contact to the underground (earth bound) show 0.3155 higher log IRC predictions. This is the largest effect among all explanatory variables that relate to building characteristics.

The estimated effects of year of construction, main material of the walls, type of building, thermal retro fitting, type of construction, airtight windows, and geology in Table 2 largely follow expected patterns that are already well known in the radon community and confirm findings of Demoury et al. (2013), Kropat et al. (2014), Sundal et al. (2004), Collignan et al. (2016), Jiranek and Kacmarikova (2014), Pampuri et al. (2018), Yang et al. (2019), Hahn et al. (2015), Sundal et al. (2004), Tondeur et al. (2014), Gruber et al. (2021b) and Bossew et al. (2008).

Contrary to the intuition that a foundation blocks radon migration into a building, foundation partly, which is contained in the intercept term, has the weakest effect. The difference of foundation fully, which has an estimated coefficient of 0.042, and foundation partly is small, but this could indicate again that participants had problems filling out the questionnaire regarding foundation as a building characteristic.

The effect of the number of persons older than 14 years is also interesting and may relate directly to ventilation and its proxies. Its weak negative effect is statistically significant and predicted mean log IRC decreases by 0.039 per one person increase. This is reasonable because the number of persons living in a dwelling generally correlates with the frequency of opening doors and windows. The estimated coefficient of proportion measurement time, indicates that the longer the measurement duration in winter months, the higher the predicted mean IRC. This variable has the largest effect on predicted mean log IRC and the expected increase is statistically significant.

The geological units used in the final model can be divided into a higher and lower group of log IRC. Bohemian Massif (reference level), Quaternary, and Tertiary Basins belong to the group with higher IRC levels. Only the Bohemian Massif can be linked directly to high radon concentrations due to its composition of granitoid rocks which typically contain high uranium concentrations. The other estimated effects of geological units in the group with higher IRC levels are not easy to interpret. This is particularly true for Tertiary Basin, which usually shows low uranium concentration and lower permeability due to sealing layers of clay that are typical for this unit. It is worth noting that compared to the other geological units, the number of measurements within Tertiary Basin, Quaternary and Austroalpine units is high and may therefore also feature many different building characteristics. This may indicate that the resolution of some geological units is too low and may not add a lot of information to the model. The spatial spline incorporates the position of the measurements and thus also includes information pertaining to geological units.

It is worth emphasizing again, that the effects as seen in Table 2, were not primarily chosen based on their statistical significance, but because of their contribution to the prediction error in the cross validation procedure in the previous section. This circumvented model building based on statistical significance, or p-values, and affords a more data-driven approach. The result is a final model that can be used for predicting average radon concentrations based on coordinates of a dwelling and building factors as well as investigating the estimated effects. The inclusion of explanatory variables allows an easy interpretation and serves as a tool for prediction based on simple building and geology factors. Complex interactions among the various explanatory variables as well as the spline were not investigated due to time and computation resource constraints.

The estimates of the variance component of the random effect, \(\sigma _{d}^{2}\), and the error variance, \(\sigma ^{2}\), take the values 0.386 and 0.118, respectively. The correlation between two rooms in a dwelling is thus 0.770, which can be easily derived from the variance structure in Eq. 3. This positive correlation shows that higher log IRC values in one room tend to be accompanied by higher values in the other room. These estimated variances are not investigated further and only play a role as variance components in the REML estimators \(\pmb {\hat{\beta }}\) and \(\pmb {\hat{\beta }}_{sp}\) as well as transformation of predictions in subsequent Sect. 3.3.

The estimated spatial spline can be visualized for a fixed combination of building and geology factors as a three-dimensional surface, where the coordinates and predictions are shown on the respective axes. A contour of the spline is visible in the next subsection in Fig. 6 in terms of predicted grid cell midpoints. The fixed effects in Table 2 thus cause vertical shifts of the spline surface but have no influence on the spline curvature. This is a simple approach with strong assumptions, but serves its original purpose, where it was intended to subtract the effects of the building factors from the model for more precise spatial prediction on basis of simple predefined dwelling characteristics.

Model diagnostics checking was performed mainly on basis of standardized residuals, as suggested by Verbeke and Molenberghs (2000), Brown and Prescott (1999) for mixed models, and inspecting their sample distribution. A histogram with a density plot along with a quantile to quantile plot can be found in Fig. 4. It can be verified that the residuals reflect the postulated normal distribution reasonably well for log IRC.

Fig. 4
figure 4

Residuals of the final additive mixed model

4.3 Predicting radon areas

As for the exact specification of the design matrix \(\pmb {X}\) in Eq. 4 in Sect. 3.3, a reference dwelling was selected that is a good representation of an average Austrian dwelling found in all federal states. A most common dwelling can be found by looking at the frequency of building factor levels, which was used as a basis for the reference dwelling. Table 3 lists the characteristics of the reference dwelling along with the most frequent dwelling obtained from the questionnaire in Austria. Major differences among the dwellings are direct contact to the ground, the building year, and the type of cellar. The reference dwelling represents more modern houses in Austria that were build later than 2000 and do not have a cellar. Since the geology of Austria is known for each coordinate, the geological factor is taken as is at the particular location of prediction.

Table 3 Most frequent and reference dwellings used for prediction with the final additive mixed model

The grid cells mentioned in Sect. 2 were taken as the basis for prediction. In particular, the mean IRC is predicted for the mid points of the grid cells. As a rule, only cells that have at least 25% of its area in a municipality are used, which are shown as gold cells in Fig. 5. Furthermore, cells have to cover an area that is considered to be habitable. Highly alpine regions, where no habitation is possible or no human infrastructure is available are thereby excluded. A map of the federal state of Upper Austria with all grid cells satisfying these requirements and with transformed predictions as in Eq. 5 can be seen in Fig. 6.

Fig. 5
figure 5

All 146 relevant 250 by 250 m grid cells that cover a municipality (municipality number 40707) in Upper Austria. Red cells indicate cells with less than 25% coverage and are not used for prediction. The grid cells are not exactly to scale and increased in size for easy viewing

Fig. 6
figure 6

Predicted mean IRC for grid cell midpoints using Eq. 5 along with the reference dwelling settings from Table 3 in Austria

Figure 7 shows the municipalities in Austria ordered from lowest to highest prediction. The horizontal lines at 100 and 300 \(Bq/m^3\) indicate potential limits of radon risk categories, which can be compared to the upper and lower limits of the confidence intervals and categorize a municipality as a certain radon area. It can be seen that some municipalities have broader confidence intervals than others. The different widths of the confidence intervals mainly come from the variability of estimated geology effects, the spatial spline, and the amount of grid cells that cover the municipality. The radon expert as well as policy makers can thus make use of the point estimator along with its variability.

Fig. 7
figure 7

95% confidence invervals for 2095 predicted municipalities in Austria

A map of Austria in terms of predicted municipalities, using the estimator in Eq. 6, is shown in Fig. 8. The patterns that are visible in Fig. 6 are still prominent on the municipality level. Regions of higher radon concentrations can be clearly identified. The primary reason for the municipality estimator and thus the map in Fig. 8 is for administrative reasons, as the practical implementation of the relevant radon regulation needs to be done on the municipality level.

Municipalities that are covered by few grid cells, i.e. coverage requirements are not met, are thus only predicted by those few grid cells. Even though a prediction over the whole area by interpolation is possible with the additive mixed model, this way was chosen for the current map. The map of Austria has been published in 2019 and is available for public use (Austrian Agency for Food and Health Safety (AGES) 2021).

Fig. 8
figure 8

2095 predicted municipalities in Austria by averaging the transformed predicted grid cell midpoints in Eq. 6

5 Discussion

Reviewing the estimated coefficients of the final model, it is evident that most of the building factors are statistically significant while decreasing the prediction error in the cross validation. The building factor room earth bound is highly significant in this model. This confirms that rooms which are in direct contact with the ground have a higher indoor radon concentration. Other effects such as the year of construction, floor level, type of building, air tight windows, or the main material for walls largely behave as predicted.

The effects of the particlular levels of foundation or thermal retro fitting may reflect that these variables were often unknown by the participants. This could be a potential drawback of collecting information about building factors by questionnaire and may lead to unwanted uncertainty, due to the fact that the average person is not an expert on building characteristics. It has to be noted that the final model is a simplification and does not contain complex interactions that may take place in reality. This lack of interactions requires careful interpretation of the estimated coefficients.

Certainly indoor radon gas accumulation is influenced and dependent on more factors than those that were collected in the questionnaire. Adding more explanatory variables than collected in the questionnaire may further improve the prediction error. Other authors on indoor radon, such as Kemski et al. (2009) and Rowe et al. (2002), mention the potential of soil radon measurements, radiometric data, and weather-related variables as possible explanatory variables for IRC. The proposed methodology allows a straight-forward inclusion of further explanatory variables, as was shown in the Metro-Radon comparison study by Gruber et al. (2021a).

The reference dwelling plays a significant role in the prediction approach introduced in this paper. A standard situation for a dwelling, or reference dwelling as it is called in this paper, has been used before by Friedmann (2005) for the previous Austrian radon map. The purpose of the reference dwelling here is identical and this predefined dwelling is intended to represent a common class of residential dwellings in Austria. The settings of the reference dwelling in Table 3 are directly comparable to the settings in Friedmann (2005). This is also in accordance with the requirements of the Austrian radon map, where the focus is on residential dwellings.

The most important feature of the generalized additive mixed model is that the dependence on the actual measurements in order to generate a map is bypassed. This implies that radon risk, based on the reference dwelling, can be predicted at points where measurements from the campaign are not available. Compared to the previous method, this is a significant advantage and the main reason for using this approach for the second Austrian radon map.

6 Conclusions

In this paper an additive mixed model is introduced to predict a reference dwelling on grid cell midpoints on basis of influential building factors and geological and spatial information in order to generate a radon map of Austria. The additive model that incorporates highly non-linear spatial effects can be estimated within the restricted maximum likelihood framework and is applied to the data of the measurement campaigns carried out by the Austrian Agency for Health and Food Safety (AGES) within ÖNRAP 2 (Gruber et al. 2021b).

The additive model thus allows an investigation of the effects of building factors and geology next to estimating spatial effects of the measurement coordinates. The fact that at most two measurements were taken from dwellings is incorporated as a random effect for dwellings. All building characteristics from the questionnaire, as well as more information such as measurement duration, were tested in the models in a specific order of contender and default starter variables and their inclusion assessed by a repeated stratified 5-fold cross validation. With this data-driven approach, a high prediction quality can be ensured.

The findings about the effects of the building factors in the model not only provide useful insights into the characteristics of buildings in Austria, but prove to be a fundamental component of prediction. With these effects a representative standard dwelling can be defined. Along with the estimated spline in the generalized additive mixed model, which is a function of the coordinates, a standard dwelling can be predicted at grid cell midpoints that cover a municipality.

By averaging these predicted reference dwellings for all grid cell midpoints in a municipality, an overall average prediction along with a measure of uncertainty can be obtained and compared to known indoor radon reference levels. Local authorities can use these results to implement preventive indoor radon risk measures as well as to assess the current risk. The mapping method outlined in this paper has successfully been implemented in the official Austrian indoor radon map and was the basis for the delineation of radon areas in Austria. The results are published for public use by AGES.