The effect of local samples in the accuracy of mid-infrared (MIR) and X-ray fluorescence (XRF) -based spectral prediction models

Within the soil spectroscopy community, there is an ongoing discussion addressing the comparison of the performance of prediction models built on a global calibration database, versus a local calibration database. In this study, this issue is addressed by spiking of global databases with local samples. The soil samples were analysed with MIR and XRF sensors. The samples were further measured using traditional wet chemistry methods to build the prediction models for seventeen major parameters. The prediction models applied by AgroCares, the company that assisted in this study, combine spectral information from MIR and XRF into a single ‘fused-spectrum’. The local dataset of 640 samples was split into 90% train and 10% test samples. To illustrate the benefits of using local calibration samples, three separate prediction models were built per element. For each model, 0%, 50% (randomly selected) and 100% of the local training samples were added to the global dataset. The remaining 10% local samples were used for validation. Seventeen soil parameters were selected to illustrate the differences in performance across a range of soil qualities, using the validation set to measure performance. The results showed that many models already exhibit an excellent level of performance (R2 ≥ 0.95) even without local samples. However, there was a clear trend that, as more local calibration samples were added, both R2 and ratio of performance to interquantile distance (RPIQ) increase.


Introduction
Hungarian agricultural soil fertility has been decreasing due to the imbalance between nutrient input and output. Simultaneously, soils have become more vulnerable due to the extreme weather conditions (e.g. drought sensitivity, erosion). Management of the steadily decreasing soil organic matter and lack of organic matter (manures, crop residues, etc.) therefore becomes crucial in maintaining soil fertility in the middle term. To harmonize the preservation of soil fertility with farming objectives and environmental requirements in the 2021-2027 Common Agricultural Policy (CAP) of the European Union, there is a need for proper soil nutrient management strategies. These strategies should be based on information about the current status of soil fertility. Since traditional soil testing is time-consuming and expensive, there is a need for techniques and instruments that allow rapid, affordable and precise routine soil testing. As a result, general interest in using diffuse reflectance infrared spectroscopy testing of soil physical, chemical and biological properties has increased (Soriano-Disla et al., 2014).
This technique has various advantages such as (a) it is non-destructive, (b) requires relatively little sample preparation and (c) does not involve any (hazardous) chemicals. Measurements only take a few seconds and many soil properties can be estimated from a single scan. Moreover, the technique allows for flexible measurement configurations, and in-situ as well as laboratory-based measurements (Viscarra Rossel et al., 2006).
Mid-infrared (MIR [3-50 μm]) spectroscopy is a widely used tool to estimate particle size distribution, organic carbon content and other chemical and physical soil properties (Terra et al., 2015) based on the detection in the 4000 -400 cm − 1 wave number range. It typically performs better than near-infrared spectroscopy, detecting the fundamental vibrations instead of overtones (Seybold et al., 2019). Some soil properties, such as organic matter (aliphatics, aromatics, carbonyls, etc.) and mineral components (CaCO 3 ; clay minerals, etc.) are believed to connect to distinct wavenumbers. In contrast, others with no standard chemical composition (pH, texture, cation exchange capacity, aggregate stability) are estimated based on the whole spectrum (Ma et al., 2018). Prediction accuracy, however, is affected by potential overlaps of the peaks in the former case, and noise and multicollinearity in the latter.
X-ray fluorescence (XRF [0.01 to 10 nm]) spectroscopy is based on detecting the secondary spectral lines at various wavelengths emitted by chemical elements (Nawar et al., 2019). This is a robust method for predicting heavy metals. However, it is less effective for analyzing low atomic number elements, i.e., K, P, Ca and Mg (Kaniu et al., 2012). The combined application of MIR and XRF spectroscopy was first reported by Towett et al., (2015) and found to be promising due to their synergistic effect on predicting soil properties (Naimi et al., 2022;O'Rourke et al. 2016).
To make better use of spectroscopy, the establishment of soil spectral libraries covering different soil types has been strongly recommended (Viscarra Rossel, 2009). Most investigations have been focused on a single, quasi-homogeneous spectral library dataset. Nonetheless, these datasets are not necessarily comparable directly due to their various sample handling (e.g. drying, milling, sieving) and spectroscopic conditions. Extending the applicability of such libraries by spiking with local samples was introduced (Brown, 2007;Viscarra Rossel, 2009), which significantly improved prediction accuracy (Guerro et al., 2016;Seidel et al., 2019) even at the field scale (Breure et al., 2022).
Eight years ago, AgroCares, a company utilising spectroscopic sensors to estimate soil and feed quality, started a research program focusing on the development and implementation of a soil testing method using mid-infrared (MIR) and X-ray fluorescence (XRF) sensor technology (Dimpka et al., 2017). One of the major challenges of such a concept is the derivation of reliable prediction models. For this, machine learning techniques were applied to link the wet chemical parameters of the soil calibration samples to spectra obtained from the MIR/XRF sensors. To make better use of spectroscopy, the establishment of soil spectral libraries covering different soil types has been strongly recommended (Viscarra Rossel, 2009). AgroCares utilises its global spectral calibration library of various soil types by building temporary local models to produce one-off predictions when prompted to produce a prediction. In order to develop prediction models for Hungary, a calibration and validation study started in 2017. In total 640 geo-referenced samples were collected to cover the different soil properties in Hungary. These samples are suitable for testing the effect of localised sample injection in a global model for Hungary. The aim of this study is to compare the effect of adding variable numbers of local calibration samples to a global soil calibration database on model performance within that local region.

Materials and methods
In Hungary, 640 geo-referenced soil calibration samples were taken. The locations were selected using the conditioned Latin Hypercube method (Minasny & McBratney, 2006;Roudier & Hedley, 2013). Variables used to stratify the samples were land use, soil type, climate data, accessibility and market value. In the selection of sample locations, it was considered if it is a high-priority agricultural area or not. The focus was on the most intensively used arable land (Fig. 1).
Soil samples were taken with an Edelman auger from the 0-200 mm top layer from one point. The top 20 mm of soil from the sample auger were removed, in order to remove any plant debris that might have fallen into the drill hole. 1 kg each of the soil samples were placed into 2 bags.
Soil samples were prepared according to ISO norm 11464:2006. The preparation included drying at 40 ºC, and if necessary, the soil sample was crushed and sieved through a 2 mm sieve. The fraction less than 2 mm was divided into a portion mechanically using a divider. A subsample of 30 g was retrieved and milled with a ball mill to 0.2 mm particle size. Each milled sample was analysed with Alpha I Bruker (MIR) (Bruker Corporation, Billerica, USA) and Epsilon 1 (EDXRF) (Malvern Panalytical, Malvern, UK) sensors. The exact same instrumentation and procedures were used to create the AgroCares spectral library. This adherence to protocol helped eliminate potential discrepancies related to analytical procedure.
The samples were measured using traditional wet chemistry methods to define the reference dataset for the calibration and validation sets. Seventeen parameters were measured: • pH in a 1:5 (volume fraction) suspension of soil in water (pH in H 2 O), in 1 mol/l potassium chloride solution ( The AgroCares workflow creates prediction models that combine the spectral information from MIR and XRF into a single 'fused-spectrum'. This means that spectral information present in either sensor are simultaneously utilised for a better overall prediction than either sensor individually (Elmenreich, 2002). This can be seen in the results section in Table 1.
To build a prediction model, spectral data from both sensors is required, as well as the wet-chemical reference data to be predicted by the model. In 2021, the models of AgroCares were built on approximately 17,000 soil samples, collected from 35 countries (primarily in Africa, Asia and Europe). The prediction models were built using the WEKA software (Hall et al., 2009) and used the ADAMS knowledge flow (Reutemann & Vanschoren, 2012) to prepare the data. Before training the primary model, the data was first cleaned of outliers by training a simple linear regression algorithm with 10-fold cross validation. All samples that were misclassified by this cleaning algorithm by a factor of m (m = 6 in experiments) were removed. The cleaned samples were then used to build each prediction model, one model per soil property. Each model was built by first applying a Segmented Savitzky-Golay (SSG) filter (Geise and French, 1955) with n = 4 segments for both MIR and XRF spectra, then partial least squares (PLS) regression (Vinzi et al., 2010) to decompose the spectral information into approximately 10-20 components per sensor input (this varies per model). The PLS data from both MIR and XRF was then concatenated and used as input for a locally weighted learning (LWL) algorithm (number of neighbours, k = 300) (Atkeson et al., 1997) wrapped around a Gaussian Processes (GP) regressor (Rasmussen & Williams, 2006). The parameters for each of these processes (e.g. number of PLS components, window size of Savitzky-Golay, etc.) were defined through an optimisation process for each soil property using the full global dataset. The LWL component is the crucial component here -despite using a global dataset, the LWL builds a temporary GP regressor on k of the most similar training samples (based on the PLS components) at prediction time, and uses this model to predict the soil property value.
Other studies have also utilised a fusion of MIR and XRF sensors to show improved combined performance. O' Rourke et al. (2016) used Cubist regression models (Holmes et al., 1999;Quinlan, 1993) to predict soil parameters for the NSDB Republic of Ireland Database (Fay et al., 2007). This technique also showed improvements when using fused spectral inputs, but the scope of data is much smaller than this study. Kandpal et al., (2022) also presented favourable results for fused sensor approaches, using different PLS variants as the primary prediction method (and various preprocessing techniques). But this method also used a relatively small number of samples (n = 196). The key difference in the work in this study is the use of the LWL algorithm to dynamically create localised prediction models for every new sample. This technique takes advantage of a large soil library to build focused models on the soil sample being predicted.
The local dataset of 640 Hungarian soil samples was split into 90% training and 10% test samples. To illustrate the benefits of using local calibration samples, three separate prediction models were built per soil property. For each model, 0%, 50% (randomly selected) or 100% of the local Hungarian training samples (90%) were added to the global calibration dataset of approximately 17,000 samples (where only 300 are selected per sample, as defined above). The remaining 10% local samples were used for validation. Seventeen soil parameters were selected to illustrate the differences in performance across a range of soil qualities, using the validation set to measure performance. Two evaluation metrics are presented: R 2 , the coefficient of determination -a measure of how closely the predictions match the actual values; and ratio of performance to interquantile distance (RPIQ), a metric for representing the root mean squared error as a factor of the range (Bellon-Maurel et al., 2010). Because the validation samples are constant, an increase in RPIQ represents a decrease in error. To interpret and to compare the results of the model performance in Hungary, the predictions based on R 2 values were classified according to Malley et al., (2004), as follows: excellent (R 2 > 0.95), successful (0.90 ≤ R 2 ≤ 0.95), moderately successful (0.80 ≤ R 2 < 0.90) and moderately useful (0.70 ≤ R 2 < 0.80).

Results
The experiment was conducted for: pH(KCl), pH(H 2 O), organic C, total N, clay content, total P, total Ca, total Mg, total K, exchangeable Ca, exchangeable Mg, exchangeable K, CEC, plant available P, plant available Ca, plant available Mg and plant available K. Table 1 shows the results of the experiments incorporating varying levels of local samples in the calibration set. Many models already exhibit an excellent level of performance (R 2 ≥ 0.95), even without local samples. Overall, there is a clear trend that as more local calibration samples were added, both R 2 and RPIQ increase. For elements with already excellent performance, additional samples had little effect, but the effect is more pronounced for lower performing models. Notable examples include: pH(KCl), clay, exchangeable Ca, available Ca and Mg and total K -each showing a substantial reduction in RMSE and an increase in R 2 . Some models show no increase in performance: CEC, total Mg and available P; local samples have no effect on performance. The worst prediction was for plant available K, but adding calibration samples resulted in an R 2 improvement from 0.13 to 0.30 and a 10% increase in RPIQ. Figure 2 illustrates the differences in performance of the prediction models on the Hungarian validation set when different amounts of Hungarian calibration data is used to train the prediction models. Those parameters were selected for visualization where the addition of local samples has a larger effect on both R 2 and RPIQ.
In models where R 2 performance is already high (> 0.95), additional Hungarian samples increase the RPIQ. In weaker models, such as available Mg and available Ca, the addition of local samples has a larger effect on both R 2 and RPIQ except for CEC and available P.
Clay showed an overall reduction in error across the entire range. Performance for plant available Ca is excellent from 0 to 10,000 in both models, but when local calibration samples are provided, the error from 10,000 + is notably reduced (by 11.08%). An even bigger effect is seen with plant available Mg. While the performance of exchangeable K is poor, the addition of local samples aids in centralising the predictions. The model for plant available K is very poor, but the trend clearly increases with the addition of local samples. The effect of the local sample was most visible on available Mg where the R 2 improved by 20% and the RPIQ increased by 78%. Figure 3 illustrates the performance improvement of using 'fused' models for prediction. In all but two of the models: total Ca and plant available K, fused performance matches or exceeds that of the individual sensor models. In the total Ca model, XRF alone far exceeds the performance of the fused model, but performance was already at a very high level for all two sensors. For plant available K, performance was quite low for all three models and the increase seen for XRF alone is likely due to noise in the training process. parameters of local samples (Stevens et al., 2013;Gogé et al., 2014;Guerrero et al., 2014;Clairotte et al., 2016;Brown, 2007). It was found that the addition of local samples from Hungary notably improved the performance for pH(KCl), clay, exchangeable Ca, available Ca and Mg and total K. The improvement was mainly expressed as a reduction in RMSE values and only moderately improved R 2 values, though some models showed no increase in performance. In the current study, local samples did not affect the performance of CEC, total Mg and available P. It seems that the global dataset of AgroCares already contains similar information on comparable soils from other countries in the prediction of these parameters and so no further improvements could be gained from local samples. The prediction of clay fraction was moderately successful without the local samples and became successful with local samples, but 100% of the local training dataset was needed for this improvement.
The predictions for plant available K was very poor, but could be seen to improve with the successive addition of local samples, which may indicate that the model performance would be further increased with extremely dense sampling. The positive effect of local samples on model performance could be explained with the statement of Viscarra Rossel et al., (2008): NIR spectra and soil properties can vary under different soil minerology and their content in soil organic matter. According to Fabien et al., (2012), when strong spectral features are related to the characteristic under study (as for CaCO 3 content), a wide national database can be used alone to calibrate accurate prediction models. In the other cases, for properties involving more diverse spectral regions, the usefulness of a large database spiked with local samples should be established. It confirms the study results of Soriano-Disla et al., (2014): 'variables that are predicted by virtue of their correlations with infrared-active soil properties (indirect calibrations) frequently require the development of models for specific soil types, locations and particular environments'.
In the case of CEC, the local dataset involvement did not increase RPIQ, possibly because the non-Hungarian soil samples in the database are already good enough. Most studies aim to predict soil properties (including CEC) on a catchment or a regional scale. For example, Viscarra Rossel et al., (2006) reported a good R 2 (0.73) for CEC prediction in Australian soils. Terra et al., (2015) and Pinheiro et al., (2017) found R 2 = 0.72 and 0.68 and RMSE = 0.14 and 5.86 for CEC prediction in south American tropical soils. Also, Ulusoy et al., (2016) published R 2 of 0.83 and RMSE of 1.45 at field scale investigations in Turkey. However, the prediction goodness might drop on a global scale, including several soil types with various environmental circumstances.
A notable difference between the correlation of R 2 and RPIQ can be seen with the plant available P content. The drastic difference between the two measures is likely because the validation set contains a few large values which artificially increase the R 2 . These large values would fall outside the inter-quantile range, but their errors are still factored into the RPIQ measurement, thereby resulting in a relatively small RPIQ. This is a weakness of the RPIQ metric.

Conclusions
In 2021, AgroCares had more than 17,000 soil samples in their global calibration dataset collected from 32 countries in Africa, Asia and Europe. The results of this study show that the global dataset of AgroCares already contains enough information on comparable soils from other countries to predict the total calcium, total nitrogen and total phosphorus successfully, regardless of the number of Hungarian samples present. However, in weaker models, such as clay content, exchangeable potassium and all magnesium forms with R 2 lower than 0.95, the addition of local Hungarian samples has improved the quality of local predictions (except for total Mg where it decreased).
In the case of special soil parameters, it can be important to identify the local effects that are unique to a country or a region. Based on the experiment, it can be concluded that the model performance in Hungary benefits from the inclusion of local samples. Based on the cases where model performance was improved with the increasing number of local samples, it can be concluded that more data is useful. However, it is important to state that sample numbers did not always improve model performance and/or even decrease it. According to the results, the AgroCares concept for routine soil testing using MIR + XRF sensor technology is viable for Hungarian agriculture.
Funding Open access funding provided by Hungarian University of Agriculture and Life Sciences.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.