Introduction

Optimised phosphorus (P) fertilisation is desired to enhance farm input efficiencies, improve economic returns and minimize ecologic disturbances. In this respect, reducing P inputs is an actual aim in view of the high P status of many arable soils, at least in Western Europe (Csatho et al. 2007). Nonetheless, optimisation might also result in increased P dosage to exploit soil yield potential. Plant availability of P is a complex outcome of P fertilisation history, but also of multiple and partly interacting soil properties. However, Pavl makes up only a small percentage of total soil P. Diverse extractants are approved to estimate Pavl and to calculate fertiliser dosage. Jordan-Meille et al. (2012) reviewed European soil P tests and fertilising recommendations. The principle is similar everywhere: P extraction with chemical reagents is followed by calibration of Pavl values on the basis of field trials in order to achieve fertiliser recommendations. In Germany, the calcium acetate lactate (CAL) extraction (Schüller 1969) is a standard soil test method to determine Pavl and forms the basis of P fertilisation recommendations (LWK 2015). Note that Pavl as determined by conventional soil tests is operationally defined and does not account for chemical P binding forms (Hartmann et al. 2019).

Available nutrients and other soil properties that directly or indirectly impact crop growth may vary considerably within fields (Patzold et al. 2008). Precision agriculture addresses this problem by adapting fertilisation in different field-zones (Gebbers and Adamchuk 2010); such optimisation can lead to economic benefits in the long term (Schulte-Ostermann and Wagner 2018). However, common agricultural practice is to take composite soil samples that are regarded as representative for the sampled area. In Germany, official advisory services recommend one composite soil sample per 3 ha area (LWK 2015). In most cases, P fertiliser is still uniformly applied because farmers feel uncertain about how to divide a distinct field into management zones. Zones with varying soil properties can be distinguished when conducting a soil survey with diverse sensing techniques. Sensor maps often unravel soil heterogeneity and explain variation in, e.g., grain yield or total biomass (Mertens et al. 2008; Sun et al. 2011). Such information is commercially provided by consultants and helps to define management or soil sampling zones. However, routine soil sensing such as electromagnetic induction most often yields information that is not directly related to available decision rules (Sylvester-Bradley et al. 1999); this is especially the case for phosphorus.

For physical reasons, valuable information about Pavl in soil is not directly readable from sensor signals including electromagnetic induction, gamma spectrometry and optical instruments. However, in several studies, infrared (IR) spectroscopy in the visible and near-infrared (Vis–NIRS) and in the mid-infrared (MIRS) have performed satisfactorily in elucidating relationships between soil spectra and nutrient status as reviewed in Kuang et al. (2012). Spectroscopic models require calibration with conventionally analysed soil samples before predictions for the parameter of interest in unknown samples can be done. For model calibration, multivariate statistics such as partial least squares regression (PLSR) are in widespread use. Even if the soil property of interest has no specific spectral feature, PLSR can perform well for model calibration, when it finds surrogate spectral features that retain satisfactory estimations of the studied soil properties (Gomez et al. 2008).

Reviews by Kuang et al. (2012) and Soriano-Disla et al. (2014) have pointed out that no precise, but only approximate, quantitative prediction of Pavl is possible using IR spectroscopy. The reason is that the low dipole moment between P and oxygen (O) inhibits direct detection of orthophosphate. However, when organically bound P forms a tight relationship with other relevant soil properties, P quantification is generally possible, but requires further testing in soils with a wide range of properties (Abdi et al. 2016). In general, MIRS yields better results than Vis–NIRS where orthophosphate has no spectral response (Soriano-Disla et al. 2014; Kruse et al. 2015). In chemically pure systems, Ahmed et al. (2019) elucidated orthophosphate and molecular mechanisms for phosphate adsorption to goethite using MIRS. Mayrink et al. (2019) successfully predicted exchange resin-adsorbed P, but also without direct spectral influence of soil components. However, P binding mechanisms and partners in soil are manifold, and Pavl is not a chemically defined component. Consequently, Wijewardane et al. (2018) found poor Pavl (Mehlich) prediction in the MIR when analysing 4969 samples from a broad range of soils all over the USA.

The Pavl proportion of total P is partly controlled by organic matter (OM) content and properties that determine IR reflectance of soil. Consequently, IR spectra may contain indirect, i.e., surrogate, information on P fractions (Chang et al. 2001; Abdi et al. 2012; Soriano-Disla et al. 2014). Abdi et al. (2016) successfully predicted organic P in chernozem soil samples because of the close relationship between OM and organic P. Organically bound P (e.g., phytate) is not, or only to a minor degree, captured by conventional soil P tests (Steffens et al. 2010). Abdi et al. (2012, 2016) reported variable success in Pavl prediction depending on the extractant. The CAL method was designed to characterize Pavl (Schüller 1969) and extracts P from diverse binding partners in soil (Steffens et al. 2010; Hartmann et al. 2019). Until now, Vis–NIRS and MIRS have hardly been tested to predict CAL-extractable P (CAL-P).

Most published studies to predict Pavl with emphasis on precision agriculture have been conducted on very few or even single fields, and models were mostly not transferred to fields outside calibration. Mouazen et al. (2007) and Mouazen and Kuang (2016) predicted Pavl in the lab and field with accuracy that was regarded sufficient for practical purposes. Mayrink et al. (2019) successfully predicted Mehlich-3 extractable P by taking diffuse reflectance spectra (325–1075 nm) from resin exchange strips that had been used to extract P from soil. All soil samples came from a 20 ha field; prior to model building, samples were pre-selected to optimise results.

Application of Vis–NIRS directly in the field is generally possible, but taking Vis–NIR spectra in the laboratory still performs generally better, because measurements can be taken under controlled conditions (Rodionov et al. 2014a, b, 2016). Mouazen and Kuang (2016) published a case study on Pavl prediction with mobile Vis–NIRS. For the mid-infrared, portable spectrometers (pMIRS) are available and are increasingly tested in soil science (Soriano-Disla et al. 2017). For example, the recent study by Rogovska et al. (2019) tested the potential for nitrate estimation in precision agriculture. However, prediction of Pavl in situ using pMIRS failed (Ji et al. 2016).

Major challenges concerning Pavl prediction by IR spectroscopy are (i) overcoming the case study character of Pavl prediction both in the Vis–NIR and MIR region, (ii) transferring models from calibration sites to independent fields, and (iii) calibrating Pavl prediction models that are universally applicable in practical precision farming. To address these challenges, the ongoing joint research project “BonaRes-I4S” is developing a mobile sensor platform that integrates non-invasive sensors and soil sampling for chemical and spectroscopic analyses (Gebbers et al. 2016). The objective of this study was to test and to compare Vis–NIR and MIR models capable of Pavl prediction in samples from different fields. The study was conducted in a laboratory setting as needed to evaluate the potential for integrating a pMIRS-based Pavl prediction on the future “BonaRes-I4S” sensor platform.

Materials and methods

Sample set

This study was conducted on archived sets of arable topsoil samples from different research sites. All samples had been analysed in different projects with the CAL method to characterise Pavl and were re-evaluated for this study. Six different localities with soils from loess or loess-derived sediments were sampled (Table 1). Klein-Altendorf (KA; 50°36′57″ N, 6°59′37″ E; four fields), Poppelsdorf (PO; 50°43′34″ N, 7°05′09″ E), Dikopshof (DI; 50°48′29″ N, 6°57′11″ E), and Hilberath (HI; 50°34′46″ N, 6°59′27″ E) were situated close to Bonn. The field at Merklingsen (ME; 51°34′07″ N; 8°00′05″ E) was located 110 km NE of Bonn. The Görzig (GÖ; 51°40′08″ N; 12°00′52″ E) field was located 360 km ENE of Bonn and 20 km N of Halle. The topsoils at KA, DI, ME, and GÖ formed from Weichselian loess. In HI, Pleistocene periglacial slope deposits (PPSD) from loess and underlying weathered Lower Devonian silt and sandstones led to expressed topsoil heterogeneity as described by Heggemann et al. (2017). Soils at PO developed from silty Holocene fluvial deposits and were similar to loess soils with regard to texture, pH, pedogenic iron and OM content. Only the HI soils partly revealed stagnic properties. All samples fell in texture classes of silt loam and loam. Mineralogical composition was not investigated because loess in the Bonn region was considered homogeneous. For ME and GÖ, divergent mineralogical composition could not be excluded with respect to the regionally deviating loess origin, although the clay fraction generally reveals little difference in mineralogy throughout the German loess belt (Gehrt 2000). Soil samples for chemical and spectroscopic investigation were taken from the plough layer (0–0.3 m depth) and were carbonate-free except at GÖ where 18 of 109 samples contained 0.1 to 0.9% carbonate. The pH values hardly varied between locations except at HI where pH was approximately one unit lower than in the other fields (Table 1).

Table 1 Set of soil samples under study. At the localities Dikopshof and Klein-Altendorf, sample sub-sets [experimental blocks (same crop), treatments (different crops)], and fields, respectively] were investigated (sub-sets in parentheses)

To achieve a maximum range of P content at comparable soil properties, two long-term fertilising experiments (LTFE) at PO and DI with 68 and 120 plots were included in the study. For the DI-LTFE, effects of fertilisation, pre-crop and sampling date on spectral properties could be detected; therefore, sample sets were not only assigned to the treatments, but also re-grouped as listed in Table 1. The GÖ field of 30 ha comprises a variable-rate fertilising experiment on 72 m wide strips that was sampled on a 36 m grid. At KA, four fields with similar tilling and fertilising practices (only mineral fertiliser) were sampled. The large sample numbers from small fields rely on barley variety plot trials with variable N and uniform P fertilisation. The HI field had been cropped with cereals over the previous 4 years when sampled; before, it had been a pear orchard with grassed tracks for at least 30 years. The sampling points were located on a regular grid with 21 m width.

P analyses and classification of CAL-P contents

In all samples, Pavl was determined according to the CAL method (Schüller 1969). The CAL extraction (buffered at pH 4.1) is recommended as standard method by the Association of German Agricultural Analytical and Research Institutes (VDLUFA) and applied by official advisory services (LWK 2015). Fertilising recommendation in North Rhine-Westphalia, e.g., relies on five classes of CAL-P contents from very low (A) to very high (E) with allowance for soil texture. Wuenscher et al. (2015) compared various P extraction methods and found a close correlation between CAL-P and Olsen-P (r = 0.801***); yet CAL extracted slightly higher P amounts than Olsen.

Spectra acquisition

Vis–NIRS

Spectra were recorded in the laboratory using an AgriSpec Vis–NIR spectrometer with 350–2500 nm range (Analytical Spectral Devices Inc.; Boulder, CO, USA). Prior to spectra acquisition, soil samples were air-dried, sieved and pestled. Samples were placed and levelled in a petri dish and three mean spectra of 30 single spectra each in slightly different positions were taken as described by Rodionov et al. (2014b). Spectra were taken from a total of 477 samples. No Vis–NIR spectra could be recorded from the archived GÖ samples because not enough sample material was available after conventional soil analyses for the spectral investigation that was not initially planned.

MIRS

Mid-infrared spectra were recorded with a benchtop Bruker Tensor 27 spectrometer (Bruker Optik, Ettlingen, Germany). The laboratory spectrometer was equipped with an automated high throughput device (Bruker HTS-XT) as described by Patzold et al. (2008). Soil samples were air-dried, sieved, and milled prior to spectra acquisition as recommended by Stumpe et al. (2011). The recorded wavelengths were 1250–25,000 nm. Five composite measurements, comprising 120 scans at a time, were conducted for each sample. A total of 586 soil samples were examined.

Spectra analyses and model building

Partial least squares regression (PLSR) with leave-one-out cross-validation was carried out for model calibration using the software “ParLeS” (Viscarra Rossel 2008). Spectra were truncated to 500–2400 nm (Vis–NIR) and 2500–16,667 nm (MIR) wavebands to eliminate noise at the edges. Reflectance [R] was transformed to log (1/R). To account for site-to-site variance, all Vis–NIR spectra were mean-centered prior to modelling. After testing diverse data pre-treatments, the MIR spectra were subjected to PLSR as first derivative. For composite models from different sample sets, data were randomly divided into calibration (2/3) and validation sets (1/3). An exception was the entire DI-LTFE dataset, where three blocks with 24 plots each were used for calibration and two blocks with 24 plots each for test-set validation to achieve a balanced representation of the fertilising treatments.

The samples were taken from various previous projects (see above); in consequence, CAL-P values were not optimally distributed in some of the sample sub-sets (see Supplementary Material, Tables S1–S3). Nevertheless, the full dataset was utilized to encompass the expected variability of future prediction datasets and to ensure the representativeness of samples. For practical usefulness, prediction models should be applicable to independent, i.e., unknown sample sets. Therefore, diverse combinations of calibration and independent test-set validation sample sub-sets were utilized. Calibrated models were transferred to sites that did not contribute to model building. For this test-set validation, complete datasets of the validation site were used. To improve model transferability between localities, the spiking approach as reviewed by Stenberg et al. (2010) was tested for some exemplary cases. Prior to calibration, ten and eight randomly selected samples from other sites were added to samples sets from one (KA) and four (KA, DI, HI, ME) localities, respectively. Statistical parameters of CAL-P distribution for the diverse calibration and validation data sets and sub-sets are listed in Tables S1–S3.

The number of latent variables, herein denoted as “factors” (Viscarra Rossel 2008), was limited to ten to avoid overfitting. Quality of calibration models and validation was evaluated using R2, ratio of performance to deviation (RPD) and root mean square error (of cross-validation or of prediction, resp., RMSECV or RMSEP). Applicability of RPD was generally questioned by Bellon-Maurel et al. (2010), and RPD threshold values have been controversially discussed (Bellon-Maurel and McBratney 2011). Nevertheless, RPD is one of the most popular parameters to evaluate model performance (Bellon-Maurel et al. 2010) and still applied in recent studies (Wijewardane et al. 2018; Mayrink et al. 2019). Accordingly, this study reports RPD and model quality categories as proposed by Chang et al. (2001): (A) excellent models (RPD > 2); (B) acceptable models (1.4 < RPD < 2); (C) non-reliable models (RPD < 1.4). Values for RMSE are important for evaluation of model quality and so were reported for all cross-validations and predictions.

For Vis–NIRS and the MIRS models comprising the entire dataset, the most relevant spectral regions were identified as described in detail by Gomez et al. (2008) to ensure that soil constituents being potentially relevant for phosphate bonding were adequately considered. When (i) the variable importance in the projection (VIP; Viscarra Rossel 2008) exceeded the value 1 and (ii) at the same time, the PLS regression coefficient (b-coefficient) was greater than its standard deviation, the respective wavelength was considered significant.

Calculation of cellulose absorption index (CAI)

In order to better differentiate the various sample sets with respect to their spectral properties, the CAI was calculated. This index is used in Vis–NIRS analyses to detect the presence of plant litter. For MIR, the use of the CAI has not yet been reported in the literature. Values > 0 generally point to the presence of lignin and cellulose, and CAI values of ageing litter decrease over time due to lignin and/or cellulose decomposition (Nagler et al. 2000). Further, CAI can indicate presence of pre-crop residues (Aguilar et al. 2012). The CAI was derived from Vis–NIR spectra as CAI = 0.5·(R2025 + R2215) − R2110 from the sums of reflectance factors R at the wavebands 2000–2050 nm (R2025), 2089–2130 nm (R2110) and 2190–2240 nm (R2215), respectively (Rodionov et al. 2016).

Results and discussion

Testing the general performance of CAL-P prediction on different fields

Vis–NIRS

Quantification of CAL-P with Vis–NIRS was successful when the entire dataset (n = 477) was used for cross-validation (Fig. 1a). Prediction quality was acceptable with RPD = 1.84 and R2 = 0.703 though RMSECV equalled 30.3 and was not satisfactory for precise fertiliser dosage. Soils with high P demand would not have been fertilised and experience yield loss whereas soils without P demand would have unnecessarily received P fertiliser and experienced increased risk of loss to the environment.

Fig. 1
figure 1

Prediction of plant-available P (calcium acetate lactate extractable, CAL-P) with a Vis–NIR spectroscopy and b MIR spectroscopy in topsoil samples from different fields in Germany (laboratory spectra, PLSR leave-one-out cross validation)

The fundamental spectral response in the MIR region and the electron transitions and overtones in the Vis–NIR region pointed to iron oxides, clay minerals and diverse functional groups of SOM (Viscarra Rossel and Behrens 2010) as major binding partners for CAL-P. The exact spectral regions that were selected for model calibration by the ParLeS software are depicted in Figs. S1 and S2. Neighbouring components and functional groups can lead to spectral shifts in the NIR.

Calibrating individual models for single fields led to highly variable results, although the same data were used as for the overall calibration (Table 2). Excellent CAL-P prediction was obtained for two of the five sample sets and three of the 13 subsets. Figure 2a illustrates the reliable (RPD = 4.31, category A) and precise (RMSECV = 11.1 mg P kg−1 soil) result for field KA-2. Sample sets from other fields yielded worse, but acceptable, results (Fig. 2b for PO, Table 2). In some cases, field-individual prediction failed completely (Fig. 2c).

Table 2 Vis–NIRS calibration models (leave-one-out cross validation, CV) for different entire sample sets
Fig. 2
figure 2

Prediction of CAL-P with Vis–NIR spectroscopy (ac) and MIR spectroscopy (d) in topsoil samples of single fields (leave-one-out cross-validation): a KA-2, b PO, c HI and KA-4, and d GÖ. For sample set designations, refer to Table 1

One reason for the partly poor results in P prediction might be the data range that differed among the fields (see Table 1). On the KA fields, fertilisation was not systematically varied, and CAL-P variability consequently relies on natural soil heterogeneity. In contrast, the LTFEs at DI and PO were established on homogeneous fields and the fertilizer treatments have been established for more than 100 years. Consequently, some differences in OM and pH have developed (Table 1). In addition, indirect effects such as variable crop growth have influenced SOM content and P binding and availability. As CAL-P is not defined by a specific binding partner but correlated to various soil properties (Wuenscher et al. 2015), the fertiliser-induced variability of soil properties makes CAL-P prediction complex and difficult. This is perhaps the reason why P prediction was much better on fields KA-2 and KA-3, where fertilisation has not been systematically varied (Table 1). Conceptually, the same is true for field KA-1 where P prediction was acceptable and KA-4 where prediction failed (Fig. 2). However, KA-1 was sampled shortly after barley harvest and the small CAL-P range in KA-4 probably hampered a better calibration result.

Prediction failed in HI although the range in CAL-P was large. Here, the former long-term use as a pear orchard caused highly variable OM quality and P binding conditions due to wood residues in the former tree strips, and grassed inter-row spaces. Soil heterogeneity at HI also was pronounced with respect to relief and parent material (Heggemann et al. 2017).

MIRS

In general, MIRS cross-validations were comparable to those in the Vis–NIR range, with some exceptions (Fig. 1b and Table 3). For the entire dataset, RPD was slightly better, but still only acceptable. Although the data range was large, CAL-P prediction via MIRS was poor for the GÖ sample set (Fig. 2d). Excellent to acceptable performance was achieved for DI-all, PO and KA (except KA-4), as in the Vis–NIRS. Hence, in contrast to the statement of Kruse et al. (2015), MIRS did not generally yield generally better results than Vis–NIRS.

Table 3 MIRS cross-validation results for CAL-P (entire datasets)

Transferring calibration models to leave-out fields

Transferability of models between fields from one locality

All Klein-Altendorf fields belong to the same experimental farm, reveal similar soils and are managed in the same way. The dataset for each of the fields was randomly divided into calibration and validation data. Reference data distributions for all sample sets are compiled in Tables S1–S3. In a first step, fields KA-1–3 were merged, and field KA-4 was excluded because here CV had failed. Calibration (CV) and test-set validation (P) performed “A” in the Vis–NIR as well as in the MIR range (Table 4, lines 1–2 and 13–14). Again, Vis–NIRS and MIRS performed comparably. These results and good performance for the single datasets (Tables 2 and 3) gave rise to the expectation of satisfactory transferability tests.

Table 4 Testing model transferability within a farm at one locality: results of model calibration (leave-one-out cross-validation, CV) at different sites and subsequent application of prediction model to different but comparable fields (P; test-set validation); field KA-4 was excluded from this test because single-field CV failed (see Tables 2 and 3); spectra pre-treatment Vis–NIRS: log (1/R), MIRS: 1st derivative. Statistical parameters for the sample sets are listed in Table S2

However, calibrating individual models (CV) for each KA field and transferring them to other KA fields for test-set validation (P) yielded mixed results. In most cases, no CAL-P prediction was possible with Vis–NIRS calibration models, although model calibration had yielded RMSECV values that were acceptable (Table 4). For subsequent prediction, R2 as well as RPD were small and most RMSEP values were too large to serve as a basis for precise fertiliser dosage when compared to the recommended soil P values (Table 6). This result was insofar unexpected as the ranges in values of calibration and validation data were wide and overlapping (Table S2). Only CAL-P prediction for field KA-2 was acceptable after calibrating a model with KA-3 data. For the MIR range, some predictions were slightly better (Table 4), but the overall results were comparable to those of Vis–NIRS. Consequently, in this respect MIRS was not superior to Vis–NIRS.

Figure 3a illustrates Vis–NIRS prediction for KA-2 based on the KA-3 model. Prediction of KA-3 completely failed when KA-2 were used for calibration (Fig. 3b). It is assumed that soil properties connected to CAL-P or the IR-active soil components related to P extractability varied between the two fields. Sampling date, soil surface cover and pre-crop were different for KA-2 and KA-3 (Table 1). Field KA-2 was sampled in November approximately 6 weeks after sugar beet harvest prior to tillage, i.e., fresh beet leaf residues were present. In contrast, KA-3 sampling was conducted in March, approximately 7 months after grain maize harvest with subsequent ploughing, cultivation and wheat sowing (October). Conventionally measured CAL-P contents varied between the fields, but the CAL-P ranges largely overlapped.

Fig. 3
figure 3

Transferring Vis–NIRS calibration models between two fields of the same locality (i.e., on the same farm): calibration (CV, leave-one-out cross validation) and CAL-P prediction (P, test-set validation); a CV for field KA-3 and P for field KA-2 (entire dataset and for samples < 207 mg Pobserved kg−1 to avoid extrapolation); b CV for field KA-2 and P for field KA-3 (entire dataset and reduced dataset with > 98 mg Pobserved kg−1 and N = 32 to avoid extrapolation and imbalanced CV/P sample numbers. Values for RMSECV and RMSEP are given in Table 4

Transferring models between localities

The performance of calibration and test-set validation was almost identical for the Vis–NIR and the MIR range. Therefore, only the Vis–NIR results are described here and shown in Table 5. The respective MIR results can be found in the supplementary material (Table S4). Calibrating a model for the entire dataset of KA-1–4 (n = 210) yielded, as expected from the results already shown, a highly performing cross-validation (Table 5, line 1). Transferring this model to the localities PO, DI, HI and ME revealed non-satisfactory test-set validations (lines 2–5). In expectation of better validation results, farmyard manure (FYM) treated plots were excluded from the PO and DI sample sets (PO–FYM, DI–FYM) because no FYM had been applied to KA fields. However, prediction performance further decreased (Table 5, lines 6–7) thus indicating a need for widely varying calibration samples when model transfer is envisaged. On the other hand, restricting the DI samples to those receiving mineral fertiliser slightly improved prediction (line 8). Spiking the model by adding ten randomly selected DI samples to the calibration set improved independent P prediction and performed “B” instead of “C” (lines 9–10). Further combinations of single-site CV and prediction were tested, but no satisfactory results were achieved. Omitting KA-4 samples from the CV sample set led to worse P predictions in the test-sets, although KA-4 had failed in the single-field calibration. Overall, the results indicated that calibration sets with expressed variability—also concerning quality of OM as binding partner for Pavl—may lead to better model transferability.

Table 5 Testing vis–NIRS model transferability between localities: results of model calibration (leave-one-out cross-validation, CV) on large datasets from different calibration localities and prediction (P; test-set validation) for a single hold-out field (complete dataset of each field). For annotations, see Tables 2, 3, 4

Looking for a more universal applicability, the complete datasets of KA, DI, HI and ME were used for calibration and the model was applied to the PO dataset. This combination was chosen because PO yielded a well performing single-locality cross-validation; despite the large variability of the calibration set, test-set validation failed. Spiking the calibration data set with eight randomly selected samples from PO did not improve the result (Table 5, lines 11–14).

Reasons for limited transferability of calibration models in the Vis–NIR and MIR range

The DI-LTFE offered good opportunities to study the effects of pre-crop and sampling date as possible reasons for limited model transferability. All treatments were present in each of the five blocks (balanced sample set). Consequently, CAL-P contents covered similar ranges in each block (Table 1).

Vis–NIRS

Model calibration (CV) yielded satisfactory P prediction, when all 120 plots were considered. The same applied (except for DI-III) when calibration was performed block-wise as already shown in Table 2. Remarkably, selecting calibration and validation data with respect to pre-crop (block-wise) or, alternatively, to treatment led to differentiated results. Again, this points to better performance of heterogeneous sample-sets. Best results were achieved for block-wise calibration (i.e., integrating all treatments) and at least 5 months after harvesting the pre-crop. Note that visible straw and other particles had been thoroughly removed before spectra acquisition. Nevertheless, different combinations of Vis–NIR calibration and validation sample sub-sets led to contrasting results.

Predicting CAL-P in DI-I samples with the DI-II model was satisfactory (performance B), but failed the other way around (not shown). This was insofar surprising as both blocks were sampled at bare surface and 5 months after pre-crops with easily degradable residues. More calibration–validation combinations were tested, but no general rule could be found. It seems that crop residues, perhaps at the molecular scale, hamper model transferability. Separating the DI sample-set into the treatments “with” and “without FYM”, i.e., reducing the variability of OM quality, performed only “D” (Table 2, lines 7–8); transferability was consequently not further tested. Much better results were achieved when samples were separated in “with” or “without mineral P” (+ mineral P and – mineral P, respectively; Table 2, lines 9–10). However, validation with the respective opposite treatment failed.

These results point to the complexity of boundary conditions for successful CAL-P prediction. Further, they likely explain why published results on P prediction are somehow contradictory as reviewed by Stenberg et al. (2010) and Soriano-Disla et al. (2014). Over time, decreasing amounts of freshly added phosphates are being extracted by the CAL method (Hartmann et al. 2019). Lozier et al. (2017) found variable concentrations of water-extractable soil P in different cover crops and crop residues over time during the non-growing season. Accordingly, Delin (2016) reported that release of soluble P from different organic amendments disappeared within 2 months. During breakdown of straw, leaves and root residues, diverse intermediate organic substances formed dissolved OM while revealing seasonal dynamics (Kalbitz et al. 2000). In the MIR, spectral features of dissolved OM originating from maize and rye or wheat differed (Ellerbrock and Kaiser 2005). Consequently, it is assumed that crop, pre-crop, fertilization regimes, residence time of P released from fertilizer or crop residues and further factors affected CAL-P at the sampling date and model calibration. Such differences were obviously not, or not fully, mirrored in the spectra. Temporarily changing spectral features and their—perhaps incoherent—attribution to CAL-P by PLSR obviously led to highly variable prediction quality. The elemental analyses (organic carbon content, C:N ratio) could not explain the results (Table 1).

To summarize, the influences on Pavl were numerous, complex and differed from field to field. Mayrink et al. (2019) observed that lesser factors were sufficient for modelling when taking spectra from exchange resin with adsorbed P instead of scanning soil samples directly. They suggest that this fact is responsible for their very good results. Indeed, the rather high factor number that was needed for model building in this study underlines the problem complexity that probably has decreased model transferability.

MIRS

Overall, for MIRS, the same general statements can be summed up as for Vis–NIRS. However, MIRS-CV results for the single DI blocks and for the treatments ± FYM were mostly worse than using Vis–NIRS (Tables 2 and 3) and performed only “B” and “C”. Consequently no test-set validation was conducted. As for Vis–NIRS, separation into “with” and “without mineral P” yielded very good CV results, but test-set validation with the opposite treatment failed. The factors determining CAL-P are, as discussed above, mostly related to OM quality, and Vis–NIRS obviously better captures the complex situation.

Cellulose absorption index (CAI) as elucidation

The CAI potentially provides information about the presence of variable amounts of degrading plant material (Nagler et al. 2000). Aguilar et al. (2012) showed that CAI is sensitive to partly degraded residues insofar as CAI of soil with pea residues was affected if wheat residues from the previous year were present. Here, CAI was calculated to point out differences in OM quality between the sample sets. Note that CAI is only one possibility to better characterise spectral variability of soil samples; it is shown here as exemplary indicator. Figure 4 shows mean CAI’s for different sample sets and sub-sets.

Fig. 4
figure 4

Cellulose absorption index for different sample sets and sub-sets. Error bars represent standard deviation

CAIs vary in a large range between the localities but also within some sample sub-sets. Between treatments of DI-LTFE, variation is small. However, pre-crop seemed to affect CAI, because considerable variation arose when the LTFE dataset was sorted by blocks (compare blocks DI-I and DI-III). High standard deviation occurs because sorting by treatments integrates different crops and, in turn, each block comprises all treatments. Accordingly, CAI differences between the KA fields occurred. After sugar beet harvest, the CAI was lower than after cereal harvest at other KA fields, although soil samples were taken only 6 weeks after beet but ≥ 3 months after cereals. For the two fields with the highest CAI values, single-field CV failed: KA-4 was sampled shortly after cereal harvest (Table 2, line 16) and, at the ancient orchard HI (line 17), wooden residues (roots, branches) were still present. This underlines that type of harvest residue—perhaps also at the molecular scale—and sampling date influenced spectra and sometimes hampered CAL-P prediction.

Calibration for DI after separation into treatments + FYM and − FYM failed (Table 2, lines 7–8), although CAIs were similar. Thus, more factors than CAI alone must have been relevant. Note the small variability of organic carbon contents within the sample sets + FYM and − FYM compared to any other grouping within DI (Table 1). The better performance of the models + minP and − minP compared to the FYM grouping was perhaps due to the larger organic carbon content range. Accordingly, Abdi et al. (2012) were not able to predict Pavl with Vis–NIRS and concluded that the low correlation of the P related properties and soil C were the reason. However, the authors combined soil samples from 6 years of a plot experiment, and the results here additionally turned out the effect of pre-crop and harvest residues on P prediction (see above).

Of course, CAI can only serve as a proxy for changing OM properties and related P binding. This becomes obvious by the weak transferability of the universal calibration from DI–KA–HI–ME to the validation site PO. The CAI mean and standard deviation of the DI–KA–HI–ME dataset include CAI for PO (Fig. 4), suggesting good transferability. Nevertheless prediction (without or with spiking) was not satisfactory (Table 5). Consequently CAI alone is not an appropriate parameter to decide upon model application to unknown samples. Yet, in this study, neither the details of P binding nor spectral features of OM were elucidated. Further, more factors controlling CAL-P (e.g., iron oxides) and related spectral features should be considered but were not investigated in this study.

Utilisation of IR-based P prediction in fertilisation practice

In Germany, P dosage recommendation relies on soil CAL-P content, in some regions also on soil texture, and crop yield (Table 6, see also Jordan-Meille et al. 2012). In view of the primary goal to integrate spectroscopic P determination into agricultural practice, a correct estimation of the P class is sufficient and marks a step towards variable rate P dosage.

Table 6 Classes of CAL-soluble phosphorus contents in arable topsoils, P fertilization recommendation for North Rhine-Westphalia (LWK 2015), and percentage of correct and wrong fertiliser recommendation on the basis of spectroscopic P prediction

Applying the universal Vis–NIRS and MIRS models, > 60% of the predictions met the correct CAL-P class (Table 6). The CAL-P status prediction was better for soils in classes B (low P), C (optimal status), and D (high P status) than for soils with very low (class A) or very high (class E) CAL-P contents. In the latter cases, spectroscopic P estimation would often lead to wrong fertilising recommendations. Consequently, the procedure cannot yet fully be recommended for practical application. However, for some single locations, results were better than for the entire dataset (e.g., for KA 73%, not shown). Accordingly Stenberg et al. (2010) and Gholizadeh et al. (2013) reported that local calibrations outperform global ones, although spectral libraries from numerous sites can be advantageous for calibrating models (Wijewardane et al. 2018). Until now, Vis–NIR and MIR spectral libraries for Pavl in German arable soils are not available. Considering the results of this study will help to construct such libraries and to reduce costs for P analyses. The calculated acceptable costs for sensor-based variable-rate P fertilisation allow for future introduction of spectroscopic Pavl determination in precision farming practice as pointed out by Schulte-Ostermann and Wagner (2018) for the Görzig site.

Limitations of the present study

Currently, the results are only valuable for the German standard CAL method. Nevertheless, the findings can help to improve P model calibration also for other soil tests that are operationally and not chemically defined. However, individual calibration of spectral models is still mandatory for each P extraction method.

Surprisingly, no clear preference for Vis–NIRS or MIRS could be derived from the results, because performance was similar under laboratory conditions. Both techniques revealed pros and cons. Although MIRS has not so far been introduced into mobile field applications, the results of this study encourage testing portable MIR spectrometers in the field.

The overall unsatisfactory prediction results were perhaps the consequence of dataset composition: the calibration and validation samples were not pre-selected but taken from previous projects as entire sets. This decision had been taken in order to explore the value of archived samples and data for future model calibration on a regional basis. However, it is widely accepted that calibration and validation samples should be selected to cover a wide range of the parameter of interest (Stenberg et al. 2010; Soriano-Disla et al. 2014).

This study was based on the linear standard PLSR calibration. Non-linear machine learning approaches can overcome some modelling restrictions (Gholizadeh et al. 2013). Nawar and Mouazen (2019) successfully performed random forest SOC modelling on a spectral library that was spiked with selected samples from the target field. As a consequence, the authors recommended testing the approach with other soil properties. This present study confirms this statement.

While standardizing spectra recording (surface properties, light source, geometry, etc.) is known to be an important issue when building spectral libraries (Gholizadeh et al. 2013; Soriano-Disla et al. 2014), no related references were found concerning the soil sampling itself. This study showed that sampling season, pre-crop, etc. affect calibration and prediction performance. One reason was, presumably, the presence of different harvest residues of microscopic or molecular size. Such residues impeded successful P prediction when the test-set samples contained no, less, or other such substances. In this respect, spiking the calibration set resulted in no significant prediction improvement. Consequently, for more universal validity of P prediction models, attention can easily and must thoroughly be paid to composition of the sample sets. This can, e.g., necessitate strict co-ordination of sampling campaigns or standardization of sampling guidelines with respect to pre-crop and sampling date. Although not examined in this study that focused on non-stagnic loess soils, it must be assumed that also parent material and pedogenic iron oxides should be taken into account. Further work is needed to systematically elucidate spectral features of factors that control Pavl extractability. Future research should focus on standardisation of the sampling procedure in order to enhance performance of Vis–NIRS, and MIRS prediction of Pavl. The development of a sampling protocol to build a broader and valuable data pool as a basis for further model improvement is recommended. This guideline should include standards for calibration sample collection including pre-crop species, sampling date and presence of harvest residues. Such sampling protocol may make it possible to improve transferability and universal validity of Pavl models. With this precondition, Pavl prediction using Vis–NIRS or MIRS will probably further develop towards practical applicability in precision agriculture.

Conclusions

Arable soil samples from multiple sites in Germany were used to evaluate performance of Vis–NIR and MIR spectroscopy for predicting CAL-P. In the laboratory, Vis–NIRS and MIRS proved to be potentially useful for developing future applications in mobile sensing. Performance of Vis–NIRS or MIRS conditions was similar and no preference could be derived. Analysis of the spectral data resulted in the rapid quantification of plant available CAL-P at a moderate level of precision (30.3 mg kg−1). Unfortunately, P prediction often failed for independent test fields indicating a lack of transferability, occasionally even between neighbouring fields being similarly managed. Therefore, universal model validity and transferability to unknown sites are not yet satisfactory. A wide range of CAL-P alone did not ensure satisfactory modelling results. Variability in P extractability and spectral reflectance properties remained unexplained, but were ascribed to season, fertilising regime, pre-crop and harvest residues. The CAI—although originally designed to distinguish plants and soil in remote sensing—proved useful to explain varying transferability. However, CAI alone was not a reliable indicator to predict model transferability. Non-linear machine learning modelling instead of PLSR might help to better account for the complex problem of Pavl prediction. Anyway, Vis–NIRS and MIRS at their current state are already sufficient to outperform the actual practice of evaluating P requirement, because the models frequently predicted the correct P recommendation for soils classified as having low, optimal or high CAL-P contents. The application of a future sampling guideline will most probably lead to better Pavl predictions. The expected increase in prediction performance of Vis–NIRS and MIRS makes an introduction into precision agriculture seem possible.