1 Introduction

Due to the limited time and financial resources, ore grade estimation generally depends on the limited data which generally consists of drillings. The mass of rock samples collected from a deposit is generally equivalent to 1 to 1.000.000 of the minerals deposit that would be exploited [1]. This limited data should be used with care to make reliable estimation of spatial distribution of target ore grade. This estimate is used in many areas such to assess the economic viability, investment decision to deposit, short and long-term mine planning, and estimation of mine life [2,3,4,5]. In ore grade estimation classical methods like inverse distance weighting (IDW), stochastic simulation and, most widely, kriging and its variants are used [6,7,8,9,10,11,12]. These geostatistical techniques are readily available in some commercial and open-source software packages [13]. Availability of the numerous software does not mean that these techniques can be easily applied to ore grade estimation. Still, steps like variogram modelling, detection of possible anisotropy and trends, and determination of kriging plans stand as challenging tasks [14]. These steps generally require expert knowledge and experience [15]. Due to the complexity of the classical geostatistical methods such as kriging, some alternative techniques are being developed. As alternative methods, Machine learning (ML) algorithms provide a rich spectrum of approaches to classical geostatistical methods.

ML algorithms directly learn from available data to perform mainly regression and classification tasks while generally not making any assumption about available data [16]. Due to the power and simplicity of machine learning algorithms, the use of the algorithms is gaining popularity in ore grade estimation [17]. Many researchers used artificial neural networks and their variants, fuzzy logic, support vector machine, regression trees, extreme learning machines, and random forest in ore grade estimation with encouraging results [10, 18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]. In these research, distribution of Au, Fe, Cu, Mo, Ag, Al2O3, and Zn contents is estimated. For the resource estimation of the iron ore deposits, neural networks and its variants are dominating approaches [23, 25, 34,35,36] while other techniques like random forest and fuzzy logic are also used [27, 37]. Even though gradient boosting trees are used with success for the estimation of gold deposits [38, 39], the method has not been applied at the estimation of Fe content yet. The estimation steps of iron ore deposit are like other mineral resources. Nevertheless, mineral deposition processes are unique to all depositions while this affects the spatial continuity of the underlying element [40,41,42]. For example, gold deposits show higher spatial variability than iron deposits. For this reason, it is important to show that methods that have been applied and succeeded in other commodities can also be used in estimating the content of iron ore, where spatial variability is different from other commodities.

In this study, Fe grade estimation of an iron deposit is performed by using XGBoost algorithm, and results are compared with traditional ordinary kriging. Due to the different nature of these estimation methods, different steps must be taken to perform estimations. In estimation with XGBoost algorithm, hyper-parameter optimization is performed by using grid search to determine the most appropriate hyper-parameter value while root mean square error is used to evaluate the performance of estimations. In estimations with ordinary kriging, experimental variograms are calculated in different directions, and model variograms are fitted to these variograms. Cross-validation serves as the cornerstone technique for assessing the acceptability of variogram models, a role it has maintained since its introduction to the geostatistical field. For this reason, cross-validation is used to assess the acceptability of the fitted models [43,44,45,46,47]. Estimations are performed by using the model variogram. In estimations with both methods, composites of Fe grades are considered as input data. Finally, estimations are compared with these composite values and each other by means of summary statistics and residual values when kriging estimate results are considered as base. Results show that estimations with XGBoost algorithm produced estimates with higher range than kriging estimates. But the method still suffers from smoothing like kriging while minimum and maximum values of estimates were higher and lower than composite values, respectively.

2 Methods

2.1 Extreme Gradient Boosting (XGBoost)

XGBoost is a member of ensemble learning family which is a supervised, parallel, scalable tree boosting system that can be used to solve both regression and classification problems [48]. An ensemble can be defined as the fusion of two or more trained models to improve performance of underlying individual models [49, 50]. While the fusion of the models increases the generalization, local and specific information is also captured [50]. These properties make ensemble techniques one of the most powerful methods among the machine learning alternatives [51]. Due to the simplicity and success of the method, ensemble learners are used in many applications of classification, regression, ranking, and anomaly detection [52,53,54,55] .

In recent years, large numbers of ensemble machine learning approaches are proposed [52]. Among the alternatives, boosting techniques show promising results while achieving the best performance which is measured in terms of squared correlation (R2) [56]. In boosting algorithms, weak learners which are slightly better than random guess are combined to generate strong learner in iterative way. The aim of gradient boosting is to approximate the function \({F}^{*}\left(x\right)\) which maps the x to corresponding outputs y while dataset \(D={\left\{{x}_{i},{y}_{i}\right\}}_{1}^{N}\) is given. This approximation is reached by minimization of the loss function \(L\left(y,F\left(x\right)\right)\). Additive approximation of the \({F}^{*}\left(x\right)\) function is reached by weighted sum of functions;

$${F}_{m}\left(x\right)={F}_{m-1}\left(x\right)+{p}_{m}{h}_{m}\left(x\right)$$
(1)

where \({p}_{m}\) is the weight associated with the mth function \({h}_{m}\left(x\right)\). The approximation is first reached by constant approximation of \({F}^{*}\left(x\right)\) as;

$$\left({F}_{0}\left(x\right)\right)=\text{arg}\;\underset{\alpha }{\text{min}}\sum _{i=1}^{N}L\left({y}_{i},a\right)$$
(2)

And following models are minimized as;

$$\left({P}_{m},{h}_{m}\left(x\right)\right)=\text{arg}\;\underset{p,h}{\text{min}}\sum _{i=1}^{N}L\left({y}_{i},{F}_{m-1}\left({x}_{i}\right)+ph\left({x}_{i}\right)\right)$$
(3)

\({F}^{*}\) is optimized greedily by using gradient descent optimization. \({h}_{m}\) is trained on new dataset \(D=\left\{{x}_{j},{r}_{mi}\right\}\), and pseudo residuals rm are calculated as follows;

$${r}_{mi}={\left[\frac{\partial L\left({y}_{i},F\left(x\right))\right)}{\partial F\left(x\right)}\right]}_{F\left(x\right)={F}_{m-1}\left(x\right)}$$
(4)

One can calculate the value of \({p}_{m}\) by applying line search optimization.

In XGBoost, only decision trees are considered as base regressor or classifier to boost ensemble [48]. The complexity of the trees is controlled by using loss function variant.

$${L}_{xgb}=\sum _{i=1}^{N}L\left({y}_{i},F\left({x}_{i}\right)\right)+\sum _{m=1}^{M}{\Omega }\left({h}_{m}\right)$$
(5)
$$\Omega\left(h\right)=\gamma T+\frac12\lambda\left\|w\right\|^2$$
(6)

where \(T\) is the number of leaves and \(w\) are output scores associated with the leaves in the XGBoost algorithm. It uses parallel processing, regularization, and tree pruning which dramatically increases the regression and classification accuracy and precision compared to raw techniques. These superiorities make XGBoost to be widely used in regression problems [57].

2.2 Ordinary Kriging (OK)

Kriging is a widely used spatial estimation method that depends on attaching the weights to neighboring data to make estimation at unsampled location.

Estimation using ordinary kriging depends on the fitting variogram model by using the experimental variogram values which are calculated based on the distance h as:

$$\gamma \left(h\right)=\frac{1}{2n}\sum _{i=1}^{n}{Z}_{i}-{Z}_{i+h}$$
(7)

where  is experimental variogram value at distance h and n is the number of pairs used in experimental value calculations. In resource estimation, experimental variograms are generally calculated in horizontal and vertical directions. In horizontal, experimental variograms are calculated in four directions which are at azimuths 0°, 45°, 90°, and 135°. Next step is to fit a model to experimental variogram. Most common model that used is nested spherical and pure nugget model. By fitting model variogram, estimation can be made by using kriging approach. Despite its widespread use, kriging is not without its limitations in spatial estimation.

One of the main limitations of kriging is it requires estimation of variogram model that represents spatial continuity of the target variable to estimate unsampled locations. The estimation of the variogram model is subjective and often requires considerable experience. Also, the variogram exhibits sensitivity to deviations from normality or symmetrical distributions [58]. This sensitivity arises from its inherent reliance on squared differences. Even a single outlier can significantly distort the experimental variogram due to its potential involvement in numerous paired comparisons across multiple, or even all, lag intervals [59]. Other limitation is while kriging theory assumes an infinitely large study area, practical applications are confined to finite domains, often delineated by geological boundaries. This inherent discrepancy can manifest as the “string effect” when kriging along linear features like drillholes. Due to the limited spatial extent of the domain, samples located at the ends of these linear features possess fewer neighboring data points for comparison. Consequently, the kriging system assigns them greater weight, potentially leading to biased estimations, particularly in non-stationary domains where the spatial characteristics of the variable under study exhibit systematic variations. This effect becomes especially critical when the end and central samples within the drillhole data exhibit inherent differences [60, 61].

3 Case Study

3.1 Drillhole Data and Compositing

To estimate the Fe grade of a deposit located in Türkiye, 30 drillholes are drilled with 45 m average drilling space at horizontal direction. All drillholes are drilled vertically except one drillhole with − 75° inclination and 154° azimuth. Total length of all drillings is 10,650 m with an average of 355 m. A total of 1010 samples were collected from the drillholes in approximately 1 m average length with varying lengths between 0.5 and 2 m. Each sample is represented by Easting(X), Northing (Y), Elevation (Z), and Fe (%) content. Due to the confidentiality agreement with the company that owns the deposit, no further information like location and name of the deposit can be given. Plan and oblique view of the drill holes are shown in Fig. 1.

Fig. 1
figure 1

a Plan and b oblique view of the drillhole traces

Compositing is a well-known and standard technique in grade estimation with data having unequal sampling length. All samples are composited into 1 m length with sample length weighted approach. The acceptance rate for composites considered as 50% which means that composites with length less than 50 cm considered as short composites and discarded from the composite dataset. Only two short composites are discarded from the dataset while the remaining 1058 composite are used in estimations. Relative frequency distribution of the Fe grades of the composites are shown in Fig. 2.

Fig. 2
figure 2

Distribution of Fe composites

As seen from the Fig. 2 Fe grades show negatively skewed distribution. Nearly 85% of the data lies between 45 and 50% Fe grade. Only 3 of the composite data have higher than 50% Fe grade which are isolated occurrences that do not show spatial continuity.

3.2 Estimations

Estimations are both performed with XGBoost and ordinary kriging. These methods demand different steps to be taken to estimate the ore grade distribution. All estimations are performed at block model that represents mineralization with sizes 5 × 5 × 5 m in X, Y, and Z directions consisting of 29,262 blocks. All estimations are performed at midpoints of these blocks while these estimations are considered as Fe grade of the corresponding block. The estimation steps specific to both methods are explained in detail in the following sections.

3.2.1 Estimation with XGBoost

Input data is considered as X, Y, and Z coordinates of the composites while output is considered as Fe analysis. Estimation with XGBoost requires the prediction of parameters of the algorithm. In general, default parameters provided by the XGBoost package are not the best option. Some parameters should be tuned to estimate the Fe distribution at the deposit by using XGBoost. Performance of the XGBoost is sensitive to selected parameters. Inappropriate parameters result in unacceptable estimates. For this reason, parameters should be tuned. In machine learning, data splitting is done to avoid possible over-fitting. The composite data set was divided into training and test sets as 80% and 20% of all data, respectively, which was carried out randomly. Some parameters should be tuned to estimate the Fe distribution at the deposit by using XGBoost. Performance of the XGBoost is sensitive to selected parameters. Inappropriate parameters result in unacceptable estimates.

To predict acceptable parameters, grid search methodology is considered. Grid search method is easy to implement and understand [62]. It is an optimization method that tries all possible combinations of given parameters. Among all possible combinations, the combination that gives the best result according to the performance criterion is determined as the estimation parameter. In parameter tuning of the ML algorithms, there are many alternatives [63,64,65,66]. While grid search is computationally intensive approach, this method offers the advantage of exhaustive exploration of the parameter space. This guarantees evaluation of all possible combinations, potentially leading to superior results [64, 67]. In this study, root mean square error (RMSE) is considered as performance criterion while the combination with lowest RMSE can be considered as the best option. Relying only on grid search may result in overfitting which is undesired in estimation with ML in general. Cross validation techniques like train/split, K-fold, stratified K-fold, and leave-one-out can be used to mitigate overfitting while completely avoiding overfitting is impossible. In this study, K-fold cross validation technique is used to mitigate overfitting. In K-fold cross validation, data points are split into k equal-sized subsets which are called folds. Among the subsets, one subset is used to test the performance of the model which is trained by remaining subsets. To tune the parameter of XGBoost, estimation grid search parameters are given in Table 1 while K-fold cross validation (with K = 10) approach is adopted.

Table 1 Parameters that are tuned in XGBoost estimations (Default parameters provided by Python XGBoost package are shown in bold)

In Table 1 parameter ranges are determined as eta lower boundary of starts from 0.01 which the value is close to zero and upper boundary is 1 which is the maximum value that eta parameter can take. Max_depth parameter starts from 3 while less than this value avoids model to be fitted in practice, and maximum value is considered as 15 while increasing tree depth increases overfitting possibility. Min_child_weight parameter is the minimum weight required to create a new node. This parameter range is selected between 1 and 10 while 1 is minimum integer value that parameter can take. An upper boundary higher than 10 generally results in smoothing of the estimation results while high size groups in leaf nodes avoid algorithm to capture values at the upper and lower tail of dataset. By nature, subsample parameter can take values between 0 and 1. In ensemble learning like XGBoost, the subsample ratio parameter typically ranges from 0.5 to 1.0 during decision tree construction. This parameter governs the proportion of training data points utilized to grow each individual tree. A subsample ratio of 0.5 signifies that half of the training data is randomly sampled with replacement before building each tree. To prevent over fitting, the subsample range is determined between 0.5 and 1. Parameters that are visited in grid search with lowest RMSE is considered as best alternative. The average of the RMSE in K-fold cross validation was 0.68. The fitted model is used to estimate the test data which is already known. RMSE of the test data and predictions was 0.69 which is very close to K-fold cross validation’s RMSE. Alternative to RMSE, minimum absolute error (MAE) is also considered during the parameter tuning. The average of MAE was 0.67 and 0.65 for K-fold cross-validation and test data, respectively, which were lowest values among the alternatives. This can be interpreted as generalization of the model is enough to be used in estimation. Unlike classical kriging method, most machine learning algorithms does not provides an statistical measure of reliability of the estimation results. In kriging, even though it has limitations, estimation error variance is used to assess the uncertainty of the estimates. Unlike kriging, most machine learning algorithms do not provide such a tool to assess the estimation reliability while XGBoost is not an exception. In this study, estimation results are assessed based on the swath plots of the XGBoost and composite respect to X, Y, and Z directions. Visual checks of the swath plots show that XGBoost was able to capture random and structural variation which also indicates that XGBoost estimates were able to capture the anisotropy. All estimations are performed with a Python 3.10 code written and run at a computer with 64 GB RAM, 24 core-CPU.

Complexity analysis is a formal technique employed to assess the resource consumption of an algorithm. It characterizes the relationship between the input size and the execution time of the algorithm, independent of the specific hardware platform, programming language, or compiler used. This analysis allows for the evaluation and comparison of different algorithms solving the same problem, providing insights into their relative efficiency as the input size grows. In seminal work that XGBoost algorithm was proposed [48], complexity of the training original algorithm is O(K d ||X||0 log n) where ||X||0 is number of data used in training, d is maximum depth of three, and K is total number of trees. In prediction step, the complexity is O(K d). In the current computer configuration, estimation of 26,291 blocks took only 3.99 × 10−3 s. In order to assess performance of the algorithm, 2 millions of artificial blocks are generated and estimated with the fitted model. For these blocks, it took 0.12 s to estimate. Which means that with current computer configuration, XGBoost is able to handle large number of blocks in estimation.

3.2.2 Estimation with Ordinary Kriging

In order to estimate the variogram in 3D, vertical and horizontal experimental variograms are calculated using Netpromine software by using composite values. In horizontal direction, experimental variograms are calculated in 0°, 45°, 90°, and 135° azimuths to detect possible anisotropy. Lag distance of 50 m is used with lag tolerance of 25 m and tolerance angle 22.5°. All experimental variograms showed similar behavior in different directions. Due to the isotropic behavior of the directional experimental variograms, omnidirectional horizontal variogram is calculated to fit the underlying variogram model. Variogram modelling is continued with vertical experimental values. In order to calculate the vertical experimental variograms, a lag distance of 1 m is used with lag tolerance of 5°. Model variogram is selected as a nested model consisting of one nugget effect and one spherical model (Table 2).

Table 2 Variogram model used in estimation

To assess the acceptability of fitted variogram, model cross validation is used. In grade estimation with kriging, leave-one-out cross validation (LOOCV) is the standard technique to assess the acceptability of the estimation results. In LOOCV, each datum and associated datum locations are removed one at a time, and the grade of data location is estimated with remaining data. From another perspective, LOOCV can be considered as a specialized form of K-Fold cross validation with K is set to number of composites in dataset. Cross validation ends when all locations are visited and grades at the locations are estimated with selected model variogram and search ellipsoid parameters. In Kriging, only neighboring data is used to avoid excess smoothing. The neighboring data is determined as the data that falls in the ellipsoid which central is located at the estimation point. The axes length of the ellipsoid is usually determined by choosing a slightly bigger value than the ranges in the horizontal and vertical directions. When all data points are visited, two different data are obtained, the estimated and the actual values. The difference of measured and estimated values is named as residual and calculated as follows;

$${x}_{residual}= x-{x}^{{\prime }}$$
(7)

where \(x\) is data value and \({x}^{{\prime }}\) is estimated value during the cross-validation. The statistics of these residuals (kriging errors) provide insight for acceptability of the underlying variogram model. The mean of these residuals should be as close as to zero while percentage of the errors within two standard deviations (PEWTSD) should be higher than 95%, while PEWTSD measures the spread of the residuals respect to mean of residuals and calculated as;

$$\text{P}\text{E}\text{W}\text{T}\text{S}\text{D}= \frac{n}{Number\;of\;data}*100$$
(8)

where n is the number of residuals that stands between mean of residuals ± 2×standard deviation of kriging errors.

LOOCV performed to assess the acceptability of the underlying variogram is given in Table 2. Search ellipsoid with 250 m and 30 m in horizontal and vertical direction is selected. Summary statistics of the residuals are given in Table 3.

Table 3 Summary statistics of kriging errors of residuals

As seen from Table 3 the mean of the kriging errors is close to zero which means that estimations are unbiased. 99.90% of the errors lie between the two-standard deviation from the mean of residuals. Therefore, cross validation results reveal that proposed variogram model can be used in estimations.

After the fitting and cross validation steps, estimations are performed with moving neighborhood having an ellipsoid with radiuses of 250 m and 30 m in horizontal and vertical directions, respectively. A maximum of 16 closest data are used in estimation to mitigate over-smoothing. The geometry of the search ellipsoid and conditioning data was enough to perform estimations at single pass.

4 Results and Discussion

In the spatial estimation of the ore grade, the statistics of the estimation results are expected to be as close as possible to the statistics of the composites. However, the statistics of the estimations are generally smoother while this phenomenon is well-known in geostatistical estimation [68]. In other words, the highest value of the estimation ​​is expected to be lower than the highest value of the composite values. In addition, the lowest value of the predictive values ​​is expected to be higher than the lowest value of the composite values. As a result of smoothing, variance of the estimates is lower than variance of the composites. Due to the unbiasedness property of the estimator, averages of the estimations and composite values ​​are expected to be close to each other.

Results of the estimations can be assessed by using summary statistics and cross-plots [1, 69]. Summary statistics of the composites, kriging, and XGBoost are given in Table 4.

Table 4 Summary statistics of composites, kriging, and XGBoost estimates

As seen from Table 4, kriging and XGBoost both produced similar estimation averages to composites and kriging. This similarity shows that both estimation methods produced unbiased results. For both estimation methods, the absolute value of the deviation of the means from the composite mean was less than 0.2% which is a neglectable deviation. The estimate range of the XGBoost is 13% higher than the Kriging estimates while the range percent is calculated as follows.

$$Estimation\;range\;\%=\frac{\left(max(\;XGBoost)-min(\;XGBoost)\right)-\left((max\left(OK\right)-min\left(OK\right)\right)}{\left((max\left(OK\right)-min\left(OK\right)\right)}\ast100$$
(9)

As expected, a well-known smoothing effect is observed at both estimation results. Both methods produced smoother estimation results while variances of the estimations are notably lower than the variance of the composites. XGBoost produced estimation results with higher variation than OK due to the higher estimation range. Comparison of the estimates are followed by cross-plotting kriging estimates against XGBoost estimates (Fig. 3). Assessment of the results followed by cross-plots of the estimates and cross-plots of the Kriging and XGBoost estimates is given in Fig. 3.

Fig. 3
figure 3

Cross-plot of kriging estimates vs. XGBoost estimates (dotted black line represents linear regression line, for clarity only randomly selected 10% of the estimation data are shown)

As seen from Fig. 3 estimation results show moderately positive association in linear characteristics with 0.63 Pearson correlation coefficient. One reason of the moderate correlation is estimation range of the XGBoost is higher than Kriging while at extreme points of XGBoost estimates where kriging produced smoother estimates that resembles to each other which decreases the linear relation between the XGBoost and kriging estimates. Other reason for the moderate correlation is variance of the estimates of XGBoost is higher as seen from Table 4. Which means that even though mean and median values of both estimates are similar, variability of the XGBoost is higher in spatial sense and kriging estimated less variable results. In order to continue assessing the kriging and XGBoost estimates, residuals are calculated by subtracting XGBoost estimates from kriging estimates. The histogram and summary statistics of the estimates are shown in Fig. 4.

Fig. 4
figure 4

Residuals of XGBoost estimates

As seen from the Fig. 4 residuals show normal distribution with the average value is close to zero while maximums and minimums are 4.88 Fe (%) and −5.25 Fe (%), respectively. As expected from a normal distribution, histogram of these residuals shows light-tailed behavior. That means that the number of data that stays at the tails is relatively lower than the center of the histogram while 80.3% of the residuals stays between −1 and 1 Fe (%). This was the expected result while the correlation between the estimates is moderate. Among 26,291 block estimates, at tails only two of the data were higher than the 5 Fe (%) and at lower tail three of the residuals were lower than −4 Fe (%) which represents only 0.02% of data.

Kriging is an industry standard that requires certain steps to be taken like variographic analysis, determination of possible anisotropy, and model variogram fitting in grade estimation. While an expert is needed to perform these time-consuming and troublesome steps, the estimation results generally reliant on knowledge and experience of this expert. In grade estimation, XGBoost algorithm can be good candidate to industry standard kriging while estimations can only be performed with input-output data pairs which consists of X, Y, and Z coordinates and grade measurements at those locations, respectively. In XGBoost estimation, grade continuity is implicitly captured by the algorithm which is mainly dependent on hyper-parameters specific to the algorithm. However, results of the estimations are dependent on estimation of these parameters which are appropriate to given data set.

5 Conclusions

In this paper, XGBoost algorithm was used to estimate Fe grade of a deposit. This study is the first in literature in which the spatial estimation of Fe grade is made using the XGBoost algorithm. For comparison, the kriging method is used. Results of the estimation show that XGBoost can be used in grade estimation as an alternative to kriging. The XGBoost produced a higher estimation range than kriging which is desired in grade estimation at an ore deposit. In correlation with higher estimation range of XGBoost, standard deviation of the estimation results with XGBoost was higher than kriging. Estimates were moderately correlated with kriging estimates. Like kriging, XGBoost suffers from smoothing the estimates while standard deviation of the estimation method was significantly lower than the composites. XGBoost requires hyper-parameter tuning to reach an acceptable level of estimation while default parameters are not the best option to estimate the grade distribution. In hyper-parameter tuning, grid search can be used which sometimes can need high computing power.

The current work only considers the composite values of the samples with X, Y, and Z coordinates as input variable and Fe grade as target variable. Other attributes like alteration degree, possible minor faults and rock type are ignored during the work. Further studies that assess the effect of these variables on estimation with XGBoost algorithm should be conducted. As is well known, like all other ML algorithms, XGBoost is data hungry and requires a great number of data to establish a relation between input and output data. It is not possible to know the number of data that is required to make reliable grade estimation prior to running XGBoost model. When the number of data is not enough to make reliable estimation with XGBoost algorithm, additional data should be collected from the field which requires additional drillings. These additional drilling requires costly and time-consuming steps like planning of the drill hole locations, depths and inclination of these drillholes, data sampling, and chemical analysis of samples which stands as disadvantage of the using XGBoost algorithm in grade estimation.