Introduction

Concrete is an indispensable construction material and popular due its durability, cost-effectiveness, fire resistance, and expedited construction properties. Concrete is made by combining several components, such as cement, water, fine particles, and coarse aggregates, in precise ratios that produce the desired compressive strength (CS). Many studies have focused on estimating the properties of hardened concrete. CS is one of the fundamental qualities of concrete that designers are interested in, and relevant information can be gathered through laboratory testing. Evaluating the CS of concrete can estimate the residual capacity of the structure1.

Concrete specimens (cubes and cylinders) must be loaded to failure in order to directly determine the CS of concrete. Therefore, samples must be analysed in laboratories to determine the CS of the concrete. However, this is time-consuming and very costly process. To determine the in-situ CS of structural members, concrete cores can be extracted and this method is called destructive testing. This destructive method is also costly and needs the drilling of concrete cores, transportation, and dressing of concrete cores as per the codal requirement to determine the CS of concrete. To overcome this issue, in 1948 Swiss civil engineer and bridge builder Ernst O. Schmidt invented a rebound hammer (RH) to determine the CS of concrete without damaging the structural concrete2. Determining the properties of concrete without damaging the concrete is called non-destructive testing (NDT) or non-destructive evaluation (NDE). A brief description of NDT is available in Section "Non-destructive testing". The commonly used NDT techniques to evaluate the CS of concrete are RH and ultrasonic pulse velocity (UPV)3. UPV is an NDT method used to evaluate the elastic properties and integrity of materials, particularly concrete and rock. It involves the amount of the time it takes for an ultrasonic pulse to travel through a material and be reflected back by an internal defect or the opposite surface. The velocity of the pulse can then be calculated and used to determine the quality of the material. The UPV technique serves as a prevalent method for detecting damages or assessing deterioration in civil engineering infrastructures.

Reliable and accurate models are needed to evaluate the CS of concrete, it reduces time as well as cost. Various mathematical models are accessible in the previous studies to calculate the CS of concrete with RH and UPV values. The accuracy of mathematical models is very less due to the complexity of the concrete and models are based on a limited dataset. Machine learning (ML) algorithms are good to estimate the CS of concrete and are used by numerous researchers in the field of concrete technology. There are very few studies in the literature that used ML models (step-by-step regression (SBSR), an adaptive neuro-fuzzy inference system (ANFIS), gene expression programming (GEP), high correlated variables creator machine (HCVCM)-SBSR, HCVCM-GEP, and HCVCM-ANFIS, artificial neural networks (ANN), support vector machines (SVM), Linear regression (LR), lazy-learning algorithms (LLA), and tree-based learning algorithms (TBL)) to predict CS using RH and UPV tests. Some of the studies are explained below and also described in Table 1.

Table 1 Summary of previously established ML models to determine the CS using RH and UPV values.

Shishegaran et al.4 used SBSR, ANFIS, GEP, and other three hybrid algorithms (HCVCM-SBSR, HCVCM-GEP, and HCVCM-ANFIS) to estimate the CS using NDTs. The collected dataset contains only 516 experimental values of RH and UPV. The findings of the analyzed results show that HCVCM-ANFIS outperforms all other ML models to estimate the CS of concrete. HCVCM increases ANFIS accuracy by 5%, 10%, 3%, 20% and 7% in coefficient of determination, RMSE, NMSE, MAPE, and maximum negative error, respectively. Asteris and Mokos5 predicted the CS of concrete obtained from RH and UPV testing using ANN. There were only 209 experimental datasets of RH and UPV values. A single and double hidden layer was used to train the ANN model. The single hidden layer with twenty-five neurons performs well when compared to the double hidden layer. The R-value of the selected ANN was 0.9891 with a RMSE value of 1.4678 MPa. Shih et al.6 estimated the CS of concrete with NDT by utilizing SVM. To develop and validate the SVM model, information was gathered from 95-cylinder concrete samples. In comparison to statistical regression methods, the SVM model has a greater level of precision. LR, LLA, and TBL were utilized by Erdal et al.7 to estimate the concrete CS using NDT results. The performance of the TBL algorithm was greater than LR and LLA.

The use of ML algorithms in the NDT of concrete structures can improve the accuracy and efficiency of the testing process. ML algorithms can be used to analyse the data collected by NDT methods such as ultrasonic testing, infrared thermography, and ground penetrating radar, to identify and classify defects in the concrete. This can help to reduce the need for costly and time-consuming manual inspections, and can also improve the ability to detect defects that might be missed by traditional NDT methods.

The distinction of the developed ML models arises from their reliance on the most comprehensive dataset among comparable studies centered on predicting CS through NDT data, as emphasized in Table 1.

Real-time NDT (RTNDT) is a promising alternative that allows for the measurement of material properties without causing any damage. RTNDT techniques use various sensors to collect data about the material's response to external loads or other stimuli and can provide information about the material's mechanical properties in real time. This study aims to evolve a model for estimating CS based on RTNDT data. The main contribution of this study is the use of an ensemble learning (EL) approach, which is a ML technique that combines the predictions of multiple models to produce a more accurate and robust prediction.

By developing an EL approach for CS prediction based on RTNDT data, this study will provide a novel solution for NDE of materials and structures that is faster, cheaper, and safer than traditional destructive testing. The results of this study will be useful for engineers, researchers, and practitioners who are interested in the development of RTNDT techniques for material evaluation. The primary contribution of this research includes: (i) to develop accurate prediction models for structural health monitoring of concrete structures, and (ii) to identify the impact of input parameters on the CS of concrete. Notably, the developed ML models exhibit performance metrics that outperform existing models, signifying superior predictive accuracy and efficacy in estimating CS from NDT data.

Non-destructive testing

NDT refers to a range of analysis techniques used in engineering and/or science to assess the properties, integrity, and characteristics of materials, components, or structures without causing any damage or alteration to their physical properties. NDT methods are used to examine and evaluate flaws, defects, or anomalies in a non-invasive manner, ensuring the reliability, safety, and quality of the examined materials or structures9. The commonly used NDT testing to estimate the CS of concrete is described below:

Rebound hammer

The RH test is one of the most widely used NDT technique for determining the CS of concrete which offers a practical and reasonably priced method to determine the concrete CS. The RH test standards are provided by various nations like India, USA, China, UK, Russia, European Union, Switzerland, and Japan, as shown in Fig. 1.

Figure 1
figure 1

Standards of RH and UPV testing.

The concept behind the hardness test is that an elastic mass's rebound is influenced by how hard the surface is that it impacts. The strength of concrete is inversely correlated with the amount of energy it can absorb. The method of testing starts with carefully choosing and preparing the concrete surface that will be tested. Abrasive stones should be used to smooth up the test surface after the surface has been chosen. To impart a specific amount of energy, the hammer is then driven on the test surface.

Let the plunger make a perpendicular stroke to the surface. In the old RH, the inclination angle of the hammer has an impact on the results, but it is unimportant in the latest RH instruments. The rebound number should be recorded after the impact10,11. A minimum of ten readings must be taken in each area being analysed. Although there is no unique relationship between concrete hardness and strength. However, according to IS 13311 Part 212, the rebound number is affected by factors such as cement type, aggregate type, carbonation of concrete, surface condition, concrete age, concrete moisture content, curing time, etc.

Ultrasonic pulse velocity (UPV)

This method involves measuring the velocity of an ultrasonic wave propagating through a specimen to evaluate its strength and quality characteristics. A complicated system of stress waves is created as a result, including longitudinal (compressional), shear (transverse), and surface (Rayleigh) waves. The longitudinal waves, which move the quickest, are detected by the receiving transducer. The velocity of the ultrasonic wave can be used as a metric to grade the quality of the concrete, with higher velocities indicating better quality and homogeneity, and lower velocities indicating non-uniformity or the presence of defects such as cracks or voids.

In order to conduct this test, an ultrasonic wave pulse is introduced into the material under examination, and the elapsed time for the pulse to traverse the material is meticulously recorded. Subsequently, the pulse velocity is computed by dividing the distance, the pulse travelled within the material by the time it took for this traversal. Notably, the velocity of the ultrasonic wave is influenced by the density and elastic modulus of the material. There are various standard methods used globally to conduct the UPV test, as shown in Fig. 1. UPV testing methods can be categorized into three groups: direct testing, semi-direct testing, and indirect testing, as presented in Fig. 2.

Figure 2
figure 2

Type of UPV testing (a) direct test, (b) semi-direct test, and (c) indirect test.

According to IS 13311 Part 113, factors that can influence the pulse velocity includes the surface conditions and moisture present in the concrete, the shape, and size of the concrete member, the temperature of the concrete, the presence of stress, the effect of reinforcing bars, etc. It is important to consider these factors to obtain accurate results. The pulse velocity (V) is given by:

$$V= \frac{L}{T}$$
(1)

where, V, L, and T are the pulse velocity, length, and effective time, respectively.

The velocity criterion for concrete quality grading according to IS 13311 Part 113 is shown in Table 2, and concrete quality classification based on the RH and UPV values is shown in Table 314.

Table 2 Velocity criterion for concrete quality grading.
Table 3 Concrete quality classification based on RH and UPV.

Experimental methodology

Materials

OPC 43 grade cement was used in the present research work. The physical properties of Ordinary Portland Cement such as consistency, fineness, specific gravity, and CS after 672 ± 4 h have values of 30%, 310 m2/kg, 3.14, and 47.31 MPa, respectively. The coarse aggregates used in this study, measuring 20 mm and 10 mm in size, were naturally crushed and have corresponding fineness modulus of 2.25 and specific gravity of 2.71. Fine aggregates were natural with specific and fineness modulus values of 2.69 and 2.78 (Zone III), respectively. The design mix was prepared according to IS 10262: 201615. Test specimens were batched onsite using different concrete mix designs with cement, water, coarse aggregate, fine aggregate, admixture, and w/c ratio, with nominal 28 days CS of 22 MPa to 44 MPa and concrete with slump flow of more than 100 mm. The total number of cast samples was 111, and each cube sample measured 150 mm × 150 mm × 150 mm. Figure 3 depicts the complete procedure of the samples from the casting phase to the testing phase.

Figure 3
figure 3

Preparation of specimens (casting, marking, NDT and compression testing).

CS using RH

The CS of concrete can be quickly and conveniently determined using the NDT method known as the "RH test". The RH, often referred to as a Schmidt hammer, is composed of a mass that is moved along a plunger inside a tubular casing and is controlled by a spring. Before testing, all samples were taken out of the curing tank and maintained in the lab environment for roughly 24 h. Then 15 mm small circles were marked on the two faces of the concrete cube. The arrangement of the circle with the centre-to-centre spacing is presented in Fig. 4a. The total number of marking were twenty-five as shown in Fig. 4b. The concrete cube specimens were placed in a compression testing machine and subjected to a constant load of approximately 7 N/mm2 (based on impact energy of the hammer) (Fig. 4c). The rebound number was measured, and the CS is calculated according to IS 516:195916. In the RH test, 10 to 12 rebound number responses from each test location were measured on two faces of the cube specimen (Fig. 4d).

Figure 4
figure 4

Concrete cubes (a) Marking of 15 mm circle for RH test (b) Assigning number to each circle and (c) Impression of RH test.

The average rebound value of each test region is derived using an algorithmic average after the maximum set and the minimum set of results17 have been removed.

$${R}_{m}= \sum_{i=1}^{10}\frac{{RN}_{i}}{10}$$
(2)

where, \({R}_{m}\) and \({RN}_{i}\) are the average value of each test area and the measured value of each impact point, respectively.

Density using UPV

The working principle and related details of UPV are already mentioned in the subsection "Ultrasonic pulse velocity (UPV)". In the UPV test, marking has been done on two faces of the concrete cube. The diameter of the probe is 50 mm and on each face, five markings were made as shown in Fig. 5a and b. On an average only two readings were taken from the selected two faces as shown in Fig. 5d. The 54 kHz probes were used and the method of testing is a direct method or direct testing as shown in Fig. 5c.

Figure 5
figure 5

Concrete cubes (a) Marking of 50 mm circle for UPV test (b) Assigning number to each circle (c) testing of samples with UPV instrument, and (d) Impression of UPV test.

The average UPV value can be obtained by using below expression:

$${UPV}_{m}= \sum_{i=1}^{2}\frac{{UPV}_{i}}{2}$$
(3)

where, \({UPV}_{m}\) and \({UPV}_{i}\) are the average value of each test area and the measured value of each point, sequentially.

CS using UTM

After the RH and UPV testing, the sample had been tested under the universal testing machine (UTM) according to IS 516: 195916. Load should be applied gradually and continuously without shock. Gradually increase the load until the cube either reaches its peak capacity or shows signs of cracking. The maximum load at which the cube fails should be recorded along with the type of failure (crushing, splitting, etc.) as shown in Fig. 6. IS 516 code provides guidelines for the testing of concrete cubes and the procedure is followed to obtain accurate results.

Figure 6
figure 6

Testing of the specimen under UTM (a) setting of the specimen under UTM, (b) application of load, and (c) failure of the specimen.

Collected database

To collect the RH and UPV data of concrete cubes, a thorough research of the literature was conducted. From the published studies, 476 datasets had been gathered5,18,19,20,21. From in-situ NDT, 134 datasets had been obtained. Furthermore, a total of 111 concrete cubes were cast and subsequently subjected to laboratory testing employing methods such as RH, UPV, and UTM. In the end, 721 datasets were chosen to construct the ML models. Figure 7 shows the full approach used to accomplish the goal of this study. Table 4 presents the statistical characteristics (minimum, maximum, mean, standard deviation (SD) and kurtosis (Ku)) of the gathered, in-situ, laboratory tests, and the merged dataset.

Figure 7
figure 7

Methodology chart.

Table 4 Statistical parameters of all the datasets used to develop ML models.

A probability histogram is a graph that lists all possible outcomes along the x-axis and the likelihood of each outcome on the y-axis. The probability distribution is depicted graphically in Fig. 8.

Figure 8
figure 8

Histogram probabilities plot (a) data from the literature, (b) in-situ data, (c) laboratory data, and (d) combined all data.

Processing of data

Data processing is an important step in ML algorithms. Data standardization is the process of transforming data into a common format or scale so that it can be easily compared and combined with other data. This is particularly important when working with data from different sources, as each source may have its format or scale. In this work, “Min–Max Scaling” has been used to normalize the input and output datasets. Standardization/normalization of the dataset is important because it allows for more accurate and fair comparisons between data and can improve the performance of ML algorithms22,23.

Prediction models

To predict the CS using NDT methods; mathematical and ML models have been applied. The mathematical models have been divided into three categories: (a) prediction models based on RH, (b) prediction models based on UPV value, and (c) prediction models based on combined RH and UPV values. Ensemble-based ML algorithms namely AdaBoost, CatBoost, GBT, RF, stacking, and XGB have been applied to develop the CS prediction models. Detailed information on the mathematical and ML models is available in subsequent sections.

Mathematical models

In this section, the details of analytical models based on RH, UPV, and combined RH and UPV are given in Tables 5, 6 and 7, sequentially with the year of publication. These models are the most widely acknowledged empirical connections for calculating the CS of concrete using NDT techniques that can be found in the literature. These equations rely on UPV, RH, or a combination of both for measurements, but they tend to demonstrate considerable deviation, leading to predicted results that significantly differ from the actual values.

Table 5 List of analytical models based on RH.
Table 6 List of analytical models based on UPV.
Table 7 List of analytical models based on combined RH and UPV.

The CS of concrete has been adjusted using a tiny correction in the analytical models (UPV-M8, CM-10, and CM-13). In the UPV-M8, C-M10, and C-M13 models, the correction factor in the existing model is divided by 10000, 10, and 10 values, respectively.

ML models

EL combines diverse models: bagging (bootstrap aggregating) and boosting. Bagging trains varied model instances on different data subsets, combining predictions through voting or averaging. Boosting trains sequential weak models, each correcting the previous one's errors, combining predictions with weighted emphasis. Models like random forest (RF), gradient boosting trees (GBT), AdaBoost, and XGBoost (XGB) are prominent for their superior performance and lower overfitting risks. This study applied six EL models to enhance accuracy of the existing model, showcasing EL's capability to improve predictive results. The overview of the ML models is given in Table 8.

Table 8 Description of ML models.

Model validation

There are several metrics commonly used for model validation in regression problems, such as correlation coefficient (R), root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). The R-value closes to one shows the better fit of the model and lower RMSE (approaches to zero) indicates a better fit. Nash–Sutcliffe (NS) efficiency index with a value equal to one shows a good fit between the experimental and predicted values. A higher a20-index value indicates a better fit. It is always preferable to use multiple metrics to evaluate the performance of the individual model54,55,56,57,58,59,60,61,62,63.

$$R= \frac{{\sum }_{i=1}^{N}\left({r}_{i}-\overline{r }\right)\left({s}_{i}-\overline{s }\right)}{\sqrt{{\sum }_{i=1}^{N}{\left({r}_{i}-\overline{r }\right)}^{2}{\sum }_{i=1}^{N}{\left({s}_{i}-\overline{s }\right)}^{2}}}$$
(4)
$$MAE= \frac{1}{N}\sum_{i=1}^{N}\left|{r}_{i}-{s}_{i}\right|$$
(5)
$$RMSE= \sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left({r}_{i}-{s}_{i}\right)}^{2}}$$
(6)
$$MAPE= \frac{1}{N}\sum_{i=1}^{N}\left|\frac{{r}_{i}-{s}_{i}}{{r}_{i}}\right|\times 100$$
(7)
$$NS=1- \frac{{\sum }_{i=1}^{N}{({r}_{i}-{s}_{i})}^{2}}{{\sum }_{i=1}^{N}{({r}_{i}-\overline{s })}^{2}}$$
(8)
$$a20- index=\frac{m20}{N}$$
(9)

where, r, s, \(\overline{r }\), and \(\overline{s }\) are the experimental values, predicted values, mean of experimental values, and mean of the predicted values, respectively. N represents the number of datasets, and m20 is the number of values obtained from measured values divided by predicted value and lies in the range of 0.8 to 1.2.

Results and discussions

Mathematical models

The mathematical models are divided into three categories namely: (i) RH models (ii) UPV models and (iii) combined RH and UPV models as mentioned in section "Prediction models". In RH models, the R-value of the RH-M2 model is the highest among all the models (RH models). But, MAE, RMSE, and MAPE values of the RH-M5 model are the lowest among all the RH models. On the other hand, NS and the a20-index of the RH-M5 hold the first position. The range of error in RH-M1, RH-M2, RH-M3, RH-M4, RH-M5, RH-M6, and RH-M7 models are  + 5.02 MPa to  + 51.94 MPa,  − 32.70 MPa to  + 13.75 MPa,  − 11.43 MPa to  + 16.96 MPa,  − 12.49 MPa to  + 16.48 MPa,  − 12.30 MPa to 16.05 MPa,  − 12.07 MPa to 19.74 MPa, and  − 17.87 MPa to 18.92 MPa, sequentially. The values of all performance metrics is shown in Table 9 and the scatter plot of all the RH models is shown in Fig. 9.

Table 9 Results of RH models.
Figure 9
figure 9

Results of analytical models based on RH (a) RH-M1, (b) RH-M2, (c) RH-M3, (d) RH-M4, (e) RH-M5, (f) RH-M6, and (g) RH-M7.

Therefore, based on all the performance indices, it can be summarized that the precision of the RH-M5 model is good among all the RH models.

Among UPV models, the UPV-M7 model exhibits a higher R-value compared to other UPV models. However, the errors in the UPV-M7 model are higher as compared to the UPV-M6 model. The MAE value of the UPV-M6 model is 20.48% lower than the UPV-M7 model. Similarly, the RMSE and MAPE value of the UPV-M6 model is 22.71% and 15.62% lower than UPV-M7. However, the NS and a20-index of the UPV-M8 and UPV-M2 models are greater.

However, the overall performance of the UPV-M6 model is good. Therefore, it can be inferred that the UPV-M6 model outperforms other UPV models in terms of performance. A scatter plot and the values for all performance metrics are presented in Fig. 10 and Table 10 respectively.

Figure 10
figure 10

Results of analytical models based on UPV (a) UPV-M1, (b) UPV-M2, (c) UPV-M3, (d) UPV-M4, (e) UPV-M5, (f) UPV-M6, (g) UPV-M7, (h) UPV-M8, (i) UPV-M9, and (j) UPV-M10.

Table 10 Results of UPV models.

The final category of analytical models incorporates a combination of RH and UPV measurements. The C-M6 model displays superior performance when compared to other combined models, with higher values for R-value, NS, and a20-index. Specifically, the R-value of the C-M6 model surpasses that of C-M1, C-M2, C-M3, C-M4, C-M5, C-M7, C-M8, C-M9, C-M10, C-M11, C-M12, and C-M13 by 5.99%, 6.26%, 9.30%, 7.38%, 4.01%, 29.39%, 1.59%, 13.65%, 10.94%, 52.19%, 0.97%, and 66.12%, respectively.

Overall, the C-M6 model demonstrates superior performance in comparison to combined models as well as the other two categories. Figure 11 and Table 11 depict a sequential representation of the scatter plot and performance metric values. Additionally, the C-M6 model exhibits the lowest MAE, RMSE, and MAPE values among all combined models.

Figure 11
figure 11figure 11

Results of analytical models based on combined RH and UPV (a) C-M1, (b) C-M2, (c) C-M3, (d) C-M4, (e) C-M5, and (f) C-M6, (g) C-M7, (h) C-M8, (i) C-M9, (j) C-M10, (k) C-M11, (l) C-M12, and (m) C-M13.

Table 11 Results of combined RH and UPV models.

ML models

In this study, six ML models were developed and evaluated based on six distinct performance metrics. Along with these metrics, scatter plots, absolute error plots, and grouped marginal plots have also been utilized to display the accuracy and errors of the models. To compare the analytical models and the ML models, a raincloud graphic and Taylor diagram have been employed.

The AdaBoost model exhibits an R-value of 0.9280, followed by a MAPE value of 10.81%, with the NS and a20-index values being 0.8610 and 0.8724, respectively. In comparison, the CatBoost, GBT, RF, stacking, and XGB models display correlation coefficients of 0.9349, 0.9627, 0.9877, 0.9536, and 0.9970, respectively. Among all ML models, the XGB model outperforms the others in terms of R-value, with a 7.44%, 6.64%, 3.56%, 0.94%, and 4.55% higher score than AdaBoost, CatBoost, GBT, RF, and stacking, respectively. Similarly, the NS and a20-index values of the XGB model are higher than all other developed ML models. Additionally, the MAPE and MAE values of the XGB model are significantly lower than AdaBoost, CatBoost, GBT, RF, and stacking models, with reductions of 84.37%, 832.4%, 77.33%, 59.46%, and 81.08% for MAPE, and 84.99%, 83.43%, 77.60%, 60.37%, and 82.31% for MAE, respectively. Based on all performance metrics, the XGB model exhibits good precision.

The graphical representation of all the developed models are shown in Fig. 12a–f. Table 12 displays the values of all performance metrics for the developed models (ML models). In these figures, three plots are provided. The first plot shows the scatter plot of the training and testing dataset.

Figure 12
figure 12figure 12

Results of the developed ML models (a) AdaBoost, (b) CatBoost, (c) GBT, (d) RF (e) stacking, and (f) XGB model.

Table 12 Results of ML models.

The second plot is the grouped marginal plot, which combines a scatter plot with density curves along the margins to represent the distribution of multiple variables in a single plot. In the last plot, the absolute error values of the training and testing datasets is shown. In a grouped marginal plot, each group of observations is represented by a different colour (red colour is for experimental values and blue colour is for predicted values) and the density curves along the margins show the distribution of each variable for each group. This plot helps to visualize the relationship between two variables and the distribution of each variable for different groups in the data.

In ML models such as AdaBoost, CatBoost, GBT, RF, stacking, and XGB models only 30.37%, 33.70%, 41.61%, 61.44%, 25.38%, and 91.12% of data directly lies over the diagonal line (best-fitting line with dark blue colour) as shown in scatter plot Fig. 12a–f, sequentially. The range of the errors in the ML models are − 13.57 MPa to + 11.91 MPa, − 16.17 MPa to + 12.16 MPa, − 12.31 MPa to + 9.28 MPa, − 7.18 to 5.29 MPa, − 10.16 to 9.57 MPa, and − 5.77 MPa to + 4.79 MPa for models AdaBoost, CatBoost, GBT, RF, stacking and XGB, sequentially. The absolute error values of AdaBoost, CatBoost, GBT, RF, stacking and XGB models are 14 MPa, 17 MPa, 13 MPa, 8 MPa, 11 MPa, and 6 MPa, sequentially, as shown in Fig. 12a–f. The grouped marginal plot of the XGB model, as depicted in Fig. 12a–f, demonstrates superior performance relative to other ML models. This graphical analysis further confirms the high precision of the XGB model. Based on performance indices and graphical analysis, the ranking of all ML models is in descending order as follows: XGB, RF, GBT, stacking, CatBoost, and AdaBoost.

Comparison between mathematical and ML models

The performance of the analytical and developed ML models had been compared with existing ML models. The metrics of all the models (existing ML models, analytical models, and developed ML model) is shown in Table 13. The performance of the XGB model is 7.39%, 0.80%, 8.09%, 25.71%, 72.25%, and 11.19% higher than Shishegaran et al.4, Asteris and Mokos5, Asteris et al.8, RH-M5, UPV-M6 and C-M6, sequentially. Similarly, the NS value of XGB model is 4.83%, 256.75%, and 3.91% higher than RH-M5, UPV-M6, and C-M6, sequentially. In addition to that, the a20-index of the XGB model is also higher than all the analytical as well as existing ML models. The MAPE value of the XGB model is 75.03%, 86.78%, 91.20%, 93.78%, and 88.67% lower than Shih et al.6, Asteris et al.8, RH-M5, UPV-M6 and C-M6, sequentially. In nutshell, the XGB model demonstrates superior performance as compared to both existing ML and analytical models.

Table 13 Comparison of existing ML models with best-fitted analytical and ML models.

Taylor diagrams and raincloud plots have been used to compare how well the ML and analytical models performed. Taylor diagram is drawn between the R-value, RMSE value, and standard deviation. The dark dotted blue line represents the standard deviation of the experimental dataset and green star shows the position of the best-fitted model. Figure 13a represents the Taylor plot of RH models. In Fig. 13a, not a single model shows good fitting, because the RMSE value of all the models are above 6 MPa.

Figure 13
figure 13

Taylor plots (a) RH, (b) UPV, (c) combined RH and UPV models up to C-M7, (d) combined RH and UPV models from C-M8 to C-M13, (e) developed ML models, and (f) best analytical (RH-M5, UPV-M6 and C-M6) and ML (XGB) models.

Figure 13b represents the Taylor plot of UPV models. The RMSE value of all the models above 9 MPa in Fig. 13b does not exhibit any good fitting. Figure 13c and d represent the Taylor plot of combined RH and UPV models. Only two models (C-M5 and C-M6) in Fig. 13c have RMSE values lower than six MPa. The alignment of analytical models, specifically RH-M1, RH-M2, RH-M7, UPV-M1, UPV-M4, C-M1, C-M2, C-M7, C-M8, C-M10, and C-M13, exhibits deviations from ideal placement within the Taylor plot. This deviation can be attributed to the pronounced disparity observed in the standard deviation values. The Taylor plot of the developed ML models is shown in Fig. 13e. The RMSE value of the ML models such as GBT, RF, stacking, and XGB models is less than three MPa and among these models, the XGB model shows the best fit. Figure 13f shows the Taylor plot of the best-selected analytical models (RH-M5, UPV-M6, and C-M6) and best-fitted ML model (XGB).

In addition to that, the raincloud plot has also been used to compare the performance of the selected analytical models (RH-M5, UPV-M6, and C-M6) and all ML models as shown in Fig. 14. The range of errors of all the models such as RH-M5, UPV-M6, C-M6, AdaBoost, CatBoost, GBT, RF, Stacking and XGB models are  − 12.30 MPa to  + 16.05 MPa,  − 17.57 MPa to  + 32.62 MPa,  − 14.75 MPa to  + 17.64 MPa,  − 13.57 MPa to  + 11.92 MPa,  − 16.17 MPa to  + 12.16 MPa,  − 12.31 MPa to  + 9.28 MPa,  − 7.18 MPa to  + 5.29 MPa,  − 10.16 MPa to  + 9.57 MPa, and  − 5.77 MPa to  + 4.79 MPa, as shown in Fig. 14. This plot also indicate that the performance of the XGB model is higher as compared to all analytical as well as developed ML models.

Figure 14
figure 14

Raincloud plot to show the error comparison of best analytical models and developed ML models.

Sensitivity analysis

Sensitivity analysis is a technique to determine how the uncertainty in the output of a model can be attributed to variations in its inputs. SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any ML model. It uses Shapley values, a well-established mathematical concept from cooperative game theory, to explain the output of a model by assigning a contribution to each feature64.

For an XGB algorithm, SHAP values can be used to perform sensitivity analysis by calculating the influence of each feature to the model's predictions. Observing the magnitude and direction of the SHAP values associated with each feature enables the identification of the most influential features affecting the model's predictions. Understanding these values, helps ascertain how altering feature values will affect the predictions of the developed model. This information can be useful for interpreting the model's behaviour and for making decisions about feature selection and model interpretation. RH value has the highest impact on the CS of concrete as compared to the UPV value. The RH value has an 88.19% influence on the CS of concrete and the rest is contributed by UPV values as shown in Fig. 15.

Figure 15
figure 15

Feature importance.

Conclusions

The compressive strength of concrete based on the NDT technique has been evaluated in the present study using analytical as well as the ML models. The three groups of analytical models—RH models, UPV models, and combined RH and UPV models consist of seven, ten, and thirteen models, respectively. The Ensemble-based ML algorithms (AdaBoost, CatBoost, GBT, RF, Stacking, and XGB models) have been used to enhance the accuracy of the existing models. The six performance metrics were employed to evaluate the accuracy of both analytical and ML models. Furthermore, graphical representations such as scatter plots, absolute error plots, and grouped marginal plots were utilized to analyse the fitting of the ML models. In addition to that, Taylor and raincloud plots have also been also used to compare the performance of the selected analytical and the developed ML models. Based on the performance metrics and graphical representations, the following conclusions can be drawn:

  • In selected analytical models, the correlation coefficient of the RH model (RH-M5), UPV model (UPV-M6), and combined RH and UPV model (C-M6) are 0.7931, 0.5788, and 0.8967, sequentially. Similarly, the NS and a20-index of the C-M6 model are higher than RH and UPV models with values of 0.9565 and 0.6602, sequentially.

  • The performance of all the developed ML models is higher than existing analytical models. Among ML models, the precision of the XGB model is higher in terms of R, RMSE, MAPE, and MAE values.

  • The R-value of the XGB model is 25.71%, 72.25%, and 11.19% higher than RH-M5, UPV-ML, and C-M6 models, sequentially.

  • According to the sensitivity study, RH values have a substantially larger impact on concrete's CS than UPV values.

  • The Taylor and raincloud plots also confirm the reliability of the XGB model.