Determining the mix design method for normal strength concrete using machine learning

There exist many empirical data-based methods that facilitate the process of concrete mix design. The output of mix design methods are the proportions of concrete constituents that when mixed together produce hardened concrete, taking into account the required strength, workability and durability requirements. Based only on the proportions of the mix, it can be challenging to determine the designing method. Therefore, in this work, computer-generated data was used to train a simple machine learning model to determine the method by which a normal strength concrete mix was designed. The developed machine leaning model only requires knowledge of the mix’s proportions, i.e., the amounts of cement, water, sand, and gravel to accurately determine the method by which the mix was designed. It was found that a simple machine learning model (decision tree) was able to determine the mix design method with high accuracy. Moreover, via principal components analyses, and other similar techniques, it was found that amount of cement is the best predictor of the mix design method. Findings of this work provide a method for determining mix design methods and promote the use of machine learning in the field of civil engineering.


Introduction
The process of designing a concrete mix is a vital skill that civil engineers acquire early on in their careers. The goal of concrete mix design is to determine the amounts of the different concrete constituents required to produce concrete with predetermined workability and compressive strength. Depending on the purpose for which concrete is going to be employed, its compressive strength is determined. Achieving this strength depends on the proportions of its constituents. In addition to strength, concrete workability and durability are other aspects that influence the amounts of mix constituents. There are various methods that are widely used for designing concrete mixes such as the American Institute of Concrete (ACI) [8], Department of Environment (DOE) [33] and the absolute volume method (AVM) [8]. For a required compressive strength, workability, and durability, the three methods suggest different proportions of cement, water, sand and gravel. It is important that the mix design method used is known, especially in countries where design codes of neighboring countries are interchangeably used. In these codes, AVM, ACI or DOE or other mix design methods are used to design concrete mixes. Knowing the designing method can help construction engineers understand the mix better and interpret the fresh concrete's test results accordingly. Each design method works according to certain assumptions, and uses empirical tables or curves in the designing process. Hence, for a certain required strength, durability and workability requirements, each method suggests different amounts of each ingredient. Therefore, it is important to know which method designed the mix because of such differences. For instance, the AVM method does not account for uncertainty and hence does not add a margin to the required strength. Also, the margin added when using the ACI method is different than the one added when using DOE method. In addition, a mix that is designed using the ACI method and DOE (although with different approaches) accounts for durability requirements such as presence of sulphates and chlorides in soil and water. Moreover, the DOE differentiates between crushed and uncrushed aggregates while AVM and ACI do not. So, for two mixes that were designed for the same required strength, one of them designed using ACI and the other using AVM or DOE, the mix's workability can be different, and hence in the field, slump tests results will be different.
Knowing that a mix is designed using ACI, informs the construction engineer that it accounted for durability aspects, and that it is highly probable that the mix will produce the targeted strength, because the ACI mix design method adds a margin that is based on statistical standard deviation values and based on the production rate of the lab/plant/individual making the concrete. Also, knowing that a mix was designed using ACI or DOE indicates that the amount of water is based on the maximum nominal size of aggregates and the desired workability, so slump readings will be different unless the correct MNS aggregates are used. Also, DOE suggests different amounts of water based on whether the aggregates used are crushed or uncrushed, and so on.
Finally, knowing the design method that was used to design the mix at hand will help troubleshoot some of the issues that may be encountered, such as reasons for why the mix is too stiff or why it is too wet, why it uses such high or low amounts of a certain ingredient, or whether the mix accounts for durability aspects, and more importantly, how certain we can be that the mix will achieve its designed compressive strength.
Nevertheless, it can be difficult to determine the designing method based only on the proportions of the mix. Hence, in this work, machine learning was employed to classify concrete mix design methods based on their produced concrete constituents' proportions, i.e., the amounts of cement, water, coarse and fine aggregates.

Literature review
For normal strength concrete, the main constituents are cement, water, fine and coarse aggregates. These four constituents are mixed to make concrete structural members such as beams, columns, and slabs. Other components can be added to the concrete/cementitious mix to enhance its workability, strength, fracture toughness and other properties, and can be optimized to satisfy predefined constraints required for different types of concrete and cementitious materials [1,2,13,16,19,21,25,29,34,36]. The following section discusses some of the existing methods of normal strength concrete mix design.

Absolute volume method
Absolute Volume method is a fast and often reliable method that is used by engineers for designing concrete mixes. It is part of the ACI method [8]. The method's main assumption is that the absolute volume of concrete is the sum of the absolute volumes of its components. Using this method, a designer chooses from a list of ratios for fine/ coarse aggregate as well as the water/cement ratio.

American Concrete Institute method (ACI)
The ACI method [8] is developed by the American Concrete Institute and is widely used for designing concrete mixes. It was first published in 1944, and the last revised version was published in 2002. A designer who chooses to use this method for designing a concrete mix, utilizes many empirical data presented in tables. The method considers many aspects, including the shape and weight of the aggregate used as well as whether the mix is air entrained or non-air-entrained.

Department of environment method (DOE)
The United Kingdom's Department of Environment produced the well-known DOE method for designing concrete mixes based on empirical data that are provided to designers in the form of curves [33]. The method is also 1 3 known as the British Standard method, and the latest version was published in 1988. As was the case with ACI, the DOE also considers many aspects, including the type of aggregates that are being used in the mix i.e., crushed, or uncrushed in addition to different types of cements.

Machine learning applied to concrete mix design methods
Nowadays, a substantial attention is directed to machine learning and deep learning techniques since they proved to be capable of solving complicated problems. Such techniques have been utilized to solve concrete mix design problems such as the prediction of concrete compressive strength. In 1998, [38] reported the efficacy of using artificial neural networks (ANN) and linear regression in predicting the strength of high-strength concrete. The use of ANN in predicting the strength of concrete continued as machine learning and deep learning methods improved, solving problems related to strength prediction of normal and high strength concrete [5,11,15,20,23,37], high strength and high-performance concrete [9,22,27] and ultra-high-performance concrete [17], recycled aggregate concrete [32], structural lightweight concrete [3] and self-consolidating concrete [30]. Besides Neural networks, decision trees have also been used to predict compressive strength of different types of concrete such as high strength and high-performance concrete [6,14], FRP-confined concrete [26], as well as recycled aggregate concrete [7,12]. Despite these efforts, the classification of concrete mixes given their ingredients is not very well studied.

Decision trees
Decision trees, or classification trees [4] is a group of algorithms that can predict classes of data by climbing down the tree nodes and branches from the beginning (the root) node down to a leaf node. The leaf node contains the predicted (decided) class. The example tree in Fig. 1 illustrates the use of decision tree for the classification of binary target variable either Y = 0 or Y = 1, based on two predictors, X1 and X2 whose values are in the range from 0 to 1. Nodes and branches are the essential components of a decision tree model. During the development of the classification tree, three processes are often performed, i.e., splitting, stopping, and pruning [31].
A tree node can be a root node, also known as a decision node, an internal node, situated in the middle of the tree connecting parent nodes and child nodes (leaf nodes). An internal node holds one of the possible values available to it at that level in the tree. In addition, a tree node can be a leaf node which contains the final result of a multiple decisions down the tree. Branches in the tree connect the nodes transmitting decisions across the tree.
In order to create a working decision tree, a few processes are carried out including splitting. In particular, a decision tree is trained by passing the training data along nodes and branches, such data is frequently split according to the provided features with the end goal that purer child nodes of the target variable are produced. In addition to splitting there is stopping, which stops the tree from growing aimlessly resulting in a tree that is both computationally expensive and ineffective. In some cases, stopping criteria do not work well, necessitating the use of an alternative method which is pruning, in which a tree is allowed to grow large first then unnecessary nodes are excluded (pruned) rendering an optimally sized tree [31].

Dataset
The backbone of a machine learning model is the data on which it is trained, in this work, thousands of mix designs were used to train the machine learning model. Since designing a concrete mix manually can be cumbersome as it requires following a large number of steps and checks, computer software was used that implemented concrete mix design methods. Particularly, we used MATLAB programming language to create programs that help us create the dataset for the machine learning model. These programs are able to design concrete mixes using three popular methods, namely, the American Institute method, the Department of Environment method, and the absolute volume method. The output of these program was compared to manual calculations to establish a lack of discrepancies between computer programs output and manual calculations. Ten comparisons between the output of the programs and manual calculations were conducted for each of the three programs. A representative example of such comparisons is shown in Table 1. Each method's developed program was used to generate 1000 concrete mix designs, creating a total of 3000 mix designs for the three methods for training and testing purposes. Further, all generated concrete mixes were designed to produce concrete of compressive strength no less than 14 MPa and not greater than 42 MPa. The inputs and outputs of each program are summarized below:

AVM
Input: Required strength, materials properties, w/c ratio, and state of control on placing and mixing concrete.
Output: Weights (per cubic meter) of cement, water, fine and coarse aggregates.

ACI
Input: Required strength, materials properties, maximum nominal size of coarse aggregates (MNS), required workability (slump) and whether previous test records are available.
Output: Weights (per cubic meter) of cement, water, fine and coarse aggregates.

DOE
Input: Required strength, materials properties, maximum nominal size of coarse aggregates (MNS), types of coarse and fine aggregates, type of cement, level of exposure to harsh conditions, required workability (slump) and whether previous test records are available.
Output: Weights (per cubic meter) of cement, water, fine and coarse aggregates, suggested ratios for coarse aggregates.

Visualizing the dataset
In Fig. 2, histograms of the concrete mixes produced from each method are shown. It can be seen that the dataset is uniformly distributed across different strengths for all mix design methods. More importantly, relationships between ingredients are not visually evident and intricate overlaps are present, making it very challenging to distinguish the design method based on the amount of ingredients it recommends for a given compressive strength, see Fig. 3a-d. Parallel lines of all mixes are shown in

Features
The features of the dataset used in the tree model are the main concrete constituents per one cubic meter of each mix design and the corresponding compressive strength, namely, amounts of cement in kg, water in L, fine aggregate in kg and coarse aggregates in kg, as well as compressive strength in MPa.

Coding environment
For the purpose of preprocessing and visualizing the dataset, training and testing the model, MATLAB programming language was used.

Preprocessing of dataset
Prior to training, dataset was standardized using mean and standard deviation of the training data parameters.

Splitting the dataset and cross validation
Data (3000 mix designs) was split into training data (2400 mix designs) and testing data (600). Further, during model training 5-fold cross validation was used to prevent overfitting.

Machine learning model parameters
Three versions of decisions trees were used, depending on the number of nodes in the tree, namely, fine, medium, and coarse trees. Furthermore, the number of maximum splits was selected to be 100 and the splitting criterion to be Gini's Diversity Index [18]

Methods of evaluating classifier's performance
To measure the performance of the decision tree classifier, accuracy measure was used, which is defined as: In addition to accuracy, receiver operating characteristic (ROC) was used, which is a curve showing rates of true and false positives. In particular, it shows true positive rate versus false positive rate for the tree trained classifier. It is initially designed for binary classification but can be used in multi-class classification by evaluating one-vs-rest curves. A perfect classifier that correctly classifies all points to their supposed class appears as a left-corner right angle curve. A poor classifier produces a curve close to a line at 45°. The area under curve (AUC) number is an indication of the performance of the classifier, 1 being a perfect classifier.
To further evaluate the model, confusion matrix was also used, which is a matrix in which the rows correspond to the predicted class (predicted mix design method) and the columns correspond to the true class (actual mix design method). The diagonal of the matrix corresponds to observations that have been correctly classified and the off-diagonal shows the incorrectly classified observations.

Feature importance
In addition to being accurate, a machine learning model should also be efficient, hence the number of features provided to the model should be optimized. This can be achieved by employing principal component analysis PCA [35] and minimum redundancy maximum relevance techniques MRMR [10].

Principal component analysis (PCA)
To reduce the dimensionality of the predictors in dataset, PCA linearly transforms predictors with the goal of removing redundant dimensions, generating principal components (features of paramount importance to the classification process).

Minimum redundancy maximum relevance (MRMR)
To determine the importance of each feature in the dataset MRMR algorithm was used. The MRMR algorithm determines the features that are of high importance to the classification process. The algorithm minimizes the redundancy found in the predictors and maximizes their relevance to outcome variables.

Results and discussion
Upon training the model, it was tested using the testing dataset which contains 20% of the entire dataset, i.e., 600 mix designs, 200 of which comes from the AVM model, 200 from the ACI method and 200 from the DOE mix design method. The classification accuracies of the training and the testing data for three types of trees are summarized in Table 2 below.
To further evaluate the classifier, ROC curves were plotted which test the performance of the tree classifier on training data. It is initially designed for binary classification but can be used in multi-class classification by evaluating one-vs-rest curves. In Fig. 4a, ROC curve of ACI method being the positive class and the remining methods being the negative class. For this comparison, the area under the curve was found to be 0.99, with a true positive rate (TPR) of more than 0.99, which is an indicative of an excellent classification performance, since a true positive rate of above 0.99 indicates that the current classifier classifies more than 99% of the observations correctly to the right class. In Fig. 4b, AVM method being the positive class and the remining methods being the negative class. For this comparison, the area under the curve was found to be 0.97, which is an indicative of a satisfactory classification performance for this class. In Fig. 4c, DOE method being the positive class and the remining methods being the negative class. For this comparison, the area under the curve was found to be 0.97, which indicates satisfactory classification performance for this class.
In the confusion matrix shown in Fig. 5, both the number of observations as well as the percentage of the number of observations are presented in each cell in the matrix. The diagonal cells (green) correspond to observations that have been correctly classified while the offdiagonal cells (red) correspond to observations that have been incorrectly classified. The column on the far right of the matrix (light blue) displays the percentages of all the examples predicted to belong to each class (AVM ACI or DOE) that have been correctly classified (known as precision or positive predictive value) and incorrectly classified (false discovery rate). The row at the bottom of the matrix (light blue) displays the percentages of all the examples that belong to each class that have been correctly classified (known as recall or true positive rate) and incorrectly classified (false negative rate). The cell in the bottom right corner of the matrix (grey) displays the overall accuracy. It can be seen that the tree model is an excellent classifier reaching an accuracy of 96.3%, a precision no less than   Evaluating the performance of the tree classifier using confusion matrix on testing data. the rows correspond to the predicted class (predicted mix design method) and the columns correspond to the true class (Actual mix design method). The diagonal of the matrix corresponds to observations that have been correctly classified, and the off-diagonal shows the incorrectly classified observations. In each cell, both the number of observations and the percentage of the total number of observations are shown 92.5% and a recall of no less than 93.5% across all classes when evaluated using previously unseen testing data. The tree model can be viewed it its real form, however, here we only show the coarse tree model actual tree, as the fine tree model is too dense to be included. The coarse tree model is shown in Fig. 6.
To explain how the tree model classifies mix design methods, the coarse tree model shown in Fig. 6 (74.3% accurate) will be used. So if we have a mix that requires the following amounts: water: 170 L, fine aggregates: 700 kg, coarse aggregates: 1100 kg and cement: 350 kg. We start at the top node, we see that based on the amount of water in our mix, we will head to the right branch of the tree (reader's right). Next, based on the amount of fine aggregates we have in our mix, we can see that because 700 kg is greater than 650.08 kg, final node in the tree is reached and hence a decision is made that the designing method is DOE.
To determine feature importance, the fine tree model was retrained however, employing PCA this time around. The PCA-enabled fine tree model kept 3 feature that can explain 95% variance. Explained variance per feature is as follows; amount of cement: 66.6%, amount of water: 27.9%, amount of fine aggregates: 5%, amount of coarse aggregates: 0.5%, and concrete's compressive strength 0.0%. Results of PCA is shown in Fig. 7a.
Similarly, degree of importance of features was evaluated using MRMR algorithm. It assigns scores representing importance of features. Scores are amount of cement: 0.3468, amount of water: 0.1592, amount of fine aggregates: 0.0838, amount of coarse aggregates: 0.0 and corresponding concrete's compressive strength: 0.0. The drop in score between the amount of cement and amount of water is large which reinforces that cement is the most important predictor of mix design method. Meanwhile, amount of coarse aggregates and compressive strength  Fig. 7 Results of features importance analyses using a PCA and b MRMR seem to contribute minimally to the mix design method prediction. Results of MRMR is shown in Fig. 7b.
Using the three most important feature (cement, water, and fine aggregates) the tree models were retrained. Accuracy of the reduced-dimensionality models were less accurate than the models with full list of features. Accuracies of the reduced-dimensionality models are presented in Table 3.

Conclusions
It is very important that the method by which a concrete mix was designed is known to site/construction engineers to be able to properly interpret fresh concrete tests results and for quality control purposes. However, to the untrained and trained eyes, it can be difficult to discern which method was used to design a given mix. Hence, machine learning was used, specifically decision trees, to classify concrete mix design methods based on the concrete constituents' proportions for a given compressive strength, i.e., the amount of cement, water, coarse and fine aggregate. It was found that decision trees can accurately classify mix design methods with a 96% accuracy. It was shown that knowledge of the amount of a mix's four ingredients is enough to accurately determine the method by which it was designed. Additionally, upon performing PCA and MRMR analyses, it was found that that the amount of cement in the concrete mix is the most important predictor of the mix design method, followed by amounts of water, and fine aggregates. Findings of this work provided a model that can be used to discern mixes designed by different methods. Furthermore, this work presented an example of the benefits and effectiveness of using simple machine learning algorithms in solving civil engineering problems.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.