1 Introduction

Fifth generation (5G) wireless networks and beyond are called upon to support various services ranging from ultra-reliable Internet of Things (IoT) to broadband multimedia services, offering high reliability and throughput, massive connectivity, and low latency [1,2,3]. New generation networks have to provide reliable and high-throughput communications, overwhelming the issues facing due to the limited resources and low transmission power [2, 3]. The adaptive modulation and coding (AMC) technique is the best candidate to fulfill the requirements of the new era communication networks and can be effectively utilized in the 5G new radio (NR) communication system [4].

Traditionally, in mobile wireless communication systems, AMC technology is a mechanism with where BS selects the appropriate modulation and coding scheme that offers the highest link quality. For the decision process, the BS utilizes different parameters that characterize the channel state of the UE and are fed back from the latter to the former. Periodically, the UE estimates its channel quality, maps this information into a CQI, and reports this metric to the BS. In the general case, the CQI is fed as an input to the AMC process, which finally estimates the MCS [5].

Considering the LTE cellular network or a non-standalone 5G network, the BS preserves a lookup table that matches the CQI report with the corresponding modulation and coding scheme. These AMC solutions consist of two different feedback loops called the Inner-Loop Link Adaptation (ILLA) and the Outer-Loop Link Adaptation (OLLA) [6]. Such methods have been extensively studied in the past [7, 8]. In case of poor channel quality due to extreme channel conditions arising from high path-loss, scattering, fading, shadowing, and diffraction, the Physical layer (PHY-layer) should set lower ranks of modulation and coding to avoid communication disruptions, thus maintaining the communication link [9]. High-ranked modulations are employed when the UE is experiencing good channel conditions with the BS. This AMC mechanism is referred as the rule-based AMC [10]. The rule-based AMC mechanism provides fixed rules for all users in LTE communication systems. The emerging 5G systems and services need a more flexible approach that automatically adjusts physical layer parameters according to the user channel state and service type.

Conventional AMC techniques are limited to map the reported CQI or signal-to-noise ratio (SNR) metric to MCS using a predefined lookup table [11]. Practically, this means that the selection of MCS value is sensitive to SNR and depends only on the latter, ignoring different channel models and other link losses experienced during transmission, e.g., fading, shadowing, and Doppler shift [12]. Therefore, it is essential that possible under or over-estimation of SNR could affect the estimation of MCS. Hence, an approach that will effectively combine and exploit the various operating system parameters and channel conditions is missing. To this end, machine learning (ML) methods have been foreseen to realize many challenging tasks in wireless communications and can be adopted to solve the MCS selection problem. Essentially, the MCS estimation problem can be considered as a supervised multi-class classification problem that can be solved using machine learning methods, such as single-hidden layer artificial neural networks (ANNs) and deep neural networks (DNNs) [12], support vector machine (SVM) [13], random forest (RF) [14], and bagging with k-nearest neighbors (B-kNN) base predictors [15].

In this paper, the employment of different machine learning algorithms for MCS prediction is presented, within the context of a non-standalone 5G network. More specifically, the machine learning algorithms ANN, RF, B-kNN, and SVM are utilized to build the different prediction models and compared in terms of accuracy, recall, precision, and F1-score. For the performance evaluation of the proposed models, a simulated data set is considered with more than 13,500 samples produced by ray-tracing techniques. Furthermore, an extensive analysis of the physical layer attributes in an urban environment related to the MCS prediction for the 5G network under consideration is presented, including the reference signal received power (RSRP), the reference signal received quality (RSRQ), the carrier received signal strength indicator (RSSI), the signal-to-interference-plus-noise ratio (SINR), the propagation distance, the BS altitude, the path visibility, and the operating frequency. Furthermore, the quality of the features within the dataset has a substantial impact on the accuracy of our predictions. To ensure the development of an effective model, we have employed and outlined feature selection methods. These methods aim to identify and isolate the most pertinent and non-repetitive features for our modeling efforts. Also, the data splitting method was adopted to avoid overfitting issues, dividing the original data set into three subsets: training, validation, and testing [16].General, balancing between overfitting and underfitting is crucial in machine learning. Overfitting, where a model memorizes data but fails to generalize, and underfitting, where it’s too simplistic, can hinder performance. To improve a model, split it into training, validation, and test sets, select relevant features, apply regularization, fine-tune hyperparameters, and consider early stopping. Experiment with ensemble methods, simplify the model if overfitting. First, a random selection following uniform distribution is applied to the samples of the total data set to distinguish them in the training, validation, and testing subsets 70%, 15%, and 15%, respectively. Subsequently, each ML model is trained based on the data from the training subset and tuned its hyperparameters utilizing the validation set. Finally, the tuned ML models are evaluated on the test subset.

1.1 Related work

The application of optimal link adaptation in different cellular network technologies has been studied in the past [17,18,19,20,21,22]. T. Ohseki and Suegara in [17] propose a new OLLA scheme to realize low-latency transmission in LTE-Advanced and future wireless networks. The proposed technique controls the size of the compensation in the estimated SINR, based on the time elapsed after a UE transits from an idle state to an active state. This approach decreases the transmission latency, particularly in small-packet applications. Furthermore, Ramamurthi and Chen in [18] introduce a link adaptation scheme designed for LTE-Advanced cellular networks. This scheme efficiently determines the (MCS) as well as the Multiple-Input Multiple-Output (MIMO) configuration. The objective is to enhance the overall performance of the LTE downlink. The proposed scheme considers both the average SNR, as well as the user mobility information. The results of this work revealed that link adaptation can be aggressive in choosing MIMO/MCS combinations for stationary terminals, while a conventional approach is better suited as the mobility of the terminals increases. Additionally, [19] and [20] designed an MCS selection scheme to improve the overall wireless network performance in terms of quality of service (QoS); [19] aims to eliminate the obstacle of joint Resource Block (RB) allocation and MCS selection in the context of LTE femtocell Downlink (DL), whilst [20] try to meet the challenges of multilink adaptation in 5G (URLLC) for better utilizing the available radio resources while assuring the QoS requirements of users. In another approach [21], a joint link adaptation and scheduling scheme for 5G ultra-reliable low-latency communications (URLLC) is presented. The proposed scheduling mechanism reduces the URLLC latency from 1.3 to 1 ms at the 99.999% percentile, with less than 10% degradation of the enhanced mobile broadband (eMBB) throughput performance. Lastly, [22] has studied the application of AMC technique for MCS selection in the downlink of a standard 5G network from a physical layer perspective. More specifically, the authors derive the mathematical expression that can describe the relationship between SNR and CQI via a straight-line fitting method and propose a new SNR to CQI mapping algorithm for MCS selection with improved adjustment factor optimization. Through computer simulation results, it can be observed that the proposed algorithm can achieve reduced errors and higher throughput compared to a traditional MCS selection approach.

Considering the application of machine learning-based methods for MCS prediction, many research efforts have been conducted, e.g., see [12, 23,24,25,26,27,28]. An ANN have been proposed [12] to estimate the signal-to-noise ratio (SNR) utilized by an AMC scheme to select the appropriate modulation and coding rate for transmission. The obtained results have shown that, compared to standalone Error Vector Magnitude (EVM) based link adaptation scheme, the proposed method for MCS selection provided high accuracy and improved throughput performance at a lower complexity. Furthermore, in [23], the authors introduced two ML-assisted AMC schemes for IEEE 802.11n wireless communications systems based on MIMO and OFDM transmission under different channel conditions. Comparisons of the proposed methods with traditional MCS schemes showed that higher MCS estimation accuracy is achieved. Also, the authors in [24] proposed a modified autoregressive integrated moving average model (ARIMA) for CQI prediction in LTE-based mobile satellite communications. Their approach predicts discrete CQI states instead of continuous values and provides the reference CQI for the AMC mechanism. This modified technique reduces the prediction complexity and delay. Also, they proved that their approach could realize CQI prediction with satisfying root mean square error (RMSE) performance. In work [25], the authors presented a CQI prediction scheme using a Feed-Forward Neural Network algorithm in the context of MIMO-3rd Generation Partnership Project (3GPP) LTE systems. The predicted CQI values improve the trade-off between Bit Error Rate (BER) and Spectral Efficiency (SE). In [26], supervised machine learning methods, such as ANNs and SVM, are experimentally demonstrated for an in-band optical signal to noise ratio (OSNR) estimation and modulation classification, respectively. The presented results show that the estimation of in-band OSNR and the modulation classification is achievable from directly detected signals employing advanced modulation schemes of up to 64 QAM with varying pulse shapes. Moreover, in [27], a convolutional neural network (CNN) approach is proposed to automatically classify the MCS. Numerical results have shown that the proposed scheme provides significant advantages compared to other deep learning-based methods in terms of accuracy and computational complexity. Furthermore, different ANN architectures were trained in [28]. The goal of the classifier was to predict the modulation class, given a set of 12 input features. The ANN method with two hidden layers achieved an accuracy of 98.4% with a model training time of 20 sec.

1.2 Contributions

As presented in the previous detailed literature review, current research attempts [12, 23,24,25,26,27,28] apply machine learning-based methods for MCS prediction without evaluating and comparing the performance in terms of accuracy, precision, recall, and F1-score metrics. Moreover, there are no research efforts in the literature that consider the MCS prediction problem as a multiclass classification problem, employing different ML methods. In particular, only the approach in [28] follows this logic, but its operation is limited to predicting only the modulation scheme using multiple ANNs configurations, and in the context of optical networks. All-in-all and to the best of our knowledge, this is the first time that different state-of-the-art supervised ML methods have been used for MCS prediction. Specifically, we are addressing a multiclass classification problem within the context of a non-standalone 5G communication system. Regarding the multiclass classification task related to MCS, we have conducted a dedicated examination of the following machine learning methods: RF, B-kNN, SVM, and ANN. The training, validation, and testing phases of the ML models is performed exploiting data collected from a simulation tool in various urban locations for a 5G C-RAN network operating at 2.1 GHz. The main contributions of this paper can be summarized as follows:

  • This paper provides a comprehensive evaluation of contemporary machine learning techniques such as RF, B-kNN, and SVM for MCS prediction.

  • For the MCS assignment, a novel MCS prediction approach based on machine learning is offered in lieu of the conventional MCS selection method.

  • Investigation of the level of detail in terms of input features and their effect on prediction accuracy and generalization capability.

1.3 Structure

The remainder of this paper is organized as follows. Section 2 presents the 5G C-RAN network in urban environment. Section 3 outlines the machine-learning-based methods for MCS prediction, while section 4 presents the simulation procedure for the data collection. The data pre-processing, the learning and validation procedures and the performance metrics are outlined in Sect. 5. Section 6 evaluates and discusses the performance of all tested machine learning algorithms, followed by a summary in in Sect. 7..

2 System model

The 5G-NR is a new radio-access technology, representing an expanded range of new services, such as massive machine type communications, enhanced mobile broadband, and ultra-reliable and low latency communications. These services are procured for a variety of applications in different sectors. Modulation and coding scheme is the key enabling technology for broadband mobile internet and has been part of the fifth-generation (5G) new radio (NR) access technology [29]. The automatic adaptation of MCS offers high data rates and reliability transmission, which justify the factor’s importance. Traditionally, in 4G LTE cellular networks, the UE reports the CQI to the BS. The BS utilizes the CQI index value, which is mapped to an estimated self-acting SINR. Nevertheless, transmission delays and communication connection state make optimal MCS selection more challenging. The proposed technique enhances MCS selection by training machine-learning algorithms with physical layer parameters, hence eliminating conventional dependencies through lookup tables.

2.1 Network architecture

Fig. 1
figure 1

5G C-RAN architecture

As depicted in Fig. 1, a typical 5G C-RAN (Cloud Radio Access Network) architecture is considered, where the Baseband Units (BBUs) from multiple BS are pooled into a centralized BBU Pool. The Remote Radio Head (RRH) from each base station (BS) is located on a tower and connected through an optical fiber link which is defined as Ir interface to the BBU pool implemented in the cloud. Each BBU controls one or multiple RRHs, under the restriction of the maximum data volume limit that a BBU can handle [30, 31]. The virtualization of BBUs in a centralized cloud architecture provides increased flexibility in network upgrades and adaptability to non-uniform traffic. Moreover, direct communication between BBUs is enabled due to the virtualization of BBUs in a centralized cloud architecture. In this architecture, the RRH performs the functions related to layer 1, including the radio frequency (RF) amplification, A/D and D/A conversion filtering, digital processing, and interface adaptation. In the meantime, various functions, including modulation, coding, fast Fourier transform, and the selection of suitable frequencies or channels at layers 2 and 3, are carried out within the BBU pool. The standard for the AMC process in downlink transmission is defined in the 3GPP specification [32]. Additionally, the switch manager assumes a pivotal role as it is responsible for managing the connections between the virtualized BBUs and the RRHs. It serves as a central element in the network infrastructure, facilitating communication between the BBUs and RRHs. The MCS scheme relies on UE’s channel measurements and feedback of the channel state information reference signal (CSIRS) transmitted by the BS.

2.2 Channel model

The communication channel between RRHs and UEs is characterized by dominant propagation phenomena, such as shadowing in an urban environment and reflection at building walls. Several propagation models based on the ray-optical technique can model these effects for obtaining highly accurate prediction results. However, these models are still time-consuming, even with accelerations like preprocessing. Consequently, the dominant path model (DPM) is used for the considered communication system, which determines the dominant path between the transmitter and the receiver [33]. This propagation model can achieve accuracy nearly identical to ray tracing techniques with reduced computation time. The path loss expression using DPM is given as follows:

$$\begin{aligned} \begin{aligned} PL&=20\cdot log\left( \frac{4\pi }{\lambda }\right) +20\cdot p \cdot log(d) \\ {}&\quad + \sum _{i=0}^{n} \alpha (\phi ,i)-\frac{1}{c}\sum _{k=0}^{c}w_{k}, \end{aligned} \end{aligned}$$
(1)

where d represents the distance between the RRH and UE in meters. The factor p depends on the visibility state between the current pixel and the transmitter. Also, the \(\lambda \) denotes the wavelength, and the function \(\alpha (\cdot )\) yields the loss (in dB) caused by interactions, such as changes in the direction of propagation due to walls. \(\phi \) represents the angle between the previous and new directions of signal propagation. Additionally, the parameter n indicates the total number of walls. Furthermore, the parameter \(w_{k}\) is referred to as the waveguiding factor, while the parameter c indicates the total number of accumulated angles that represent changes in the direction of the path. All of the aforementioned details are explained in depth in [33]. Moreover, since the UEs are fixed within the region of interest, the Doppler shift effect is negligible

3 Machine-learning-based models for MCS prediction

This section presents different proposed machine learning methods to predict the most accurate MCS for transmission. Essentially, what should be predicted is the class id shown in Table 3 of Sect. 4.2. Consequently, MCS prediction can be regarded as a supervised multiclass classification problem, where it can be approximated through the utilization of various ML algorithms, such as ANN, SVM, RF, and B-kNN. The major principles of these four algorithms are introduced as follows.

3.1 Artificial neural networks

ANNs are inspired by biological neural networks, trying to imitate the behavior of the human brain by producing complex nonlinear relationships between the input features and the predicted class. ANNs are composed of neuron layers, containing an input layer, one or more hidden layers, and an output layer. Messages are passed from the neurons of the previous layer to the neurons of the next layer. When the ANNs structure consists of more than one hidden layer, it qualifies as a Deep Neural Network (DNN).

Fig. 2
figure 2

Developed ANN model

Consider the DNN architecture consisting of \(l=\{1,2,\dots ,L\}\) layers, for MCS prediction, as illustrated in Fig. 2. The first layer for \(l = 1\) is the input layer, the last layer for \(l=L\) is the output layer and there are \(L=2\) hidden layers in total. The input layer consists of k neurons which represent the input features vector \({\textbf {x}}\) for the DNN:

$$\begin{aligned} {\textbf {x}} = \left[ x_1,x_2,\dots ,x_k \right] . \end{aligned}$$
(2)

Each hidden layer, as well as the output layer for \(l=\{2,\dots ,L\}\), consists of \(C_l\) neurons in total. In particular, the output layer consists of \(C_L=t\) neurons, where t is the total number of classes that we want to predict. It is obvious that for the input layer \(C_1=k\). Let us now focus on the information flow between the neurons of the DNN. As an illustrative example, the input of the i-th neuron in the first hidden layer \(\left( l=2\right) \) can be expressed as follows [12]:

$$\begin{aligned} s_i^{2} = \sum _{j=1}^{C_1} w^{2}_{i,j} x_j + b^2_i, \end{aligned}$$
(3)

where \(w^{2}_{i,j}\) denotes the weight of the connection of the j-th neuron in the input layer with the i-th neuron in the first hidden layer, and \(b^2_i\) is the bias for the input of the i-th neuron in the first hidden layer. Therefore, the input of the i-th neuron in the l-th layer can be expressed via the following equation [12]:

$$\begin{aligned} s_i^{l} = \sum _{j=1}^{C_{l-1}} w^{l}_{i,j} y^{l-1}_j + b^l_i, \quad 2 \le l \le L, \end{aligned}$$
(4)

where \(y^{l-1}_j\) represents the output of the j-th neuron in the \(l-1\) layer. Thus, for \(l=2\) and \(y^{1}_j = x_j\) with \(j = \{1, 2, \dots , k\}\), from  (4) we can derive  (3). Subsequently, the output of the i-th neuron in the l-th layer \(\left( 2 \le l \le L\right) \) can be expressed as follows [12]:

$$\begin{aligned} y_i^{l} = f^{l}\left( s_i^{l}\right) = f^{l}\left( \sum _{j=1}^{C_{l-1}} w^{l}_{i,j} y^{l-1}_j + b^l_i \right) , \end{aligned}$$
(5)

where f is the activation function applied to the neurons of the l-th layer. Furthermore, to express the output of the neurons of the l-th layer with \(2 \le l \le L\), the previous equation can be written as follows [12]:

$$\begin{aligned} {\textbf {y}}_i^{l} = {\textbf {f}}^{l}\left( {\textbf {w}}^{l} {\textbf {y}}^{l-1} + {\textbf {b}}^{l}\right) , \end{aligned}$$
(6)

where \({\textbf {y}}^{l}\) vector is composed with the outputs of the neurons of the l-th layer and equals [12]:

$$\begin{aligned} {\textbf {y}}^{l} = \left[ y^{l}_1,y^{l}_2,\dots ,y^{l}_{C_l} \right] . \end{aligned}$$
(7)

Note that for neurons in the hidden layers the Rectified Linear Unit (ReLU) activation function is used, which is expressed by [34]:

$$\begin{aligned} ReLU(x)={\left\{ \begin{array}{ll} x, &{} \text { if } x>0 \\ 0,&{} \text { otherwise } \end{array}\right. } \end{aligned}$$
(8)

while the activation function used for the neurons of the output layer L is the SoftMax. Therefore, each output from the neurons of the output layer is expressed as follows:

$$\begin{aligned} \begin{aligned} y_u^{L}&= SoftMax(s_u^{L}) = \frac{e^{s_u^{L}}}{\sum _{v=1}^{t} e^{s_v^{L}}}, \\&\textrm{for} \; u = \{1,2,\dots ,t\} \end{aligned} \end{aligned}$$
(9)

where e (Euler’s number) is the mathematical constant to the power of each \({\textbf {s}}_u^{L}\)

The \({\textbf {y}}^{L}\) vector with the output of the neurons of the output layer equals to:

$$\begin{aligned} {\textbf {y}}^{L} = \left[ y^{L}_1,y^{L}_2,\dots ,y^{L}_{t} \right] . \end{aligned}$$
(10)

Finally, the predicted \(\widehat{MCS}\) value is determined through the following expression [12]:

$$\begin{aligned} \widehat{MCS} = \underset{u=\left\{ 1,2,\dots ,t \right\} }{\arg \max } y_{u}^{L} \end{aligned}$$
(11)

In order to evaluate the performance of the constructed ANN, we consider the SoftMax cross-entropy cost function, which is defined as:

$$\begin{aligned} Loss = - \sum _{i=1}^{t} y_i \log \left( y_i^{L}\right) , \end{aligned}$$
(12)

where \(y_i\) and \(y_i^{L}\) are the probability of occurrence of the class i and the output probability of the class i from the i-th neuron of the output layer, respectively. The minimization of the cost function (12) from ANN models is carried out applying various optimization methods available in the existing literature, such as Stochastic Gradient Descent (SGD), Resilient Backpropagation, Levenberg-Marquardt, Scaled Conjugate Gradient, Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton, and many others [35]. The Stohastic Gradient Descent algorithm [12] is used in this study for the learning process of the neural networks under consideration. It is commonly applied for multiclass classification problems due to its rapid convergence [36].

3.2 Support vector machine

One-against-all (OAA) method is the implementation of the SVM algorithm for multi-class classification problems [37]. For a t-class clasification problem like MCS prediction, OAA SVM constructs t binary SVM methods. Thus, given a training data set:

$$\begin{aligned} D=\left\{ (x_{1},y_{1}),(x_{2},y_{2}),...,(x_{m},y_{m})\right\} , \end{aligned}$$
(13)

where \(x_{i}\in R^{k}, i=1,...,D\) is the instance vector as given in (2), and \(y_{i}\in \left\{ 1,...,t \right\} \) is the class of \(x_{i}\). Each j-th binary SVM method is trained with all the examples in the j-th class with positive labels, and all other examples with negative labels. Solving the optimization problem (1) in [37], the specific hyperplane created by each j-th SVM is given by the following linear expression defined as [38]:

$$\begin{aligned} F_{j}(x)=W^{T}\phi (x)+b_{j}, \end{aligned}$$
(14)

where W is the normal vector that regulates the direction of the hyperplane, \(\phi (\cdot )\) is the non-linear mapping function, and b stands for bias. Then, using the Lagrange multipliers, and solving the problem according to [38], the binary classification problem for each j-th SVM can be defined as:

$$\begin{aligned} F_{j}(x)=\sum _{i}^{l}a_{i}y_{i}K({x_{i},x})-b_{j}, \end{aligned}$$
(15)

where \(a_{i}\) are the Lagrange multipliers and K(\(\cdot \),\(\cdot \)) is a kernel function that realizes the nonlinear mapping from low to high dimensional space. The radial basis function (RBF) is a popular kernel for multiclass classification applications. The related expression can be described as [38]:

$$\begin{aligned} K_{RBF}(x_{i},x)=\exp (-\gamma \left\| x_{i}-x \right\| ^{2}), \end{aligned}$$
(16)

where \(\left\| \cdot \right\| \) denotes the norm, and \(\gamma \) is the adjustable parameter that is fitted to the data and controls the performance of the kernel. The overall prediction of the OAA SVM method can be defined as the maximum prediction from each binary SVM classifier. The OAA SVM classifier’s decision function is defined as [37]:

$$\begin{aligned} \widehat{MCS}=\underset{ j\in \left\{ 1,...,t \right\} }{{\arg \max }}F_{j} \end{aligned}$$
(17)

3.3 Random forest

Fig. 3
figure 3

Developed RF model

RF is an ensemble ML method consisting of multiple decision trees base learners, as depicted in Fig. 3. Each decision tree (DT) in RF method is trained on bootstrapped sub-sets \(D_{s}\) \((D_{s} \subseteq D)\) of the train set D given in (13). The building of each DT is carried out applying various methods such as Information Gain, Gain ratio, Gini Index, and Least Square available in the existing literature [39]. However, their suitability depends on the type of problem the RF method has to deal with. In this work, the information gain (IG) method is adopted for the buildings process of the base learners in the RF method. The IG is commonly applied for binary and multiple class classification problems and is utilized from RF methods to minimize the uncertainty in these trees, and can be described as:

$$\begin{aligned} \begin{aligned} A&= Gain\left( D_s,A\right) =Entropy\left( D_s \right) -\\&\sum _{u\in Values\left( A \right) }\frac{\mid D_v \mid }{\mid D_s \mid } Entropy\left( D_v \right) \end{aligned} \end{aligned}$$
(18)

where Values(A) is the set of all possible values in attribute A, \(D_{v}\) is the subset of bootstrapped sub-set \(D_{s}\) that have values v in \(D_{s}\). \(Entropy(D_{s})\) is the entropy measurement from the bootstrapped sub-set, that can be expressed as:

$$\begin{aligned} Entropy(D_{s})=\sum _{i=1}^{t}-P_{i}\log _{2}P_{i}, \end{aligned}$$
(19)

where t denotes the number of classes of the output variable, and \(P_{i}\) is the probability of the i-th class. The next step for each DT in the RF method is the node division, which randomly selects k features out of all features, calculates the IG for each possible splitting point of each feature, and finds the best binary split among all binary splits on the k features. Thus, during the training process, S DT models are defines in total. The overall prediction is then yielded as the most frequently predicted (“voting”) response from all the independently trained trees. The final prediction can be described as  [39]:

$$\begin{aligned} \widehat{MCS}(x)=\underset{y\in Y}{\arg \max } \sum _{j=1}^{S}I(F_{j}(x)=y) \end{aligned}$$
(20)

where \(F_{j}(x)\) is the prediction of the response variable at the training data x, using the j-th tree, y is the set of possible values of Y, and I is the zero–one loss function [40] that can be described as:

$$\begin{aligned} I(F(x)=y)={\left\{ \begin{array}{ll} 1, &{} \text { if } F(x)=y \\ 0,&{} \text { otherwise } \end{array}\right. } \end{aligned}$$
(21)

3.4 Bagging-kNN

The B-kNN is considered as an ensemble ML method consisting of multiple kNN base learners. During the bagging process, S bootstrap subsets \(\left( D_{1}, D_{2}, \dots , D_{S}\right) \) are selected with replacement from the initial training set D, as illustrated in Fig. 4. Then, each one of the S bootstrap subsets is fitted by a kNN base predictor \(F_{j}\) with \(j=1....S\). The primary concept behind the kNN approach is to use distance metrics to locate the K training samples closest to the sample under prediction and then forecast the outcome based on the majority vote of the K neighbors. Let us define the \(D_{K}\) subset \((D_{K} \subseteq D_{j} \subseteq D)\) with the K training samples closest to the example under prediction as follows:

$$\begin{aligned} D_{k}=\left\{ (x_{1},f_{1}(x_{1})),(x_{2},f_{2}(x_{2})),...,(x_{K},f_{K}(x_{K}))\right\} , \end{aligned}$$
(22)

where \(D_{j}\) is the subest of the j-th kNN base learner, \(x_{i}\in R^{k} \; \text {with} \; i=1,...,K\) are the closest instances to the sample under prediction, and \(f_{i}(x_{i})\in \left\{ 1,...,t \right\} \) is the class of \(x_{i}\). Then the final prediction of the kNN base learner is determined utilizing a voting method that considers the K nearest neighbors and can be expressed as follows:

$$\begin{aligned} \widehat{F_{j}}(x)=\underset{y\in Y}{\arg \max } \sum _{j=1}^{K}I(f_{j}(x)=y). \end{aligned}$$
(23)
Fig. 4
figure 4

Developed B-kNN model

The distance between the training samples and the example under prediction can be calculated using different methods, such as euclidean distance (ED), city block distance (CBD) also known as Manhattan, or cosine distance (CD) [41, 42]. For the considered multi-class classification problem of MCS prediction, the use of ED is the best approach since the shortest paths between the K neighbors and the instance under prediction should be determined. The ED formula is expressed as follows:

$$\begin{aligned} ED({\textbf {p}},{\textbf {q}})=\sqrt{\sum _{i=1}^{k}(p_{i}-q_{i}) ^{2}}, \end{aligned}$$
(24)

where \({\textbf {p}}=\{p_1,p_2,\dots ,p_k\}\) and \({\textbf {q}}=\{q_1,q_2,\dots ,q_k\}\) are vectors with size equal to the size of the total features k. Consider that the \({\textbf {p}}\) vector represents one instance of the training data set and \({\textbf {q}}\) vector is the instance that we want to predict the class.

During the training process, S kNN base learners are utilized in total. The final prediction can be obtained as the most frequently predicted (“voting”) response from all the independently kNN base learners. Therefore, the final MCS prediction can be expressed as follows  [42]:

$$\begin{aligned} \widehat{MCS}(x)=\underset{y\in Y}{\arg \max } \sum _{j=1}^{S}I(\widehat{F_{j}}(x)=y). \end{aligned}$$
(25)

4 Network simulation and data collection

Since ML methods learn how to match predictions to patterns observed from the training procedure, a data set containing various features that characterize the transmission over the physical layer should be created. To this end, this section presents the way of conducting the simulation of the considered network, as given in Sect. 2. Moreover, the MCS data set generated through the simulation is thoroughly described and analyzed.

4.1 Topology

Figure 5 illustrates the topology of the non-standalone 5G network under consideration that overlaid on an existing 4G LTE network, operating in an urban region of Frankfurt. The solid gray outlines reflect the building’s infrastructure. The various buildings’ heights and positions offer realistic shadowing and fadings conditions. Moreover, the considered network consists of three RRH locations installed on the rooftop of three different buildings. Each RRH in the region of interest has a different height above ground level and operates in an individual frequency. Furthermore, the frequency division duplex (FDD) mode is employed, and the system operating frequency equals 2.1 GHz with an occupied channel bandwidth of 20 MHz. Additionally, the UEs are fixed at the height of 1.5 m above ground level, and the spatial resolution for their locations is 5 m. Since the primary objective is to optimize the statistical analysis, UEs are scattered across the urban area, where different geographical configurations exist, e.g. dense or sparse regions.

Fig. 5
figure 5

Top view of the urban area of Frankfurt consisting of three RRH locations for the considered 5G network

Further, it is noted that UEs can receive signals from all three RRH positions. However, a handover scheme is implemented so that each UE chooses to be served by the RRH that offers the highest QoS. The rest of the transmission parameters that outline the considered 5G network are listed in Table 1.

Table 1 Transmission parameters

4.2 MCS simulation and data set generation

For the simulation of the 5 G communication system under consideration, a well-recognized software suite named WinProp® is used [43]. As illustrated in Fig. 6, the software considers each pixel as a receiver location and calculates all the network’s physical parameters for each pixel separately. For instance, in the cell assignment procedure, the best RRH for each receiver pixel is determined based on the maximum power received. For data rate predictions, WinProp® software applies the minimum signal-to-noise-plus-interference ratio (SINR) criteria.

Fig. 6
figure 6

Simulated throughput results through MCS selection for all receiver locations in the region of interest

Specifically, at receiver sites where the predicted SINR is greater than a predetermined SINR threshold, an MCS reflecting this SINR value is chosen. Otherwise, the receivers experience outage. The selected MCS value determines the data rate for each receiver location, as depicted in Fig. 6.

Table 2 The entire feature set defined and explored in the developed models
Fig. 7
figure 7

Distribution of data set per class

Furthermore, the software exports prediction results for each metric of interest separately in tabular data format. In the following, eight network metrics that impact the MCS are collected and combined for each receiver location separately. Table 2 gives an outline of the features that were defined and used in the MCS prediction. In total, 13,675 raw data samples were collected. The distribution of raw data samples by MCS is shown in Fig. 7. It is evident that the instances are not uniformly distributed, so data manipulation is needed before the training of the ML methods. Moreover, we use 13 different MCS indices, which are distinguished by the type of modulation M and the code rate R. The values of parameters M and R, as well as the corresponding throughput are given in Table 3. At this point, it’s crucial to highlight that our ML algorithms require a balanced training dataset encompassing both high and low metric scenarios to adapt effectively to real-world conditions. Training solely on high metrics can lead to a bias towards predicting higher MCS values, potentially causing suboptimal performance in denser network areas where lower MCS values are more appropriate for mitigating bottlenecks and ensuring network stability.

Table 3 Modulation, coding and throughput scheme table

5 Feature selection, model training and accuracy metrics

Fig. 8
figure 8

ML framework for MCS prediction: flow chart of data processing–training, validation, and testing of ML methods

Figure 8, presents the proposed ML Framework for MCS prediction. Specifically, the procedures of data processing, as well as the training, validation, and testing phases of the ML methods are separated into three basic steps. The first step concerns the data processing procedure, which involves preparing, cleaning, and organizing the raw data to make it suitable for building and training the different ML models. The second step concerns the training and the validation procedure of the ML methods. This step focuses on evaluating the machine learning models based on hyper-parameter tuning by choosing a set of optimal hyper-parameters for a specific learning algorithm. Concerning the final step, the ML methods which were developed during the previous steps, are now evaluated in terms of different performance metrics, such as accuracy, precision, recall, and F1-score, utilizing the testing data set. All three steps are further analyzed in the following subsections.

5.1 Data pre-processing

The performance of the ML-based models strongly depends on the amount and quality of training data. Knowledge discovery during the training process is more challenging if there is irrelevant and redundant information or noisy and unreliable data. Therefore, no matter which classifier is applied, poor models are produced if the training data are incorrect. Considering the previous statement, various data pre-processing methods are utilized, so that the ML models achieve the best possible performance [44].

5.1.1 Missing values management

Commonly, in a raw data set used to train machine learning models, there may be instances where one or more features are not determined. These cases are identified as cases with missing values. Handling missing values is vital to pre-processing the data set, as ML algorithms do not support missing values and therefore can not be trained. In the case of the data set used for training the considered machine learning algorithms, there are instances where some features were not identified during the simulation. These instances in the raw data set are recognized as samples where the simulator does not determine some of the network performance metrics (features). For example, in areas where buildings are located, it is not possible to calculate some or all of the performance metrics, as the intended user at that point experiences an outage. Consequently, instances that have at least one unknown feature are ignored and removed from the data set.

5.1.2 Instance selection

The selection of stratified samples to represent the characteristics of the overall data set is one of the most common and most challenging issues in any Big Data system. As an essential data pre-processing step, instance selection is not only employed to handle noise and missing values but also to cope with the in-feasibility of learning from huge data sets. There is a variety of procedures for sampling instances from a large data set. The most well-known approach is the stratified sampling [45]. Through this technique, the overall training set is reduced, and the class values are uniformly distributed in the training sets, as shown in Fig. 9. Overall, after eliminating missing values and removing redundant instances per class values, 6706 data samples were collected, which means a 49% reduction of the initial raw data set. It is noteworthy that a generalized and balanced training data set is formed through the stratified sampling method, thus reducing the probability of overfitting the trained machine learning models.

Fig. 9
figure 9

Distribution of data set per class after stratified sampling

5.1.3 Data normalization

Machine learning models, such as kNN and ANN, can not achieve the best possible performance if the values of the features are in different units and scales. For example, consider the kNN model which calculates the distance between the instances of the training set and the sample under prediction to find the k nearest neighbors. Assume that we try to train this model with a data set that contains small-scale and large-scale features. In this case, the contribution of the large-scale features is much more significant than the contribution of the small-scale features in the distance calculation, and as a result, we get poor predictions. Therefore, to overcome these issues, a normalization method needs to be applied to avoid features values of different units and scales. By using this method, the values of the features in the dataset are scaled into a specific range, maintaining the general distribution and ratios of the initial dataset. In order to do this, all input attributes were normalized before the training process. The normalization formula is given as follows [46]:

$$\begin{aligned} X_{norm}=\frac{X-X_{min}}{X_{max}-X_{min}} \end{aligned}$$
(26)

where X is a value of the corresponding feature under normalization, \(X_{max }\), and \(X_{min }\) are the maximum and the minimum value of this feature, respectively, and \(X_{norm } \in \left[ 0,1\right] \) is the final normalized value [47].

5.1.4 Feature selection

The goal of feature selection is to select the optimal subset with the least number of features that most contribute to learning accuracy. The advantages of reducing the features to a subset of them are well described in the literature [48], and affects many aspects of a ML experiment, such as the speed of training, the accuracy and the explainability of a model. The purpose at this stage is to recognize the highly correlated features and eliminate the redundant features. Figure 10 presents the correlation matrix of the given data set calculated with Pearson’s correlation criteria. The last column of Pearson’s correlation matrix concerns the correlation of attributes and the class, where it could be observed that the class has a strong correlation (above 0.78) with SINR, RSSI, RSRQ, and RSRP features. At the same time, these features have a strong correlation between them (above 0.78). Due to the high correlation between those features, some are considered redundant, so it’s mandatory to retain the features with the highest correlation coefficient with the target class. As observed, both SINR and RSSI have the highest correlation. Still, because these two variables are entirely linearly dependent on each other, the feature RSSI is considered redundant, as it is included in the calculation of the SINR. From the remaining variables, it is observed that the distance between the terminals and the height of the RRH have the highest negative correlation, \(-\)0.44 and \(-\)0.35, respectively with the MCS class, which means that the higher the value of these two features the lower the MCS rank. The frequency and the path visibility are considered that are low correlated with the other attributes and at the same time the degree of correlation between these two attributes and the MCS class is moderate.

Consequently, models may be trained utilizing just five input variables, and the input vector in (2) is finally considered as follows:

$$\begin{aligned} {\textbf {x}}=[SINR, F, H_{t}, V, D] \end{aligned}$$
(27)

Table 4 shows the contribution of each parameter used in the models based on IG criteria, as expressed in (18). The Information Gain criteria calculate the weight of attributes with respect to the class attribute by using the information gain. The higher the weight of an attribute the more relevant it is considered. The SINR has the greatest impact, followed by the frequency, Tx altitude, path visibility, and propagation distance. Therefore, our approach combines both the Pearson method and IG criteria, with the intention of providing a comprehensive evaluation of feature significance within our dataset.

Fig. 10
figure 10

The correlation matrix of features calculated with Pearson’s correlation

Table 4 Importance of different features based on Information Gain criteria

5.1.5 Data shuffle

The raw data is generated in a specific order through the simulation and data collection processes, as presented in Sect. 4.2. In particular, the first measurement of the dataset concerns the point in the lower-left corner, while the last measurement regards the upper-right corner, as shown in Fig. 6. Therefore measurements collections are related to specific geographical areas with different propagation impairments, e.g., reflection and diffraction. Since the original data set should be split into three subsets, i.e., training, validation, and testing, the instances of the original data set should be shuffled so that each generated subset comprises of measurements from all sub-areas for the region of interest. Data shuffling prevents any unwanted bias and the model from learning the same training order during the training phase of the ML methods.

5.2 Hyper-parameter optimization and model training

Over-fitting and under-fitting are common problems in machine learning that result in a poor generalization of the trained model to unknown input. In order to avoid this, the original data set was divided into training, validation, and testing subsets, respectively, using the data splitting approach. On one hand, during the training phase of the ML models utilizing the training subset, the class that should be predicted is known to the model for learning purposes. On the other hand, during the validation and testing phases, ML models are not aware of the class under prediction, and thus fine-tuning is achieved, reducing the probability of over-fitting. A common practice regarding data partitioning is to utilize 70 to 80% of the entire data-set for training, while the remaining percentage should be employed to improve and evaluate the trained ML models. Consequently, 70% of the total samples are selected for the training phase, 15% for validation, and the remaining 15% for testing [49]. To preserve diversity and ensure that all possible patterns are distributed across the different subsets, samples are randomly selected and distributed to each subset from the initial dataset. Finally, it should be outlined that obtaining the optimal combination of hyperparameters of the ML models is indispensable for enhancing their performance by minimizing the error metrics and thus providing accurate predictions. The optimal combination of hyperparameters for the considered ML models is obtained using the grid search method [50].

5.2.1 ANN hyper-parameter tuning

To successfully train an ANN, it is essential to determine the layer type. Since in this work, a non-linear data set is used, we consider a fully connected multi-layer perceptron (MLP) network, where the input from the dataset propagates in one direction through one or more layers [35]. After that, the number of hidden layers and the number of neurons in each hidden layer must be determined. The number of neurons per hidden layer can be defined as:

$$\begin{aligned} m_{n} = \Bigg \lceil \dfrac{k_{in}+\sqrt{N_{t}}}{n}\Bigg \rceil , \end{aligned}$$
(28)

where \(\lceil \cdot \rceil \) denotes the standard ceiling function, \(k_{in}\) represents the amount of input features in the input layer, \(N_{t}\) represents the total number of samples, and n indicates the number of hidden layers [49]. As presented in Sect. 5.1.4, to build the ML models for MCS prediction, we have considered \(k_{in} = 5\) input features and \(N_{t} = 6706\) samples in total. Therefore, for \(n = 1\) hidden layer, the number of neurons in the hidden layer equals 87, using (28). Additionally, we model ANN with \(n=2,3,4\) hidden layers. For \(n \ge 2\), the sum of neurons of all hidden layers should equal the number of neurons in the case of one hidden layer \(\left( n=1\right) \), as calculated through (28). Hence for \(n \ge 2\), our approach is to distribute equally the number of neurons in each hidden layer as calculated by (28). Consequently, to achieve this, the number of neurons \(m_n\) in each hidden layer is first calculated for a specific n, where \(n \le m_1\). Afterwards, we build \(m_n\) neurons in each hidden layer and calculate the remainder of neurons from \(m_{1}\) as \(r = m_1 - n m_n\). Subsequently, if \(r > 0\) we add a neuron in the r hidden layers starting from the first hidden layer, where \(r \le n\). Otherwise, there is nothing to add, and thus the ANN consists of n hidden layers with \(m_n\) neurons per hidden layer. As an illustrative example, for \(n=2\), the number of neurons per hidden layer equals \(m_2=43\). Therefore, at first, we build an ANN with two hidden layers and 43 neurons per hidden layer. Then, we calculate if there are remainder neurons from the \(m_1\) neurons that should sum from all hidden layers. In this case, there is \(r = 87 - 2 * 43 = 1\) remaining neuron which is added to the first hidden layer. So, the final ANN model with two hidden layers \(\left( n=2\right) \) has 44 neurons in the first hidden layer and 43 neurons in the second hidden layer. The next step in hyperparameter tuning is to specify the activation function type, which is related both to the task and the input data. Considering our approach, the Rectified Linear Unit (ReLU) activation function is used in hidden layers, because it is simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Since MCS prediction is modeled as multi-class classification problem, the SoftMax activation function is used in the output layer. These two functions are expressed as in (8) and (9), respectively.

After designing the DNN structure and before the training phase, an appropriate loss function should be defined. The SoftMax cross-entropy for multiclass classification is the loss function that is most widely used. In order to find the best ANN hyperparameters, the selected loss function should be minimized. The minimization of the loss function is achieved through the Stochastic Gradient Descent (SGD), which is an iterative optimization algorithm. However, SGD has high variance oscillations and could not converge accurately. Therefore the minimization problem should be solved by adding a momentum term [35], which navigates SGD along the relevant direction and softens the oscillations in irrelevant directions.

The last step for hyperparameter tuning is to select the learning rate, as well as the number of epochs, which are very important values. A low learning rate decelerate the training procedure but can degrade the performance of the trained model. On the contrary, a high learning rate increases the prospect of building generalized ML models that can be used in various environments and conditions. Using the grid search approach, the learning rate is tested for values between 0.001 and 0.1 with a step equal to 0.001, whereas the momentum is tested for values between 0.2 to 1 with a step of 0.2. Furthermore, the early stopping criterion is used to improve the generalization ability of the model and avoid overfitting problems. The early stopping criteria uses the parameters patience and minimal score improvement to check for a score not improving anymore, which leads to a stop. The patience defines the number of epochs the score needs to be considered constant and it equals 20. The score is defined as the difference of the loss function (12) between two consecutive epochs and is set to a minimum value equal to zero. Therefore the training phase stops if the score has not changed for 20 epochs. Furthermore, it is vital to evaluate the convergence of the analyzed ANNs during the training and validation phases. This assessment helps prevent the occurrence of significant overfitting. Finally, in Table 5, we present a comprehensive compilation of the finalized hyperparameters for the ANNs, meticulously derived during both the training and validation phases. Furthermore, Table 6 provides insightful data, including the training time (\(t_{train}\) in seconds), the number of epochs, and the minimum loss score, offering a holistic view of the performance of each ANN method.

Table 5 Chosen hyperparameters values for ANNs models
Table 6 Examined ANN layouts for the validation phase
Fig. 11
figure 11

Loss convergence progression versus iterations (epochs) for the training and validation phase of all the introduced ANNs

The assessment of the training and validation stages in terms of the loss function versus the number of iterations (epochs) is shown in Fig. 11. In essence, the number of epochs influences the convergence of the chosen approach directly. Due to the limited number of epochs, the ANN may converge to a local minimum. Nonetheless, too many epochs may lead to over-learning. The results in Fig. 11 concerning the modeled ANNs, prove that the Loss function for all processes, i.e. training and validation, converges smoothly obtaining constant loss values and reaching the global minimum in a short period. The obtained global minimum loss for the convergence during the validation phase, as well as the corresponding epoch values, are listed in Table 6.

5.2.2 SVM hyper-parameter tuning

The training of the SVM method is carried out by selecting the RBF kernel function given in (16). The hyperparameters that establish SVM’s performance are \(\gamma \) and C. The \(\gamma \) parameter defines the degree to which a particular training example determines the creation of the decision boundary and C is the penalty parameter of the error term concerning the misclassified instances. Considering that \(\gamma \) has a small value, every training instance has a significant influence on the training procedure, while for high values of \(\gamma \) there is low impact. If \(\gamma \) parameter has a high value, the training examples need to be very close to each other to be considered in the same class. Therefore, an SVM model with large \(\gamma \) values tend to overfit. However, if \(\gamma \) has low value, then more instances are grouped in the same class. Thereby, an SVM model that is trained for low \(\gamma \) value tends to underfit. Let us now focus on the effect of parameter C on the training procedure. On one hand, if C has low value, the penalty for misclassified points is low, and thus a decision boundary with a large margin is chosen. On the other hand, large values of parameter C allow the constraints to be ignored, leading to decision boundaries with a small margin. Employing the RBF kernel, both C and \(\gamma \) parameter need to be optimized simultaneously. If \(\gamma \) value is large, the effect of C becomes negligible, whereas if \(\gamma \) is small, the impact of C becomes strong [51]. To address this problem, the grid search method is used. More specifically, \(\gamma \) is tested in the range of 0.0001–30 with a step of 0.01, and C has values between 0.01 and 200 with a step of 10. Finally, the corresponding hyperparameters concerning the RBF kernel are selected to be C = 120 and \(\gamma = 17\), respectively. Figure 12 shows the effect of parameter C on the accuracy of the model for the training and validation subset, respectively. A significant increase in accuracy when C increased from 0.01 to 10 is observed. For these values, the accuracy from 44% reaches 95%. In contrary, the accuracy rises smoothly until the parameter C reaches the value of 120. For this value of parameter C, the accuracy reaches the maximum value of 97.32%. For values of C greater than 120, the accuracy remains steady at 97%. Finally the yielded training time for the SVM method is 1 sec.

Fig. 12
figure 12

Accuracy evolution versus C for the training and validation phase in SVM method. The selected \(\gamma \) is 17

5.2.3 RF hyper-parameter tuning

In the case of an RF classifier, two main parameters affect the model’s performance and should be adjusted to obtain the optimal hyperparameter values. These parameters are the number of decision trees forming the forest and the depth of each tree and should be carefully selected to achieve the best possible performance. In practice, a limited number of high-depth trees is more susceptible to overfitting than a large number of low-depth trees. [49]. Hence, an exhaustive search strategy to determine the depth and the number of decision trees is employed to ensure better performance regarding the random forest model. The tree depth is tested in the range of 1 to 10, whereas the number of the trees is examined to be between 1 and 100 with step equals 5. Finally, the training phase lasts 0.7 s, and the grid search algorithm determines that the best tree depth is 5, whereas the trees should be 10.

Figure 13 presents the accuracy evolution as a function of the number of classification trees for the training and validation subsets, respectively. As it can be observed, training converges smoothly and thus reaching stable accuracy when the number of trees is \(j \ge 20\). However, the validation decreases when the number of trees exceeds 10. At this point, it is evident that when j exceeds 10, then the model undergoes overfitting, while the corresponding accuracy is 87.74%. More specifically, as the number of DT-based learners increases, the training accuracy continually increases due to the increased mapping functions learned by contributing members. In other words, the ensemble ML model has memorized the training data. However, the accuracy on the validation set decreases when the number of DTs exceeds 10. This reduction in validation accuracy serves as a crucial warning sign, indicating that the model’s learned knowledge has become overly specialized and is failing to generalize effectively to unseen data, a hallmark issue associated with overfitting.

Fig. 13
figure 13

Accuracy evolution versus the number of the trees-based learners for the training and validation phase in the RF method. The selected tree-depth is 10

5.2.4 Bagging k-nearest neighbors hyper-parameter tuning

The predictive behavior of B-kNN depends on the number of the kNN models and the number of the k nearest neighbors for each base predictor. More specifically, the value of k in the kNN algorithm is related to the model’s error rate. Thus, a small value of k could lead to overfitting, as well as a big value of k can lead to underfitting. Regarding the hyperparameter tunning, the parameter related to the number of the base learners is tested on the range between 1 and 100 in steps of 5, whereas k is tested in the range of 1 to 10 with step equals 1. The best performance is obtained for \(j = 30\) bootstraps and \(k = 5\) nearest neighbors. Figure 14 presents the accuracy evolution as a function of the number of kNN base learners. For the accuracy, it can be observed that the model could not be trained more as the trend for accuracy on the validation set converges smoothly when the number of the base learners equals 30. At this point, the corresponding accuracy for the validation curve is about 88.65%. Finally, the training time is 0.5 s.

Fig. 14
figure 14

Accuracy evolution versus the number of the kNN-based learners for the training and validation phase in the B-KNN method. The selected k is 5

5.3 Performance metrics

To validate the performance of each ML model, it is essential to evaluate suitable performance metrics applied at the test set. There are various methods to validate the performance of each model. However, multi-label classification requires different metrics than those used in traditional regression problems. The accuracy, precision, recall, and F1-score metrics are commonly used to evaluate the performance of the developed ML classification models [52].

Accuracy is defined as the percentage of correctly classified instances among the total number of instances and it’s expression is given as follows:

$$\begin{aligned} Accuracy=\frac{1}{\mid D \mid }\sum _{i=1}^{\mid D \mid }\frac{Y_{i}\bigcap \hat{Y_{i}}}{Y_{i}\bigcup \hat{Y_{i}}}, \end{aligned}$$
(29)

where \(Y_{i }\) and \(\hat{Y_{i}}\) are the true and the predicted labels, respectively, for each instance \(d_{i } \in D \). The precision is defined as the percentage of those instances that actually belongs to class Y, among all those classified as class Y, with \(Y=\{y_1 \dots y_t\}\), and is defined as follows:

$$\begin{aligned} Precision=\frac{1}{\mid D \mid }\sum _{i=1}^{\mid D \mid }\frac{Y_{i}\bigcap \hat{Y_{i}}}{\hat{Y_{i}}} \end{aligned}$$
(30)

The recall is the percentage of members of class X correctly classified as belonging to class X and is defined as follows:

$$\begin{aligned} Recall=\frac{1}{\mid D \mid }\sum _{i=1}^{\mid D \mid }\frac{Y_{i}\bigcap \hat{Y_{i}}}{Y_{i}} \end{aligned}$$
(31)

The F1-score measures the average of precision and recall and is defined as follows:

$$\begin{aligned} {F1=2\times \frac{Precision \times Recall}{Precision+Recall}} \end{aligned}$$
(32)

6 Results and discussion

In this section, a performance evaluation of machine learning methods on the training set is conducted, revealing findings about their performance in a controlled context. Moreover, critical factors such as training and execution timeframes are investigated, highlighting the significance of efficiency.

6.1 Performance evaluation of ML methods in training set

This section presents the evaluation results obtained from the ML methods for the testing set. The evaluation of the ML methods is achieved based on the performance metrics calculated using (29)-(32). The entire training, validation, and testing procedures were carried out using RapidMiner\({}^{\text {TM}}\) [53], a fully transparent, end-to-end data science platform. Moreover, the experimental evaluation was performed on a computer consisting of Windows 10 64-bit operating system, Intel Core i7-8700 CPU, and 8 GB memory RAM. The performance of the ML methods in terms of the precision, recall, and F1-score for each MCS class are listed in Tables 7, 8, 9, 10, 11, 12, 13. Additionally, the accuracy of each ML model is depicted in Fig. 15, while Fig. 16 illustrates the mean precision, recall, and F1-score obtained from each ML method.

The precision, recall, and F1-score per MCS class using ANN with one hidden layer (\(\text {ANN}_{5-87-13}\)) are listed in Table 7. The results demonstrate that for all three metrics, there is no MCS class with a value below 0.91, which means that 91% of the predictions for any MCS configuration were accurate. The average precision and recall of ANN with one hidden layer (\(\text {ANN}_{5-87-13}\)) reached 96.99% and 96.91%, respectively, as illustrated in Fig. 16, demonstrating its high performance. Furthermore, the mean F1-score reached 96.88%, indicating that the model’s precision and recall harmonic mean is very high. However, in Fig. 15 it is observed that ANN with one hidden layer (\(\text {ANN}_{5-87-13}\)) achieves 96.91 % accuracy. Such a high degree of accuracy in a balanced data set indicates that the model has identified and developed strong relationships between features and class and has avoided overfitting issues.

The precision, recall, and F1-score per MCS class obtained from the ANN algorithm with two hidden layers (\(\text {ANN}_{5-44-43-13}\)) are listed in Table 8. The precision of the model ranges from 0.94 to 1, the recall from 0.95 to 1, and the F1-score from 0.96 to 1. It is observed that both the accuracy and the recall metrics are balanced with each other as all the values of the metric F1-score are over 96% with an average value of 98.69%. The corresponding mean error properties in Fig. 16 reveal that the best prediction result is achieved by an ANN with two hidden layers, out of all the applicable machine learning methods. The adaptability of this model is also evident in Fig. 15 where the classification accuracy of ANN with two hidden layers reached 98.71%, which is a remarkable performance. Hence, this exemplary performance can be attributed to the neural network’s configuration with two hidden layers, which excels in effectively approximating nonlinear functions and making accurate predictions of MCS class values.

Additionally, the prediction accuracy for the neural network with the three hidden layers (\(\text {ANN}_{5-29-29-29-13}\)) is illustrated in Fig. 15. The specific model yields an accuracy of 98.11% which is almost similar to that of the neural network with two hidden layers (\(\text {ANN}_{5-44-43-13}\)). Moreover, the precision, recall, and F1-score per MCS class obtained from this model (\(\text {ANN}_{5-29-29-29-13}\)) are listed in Table 9. The precision of the model ranges from 0.93 to 1, the recall from 0.92 to 1, and the F1-score from 0.94 to 1, thus proving the high efficiency of the algorithm. Moreover, in Fig. 16, it is evident that its prediction metrics are better than a single hidden-layered ANN, yet the layout with the two hidden layers outperforms this model.

The precision, recall, and F1-score per MCS class using ANN with four hidden layer (\(\text {ANN}_{5-22-22-22-21-13}\)) are listed in Table 10. The precision of the model ranges from 0.91 to 1, the recall from 0.93 to 1, and the F1-score from 0.95 to 0.99. The specific model yields accuracy of 97.30%, mean precision of 97.35%, mean recall of 97.30% and mean F1-score of 97.28%. Observing Fig. 16, this model seems to outperform the neural with one hidden level, while it is outperformed from the ANN with two and three hidden layers, respectively.

Comparing the performances of the ANN methods, the prediction accuracy rice gradually until the depth of the neural network reaches two hidden layers. Then, by increasing the depth of the ANNs for more than two hidden layers, the performance of the accuracy is reduced. More specifically, the prediction accuracy rises gradually from 96,91%, for a single hidden layer (\(\text {ANN}_{5-87-13}\)), up to 98.71%, for a two-layered (\(\text {ANN}_{5-44-43-13}\)), and reduces to 97.30% when considering a four-layered layout (\(\text {ANN}_{5-22-22-22-21-13}\)). In general, the assessed ANN models exhibit a remarkable performance with an F1-score on the order of 96.88\(-\)98.69%, preserving an average precision and average recall greater than 96%, as may be seen in Tables 7, 8, 9, 10. Moreover, the highest accuracy achieved from the ANN models in a balanced dataset suggests that the models have effectively identified strong relationships between features and classes while avoiding overfitting, further corroborating their impressive performance. Among the evaluated ANNs, the neural network with two hidden layers (\(\text {ANN}_{5-44-43-13}\)) achieves the best prediction result. The specific model yields an accuracy of 98.71%, a mean precision of 98.72%, a mean recall of 98.69%, and an average F1-score of 98.72%.

Fig. 15
figure 15

Accuracy comparison between the ML methods

The precision, recall, and F1-score per MCS class using the RF method are listed in Table 11. The precision of the RF method for each class ranges from 63.26 to 100%, with an average value of 91.70%. Also, the average recall is 88.96%. It is worth noting that the minimum recall value is 0.03 in the MCS class with index two. The low value in the recall metric indicates that the instances belonging to this class are classified almost incorrectly from the RF model. Incorrect prediction of the RF model can cause a significant load on the network. For example, by assigning a larger configuration class than allowed, retransmissions will increase because users will not demodulate their signal correctly. In contrast, the average network data rate decreases if the scheduler assigns smaller types of MCS configurations. For this reason, erroneous MCS assignments should be avoided, and forecasting models should be as accurate as possible. Additionally, the model’s accuracy reached 87.74%, while the mean F1-score equals 86.23%, with upper and lower values of 0.06% and 100%, respectively. The low value in the F1-score metric is due to the high correlation with recall defined in (32).

Fig. 16
figure 16

F1-score, precision, and recall performance measurements of ML methods

Table 7 Precision, recall and F1-score per MCS class for ANN algorithm with 1 hidden layer
Table 8 Precision, recall and F1-score per MCS class for ANN algorithm with 2 hidden layers
Table 9 Precision, recall and F1-score per MCS class for ANN algorithm with 3 hidden layers
Table 10 Precision, recall and F1-score per MCS class for ANN algorithm with 4 hidden layers
Table 11 Precision, recall and F1-score per MCS class for RF algorithm
Table 12 Precision, recall and F1-score per MCS class for B-kNN algorithm
Table 13 Precision, recall and F1-score per MCS class for SVM algorithm

The classification accuracy of B-kNN reached 88.65%, as presented in Fig. 15. Thus, its performance is relatively better than the RF classifier. Nevertheless, B-kNN has worse performance than ANNs and SVM models. B-kNN achieved an average precision of 88.9%. Moreover, its average Recall and F1-score reached the percent of 88.02% and 87.86%, respectively, as illustrated in Fig. 16. The precision, recall, and F1-score per MCS class using the B-kNN method are listed in Table 12. The precision of the B-kNN method for each class ranges from 71 to 100%, with an average value of 88.90%. Also, the recall for each class ranges from 66 to 88.02%. The lackluster performance of B-kNN can be attributed to its use of the entire feature space during training, which makes it challenging to effectively reduce bias. Furthermore, B-kNN’s inability to outperform ANNs is primarily due to the limited pattern-capturing capacity of k-nearest neighbors when compared to the inherent deep learning capabilities of ANNs.

The results from the examined OAA SVM algorithm per MCS class are listed in Table 13. It is observed a really high percentage of performance metrics, which means that the model did not adapt to the training set and avoid overfitting and underfitting issues. This is due to the exact hyperplane chosen in the previous step using the grid search approach. More precisely, the chosen hyperplane minimizes the shortest distance between the hyperplane and the closest training point, making the SVM less prone to overfitting. The precision of the model ranges from 0.9 to 1, the recall from 0.94 to 1, and the F1-score from 0.94 to 1. It is observed that both the precision and the recall are balanced with each other as all the values of the metric F1-score are over 94% with an average value of 97.02%. In addition, the average precision and recall metrics are 97.12% and 97.04%, respectively, as presented in Fig. 16. Indeed, as observed, OAA SVM can achieve comparable results to ANNs due to its ability to effectively handle complex data patterns and high-dimensional feature spaces. OAA SVM excel at finding optimal hyperplane boundaries that separate different MCS classes in the data, which allows them to capture intricate relationships within the datase. Finally, the specific model yields an accuracy of 98.71%, which is a remarkable performance and indicates that the OAA SVM method with RBF kernel can be a candidate MCS prediction model.

The above evaluations and results show that the worst performance in terms of error metrics exhibited by the RF method. Thus it is not recommended for reliable MCS predictions in 5G networks. The same stands for the B-kNN method since its performance metrics do not offer precise accuracy. On the other hand, the SVM method with RBF kernel demonstrates a comparable performance and could be adopted as an alternative option for adaptive MCS operations, preserving high accuracy values and acceptable low error rates. On the other hand, ANN methods with one, three, and four hidden levels, which show comparable performance, could be adopted as alternative MCS prediction functions while maintaining high accuracy values and acceptable low error rates. However, as mentioned previously, the best performance among all the assessed models is provided by the neural network consisting of two hidden layers with 44 and 43 neurons in each layer, respectively. Thus, all its performance metrics are exceptionally high, with mean metric values above 97%.

6.2 Training and execution times

Other essential performance indicators are the training and execution times of each ML method. The training time refers to the time that the model needs to be trained on the dataset. The execution time is the total time required to build the ML framework for the MCS prediction. More specifically it includes the times of data collection and preprocessing and the times of training, validation, and testing of the ML methods, as depicted in Fig. 8. Both training and execution times for all the examined ML methods are shown in Fig. 17. As it can be observed, the time required to train the ANN model with one hidden layer (\(\text {ANN}_{5-87-13}\)) is 28 s, and the execution time to build and assess the whole model is 33 s. For the ANN method with two hidden layers (\(\text {ANN}_{5-44-43-13}\)), the training and execution times are increased to 33 and 39 s, respectively. Also, the training and execution times for the ANN model with three (\(\text {ANN}_{5-29-29-29-13}\)) and four hidden layers (\(\text {ANN}_{5-22-22-22-21-13}\)) are increased compared to the ANN model with one and two hidden layers. More specifically, the ANN model with three hidden layers (\(\text {ANN}_{5-29-29-29-13}\)) took 46 s for the training and 51 s to execute the entire MCS prediction framework. Moreover, the training and execution times are increased to 50 and 56 s for the ANN method with four hidden layers (\(\text {ANN}_{5-22-22-22-21-13}\)). As expected, increasing the number of hidden layers in an ANN model increases the training period due to the heightened network capacity. Additional layers amplify the model’s ability to grasp diverse mapping functions, rendering it a resource-efficient means of enhancing model capacity. Consequently, the observed increase in time with extra hidden layers aligns with the hierarchical structure of neural networks and the computation interdependency among neurons in different layers. Specifically, as the number of hidden layers increases, the neural network’s capacity and complexity broaden, necessitating adjustments to a more extensive set of weight parameters during training. This, in turn, results in prolonged training and execution times due to the heightened computational load.

Fig. 17
figure 17

Training and execution time (s) of the different ML methods

Furthermore, the SVM and RF models required less time for training, 1.0 and 0.7 s, respectively, compared to ANNs. Concerning the execution time of these two methods, the SVM completed all processes in 3.5 s, while the RF method was a little bit faster and executed in 3.0 s. The B-kNN method trained on the dataset faster than the other assessed ML methods, in 0.5 s. However, its execution time is high enough and comparable to the ANNs. In particular, the execution time of the B-kNN method is disproportionate to its training time since the B-kNN consists of lazy base learners. Each kNN method does not learn a discriminative function from the training data but memorizes the whole training dataset. Nevertheless, its execution time indicates that B-KNN consumes more computation time on the testing phase than the training phase since each kNN finds a k-long list of samples close to the instance under prediction.

7 Conclusions

This paper assesses various ML methods for their suitability in accurately predicting MCS in OFDM systems. In order to do this, ANNs, SVM, RF, and B-kNN models were evaluated. The training, validation, and testing procedures were conducted using an MCS dataset derived from simulated results that considered a non-standalone 5G network overlaid on an existing 4G LTE network. The numerical results revealed that the customized deep neural network model with two hidden layers outperformed all other examined ML methods, achieving the highest average accuracy of 98.71%. In contrast, the RF method exhibited the lowest average accuracy of 87.74%. The specific DNN model with two hidden layers is highly recommended and can be effectively utilized for predicting MCS in software platforms, incorporating digital terrain information as input attributes. Future research should aim to refine machine learning methods for predicting MCS in OFDM systems, extending their applicability to the realm beyond 5G and into the evolving landscape of wireless communication technologies. Building upon the success of the DNN model with two hidden layers, it becomes imperative to explore various model architectures, hyperparameters, and ensemble approaches to bolster predictive accuracy, ensuring these methods are tailored to the unique challenges and requirements of advanced wireless systems, including 6G and beyond. Expanding the dataset to include real-world data and considering the specific characteristics of these next-generation networks will be essential for staying at the forefront of wireless communication technology. These research endeavors have the potential to make significant contributions to optimizing MCS prediction, thereby facilitating the evolution of wireless communication systems beyond 5G.