Keywords

1 Introduction

Performance prediction in wireless mobile networks is essential for network optimization and management [8], application offloading decisions [10], deployment of unmanned aerial vehicles (UAVs) also known as flying base stations [3], to list a few. In fact, there would be different angles on performance prediction in mobile communication, from low-level channel performance [11, 12] to mobile application/device throughput [6, 8, 10]. In this study, we focus on application throughput of mobile devices for predicting and evaluating.

In the mobile communication setting, the position of mobile devices is significantly crucial to estimate the performance. Simply speaking, even for a single mobile device, the measured performance of that device may show a high degree of fluctuation depending on its location (e.g., due to the density of devices, signal strength, and interference/reflection). In this study, we investigate mobile communication performance based on the coordinate information of mobile devices. We analyze a recent 5G data collection [7], which contains a set of features including the GPS coordinates, velocity, and application throughput information of mobile devices, with a machine learning (ML) approach.

As the location information is key to performance prediction, the basic assumption of making relevant prediction is the correctness of the coordinate information of devices given. However, any malfunctioning of location chips (e.g., receiving GPS signals) may result in an unacceptably erroneous estimation (although rare). A more common scenario is location spoofing taken place intentionally; that is, a location spoofing attack falsifying the position information can be attempted with a malicious intent, which is one of the greatest security concerns in mobile communication networks [4, 9]. With its criticality, this paper investigates the impact of position falsification on the presented ML-based performance predictor.

While this paper presents our initial experimental results and observations, there are several contributions non-trivial to the research community. Firstly, this paper examines the feasibility of location-based performance prediction. An interesting observation is that it is possible to estimate application throughput with 80% accuracy using a small set of features readily available when establishing the communication channel. Secondly, the impact of location-spoofing attacks on performance prediction is evaluated, with the intuition that location-based performance prediction would be critical to such threats. The experimental result shows a significant degradation of the performance prediction quality, signaling the need for effective defense mechanisms against location-spoofing attacks to enable reliable estimations.

The organization of this paper is as follows. We first introduce the 5G dataset employed for performance prediction in Sect. 2, with exploratory data analysis. In Sect. 3, location-based performance prediction is discussed with our initial experimental results for binary and multi-class classifications. Section 4 shows the impact of location spoofing attacks on performance prediction using two types of position falsification techniques (constant and constant-offset spoofing). Section 5 provides a summary of closely related studies, and we conclude our presentation with future research directions in Sect. 6.

2 Exploratory Analysis of 5G Dataset

This study employs a recent 5G dataset collected from an Irish mobile operator network [7]. The data collection was made using different file access applications, including file transfer and video streaming. The throughput of such applications was measured in different locations and mobility options (stationary or driving), in addition to other channel and context information. The number of features defined in this dataset is 26 features in total. Table 1 provides the features referred to for our performance prediction study.

Table 1. Selected features defined in the 5G dataset
Fig. 1.
figure 1

Throughput map based on the measurement, showing the feasibility of estimating throughput based on the coordinate of the mobile device. (Color figure online)

The number of samples is roughly 189K in the raw dataset. From the original dataset, we remove data instances meeting any of the following conditions: (i) DL_bitrate=0, (ii) State=Idle, and (iii) if any feature contains a null value. Note that the State feature defines the state of the download process, whether it is downloading or idle (i.e., not downloading). After this removal process, the pre-processed dataset contains 81,859 instances in total.

Fig. 2.
figure 2

Measured throughput based on CQI values (from 1 to 15)

Fig. 3.
figure 3

Correlation matrix of the selected features (positive correlation \(\rightarrow \) +1, negative correlation \(\rightarrow \) −1, less correlation \(\rightarrow \) 0)

Fig. 4.
figure 4

Feature importance (compiled by random forest): None of the features works dominantly for predicting throughput (DL_bitrate).

We carried out initial explorations to understand potential correlations of the features in the throughput feature (Tput). Figure 1 shows the throughput information on the coordinate space. The figure shows four different throughput ranges: (i) Tput < 100 Kbps, (ii) 100 Kbps \(\le \) Tput < 1 Mbps, (iii) 1 Mbps \(\le \) Tput < 10 Mbps, and (iv) Tput \(\ge \) 10 Mbps. From the figure, we can see that the location information would be helpful for estimating throughput. While some spots (colored in red or orange) show a relatively greater throughput, the rest (in blue or dark blue) show quite low bit rates. The figure also reveals some clusters having higher throughput.

The box plot in Fig. 2 provides the measured throughput over different CQI values. The CQI of a mobile device is a feedback indicating the channel data rate, provided to the base station (eNB). A previous study in [7] reported a partially proportional pattern between CQI and throughput. Our experimental result does not show such proportionality clearly; rather it shows different throughput ranges for each CQI value.

To see how the features are correlated with each other, Fig. 3 provides a correlation matrix. We can see that the feature of RSRP is strongly correlated to RSSI, while RSRP is also somewhat correlated to SNR. Additionally, the feature of RSRQ shows a high degree of correlation with SNR. For the throughput feature (DL_bitrate), none of the features shows any strong correlation. In the next section, we will examine the feasibility of throughput prediction using conventional ML methods. In addition, Fig. 4 shows the importance of the features to determine throughput, compiled by using a random forest classifier (described in Sect. 3). While RSSI is important the most, the result shows any of the features does not play dominantly for predicting throughput.

3 Performance Prediction

In this study, we reduce the performance prediction to a classification problem. We employ several conventional supervised learning methods for making the classification, as follows:

  • k-Nearest Neighbors (KNN) performs the grouping of data samples based on the proximity information. To classify, the class label most frequently found from its neighbors is assigned to the given data point (on the basis of the concept of majority vote).

  • Random Forest (RF) is a tree-based ensemble algorithm combining multiple decision trees. The combining function incorporates the results produced by individual tree trained in parallel with a subset of the data randomly allocated, to make a final decision.

  • Extreme Gradient Boosting (XGB) is also a tree-based ensemble method based on a gradient descent algorithm. XGB builds one tree at a time, while multiple decision trees are built independently in RF. This method is based on minimizing a loss function iteratively, which is the correction of errors observed in the previous iteration.

The classification problem takes the input and the predicted class is produced as the outcome. In this study, we set up three different feature sets to evaluate their impact on the classification performance, as described in Table 2. We basically perform the performance prediction based on the position information. For Set-1, it is reasonable to assume the velocity information is available when issuing the prediction request, whereas the other features defined in Table 1 might not be available before making the actual communication. Figure 2 shows the correlation between CQI and throughput (although not strong), and Set-2 refers to the CQI value in addition to the basic Set-1 features. Lastly, Set-3 refers to the entire feature set defined in Table 2 except State and Tput.

Table 2. Evaluated feature sets for performance prediction

For actual evaluation, we partition the dataset into two disjoint sets for training (70%) and testing (30%). To report classification performance, we consider two standard measures of Accuracy and F1-score: Accuracy is a fraction of the correctly classified samples, while F1-score is a harmonic mean and balanced in case of an unbalanced class distribution (i.e., majority vs. minority classes). To consider a class imbalance concern in the evaluation settings, we mainly utilize F1 score by default, unless otherwise mentioned.

3.1 Binary Classification Performance

We first evaluate the binary classification performance. Two classes are defined as: low if tput \(\le \)1 Mbps; high otherwise, in a balanced manner with respect to the distribution of data instances. The Class-0 (low) contains 39,730 samples (48.5%) and the Class-1 (high) does 42,129 samples (51.5%).

Fig. 5.
figure 5

Binary class classification performance: RF performs the best with Set-1, while XGB performs consistently over different feature sets.

Figure 5 shows the prediction performance in F1 score. The evaluation result shows that RF yields the greatest performance, while XGB shows the consistent performance over the reliance on different feature sets. The KNN algorithm shows slightly lower performance than the other two schemes. Note that we set \(k=11\) that produces the best performance for KNN (between 1 and 100 for the k value), while we simply take the default setting for RF and XGB (without intensive optimizations).

An interesting observation is that referencing additional features would not be helpful for improving the prediction performance. In fact, all the classifiers show that using set-1 performs better than or at least equal to the use of other feature sets. We conjecture that this is because any feature defined in Set-2 and Set-3 has no strong correlation to the throughput feature, as depicted in the correlation matrix in Fig. 3. The result here shows that the position information plays a significant role for estimating throughput, and this is somewhat intuitive since a mobile device may show a high degree of fluctuation in application throughput depending on its location due to several reasons, such as the density of devices, signal strength, and interference/reflection.

It is important to note that the features in Set-1 are readily available when establishing actual communication channels, and it is possible to estimate application performance (throughput) with 80% accuracy (precisely F1 score) using the RF predictor. In contrast, the features additionally defined in Set-2 and Set-3 may not be available beforehand at the connection set-up time.

Table 3. Class definition for multi-class prediction
Fig. 6.
figure 6

Multi-class prediction performance (RF): Defining a more number of classes results in the significant degradation of the estimation performance.

3.2 Multi-class Prediction Performance

We also examine the performance prediction tools with multi-class classification settings. Table 3 shows the class definition, for 3-class, 4-class, and 5-class classification settings.

Figure 6 shows the multi-class classification performance for RF. For the comparison purpose, the figure includes the binary classification result as well, As expected, defining a more number of classes results in the significant degradation of the estimation performance. For 3-class classification, the performance goes down to 62% (from 80% when performing the binary classification). As in the binary classification, the multi-class prediction result also shows using Set-1 performs better than using the other feature sets.

The other two classifiers (KNN and XGB) also showed the similar pattern, with slightly lower performance than RF. Figure 7 shows the multi-class prediction result for different classifiers when using Set-1. We can see that RF shows the best performance consistently, while XGB performs better than KNN.

Fig. 7.
figure 7

Multi-class prediction performance (Set-1): RF shows the best performance consistently, while XGB performs better than KNN.

4 Location Spoofing Attacks

We next investigate the impact of location-spoofing attacks on the coordinate-based performance prediction. In fact, location-spoofing attacks are one of the critical attacks in mobile communication environments. A widely-used Vehicular Ad-hoc Networks (VANETs) dataset, VeReMi, assumes five different types for location spoofing attacks [2]: (i) Constant attack transmitting a pre-defined coordinate, (ii) Constant offset adding a pre-defined offset to the original coordinate, (iii) Random transmitting a random coordinate, (iv) Random offset providing a random coordinate in a predefined rectangle around the original coordinate, and (v) Eventual stop transmitting the current coordinate without any change (although moving).

In this study, we evaluate the impact of spoofing attacks with constant spoofing and constant offset spoofing. Again, the constant spoofing attack overwrites the location information with the constant value. We chose five random positions to simulate the constant spoofing attack (within the coordinate space). The second scenario is the use of constant offset attack, in which a constant offset value is added to the original coordinate. For the constant offset attack, we use the notion of perturbation degree: In the coordinate space in the 5G dataset, it is straightforward to calculate the width of latitude space (i.e., \(|x| = x_{max}-x_{min}\)) and the height of the longitude space (\(|y| = y_{max}-y_{min}\)). The constant offset for a perturbation degree p is defined as \(p \times (|x|, |y|)\). For the constant offset attack, we configure different perturbation degrees from 5% to 50% to define the offset.

Table 4. Impact of constant spoofing attack (with Set-1)
Fig. 8.
figure 8

Impact of constant offset spoofing attacks on performance prediction (binary classification): Even a small perturbation degree (\(p=1\%\)) significantly impacts on performance prediction, from 80% to lower than 60% in F1 score, regardless of classifier types. Note that \(p=0\) indicates no spoofing attack applied.

Table 4 shows the performance prediction result with and without spoofing attacks. The experiment was performed with Set-1 for the binary prediction. Since five different coordinates were randomly picked up, we report the result with the average and standard deviation (for w/ spoofing). As can be seen from the table, even this simple spoofing attack considerably degrades the prediction performance. For instance, RF becomes degraded from 80% to 34.2% in F1 score, while KNN is slightly less affected than RF and XGB.

The constant spoofing attack would be easily detected and resisted as it relies on static positions. The constant offset attack is more complicated to detect since the modified coordinate is based on the original location. Figure 8 shows the binary classification performance over different perturbation degrees (p). Note that \(p=0\) indicates no spoofing attack applied. As can be seen from the figure, even a small perturbation degree (\(p=1\%\)) significantly impacts on performance prediction, from 80% to lower than 60% in F1 score, regardless of classifier types. With a greater degree of perturbation, the prediction performance drops below 50% if \(p \ge 3\%\) for any classifier. The result here signals the need for effective defense mechanisms against location-spoofing attacks for reliable estimation of throughput in a mobile communication setting.

5 Related Work

A recent study in [6] investigated mobile bandwidth prediction using 4G and 5G datasets. For bandwidth prediction, the authors applied a Recurrent Neural Network (RNN) structure by formulating the prediction problem as a time series forecasting. Their experimental result shows better performance than the conventional univariate and multivariate prediction models. This previous work assumes bandwidth prediction as a (continuous) regression problem, while our study defines the throughput estimation as a (discrete) classification problem.

The authors in [5] evaluated the impact of location spoofing attacks using the VeReMi dataset. In this previous work, two machine learning algorithms of KNN and Support Vector Machine (SVM) were examined. The measured detection performance against spoofing attacks shows over 99% (in recall and precision). A recent study in [1] investigated the detection of falsified positions and the corresponding attack types in vehicular communication networks using a boosting decision tree ensemble technique. Our study analyzes the 5G dataset to understand the impact of location spoofing attacks on performance prediction (rather than detection of spoofed coordinates).

6 Conclusion

This paper investigates mobile communication performance based on the coordinate information of mobile devices using an ML approach. Only using three features of <Longitude, Latitude, Velocity>, we observed up to 80% correct decisions (in F1 score) for binary prediction using a conventional random forest classifier. However, the experimental result shows the location-based performance prediction becomes considerably degraded when assuming more than two classes (i.e., multi-class prediction). This paper also investigated the impact of location-spoofing attacks on the coordinate-based performance prediction, since location-spoofing attacks are one of the critical attacks in mobile communication environments. The location spoofing attacks significantly impact on performance prediction from 80% to lower than 50% correct decisions, signaling the need for effective defense mechanisms for reliable performance estimation.

In this initial study, we employed conventional ML methods (KNN, RF, and XGB) for predicting throughput in a mobile communication setting. The observed performance of 80% for binary classification could be improved by designing more sophisticated learning models (e.g., using deep structures), which is one of the future tasks of this study. Additionally, this paper showed the significant impact of location spoofing attacks on performance prediction by applying two spoofing attack types (constant spoofing and constant offset spoofing). For more a sophisticated ML model resilient to such attack types, it will be interesting to apply other types of spoofing attacks (i.e., random, random offset, and eventual stop spoofing) for evaluating the robustness to location spoofing. Another interesting research avenue is the investigation of defense mechanisms against potential spoofing attacks, with the impact on performance prediction.