The results for the three different types of analyses on the WiFi dataset, namely correlation analysis, principal component analysis, and predictive analysis, to investigate any hidden relationship between other variables and TCP throughput are presented here followed by an in-depth discussion at the end of this section.
Correlation analysis
Figure 1 illustrates pairwise scatter plots with linear regression fits, pairwise correlations, and histograms of variables of interest. In this figure, the pairwise or bivariate scatter plots with linear regression fits are shown below the diagonal; the histograms of the variables are shown on the diagonal; and correlation coefficients between a pair of variables are shown above the diagonal.
As can be seen from this figure, strong positive correlation is present between R (or link speed) and I (or received signal strength indicator). This is expected as received signal strength affects the signal-to-interference-plus-noise ratio, which determines the link speed or link data rate. Significant positive correlation is observed between S (or TCP throughput in WiFi) and R, and between S and I. This also makes sense as throughput directly depends upon the link speed and indirectly depends upon the received signal strength. Relatively low negative correlation is seen between S and T (or round-trip time), and between S and M (or number of available WiFi access points). This is also understandable as a higher RTT or a higher number of available access points (or interferers) would negatively impact the throughput. Based on these visible correlations, one can presume that R or I or their combination with other variables may act as reasonable predictors of S. On the other hand, one may not expect T or M alone to be reasonable predictors of S based on simple correlation analysis. Note that the pairwise scatter plot between S and T in Fig. 1 indicates a nonlinear relationship between the two.
Principal component analysis
After applying PCA on the variables of interest, the weights (also known as loadings) of variables of interest for the principal components are given in Table 5. They represent positive or negative correlations between the original variables and the principal components. It can be seen from this table that the weights of all variables for the first principal component, i.e., PC1, are small except for the weight of T. PC1 appears to be highly correlated to the variable T. Similarly, the fifth principal component (PC5) turns out to be highly correlated to the variable S. Table 6 shows the proportions of variance for the principal components. Note that the proportion of variance for PC1 is 0.9838 in this table. This means that PC1 can explain 98% of the total variance in the dataset, which means that nearly all of the information in the dataset can be encapsulated by the first principal component.
Table 5 Weights of variables of interest for principal components Table 6 Proportions of variance for principal components Predictive analysis
In predictive analysis, we compare the performance of different variables or their combinations or different PCs—that are employed by the machine learning techniques to predict the throughput—in terms of percentage of variance and root mean square error. PoV is the percentage of total variance explained in the dependent variable in the training set by the independent variable(s) in the model that is constructed using a machine learning technique. In our case, the dependent variable is S and the independent variables can be M, I, R, and T or PC1, PC2, PC3, PC4, and PC5. PoV is based on the coefficient of determination (also known as coefficient of multiple determination) [33] and is given by
$${\text{PoV}} = \left( {1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {A_{i} - F_{i} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {A_{i} - \overline{A}} \right)^{2} }}} \right) \times 100,$$
(1)
where Ai is the ith actual value of S in the training set, Fi is the corresponding ith fitted value of S that is generated by the machine learning model, N is the number of observations for S in the training set, and \(\stackrel{\mathrm{-}}{{A}}\) is the mean of actual values of S in the training set.
RMSE between predicted values and actual values [34] of S in the test set is given by
$${\text{RMSE}} = \sqrt {\frac{1}{{N^{\prime}}}\mathop \sum \limits_{i = 1}^{{N^{\prime}}} \left( {A_{i}^{^{\prime}} - P_{i}^{^{\prime}} } \right)^{2} } ,$$
(2)
where \({{A}}_{{i}}^{^{\prime}}\) is the ith actual value of S in the test set, \({{P}}_{{i}}^{^{\prime}}\) is the corresponding ith predicted value of S, and \({{N}}^{^{\prime}}\) is the number of observations for S in the test set. Higher PoV and lower RMSE indicate better performance.
Linear regression
Table 7 shows the percentage of variance explained in S in the training set by one or more independent variables when used in generating the model based on linear regression. A very low PoV of 4.27% is observed when T alone is used in generating fitted values for S by this model. The PoV improves when multiple independent variables are used for model fitting, and a PoV of 24.49% is obtained when R, I, T, and M are used together.
Table 7 PoV and RMSE for independent variable(s) with LR model RMSE between actual values of S in the test set and predicted values is given in Table 7 as well. A RMSE of 3.71 is seen when T in the test set is used in combination with the fitted LR model to generate the predicted values for S. The RMSE decreases when multiple independent variables (or predictors) in the test set are used to generate the predicted values, and a RMSE of 3.31 is achieved when R, I, T, and M are used together for prediction.
Figures 2 and 3 illustrate actual values of S versus predictions generated when using the LR model in combination with a single predictor and multiple predictors, respectively. The black coloured line, referred to as S in test set in these figures, represents the actual values of S in the test set. It is clear from these figures that the quality of prediction based on the model that uses linear regression is poor as the predictions are far from the actual values of S. This poor performance is due to the fact that the LR model could only capture up to 24.49% of the variation in S.
Random forest
The percentage of variance explained in S in the training set by single or multiple independent variables with the model generated using random forest is shown in Table 8. As seen earlier in case of LR model, the PoV with RF model improves when multiple independent variables are used for model fitting. The PoV is 83.45% when R, I, T, and M are used together. However, it should be noted that this PoV is lower than that obtained when only using T for model fitting. A PoV of 87.63% is observed when T is used in generating fitted values for S, which is also much higher than a PoV of 4.27% seen with T in the LR model.
Table 8 PoV and RMSE for independent variable(s) with RF model Table 8 also reveals RMSE when a single predictor or multiple predictors in the test set are used in combination with the fitted RF model to generate predicted values for S. The RMSE decreases with an increasing number of predictors, and when R, I, T, and M are used together as predictors, a RMSE of 1.53 is seen. However, an even lower RMSE of 1.33 is achieved when only using T for prediction.
Figures 4 and 5 show actual versus predicted values of S generated when using the model based on RF with a single predictor and multiple predictors, respectively. When using a single variable for prediction, T outperforms others as is clearly visible in Fig. 4. The quality of prediction improves with an increasing number of predictors as noticed from Fig. 5. However, T outperforms the combination of all predictors as observed from Fig. 6. When used in model fitting that employs random forest, T is able to capture a very high percentage of variation in S, which results in good performance.
As can be seen from Tables 7 and 8 as well as Figs. 3 and 5, the combination of R, I, and T and that of R, T, and M perform similarly. This indicates that exhaustively exploring all combinations of different independent variables for predicting S is not necessary, and we employ only a few combinations consisting of two or three or all four independent variables to examine their performance in comparison with only using a single independent variable to predict S.
Random forest with principal components
When a principal component is used in combination with RF for model fitting and prediction, the PoV explained in S in the training set as well as the RMSE between actual values of S in the test set and predicted values are shown in Table 9. A very high PoV of 99.61% is achieved when PC1 is used in generating fitted values with the RF model. Also, a very low RMSE of 0.23 is observed when PC1 is used for prediction. This is reflected in Fig. 7, which shows actual versus predicted values of S. It is also observed from Table 9 that PC1 performs better than PC5 in terms of PoV and RMSE.
Table 9 PoV and RMSE for a principal component with RF model Recall that PC1 was found to be highly correlated to T and was able to explain 98% of the total variance in the WiFi dataset as was highlighted in Tables 5 and 6. Also, PC5 had been found to be highly correlated to S during PCA. Previously, we found that when T was used for generating the fitted model using RF, it outperformed other single and multiple predictors. Currently, we observe that PC1—which was earlier found to be highly correlated to T—provides an excellent performance, thereby indirectly re-confirming the ability of T to closely predict S. The need to use more computationally complex ML techniques, like neural networks, to achieve higher accuracy did not arise as a very high prediction accuracy was achieved with RF.
Discussion
As mentioned in Section 3, the crowd-sourced WiFi dataset, which we have employed for these different analyses, was collected using the publicly available Cell vs WiFi app [24]. This app gathers packet-level traces between a smartphone (located anywhere in the world) and a server (located at the Massachusetts Institute of Technology). If WiFi is available and the smartphone is connected to the Internet, the app, when activated by the smartphone user, initiates a 1 MB TCP download at the smartphone from the server over the WiFi network and the Internet, and collects packet-level measurements for S and T among other variables. The app then uploads the collected data along with the smartphone’s geographic location to the server.
RTT (or T) consists of the time it takes a packet to reach the smartphone from the server and the corresponding acknowledgement to reach the server from the smartphone. This includes all different types of delays encountered over the WiFi link and the intermediate wired links along both paths (i.e., from server to smartphone and from smartphone to server) such as transmission delay (which includes delay caused by MAC layer-level retransmissions due to problems over the WiFi link), propagation delay, queueing delay, and processing delay.
Congestion is the primary concern for TCP. It happens when routers are overwhelmed with traffic. This causes their queues to build up, which eventually overflow causing packet losses that lead to delays due to packet retransmissions. TCP is unable to differentiate between WiFi-link losses and wired-link losses, and assumes that all packet losses indicate congestion. When TCP detects a loss, it retransmits the lost packet, and in addition it reduces its transmission rate [35]. This relieves congestion by draining router queues. Packet losses over WiFi can occur due to wireless channel errors; collisions while accessing the medium at the MAC layer level; and/or buffer overflow at an access point [36]. A packet loss over WiFi is resolved by a TCP-level retransmission.
TCP needs to periodically measure RTT to set the value of its retransmission timeout (RTO) [37]. The value of RTO is set slightly higher than RTT. If RTT increases, RTO increases, and the sender has to wait longer to retransmit the lost packet, which negatively impacts throughput. Moreover, when RTO expires, TCP sees this packet loss as congestion and reduces its congestion window, which degrades throughput. On the other hand, if RTT decreases, TCP continues to operate in its current phase and keeps increasing its congestion window, which improves throughput. This indicates an inverse relationship between RTT and throughput, and generally the TCP throughput is given by cwnd/RTT, where cwnd is the size of TCP’s congestion window [38].
From correlation analysis, we observed a significant positive correlation between throughput (or S) and link speed (or R), and between throughput and received signal strength (or I). Relatively low negative correlation was seen between throughput and RTT (or T), and between throughput and number of available access points (or M). Based on visible correlations, one could presume that a combination of multiple variables, including R and I, would act as reasonable predictor of S, whereas T or M alone would not be expected to reasonably predict S. The principal component analysis revealed that the first principal component (or PC1) was highly correlated to T, whereas PC5 was highly correlated to S. The proportion of variance for PC1 turned out to be 0.9838 during PCA, implying that PC1 could explain 98% of the total variance in the dataset.
To compare performance of different variables or their combinations or different PCs that were employed by the machine learning techniques to predict S, percentage of variance was used to measure the quality of the fitted model whereas root mean square error was used to measure the accuracy of prediction resulting from the fitted model. A maximum of 24.49% of the variation in S was captured when all variables, including R, I, T, and M, were used for generating fitted values by the model that employed linear regression, and this low PoV resulted in poor quality of prediction. Also, a very low PoV of 4.27% was observed when only T was used in generating fitted values with the LR model.
When used in generating fitted and predicted values for S with the model that employed random forest, T was able to achieve a PoV of 87.63%, a RMSE of 1.33, and outperformed all single variables as well as their combinations by providing the highest PoV and the lowest RMSE. A very high PoV of 99.61% and a very low RMSE of 0.23 were observed when PC1 was used in combination with the RF model to generate fitted and predicted values, respectively. PC1 even performed better than PC5 in terms of PoV and RMSE. Recall that PC5 was earlier found to be highly correlated to S.
Previously, TCP throughput has been modelled as a function of different variables including RTT [11, 12]. The relationship between TCP throughput and measurements of path properties, including queueing delays and packet loss, has been investigated using machine learning [15]. It is shown that TCP throughput predictions can be improved by up to a factor of 3 when these path properties are considered by a support vector regression-based machine learning model. While investigating hidden relationships among variables in WiFi using random forest-based machine learning models, we discover a very significant relationship between S and T. In fact, our investigation using machine learning reveals RTT as the variable that most significantly affects TCP throughput in WiFi.