1 Introduction

One of the most popular sports in the world is Formula 1 (F1). The speed thrill and nail-biting experience that fans get while watching the race is the result of a lot of engineering, data science, management, and of course lots of training on the tracks. What is often underappreciated is that races are won first at the factory and then on the circuit. The F1 teams work hard to maintain a constant balance between obtaining top speed and down force, here aerodynamics plays a major role [12]. The teams try to predict what position they will finish by using the massive datasets they have accumulated from the past seasons. It would be worthwhile to dig deeper into such a sport and analyze the associated analytics to understand its impact on the total car race points accumulated by a F1 driver. In this work, we provide a data-driven framework to understand the various car race statistics, and check their impact on the performance of the F1 racers.

There are 4 basic strategies that the driver/team uses during the race. This will assist us in understanding the basic fundamentals of F1 race, before diving deep into the winning prediction methodologies.

  • Preparation: The engineering team works on developing a strategy which is based on simulations and data that have been acquired from the various trial runs the driver takes and also based on the past races.

  • Practice: The driver practices and uses the strategy provided. This is a great way to correct the shortcomings in the strategy. This acts as a stepping stone in fine-tuning the strategy for the qualifying and final race.

  • Qualifying: The driver moves on to taking the qualifying rounds and the starting position. The previous practice and qualifying rounds feed very critical information to the engineering team to work on the final race’s pit stops and race strategy.

  • Racing: Technical difficulties are a part and parcel of F1 races, but there a few other things that act as catalyst to the victory or loss of the team/driver. Weather conditions, traffic on the tracks, pit stops for tyre or oil changes or other quick fixes, and of course the safety car which limits the speeding cars from crashing when obstructions appear on the track.

Now that we are aware of the 4 important stages in devising a successful strategy for winning the race, it is time to dive deeper into our objective of improving the decision-making steps using statistical tools and techniques.

1.1 Related Work

In the literature, a lot of work is done by researchers in predicting the different sports results using machine learning techniques. In the work by Bishell [4], the author performed experiments on horse races data, and implemented a neural network for predicting the horse race outcomes. Bishell concludes that a simple neural network model gave more efficient and accurate results compared to other benchmarking models. The neural model managed to achieve an accuracy of 66% for the top three ranks. Another research conducted by William and Li [19] using the data collected from Caymans Race Track in Jamaica. The authors implemented the model by using a neural network and achieved an overall accuracy of 74% to predict the top three positions. Similar research was done in 2010 by Dacoodi and Khanteymoori [7], they acquired the data from Aqueduct Race Track in NY. Their work proposed a neural net that has an accuracy of 77%, when compared with other neural networks. In [13], Miljković et al. did research on predicting the outcomes of basketball matches using Naive Bayes, using the data acquired from the NBA website. The model achieved an accuracy of 77.97%. Also, [9] predicted the outcomes of the matches played by Tottenham Hotspurs in English Premier League. In [14], the author proposes a model developed from Bayes networks to predict the expert knowledge in the game of football. The author concludes that Bayes network achieved an accuracy of 59.21% and outperforms other benchmarking algorithms. Recently, in 2017, [18] conducted a research analysis on predicting the football matches played in English Premier League using Bayes Networks and the prediction accuracy was 75.09% on an average across three seasons.

The majority of this prior research focused on developing predictive models with high generalization accuracy (as measured by performance on test sets) rather than on analyzing the factors that contribute to the outcome of a sports event. Furthermore, F1 races and its related analysis have been largely ignored. To the best of our knowledge, there is no publicly published work that provides a systematic analysis on race features that influences the outcome in F1 races. In this paper, we attempt to bridge this gap and provide a detailed analysis on F1 car analytics.

1.2 Contributions of This Paper

The main contributions of this paper are as follows:

  • Firstly, we propose a novel and systematic analysis of the various F1 car race factors in our collected dataset, that govern the finishing position of a driver and the manner in which they are related to each other;

  • Secondly, we successfully reduced the data space comprising 21 race features into 4 orthogonal dimensions that explain approximately 70% of the captured variance, using principal components analysis. This will facilitate us in identifying the key factors of F1 car race influencing the race outcome;

  • Finally, in the spirit of reproducible research, we release all the code and associated dataset with this work. The data set for this domain of sports analytics is a bit difficult to obtain in the form of direct CSV files. We have web scraped the data using R language for this work. We subsequently converted this data set from multiple pages on the website into a re-usable CSV file.

The rest of the paper is organized as follows. Section 2 discusses the various factors associated with a F1 car race. Section 3 describes their inter-dependency in details. We perform a dimensional reduction of the original feature space, using PCA in Sect. 4. Subsequently, we analyze the impact of the different car race features in total race points in Sect. 5. Finally, Sect. 6 concludes the paper and discusses the future works.

2 Formula-1 (F1) Car Race Factors

In this section a brief discussion is done on data collection, data pre-processing and transformation of the input data.

2.1 Dataset

The dataset used in this paper has been acquired from a single source. This dataset is obtained after web scrapping using R studio. With the spirit of reproducible research, the dataset and code for this work is reproducible and is available onlineFootnote 1. The dataset was taken from We collected the data for a period of 5 years (2015–2019).

The dataset provides information on the following attributes:

  • Average number of pit stops taken by each racer across the board is represented by Average.Pit.Stop

  • Information about % usage of each tyre type is represented by variables Hard, Medium, Soft, Super.soft, Ultra.soft, Hyper.soft, Wet and Intermediate - which denote their % use

  • Laps each driver spent in each position during the season considering only first, second and third position is represented by the variables FirstPosition, SecondPosition and ThirdPosition

  • # of races the driver started is represented by the variable Started, # of races the driver classified by completing 90% of the race is represented by the variable Classified and # of races the driver completed by covering 100% race distance in the season is represented by the variable Completed

  • Full season laps led (represented by Full.seasons.laps.led) and driver’s season laps led (represented by Driver.s.season.laps.led) explain the number of laps led as percentage during the season and all race laps covered by that driver respectively

  • # of accidents by each racer in the season is represented by the variable Accident

  • # of penalties attained to the team and driver for each driver are represented by and respectively. Simultaneously, if there was no penalty given, it counts as a no action and that is represented by the variable No.action

  • Average position where each driver started every race, after penalties were applied is represented by the Average.pole.position

  • The total number of points scored by the driver during the season is denoted by Total.Points. These points eventually decide the winner of each season

2.2 Data Pre-processing

This section gives us insights on how missing values, data transformation and data pruning were dealt with in order to carry the analysis forward.

Data Pruning: Data pruning refers to getting rid of unwanted data which are not required for analysis. In our case we performed data pruning on the attributes which were outliers and had no significance on the analysis. The attribute Withdrawn (W) described all the drivers who had withdrawn from the race. Here all the racers did participate and there was no driver who withdrew. So, this attribute was removed. Also the attribute Did Not Qualify (DNQ) consists all the data for racers who did not qualify. However, all the drivers did qualify for the final race and hence this attribute was removed.

Handling the Missing Values: After data pruning missing values were detected and analyzed as to why they are absent. The missing values in the data is not because of faulty data entry or avoided data. It is because the driver has not been involved in that event. As an illustration, in the event of an accident, only a couple of drivers were affected. Hence, the missing values were replaced with zero.

2.3 Data Transformation

The variables that underwent transformation are as follows:

  • Pit Stop data was mentioned according to each lap i.e. 22 laps. A mathematical average was calculated and average pit stop for each driver was created.

  • Tyres data was in the form of a percentage. All the special characters were taken off and the percentage was normalized to a decimal format.

  • Full season laps led and Driver’s season laps led was in the form of percentage which was normalized to a decimal format.

The variables what we have in the dataset are all considered to be important. However, there are 22 variables so having a feature selection process in place to get more independent and uncorrelated input variable set becomes all the more important. Most classification algorithms thrive on input variables that are independent of each other in order to explain maximum variation and trends in the dataset. This paper essentially explores these different variable selection processes. We first talk about a rather straightforward correlation analysis and then move on to a more comprehensive principal components analysis.

3 Interdependency of Variables

In this section, we do a correlation analysis [5, 6] of all the variables described in the aforementioned sections. We have used the R function corrgramFootnote 2. In our case, as mentioned all the attributes are considered important for the research and there was no manual removal of features. It is important to understand the correlation trend [1, 17] between the different features before we perform any classification task. This is because if two features are perfectly correlated, then one feature can be efficiently described by the other [11, 16]. Figure 1 depicts how attributes are correlated with each other.

Fig. 1.
figure 1

Correlation between the various F1 car race variables (best viewed in color).

We observe that the average pole position is strongly negatively correlated with the first, second and third position. This makes sense as a higher average pole position would perhaps mean the racer didn’t finish in the first, second or third position at the end of the race - also depicting that the average pole position is perhaps one of the key factors in determining the finishing position of the driver. Interestingly, we observe that team penalties appear to be related to the usage of soft tyres and hyper soft tyres – using soft tyres more often generate less penalties while usage of hyper soft tyres will generate more penalties. Moreover, hyper soft tyres are positively correlated with the occurrence of accidents - in line with the fact that they can cause more penalties. Additionally, the position features viz. first, second and third position are strongly positively dependent on the number of laps completed, the full seasons laps led by the drivers and whether the driver was classified or not. Another interesting relationship is the strong negative correlation between a driver classifying and the occurrence of accidents.

4 Principal Components Analysis

In addition to the inter-dependency of the different variables, we also use Principal Component Analysis (PCA) [3, 15] to understand the underlying structure of the dataset. Let us assume that our F1 race features are the column vectors \(\textbf{v}_{1-22}\) (22 in our case), where \(\textbf{v}_j \in \mathrm{I\!R}^{n \times 1}\) where \(j=1,2,\ldots ,22\), and n is the total number of observations in the dataset. We stack the individual feature vectors \(\textbf{v}_j\) to create the variable matrix \({\textbf {X}} \in \mathrm{I\!R}^{n \times 22}\):

$$\begin{aligned} {\textbf {X}}=[\textbf{v}_1, \textbf{v}_2,\ldots ,\textbf{v}_{22}]. \end{aligned}$$

We normalize each of the feature vectors \(\textbf{v}_j\) with the corresponding mean value \(\bar{v_{j}}\) and the standard deviation \(\sigma _{v_{j}}\) to compute the normalised matrix \(\ddot{\textbf{X}}\). We compute the matrix \(\ddot{\textbf{X}}\) as:

$$\begin{aligned} \ddot{\textbf{X}}= \left[ \frac{\textbf{v}_{1}-\bar{v_{1}}}{\sigma _{v_{1}}}, \frac{\textbf{v}_{2}-\bar{v_{2}}}{\sigma _{v_{2}}},..,\frac{\textbf{v}_{j}-\bar{v_{j}}}{\sigma _{v_{j}}},..,\frac{\textbf{v}_{22}-\bar{v_{22}}}{\sigma _{v_{22}}}\right] . \end{aligned}$$

We thereby compute the covariance matrix of \(\ddot{\textbf{X}}\). Subsequently, we perform eigenvalue decomposition of the computed covariance matrix to obtain the eigen values and the eigen vectors. The eigen values describe the amount of variance captured by each of the principal components. The principal components are obtained from the eigen vectors.

4.1 Variation Explained by the Components

In this section, we analyze the variance captured by the most important principal components. Figure 2 describes the variance captured by each of the orthogonal principal components. We observe that the first two principal components capture 50% of the total variance. Furthermore, the cumulative variance captured by the first 4 principal components is \(\approx 70\)%. This indicates that most of the race features are correlated with each other (as observed in Sect. 3), and the total information in the original feature space can be effectively reduced to a lower dimensional subspace without the loss of significant information.

Fig. 2.
figure 2

Amount of variance captured by the individual principal components.

4.2 Bi-plot Representation

We also represent the car race variables in the new subspace representation of the principal components. Figure 3 is the bi-plot representation [2, 8] of our race variables across the first two principal components in a two-dimensional space. We represent the different race observations in our dataset by points in the bi-plot figure. We represent the race car variables by vectors. The bi-plot figure provides us interesting insights on the F1 car race variables. We can observe the contribution of each of the race variables onto the principal components, and also the correlation between them. The position variables viz. FirstPosition, SecondPosition, ThirdPosition are correlated with each other and have a strong contribution to the second principal component. In addition to that, other variables related to the driver’s position in the race are quite strongly contributing to PC1 - thus making it a PC that potentially explains the positional aspect of the driver. We also observe that accident and penalties due to team are correlated with each other. We don’t see a similar dependence of variables on any other PCs, hence the other three components explain the variation in the input variables in a cumulative manner.

Fig. 3.
figure 3

Biplot representation of the F1 race variables across the first two principal components. The F1 variables are represented by the vectors and the observations in the dataset are represented as points.

4.3 PCA Factor Loadings

The PCA factor loadings explain the loading that each variable has on each of the components. It also shows the range of loadings on each principal component from each variable [10]. Table 1 describes the loading factors of the various car race features onto the first four principal components. The bold loadings show the top 6 loading magnitude-wise on each principal component. It helps us understand what could each principal component potentially represent. For example, similar to the findings in the previous section, the first PC shows strong loadings for all position-related variables. Similarly, the third PC has maximum loadings on the tyre related variables, thus accounting for the variance based on the type of tyre used during the race. It is also possible for one variable to have high loadings on multiple principal components, as can be seen in the table as well.

Table 1. Loading factors of the various features onto the first four principal components.
Table 2. We show the corresponding estimate and p-value for all the car race features, while estimating the total race points accumulated by a driver in a complete season. The significance codes are represented by ‘+’, where 0: ‘+++’, 0.001: ‘++’, 0.01: ‘+’, 0.05: ‘.’, and 0.1: ‘’.

5 Impact on Season’s Total Championship Points

We have discussed the relationship between the different factors that determine the final race outcomes. In this section, we run a linear regression on the data obtained from web-scraping. This data consists of information from 5 consecutive seasons of 2015 till 2019. The dependent variable in the linear regression is the total points scored by a driver in each season denoted by Total.Points. This is chosen as the dependent variable, because eventually the driver with the highest points wins the season. We propose to study the effect of our input variables on Total.Points. In Table 2, we show the results of a linear regression model that was applied on our dataset. We can observe that number of races completed by a driver (Complete) in a season has a significance effect on Total.Points. In addition to that, for every race that a driver completes in a season, Total.Points increases by 6 units. We also observe that, amongst all the tyre types, only Medium, Soft, Ultra.Soft and Intermediate tyre types have a significant effect on Total.Points. According to the linear regression results, for a percentage increase in Intermediate during the season, the Total.Points increases by 4. We also observe that a percentage increase in the use of Medium, Soft and Ultra.Soft tyre types (which are also the most used tyre types in the season), the total points scored increase by 2 for each. In addition to these, an increase in the number of laps spent by the driver in second position, denoted by SecondPosition, the Total.Points will increase by 0.20. The results are similar for ThirdPosition. An interesting finding of this model is also the effect of Average.Pol.Pos on Total.Points. The feature Average.Pol.Pos denotes the average starting position held by each driver during the course of the season. A unit increase in the Average.Pol.Pos will result in a decrease of 3 points in Total.Points. The linear regression model has an R-squared value of 99% which means that the model was able to capture almost 99% of the variation in the data.

6 Conclusion and Future Work

In this paper, we have provided a systematic analysis of various variables associated with the F1 car race. We have identified the most important variables that assist in a favorable outcome of the car race. Using a set of statistical techniques, we concluded that most of the variables are strongly correlated with each other. We also surmised that the original feature space can be significantly reduced to a lower-dimensional subspace without a significant loss of information.

Future work include extending such systematic analysis for a larger statistical period of more than 5 years to gather more data and investigate the analysis further. Furthermore, we plan to investigate the linear regression model by modifying it to use a selected set of race features by applying forward and/or backward step regression.