Abstract
There are a range of factors that affect the outcome of Formula 1 (F1) car races. Today, it is reasonable to say that F1 races are first won at the factory, and then on the track. F1 teams accumulate enormous amounts of data during races. In this paper, we propose a datadriven approach to identify the most important factors that contribute to the overall points scored by each driver in a F1 season. We perform a correlation analysis along with a principal components analysis (PCA) to identify the factors that are closely related. Furthermore, using PCA, we efficiently reduce our 21 input variables into a lowerdimensional subspace, that can explain most of the variance in our data and which is easier to comprehend. We obtain 5 years (2015–2019) of data explaining the F1 car characteristics from a publicly available website https://www.racefans.net/. We use this webscrapped F1 race study to understand the impact of the different car features on the total points scored by a driver in the season. To the best of our knowledge, our work is the first of its kind in the area of F1 car races.
Keywords
 Formula1
 Feature analysis
 Data analytics
 Opensource code
A. Patil and N. Jain—Authors contributed equally.
Download conference paper PDF
1 Introduction
One of the most popular sports in the world is Formula 1 (F1). The speed thrill and nailbiting experience that fans get while watching the race is the result of a lot of engineering, data science, management, and of course lots of training on the tracks. What is often underappreciated is that races are won first at the factory and then on the circuit. The F1 teams work hard to maintain a constant balance between obtaining top speed and down force, here aerodynamics plays a major role [12]. The teams try to predict what position they will finish by using the massive datasets they have accumulated from the past seasons. It would be worthwhile to dig deeper into such a sport and analyze the associated analytics to understand its impact on the total car race points accumulated by a F1 driver. In this work, we provide a datadriven framework to understand the various car race statistics, and check their impact on the performance of the F1 racers.
There are 4 basic strategies that the driver/team uses during the race. This will assist us in understanding the basic fundamentals of F1 race, before diving deep into the winning prediction methodologies.

Preparation: The engineering team works on developing a strategy which is based on simulations and data that have been acquired from the various trial runs the driver takes and also based on the past races.

Practice: The driver practices and uses the strategy provided. This is a great way to correct the shortcomings in the strategy. This acts as a stepping stone in finetuning the strategy for the qualifying and final race.

Qualifying: The driver moves on to taking the qualifying rounds and the starting position. The previous practice and qualifying rounds feed very critical information to the engineering team to work on the final race’s pit stops and race strategy.

Racing: Technical difficulties are a part and parcel of F1 races, but there a few other things that act as catalyst to the victory or loss of the team/driver. Weather conditions, traffic on the tracks, pit stops for tyre or oil changes or other quick fixes, and of course the safety car which limits the speeding cars from crashing when obstructions appear on the track.
Now that we are aware of the 4 important stages in devising a successful strategy for winning the race, it is time to dive deeper into our objective of improving the decisionmaking steps using statistical tools and techniques.
1.1 Related Work
In the literature, a lot of work is done by researchers in predicting the different sports results using machine learning techniques. In the work by Bishell [4], the author performed experiments on horse races data, and implemented a neural network for predicting the horse race outcomes. Bishell concludes that a simple neural network model gave more efficient and accurate results compared to other benchmarking models. The neural model managed to achieve an accuracy of 66% for the top three ranks. Another research conducted by William and Li [19] using the data collected from Caymans Race Track in Jamaica. The authors implemented the model by using a neural network and achieved an overall accuracy of 74% to predict the top three positions. Similar research was done in 2010 by Dacoodi and Khanteymoori [7], they acquired the data from Aqueduct Race Track in NY. Their work proposed a neural net that has an accuracy of 77%, when compared with other neural networks. In [13], Miljković et al. did research on predicting the outcomes of basketball matches using Naive Bayes, using the data acquired from the NBA website. The model achieved an accuracy of 77.97%. Also, [9] predicted the outcomes of the matches played by Tottenham Hotspurs in English Premier League. In [14], the author proposes a model developed from Bayes networks to predict the expert knowledge in the game of football. The author concludes that Bayes network achieved an accuracy of 59.21% and outperforms other benchmarking algorithms. Recently, in 2017, [18] conducted a research analysis on predicting the football matches played in English Premier League using Bayes Networks and the prediction accuracy was 75.09% on an average across three seasons.
The majority of this prior research focused on developing predictive models with high generalization accuracy (as measured by performance on test sets) rather than on analyzing the factors that contribute to the outcome of a sports event. Furthermore, F1 races and its related analysis have been largely ignored. To the best of our knowledge, there is no publicly published work that provides a systematic analysis on race features that influences the outcome in F1 races. In this paper, we attempt to bridge this gap and provide a detailed analysis on F1 car analytics.
1.2 Contributions of This Paper
The main contributions of this paper are as follows:

Firstly, we propose a novel and systematic analysis of the various F1 car race factors in our collected dataset, that govern the finishing position of a driver and the manner in which they are related to each other;

Secondly, we successfully reduced the data space comprising 21 race features into 4 orthogonal dimensions that explain approximately 70% of the captured variance, using principal components analysis. This will facilitate us in identifying the key factors of F1 car race influencing the race outcome;

Finally, in the spirit of reproducible research, we release all the code and associated dataset with this work. The data set for this domain of sports analytics is a bit difficult to obtain in the form of direct CSV files. We have web scraped the data using R language for this work. We subsequently converted this data set from multiple pages on the website into a reusable CSV file.
The rest of the paper is organized as follows. Section 2 discusses the various factors associated with a F1 car race. Section 3 describes their interdependency in details. We perform a dimensional reduction of the original feature space, using PCA in Sect. 4. Subsequently, we analyze the impact of the different car race features in total race points in Sect. 5. Finally, Sect. 6 concludes the paper and discusses the future works.
2 Formula1 (F1) Car Race Factors
In this section a brief discussion is done on data collection, data preprocessing and transformation of the input data.
2.1 Dataset
The dataset used in this paper has been acquired from a single source. This dataset is obtained after web scrapping using R studio. With the spirit of reproducible research, the dataset and code for this work is reproducible and is available online^{Footnote 1}. The dataset was taken from https://www.racefans.net/2018f1season/2018f1statistics/. We collected the data for a period of 5 years (2015–2019).
The dataset provides information on the following attributes:

Average number of pit stops taken by each racer across the board is represented by Average.Pit.Stop

Information about % usage of each tyre type is represented by variables Hard, Medium, Soft, Super.soft, Ultra.soft, Hyper.soft, Wet and Intermediate  which denote their % use

Laps each driver spent in each position during the season considering only first, second and third position is represented by the variables FirstPosition, SecondPosition and ThirdPosition

# of races the driver started is represented by the variable Started, # of races the driver classified by completing 90% of the race is represented by the variable Classified and # of races the driver completed by covering 100% race distance in the season is represented by the variable Completed

Full season laps led (represented by Full.seasons.laps.led) and driver’s season laps led (represented by Driver.s.season.laps.led) explain the number of laps led as percentage during the season and all race laps covered by that driver respectively

# of accidents by each racer in the season is represented by the variable Accident

# of penalties attained to the team and driver for each driver are represented by Penalties.due.to.team and Penalties.due.to.driver respectively. Simultaneously, if there was no penalty given, it counts as a no action and that is represented by the variable No.action

Average position where each driver started every race, after penalties were applied is represented by the Average.pole.position

The total number of points scored by the driver during the season is denoted by Total.Points. These points eventually decide the winner of each season
2.2 Data Preprocessing
This section gives us insights on how missing values, data transformation and data pruning were dealt with in order to carry the analysis forward.
Data Pruning: Data pruning refers to getting rid of unwanted data which are not required for analysis. In our case we performed data pruning on the attributes which were outliers and had no significance on the analysis. The attribute Withdrawn (W) described all the drivers who had withdrawn from the race. Here all the racers did participate and there was no driver who withdrew. So, this attribute was removed. Also the attribute Did Not Qualify (DNQ) consists all the data for racers who did not qualify. However, all the drivers did qualify for the final race and hence this attribute was removed.
Handling the Missing Values: After data pruning missing values were detected and analyzed as to why they are absent. The missing values in the data is not because of faulty data entry or avoided data. It is because the driver has not been involved in that event. As an illustration, in the event of an accident, only a couple of drivers were affected. Hence, the missing values were replaced with zero.
2.3 Data Transformation
The variables that underwent transformation are as follows:

Pit Stop data was mentioned according to each lap i.e. 22 laps. A mathematical average was calculated and average pit stop for each driver was created.

Tyres data was in the form of a percentage. All the special characters were taken off and the percentage was normalized to a decimal format.

Full season laps led and Driver’s season laps led was in the form of percentage which was normalized to a decimal format.
The variables what we have in the dataset are all considered to be important. However, there are 22 variables so having a feature selection process in place to get more independent and uncorrelated input variable set becomes all the more important. Most classification algorithms thrive on input variables that are independent of each other in order to explain maximum variation and trends in the dataset. This paper essentially explores these different variable selection processes. We first talk about a rather straightforward correlation analysis and then move on to a more comprehensive principal components analysis.
3 Interdependency of Variables
In this section, we do a correlation analysis [5, 6] of all the variables described in the aforementioned sections. We have used the R function corrgram^{Footnote 2}. In our case, as mentioned all the attributes are considered important for the research and there was no manual removal of features. It is important to understand the correlation trend [1, 17] between the different features before we perform any classification task. This is because if two features are perfectly correlated, then one feature can be efficiently described by the other [11, 16]. Figure 1 depicts how attributes are correlated with each other.
We observe that the average pole position is strongly negatively correlated with the first, second and third position. This makes sense as a higher average pole position would perhaps mean the racer didn’t finish in the first, second or third position at the end of the race  also depicting that the average pole position is perhaps one of the key factors in determining the finishing position of the driver. Interestingly, we observe that team penalties appear to be related to the usage of soft tyres and hyper soft tyres – using soft tyres more often generate less penalties while usage of hyper soft tyres will generate more penalties. Moreover, hyper soft tyres are positively correlated with the occurrence of accidents  in line with the fact that they can cause more penalties. Additionally, the position features viz. first, second and third position are strongly positively dependent on the number of laps completed, the full seasons laps led by the drivers and whether the driver was classified or not. Another interesting relationship is the strong negative correlation between a driver classifying and the occurrence of accidents.
4 Principal Components Analysis
In addition to the interdependency of the different variables, we also use Principal Component Analysis (PCA) [3, 15] to understand the underlying structure of the dataset. Let us assume that our F1 race features are the column vectors \(\textbf{v}_{122}\) (22 in our case), where \(\textbf{v}_j \in \mathrm{I\!R}^{n \times 1}\) where \(j=1,2,\ldots ,22\), and n is the total number of observations in the dataset. We stack the individual feature vectors \(\textbf{v}_j\) to create the variable matrix \({\textbf {X}} \in \mathrm{I\!R}^{n \times 22}\):
We normalize each of the feature vectors \(\textbf{v}_j\) with the corresponding mean value \(\bar{v_{j}}\) and the standard deviation \(\sigma _{v_{j}}\) to compute the normalised matrix \(\ddot{\textbf{X}}\). We compute the matrix \(\ddot{\textbf{X}}\) as:
We thereby compute the covariance matrix of \(\ddot{\textbf{X}}\). Subsequently, we perform eigenvalue decomposition of the computed covariance matrix to obtain the eigen values and the eigen vectors. The eigen values describe the amount of variance captured by each of the principal components. The principal components are obtained from the eigen vectors.
4.1 Variation Explained by the Components
In this section, we analyze the variance captured by the most important principal components. Figure 2 describes the variance captured by each of the orthogonal principal components. We observe that the first two principal components capture 50% of the total variance. Furthermore, the cumulative variance captured by the first 4 principal components is \(\approx 70\)%. This indicates that most of the race features are correlated with each other (as observed in Sect. 3), and the total information in the original feature space can be effectively reduced to a lower dimensional subspace without the loss of significant information.
4.2 Biplot Representation
We also represent the car race variables in the new subspace representation of the principal components. Figure 3 is the biplot representation [2, 8] of our race variables across the first two principal components in a twodimensional space. We represent the different race observations in our dataset by points in the biplot figure. We represent the race car variables by vectors. The biplot figure provides us interesting insights on the F1 car race variables. We can observe the contribution of each of the race variables onto the principal components, and also the correlation between them. The position variables viz. FirstPosition, SecondPosition, ThirdPosition are correlated with each other and have a strong contribution to the second principal component. In addition to that, other variables related to the driver’s position in the race are quite strongly contributing to PC1  thus making it a PC that potentially explains the positional aspect of the driver. We also observe that accident and penalties due to team are correlated with each other. We don’t see a similar dependence of variables on any other PCs, hence the other three components explain the variation in the input variables in a cumulative manner.
4.3 PCA Factor Loadings
The PCA factor loadings explain the loading that each variable has on each of the components. It also shows the range of loadings on each principal component from each variable [10]. Table 1 describes the loading factors of the various car race features onto the first four principal components. The bold loadings show the top 6 loading magnitudewise on each principal component. It helps us understand what could each principal component potentially represent. For example, similar to the findings in the previous section, the first PC shows strong loadings for all positionrelated variables. Similarly, the third PC has maximum loadings on the tyre related variables, thus accounting for the variance based on the type of tyre used during the race. It is also possible for one variable to have high loadings on multiple principal components, as can be seen in the table as well.
5 Impact on Season’s Total Championship Points
We have discussed the relationship between the different factors that determine the final race outcomes. In this section, we run a linear regression on the data obtained from webscraping. This data consists of information from 5 consecutive seasons of 2015 till 2019. The dependent variable in the linear regression is the total points scored by a driver in each season denoted by Total.Points. This is chosen as the dependent variable, because eventually the driver with the highest points wins the season. We propose to study the effect of our input variables on Total.Points. In Table 2, we show the results of a linear regression model that was applied on our dataset. We can observe that number of races completed by a driver (Complete) in a season has a significance effect on Total.Points. In addition to that, for every race that a driver completes in a season, Total.Points increases by 6 units. We also observe that, amongst all the tyre types, only Medium, Soft, Ultra.Soft and Intermediate tyre types have a significant effect on Total.Points. According to the linear regression results, for a percentage increase in Intermediate during the season, the Total.Points increases by 4. We also observe that a percentage increase in the use of Medium, Soft and Ultra.Soft tyre types (which are also the most used tyre types in the season), the total points scored increase by 2 for each. In addition to these, an increase in the number of laps spent by the driver in second position, denoted by SecondPosition, the Total.Points will increase by 0.20. The results are similar for ThirdPosition. An interesting finding of this model is also the effect of Average.Pol.Pos on Total.Points. The feature Average.Pol.Pos denotes the average starting position held by each driver during the course of the season. A unit increase in the Average.Pol.Pos will result in a decrease of 3 points in Total.Points. The linear regression model has an Rsquared value of 99% which means that the model was able to capture almost 99% of the variation in the data.
6 Conclusion and Future Work
In this paper, we have provided a systematic analysis of various variables associated with the F1 car race. We have identified the most important variables that assist in a favorable outcome of the car race. Using a set of statistical techniques, we concluded that most of the variables are strongly correlated with each other. We also surmised that the original feature space can be significantly reduced to a lowerdimensional subspace without a significant loss of information.
Future work include extending such systematic analysis for a larger statistical period of more than 5 years to gather more data and investigate the analysis further. Furthermore, we plan to investigate the linear regression model by modifying it to use a selected set of race features by applying forward and/or backward step regression.
References
Alparslan, B., Jain, M., Wu, J., Dev, S.: Analyzing air pollutant concentrations in New Delhi, India. In: 2021 Photonics & Electromagnetics Research Symposium (PIERS), pp. 1191–1197. IEEE (2021)
AlSkaif, T., Dev, S., Visser, L., Hossari, M., van Sark, W.: A systematic analysis of meteorological variables for PV output power estimation. Renew. Energy 153, 12–22 (2020)
Batra, S., et al.: DMCNet: diversified model combination network for understanding engagement from video screengrabs. Syst. Soft Comput. 4, 200039 (2022)
Bishell, A.: Machine learning and New Zealand horse racing prediction. BSc. Report, Department of Computer Science, Massey University, New Zealand (2006)
Danesi, N., Jain, M., Lee, Y.H., Dev, S.: Monitoring atmospheric pollutants from groundbased observations. In: 2021 IEEE USNCURSI Radio Science Meeting (Joint with APS Symposium), pp. 98–99. IEEE (2021)
Danesi, N., Jain, M., Lee, Y.H., Dev, S.: Predicting groundbased PM2.5 concentration in Queensland, Australia. In: 2021 Photonics & Electromagnetics Research Symposium (PIERS), pp. 1183–1190. IEEE (2021)
Davoodi, E., Khanteymoori, A.R.: Horse racing prediction using artificial neural networks. Recent Adv. Neural Netw. Fuzzy Syst. Evol. Comput. 2010, 155–160 (2010)
Dev, S., Lee, Y.H., Winkler, S.: Colorbased segmentation of sky/cloud images from groundbased cameras. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10(1), 231–242 (2017)
Joseph, A., Fenton, N.E., Neil, M.: Predicting football results using Bayesian nets and other machine learning techniques. Knowl.Based Syst. 19(7), 544–553 (2006)
Manandhar, S., Dev, S., Lee, Y.H., Winkler, S., Meng, Y.S.: Systematic study of weather variables for rainfall detection. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, pp. 3027–3030. IEEE (2018)
Manandhar, S., Dev, S., Lee, Y.H., Meng, Y.S., Winkler, S.: A datadriven approach for accurate rainfall prediction. IEEE Trans. Geosci. Remote Sens. 57(11), 9323–9331 (2019)
Martins, D., Correia, J., Silva, A.: The influence of front wing pressure distribution on wheel wake aerodynamics of a F1 car. Energies 14(15), 4421 (2021)
Miljković, D., Gajić, L., Kovačević, A., Konjović, Z.: The use of data mining for basketball matches outcomes prediction. In: Proceedings of IEEE 8th International Symposium on Intelligent Systems and Informatics, pp. 309–312. IEEE (2010)
Pariath, R., Shah, S., Surve, A., Mittal, J.: Player performance prediction in football game. In: Proceedings of Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1148–1153. IEEE (2018)
Pathan, M.S., Nag, A., Dev, S.: Efficient rainfall prediction using a dimensionality reduction method. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium, pp. 6737–6740. IEEE (2022)
Pathan, M.S., Nag, A., Pathan, M.M., Dev, S.: Analyzing the impact of feature selection on the accuracy of heart disease prediction. Healthc. Anal. 2, 100060 (2022)
Pathan, M.S., Wu, J., Lee, Y.H., Yan, J., Dev, S.: Analyzing the impact of meteorological parameters on rainfall prediction. In: Proceedings of IEEE USNCURSI Radio Science Meeting (Joint with APS Symposium), pp. 100–101. IEEE (2021)
Razali, N., Mustapha, A., Yatim, F.A., Ab Aziz, R.: Predicting football matches results using Bayesian networks for English Premier League (EPL). In: Proceedings of IOP Conference Series: Materials Science and Engineering, vol. 226, p. 012099. IOP Publishing (2017)
Williams, J., Li, Y.: A case study using neural networks algorithms: horse racing predictions in Jamaica. In: Proceedings of International Conference on Artificial Intelligence (ICAI 2008), pp. 16–22. CSREA Press (2008)
Acknowledgement
This research was conducted with the financial support of Science Foundation Ireland under Grant Agreement No. 13/RC/2106_P2 at the ADAPT SFI Research Centre at University College Dublin. ADAPT, the SFI Research Centre for AIDriven Digital Content Technology, is funded by Science Foundation Ireland through the SFI Research Centres Programme. The authors would also like to thank Prof John D. Kelleher from Technological University Dublin, Ireland for helpful discussions on this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Patil, A., Jain, N., Agrahari, R., Hossari, M., Orlandi, F., Dev, S. (2023). A DataDriven Analysis of Formula 1 Car Races Outcome. In: Longo, L., O’Reilly, R. (eds) Artificial Intelligence and Cognitive Science. AICS 2022. Communications in Computer and Information Science, vol 1662. Springer, Cham. https://doi.org/10.1007/9783031264382_11
Download citation
DOI: https://doi.org/10.1007/9783031264382_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031264375
Online ISBN: 9783031264382
eBook Packages: Computer ScienceComputer Science (R0)