Skip to main content


Log in

A data science approach to risk assessment for automobile insurance policies

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript


In order to determine a suitable automobile insurance policy premium, one needs to take into account three factors: the risk associated with the drivers and cars on the policy, the operational costs associated with management of the policy and the desired profit margin. The premium should then be some function of these three values. We focus on risk assessment using a data science approach. Instead of using the traditional frequency and severity metrics, we instead predict the total claims that will be made by a new customer using historical data of current and past policies. Given multiple features of the policy (age and gender of drivers, value of car, previous accidents, etc.), one can potentially try to provide personalized insurance policies based specifically on these features as follows. We can compute the average claims made per year of all past and current policies with identical features and then take an average over these claim rates. Unfortunately there may not be sufficient samples to obtain a robust average. We can instead try to include policies that are “similar” to obtain sufficient samples for a robust average. We therefore face a trade-off between personalization (only using closely similar policies) and robustness (extending the domain far enough to capture sufficient samples). This is known as the bias–variance trade-off. We model this problem and determine the optimal trade-off between the two (i.e., the balance that provides the highest prediction accuracy) and apply it to the claim rate prediction problem. We demonstrate our approach using real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The data used for this publication are confidential, and hence, we are only permitted to provide results but cannot share the data.

Code Availability

The code used to generate results is also proprietary to the company, but we hope that our pseudo-code can be used if one wishes to apply the model to their datasets.


  1. Albrecher, H., Bommier, A., Filipović, D., et al.: Insurance: models, digitalization, and data science. Eur. Actuar. J. 9, 349–360 (2019)

    Article  MathSciNet  Google Scholar 

  2. Bian, Y., Yang, C., Zhao, J.L., et al.: Good drivers pay less: a study of usage-based vehicle insurance models. Transp. Res. A: Policy Pract. 107, 20–34 (2018).

    Article  Google Scholar 

  3. David, M., Jemna, D.V.: Modeling the frequency of auto insurance claims by means of poisson and negative binomial models. Analele stiintifice ale Universitatii “Al I Cuza” din Iasi Stiinte economice/Scientific Annals of the“ Al I Cuza” (2015)

  4. Denuit, M., Trufin, J.: Effective Statistical Learning Methods for Actuaries. Springer Actuarial Lecture Notes (2019)

  5. Errais, E.: Pricing insurance premia: a top down approach. Annals of Operations Research, pp. 1–16 (2019)

  6. Esfandabadi, Z.S., Ranjbari, M., Scagnelli, S.D.: (0) Prioritizing risk-level factors in comprehensive automobile insurance management: A hybrid multi-criteria decision-making model. Glob. Bus. Rev.,

  7. Guelman, L.: Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Syst. Appl. 39(3), 3659–3667 (2012)

    Article  MathSciNet  Google Scholar 

  8. Hanafy, M., Ming, R.: Machine learning approaches for auto insurance big data. Risks 9(2), 42 (2021)

    Article  Google Scholar 

  9. Hassani, H., Unger, S., Beneki, C.: Big data and actuarial science. Big Data Cogn. Comput. 4, 40 (2020)

    Article  Google Scholar 

  10. He, B., Zhang, D., Liu, S., et al.: Profiling driver behavior for personalized insurance pricing and maximal profit. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 1387–1396. (2018)

  11. Hosein, P.: On the prediction of automobile insurance claims: the personalization versus confidence trade-off. In: 2021 IEEE International Conference on Technology Management, pp. 1–6. Operations and Decisions (ICTMOD), IEEE (2021)

  12. Hosein, P., Rahaman, I., Nichols, K., et al.: Recommendations for long-term profit optimization. In: ImpactRS@ RecSys (2019)

  13. Jeong, H., Valdez, E.A.: Predictive compound risk models with dependence. Insurance Math. Econom. 94, 182–195 (2020)

    Article  MathSciNet  Google Scholar 

  14. Kanchinadam, T., Qazi, M., Bockhorst, J., et al.: Using discriminative graphical models for insurance recommender systems. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 421–428 (2018).

  15. Liu, Y., Wang, B.J., Lv, S.G.: Using multi-class adaboost tree for prediction frequency of auto insurance. J. Appl. Finance Bank. 4(5), 45 (2014)

    Google Scholar 

  16. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., Luxburg, U.V., Bengio, S., et al. (Eds.) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc (2017).

  17. Qazi, M., Fung, G.M., Meissner, K.J., et al.: An insurance recommendation system using bayesian networks. In: Proceedings of the Eleventh ACM Conference on Recommender Systems. Association for Computing Machinery, New York, NY, USA, RecSys ’17, pp. 274–278 (2017).

  18. Qazi, M., Tollas, K., Kanchinadam, T., et al.: Designing and deploying insurance recommender systems using machine learning. WIREs Data Min. Knowl. Discovery 10(4), e1363 (2020).

    Article  Google Scholar 

  19. Su, X., Bai, M.: Stochastic gradient boosting frequency-severity model of insurance claims. PLoS ONE 15(8), e0238000 (2020)

    Article  Google Scholar 

  20. Zhang, Y., Dukic, V.: Predicting multivariate insurance loss payments under the bayesian copula framework. J. Risk Insurance 80(4), 891–919 (2013)

    Article  Google Scholar 

Download references


The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations



The sole author performed the research, wrote the code for evaluating the solution and wrote the entire paper

Corresponding author

Correspondence to Patrick Hosein.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosein, P. A data science approach to risk assessment for automobile insurance policies. Int J Data Sci Anal 17, 127–138 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Mathematics subject classification