The Journal of Supercomputing

, Volume 72, Issue 7, pp 2815–2831 | Cite as

Highway traffic accident prediction using VDS big data analysis

  • Seong-hun Park
  • Sung-min Kim
  • Young-guk HaEmail author


In modern society, accidents on the roads are one of the most life-threatening dangers to humans. Traffic accidents that cause a lot of damages are occurring all over the places. The most effective solution to these types of accidents can be to predict future accidents in advance, giving drivers chances to avoid the dangers or reduce the damage by responding quickly. Predicting accidents on the road can be achieved using classification analysis, a data mining procedure requiring enough data to build a learning model. However, building such a predicting system involves several problems. It requires many hardware resources to collect and analyze traffic data for predicting traffic accidents since the data are extremely large. Furthermore, the size of data related to traffic accidents is less than that not related to traffic accidents; the amounts of the two classes (classes to be predicted and other classes) of data differ and are thus imbalanced. The purpose of this paper is to build a predicting model that can resolve all these problems. This paper suggests using the Hadoop framework to process and analyze big traffic data efficiently and a sampling method to resolve the problem of data imbalance. Based on this, the predicting system first preprocesses the big traffic data and analyzes it to create data for the learning system. The imbalance of created data is corrected using a sampling method. To improve the predicting accuracy, corrected data are classified into several groups, to which classification analysis is applied.


Accident prediction Big data inference Imbalance data MapReduce Classification 


  1. 1.
    Lv Y, Tang S, Zhao H (2009) Real-time highway traffic accident prediction based on the \(k\)-nearest neighbor method. In: International conference on measuring technology and mechatronics automation (ICMTMA), vol 3, pp 547–550Google Scholar
  2. 2.
    Yu R, Liu X (2010) Study on traffic accidents prediction model based on RBF neural network. In: 2nd international conference on information engineering and computer science (ICIECS), pp 1–4Google Scholar
  3. 3.
    Lv Y, Tang S, Zhao H (2010) Research on influence extention of two-lane highway intersections based on traffic accident database. In: International conference on optoelectronics and image processing (ICOIP), vol 2, pp 244–246Google Scholar
  4. 4.
    Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto K-I (2007) The effects of over and under sampling on fault-prone module detection. In: Empirical software engineering and measurement, pp 196–204Google Scholar
  5. 5.
    Gothenberg A, Tenhunen H (1998) Performance analysis of low oversampling ratio sigma-delta noise shapers for RF applications. In: Proceedings of the 1998 IEEE international symposium on circuits and systems (ISCAS’98), vol 1, pp 401–404Google Scholar
  6. 6.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113CrossRefGoogle Scholar
  7. 7.
    Lee T, Kim H, Rhee K-H, Shin S-U (2013) Implementation and performance of distributed text processing system using Hadoop for e-discovery cloud service. Innov Inf Sci Technol Res Group (ISYOU) 4:12–24Google Scholar
  8. 8.
    Zhang F, Sakr M (2013) Dataset scaling and MapReduce performance. In: 2013 IEEE 27th international on parallel and distributed processing symposium workshops and PhD forum (IPDPSW), pp 1683–1690Google Scholar
  9. 9.
    Guruzon. Accessed 20 Nov 2013
  10. 10.
    Chen T-S, Hu X-Q, Li S-A, Zhou C-L (2008) Multi-class diagnosis classification on high dimension data by logistic models. In: 2008 international conference on machine learning and cybernetics, vol 6, pp 3301–3306Google Scholar
  11. 11.
    Seliya N, Xu Z, Khoshgoftaar TM (2008) Addressing class imbalance in non-binary classification problems. In: 20th IEEE international conference on tools with artificial intelligence (ICTAI’08), vol 1, pp 460–466Google Scholar
  12. 12.
    Maithani S, Tyagi R (2008) Noise characterization and classification for background estimation. In: International conference on signal processing, communications and networking (ICSCN’08), pp 208–213Google Scholar
  13. 13.
    Yan Z, Wang X, Du L (2011) Design method of highway traffic safety analysis model. In: International conference on transportation, mechanical, and electrical engineering (TMEE), pp 151–154Google Scholar
  14. 14.
    Beshah T, Ejigu D, Abraham A, Snasel V, Kromer P (2011) Pattern recognition and knowledge discovery from road traffic accident data in Ethiopia: implications for improving road safety. In: 2011 world congress on information and communication technologies (WICT), pp 1241–1246Google Scholar
  15. 15.
    Ramani RG, Shanthi S (2012) Classifier prediction evaluation in modeling road traffic accident data. In: 2012 IEEE international conference on computational intelligence and computing research (ICCIC), pp 1–4Google Scholar
  16. 16.
    Ghimire B, Bhattacharjee S, Ghosh SK (2013) Analysis of spatial autocorrelation for traffic accident data based on spatial decision tree. In: 2013 fourth international conference on computing for geospatial research and application (COM.Geo), pp 111–115Google Scholar
  17. 17.
    Apache, Apache Hadoop. Accessed 20 Nov 2013
  18. 18.
    Apache, Apache Hive. Accessed 20 Nov 2013
  19. 19.
    Apache, Apache Mahout. Accessed 13 Jan 2014
  20. 20.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  21. 21.
    Vapnik VN (1998) Statistical learning theory. Wiley, New YorkzbMATHGoogle Scholar
  22. 22.
    Kukar M (2004) Transduction and typicalness for quality assessment of individual classifications in machine learning and data mining. In: Fourth IEEE international conference on data mining (ICDM’04), pp 146–153Google Scholar
  23. 23.
    Raghavendra PS, Chowdhury SR, Kameswari SV (2010) Comparative study of neural networks and \(k\)-means classification in web usage mining. In: International conference on internet technology and secured transactions (ICITST)Google Scholar
  24. 24.
    Rahayu SP, Purnami SW, Embong A (2008) Applying kernel logistic regression in data mining to classify credit risk. Inf Technol 2:1–6Google Scholar
  25. 25.
    Mountassir A, Benbrahim H, Berrada I (2010) An empirical study to address the problem of unbalanced data sets in sentiment classification. In: 2012 IEEE international conference on systems, man, and cybernetics (SMC), pp 3298–3303Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Konkuk UniversitySeoulKorea
  2. 2.Konkuk UniversitySeoulKorea

Personalised recommendations