Multimedia Tools and Applications

, Volume 78, Issue 21, pp 29783–29804 | Cite as

A comparison of learning methods over raw data: forecasting cab services market share in New York City

  • Fernando Turrado García
  • Luis Javier García VillalbaEmail author
  • Ana Lucila Sandoval Orozco
  • Tai-Hoon Kim


The cab services, present in most of the cities, are one of the most used offerings for passenger transportation. Nowadays their business model is being threatened by the meddling of emerging third parties powered by modern technologies. Based on the New York cab data, we will make a comparison of several machine learning techniques (linear regression, support vector machines and random forest) for forecasting the amount of dollars spent in the cab service. The comparison of those methods will focus on the accuracy of their forecasts under several circumstances: real data applied to all features, some noisy data (real data with some uniform distributed noise added) applied to several key features and some estimated data (obtained from other statistical estimators) applied to the key features. The main goal of this comparison is to provide some data regarding the performance of those methods when they are used in conjunction with other estimators


Forecast Linear regression Random forest Support vector machines Time series 



This research work was supported by Sungshin Women’s University. In addition, L.J.G.V. and A.L.S.O thanks the European Commission Horizon 2020 5G-PPP Programme (Grant Agreement number H2020-ICT-2014-2/671672-SELFNET - Framework for Self-Organized Network Management in Virtualized and Software-Defined Networks).


  1. 1.
    Aarhaug J, Skollerud K (2014) Taxi: different solutions in different segments. Transportation Research Procedia 1(1):276–283CrossRefGoogle Scholar
  2. 2.
    Adusumilli S, Bhatt D, Wang H, Devabhaktuni V, Bhattacharya P (2015) A novel hybrid approach utilizing principal component regression and random forest regression to bridge the period of gps outages. Neurocomputing 166:185–192CrossRefGoogle Scholar
  3. 3.
    Ahmed MS, Cook AR (1979) Analysis of freeway traffic Time-Series data by using Box-Jenkins techniques. Transportation Research BoardGoogle Scholar
  4. 4.
    Azevedo CL, Cardoso JL, Ben-Akiva M (2014) Vehicle tracking using the k-shortest paths algorithm and dual graphs. Transportation Research Procedia 1(1):3–11CrossRefGoogle Scholar
  5. 5.
  6. 6.
    Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, HobokenzbMATHGoogle Scholar
  7. 7.
    Brands T, de Romph E, Veitch T, Cook J (2014) Modelling public transport route choice, with multiple access and egress modes. Transportation research procedia 1(1):12–23CrossRefGoogle Scholar
  8. 8.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  9. 9.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297zbMATHGoogle Scholar
  10. 10.
    Dai W, Brisimi TS, Adams WG, Mela T, Saligrama V, Paschalidis IC (2015) Prediction of hospitalization due to heart diseases by supervised learning methods. Int J Med Inform 84(3):189–197CrossRefGoogle Scholar
  11. 11.
    (2016) Dan Work: Cab data public repository at the University of Illinois.
  12. 12.
    Dobson AJ, Barnett A (2008) An introduction to generalized linear models. CRC Press, Boca RatonzbMATHGoogle Scholar
  13. 13.
    García Turrado F, García Villalba LJ, Portela J (2012) Intelligent system for time series classification using support vector machines applied to supply-chain. Expert Systems with Applications 39(12):10,590–10,599CrossRefGoogle Scholar
  14. 14.
    Guo G (2014) Soft biometrics from face images using support vector machines. In: Support vector machines applications, Springer, pp 269–302Google Scholar
  15. 15.
    Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer Science & Business Media, BerlinCrossRefGoogle Scholar
  16. 16.
    Hastie TJ, Pregibon D (1992) Statistical Models, chap. Generalized linear models. Hastie, Wadsworth & Brooks/ColeGoogle Scholar
  17. 17.
    Hwang RH, Hsueh YL, Chen YT (2015) An effective taxi recommender ssystem based on a spatio-temporal factor analysis model. Inf Sci 314:28–40CrossRefGoogle Scholar
  18. 18.
    Hyndman RJ (2014) Forecast package for R
  19. 19.
    Kumar M, Rath SK (2015) Classification of microarray using mapreduce based proximal support vector machine classifier. Knowl-Based Syst 89:584–602CrossRefGoogle Scholar
  20. 20.
    Liaw A, Wiener M (2002) Classification and regression by randomforest. R news 2(3):18–22Google Scholar
  21. 21.
    Lindner C, Bromiley PA, Ionita MC, Cootes TF (2015) Robust and accurate shape model matching using random forest regression-voting. IEEE Trans Pattern Anal Mach Intell 37(9):1862–1874CrossRefGoogle Scholar
  22. 22.
    McCullagh P (1989) Generalized linear models. Chapman and Hall, UKCrossRefGoogle Scholar
  23. 23.
    McCullagh P, Nelder JA (1989) Generalized linear models. Chapman and Hall, UKCrossRefGoogle Scholar
  24. 24.
    Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2012) Misc functions of the department of statistics (e1071). TU Wien, Version pp. 1–6Google Scholar
  25. 25.
    Nelder JA, Baker RJ (1972) Generalized linear models. Encyclopedia of Statistical SciencesGoogle Scholar
  26. 26.
    (2016) New York Freedom Of Information Law: Freedom of Information Law
  27. 27.
    Pan X, Luo Y, Xu Y (2015) K-nearest neighbor based structural twin support vector machine. Knowl-Based Syst 88:34–44CrossRefGoogle Scholar
  28. 28.
    Tomar D, Agarwal S (2015) A comparison on multi-class classification methods based on least squares twin support vector machine. Knowl-Based Syst 81:131–147CrossRefGoogle Scholar
  29. 29.
    Were K, Bui DT, Dick ØB, Singh BR (2015) A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an afromontane landscape. Ecol Indic 52:394–403CrossRefGoogle Scholar
  30. 30.
    Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer Science & Business Media, BerlinCrossRefGoogle Scholar
  31. 31.
    Wong R, Szeto W, Wong S (2014) Bi-level decisions of vacant taxi drivers traveling towards taxi stands in customer-search: modeling methodology and policy implications. Transp Policy 33:73–81CrossRefGoogle Scholar
  32. 32.
    Wong R, Szeto W, Wong S (2014) A cell-based logit-opportunity taxi customer-search model. Transportation Research Part C: Emerging Technologies 48:84–96CrossRefGoogle Scholar
  33. 33.
    Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) Forestexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116CrossRefGoogle Scholar
  34. 34.
    Yu H, Mu C, Sun C, Yang W, Yang X, Zuo X (2015) Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl-Based Syst 76:67–78CrossRefGoogle Scholar
  35. 35.
    Zhang W, Niu P, Li G, Li P (2013) Forecasting of turbine heat rate with online least squares support vector machine based on gravitational search algorithm. Knowl-Based Syst 39:34–44CrossRefGoogle Scholar
  36. 36.
    Zheng B, Myint SW, Thenkabail PS, Aggarwal RM (2015) A support vector machine to identify irrigated crop types using time-series landsat ndvi data. Int J Appl Earth Obs Geoinf 34:103–112CrossRefGoogle Scholar
  37. 37.
    Zhou QF, Zhou H, Ning YP, Yang F, Li T (2015) Two approaches for novelty detection using random forest. Expert Systems With Applications 42(10):4840–4850CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Fernando Turrado García
    • 1
  • Luis Javier García Villalba
    • 1
    Email author
  • Ana Lucila Sandoval Orozco
    • 1
  • Tai-Hoon Kim
    • 2
  1. 1.Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Information Technology and Computer Science, Office 431Universidad Complutense de Madrid (UCM)MadridSpain
  2. 2.Department of Convergence SecuritySungshin Women’s UniversitySeoulKorea

Personalised recommendations