Advertisement

A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification

  • Guohua Liang
  • Chengqi Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7691)

Abstract

Mining time series data and imbalanced data are two of ten challenging problems in data mining research. Imbalanced time series classification (ITSC) involves these two challenging problems, which take place in many real world applications. In the existing research, the structure-preserving over-sampling (SOP) method has been proposed for solving the ITSC problems. It is claimed by its authors to achieve better performance than other over-sampling and state-of-the-art methods in time series classification (TSC). However, it is unclear whether an under-sampling method with various learning algorithms is more effective than over-sampling methods, e.g., SPO for ITSC, because research has shown that under-sampling methods are more effective and efficient than over-sampling methods. We propose a comparative study between an under-sampling method with various learning algorithms and over-sampling methods, e.g. SPO. Statistical tests, the Friedman test and post-hoc test are applied to determine whether there is a statistically significant difference between methods. The experimental results demonstrate that the under-sampling technique with KNN is the most effective method and can achieve results that are superior to the existing complicated SPO method for ITSC.

Keywords

Imbalanced Time Series Classification Supervised Learning Algorithms Under-sampling Over-sampling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)CrossRefGoogle Scholar
  2. 2.
    Liang, G., Zhang, C.: An efficient and simple under-sampling technique for imbalanced time series classification. In: CIKM 2012 (in press, 2012)Google Scholar
  3. 3.
    Acır, N.: Classification of ECG beats by using a fast least square support vector machines with a dynamic programming feature selection algorithm. Neural Computing & Applications 14(4), 299–309 (2005)CrossRefGoogle Scholar
  4. 4.
    Übeyli, E.: ECG beats classification using multiclass support vector machines with error correcting output codes. Digital Signal Processing 17(3), 675–684 (2007)CrossRefGoogle Scholar
  5. 5.
    Hidasi, B., Gáspár-Papanek, C.: ShiftTree: An Interpretable Model-Based Approach for Time Series Classification. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 48–64. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing 26(1), 43–49 (1978)MATHCrossRefGoogle Scholar
  7. 7.
    Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.: Fast time series classification using numerosity reduction. In: Proceedings of 23rd International Conference in Machine Learning, ICML 2006, pp. 1033–1040 (2006)Google Scholar
  8. 8.
    Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 149–160. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  9. 9.
    Japkowicz, N., et al.: Learning from imbalanced data sets: A comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, vol. 68 (2000)Google Scholar
  10. 10.
    Chawla, N., Japkowicz, N., Kolcz, A.: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003)Google Scholar
  11. 11.
    Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1), 1–6 (2004)CrossRefGoogle Scholar
  12. 12.
    Liang, G., Zhu, X., Zhang, C.: The effect of varying levels of class distribution on bagging with different algorithms: An empirical study. International Journal of Machine Learning and Cybernetics (in press, 2012)Google Scholar
  13. 13.
    Liang, G.: An investigation of sensitivity on bagging predictors: An empirical approach. In: 26th AAAI Conference on Artificial Intelligence, pp. 2439–2440 (2012)Google Scholar
  14. 14.
    Liang, G., Zhang, C.: An Empirical Evaluation of Bagging with Different Algorithms on Imbalanced Data. In: Tang, J., King, I., Chen, L., Wang, J. (eds.) ADMA 2011, Part I. LNCS, vol. 7120, pp. 339–352. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  15. 15.
    Liang, G., Zhang, C.: Empirical study of bagging predictors on medical data. In: 9th Australian Data Mining Conference, AusDM 2011, pp. 31–40 (2011)Google Scholar
  16. 16.
    Liang, G., Zhu, X., Zhang, C.: An Empirical Study of Bagging Predictors for Imbalanced Data with Different Levels of Class Distribution. In: Wang, D., Reynolds, M. (eds.) AI 2011. LNCS, vol. 7106, pp. 213–222. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  17. 17.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1), 321–357 (2002)MATHGoogle Scholar
  18. 18.
    Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Han, H., Wang, W., Mao, B.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Drummond, C., Holte, R., et al.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets II (2003)Google Scholar
  21. 21.
    Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(2), 539–550 (2009)CrossRefGoogle Scholar
  22. 22.
    Ling, C., Li, C.: Data mining for direct marketing: Problems and solutions. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998)Google Scholar
  23. 23.
    Cao, H., Li, X., Woon, Y., Ng, S.: SPO: Structure preserving oversampling for imbalanced time series classification. In: : Proceedings of the IEEE 11th International Conference on Data Mining, ICDM 2011, pp. 1008–1013 (2011)Google Scholar
  24. 24.
    Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson, Addison Wesley (2006)Google Scholar
  25. 25.
    Keogh, E., Zhu, Q., Hu, B., Hao, Y., Xi, X., Wei, L., Ratanamahatana, C.A.: UCR Repository of time series classification/clustering homepage, http://www.cs.ucr.edu/~eamonn/time_series_data/ (2011)
  26. 26.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tool and Techniques. Morgan Kaufmann (2005)Google Scholar
  27. 27.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Guohua Liang
    • 1
  • Chengqi Zhang
    • 1
  1. 1.The Centre for Quantum Computation & Intelligent Systems, FEITUniversity of Technology SydneyAustralia

Personalised recommendations