Advertisement

Design issues in Time Series dataset balancing algorithms

  • Enrique A. de la CalEmail author
  • José R. Villar
  • Paula M. Vergara
  • Álvaro Herrero
  • Javier Sedano
AI and ML applied to Health Sciences (MLHS)
  • 56 Downloads

Abstract

Nowadays, the Internet of Things and the e-Health are producing huge collections of Time Series that are analyzed in order to classify current status or to detect certain events, among others. In two-class problems, when the positive events to detect are infrequent, the gathered data lack balance. Even in unsupervised learning, this imbalance causes models to decrease their generalization capability. In order to solve such problem, Time Series balancing algorithms have been proposed. Time Series balancing algorithms have barely been studied; the different approaches make use of either a single bag of Time Series extracting some of them in order to generate a synthetic new one or ghost points in the distance space. These solutions are suitable when there is one only data source and they are univariate datasets. However, in the context of the Internet of Things, where multiple data sources are available, these approaches may not perform coherently. Besides, up to our knowledge there is not multiple datasources and multivariate TS balancing algorithms in the literature. In this research, we study two main concerns that should be considered when designing balancing Time Series algorithms: on the one hand, the TS balancing algorithms should deal with multiple multivariate data sources; on the other hand, the balancing algorithms should be shape preserving. A new algorithm is proposed for balancing multivariate Time Series datasets, as part of our work. A complete evaluation of the algorithm is performed dealing with two real-world multivariate Time Series datasets coming from the e-Health domain: one about epilepsy crisis identification and the other on fall detection. A thorough analysis of the performance is discussed, showing the advantages of considering the Time Series issues within the balancing algorithm.

Keywords

Imbalanced Time Series Correlation measures Human activity recognition Epilepsy onset recognition Fall detection 

Notes

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Abbate S, Avvenuti M, Corsini P, Light J, Vecchio A (2010) Monitoring of human movements for fall detection and activities recognition in elderly care using wireless sensor network: a survey. In: Merret GV, Tan YK (eds) Wireless sensor networks: application-centric design. InTech, Rijeka, Croatia, pp 147–166Google Scholar
  2. 2.
    Alvarez-Alvarez A, Triviño G, Cordón O (2012) Human gait modeling using a genetic fuzzy finite state machine. IEEE Trans Fuzzy Syst 20(2):205–223CrossRefGoogle Scholar
  3. 3.
    Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRefGoogle Scholar
  4. 4.
    Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Discov 29(2):400–422MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Beniczky S, Polster T, Kjaer T, Hjalgrim H (2013) Detection of generalized tonic-clonic seizures by a wireless wrist accelerometer: a prospective, multicenter study. Epilepsia 4(54):e58–61CrossRefGoogle Scholar
  6. 6.
    Berndt D.J, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAIWS’94. AAAI Press, pp 359–370Google Scholar
  7. 7.
    Breiman L, Friedman J, Stone Charles J, Olshen Richard A (1984) Classification and regression trees. CRC Press, CambridgezbMATHGoogle Scholar
  8. 8.
    Casilari E, Santoyo-Ramn JA, Cano-Garca JM (2017) UMAFALL: a multisensor dataset for the research on automatic fall detection. Procedia Comput Sci 110(Supplement C):32–39CrossRefGoogle Scholar
  9. 9.
    Chan TK, Chin CS (2018) Health stages diagnostics of underwater thruster using sound features with imbalanced dataset. Neural Comput Appl.  https://doi.org/10.1007/s00521-018-3407-3 Google Scholar
  10. 10.
    Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 853–867CrossRefGoogle Scholar
  11. 11.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefzbMATHGoogle Scholar
  12. 12.
    Coppersmith D, Hong SJ, Hosking JRM (1999) Partitioning nominal attributes in decision trees. Data Min Knowl Discov 3(8):197–217CrossRefGoogle Scholar
  13. 13.
    de la Cal E, Villar J, Vergara P, Sedano J (2017) An study on the distances of an extension of the smote algorithm for time series. In: Proceedings of the 17th international conference on computational and mathematical methods in science and engineering (CMMSE 2017), pp 722–733Google Scholar
  14. 14.
    de la Cal E, Villar J, Vergara P, Sedano J, Herrero A (2017) A smote extension for balancing multivariate epilepsy-related time series datasets. In: Proceedings of 12th international conference on soft computing models in industrial and environmental applications (SOCO 2017), pp 439–448Google Scholar
  15. 15.
    Friedman JHA, Finkel JBR (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw 3(3):209–226CrossRefzbMATHGoogle Scholar
  16. 16.
    Fu T (2011) A review on time series data mining. Eng Appl Artif Intell 24(1):164–181CrossRefGoogle Scholar
  17. 17.
    Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBOOST: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471CrossRefGoogle Scholar
  18. 18.
    Hardjono T, Pentland AS (2016) Preserving data privacy in the IoT world. Technical report, Massachusetts Institute of Technology (Connection Science & Engineering)Google Scholar
  19. 19.
    He H, Bai Y, Garcia E, Li S et al (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International joint conference on neural networks. IEEE, pp 1322–1328Google Scholar
  20. 20.
    Khojasteh S, Villar J, Chira C, González V, de la Cal E (2018) Improving fall detection using an on-wrist wearable accelerometer. J Sens 18(5):1350CrossRefGoogle Scholar
  21. 21.
    Köknar-Tezel S, Latecki LJ (2011) Improving svm classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 28(1):1–23CrossRefGoogle Scholar
  22. 22.
    Lopes Vinicius M, Barradas Filho Oliveira A, Barros Kardec A, Viegas Moraes Amorim I, Silva Claudio OL, Marques Pereira E, Marques Lopes BA (2017) Attesting compliance of biodiesel quality using composition data and classification methods. Neural Comput Appl.  https://doi.org/10.1007/s00521-017-3087-4 Google Scholar
  23. 23.
    López V, Fernández A, del Jesus M, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl Based Syst 38:85–104CrossRefGoogle Scholar
  24. 24.
    Mishra S, Saravanan C, Dwivedi V, Pathak K (2015) Discovering flood rising pattern in hydrological time series data mining during the pre monsoon period. Indian J Mar Sci 44(3):3Google Scholar
  25. 25.
    Montgomery DC, Jennings CL, Kulahci M (2015) Introduction to time series analysis and forecasting. Wiley, New YorkzbMATHGoogle Scholar
  26. 26.
    Moses D et al (2015) A survey of data mining algorithms used in cardiovascular disease diagnosis from multi-lead ECG data. Kuwait J Sci 42(2):206–235Google Scholar
  27. 27.
    Nooralishahi P, Seera M, Loo CK (2017) Online semi-supervised multi-channel time series classifier based on growing neural gas. Neural Comput Appl 28(11):3491–3505CrossRefGoogle Scholar
  28. 28.
    Sez JA, Krawczyk B, Woniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit 57:164–178CrossRefGoogle Scholar
  29. 29.
    Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th international conference in data warehousing and knowledge discovery (DaWaK 2008), pp 283–292Google Scholar
  30. 30.
    Suto J, Oniga S, Lung C, Orha I (2018) Comparison of offline and real-time human activity recognition results using machine learning techniques. Neural Comput Appl.  https://doi.org/10.1007/s00521-018-3437-x Google Scholar
  31. 31.
    Tang S, Chen S (2008) The generation mechanism of synthetic minority class examples. In: Proceedings of 5th international conference on information technology and applications in biomedicine (ITAB 2008), pp 444–447Google Scholar
  32. 32.
    Villar JR, González S, Sedano J, Chira C, Trejo-Gabriel-Galán JM (2015) Improving human activity recognition and its application in early stroke diagnosis. Int J Neural Syst 25(4):1450,036–1450,055CrossRefGoogle Scholar
  33. 33.
    Villar JR, Menéndez M, de la Cal E, González VM, Sedano J (2017) Identification of abnormal movements with 3D accelerometer sensors for its application to seizure recognition. J Appl Logic Part B 24:54–61CrossRefzbMATHGoogle Scholar
  34. 34.
    Villar JR, Vergara P, Menéndez M, de la Cal E, González VM, Sedano J (2016) Generalized models for the classification of abnormal movements in daily life and its applicability to epilepsy convulsion recognition. Int J Neural Syst 26(6):1650,037–1650,052CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of Oviedo, EIMEMOviedoSpain
  2. 2.Civil Engineering DepartmentUniversity of Burgos, EPSBurgosSpain
  3. 3.Instituto Tecnológico de Castilla y LeónBurgosSpain

Personalised recommendations