Skip to main content

Multi-period classification: learning sequent classes from temporal domains

Abstract

As the majority of real-world decisions change over time, extending traditional classifiers to deal with the problem of classifying an attribute of interest across different time periods becomes increasingly important. Tackling this problem, referred to as multi-period classification, is critical to answer real-world tasks, such as the prediction of upcoming healthcare needs or administrative planning tasks. In this context, although existing research provides principles for learning single labels from complex data domains, less attention has been given to the problem of learning sequences of classes (symbolic time series). This work motivates the need for multi-period classifiers, and proposes a method, cluster-based multi-period classification (CMPC), that preserves local dependencies across the periods under classification. Evaluation against real-world datasets provides evidence of the relevance of multi-period classifiers, and shows the superior performance of the CMPC method against peer methods adapted from long-term prediction for multi-period tasks with a high number of periods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. Available in http://web.tecnico.ulisboa.pt/rmch/software/evoc/

  2. In general, this classifier slightly outperforms the performance of kNN lazy learners (Aha et al. 1991) and C4.5 decision trees (Quinlan 1993) for the used data settings. We hypothesize that this is due to the fact that the learned dependencies among subsets of informative events can model relevant temporal or cross-attribute dependencies.

  3. http://doc.gold.ac.uk/~mas02mg/software/hmmweka/

  4. http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data

  5. http://archive.ics.uci.edu/ml/datasets/Diabetes

  6. http://www.heritagehealthprize.com/c/hhp/data (under a granted permission)

  7. Complete list of results available in http://web.tecnico.ulisboa.pt/rmch/software/evoc/

References

  • Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  • Azuaje F (2011) Integrative data analysis for biomarker discovery. In: Bioinformatics and biomarker discovery: omic data analysis for personalized medicine, pp 137–154

  • Bache K, Lichman M (2013) UCI machine learning repository

  • Baldi P, Chauvin Y, Hunkapliier Y, McClure M (1994) Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci USA 91(3):1059–1063

    Article  Google Scholar 

  • Batista GEAPA, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In SDM’11. SIAM / Omnipress, Mesa, pp 699–710

  • Baxter RA, Williams GJ, He H (2001) Feature selection for temporal health records. In PAKDD, London, UK. Springer-Verlag, London, pp 198–209

  • Ben Taieb S, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition. Expert Syst Appl 39(8):7067–7083

    Article  Google Scholar 

  • Ben Taieb S, Sorjamaa A, Bontempi G (2010) Multiple-output modeling for multi-step-ahead time series forecasting. Neurocomputing 73:1950–1957

    Article  Google Scholar 

  • Bengio S, Fessant F, Collobert D (1996) Use of modular architectures for time series prediction. Neural Process Lett 3:101–106

    Article  Google Scholar 

  • Bishop C (2006) Pattern recognition and machine learning., Information science and statisticsSpringer, New York

    MATH  Google Scholar 

  • Bontempi G, Ben Taieb S (2011) Conditionally dependent strategies for multiple-step-ahead prediction in local learning. Int J Forecast 27(2004):689–699

    Article  Google Scholar 

  • Bontempi G, Birattari M, and Bersini H (1998) Lazy learning for iterated time-series prediction. In Suykens JAK, Vandewalle J (eds) IW on advanced black-box tech for nonlinear modeling, Leuven, Belgium. Katholieke University, Leuven, pp 62–68

  • Bradley PS, Reina CA, Fayyad UM (2000) Clustering very large databases using EM mixture models. In: Pattern recognition, international conference on 2:2076+

  • Brahim-Belhouari S, Bermak A (2004) Gaussian process for nonstationary time series prediction. Comput Stat Data Anal 47(4):705–712

    Article  MATH  MathSciNet  Google Scholar 

  • Cadez I, Heckerman D, Meek C, Smyth P, White S (2000) Visualization of navigation patterns on a web site using model-based clustering. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00, New York, NY, USA. ACM, New York, pp 280–284

  • Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. Proc VLDB Endow 1(2):1542–1552

    Article  Google Scholar 

  • Geurts P (2001) Pattern extraction for time series classification. In: Principles of data mining and knowledge discovery. LNCS, vol 2168. Springer, Heidelberg, pp 115–127

  • Graves A (2012) Supervised sequence labelling with recurrent neural networks., Studies in computational intelligenceSpringer, New York

    Book  MATH  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Hartigan JA, Wong MA (1979) A k-means clustering algorithm. JSTOR Appl Stat 28(1):100–108

    Article  MATH  Google Scholar 

  • Henriques R, Antunes C (2012) On the need of new approaches for the novel problem of long-term prediction over multi-dimensional data. In: Lee R (ed) Computer and information science 2012, vol 429., Studies in computational intelligenceSpringer, Berlin, pp 121–138

    Chapter  Google Scholar 

  • Henriques R, Antunes C (2014) Learning predictive models from integrated healthcare data: capturing temporal and cross-attribute dependencies. In: HICSS, IEEE

  • Henriques R, Pina S, Antunes C (2013) Temporal mining of integrated healthcare data: methods, revealings and implications. In: SDM IW on data mining for medicine and healthcare. SIAM, pp 52–60

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Ji Y, Hao J, Reyhani N, Lendasse A (2005) Direct and recursive prediction of time series using mutual information selection. In: IWANN. LNCS, vol 3512. Springer, Heidelberg, pp 1010–1017

  • Kirshner S (2005) Modeling of multivariate time series using hidden Markov models. PhD thesis, AAI3164062

  • Kriegel H-P, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdisc Rew 1(3):231–240

    Google Scholar 

  • Letham B, Rudin C, Madigan D (2013) Sequential event prediction. Mach Learn 93(2–3):357–380

    Article  MATH  MathSciNet  Google Scholar 

  • Lockett AJ, Miikkulainen R (2009) Temporal convolution machines for sequence learning. Technical report AI-09-04, University of Texas at Austin

  • Mantaci S, Restivo A, Sciortino M (2008) Distance measures for biological sequences: some recent approaches. Int J Approx Reason 47(1):109–124

    Article  MATH  MathSciNet  Google Scholar 

  • Moen P (2000) Attribute, event sequence and event type similarity notions for data mining. University of Helsinki

  • Mörchen F (2003) Time series feature extraction for data mining using DWT and DFT. Reihe Informatik Univ

  • Mörchen F (2006) Time series knowledge mining. Wissenschaft in Dissertationen. Görich & Weiershäuser

  • Murphy K (2002) Dynamic Bayesian networks: representation, inference and learning. PhD thesis, UC Berkeley, Computer Science Division

  • Nguyen H-L, Ng W-K, Woon Y-K (2013) Closed motifs for streaming time series classification. KAIS, pp 1–25

  • Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA

    Google Scholar 

  • Povinelli RJ, Johnson MT, Lindgren AC, Ye J (2004) Time series classification using gaussian mixture models of reconstructed phase spaces. IEEE Trans Knowl Data Eng 16(6):779–789

    Article  Google Scholar 

  • Quinlan R (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo, CA

  • Rahman S, Bakar A, Hussein Z (2008) A review on protein sequence clustering research. ICBE, vol 21., IFMBE ProceedingsSpringer, Berlin-Heidelberg, pp 275–278

    Google Scholar 

  • Roddick JF, Spiliopoulou M (2002) A survey of temporal knowledge discovery paradigms and methods. IEEE Trans Knowl Data Eng 14(4):750–767

  • Sorjamaa A, Hao J, Reyhani N, Ji Y, Lendasse A (2007) Methodology for long-term prediction of time series. Neurocomputing 70:2861–2869

    Article  Google Scholar 

  • Sorjamaa A, Lendasse A (2006) Time series prediction using dirrec strategy. In: ESANN’06, pp 143–148

  • Taieb SB, Bontempi G, Sorjamaa A, Lendasse A (2009) Long-term prediction of time series by combining direct and mimo strategies. In IJCNN, Piscataway, NJ, USA. IEEE Press, pp 1559–1566

  • Toft P, Rostrup E, Nielsen FA, Nielsen FA, Hansen LK, Goutte C, Goutte C (1998) On clustering fMRI time series. Neuroimage 9:298–310

    Google Scholar 

  • Tseng V, Lee C-H (2009a) Effective temporal data classification by integrating sequential pattern mining and probabilistic induction. Expert Syst Appl 36(5):9524–9532

    Article  Google Scholar 

  • Tseng VS, Lee C-H (2009b) Effective temporal data classification by integrating sequential pattern mining and probabilistic induction. Expert Syst Appl 36(5):9524–9532

    Article  Google Scholar 

  • Tsoumakas G, Katakis I (2007) Multi label classification: an overview. Int J Data Wareh Min 3(3):1–13

    Article  Google Scholar 

  • Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  Google Scholar 

  • Xi X, Keogh E, Shelton C, Wei L, Ratanamahatana CA (2006) Fast time series classification using numerosity reduction. In ICML. ACM, New York, pp 1033–1040

  • Zhang M-L, Zhou Z-H (2005) A k-nearest neighbor based algorithm for multi-label classification. IEEE International Conference on Granular Computing, vol 2, pp 718–721

Download references

Acknowledgments

The authors deeply thank the reviewers of this manuscript for the detailed, attentive and insightful feedback. This work was supported by Fundação para a Ciência e Tecnologia under the multi-annual funding of INESC-ID PEst-OE/EEI/LA0021/2013 and the Ph.D. Grant SFRH/BD/75924/2011.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Henriques.

Additional information

Responsible editor: Dr. Eamonn Keogh.

Appendix: Complementary metrics

Appendix: Complementary metrics

Multi-period classifiers can be evaluated when the attribute under classification is either nominal or ordinal. In the paper, we targeted nominal attributes on the learning codomain and adopted simple loss functions (based on matching operators) to evaluate the performance of the proposed methods. However, three additional views are included in this appendix. First, loss functions to deal with ordinal labels. Second, meaningful evaluation metrics based on compact confusion matrices when a high number of labels is available. Third, distance metrics that can account for misalignements, such as temporal shifts.

Multi-period classification with ordinal labels

Multi-period accuracy \(Acc_j\) can be derived from loss functions applied along the horizon of prediction. Representative loss functions include the simple, average normalized or relative root mean squared error. To draw comparisons with literature results, we suggest the use of Normalized Root Mean Squared Error, NRMSE (5) and of Symmetric Mean Absolute Percentage of Error, SMAPE (6) (Ben Taieb et al. 2010).

$$\begin{aligned} {\hbox {Acc}}_j(\varvec{y}_j,\hat{\varvec{y}}_j)= 1-{\hbox {NRMSE}}(\varvec{y}_j,\hat{\varvec{y}}_j)=1-\frac{\sqrt{\frac{1}{h}\Sigma _{i=1}^{h}(y_j^i-\hat{y}_j^i)^2}}{y_{\max }-y_{\min }}\in [0,1]\end{aligned}$$
(5)
$$\begin{aligned} {\hbox {Acc}}_j(\varvec{y}^j,\hat{\varvec{y}}^j)=1-{\hbox {SMAPE}}(\varvec{y}_j,\hat{\varvec{y}}_j)=1-\frac{1}{h}\Sigma _{i=1}^h\frac{\mid y_j^{i}-\hat{y}_j^{i}\mid }{(y_j^{i}+\hat{y}_j^{i})/2}\in [0,1] \end{aligned}$$
(6)

Evaluation using compact confusion matrices

In order to account for further critical performance views, a classic confusion matrix can be computed for each period. This solution, illustrated in Fig. 7, has the undesirable property of not offering compact views to study performance. For instance, multiple metrics need to be computed for each label and period in order to obtain a global view of the multi-period classifier sensitivity. A simple option, similarly to (3) and (4), would be to average the values for an instance across the \(h\) periods. However, for the ordinal setting, instead of simply computing the matchings, a normalized distance needs to be applied between each pair of observed and estimated labels.

Fig. 7
figure 7

Confusion matrices in multi-period classification settings. A confusion matrix in multi-period settings is the composition of classic confusion matrices per label and period, which results in a total of \({\mid }\Sigma {\mid }\times h\) views

However, with this option we loose the ability to understand which periods are affecting the score. A second option is to collapse the labels’ axis by defining a predicate. For this goal, we can rely on a mapping function \(T\) to map a set of observed \(h\) labels as a single label. An illustrative function is one that decides whether an instance is of interest (positive) or not based on the observed values. For example, relevant patients can be defined as having at least one hospitalization across the horizon of prediction. Still, this option requires the computation of each metric for the \(h\) periods. Thus, we propose the use of this option with a simple test (based on a fixed \(\beta \)-threshold) to evaluate the adequacy of the \(h\) predictions for a particular instance, \({\hbox {Acc}}(y,\hat{y})\ge \beta \) ((7) and (8)). Understandably, this option comes at a cost of defining a new labeling function \(T\) and of working with \(\beta \)-threshold levels. Table 6 presents the revised confusion matrix for multi-period classification when two classes are considered. Resulting sensitivity (7) and specificity (8) metrics for this setting are computed as follows:

$$\begin{aligned} {\hbox {Sensitivity}}_c&= \frac{\Sigma _{j=1}^{m}(c=T(\varvec{y}_j))\wedge Acc(\varvec{y}_j,\hat{\varvec{y}}_j)\ge \beta }{\Sigma _{j=1}^{m}c=T(\varvec{y}_j)}, \end{aligned}$$
(7)
$$\begin{aligned} {\hbox {Specificity}}_c&= \frac{\Sigma _{j=1}^{m} (c\ne T(\varvec{y}_j))\wedge Acc(\varvec{y}_j,\hat{\varvec{y}}_j)\ge \beta }{\Sigma _{j=1}^{m} c\ne T(\varvec{y}_j)}. \end{aligned}$$
(8)
Table 6 Multi-period confusion matrix

Complementary evaluation metrics

Understandably, the distance functions used to evaluate the performance of multi-period classifiers are conservative for the cases where mismatches are caused by temporal shifts. To avoid a significant penalization of the performance of multi-period classifiers when misalignments occur on the time or cardinality axes, their evaluation can rely on more expressive time series’ similarity functions.

Ding et al. (2008) and Batista et al. (2011) compare the properties of alternative similarity functions when the attribute under classification is ordinal or numeric. Dynamic Time Warping (DTW) treats misalignments, which becomes critical when dealing with long horizons of prediction. Longest Common Subsequence deals with gap constraints. Pattern-based functions consider shifting and scaling in both the temporal and the amplitude axes.

When the output attribute is nominal, similarity functions proposed to compare biomolecular sequences based significant functional or structural similarity can be applied (Mantaci et al. 2008). These functions are also able to identify temporal shifts as they rely on sequence alignment operators. Moreover, they are able to deal with shifts on the amplitude axis by detecting character level differences.

On one hand, these similarity functions have the advantage of smoothing error accumulation by allowing temporal misalignments. On the other hand, their use can mask the structural accuracy of multi-period classifiers and lead to more optimistic results.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Henriques, R., Madeira, S.C. & Antunes, C. Multi-period classification: learning sequent classes from temporal domains. Data Min Knowl Disc 29, 792–819 (2015). https://doi.org/10.1007/s10618-014-0376-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0376-8

Keywords

  • Multi-period classification
  • Long-term prediction
  • Time-sensitive supervised learning