Skip to main content
Log in

Very fast decision rules for classification in data streams

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data stream mining is the process of extracting knowledge structures from continuous, rapid data records. Many decision tasks can be formulated as stream mining problems and therefore many new algorithms for data streams are being proposed. Decision rules are one of the most interpretable and flexible models for predictive data mining. Nevertheless, few algorithms have been proposed in the literature to learn rule models for time-changing and high-speed flows of data. In this paper we present the very fast decision rules (VFDR) algorithm and discuss interesting extensions to the base version. All the proposed versions are one-pass and any-time algorithms. They work on-line and learn ordered or unordered rule sets. Algorithms designed to work with data streams should be able to detect changes and quickly adapt the decision model. In order to manage these situations we also present the adaptive extension (AVFDR) to detect changes in the process generating data and adapt the decision model. Detecting local drifts takes advantage of the modularity of the rule sets. In AVFDR, each individual rule monitors the evolution of performance metrics to detect concept drift. AVFDR prunes rules whenever a drift is signaled. This explicit change detection mechanism provides useful information about the dynamics of the process generating data, faster adaptation to changes and generates more compact rule sets. The experimental evaluation demonstrates that algorithms achieve competitive results in comparison to alternative methods and the adaptive methods are able to learn fast and compact rule sets from evolving streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note that decision lists are ordered rule sets.

  2. Weighted Max generally did not produce results much different from the Weighted Sum therefore we opted for not including this setting in the results.

References

  • Baena-Garcia M, Campo-Avila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Fourth international workshop on knowledge discovery from data streams. ECML-PKDD, Berlin, pp 77–86

  • Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) KNIME: the konstanz information miner: version 2.0 and beyond. SIGKDD Explor Newsl 11:26–31

    Article  Google Scholar 

  • Bifet A, Gavalda R (2009) Adaptive learning from evolving data streams. In: Advances in intelligent data analysis VIII. Lecture notes in computer science, vol 5772. Springer, Berlin/Heidelberg, pp 249–260

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res (JMLR) 11:1601–1604

    Google Scholar 

  • Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM Press, New York, pp 139–148

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees, 1st edn. Chapman and Hall/CRC, Boca Raton

    MATH  Google Scholar 

  • Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the European working session on machine learning, EWSL ’91. Springer, London, pp 151–163

  • Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3:261–283

    Google Scholar 

  • Cohen W (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, ICML’95. Morgan Kaufmann, San Francisco, pp 115–123

  • Data Expo (2009) ASA sections on statistical computing statistical graphics. http://stat-computing.org/dataexpo/2009/. Accessed 1 Feb 2013

  • Data Mining Group (2011) Predictive model markup language (pmml 4.1). http://www.dmg.org/v4-0-1/RuleSet.html. Accessed 1 Feb 2013

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Domingos P (1996) Unifying instance-based and rule-based induction. Mach Learn 24:141–168

    Google Scholar 

  • Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00. ACM Press, New York, pp 71–80

  • Ferrer F, Aguilar J, Riquelme J (2005) Incremental rule learning and border examples selection from numerical data streams. J Univ Comput Sci 11(8):1426–1439

    Google Scholar 

  • Frank A, Asuncion A (2010) UCI machine learning repository. University of California, Irvine

  • Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning, ICML’98. Morgan Kaufmann, San Mateo, pp 144–151

  • Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  Google Scholar 

  • Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92

    Article  Google Scholar 

  • Fürnkranz J (2001) Round robin rule learning. In: Proceedings of the 18th international conference on machine learning, ICML’01. Morgan Kaufmann, San Mateo, pp 146–153

  • Fürnkranz J, Gamberger D, Lavrač N (2012) Foundations of rule learning. Springer, New York

    Book  MATH  Google Scholar 

  • Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC, Baco Raton

    Book  MATH  Google Scholar 

  • Gama J, Kosina P (2011) Learning decision rules from data streams. In: Proceedings of the 22nd international joint conference on artificial intelligence. AAAI, Menlo Park, pp 1255–1260

  • Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’03. ACM Press, New York, pp 523–528

  • Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: SBIA Brazilian symposium on artificial intelligence, LNCS 3171. Springer, Heidelberg, pp 286–295

  • Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10:23–45

    Google Scholar 

  • Gama J, Sebastiao R, Rodrigues PP (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM Press, New York, pp 329–338

  • Grant E, Leavenworth R (1996) Statistical quality control. McGraw-Hill, New York

    Google Scholar 

  • Harries M (1999) Splice-2 comparative evaluation: electricity pricing. Technical report, The University of New South Wales, Sydney

  • Hinkley D (1970) Inference about the change point from cumulative sum-tests. Biometrika 58:509–523

    Article  MathSciNet  Google Scholar 

  • Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 97–106

  • Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An adaptive personalized news dissemination system. J Intell Inf Syst 32:191–212

    Article  Google Scholar 

  • Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8(3):281–300

    Google Scholar 

  • Kolter JZ, Maloof MA (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the 3th international IEEE conference on data mining. IEEE Computer Society, New York, pp 123–130

  • Kosina P, Gama J (2012a) Handling time changing data with adaptive very fast decision rules. In: Proceedings of the 2012 European conference on machine learning and knowledge discovery in databases, ECML PKDD’12, vol I. Springer, Berlin, Heidelberg, pp 827–842

  • Kosina P, Gama J (2012b) Very fast decision rules for multi-class problems. In: Proceedings of the 2012 ACM symposium on applied computing. ACM Press, New York, pp 795–800

  • Lindgren T, Boström H (2004) Resolving rule conflicts with double induction. Intell Data Anal 8(5):457–468

    Google Scholar 

  • Maloof M, Michalski R (2004) Incremental learning with partial instance memory. Artif Intell 154:95–126

    Article  MATH  MathSciNet  Google Scholar 

  • Moro S, Laureano R, Cortez P (2011) Using data mining for bank direct marketing: an application of the crisp-dm methodology. In: Proceedings of the European simulation and modelling conference, ESM’2011. EUROSIS, Guimaraes, pp 117–121

  • Nemenyi P (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University

  • Oza NC, Russell S (2001) Online bagging and boosting. In: Artificial intelligence and statistics 2001. Morgan Kaufmann, San Mateo, pp 105–112

  • Quinlan JR (1991) Determinate literals in inductive logic programming. In: Proceedings of the 12th international joint conference on artificial intelligence, IJCAI’91, vol 2. Morgan Kaufmann Publishers Inc, San Francisco, pp 746–750

  • Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo

    Google Scholar 

  • Rivest R (1987) Learning decision lists. Mach Learn 2:229–246

    MathSciNet  Google Scholar 

  • Schlimmer JC, Granger RH (1986) Incremental learning from noisy data. Mach Learn 1:317–354

    Google Scholar 

  • Shaker A, Hüllermeier E (2012) IBLStreams: a system for instance-based classification and regression on data streams. Evol Syst 3:235–249

    Article  Google Scholar 

  • Street WN, Kim Y (2001) A streaming ensemble algorithm SEA for large-scale classification. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01. ACM Press, New York, pp 377–382

  • Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03. ACM Press, New York, pp 226–235

  • Weiss SM, Indurkhya N (1998) Predictive data mining: a practical guide. Morgan Kaufmann Publishers, San Francisco

    MATH  Google Scholar 

  • Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23:69–101

    Google Scholar 

Download references

Acknowledgments

The authors would like to express their gratitude to the reviewers of previous versions of the paper. This work is partially funded by FCT - Fundao para a Ciłncia e a Tecnologia/MEC - Ministrio da Educao e Ciłncia through National Funds (PIDDAC) and the ERDF - European Regional Development Fund through ON2 North Portugal Regional Operational Programme within the projects Knowledge Discovery from Ubiquitous Data Streams FCT-KDUS(PTDC/EIA/098355/2008), NORTE-07-0124-FEDER-000059. Authors also acknowledge the support of the European Commission through the project MAESTRA (Grant Number ICT-2013-612944). Petr Kosina also acknowledges the support of Faculty of Informatics, MU, Brno.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João Gama.

Additional information

Responsible editor: Johannes Fürnkranz.

Appendices

Appendix 1: Datasets

In this section we describe the datasets that are used in the experiments. We have used large scale artificial and real world datasets. The real world datasets were previously used in other works when testing on-line learning algorithms as they represent large datasets and it is likely that they contain drifts, but their presence and nature is not known.

1.1 Artificial datasets

The artificial datasets are obtained using generators proposed by Bifet et al. (2010), each generator was used to produce five datasets with different random seeds. The hyperplane dataset is generated such that the class is given by rotating hyperplane (Hulten et al. 2001). A hyperplane in d-dimensional space is set of points x that satisfy \(\sum \nolimits _{i=1}^{d} w_i x_i = w_0\) where \(x_i\) is the \(i\)th coordinate of x. \(\sum \nolimits _{i=1}^{d} w_i x_i \ge w_0\) then represents the positive and \(\sum \nolimits _{i=1}^{d} w_i x_i < w_0\) the negative concept. This set with 100,000 examples has two classes, ten attributes and five of them changing at speed 0.01 with 5 % noise (probability for each instance to have its class inverted).

Another artificial dataset is SEA concepts (Street and Kim 2001) and is commonly used in stream mining tasks that require time changing qualities of data. It is a two-class problem, defined by three attributes (two relevant) and 10 % of noise (the same as previous). The domain of the attributes is: \(x_i \in [0,10]\), where \(i=1,2,3\). The target concept is \(x_1 + x_2\le \beta \), where here \(\beta \in \{7,8,9,9.5\}\). There are four concepts; the size of each is 15,000 examples, with a total 60,000 for the whole dataset.

LED is formed by examples (Breiman et al. 1984) with \(\{0, 1\}\) values of each attribute signaling whether given LED is off or on. Only seven out of 24 are relevant. Class label reflects the number (0–9) displayed by the diodes. There is 10 % of noise added to this dataset (probability for each attribute that it would have its value inverted). The generated set size is 200,000 instances. The drift in this dataset is caused by changing relevant attributes with irrelevant.

The goal in the Waveform dataset is to recognize three different classes of waveform. The waveforms are generated from a combination of two or three base waves. The optimal Bayes classification rate is known to be 86 %. The dataset has 21 numeric attributes, all of which include noise, and consists of 100,000 examples. The drift switches the positions (attributes) of the generated attribute-values.

The radial basis function (RBF) generates a fixed number of random centroids. Each center has a random position, a single SD, class label and weight. A new example is generated by a randomly selected center. The weights are considered and centers with higher weight are more likely to be chosen. Then a random direction is chosen to offset the attribute values from the central point. The displacement length is randomly drawn from a Gaussian distribution with SD determined by the chosen centroid. The chosen centroid also determines the class label of the example. The generated RBF datasets have ten numerical attributes and 50 centers with two classes. The number of examples is 100,000, the speed of change of centroids is 0.0001, and the number of centroids with drift is 50.

1.2 Real-world datasets

The real-world datasets are large datasets, which are used with the ordering of examples the way they were collected as it is likely that they contain drift. The intrusion detection from KDDCUP 99 obtained from the UCI repository (Frank and Asuncion 2010), is a data set describing connections which are labeled either as normal or one of four categories of attack. The dataset consists of 4,898,431 instances.

The next dataset is forestCovtype also from the UCI repository (Frank and Asuncion 2010), which has 54 cartographic attributes, continuous and categorical. The goal is to predict the forest cover type for given area. The dataset contains 581,012 instances.

The elec dataset (Harries 1999) contains data collected from electricity market of New South Wales, Australia. It has 45,312 instances.

The task of Airlines dataset based on data from Data Expo (2009) is to predict whether a flight will be delayed given the information of the scheduled departure in seven attributes. It consists of 539,383 instances.

The connect-4 dataset from the UCI repository (Frank and Asuncion 2010) consists of 42 categorical attributes and contains 67,557 examples.

The pokerhand (Frank and Asuncion 2010) consists of 829,201 instances and ten predictive attributes. Each example represents a hand consisting of five playing cards drawn from a standard deck of 52. Each card is described using two attributes (suit and rank). The class describes the poker hand. This dataset was modified so that the cards are sorted by rank and suit and the duplicates were removed.

The bank dataset (Moro et al. 2011) is related with direct marketing campaigns of a Portuguese bank institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required in order to access if the product (bank term deposit) would be (or not) subscribed. The classification task is to predict if the client will subscribe a term deposit. The full dataset has 45,211 examples with 16 attributes.

Katakis et al. (2009) presented the spam dataset, a real world text data stream that is chronologically ordered to represent the evolution of Spam messages over time. There are two classes, legitimate and Spam messages, and 9,324 examples with 500 attributes.

Appendix 2: Results from tests on shuffled real world datasets

See Appendix Tables 16, 17 and 18.

Table 16 Prequential error rates of the classifiers on shuffled real world data
Table 17 The p values of paired t tests on shuffled real world datasets
Table 18 Number of rules of rule classifiers and leaves of VFDTc on stationary data and shuffled real world datasets

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kosina, P., Gama, J. Very fast decision rules for classification in data streams. Data Min Knowl Disc 29, 168–202 (2015). https://doi.org/10.1007/s10618-013-0340-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0340-z

Keywords

Navigation