Handbook of Genetic Programming Applications pp 451-480
Evolving GP Classifiers for Streaming Data Tasks with Concept Change and Label Budgets: A Benchmarking Study
Streaming data classification requires that several additional challenges are addressed that are not typically encountered in offline supervised learning formulations. Specifically, access to data at any training generation is limited to a small subset of the data, and the data itself is potentially generated by a non-stationary process. Moreover, there is a cost to requesting labels, thus a label budget is enforced. Finally, an anytime classification requirement implies that it must be possible to identify a ‘champion’ classifier for predicting labels as the stream progresses. In this work, we propose a general framework for deploying genetic programming (GP) to streaming data classification under these constraints. The framework consists of a sampling policy and an archiving policy that enforce criteria for selecting data to appear in a data subset. Only the exemplars of the data subset are labeled, and it is the content of the data subset that training epochs are performed against. Specific recommendations include support for GP task decomposition/modularity and making additional training epochs per data subset. Both recommendations make significant improvements to the baseline performance of GP under streaming data with label budgets. Benchmarking issues addressed include the identification of datasets and performance measures.