A survey on learning from data streams: current and future trends

Gama, João

doi:10.1007/s13748-011-0002-6

A survey on learning from data streams: current and future trends

Review
Published: 13 January 2012

Volume 1, pages 45–55, (2012)
Cite this article

Download PDF

Progress in Artificial Intelligence Aims and scope Submit manuscript

A survey on learning from data streams: current and future trends

Download PDF

João Gama¹

4652 Accesses
106 Citations
4 Altmetric
Explore all metrics

Abstract

Nowadays, there are applications in which the data are modeled best not as persistent tables, but rather as transient data streams. In this article, we discuss the limitations of current machine learning and data mining algorithms. We discuss the fundamental issues in learning in dynamic environments like continuously maintain learning models that evolve over time, learning and forgetting, concept drift and change detection. Data streams produce a huge amount of data that introduce new constraints in the design of learning algorithms: limited computational resources in terms of memory, cpu power, and communication bandwidth. We present some illustrative algorithms, designed to taking these constrains into account, for decision-tree learning, hierarchical clustering and frequent pattern mining. We identify the main issues and current challenges that emerge in learning from data streams that open research lines for further developments.

References

Aggarwal, C.: On biased reservoir sampling in the presence of stream evolution. In: Dayal, U., Whang, K.-Y., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.-K. (eds.) Proceedings of the International Conference on Very Large Data Bases, pp. 607–618. ACM Seoul, Korea (2006)
Aggarwal, C. (ed): Data Streams—Models and algorithms. Springer, Berlin (2007)
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the International Conference on Very Large Data Bases, pp. 81–92. Morgan Kaufmann, Berlin (2003)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. Washington, DC, USA (1993)
Alon N., Matias Y., Szegedy M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999)
Article MathSciNet MATH Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Kolaitis, P.G. (ed.) Proceedings of the 21st Symposium on Principles of Database Systems, pp. 1–16. ACM Press, Madison (2002)
Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Annual ACM SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics, San Francisco (2002)
Babu S., Widom J.: Continuous queries over data streams. SIGMOD Rec. 30(3), 109–120 (2001)
Article Google Scholar
Baeza-Yates, R.A., Broder, A.Z., Maarek, Y.S.: The new frontier of web search technology, Seven challenges. In: SeCO Workshop. Lecture Notes in Computer Science, vol. 6585, pp. 3–9. Springer, Berlin (2010)
Bifet, A., Gavaldà, R.: Kalman filters and adaptive windows for learning in data streams. In: Todorovski, L., Lavrac, N. (eds.) Proceedings of the 9th Discovery Science, Lecture Notes Artificial Intelligence, vol. 4265, pp. 29–40. Springer, Barcelona (2006)
Bifet, A., Gavaldà, R.: Mining adaptively frequent closed unlabeled rooted trees in data streams. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pp. 34–42. Las Vegas, USA (2008)
Bifet, A., Gavaldà, R.: Adaptive XML tree classification on evolving data streams. In: Machine Learning and Knowledge Discovery in Databases, European Conference, Lecture Notes in Computer Science, vol. 5781, pp. 147–162. Springer, Bled (2009)
Bifet, A., Holmes, G., Pfahringer, B.: Leveraging bagging for evolving data streams. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML/PKDD (1), Lecture Notes in Computer Science, vol. 6321, pp. 135–150. Springer, Berlin (2010)
Bifet, A., Holmes, G., Pfahringer, B., Gavaldà, R.: Improving adaptive bagging methods for evolving data streams. In: Zhou, Z.-H., Washio, T. (eds.) ACML, Lecture Notes in Computer Science, vol. 5828, pp. 23–37. Springer, Berlin (2009)
Brain, D., Webb, G.: The need for low bias algorithms in classification learning from large data sets. In: Elomaa, T., Mannila, H., Toivonen, H (eds.) Principles of Data Mining and Knowledge Discovery PKDD-02, Lecture Notes in Artificial Intelligence, vol. 2431, pp. 62–73. Springer, Helsinki (2002)
Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine learning. In: Proceedings of the Neural Information Processing Systems (2000)
Chakrabarti, A., Ba, K.D., Muthukrishnan, S.: Estimating entropy and entropy norm on data streams. In: STACS: 23rd Annual Symposium on Theoretical Aspects of Computer Science, pp.196–205. Marseille, France (2006)
Chaudhry, N.: Stream Data Management, Chapter Introduction to Stream Data Management, pp. 1–11. Springer, Berlin (2005)
Chen R., Sivakumar K., Kargupta H.: Collective mining of Bayesian networks from heterogeneous data. Knowl. Inform. Syst. J. 6(2), 164–187 (2004)
Google Scholar
Cormode G., Muthukrishnan S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm 55(1), 58–75 (2005)
Article MathSciNet MATH Google Scholar
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the divide: Continuous clustering of distributed data streams. In: ICDE: Proceedings of the International Conference on Data Engineering, pp. 1036–1045. Istanbul, Turkey (2007)
Cormode, G., Thottan, M. (eds.): Algorithms for Next Generation Networks. Springer, Berlin (2010)
Cortes C., Fisher K., Pregibon D., Rogers A., Smith F.: Hancock: a language for analyzing transactional data streams. ACM Trans. Progr. Languages Syst. 26(2), 301–338 (2004)
Article Google Scholar
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. In: Proceedings of Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 635–644. Springer, San Francisco (2002)
Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Parsa, I., Ramakrishnan, R., Stolfo, S. (eds.) Proceedings of the ACM Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80. ACM Press, Boston (2000)
Flajolet P., Martin G.N.: Probabilistic counting algorithms for data base applications. J Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet MATH Google Scholar
Gaber, M. M., Yu, P.S.: A framework for resource-aware knowledge discovery in data streams: a holistic approach with its application to clustering. In: ACM Symposium Applied Computing, pp. 649–656. ACM Press, Boston (2006)
Gaber, M.M., Krishnaswamy, S., Zaslavsky, A.: Cost-efficient mining techniques for data streams. In: Proceedings of the second workshop on Australasian information security, pp. 109–114. Australian Computer Society, Inc., Melbourne (2004)
Gama, J.: Knowledge Discovery from Data Streams. Data Mining and Knowledge Discovery. Chapman & Hall/CRC Press, Atlanta (2010)
Gama J., Fernandes R., Rocha R.: Decision trees for mining data streams. Intell. Data Anal. 10(1), 23–46 (2006)
Google Scholar
Gama J., Medas P.: Learning decision trees from dynamic data streams. J. Univers. Comput. Sci. 11(8), 1353–1366 (2005)
Google Scholar
Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 523–528. ACM Press, Washington, DC (2003)
Gama, J., Sebastião, R., Rodrigues, P.P.: Issues in evaluation of stream learning algorithms. In: KDD, pp. 329–338 (2009)
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Data Mining: Next Generation Challenges and Future Directions, pp. 105–124. AAAI/MIT Press, Cambridge (2004)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In: VLDB, pp. 79–88. Rome, Italy (2001)
Han J., Pei J., Yin Y., Mao R.: Mining frequent patterns without candidate generation. Data Min. Knowl. Discov. 8, 53–87 (2004)
Article MathSciNet Google Scholar
Hulten, G., Domingos, P.: Catching up with the data: research issues in mining data streams. In: Proceedings of Workshop on Research Issues in Data Mining and Knowledge Discovery, Santa Baraba, USA (2001)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 97–106. ACM Press, San Francisco (2001)
Ikonomovska E., Gama J., Džeroski S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23, 128–168 (2011). doi:10.1007/s10618-010-0201-y
Article MathSciNet Google Scholar
Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y.: Data Mining: Next Generation Challenges and Future Directions. AAAI Press and MIT Press, Cambridge (2004)
Kargupta, H., Park, B.-H.: Mining decision trees from data streams in a mobile environment. In: IEEE International Conference on Data Mining, pp. 281–288. IEEE Computer Society, San Jose (2001)
Kargupta H., Park B.-H., Dutta H.: Orthogonal decision trees. IEEE Trans. Knowl. Data Eng. 18, 1028–1042 (2006)
Article Google Scholar
Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceedings of the International Conference on Very Large Data Bases, pp. 180–191. Morgan Kaufmann, Toronto (2004)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 346–357. Morgan Kaufmann, Hong Kong (2002)
Motwani R., Raghavan P.: Randomized Algorithms. Cambridge University Press, Cambridge (1997)
Google Scholar
Muthukrishnan, S.: Data Streams: Algorithms and Applications. Now Publishers, USA (2005)
Muthukrishnan, S.: Massive data streams research: Where to go. Tech. Rep., Rutgers University (2010)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Mateo (1993)
Rodrigues P.P., Gama J., Pedroso J.P.: Hierarchical clustering of time series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008)
Article Google Scholar
Sharfman I., Schuster A., Keren D.: A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst. 32(4), 301–312 (2007)
Article Google Scholar
Tatbul, N., Cetintemel, U., Zdonik, S., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Proceedings of the International Conference on Very Large Data Bases, pp. 309–320. VLDB Endowment, Berlin (2003)
Thakar A.R., Szalay A.S., Fekete G., Gray J.: The catalog archive server database management system. Comput. Sci. Eng. 10(1), 30–37 (2008)
Article Google Scholar
Vitter J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet MATH Google Scholar
Wald, A.: Sequential Analysis. John Wiley and Sons, Inc., New York (1947)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 103–114. ACM Press, Montreal (1996)

Download references

Author information

Authors and Affiliations

LIAAD-INESC-Porto LA, and FEP-University of Porto, R. de Ceuta 118-6, 4050, Porto, Portugal
João Gama

Authors

João Gama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João Gama.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gama, J. A survey on learning from data streams: current and future trends. Prog Artif Intell 1, 45–55 (2012). https://doi.org/10.1007/s13748-011-0002-6

Download citation

Received: 02 February 2011
Accepted: 01 July 2011
Published: 13 January 2012
Issue Date: April 2012
DOI: https://doi.org/10.1007/s13748-011-0002-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A survey on learning from data streams: current and future trends

Abstract

Article PDF

Similar content being viewed by others

Challenges in Learning from Streaming Data Extended Abstract

Introduction

RILL: Algorithm for Learning Rules from Streaming Data with Concept Drift

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on learning from data streams: current and future trends

Abstract

Article PDF

Similar content being viewed by others

Challenges in Learning from Streaming Data Extended Abstract

Introduction

RILL: Algorithm for Learning Rules from Streaming Data with Concept Drift

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation