Skip to main content

Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification

  • Conference paper
  • First Online:
Book cover Research and Development in Intelligent Systems XXXI (SGAI 2014)

Abstract

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gaber, M., Zaslavsky, A., Krishnaswamy, S.: A survey of classification methods in data streams. In: Data Streams, pp. 39–59 (2007)

    Google Scholar 

  2. Domingos, P., Hulten, G.: A general framework for mining massive data streams. J. Comput. Graph. Stat. 12, 945–949 (2003)

    Article  MathSciNet  Google Scholar 

  3. Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)

    Google Scholar 

  4. Bujlow, T., Riaz, T., Pedersen, J.M.: A method for classification of network traffic based on c5.0 machine learning algorithm. In: 2012 International Conference on Computing, Networking and Communications (ICNC), pp. 237–241 (2012)

    Google Scholar 

  5. Jadhav, A., Jadhav, A., Jadhav, P., Kulkarni, P.: A novel approach for the design of network intrusion detection system(nids). In: 2013 International Conference on Sensor Network Security Technology and Privacy Communication System (SNS PCS), pp. 22–27 (2013)

    Google Scholar 

  6. Behdad, M., French, T.: Online learning classifiers in dynamic environments with incomplete feedback. In: IEEE Congress on Evolutionary Computation, Cancùn, Mexico (2013)

    Google Scholar 

  7. Salazar, A., Safont, G., Soriano, A., Vergara, L.: Automatic credit card fraud detection based on non-linear signal processing. In: 2012 IEEE International Carnahan Conference on Security Technology (ICCST), pp. 207–212 (2012)

    Google Scholar 

  8. Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, pp. 573–579 (1998)

    Google Scholar 

  9. Shafer, J., Agrawal, R., Mehta, M.: Sprint: a scalable parallel classifier for data mining. In: Proceedings of the 22nd VLDB Conference (1996)

    Google Scholar 

  10. Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks. Knowl. Based Syst. 35, 49–63 (2012)

    Article  Google Scholar 

  11. Domingos, P., Hulten, G.: Mining high-speed data streams. KDD, pp. 71–80 (2000)

    Google Scholar 

  12. Zhang, P., Gao, B.J., Zhu, X., Guo, L.: Enabling fast lazy learning for data streams. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 932–941 (2011)

    Google Scholar 

  13. Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT 12, pp. 38–49. ACM, New York (2012)

    Google Scholar 

  14. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)

    Google Scholar 

  15. Liang, S., Wang, C., Liu, Y., Jian, L.:CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU. In: 2009 IEEE Youth Conference on Information, Computing and Telecommunication, YC-ICT ’09, pp. 415–418 (2009)

    Google Scholar 

  16. Dilectin, H.D., Mercy, R.B.V.: Classification and dynamic class detection of real time data for tsunami warning system. In: 2012 International Conference on Recent Advances in Computing and Software Systems (RACSS), pp. 124–129 (2012)

    Google Scholar 

  17. Gantz, J., Reinsel, D.: Extracting value from chaos. IDC iview, pp. 1–12 (2011)

    Google Scholar 

  18. Massive online analysis (http://moa.cms.waikato.ac.nz) (2014)

  19. Dawid, A.: Stastical theory the prequential approach. Royal Stat. Soc. 147, 278–292 (1984)

    Google Scholar 

  20. Schlimmer, J.C., Granger, R.: Beyond incremental processing: tracking concept drift. In: Proceedings of the Fifth National Conference on Artificial Intelligence, vol. 1, pp. 502–507 (1986)

    Google Scholar 

  21. Street, W.N., Kim, Y.S.: A streaming ensemble algorithm (sea) for large-scale classication. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382 (2001)

    Google Scholar 

  22. Hadoop, http://hadoop.apache.org/ (2014)

  23. Spark: Lightning fast cluster computing (http://spark.apache.org) (2014)

  24. Aggarwal, C., Han, J., Wang, J.,Yu P.: A framework for clustering evolving data streams. In: Proceedings of the 29th VLDB Conference. Berlin, Germany (2003)

    Google Scholar 

  25. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  26. Pettinger, D., Di Fatta, G.: Space partitioning for scalable k-means. In: The Ninth IEEE International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319–324, Washington DC, USA (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Tennant .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tennant, M., Stahl, F., Di Fatta, G., Gomes, J.B. (2014). Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXI. SGAI 2014. Springer, Cham. https://doi.org/10.1007/978-3-319-12069-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12069-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12068-3

  • Online ISBN: 978-3-319-12069-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics