Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification

Tennant, Mark; Stahl, Frederic; Di Fatta, Giuseppe; Gomes, João Bártolo

doi:10.1007/978-3-319-12069-0_4

Mark Tennant³,
Frederic Stahl³,
Giuseppe Di Fatta³ &
…
João Bártolo Gomes⁴

Included in the following conference series:

International Conference on Innovative Techniques and Applications of Artificial Intelligence

567 Accesses
2 Citations

Abstract

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gaber, M., Zaslavsky, A., Krishnaswamy, S.: A survey of classification methods in data streams. In: Data Streams, pp. 39–59 (2007)
Google Scholar
Domingos, P., Hulten, G.: A general framework for mining massive data streams. J. Comput. Graph. Stat. 12, 945–949 (2003)
Article MathSciNet Google Scholar
Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)
Google Scholar
Bujlow, T., Riaz, T., Pedersen, J.M.: A method for classification of network traffic based on c5.0 machine learning algorithm. In: 2012 International Conference on Computing, Networking and Communications (ICNC), pp. 237–241 (2012)
Google Scholar
Jadhav, A., Jadhav, A., Jadhav, P., Kulkarni, P.: A novel approach for the design of network intrusion detection system(nids). In: 2013 International Conference on Sensor Network Security Technology and Privacy Communication System (SNS PCS), pp. 22–27 (2013)
Google Scholar
Behdad, M., French, T.: Online learning classifiers in dynamic environments with incomplete feedback. In: IEEE Congress on Evolutionary Computation, Cancùn, Mexico (2013)
Google Scholar
Salazar, A., Safont, G., Soriano, A., Vergara, L.: Automatic credit card fraud detection based on non-linear signal processing. In: 2012 IEEE International Carnahan Conference on Security Technology (ICCST), pp. 207–212 (2012)
Google Scholar
Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, pp. 573–579 (1998)
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: Sprint: a scalable parallel classifier for data mining. In: Proceedings of the 22nd VLDB Conference (1996)
Google Scholar
Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks. Knowl. Based Syst. 35, 49–63 (2012)
Article Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. KDD, pp. 71–80 (2000)
Google Scholar
Zhang, P., Gao, B.J., Zhu, X., Guo, L.: Enabling fast lazy learning for data streams. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 932–941 (2011)
Google Scholar
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT 12, pp. 38–49. ACM, New York (2012)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)
Google Scholar
Liang, S., Wang, C., Liu, Y., Jian, L.:CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU. In: 2009 IEEE Youth Conference on Information, Computing and Telecommunication, YC-ICT ’09, pp. 415–418 (2009)
Google Scholar
Dilectin, H.D., Mercy, R.B.V.: Classification and dynamic class detection of real time data for tsunami warning system. In: 2012 International Conference on Recent Advances in Computing and Software Systems (RACSS), pp. 124–129 (2012)
Google Scholar
Gantz, J., Reinsel, D.: Extracting value from chaos. IDC iview, pp. 1–12 (2011)
Google Scholar
Massive online analysis (http://moa.cms.waikato.ac.nz) (2014)
Dawid, A.: Stastical theory the prequential approach. Royal Stat. Soc. 147, 278–292 (1984)
Google Scholar
Schlimmer, J.C., Granger, R.: Beyond incremental processing: tracking concept drift. In: Proceedings of the Fifth National Conference on Artificial Intelligence, vol. 1, pp. 502–507 (1986)
Google Scholar
Street, W.N., Kim, Y.S.: A streaming ensemble algorithm (sea) for large-scale classication. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382 (2001)
Google Scholar
Hadoop, http://hadoop.apache.org/ (2014)
Spark: Lightning fast cluster computing (http://spark.apache.org) (2014)
Aggarwal, C., Han, J., Wang, J.,Yu P.: A framework for clustering evolving data streams. In: Proceedings of the 29th VLDB Conference. Berlin, Germany (2003)
Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet MATH Google Scholar
Pettinger, D., Di Fatta, G.: Space partitioning for scalable k-means. In: The Ninth IEEE International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319–324, Washington DC, USA (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Systems Engineering, University of Reading, Whiteknights, Reading, RG6 6AY, UK
Mark Tennant, Frederic Stahl & Giuseppe Di Fatta
Institute for Infocomm Research (I2R), A*STAR, 1 Fusionopolis Way Connexis, 138632, Singapore, Singapore
João Bártolo Gomes

Authors

Mark Tennant
View author publications
You can also search for this author in PubMed Google Scholar
Frederic Stahl
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Di Fatta
View author publications
You can also search for this author in PubMed Google Scholar
João Bártolo Gomes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Tennant .

Editor information

Editors and Affiliations

University of Portsmouth, Portsmouth, United Kingdom
Max Bramer
School of Computing, Eng & Mathematics, University of Brighton, Brighton, West Sussex, United Kingdom
Miltos Petridis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tennant, M., Stahl, F., Di Fatta, G., Gomes, J.B. (2014). Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXI. SGAI 2014. Springer, Cham. https://doi.org/10.1007/978-3-319-12069-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-12069-0_4
Published: 30 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12068-3
Online ISBN: 978-3-319-12069-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics