Non-standard Distances in High Dimensional Raw Data Stream Classification
- 10 Downloads
In this paper, we present a new approach for classifying high dimensional raw (or close to raw) data streams. It is based on k-nearest neighbour (kNN) classifier. The novelty of the proposed solution is based on non-standard distances, which are computed from compression and hashing methods. We use the term “non-standard” to emphasize the method by which proposed distances are computed. Standard distances, such as Euclidean, Manhattan, Mahalanobis, etc. are calculated from numerical features that describe data. The non-standard approach is not necessarily based on extracted features - we can use raw (not preprocessed) data. The proposed method does not need to select or extract features. Experiments were performed on the datasets having dimensionality larger than 1000 features. Results show that the proposed method in most cases performs better than or similarly to other standard stream classification algorithms. All experiments and comparisons were performed in a Massive Online Analysis (MOA) environment.
KeywordsStream classification High-dimensional data KNN classifier Distance MOA Data compression Hashing
We would like to thank the reviewers for their valuable comments and effort to improve this paper. Computations performed as part of the experiments were carried out at the Computer Center of the University of Bialystok.
- Aggarwal CC (2014) A survey of stream classification algorithms. In: Aggarwal CC (ed) Data classification: algorithms and applications, 25 July 2014. Chapman and Hall/CRC, pp 245–273Google Scholar
- Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604Google Scholar
- Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806Google Scholar
- Bifet A, de Francisci Morales G, Read J, Holmes G, Pfahringer B (2015) Efficient online evaluation of big data stream classifiers. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’15. Sydney, NSW, Australia, pp 59–68Google Scholar
- Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT PressGoogle Scholar
- Cilibrasi R (2007) Statistical inference through data compression. Ph.D. thesis, Institute for Logic, Language and Computation, University of AmsterdamGoogle Scholar
- Loeffel P-X (2017) Adaptive machine learning algorithms for data streams subject to concept drifts. Ph.D. thesis, Université Pierre et Marie Curie, Paris VIGoogle Scholar
- Raff E, Nicholas C (2017) An alternative to NCD for large sequences, Lempel-Ziv Jaccard distance. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1007–1015Google Scholar
- Stefanowski J, Brzezinski D (2016) Stream Classification. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, US, Boston, MAGoogle Scholar
- Wojnarski M, Janusz A, Nguyen HS, Bazan J, Luo C, Chen Z, Hu F, Wang G, Guan L, Luo H, Gao J, Shen Y, Nikulin V, Huang T-H, McLachlan GJ, Bošnjak M, Gamberger D (2010) RSCTC’ 2010 discovery challenge: mining DNA microarray data for medical diagnosis and treatment. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Rough sets and current trends in computing. Springer, Berlin, pp 4–19CrossRefGoogle Scholar