Entropy-based outlier detection using spark

Feng, Guilan; Li, Zhengnan; Zhou, Wengang; Dong, Shi

doi:10.1007/s10586-019-02932-2

Entropy-based outlier detection using spark

Published: 16 April 2019

Volume 23, pages 409–419, (2020)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Guilan Feng¹,
Zhengnan Li²,
Wengang Zhou³ &
…
Shi Dong⁴

441 Accesses
4 Citations
Explore all metrics

Abstract

The k-nearest neighbors outlier detection is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. Furthermore, it gives to each attribute the same importance to outlier. There are several approaches to enhance its precision, with the entropy-based outlier detection being among the most successful ones. Entropy-based outlier detection computes attribute entropy of the data set to weighted distance formula for the outlier detection. Apart from the existing the k-nearest neighbors outlier detection to handle big datasets, there is not an entropy-based outlier detection to manage that volume of data. In this paper, we propose an entropy-based outlier detection based on Spark. It presents three separately stages. The first stage computes attribute entropy. The second stage finds the k nearest neighbors and calculates the degrees of outliers using the attribute entropy computed previously. The third stage ranks each point on the degrees of outliers and declares the top n points in this ranking to be outliers. Extensive experimental results show the advantages of the proposed method. This algorithm can improve the outlier detection precision, reduce the runtime and realize the effective large scale dataset outlier detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

A New Neighborhood-Based Outlier Detection Technique

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

Article 20 November 2015

An effective information detection method for social big data

Article 19 December 2017

References

Aggarwal, C.C.: Outlier Analysis. Springer, New York (2015)
MATH Google Scholar
Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recogn. 74, 406–421 (2017)
Article Google Scholar
Ramaswamy,S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Paper presented at the ACM SIGMOD International Conference on Management of Data, ACM, pp. 427–438, (2000)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2015)
Google Scholar
Maillo, J., Ramírez, S., Triguero, I., et al.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2016)
Article Google Scholar
Maillo, J., Luengo, J., García, S., et al.: Exact fuzzy k-nearest neighbor classification for big datasets. In: Paper presented at the IEEE International Conference on Fuzzy Systems, IEEE, (2017)
Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Paper presented at the Conference on Networked Systems Design and Implementation, pp. 1–14, (2012)
Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25, 589–602 (2013)
Article Google Scholar
Subramanyam, R.B.V., Sonam, G.: Map-reduce algorithm for mining outliers in the large data sets using twister programing model. Int. J. Comput. Sci. Electron. Eng. 3(1), 81–86 (2015)
Google Scholar
Guo, Y.P., Liang, J.Y., Zhao, X.W.: An outlier detection algorithm for mixed data based on MapReduce. J. Chin. Comput. Syst. 35(9), 1961–1966 (2014)
Google Scholar
Cao, L., Yan, Y., Kuhlman, C., et al.: Multi-tactic distance-based outlier detection. In: Paper presented at the IEEE International Conference on Data Engineering, IEEE, pp. 959–970, (2017)
Hu, C.P., Qin, X.L.: A density-based local outlier detecting algorithm. J. Comput. Res. Dev. 47(12), 2110–2116 (2010)
Google Scholar
Wang, J.H., Zhao, X.X., Zhang, G.Y.: NLOF: a new density-based local outlier detecting algorithm. Comput. Sci. 40(8), 181–185 (2013)
Google Scholar
Xin, L.L., He, W., Yu, J.: An outlier detection algorithm based on density difference. J. Shangdong Univ. (Eng. Sci.) 45(3), 7–14 (2015)
Google Scholar
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)
Article MathSciNet Google Scholar
Filippone, M., Sanguinetti, G.: Information theoretic novelty detection. Pattern Recogn. 43(3), 805–814 (2010)
Article Google Scholar
Jiang, F., Sui, Y., Cao, C.: An information entropy-based approach to outlier detection in rough sets. Expert Syst. Appl. 37(9), 6338–6344 (2010)
Article Google Scholar
Pang, G., Cao, L., Chen, L., et al.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: Paper presented at the International Conference on Data Mining, IEEE, pp. 410–419, (2017)
Asuncion A.: UCI machine learning repository, (2013)
Yan, Y., Cao, L., Kulhman, C., et al: Distributed local outlier detection in big data. In: Paper presented at The ACM SIGKDD International Conference, pp. 1225–1234, (2017)
Sarumiab, O.A., Leungb, C.K., Adetunmbi, A.O.: Spark-based data analytics of sequence motifs in large omics data. Procedia Computer Science 136, 596–605 (2018)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Civil Aviation Flight Data Analysis under No. XM2852 and Key Scientific and Technological Research Projects in Henan Province (Grand No. 192102210125).

Author information

Authors and Affiliations

Modern Education Technology Center, Civil Aviation Flight University of China, Guanghan, 618307, China
Guilan Feng
Institute of Aviation Engineering, Civil Aviation Flight University of China, Guanghan, 618307, China
Zhengnan Li
Institute of Flight Technology, Civil Aviation Flight University of China, Guanghan, 618307, China
Wengang Zhou
School of Computer Science and Technology, Zhoukou Normal University, Zhoukou, 466001, China
Shi Dong

Authors

Guilan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhengnan Li
View author publications
You can also search for this author in PubMed Google Scholar
Wengang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Shi Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Dong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, G., Li, Z., Zhou, W. et al. Entropy-based outlier detection using spark. Cluster Comput 23, 409–419 (2020). https://doi.org/10.1007/s10586-019-02932-2

Download citation

Received: 10 October 2018
Revised: 04 March 2019
Accepted: 12 April 2019
Published: 16 April 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10586-019-02932-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entropy-based outlier detection using spark

Abstract

Access this article

Similar content being viewed by others

A New Neighborhood-Based Outlier Detection Technique

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

An effective information detection method for social big data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Entropy-based outlier detection using spark

Abstract

Access this article

Similar content being viewed by others

A New Neighborhood-Based Outlier Detection Technique

An Efficient Algorithm for Distributed Outlier Detection in Large Multi-Dimensional Datasets

An effective information detection method for social big data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation