Advertisement

Concept Drift Based Multi-dimensional Data Streams Sampling Method

  • Ling Lin
  • Xiaolong Qi
  • Zhirui Zhu
  • Yang GaoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11439)

Abstract

A summary can immensely reduce the time and space complexity of an algorithm. This concept is considered a research hotspot in the field of data stream mining. Data streams are characterized as having continuous data arrival, rapid speed, large scale, and cannot be completely stored in memory simultaneously. A summary is often formed in the memory to approximate the database query or data mining task. A sampling technique is a commonly used method for constructing data stream summaries. Traditional simple random sampling algorithms do not consider the conceptual drift of data distributions that change over time. Therefore, a challenging task is sampling the summary of the data distribution in multi-dimensional data streams of a concept drift. This study proposes a sampling algorithm that ensures the consistency of the data distribution with the data streams of the concept drift. First, probability statistics is used on the data stream cells in the reference window to obtain data distribution. A probability sampling is performed on the basis of this distribution. Second, the sliding window is used to continuously detect whether the data distribution has changed. If the data distribution does not change, then the original sampling data are maintained. Otherwise, the data distribution in the statistical window is restated to form a new sampling probability. The proposed algorithm ensures that the data distribution in the data profile is continually consistent with the population distribution. We compare our algorithm with the state-of-the-art algorithms on synthetic and real data sets. Experimental results demonstrate the effectiveness of our algorithm.

Keywords

Data stream clustering Sampling Summary 

Notes

Acknowledgments

This work is supported by the National Key R&D Program of China (2017YFB0702600, 2017YFB0702601), the National Natural Science Foundation of China (61432008, U1435214, 61503178) and Yili Normal University Project (No. 2016WXDZD001).

References

  1. 1.
    Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. (TODS) 38(4), 26 (2013)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Rivetti, N., Busnel, Y., Mostefaoui, A.: Efficiently summarizing data streams over sliding windows. In: 2015 IEEE 14th International Symposium on Network Computing and Applications (NCA), pp. 151–158. IEEE (2015)Google Scholar
  3. 3.
    Cormode, G., Duffield, N.: Sampling for big data: a tutorial. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1975–1975. ACM (2014)Google Scholar
  4. 4.
    Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: 19th International Conference on Scientific and Statistical Database Management, p. 22. IEEE (2007)Google Scholar
  6. 6.
    Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)Google Scholar
  7. 7.
    Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 667–676. ACM (2007)Google Scholar
  8. 8.
    Qahtan, A.A., Alharbi, B., Wang, S., Zhang, X.: A PCA-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM (2015)Google Scholar
  9. 9.
    Ahmed, M.: Data summarization: a survey. Knowl. Inf. Syst. 58, 1–25 (2018)Google Scholar
  10. 10.
    Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data—a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, New York (2015).  https://doi.org/10.1007/978-1-4939-2092-1_38CrossRefGoogle Scholar
  11. 11.
    Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: ACM SIGMOD Record, vol. 27, no. 2, pp. 331–342. ACM (1998)Google Scholar
  12. 12.
    Zhang, J., Xu, J., Liao, S.S.: Sampling methods for summarizing unordered vehicle-to-vehicle data streams. Transp. Res. Part C: Emerg. Technol. 23, 56–67 (2012)CrossRefGoogle Scholar
  13. 13.
    Chuang, K.-T., Chen, H.-L., Chen, M.-S.: Feature-preserved sampling over streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 2(4), 15 (2009)Google Scholar
  14. 14.
    Tillé, Y.: Sampling algorithms. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1273–1274. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  15. 15.
    Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–216 (2014)CrossRefGoogle Scholar
  16. 16.
    Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.State Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and IndustrializationNanjing UniversityNanjingChina
  2. 2.Electronics and Information Engineering CollegeYili Normal UniversityYiningChina

Personalised recommendations