Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Outlier Detection over Sliding Windows for Probabilistic Data Streams

Abstract

Outlier detection is a very useful technique in many applications, where data is generally uncertain and could be described using probability. While having been studied intensively in the field of deterministic data, outlier detection is still novel in the emerging uncertain data field. In this paper, we study the semantic of outlier detection on probabilistic data stream and present a new definition of distance-based outlier over sliding window. We then show the problem of detecting an outlier over a set of possible world instances is equivalent to the problem of finding the k-th element in its neighborhood. Based on this observation, a dynamic programming algorithm (DPA) is proposed to reduce the detection cost from O(2|R(e; d)|) to O(|k·R(e; d)|), where R(e; d) is the d-neighborhood of e. Furthermore, we propose a pruning-based approach (PBA) to effectively and efficiently filter non-outliers on single window, and dynamically detect recent m elements incrementally. Finally, detailed analysis and thorough experimental results demonstrate the efficiency and scalability of our approach.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases (VLDB), New York City, USA, Aug. 24–27, 1998, pp.392–403.

  2. 2.

    Knorr E M, Ng R T. Finding intensional knowledge of distance-based outliers. In Proc. the 25th International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, Sept. 7–10, 1999, pp.211–222.

  3. 3.

    Shen H, Zhan Y. Improved approximate detection of duplicates for data streams over sliding windows. Journal of Computer Science and Technology, 2008, 23(6): 973–987.

  4. 4.

    Breuning M M, Kriegel H P, Ng R T, Sander J. LOF: Identifying density-based local outliers. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD), Dallas, USA, May 16–18, 2000, pp.93–104.

  5. 5.

    Hinterberger H. Exploratory Data Analysis. Encyclopedia of Database Systems, Springer US, 2009, p.1080.

  6. 6.

    Arning A, Agrawal R, Raghavan P. A linear method for deviation detection in large databases. In Proc. the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Portland, USA, Aug. 2–4, 1996, pp.164–169.

  7. 7.

    Sarawagi S, Agrawal R, Megiddo N. Discovery-driven exploration of OLAP data cubes. In Proc. the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, Mar. 23–27, 1998, pp.168–182.

  8. 8.

    Pei J, Jiang B, Lin X, Yuan Y. Probabilistic skylines on uncertain data. In Proc. the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria, Sept. 23–27, 2007, pp.15–26.

  9. 9.

    Soliman M A, Ilyas I F, Chang K C-C. Top-k query processing in uncertain databases. In Proc. the 23rd International Conference on Data Engineering (ICDE), Istanbul, Turkey, Apr. 15–20, 2007, pp.345–360.

  10. 10.

    Hua M, Pei J, Zhang W, Lin X. Ranking queries on uncertain data: A probabilistic threshold approach. In Proc. the ACM SIGMOD International Conference on Management of Data (SIGMOD), Vancouver, Canada, Jun. 10–12, 2008, pp.673–689.

  11. 11.

    Aggarwal C C, Yu P S. Outlier detection with uncertain data. In Proc. the SIAM International Conference on Data Mining (SDM), Atlanta, USA, Apr. 24-26, 2008, pp.483–493.

  12. 12.

    Kriegel H, Pfeifle M. Density-based clustering of uncertain data. In Proc. the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Chicago, USA, Aug. 21–24, 2005, pp.672{677.

  13. 13.

    Jin C, Yi K, Chen L, Yu J X, Lin X. Sliding-window top-k queries on uncertain streams. PVLDB, 2008, 1(1): 301–312.

  14. 14.

    Woo H, Mok A K. Real-time monitoring of uncertain data streams using probabilistic similarity. In Proc. the 28th IEEE Real-Time Systems Symposium (RTSS), Tucson, Arizona, USA, Dec. 3–6, 2007, pp.288-300.

  15. 15.

    Aggarwal C C, Yu P S. A framework for clustering uncertain data streams. In Proc. the 24th International Conference on Data Engineering (ICDE), Cancún, México, Apr. 7–12, 2008, pp.150–159.

  16. 16.

    Lin X, Yuan Y, Wang W, Lu H. Stabbing the sky: Efficient skyline computation over sliding windows. In Proc. the 21st International Conference on Data Engineering (ICDE), Tokyo, Japan, Apr. 5–8, 2005, pp.502–513.

Download references

Author information

Correspondence to Bin Wang.

Additional information

This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 60973020, 60828004, and 60933001, the Program for New Century Excellent Talents in University of China under Grant No. NCET-06-0290, and the Fundamental Research Funds for the Central Universities under Grant No. N090504004.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Wang, B., Yang, X., Wang, G. et al. Outlier Detection over Sliding Windows for Probabilistic Data Streams. J. Comput. Sci. Technol. 25, 389–400 (2010). https://doi.org/10.1007/s11390-010-9332-2

Download citation

Keywords

  • outlier detection
  • uncertain data
  • probabilistic data stream
  • sliding window