Data Streaming with Affinity Propagation

  • Xiangliang Zhang
  • Cyril Furtlehner
  • Michèle Sebag
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5212)

Abstract

This paper proposed StrAP (Streaming AP), extending Affinity Propagation (AP) to data steaming. AP, a new clustering algorithm, extracts the data items, or exemplars, that best represent the dataset using a message passing method. Several steps are made to build StrAP. The first one (Weighted AP) extends AP to weighted items with no loss of generality. The second one (Hierarchical WAP) is concerned with reducing the quadratic AP complexity, by applying AP on data subsets and further applying Weighted AP on the exemplars extracted from all subsets. Finally StrAP extends Hierarchical WAP to deal with changes in the data distribution. Experiments on artificial datasets, on the Intrusion Detection benchmark (KDD99) and on a real-world problem, clustering the stream of jobs submitted to the EGEE grid system, provide a comparative validation of the approach.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Fan, W., Wang, H., Yu, P.: Active mining of data streams. In: SIAM Conference on Data Mining (SDM) (2004)Google Scholar
  2. 2.
    Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Int. Conf. on Very Large Data Bases(VLDB), pp. 81–92 (2003)Google Scholar
  3. 3.
    Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: IEEE Symposium on Foundations of Computer Science, pp. 359–366 (2000)Google Scholar
  4. 4.
    Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM Conference on Data Mining (SDM) (2006)Google Scholar
  5. 5.
    Muthukrishnan, S.: Data streams: Algorithms and applications. Found. Trends Theor. Comput. Sci. 1, 117–236 (2005)CrossRefGoogle Scholar
  6. 6.
    Papadimitriou, S., Brockwell, A., Faloutsos, C.: Adaptive, hands-off stream mining. In: Int. Conf. on Very Large Data Bases(VLDB), pp. 560–571 (2003)Google Scholar
  7. 7.
    Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: ACM Symposium Principles of Database Systems(PODS), pp. 286–296 (2004)Google Scholar
  8. 8.
    Babcock, B., Olston, C.: Distributed topk monitoring. In: ACM International Conference on Management of Data (SIGMOD), pp. 28–39 (2003)Google Scholar
  9. 9.
    Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Frey, B., Dueck, D.: Supporting online material of clustering by passing messages between data points. Science 315 (2007), http://www.sciencemag.org/cgi/content/full/1136800/DC1
  11. 11.
    Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering (TKDE) 15, 515–528 (2003)CrossRefGoogle Scholar
  12. 12.
    Page, E.: Continuous inspection schemes 41, 100–115 (1954)Google Scholar
  13. 13.
    Hinkley, D.: Inference about the change-point from cumulative sum tests. Biometrika 58, 509–523 (1971)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Leone, M., Sumedha, W.M.: Clustering by soft-constraint affinity propagation: Applications to gene-expression data. Bioinformatics 23, 2708 (2007)CrossRefGoogle Scholar
  15. 15.
    Ester, M.: A density-based algorithm for discovering clusters in large spatial databases with noisethe uniqueness of a good optimum for k-means. In: International Conference on Knowledge Discovery and Data Mining(KDD) (1996)Google Scholar
  16. 16.
    Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.A.: The UCR time series classification/clustering homepage (2006), http://www.cs.ucr.edu/~eamonn/time_series_data/
  17. 17.
    KDD99: Kdd cup 1999 data (computer network intrusion detection) (1999), http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
  18. 18.
    Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999)Google Scholar
  19. 19.
    Dang, X.H., Ng, W.K., Ong, K.L.: An error bound guarantee algorithm for online mining frequent sets over data streams. Journal of Knowledge and Information Systems (2007)Google Scholar
  20. 20.
    Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining highspeed data streams. In: ACM International Conference on Management of Data (SIGMOD), pp. 523–528 (2003)Google Scholar
  21. 21.
    Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in streaming data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(4) (2008)Google Scholar
  22. 22.
    Agarwal, D.K.: An empirical bayes approach to detect anomalies in dynamic multidimensional arrays. In: International Conference on Data Mining (ICDM) (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Xiangliang Zhang
    • 1
  • Cyril Furtlehner
    • 1
  • Michèle Sebag
    • 1
  1. 1.TAO − INRIA CNRSOrsay CedexFrance

Personalised recommendations