Skip to main content
Log in

D-Search: an efficient and exact search algorithm for large distribution sets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2,300 times faster than the naive implementation without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/

  2. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/

  3. Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156

    Article  MathSciNet  Google Scholar 

  4. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB. Berlin, Germany, pp 81–92, Sept 2003

  5. Barbic J, Safonova A, Pan J-Y, Faloutsos C, Hodgins JK, Pollard NS (2004) Segmenting motion capture data into distinct behaviors. In: Graphics interface, pp 185–194

  6. Bay SD, Kibler DF, Pazzani MJ, Smyth P (2000) The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD explorations, pp 81–85

  7. Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD. San Diego, California, pp 551–562, June 2003

  8. Cheng R, Xia Y, Prabhakar S, Shah R, Vitter JS (2004) Efficient indexing methods for probabilistic threshold queries over uncertain data. In: Proceedings of VLDB. Toronto, Canada, pp 876–887 Aug/Sept 2004

  9. Dasgupta S (1999) Learning mixtures of gaussians. In: FOCS

  10. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J Royal Stat Soc, Ser B 39(1): 1–38

    MathSciNet  MATH  Google Scholar 

  11. Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley-Interscience, New York

    Google Scholar 

  12. Fischer S, Lienhart R, Effelsberg W (1995) Automatic recognition of film genres. In: ACM multimedia. pp 295–30

  13. Gandhi V, Kang JM, Shekhar S, Ju J, Kolaczyk ED, Gopal S (2009) Context inclusive function evaluation: a case study with em-based multi-scale multi-granular image classification. Knowl Inf Syst 21(2): 231–247

    Article  Google Scholar 

  14. Gao L, Wang XS (2005) Continuous similarity-based queries on streaming time series. In: IEEE transactions on knowledge and data engineering (TKDE), pp 1320–1332

  15. Gong Z, Liu Q (2009) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132

    Article  Google Scholar 

  16. Guo Z, Zhang Z, Xing EP, Faloutsos C (2007) Enhanced max margin learning on multimodal data mining in a multimedia database. In: KDD, pp 340–349

  17. Huang X, Li SZ, and Wang Y (2005) Jensen-shannon boosting learning for object recognition. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 2. pp 144–149

  18. Ishikawa Y, Machida Y, Kitagawa H (2006) A dynamic mobility histogram construction method based on markov chains. In: Proceedings of international conference on statistical and scientific database management (SSDBM), pp 359–368

  19. Kannan R, Salmasian H, Vempala S (2005) The spectral method for general mixture models. In: 18th annual conference on learning theory (COLT), pp 444–457

  20. Pearson K (1894) Contributions to the mathematical theory of evolution. l. Trans Royal Soc 185A: 71–110

    Google Scholar 

  21. Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9): 1154–1166

    Article  Google Scholar 

  22. Li C, Zhai P, Zheng S-Q, Prabhakaran B (2004) Segmentation and recognition of multi-attribute motion sequences. In: ACM multimedia, pp 836–843

  23. Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge

    MATH  Google Scholar 

  24. Papadimitriou S, Yu PS (2006) Optimal multi-scale patterns in time series streams. In: SIGMOD, pp 647–658

  25. Pfurtscheller G, Flotzinger D, Neuper C (1994) Differentiation between finger, toe and tongue movement in man based on 40 hz eeg. In: Electroencephalography and clinical neurophysiology, pp 456–460

  26. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C. 2. Cambridge University Press, Cambridge

    Google Scholar 

  27. Raymer ML, Doom TE, Kuhn LA, Punch WF (2003) Knowledge discovery in medical and biological datasets using a hybrid bayes classifier/evolutionary algorithm. IEEE Trans Syst 33:802–813

    Google Scholar 

  28. Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2): 99–121

    Article  MATH  Google Scholar 

  29. Sakurai Y, Papadimitriou S, Faloutsos C (2005) Braid: stream mining through group lag correlations. In: Proceedings of ACM SIGMOD. Baltimore, MD, pp 599–610, June 2005

  30. Shi T, Belkin M, Yu B (2008) Data spectroscopy: learning mixture models using eigenspaces of convolution operators. In: ICML, pp 936–943

  31. Sun Z (2003) Adaptation for multiple cue integration. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 1. pp 440–445

  32. Sykacek P, Roberts SJ (2002) Adaptive classification by variational kalman filtering. In: NIPS, pp 737–744

  33. Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB. Trondheim, Norway, pp 922–933, Aug/Sept 2005

  34. Traina A, Traina C, Papadimitriou S, Faloutsos C (2001) Tri-plots: Scalable tools for multidimensional data mining. KDD Aug 2001

  35. Vert J-P (2001) Adaptive context trees and text clustering. IEEE Trans Inf Theory 47(5): 1884–1901

    Article  MathSciNet  MATH  Google Scholar 

  36. Woon WL, Wong K-S (2009) String alignment for automated document versioning. Knowl Inf Syst 18(3): 293–309

    Article  Google Scholar 

  37. Zhang Z, Dai BT, Tung AKH (2008) Estimating local optimums in em algorithm over gaussian mixture model. In: ICML, pp 1240–1247

  38. Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp 358–369

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasuko Matsubara.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Matsubara, Y., Sakurai, Y. & Yoshikawa, M. D-Search: an efficient and exact search algorithm for large distribution sets. Knowl Inf Syst 29, 131–157 (2011). https://doi.org/10.1007/s10115-010-0336-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0336-6

Keywords

Navigation