Abstract
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2,300 times faster than the naive implementation without sacrificing accuracy.
Similar content being viewed by others
References
CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/
Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB. Berlin, Germany, pp 81–92, Sept 2003
Barbic J, Safonova A, Pan J-Y, Faloutsos C, Hodgins JK, Pollard NS (2004) Segmenting motion capture data into distinct behaviors. In: Graphics interface, pp 185–194
Bay SD, Kibler DF, Pazzani MJ, Smyth P (2000) The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD explorations, pp 81–85
Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD. San Diego, California, pp 551–562, June 2003
Cheng R, Xia Y, Prabhakar S, Shah R, Vitter JS (2004) Efficient indexing methods for probabilistic threshold queries over uncertain data. In: Proceedings of VLDB. Toronto, Canada, pp 876–887 Aug/Sept 2004
Dasgupta S (1999) Learning mixtures of gaussians. In: FOCS
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J Royal Stat Soc, Ser B 39(1): 1–38
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley-Interscience, New York
Fischer S, Lienhart R, Effelsberg W (1995) Automatic recognition of film genres. In: ACM multimedia. pp 295–30
Gandhi V, Kang JM, Shekhar S, Ju J, Kolaczyk ED, Gopal S (2009) Context inclusive function evaluation: a case study with em-based multi-scale multi-granular image classification. Knowl Inf Syst 21(2): 231–247
Gao L, Wang XS (2005) Continuous similarity-based queries on streaming time series. In: IEEE transactions on knowledge and data engineering (TKDE), pp 1320–1332
Gong Z, Liu Q (2009) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132
Guo Z, Zhang Z, Xing EP, Faloutsos C (2007) Enhanced max margin learning on multimodal data mining in a multimedia database. In: KDD, pp 340–349
Huang X, Li SZ, and Wang Y (2005) Jensen-shannon boosting learning for object recognition. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 2. pp 144–149
Ishikawa Y, Machida Y, Kitagawa H (2006) A dynamic mobility histogram construction method based on markov chains. In: Proceedings of international conference on statistical and scientific database management (SSDBM), pp 359–368
Kannan R, Salmasian H, Vempala S (2005) The spectral method for general mixture models. In: 18th annual conference on learning theory (COLT), pp 444–457
Pearson K (1894) Contributions to the mathematical theory of evolution. l. Trans Royal Soc 185A: 71–110
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9): 1154–1166
Li C, Zhai P, Zheng S-Q, Prabhakaran B (2004) Segmentation and recognition of multi-attribute motion sequences. In: ACM multimedia, pp 836–843
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge
Papadimitriou S, Yu PS (2006) Optimal multi-scale patterns in time series streams. In: SIGMOD, pp 647–658
Pfurtscheller G, Flotzinger D, Neuper C (1994) Differentiation between finger, toe and tongue movement in man based on 40 hz eeg. In: Electroencephalography and clinical neurophysiology, pp 456–460
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C. 2. Cambridge University Press, Cambridge
Raymer ML, Doom TE, Kuhn LA, Punch WF (2003) Knowledge discovery in medical and biological datasets using a hybrid bayes classifier/evolutionary algorithm. IEEE Trans Syst 33:802–813
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2): 99–121
Sakurai Y, Papadimitriou S, Faloutsos C (2005) Braid: stream mining through group lag correlations. In: Proceedings of ACM SIGMOD. Baltimore, MD, pp 599–610, June 2005
Shi T, Belkin M, Yu B (2008) Data spectroscopy: learning mixture models using eigenspaces of convolution operators. In: ICML, pp 936–943
Sun Z (2003) Adaptation for multiple cue integration. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 1. pp 440–445
Sykacek P, Roberts SJ (2002) Adaptive classification by variational kalman filtering. In: NIPS, pp 737–744
Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB. Trondheim, Norway, pp 922–933, Aug/Sept 2005
Traina A, Traina C, Papadimitriou S, Faloutsos C (2001) Tri-plots: Scalable tools for multidimensional data mining. KDD Aug 2001
Vert J-P (2001) Adaptive context trees and text clustering. IEEE Trans Inf Theory 47(5): 1884–1901
Woon WL, Wong K-S (2009) String alignment for automated document versioning. Knowl Inf Syst 18(3): 293–309
Zhang Z, Dai BT, Tung AKH (2008) Estimating local optimums in em algorithm over gaussian mixture model. In: ICML, pp 1240–1247
Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp 358–369
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Matsubara, Y., Sakurai, Y. & Yoshikawa, M. D-Search: an efficient and exact search algorithm for large distribution sets. Knowl Inf Syst 29, 131–157 (2011). https://doi.org/10.1007/s10115-010-0336-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0336-6