D-Search: an efficient and exact search algorithm for large distribution sets

Matsubara, Yasuko; Sakurai, Yasushi; Yoshikawa, Masatoshi

doi:10.1007/s10115-010-0336-6

D-Search: an efficient and exact search algorithm for large distribution sets

Regular Paper
Published: 21 August 2010

Volume 29, pages 131–157, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yasuko Matsubara¹,
Yasushi Sakurai² &
Masatoshi Yoshikawa¹

240 Accesses
4 Citations
Explore all metrics

Abstract

Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions) to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation or anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multistep sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multidimensional datasets show that our solution achieves a wall clock time up to 2,300 times faster than the naive implementation without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering: application and trends

Article 27 November 2022

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Article Open access 07 July 2017

Stratified random sampling from streaming and stored data

Article 23 October 2020

References

CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/
Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Article MathSciNet Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB. Berlin, Germany, pp 81–92, Sept 2003
Barbic J, Safonova A, Pan J-Y, Faloutsos C, Hodgins JK, Pollard NS (2004) Segmenting motion capture data into distinct behaviors. In: Graphics interface, pp 185–194
Bay SD, Kibler DF, Pazzani MJ, Smyth P (2000) The uci kdd archive of large data sets for data mining research and experimentation. In: SIGKDD explorations, pp 81–85
Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD. San Diego, California, pp 551–562, June 2003
Cheng R, Xia Y, Prabhakar S, Shah R, Vitter JS (2004) Efficient indexing methods for probabilistic threshold queries over uncertain data. In: Proceedings of VLDB. Toronto, Canada, pp 876–887 Aug/Sept 2004
Dasgupta S (1999) Learning mixtures of gaussians. In: FOCS
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J Royal Stat Soc, Ser B 39(1): 1–38
MathSciNet MATH Google Scholar
Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley-Interscience, New York
Google Scholar
Fischer S, Lienhart R, Effelsberg W (1995) Automatic recognition of film genres. In: ACM multimedia. pp 295–30
Gandhi V, Kang JM, Shekhar S, Ju J, Kolaczyk ED, Gopal S (2009) Context inclusive function evaluation: a case study with em-based multi-scale multi-granular image classification. Knowl Inf Syst 21(2): 231–247
Article Google Scholar
Gao L, Wang XS (2005) Continuous similarity-based queries on streaming time series. In: IEEE transactions on knowledge and data engineering (TKDE), pp 1320–1332
Gong Z, Liu Q (2009) Improving keyword based web image search with visual feature distribution and term expansion. Knowl Inf Syst 21(1): 113–132
Article Google Scholar
Guo Z, Zhang Z, Xing EP, Faloutsos C (2007) Enhanced max margin learning on multimodal data mining in a multimedia database. In: KDD, pp 340–349
Huang X, Li SZ, and Wang Y (2005) Jensen-shannon boosting learning for object recognition. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 2. pp 144–149
Ishikawa Y, Machida Y, Kitagawa H (2006) A dynamic mobility histogram construction method based on markov chains. In: Proceedings of international conference on statistical and scientific database management (SSDBM), pp 359–368
Kannan R, Salmasian H, Vempala S (2005) The spectral method for general mixture models. In: 18th annual conference on learning theory (COLT), pp 444–457
Pearson K (1894) Contributions to the mathematical theory of evolution. l. Trans Royal Soc 185A: 71–110
Google Scholar
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9): 1154–1166
Article Google Scholar
Li C, Zhai P, Zheng S-Q, Prabhakaran B (2004) Segmentation and recognition of multi-attribute motion sequences. In: ACM multimedia, pp 836–843
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge
MATH Google Scholar
Papadimitriou S, Yu PS (2006) Optimal multi-scale patterns in time series streams. In: SIGMOD, pp 647–658
Pfurtscheller G, Flotzinger D, Neuper C (1994) Differentiation between finger, toe and tongue movement in man based on 40 hz eeg. In: Electroencephalography and clinical neurophysiology, pp 456–460
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C. 2. Cambridge University Press, Cambridge
Google Scholar
Raymer ML, Doom TE, Kuhn LA, Punch WF (2003) Knowledge discovery in medical and biological datasets using a hybrid bayes classifier/evolutionary algorithm. IEEE Trans Syst 33:802–813
Google Scholar
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2): 99–121
Article MATH Google Scholar
Sakurai Y, Papadimitriou S, Faloutsos C (2005) Braid: stream mining through group lag correlations. In: Proceedings of ACM SIGMOD. Baltimore, MD, pp 599–610, June 2005
Shi T, Belkin M, Yu B (2008) Data spectroscopy: learning mixture models using eigenspaces of convolution operators. In: ICML, pp 936–943
Sun Z (2003) Adaptation for multiple cue integration. In: Proceedings of IEEE computer society international conference on computer vision and pattern recognition (CVPR) vol 1. pp 440–445
Sykacek P, Roberts SJ (2002) Adaptive classification by variational kalman filtering. In: NIPS, pp 737–744
Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB. Trondheim, Norway, pp 922–933, Aug/Sept 2005
Traina A, Traina C, Papadimitriou S, Faloutsos C (2001) Tri-plots: Scalable tools for multidimensional data mining. KDD Aug 2001
Vert J-P (2001) Adaptive context trees and text clustering. IEEE Trans Inf Theory 47(5): 1884–1901
Article MathSciNet MATH Google Scholar
Woon WL, Wong K-S (2009) String alignment for automated document versioning. Knowl Inf Syst 18(3): 293–309
Article Google Scholar
Zhang Z, Dai BT, Tung AKH (2008) Estimating local optimums in em algorithm over gaussian mixture model. In: ICML, pp 1240–1247
Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp 358–369

Download references

Author information

Authors and Affiliations

Kyoto University, Kyoto, Japan
Yasuko Matsubara & Masatoshi Yoshikawa
NTT Communication Science Labs, Kyoto, Japan
Yasushi Sakurai

Authors

Yasuko Matsubara
View author publications
You can also search for this author in PubMed Google Scholar
Yasushi Sakurai
View author publications
You can also search for this author in PubMed Google Scholar
Masatoshi Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasuko Matsubara.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Matsubara, Y., Sakurai, Y. & Yoshikawa, M. D-Search: an efficient and exact search algorithm for large distribution sets. Knowl Inf Syst 29, 131–157 (2011). https://doi.org/10.1007/s10115-010-0336-6

Download citation

Received: 05 March 2010
Revised: 08 May 2010
Accepted: 30 May 2010
Published: 21 August 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s10115-010-0336-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

D-Search: an efficient and exact search algorithm for large distribution sets

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

D-Search: an efficient and exact search algorithm for large distribution sets

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations

Stratified random sampling from streaming and stored data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation