Advertisement

The VLDB Journal

, Volume 18, Issue 3, pp 809–835 | Cite as

Top-k typicality queries and efficient query answering methods on large databases

  • Ming Hua
  • Jian Pei
  • Ada W. C. Fu
  • Xuemin Lin
  • Ho-Fung Leung
Regular Paper

Abstract

Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like “Who are the top-k most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Moreover, to answer questions like “Who are the best k typical guards in whole representing different types of guards?”, the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.

Keywords

Typicality analysis Top-k query Efficient query answering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abramson I.S.: On bandwidth variation in kernel estimates—a square root law. Ann. Stat. 10(4), 1217–1223 (1982)MATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of ACM SIGMOD (1998)Google Scholar
  3. 3.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proc. of ACM SIGMOD (1999)Google Scholar
  4. 4.
    Angluin, D., Valiant, L.G.: Fast probabilistic algorithms for hamiltonian circuits and matchings In: Proc. of STOC (1977)Google Scholar
  5. 5.
    Au Yeung, C.M., Leung, H.F.: A formal model of ontology for handling fuzzy membership and typicality of instances. Comput. J. (to appear)Google Scholar
  6. 6.
    Barsalou, L.W.: The instability of graded structure: implications for the nature of concepts. Concepts and conceptual development, pp. 101–140 (1987)Google Scholar
  7. 7.
    Bespamyatnikh, S., Kedem, K., Segal, M.: Optimal facility location under various distance functions. In: Proc. of WADS (1999)Google Scholar
  8. 8.
    Bose P., Maheshwari A., Morin P.: Fast approximations for sums of distances, clustering and the fermat-weber problem. Compu. Geom. Theory Appl. 24(3), 135–146 (2003)MATHMathSciNetGoogle Scholar
  9. 9.
    Bowman A.W.: An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71(2), 353–360 (1984)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Bozkaya T., Ozsoyoglu M.: Indexing large metric spaces for similarity search queries. ACM Trans. Database Syst. 24(3), 361–404 (1999)CrossRefGoogle Scholar
  11. 11.
    Breiman L., Meisel W., Purcell E.: Variable kernel estimates of multivariate densities. Technometrics 19(2), 135–144 (1977)MATHCrossRefGoogle Scholar
  12. 12.
    Brooks L.R.: Nonanalytic concept formation and memory for instances. In: Rosch, E.H., Lloyd, B.B. (eds) Cognition and Categorization, pp. 169–211. Hillsdale, New York, NY (1973)Google Scholar
  13. 13.
    Campbell, N.A.: Some aspects of allocation and discrimination. In: Multivariate Statistical Methods in Physical Anthropology, pp. 177–192 (1984)Google Scholar
  14. 14.
    Cantone D., Cincotti G., Ferro A., Pulvirenti A.: An efficient approximate algorithm for the 1-median problem in metric spaces. SIAM J. Optim. 16(2), 434–451 (2005)MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Charikar, M., Guha, S., Tardos, E., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. In: Proceedings of the Symposium on Theory of Computing, pp. 1–10. ACM Press, New York (1999)Google Scholar
  16. 16.
    Cohen B., Murphy G.L.: Models of concepts. Cogn. Sci. 8, 27–58 (1984)CrossRefGoogle Scholar
  17. 17.
    Cohen, E., Kaplan, H.: Spatially-decaying aggregation over a network: model and algorithms. In: Proc. of ACM SIGMOD (2004)Google Scholar
  18. 18.
    Cohen E., Kaplan H.: Spatially-decaying aggregation over a network. J. Comput. Syst. Sci. 73(3), 265–288 (2007)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries using views. In: Proc. of VLDB (2006)Google Scholar
  20. 20.
    Devroye L.: A Course in Density Estimation. Birkhauser, Basel (1987)MATHGoogle Scholar
  21. 21.
    Devroye L., Lugosi G.: Combinatorial Methods in Density Estimation, 1st edn. Springer, Berlin (2001)MATHGoogle Scholar
  22. 22.
    Dubois D., Prade H., Rossazza J.: Vagueness, typicality, and uncertainty in class hierarchies. Int. J. Intell. Syst. 6, 167–183 (1991)CrossRefGoogle Scholar
  23. 23.
    Edwards A.W.F.: Likelihood, 1st edn. Cambridge University Press, London (1985)Google Scholar
  24. 24.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of ACM SIGKDD (1996)Google Scholar
  25. 25.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proc. of PODS (2001)Google Scholar
  26. 26.
    Foody G.M., Campbell N.A., Trodd N.M., Wood T.F.: Derivation and applications of probabilistic measures of class membership from the maximum likelihood classification. Photogramm. Eng. Remote Sens. 58, 1335–1341 (1992)Google Scholar
  27. 27.
    Guha S., Rastogi R., Shim K.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998). doi: 10.1145/276305.276312 CrossRefGoogle Scholar
  28. 28.
    Gunopoulos D., Kollios G., Tsotras V., Domeniconi C.: Selectivity estimators for multi-dimensional range queries over real attributes. VLDB J. 14(2), 137–154 (2005)CrossRefGoogle Scholar
  29. 29.
    Hartigan J.A., Wong M.A.: A K-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)MATHCrossRefGoogle Scholar
  30. 30.
    Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proc. of ACM SIGKDD (1998)Google Scholar
  31. 31.
    Hodge V., Austin J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 2(2), 85–126 (2004)CrossRefGoogle Scholar
  32. 32.
    Hua, M., Pei, J., Fu, A.W.C., Lin, X., Leung, H.F.: Efficiently answering top-k typicality queries on large databases. In: Proc. of VLDB (2007)Google Scholar
  33. 33.
    Indyk, P.: Sublinear time algorithms for metric space problems. In: Proc. of STOC (1999)Google Scholar
  34. 34.
    Kanazawa Y.: An optimal variable cell histogram based on the sample spacings. Ann. Stat. 20(1), 291–304 (1992)MATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Karypis G., Han E.H.S., Kumar V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)CrossRefGoogle Scholar
  36. 36.
    Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Dodge, Y. (ed.) Statistical Data Analysis based on the L1 Norm, pp. 405–416. Elsevier/North Holland, Amsterdam (1987)Google Scholar
  37. 37.
    Mack Y., Rosenblatt M.: Multivariate k-nearest neighbor density estimates. J. Multivar. Anal. 9, 1–15 (1979)MATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over sliding windows. In: Proc. of ACM SIGMOD (2006)Google Scholar
  39. 39.
    Nepal, S., Ramakrishna, M.: Query processing issues in image(multimedia) databases. In: Proc. of ICDE (1999)Google Scholar
  40. 40.
    Ng R.T., Han J.: Clarans: a method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)CrossRefGoogle Scholar
  41. 41.
    Nosofsky R.M.: Similarity, frequency, and category representations. J. Exp. Psychol. Learn. Memory Cogn. 14(1), 54–65 (1988)CrossRefGoogle Scholar
  42. 42.
    Rosch, E.: On the internal structure of perceptual and semantic categories. In: Cognitive Development and Acquisition of Language, pp. 111–144 (1973)Google Scholar
  43. 43.
    Rosch E.: Cognitive representations of semantic categories. J. Exp. Psychol. Gen. 104, 192–233 (1975)CrossRefGoogle Scholar
  44. 44.
    Rudemo M.: Empirical choice of histograms and kernel density estimators. Scand. J. Stat. 9, 65–78 (1982)MATHMathSciNetGoogle Scholar
  45. 45.
    Sain S.R., Baggerly K.A., Scott D.W.: Cross-validation of multivariate densities. J. Am. Stat. Assoc. 89(427), 807–817 (1994)MATHCrossRefMathSciNetGoogle Scholar
  46. 46.
    Scott, D., Sain, S.: Multi-dimensional density estimation. Handbook of Statistics: Data Mining and Computational Statistics, vol. 23 (2004)Google Scholar
  47. 47.
    Silverman B.W.: Density Estimation for Statistics and Data Analysis (Hardcover). Chapman and Hall, London (1986)Google Scholar
  48. 48.
    Tamma V., Bench-Capon T.: An ontology model to facilitate knowledge-sharing in multi-agent systems. Knowl. Eng. Rev. 17(1), 41–60 (2002)CrossRefGoogle Scholar
  49. 49.
    Tarter M.: Density estimation applications for outlier detection. Comput. Programs Biomed. 10(1), 55–60 (1979)CrossRefGoogle Scholar
  50. 50.
    Walfish S.: A review of statistical outlier methods. Pharm. Technol. 30(11), 82–88 (2006)Google Scholar
  51. 51.
    Wang, W., Yang, J., Muntz, R.R.: Sting: a statistical information grid approach to spatial data mining. In: Proc. of VLDB (1997)Google Scholar
  52. 52.
    Xin, D., Cheng, H., Yan, X., Han, J.: Extracting redundancy-aware top-k patterns. In: Proc. of ACM SIGKDD (2006)Google Scholar
  53. 53.
    Xin, D., Han, J., Cheng, H., Li, X.: Answering top-k queries with multi-dimensional selections: the ranking cube approach. In: Proc. of VLDB (2006)Google Scholar
  54. 54.
    Xu, R., Wunch II, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)Google Scholar
  55. 55.
    Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proc. of SODA (1993)Google Scholar
  56. 56.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. In: Proc. of ACM SIGMOD (1996)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  • Ming Hua
    • 1
  • Jian Pei
    • 1
  • Ada W. C. Fu
    • 2
  • Xuemin Lin
    • 3
    • 4
  • Ho-Fung Leung
    • 2
  1. 1.Simon Fraser UniversityBurnabyCanada
  2. 2.The Chinese University of Hong KongShatinHong Kong, China
  3. 3.The University of New South WalesSydneyAustralia
  4. 4.NICTASydneyAustralia

Personalised recommendations