The VLDB Journal

, Volume 18, Issue 2, pp 571–597 | Cite as

Guessing the extreme values in a data set: a Bayesian method and its applications

Special Issue Paper

Abstract

For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, Top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to four specific problems that arise in the context of data management.

Keywords

Sampling Online aggregation Monte Carlo Extreme values Bayesian 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agarwal, D., McGregor, A., Phillips, J.M., Venkatasubramanian, S., Zhu, Z.: Spatial scan statistics: approximations and performance study, KDD, pp. 24–33 (2006)Google Scholar
  2. 2.
    Agarwal, D., Phillips, J.M., Venkatasubramanian, S.: The Hunting of the Bump: On Maximizing Statistical Discrepancy, SODA, pp. 1137–1146 (2006)Google Scholar
  3. 3.
    Arge, L., Procopiuc, O., Ramaswamy, S., Suel, T., Vitter, J.S.: Scalable Sweeping-Based Spatial Join, VLDB, pp. 570–581 (1998)Google Scholar
  4. 4.
    Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule, KDD, pp. 29–38 (2003)Google Scholar
  5. 5.
    Bazaraa M.S., Sherali H.D., Shetty C.M.: Nonlinear Programming: Theory and Algorithms. Wiley, New York (1993)MATHGoogle Scholar
  6. 6.
    Bilmes, J.: A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, University of Berkeley, ICSI-TR-97-021 (1997)Google Scholar
  7. 7.
    Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient Processing of Spatial Joins Using R-Trees, SIGMOD, pp. 237–246 (1993)Google Scholar
  8. 8.
    Casella G., Berger R.L.: Statistical Inference, 2nd edn. Duxbury Press, North Scituate (2001)Google Scholar
  9. 9.
    Donjerkovic, D., Ramakrishnan, R.: Probabilistic Optimization of Top N Queries, VLDB, pp. 411–422 (1999)Google Scholar
  10. 10.
    Dudoit S., Shaffer J.P., Boldrick J.C.: Multiple hypothesis testing in microarray experiments. Stat. Sci. 18, 71–103 (2003)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Haas, P.J., Hellerstein, J.M.: Ripple Joins for Online Aggregation, SIGMOD, pp. 287–298 (1999)Google Scholar
  12. 12.
    Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online Aggregation, SIGMOD, pp. 171–182 (1997)Google Scholar
  13. 13.
    Hjaltason, G.R., Samet, H.: Incremental Distance Join Algorithms for Spatial Databases, SIGMOD, pp. 237–248 (1998)Google Scholar
  14. 14.
    Hou W.-C., Özsoyoglu G.: Statistical estimators for aggregate relational algebra queries. ACM Trans. Database Syst. 16, 600–654 (1991)CrossRefGoogle Scholar
  15. 15.
  16. 16.
    Kinnison R.R.: Applied Extreme Value Statistics. Macmillan, New York (1985)Google Scholar
  17. 17.
    Knorr E.M., Ng R.T., Tucakov V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000)CrossRefGoogle Scholar
  18. 18.
    Kulldorff M.: A spatial scan statistic. Comm. Stat. Theory Methods 26, 1481–1496 (1997)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Kulldorff, M.: Spatial scan statistics: model, calculations, and applications, Scan Statistics and Applications, pp. 303–322 (1999)Google Scholar
  20. 20.
    Leadbetter M.R., Lindgren G., Rootzen H.: Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics. Springer, Berlin (1983)MATHGoogle Scholar
  21. 21.
    Lee P.M.: Bayesian Statistics: An Introduction. Hodder Arnold, London (1997)MATHGoogle Scholar
  22. 22.
    Lo, M.-L., Ravishankar, C.V.: Spatial Hash-Joins, SIGMOD, pp. 247–258 (1996)Google Scholar
  23. 23.
    Maritz J.S., Munro A.H.: On the use of the generalized extreme value distribution in estimating extreme percentiles. Biometrics 23, 79–103 (1976)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Neill, D.B., Moore, A.W.: A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters, NIPS, pp. 256–265 (2003)Google Scholar
  25. 25.
    Neill, D.B., Moore, A.W.: Rapid detection of significant spatial clusters, KDD, pp. 256–265 (2004)Google Scholar
  26. 26.
    Neill, D.B., Moore, A.W., Sabhnani, M., Daniel, K.: Detection of emerging space-time clusters, KDD, pp. 218–227 (2005)Google Scholar
  27. 27.
    Olken, F.: Random Sampling from Databases, LBL Technical Report, LBL-32883 (1993)Google Scholar
  28. 28.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets, SIGMOD, pp. 427–438 (2000)Google Scholar
  29. 29.
    Robert C.P., Casella G.: Monte Carlo Statistic Methods. Springer, Berlin (2004)Google Scholar
  30. 30.
    Sarndal C.-E., Swensson B., Wretman J.: Model Assisted Survey Sampling. Springer, Berlin (1992)Google Scholar
  31. 31.
    Schaefer, G., Stich, M.: UCID—An Uncompressed Colour Image Database, SPIE, Storage and Retrieval Methods and Applications for Multimedia, pp. 472–480 (2004)Google Scholar
  32. 32.
    Seidl, T., Kriegel, H.-P.: Efficient User-Adaptable Similarity Search in Large Multimedia Databases, VLDB, pp. 506–515 (1997)Google Scholar
  33. 33.
    Shin, H., Moon, B., Lee, S.: Adaptive Multi-Stage Distance Join Processing, SIGMOD, pp. 343–354 (2000)Google Scholar
  34. 34.
    Wilks S.S.: The large sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938)MATHCrossRefGoogle Scholar
  35. 35.
    Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial Anomlay Detection, CISE Technical Report, University of Florida (2008)Google Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Computer and Information Science and Engineering DepartmentUniversity of FloridaGainesvilleUSA

Personalised recommendations