Abstract
Squashing is a lossy data compression technique that preserves statistical information. Specifically, squashing compresses a massive dataset to a much smaller one so that outputs from statistical analyses carried out on the smaller (squashed) dataset reproduce outputs from the same statistical analyses carried out on the original dataset. Likelihood-based data squashing (LDS) differs from a previously published squashing algorithm insofar as it uses a statistical model to squash the data. The results show that LDS provides excellent squashing performance even when the target statistical analysis departs from the model used to squash the data.
Similar content being viewed by others
References
Aha, D.W., Kilber, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning, 6:37–66.
Box, G.E.P. and Draper, N.R. 1987. Empirical Model Building and Response Surfaces. New York, USA: John Wiley & Sons.
Box, C.E.P., Hunter, W.G., and Hunter, J.S. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York, USA: John Wiley & Sons.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 9–15.
Breiman, L. and Friedman, J. 1984. Tools for large data set analysis. In Statistical Signal Processing, Edward J. Wegman and James G. Smith (Eds). New York: M. Dekker, pp. 191–197.
Catlett, J. Megainduction: A test flight. 1991. In Proceedings of the Eighth International Workshop on Machine Learning, pp. 596–599.
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 6–15.
Furnival, G.M. and Wilson, R.W. 1974. Regression by leaps and bounds. Technometrics, 16: 499–511.
Gibson, G.A., Vitter, J.S., and Wilkes, J. 1996. Report of the working group on storage I/O issues in large-scale computing. ACM Computing Surveys, 28:779–793.
Lawless, J. and Singhal, K. 1978. Efficient screening of nonnormal regression models. Biometrics, 34, pp. 318–327.
Provost, F. and Kolluri, V. 1989. A survey of methods for scaling up inductive algorithms. Journal of Data Mining and Knowledge Discovery, 3: 131–169.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6: 461–464.
Syed, N.A., Liu, H., and Sung, K.K. 1999. A study of support vectors on model independent example selection. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 272–276.
Venables, W.N. and Ripley, B.D. 1997. Modern Applied Statistics with S-PLUS. New York: Springer-Verlag.
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: An efficient data clustering method for large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 103–114.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Madigan, D., Raghavan, N., Dumouchel, W. et al. Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction. Data Mining and Knowledge Discovery 6, 173–190 (2002). https://doi.org/10.1023/A:1014095614948
Issue Date:
DOI: https://doi.org/10.1023/A:1014095614948