Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Madigan, David; Raghavan, Nandini; Dumouchel, William; Nason, Martha; Posse, Christian; Ridgeway, Greg

doi:10.1023/A:1014095614948

Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Published: April 2002

Volume 6, pages 173–190, (2002)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

David Madigan¹,
Nandini Raghavan²,
William Dumouchel²,
Martha Nason³,
Christian Posse³ &
…
Greg Ridgeway⁴

218 Accesses
31 Citations
Explore all metrics

Abstract

Squashing is a lossy data compression technique that preserves statistical information. Specifically, squashing compresses a massive dataset to a much smaller one so that outputs from statistical analyses carried out on the smaller (squashed) dataset reproduce outputs from the same statistical analyses carried out on the original dataset. Likelihood-based data squashing (LDS) differs from a previously published squashing algorithm insofar as it uses a statistical model to squash the data. The results show that LDS provides excellent squashing performance even when the target statistical analysis departs from the model used to squash the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information Selection and Data Compression RapidMiner Library

Data Mining Paradigms

On Version Space Compression

References

Aha, D.W., Kilber, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning, 6:37–66.
Google Scholar
Box, G.E.P. and Draper, N.R. 1987. Empirical Model Building and Response Surfaces. New York, USA: John Wiley & Sons.
Google Scholar
Box, C.E.P., Hunter, W.G., and Hunter, J.S. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York, USA: John Wiley & Sons.
Google Scholar
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 9–15.
Breiman, L. and Friedman, J. 1984. Tools for large data set analysis. In Statistical Signal Processing, Edward J. Wegman and James G. Smith (Eds). New York: M. Dekker, pp. 191–197.
Google Scholar
Catlett, J. Megainduction: A test flight. 1991. In Proceedings of the Eighth International Workshop on Machine Learning, pp. 596–599.
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 6–15.
Furnival, G.M. and Wilson, R.W. 1974. Regression by leaps and bounds. Technometrics, 16: 499–511.
Google Scholar
Gibson, G.A., Vitter, J.S., and Wilkes, J. 1996. Report of the working group on storage I/O issues in large-scale computing. ACM Computing Surveys, 28:779–793.
Google Scholar
Lawless, J. and Singhal, K. 1978. Efficient screening of nonnormal regression models. Biometrics, 34, pp. 318–327.
Google Scholar
Provost, F. and Kolluri, V. 1989. A survey of methods for scaling up inductive algorithms. Journal of Data Mining and Knowledge Discovery, 3: 131–169.
Google Scholar
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6: 461–464.
Google Scholar
Syed, N.A., Liu, H., and Sung, K.K. 1999. A study of support vectors on model independent example selection. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 272–276.
Venables, W.N. and Ripley, B.D. 1997. Modern Applied Statistics with S-PLUS. New York: Springer-Verlag.
Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: An efficient data clustering method for large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 103–114.

Download references

Author information

Authors and Affiliations

Rutgers University, USA
David Madigan
AT&T Labs—Research, USA
Nandini Raghavan & William Dumouchel
Talaria, Inc, USA
Martha Nason & Christian Posse
University of Washington, USA
Greg Ridgeway

Authors

David Madigan
View author publications
You can also search for this author in PubMed Google Scholar
Nandini Raghavan
View author publications
You can also search for this author in PubMed Google Scholar
William Dumouchel
View author publications
You can also search for this author in PubMed Google Scholar
Martha Nason
View author publications
You can also search for this author in PubMed Google Scholar
Christian Posse
View author publications
You can also search for this author in PubMed Google Scholar
Greg Ridgeway
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Madigan, D., Raghavan, N., Dumouchel, W. et al. Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction. Data Mining and Knowledge Discovery 6, 173–190 (2002). https://doi.org/10.1023/A:1014095614948

Download citation

Issue Date: April 2002
DOI: https://doi.org/10.1023/A:1014095614948

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Abstract

Access this article

Similar content being viewed by others

Information Selection and Data Compression RapidMiner Library

Data Mining Paradigms

On Version Space Compression

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

Abstract

Access this article

Similar content being viewed by others

Information Selection and Data Compression RapidMiner Library

Data Mining Paradigms

On Version Space Compression

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation