Data Squashing: Constructing Summary Data Sets

DuMouchel, William

doi:10.1007/978-1-4615-0005-6_16

William DuMouchel³

Part of the book series: Massive Computing ((MACO,volume 4))

520 Accesses
12 Citations

Abstract

A “large dataset” is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined “data squashing” as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 629.00; Price excludes VAT (USA)

Softcover Book: USD 799.99; Price excludes VAT (USA)

Hardcover Book: USD 799.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bibliography

W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon: Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pages 6–15, 1999.
Google Scholar
J. Friedman: Greedy function approximation: A stochastic boosting machine. Technical report, Department of Statistics, Stanford University, 1999a.
Google Scholar
J. Friedman: Stochastic gradient boosting. Technical report, Department of Statistics, Stanford University, 1999b.
Google Scholar
D. Madigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, and G. Ridgeway: Likelihood-based data squashing: A modeling approach to instance construction. Technical report, ATT Labs Research, 1999.
MATH Google Scholar
A. Owen: Empirical likelihood ratio confidence regions. The Annals of Statistics, 18: 90–120, 1990.
Article MathSciNet MATH Google Scholar
A. Owen: Data squashing by empirical likelihood. Technical report, Department of Statistics, Stanford University, 1999.
Google Scholar
SAS Institute. SAS Users Manual, 1998.
Google Scholar
W.N. Venables and B.D. Ripley: Modern Applied Statistics with S-Plus. Springer-Verlag, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs Research, Florham Park, NJ, 07932, USA
William DuMouchel

Authors

William DuMouchel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

AT&T Labs Research, USA
James Abello & Mauricio G. C. Resende &
University of Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

DuMouchel, W. (2002). Data Squashing: Constructing Summary Data Sets. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_16

Download citation

DOI: https://doi.org/10.1007/978-1-4615-0005-6_16
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics