Abstract
A “large dataset” is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined “data squashing” as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Bibliography
W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon: Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pages 6–15, 1999.
J. Friedman: Greedy function approximation: A stochastic boosting machine. Technical report, Department of Statistics, Stanford University, 1999a.
J. Friedman: Stochastic gradient boosting. Technical report, Department of Statistics, Stanford University, 1999b.
D. Madigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, and G. Ridgeway: Likelihood-based data squashing: A modeling approach to instance construction. Technical report, ATT Labs Research, 1999.
A. Owen: Empirical likelihood ratio confidence regions. The Annals of Statistics, 18: 90–120, 1990.
A. Owen: Data squashing by empirical likelihood. Technical report, Department of Statistics, Stanford University, 1999.
SAS Institute. SAS Users Manual, 1998.
W.N. Venables and B.D. Ripley: Modern Applied Statistics with S-Plus. Springer-Verlag, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
DuMouchel, W. (2002). Data Squashing: Constructing Summary Data Sets. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_16
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0005-6_16
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-4882-5
Online ISBN: 978-1-4615-0005-6
eBook Packages: Springer Book Archive