Skip to main content

Data Squashing: Constructing Summary Data Sets

  • Chapter
Book cover Handbook of Massive Data Sets

Part of the book series: Massive Computing ((MACO,volume 4))

Abstract

A “large dataset” is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined “data squashing” as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 629.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 799.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  • W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon: Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pages 6–15, 1999.

    Google Scholar 

  • J. Friedman: Greedy function approximation: A stochastic boosting machine. Technical report, Department of Statistics, Stanford University, 1999a.

    Google Scholar 

  • J. Friedman: Stochastic gradient boosting. Technical report, Department of Statistics, Stanford University, 1999b.

    Google Scholar 

  • D. Madigan, N. Raghavan, W. DuMouchel, M. Nason, C. Posse, and G. Ridgeway: Likelihood-based data squashing: A modeling approach to instance construction. Technical report, ATT Labs Research, 1999.

    MATH  Google Scholar 

  • A. Owen: Empirical likelihood ratio confidence regions. The Annals of Statistics, 18: 90–120, 1990.

    Article  MathSciNet  MATH  Google Scholar 

  • A. Owen: Data squashing by empirical likelihood. Technical report, Department of Statistics, Stanford University, 1999.

    Google Scholar 

  • SAS Institute. SAS Users Manual, 1998.

    Google Scholar 

  • W.N. Venables and B.D. Ripley: Modern Applied Statistics with S-Plus. Springer-Verlag, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

DuMouchel, W. (2002). Data Squashing: Constructing Summary Data Sets. In: Abello, J., Pardalos, P.M., Resende, M.G.C. (eds) Handbook of Massive Data Sets. Massive Computing, vol 4. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0005-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-4615-0005-6_16

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4613-4882-5

  • Online ISBN: 978-1-4615-0005-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics