Advertisement

Foundations of Science

, Volume 22, Issue 3, pp 595–612 | Cite as

The Deluge of Spurious Correlations in Big Data

  • Cristian S. Calude
  • Giuseppe Longo
Article

Abstract

Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this “philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in “randomly” generated, large enough databases, which—as we will prove—implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.

Keywords

Big data Ergodic theory Ramsey theory Algorithmic information theory Correlation 

Notes

Acknowledgments

The authors have been supported in part by Marie Curie FP7-PEOPLE-2010-IRSES Grant. Longo’s work is also part of the project “Lois des dieux, des hommes et de la nature”, Institut d’Etudes Avancées, Nantes, France. We thank A. Vulpiani for suggesting the use of Kac’s lemma, G. Tee for providing historical data and A. Abbott, F. Kroon, H. Maurer, J. P. Lewis, C. Mamali, R. Nicolescu, G. Smith, G. Tee, A. Vulpiani and the anonymous referees for useful comments and suggestions.

References

  1. Ahn, A. (2015). The party problem. http://mathforum.org/mathimages/index.php/The_Party_Problem_(Ramsey's_Theorem). Accessed December 12, 2015.
  2. Andrews, G. E. (2012). Drowning in the data deluge. Notices of the AMS: American Mathematical Society, 59(7), 933–941.CrossRefGoogle Scholar
  3. Calude, A. S. (2015). Does big data equal big problems? http://blogs.crikey.com.au/fullysic/2015/11/13/does-big-data-equal-big-problems. November 2015.
  4. Calude, C. (2002). Information and randomness–An algorithmic perspective (2nd ed.). Berlin: Springer.Google Scholar
  5. Calude, C. S., & Longo, G. (2015). Classical, quantum and biological randomness as relative. Natural Computing. doi: 10.1007/s11047-015-9533-2
  6. Cecconi, F., Cencini, M., Falcioni, M., & Vulpiani, A. (2012). Predicting the future from the past: An old problem from a modern perspective. American Journal of Physics, 80(11), 1001–1008.CrossRefGoogle Scholar
  7. Chibbaro, S., Rondoni, L., & Vulpiani, A. (2014). Reductionism, emergence and levels of reality. Berlin: Springer.CrossRefGoogle Scholar
  8. Cooper, S. B. (2004). Computability theory. London: Chapman Hall/CRC.Google Scholar
  9. Devaney, R. L. (2003). An introduction to chaotic dynamical systems (2nd ed.). Redwood City, CA: Addison-Wesley.Google Scholar
  10. Downey, R., & Hirschfeldt, D. (2010). Algorithmic randomness and complexity. Berlin: Springer.CrossRefGoogle Scholar
  11. Ellis, G., & Silk, J. (2014). Scientific method: Defend the integrity of physics. Nature, 516, 321–323.CrossRefGoogle Scholar
  12. Ferber, R. (1956). Are correlations any guide to predictive value? Journal of the Royal Statistical Society Series C (Applied Statistics), 5(2), 113–121.Google Scholar
  13. Floridi, L. (2012). Big data and their epistemological challenge. Philosophy and Technology, 25(4), 435–437.CrossRefGoogle Scholar
  14. Frické, M. (2015). Big data and its epistemology. Journal of the Association for Information Science and Technology, 66(4), 651–661.CrossRefGoogle Scholar
  15. Gisin, N. (2014). Quantum chance: Nonlocality, teleportation and other quantum marvels. London: Springer.CrossRefGoogle Scholar
  16. Gowers, T. (2001). A new proof of Szemerédi’s theorem. Geometric and Functional Analysis, 11(3), 465–588.CrossRefGoogle Scholar
  17. Graham, R. (2007). Some of my favorite problems in Ramsey theory. INTEGERS: The Electronic Journal of Combinatorial Number Theory, 7(2), A2.Google Scholar
  18. Graham, R., Rothschild, B. L., & Spencer, J. H. (1990). Ramsey theory (2nd ed.). New York: Wiley.Google Scholar
  19. Graham, R., & Spencer, J. H. (1990). Ramsey theory. Scientific American, 262, 112–117.CrossRefGoogle Scholar
  20. Grjebine, A. (2015). La dette publique et comment s’en débarrasser. Paris: Press Universitaire de France.Google Scholar
  21. Grossman, L. (2015). What’s this all about? The massive volume of data that humanity generates is a new kind of problem. The solution is very old: art. Time Magazine, 6 July 2015 (double issue).Google Scholar
  22. Hoffman, C. (2015). Benchmarked: What’s the best file compression format? http://www.howtogeek.com/200698/benchmarked-whats-the-best-file-compression-format/. May 2015.
  23. Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bulletin of the AMS: American Mathematical Society, 53, 1002–1010.CrossRefGoogle Scholar
  24. Khoussainov, B. (2016). Algorithmically random universal algebras. In M. Burgin & C. S. Calude (Eds.), Information and complexity. World Scientific Series in Information Studies, Singapore, 2016 (to appear).Google Scholar
  25. Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 1–12.CrossRefGoogle Scholar
  26. Longo, G., & Montévil, M. (2014). Perspectives on organisms: Biological time, symmetries and singularities. Berlin: Springer.CrossRefGoogle Scholar
  27. Longo, G. (2008). On the relevance of negative results. Influxus. http://www.influxus.eu/article474.html
  28. Lynch, P. (2008). The origins of computer weather prediction and climate modeling. Journal of Computational Physics, 227(3431), 3431–3444.CrossRefGoogle Scholar
  29. Lyons, J. (2013). George Osborne’s favourite “godfathers of austerity” economists admit to making error in research. http://www.mirror.co.uk/news/uk-news/george-osbornes-favourite-economists-reinhart-1838219. April 2013.
  30. Manin, Y. I. (2016). Cognition and complexity. In M. Burgin & C. S. Calude (Eds.), Information and complexity. World Scientific Series in Information Studies, Singapore, 2016 (to appear).Google Scholar
  31. Montelle, C. (2011). Chasing shadows: Mathematics, astronomy, and the early history of eclipse reckoning. Baltimore: Johns Hopkins University Press.Google Scholar
  32. NSF. (2010). Computational and data-enabled science and engineering. http://www.nsf.gov/mps/cds-e/
  33. Needham, J. (2008). Science and civilisation in China: Medicine (Vol. 6). Cambridge: Cambridge University Press.Google Scholar
  34. Norvig, P. (2008). All we want are the facts, ma’am. http://norvig.com/fact-check.html
  35. Oxford Dictionaries. Spurious. http://www.oxforddictionaries.com/definition/learner/spurious. Accessed November 30, 2015.
  36. O’Grady, C. (2015). Louder vowels won’t get you laid, and other tales of spurious correlation. http://arstechnica.co.uk/science/2015/06/louder-vowels-wont-get-you-laid-and-other-tales-of-spurious-correlation. June 2015.
  37. Paris, J., & Harrington, L. (1977). A mathematical incompleteness in Peano Arithmetic. In J. Barwise (Ed.), Handbook of mathematical logic (pp. 1133–1142). Amsterdam: North Holland.CrossRefGoogle Scholar
  38. Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  39. Reed, D. A., & Dongarra, J. (2015). Exascale computing and big data. Communications of the ACM, 58(7), 56–68.CrossRefGoogle Scholar
  40. Reinhart, C., & Rogoff, K. (2010). Growth in a time of debt. American Economic Review, 2, 573–578.CrossRefGoogle Scholar
  41. Roberts, S., & Winters, J. (2013). Linguistic diversity and traffic accidents: Lessons from statistical studies of cultural traits. PLoS ONE, 8(8), e70902.CrossRefGoogle Scholar
  42. Schmidt, E. (2010). Every 2 days we create as much information as we did up to 2003. http://techcrunch.com/2010/08/04/schmidt-data. August 2010.
  43. Schutt, R., & O’Neil, C. (2014). Doing data science. Newton, MA: O’Reilly Media.Google Scholar
  44. Sessions, J. (2011). The case for growth: Sessions lists benefits of discretionary cuts. http://www.sessions.senate.gov/public/index.cfm/news-releases?ID=E36C43B4-B428-41A4-A562-475FC16D3793. March 2011.
  45. Shen, A. (2015). Around Kolmogorov complexity: Basic notions and results. http://dblp.uni-trier.de/rec/bib/journals/corr/Shen15
  46. Smith, G. (2014). Standard deviations: Flawed assumptions, tortured data, and other ways to lie with statistics. New York: Overlook/Duckworth.Google Scholar
  47. Smith, J. (2013). From Reinhart and Rogoff’s own data: UK GDP increased fastest when debt-to-GDP ratio was highest—And the debt ratio came down! http://www.primeeconomics.org/articles/1785. April 2013.
  48. Spurious correlations. http://www.tylervigen.com/spurious-correlations. November 2015.
  49. Stanton, J. M. (2012). Introduction to data science. Syracuse: Syracuse University.Google Scholar
  50. Thomas Herndon, M. A., & Pollin, R. (2014). Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of Economics, 38, 257–279.CrossRefGoogle Scholar
  51. Vereshchagin, N. K. (2007). Kolmogorov complexity of enumerating finite sets. Information Processing Letters, 103(1), 34–39.CrossRefGoogle Scholar
  52. Vigen, T. (2015). Spurious correlations. New York: Hachette Books.Google Scholar
  53. Walkden, C. Magic post-graduate lectures: Magic010 ergodic theory lecture 5. http://www.maths.manchester.ac.uk/~cwalkden/magic/

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of AucklandAucklandNew Zealand
  2. 2.Centre Cavaillès (République des Savoirs), CNRSCollège de France & École Normale SupérieureParisFrance
  3. 3.Department of Integrative Physiology and PathobiologyTufts University School of MedicineBostonUSA

Personalised recommendations