The Deluge of Spurious Correlations in Big Data

Abstract

Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a “philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: “with enough data, the numbers speak for themselves”. The “end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this “philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in “randomly” generated, large enough databases, which—as we will prove—implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    Anderson attributed the last sentence to Google’s research director Peter Norvig (2008), who denied it: “That’s a silly statement, I didn’t say it, and I disagree with it.”

  2. 2.

    See more in Andrews (2012).

  3. 3.

    This example points to another important issue: no data collecting is strictly objective—see the analysis in Grjebine (2015) of Reinhart and Rogoff’s bias in their collection of data in several countries for 218 years.

  4. 4.

    European policy makers largely referred to that paper till 2013. For example, O. Rehn, EU Commissioner for Economic Affairs (2009–13) referred to the Reinhart-Rogoff correlation as a key guideline for his past and present economic views (Smith (2013), address to ILO, April 9, 2013) and G. Osborne, British Chancellor of the Exchequer (since 2010), claimed in April 2013: “As Rogoff and Reinhart demonstrate convincingly, all financial crises ultimately have their origins in one thing [the public debt].” (Lyons 2013).

  5. 5.

    See also the huge collection of spurious correlations (Spurious 2015) and the book (Vigen 2015) based on it, in which the old rule that “correlation does not equal causation” is illustrated through hilarious graphs.

  6. 6.

    This was informally observed also in Smith (2014, p. 20): “With fast computers and plentiful data, finding statistical relevance is trivial. If you look hard enough, it can even be found in tables of random numbers”.

  7. 7.

    They are branches of finite combinatorics and the theory of algorithms, respectively.

  8. 8.

    A branch of mathematics which studies dynamical systems with an invariant measure and related problems.

  9. 9.

    The measure of A, \(\mu (A)\), is the probability of A.

  10. 10.

    Ehrenfest’s example (Walkden 2010) is a simple illustration. Let an urn \(U_1\) contain 100 numbered balls and \(U_2\) be an empty urn. Each second, one ball is moved from one urn to the other, according to the measurement of events that produce numbers from 1 to 100. By Kac’s lemma, the expected return time to (almost) all balls in \(U_1\) is of (nearly) \(2^{100}\) s, which is about \(3 \times 10^{12}\) times the age of the Universe. Boltzmann already had an intuition of this phenomenon, in the study of the recurrence time in ergodic dynamics, of gas particles for example, see Cecconi et al. (2012).

  11. 11.

    The dimension of an attractor is the number of effective degrees of freedom.

  12. 12.

    Our footnote: see Lynch (2008), Chibbaro et al. (2014).

  13. 13.

    For example, with one dimension far exceeding the age of the Universe in seconds, yotta of yottabytes.

  14. 14.

    Appeared in Putnam Mathematical Competition in 1953 and in the problem section of the American Mathematical Monthly in 1958 (Problem E 1321).

  15. 15.

    This is a Ramsey type problem: the aim is to find out how large the party needs to be to guarantee similar pairwise acquaintanceship in (at least) one group of three people.

  16. 16.

    The function K is incomputable.

  17. 17.

    Traditionally, \(K_U\) is called Kolmogorov complexity associated to U.

  18. 18.

    The number of strings x of length n having \(K(x) \ge n-m\) is greater or equal to \(2^{n}-2^{n-m}+1\).

  19. 19.

    In view of results discussed in Sect. 6, they cannot have all properties associated with randomness.

  20. 20.

    For every x, Zip(x) is an incompressible string for Zip, but for some x, Zip(x) is compressible by U.

  21. 21.

    \(10^{82}\) is approximately the number of hydrogen atoms in the observable Universe.

  22. 22.

    Latin: with this, therefore because of this.

  23. 23.

    Latin: after this, therefore because of this.

  24. 24.

    The danger of purely speculative theories in today’s physics is discussed in Ellis and Silk (2014).

  25. 25.

    The big data can be used for scientific testing of hypotheses as well as for testing scientific theories and results.

References

  1. Ahn, A. (2015). The party problem. http://mathforum.org/mathimages/index.php/The_Party_Problem_(Ramsey's_Theorem). Accessed December 12, 2015.

  2. Andrews, G. E. (2012). Drowning in the data deluge. Notices of the AMS: American Mathematical Society, 59(7), 933–941.

    Article  Google Scholar 

  3. Calude, A. S. (2015). Does big data equal big problems? http://blogs.crikey.com.au/fullysic/2015/11/13/does-big-data-equal-big-problems. November 2015.

  4. Calude, C. (2002). Information and randomness–An algorithmic perspective (2nd ed.). Berlin: Springer.

    Google Scholar 

  5. Calude, C. S., & Longo, G. (2015). Classical, quantum and biological randomness as relative. Natural Computing. doi:10.1007/s11047-015-9533-2

  6. Cecconi, F., Cencini, M., Falcioni, M., & Vulpiani, A. (2012). Predicting the future from the past: An old problem from a modern perspective. American Journal of Physics, 80(11), 1001–1008.

    Article  Google Scholar 

  7. Chibbaro, S., Rondoni, L., & Vulpiani, A. (2014). Reductionism, emergence and levels of reality. Berlin: Springer.

    Google Scholar 

  8. Cooper, S. B. (2004). Computability theory. London: Chapman Hall/CRC.

    Google Scholar 

  9. Correlation and prediction. 1992. http://www.intropsych.com/ch01_psychology_and_science/correlation_and_prediction.html

  10. Devaney, R. L. (2003). An introduction to chaotic dynamical systems (2nd ed.). Redwood City, CA: Addison-Wesley.

    Google Scholar 

  11. Downey, R., & Hirschfeldt, D. (2010). Algorithmic randomness and complexity. Berlin: Springer.

    Google Scholar 

  12. Ellis, G., & Silk, J. (2014). Scientific method: Defend the integrity of physics. Nature, 516, 321–323.

    Article  Google Scholar 

  13. Ferber, R. (1956). Are correlations any guide to predictive value? Journal of the Royal Statistical Society Series C (Applied Statistics), 5(2), 113–121.

    Google Scholar 

  14. Floridi, L. (2012). Big data and their epistemological challenge. Philosophy and Technology, 25(4), 435–437.

    Article  Google Scholar 

  15. Frické, M. (2015). Big data and its epistemology. Journal of the Association for Information Science and Technology, 66(4), 651–661.

    Article  Google Scholar 

  16. Gisin, N. (2014). Quantum chance: Nonlocality, teleportation and other quantum marvels. London: Springer.

    Google Scholar 

  17. Gowers, T. (2001). A new proof of Szemerédi’s theorem. Geometric and Functional Analysis, 11(3), 465–588.

    Article  Google Scholar 

  18. Graham, R. (2007). Some of my favorite problems in Ramsey theory. INTEGERS: The Electronic Journal of Combinatorial Number Theory, 7(2), A2.

  19. Graham, R., Rothschild, B. L., & Spencer, J. H. (1990). Ramsey theory (2nd ed.). New York: Wiley.

    Google Scholar 

  20. Graham, R., & Spencer, J. H. (1990). Ramsey theory. Scientific American, 262, 112–117.

    Article  Google Scholar 

  21. Grjebine, A. (2015). La dette publique et comment s’en débarrasser. Paris: Press Universitaire de France.

    Google Scholar 

  22. Grossman, L. (2015). What’s this all about? The massive volume of data that humanity generates is a new kind of problem. The solution is very old: art. Time Magazine, 6 July 2015 (double issue).

  23. Hoffman, C. (2015). Benchmarked: What’s the best file compression format? http://www.howtogeek.com/200698/benchmarked-whats-the-best-file-compression-format/. May 2015.

  24. IBM. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html. May 2011.

  25. Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bulletin of the AMS: American Mathematical Society, 53, 1002–1010.

    Article  Google Scholar 

  26. Khoussainov, B. (2016). Algorithmically random universal algebras. In M. Burgin & C. S. Calude (Eds.), Information and complexity. World Scientific Series in Information Studies, Singapore, 2016 (to appear).

  27. Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1(1), 1–12.

    Article  Google Scholar 

  28. Longo, G., & Montévil, M. (2014). Perspectives on organisms: Biological time, symmetries and singularities. Berlin: Springer.

    Google Scholar 

  29. Longo, G. (2008). On the relevance of negative results. Influxus. http://www.influxus.eu/article474.html

  30. Lynch, P. (2008). The origins of computer weather prediction and climate modeling. Journal of Computational Physics, 227(3431), 3431–3444.

    Article  Google Scholar 

  31. Lyons, J. (2013). George Osborne’s favourite “godfathers of austerity” economists admit to making error in research. http://www.mirror.co.uk/news/uk-news/george-osbornes-favourite-economists-reinhart-1838219. April 2013.

  32. Manin, Y. I. (2016). Cognition and complexity. In M. Burgin & C. S. Calude (Eds.), Information and complexity. World Scientific Series in Information Studies, Singapore, 2016 (to appear).

  33. Montelle, C. (2011). Chasing shadows: Mathematics, astronomy, and the early history of eclipse reckoning. Baltimore: Johns Hopkins University Press.

    Google Scholar 

  34. NSF. (2010). Computational and data-enabled science and engineering. http://www.nsf.gov/mps/cds-e/

  35. Needham, J. (2008). Science and civilisation in China: Medicine (Vol. 6). Cambridge: Cambridge University Press.

    Google Scholar 

  36. Norvig, P. (2008). All we want are the facts, ma’am. http://norvig.com/fact-check.html

  37. Oxford Dictionaries. Spurious. http://www.oxforddictionaries.com/definition/learner/spurious. Accessed November 30, 2015.

  38. O’Grady, C. (2015). Louder vowels won’t get you laid, and other tales of spurious correlation. http://arstechnica.co.uk/science/2015/06/louder-vowels-wont-get-you-laid-and-other-tales-of-spurious-correlation. June 2015.

  39. Paris, J., & Harrington, L. (1977). A mathematical incompleteness in Peano Arithmetic. In J. Barwise (Ed.), Handbook of mathematical logic (pp. 1133–1142). Amsterdam: North Holland.

    Google Scholar 

  40. Poppelars, J. (2015). OR at work. http://john-poppelaars.blogspot.fr/2015/04/do-numbers-really-speak-for-themselves.html. April 2015.

  41. Rajaraman, A., & Ullman, J. D. (2011). Mining of massive datasets. Cambridge: Cambridge University Press.

    Google Scholar 

  42. Reed, D. A., & Dongarra, J. (2015). Exascale computing and big data. Communications of the ACM, 58(7), 56–68.

    Article  Google Scholar 

  43. Reinhart, C., & Rogoff, K. (2010). Growth in a time of debt. American Economic Review, 2, 573–578.

    Article  Google Scholar 

  44. Roberts, S., & Winters, J. (2013). Linguistic diversity and traffic accidents: Lessons from statistical studies of cultural traits. PLoS ONE, 8(8), e70902.

    Article  Google Scholar 

  45. Schmidt, E. (2010). Every 2 days we create as much information as we did up to 2003. http://techcrunch.com/2010/08/04/schmidt-data. August 2010.

  46. Schutt, R., & O’Neil, C. (2014). Doing data science. Newton, MA: O’Reilly Media.

    Google Scholar 

  47. Sessions, J. (2011). The case for growth: Sessions lists benefits of discretionary cuts. http://www.sessions.senate.gov/public/index.cfm/news-releases?ID=E36C43B4-B428-41A4-A562-475FC16D3793. March 2011.

  48. Shen, A. (2015). Around Kolmogorov complexity: Basic notions and results. http://dblp.uni-trier.de/rec/bib/journals/corr/Shen15

  49. Smith, G. (2014). Standard deviations: Flawed assumptions, tortured data, and other ways to lie with statistics. New York: Overlook/Duckworth.

    Google Scholar 

  50. Smith, J. (2013). From Reinhart and Rogoff’s own data: UK GDP increased fastest when debt-to-GDP ratio was highest—And the debt ratio came down! http://www.primeeconomics.org/articles/1785. April 2013.

  51. Spurious correlations. http://www.tylervigen.com/spurious-correlations. November 2015.

  52. Stanton, J. M. (2012). Introduction to data science. Syracuse: Syracuse University.

    Google Scholar 

  53. Thomas Herndon, M. A., & Pollin, R. (2014). Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of Economics, 38, 257–279.

    Article  Google Scholar 

  54. Vereshchagin, N. K. (2007). Kolmogorov complexity of enumerating finite sets. Information Processing Letters, 103(1), 34–39.

    Article  Google Scholar 

  55. Vigen, T. (2015). Spurious correlations. New York: Hachette Books.

    Google Scholar 

  56. Walkden, C. Magic post-graduate lectures: Magic010 ergodic theory lecture 5. http://www.maths.manchester.ac.uk/~cwalkden/magic/

Download references

Acknowledgments

The authors have been supported in part by Marie Curie FP7-PEOPLE-2010-IRSES Grant. Longo’s work is also part of the project “Lois des dieux, des hommes et de la nature”, Institut d’Etudes Avancées, Nantes, France. We thank A. Vulpiani for suggesting the use of Kac’s lemma, G. Tee for providing historical data and A. Abbott, F. Kroon, H. Maurer, J. P. Lewis, C. Mamali, R. Nicolescu, G. Smith, G. Tee, A. Vulpiani and the anonymous referees for useful comments and suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Longo.

Additional information

Italian: ...you were not born to live like brutes, but to pursue virtue and knowledge.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Calude, C.S., Longo, G. The Deluge of Spurious Correlations in Big Data. Found Sci 22, 595–612 (2017). https://doi.org/10.1007/s10699-016-9489-4

Download citation

Keywords

  • Big data
  • Ergodic theory
  • Ramsey theory
  • Algorithmic information theory
  • Correlation