Advances in Data and Information Sciences pp 115-125 | Cite as
An Insight into Theory-Guided Climate Data Science—A Literature Review
Abstract
Data science models, though successful in a large number of commercial domains, have found limited applications in scientific problems that involve complex physical phenomena. Most of these problems comprise of multi-spectral data composites. Climate science and hydrology is one such scientific domain that faces several big data challenges. Climate data poses many challenges in research because of its spatiotemporal characteristics, high degree of variance, and predominantly its physical nature. One such challenging data in climate science and hydrology is precipitation data. Precipitation data is vast, and generated at a fast pace from several sources, but due to the lack of underlying principles, the models in data science to address climatic issues such as precipitation are dysfunctional. These challenges call for a novel approach that integrates domain knowledge and data science models. To do so, the paper surveys an evolving paradigm of theory-guided data science (TGDS). It is a new paradigm in data science and analytics that aims to improve the generalization of data science models and improve their effectiveness in scientific discovery. The authors, through the survey, present the challenges imposed by climate data, which is representative of the precipitation data, and limitations of traditional data science methods. The paper suggests a shift in data science practices to adapt theory-guided data science for climate and hydrology domain of precipitation data, by providing insights on TGDS, its models and approaches.
Keywords
Data science Theory-guided Knowledge discovery Climate change Climate science PrecipitationReferences
- 1.Brad B, Jacques B, Michael C, Richard D, Angela H, James M, Charles R (2011) Big data: the next frontier for innovation, competition, and productivity. The McKinsey Global InstituteGoogle Scholar
- 2.Economist (2010) The data deluge. Special SupplementGoogle Scholar
- 3.Szalay A, Bell G, Hey T (2009) Beyond the data deluge. Science 323(5919):1297–1298CrossRefGoogle Scholar
- 4.Halevy A, Pereira F, Norvig P (2009) The unreasonable effectiveness of data. IEEE Intell Syst 24(2):8–12CrossRefGoogle Scholar
- 5.Anderson C (2008) The end of theory: the data deluge makes the scientific method obsolete. Wired MagGoogle Scholar
- 6.A guide to earth science data: summary and research challenges. IEEE https://doi.org/10.1109/mcse.2015.130, 11 Nov 2015
- 7.Faghmous JH, Kumar V (2014) A big data guide to understanding climate change: the case for theory-guided data science. Big Data 2(3). https://doi.org/10.1089/big.2014.0026CrossRefGoogle Scholar
- 8.Karpatne A, Banerjee A, Ganguly A, Atluri G, Faghmous J, Steinbach M, Samatova N, Shekhar S, Kumar V (2017) Theory-guided data science: a new paradigm for scientific discovery. IEEE Trans Knowl Data Eng 29(10):2318–2331. https://doi.org/10.1109/tkde.2017.2720168
- 9.Banerjee A, Shekhar S, Faghmous JH (2010) Theory-guided data science for climate change. Published by IEEE Computer Society in November 2014Google Scholar
- 10.Lazer D, Kennedy R, King G, Vespignani A (2014) The parable of Google flu: traps in big data analysis. Science 343(6176):1203–1205CrossRefGoogle Scholar
- 11.National Climatic Data Center (NCDC). http://www.ncdc.noaa.gov/oa/climate/ghcn-daily/. Last Accessed on 25/2/2017
- 12.SPARC Data Center. http://www.sparc-climate.org/datacenter. Last Accessed on 25/2/2017
- 13.Modern-Era Retrospective analysis for Research and Applications (MERRA). https://gmao.gsfc.nasa.gov/merra. Last Accessed on 25/2/2017
- 14.Reanalysis intercomparison and observations. http://reanalyses.org. Last Accessed on 25/2/2017
- 15.Meehl GA, Taylor KE, Stouffer RJ (2012) An overview of cmip5 and the experiment design. Bull Am Meteor Soc 93(4):485–498CrossRefGoogle Scholar
- 16.Kumar V (2016) AAAI Symposium, 17 Nov 2016, University of MinnesotaGoogle Scholar
- 17.Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer series in statistics, vol 1. Springer, BerlinGoogle Scholar
- 18.Xiao H, Wu JL, Wang JX (2016) Physics-informed machine learning for predictive turbulence modeling: using data to improve RANS modeled reynolds stresses. arXiv preprint. arXiv:1606.07987