Introduction

Big data touches increasingly personal aspects of each of our lives, from health to shopping and entertainment preferences. Ecosystem scientists curate very little of the types of social media, consumer-based, and medical data that have motivated much of the technological and analytical development in informatics (Chang and others 2014; Tirunillai and Tellis 2014; Han and others 2015; Hoegh and others 2015; Culotta and Cutler 2016; Flechet and others 2016). This is reflected in the fact that more than 70% of the published articles from the past ten years in Web of Science that refer to big data are from computer science, engineering, telecommunication, and business research fields. Yet there is a growing core of ecosystem work that is persistently expanding the scope for how our field defines, handles, and exploits big-data products and approaches. This is evidenced by publications from data-driven networks like GLEON (Global Lake Observatory Network, http://gleon.org), FLUXNET (http://fluxnet.ornl.gov), and by ventures like the U.S. National Science Foundation’s $400 million support for a National Ecological Observatory Network (NEON) (Wilson and others 2002; Baldocchi 2003; Michener and others 2012; Weathers and others 2013; McDowell 2015; Hanson and others 2016). Likewise, eco-informatics platforms like DataONE (Data Observation Network for Earth, www.dataone.org) have substantially advanced ‘discoverability’ of these big-data products (Michener and others 2012). Despite this progress, we identify a growing need for theoretical development and infrastructure support to better manage and integrate cross-scale data (for example, at different organizational levels). These advances must address key challenges that include real-time prediction of ecosystem properties and the need for probabilistic forecasts to better understand how ecosystem function might be altered under global change scenarios.

For much of the past decade, big data has been used to indicate data volumes over a terabyte—a storage capacity that was only made available on personal computers in 2007. However, volume is not the only dimension of value for big data (Figure 1), which also encompasses data variety, velocity, and, in some literature, veracity (Kwon and others 2014; Lovelace and others 2016). Variety, or structural heterogeneity, generally refers to data that integrate tabular (numeric) data with images, text, or other unstructured sources. We extend this to also encompass data from different scales and/or organizational (for example, taxonomic) levels. Velocity captures the speed at which data are generated, with the expectation that data collected at high velocities can inform real-time (or near-time) analytics. Veracity acknowledges the importance of identifying data reliability and might best be evaluated via ground-truthing and cross-validation across different data types (Lovelace and others 2016). Specific definitions vary in both time and disciplinary space—as what is ‘big’ gets redefined with each technological and analytical advance that is adopted. However, volume is a key dimension across fields and for all involved, big data is notable for being bigger than the standard data that are collected and analyzed with conventional methods. The broadest definitions incorporate a hope and expectation that big data ultimately represents an evolving capacity to search, aggregate, and cross-reference large datasets to inform decision-making and to generate new understanding about emergent properties of complex systems (Evans and others 2013; Tinati and others 2014). In this sense, big data is not exclusively generated to address a specific hypothesis but is inclusive of opportunity for addressing many yet-to-be-defined hypotheses and for generating predictive inference.

Figure 1
figure 1

The next decade will demand advances driven by integration across multiple scales and sources of data to address critical questions about how ecosystems function and change. Each of the bubbles, scaled by relative Volume, represents an individual data product that by itself might be considered ‘big’ by axes of Volume, Velocity, or Variety. Height on this figure is arbitrary for purposes of visualization. Measures of CO2 flux from a single tower can generate great volume due to the high velocity of measurement records. The volume is further increased if multiple tower sites are considered, although variety remains low because the data are all the same type of measurement. Likewise, even a single organism can generate high-volume genetic sequence data. Community ecology data, including biodiversity metrics and biotic interactions, are often ‘smaller’ data, although measuring biodiversity can require multiple types of data (high variety). Ecosystem function is a complex process that should be informed by many data types from multiple scales.

Figure 2
figure 2

The global lakes ecological observatory network (GLEON) integrates buoy sensor data from more than 80 lakes in 51 countries across 6 continents.

Below we provide a brief synthesis of how understanding and use of big data has developed in ecosystem research over the past ten years. We also highlight current technological and cultural challenges limiting our field from exploiting big-data approaches. Finally, we describe a vision for the next decade of big data in ecosystem science that will redefine the scope of what is currently considered big, as our capacity for managing and analyzing large datasets continues to grow. We anticipate that the field will further develop and support network and team science that can effectively integrate diverse data sources and generate cross-scale inference with clear and quantitative definitions of uncertainty. Further, we expect that ecosystem scientists will increasingly employ big-data approaches to understand how a growing human population and global climate change influence ecosystem function and stability. The next decade will certainly see growing demand for forecasts driven by big data that are aimed to guide policy and management needs, and ecosystem scientists will increasingly need to evaluate how research can concurrently support both theoretical understanding and forecasting needs.

Understanding Ecosystems with Big Data

Ecologists and ecosystem scientists in particular have a long history with big-data concepts. A number of recent publications illustrate the challenges and opportunities attributed to big data (Michener and Jones 2012; Hampton and others 2013; Raffaelli and others 2014; Han and others 2015; Weathers and others 2016); each describes persistent challenges, including the standardization of metadata, units, and protocols, data ‘discoverability,’ and ease of visualization and analysis within or between connected cyberinfrastructure and analysis platforms (for example, DataONE and statistical software). While ecosystem scientists have engaged in and leveraged big-data approaches, ecosystem science is still largely a discipline that explains pattern and process rather than one adept at predicting them (Dietze and others 2013; Niu and others 2014). This is in part because there is still so much about pattern and process that requires explaining.

Ecosystem scientists strive to understand biophysical processes, as well as the complex applications and implications of those processes (for example, ecosystem function, stability, and services). Big data could substantially advance understanding and forecasting capacity for ecosystem science over the next decade. Although data quality and keen researcher insights will remain critical in any analysis, engaging big-data technologies and philosophies can provide robust approaches and tools to make inroads towards answering some complex questions in ecosystem science, such as how ecosystem function is affected by changing climate across local to regional scales, how stability of ecosystem processes is supported by biodiversity, and others. Furthermore, the clear connections between big data and network science approaches denote a potential shift towards a future where early career faculty are expected both to develop collaborations and are rewarded for doing them well.

Individual research in ecosystem science often generates smaller data but, even so, requires integration of information across space, time, and types of data to address fundamental questions about how earth’s biophysical systems function (Figure 1). Assembling and curating measurements and metadata from across spatially distinct sites, such as the GLEON model (Box 1), is a network science approach—where individual and/or collaborative groups of researchers build a database that, as a whole, can generate inference and predictions more effectively than any single data component could by itself. Network science engages multiple investigators in data collection, validation, curation, and, in some cases, synthesis processes. This requires collaborative, open data sharing and documentation (Hampton and Parker 2011; Stokstad 2011; Wallis and others 2013; Hampton and others 2015; Laney and others 2015). Examples include the following: networks that prescribe and support experimental, processing and/or data entry protocols (that is, FLUXNET, NEON); those that are more loosely organized around use of shared technology and protocols (GLEON); or a network of independent research programs that shares intent and a common data repository (for example, NSF’s LTER program, Box 2). These networks provide big data, although the current infrastructure and analytics support often fall short of facilitating big-data analytics (see Boxes 1 and 2).

Box 1 The global lake ecological observatory network (GLEON)
Box 2 Long-term ecological research and sharing big data across space and time

Generating and curating data to conform to a particular network platform can mean relinquishing investigator control to a wider community. This could be a daunting prospect for many scientists, especially early in their career. The academic tenure process, for example, continues to undervalue an individual’s contribution to publications with many authors, or to network science more generally (Uriarte and others 2007; Goring and others 2014). Likewise, the growth of a big-data network that integrates across multiple sources of data takes time and requires sustained support—beyond the traditional 3- to 5-year funding cycle (Ruegg and others 2014). The development of the global FLUXNET program is funded by multiple, international agencies and is one example of ecosystem data networks that have advanced capability for modeling global carbon cycles, developing new mechanistic hypotheses, and exploring predictive capacity (Baldocchi 2003; Xiao and others 2014; Rao and others 2015). The level of infrastructure support that has made FLUXNET successful, for example, has only recently gained prominence in the mission of funding agencies. A majority of funding has historically prioritized the generation of new data, and the technological infrastructure and curation required for synthesis and analytics remains a significant limitation for ecosystem scientists, especially those who do not focus on terrestrial carbon or climate research.

Detailed experimental and site-specific studies can deepen our understanding of how ecosystem processes work, but greater theoretical and informatics developments are still needed to guide data integration to effectively predict ecosystem function. Cross-scale inference in space or time remains a challenge for developing predictive capacity in ecosystem science (Soranno and others 2014; Mouquet and others 2015; Petchey and others 2015; Price and Schmitz 2016). Recent ecological examples highlighting the utility of big data for answering questions at relevant spatial scales include a study that examined species distributions and habitat boundaries through high-resolution, large-scale, long-term cloud cover data using remote sensing (Wilson and Jetz 2016). Combining this data product with existing knowledge about the role of cloud cover in key life history characteristics of animal species (for example, growth, reproduction, and behavior) offers a way to combine high-frequency, high-volume data of high veracity to better estimate habitat transitions and species distributions across a large spatial extent. Although these and other recent studies in terrestrial ecosystems demonstrate how big-data products across organizational levels can help scale up inference in space (Xiao and others 2014), developing the technological and analytical infrastructure for integrating diverse data sources remains a primary challenge for ecosystem science. Investing in these developments will pay off in the near term through the immediate use of currently available big data to generate new hypotheses, and will pay dividends in the future as macroecological insights arising from cross-scale analyses contribute to the development of ecosystem theory.

Big-Data Approaches for Ecosystem Science

Big-data approaches open up unprecedented opportunities to generate new knowledge in ecosystem science. Ecologists are trained to develop hypotheses through observation of the natural world. However, formal training is often directed more toward placing our observations (data) within the context of literature rather than in conducting synthesis or meta-analysis, and even less toward eco-informatics approaches needed to create and use big data (Michener and Jones 2012; Touchon and McCoy 2016). The computational advances offered by machine learning (Peters and others 2014) and data mining (Hochachka and others 2007) tools, for example, enable analyses to subsume ecological heterogeneity and common data caveats rather than ‘controlling’ them. Thus, big data offers a quantitative departure from the constraints of a reductionist approach in ecosystem science, whose questions are often at much larger scales than other subdisciplines in ecology. Successfully exploiting big-data approaches will require scientific exchange between controlled experimental design and big-data products. Broadening the scope of ecological questions can mean giving up some of the control that we have been trained to consider as a gold standard of empirical research, although individual data still derive from many well-controlled studies. In exchange, using big-data approaches can reveal important contours to the ecosystem that will remain invisible to us as long as our scope of view remains fixed on questions that are conducive to manipulation (statistical or otherwise). Shifting our perspective offers us an opportunity to identify unforeseen answers to old questions, and new hypotheses that might be tested in empirically tractable, controlled settings (Stephens and others 2016).

Data collation A predominant task in any big-data approach is collating existing data. Most ecological datasets do not adhere to common standards or methods of access (Parsons and others 2011). Following pre-processing, validation, and quality control, data (and/or their metadata) can be archived as standalone synthetic products through a number of online repositories (for example, Ecological Archives (http://esapubs.org/archive/), Knowledge Network for Biocomplexity (KNB, https://knb.ecoinformatics.org/), and others). Although these online data products are becoming more commonplace in ecological research, the majority of the workflow is still largely developed on an individual basis, in part because the data products are individually unique (Michener and others 2012; Michener 2015). Finding and using existing datasets can be a time-consuming and error-prone process (Roche and others 2015). Developing a more standardized and streamlined process for data science in ecology is increasingly necessary, and the DataONE platform has greatly advanced this goal (Michener and others 2012). The DataONE platform (www.dataone.org) represents an important advance in big-data cyberinfrastructure and practice in ecosystem science (Michener and others 2012; Michener 2015). This platform is an organizational nexus for discovering data from across member nodes and a resource for training in data sharing, ethics, and informatics. There are currently 36 member nodes, including the U.S. and European LTER programs, NEON, GLEON, and USGS data repositories. The site has recorded 348,243 data downloads (of the 404,097 datasets uploaded) as of September 2016. Software such as the EcoData Retriever have been developed to automate the tasks of finding, downloading, and reformatting ecological data files, streamlining the process to get from data discovery to analysis (Morris and White 2013). The popularization of data and software carpentry workshops (http://software-carpentry.org//index.html, http://www.datacarpentry.org/) also points to greater movement in this direction. Still, the efficacy of any current platform to search, acquire, and support visualization or summary analyses across the U.S. government-funded datasets, for example, has yet to be proven. This is at least partly because productivity that results from platform use is difficult to track. Both funding agencies and individual journals are increasingly requiring that data are made available using online resources, which is a marked improvement, although much work remains if we are to achieve the ultimate goal of maximizing future utility of these archived data: a recent study showed that 56% of archived data in ecology and evolution are incomplete, and the archiving methods for 64% of datasets effectively prevent their reuse (Roche and others 2015). Rooting out unreliable values and aligning scale and unit across data types can be difficult to automate. Critical next steps will be to invest in improving the accessibility and therefore the utility of these data products.

Data Analysis Although many big-data products already drive ecosystem models of global carbon and nutrient cycles, for example, (Melillo and others 2002; McCarthy and others 2010; Xiao and others 2014), they are perhaps less appreciated for their value in the development of theory and driving the discovery of new hypotheses. There are two general analytical approaches to drawing inference from big data. If there is sufficient understanding to define a model, statistical regression-based methods are used to estimate parameters and evaluate hypotheses. Hierarchical (often Bayesian) statistics are adept at integrating diverse data sources to inform understanding of more complex ecosystem processes (Niu and others 2014; Hartig and others 2012).

However, when a priori understanding is insufficient to describe a process model, data mining approaches can reveal robust, multivariate patterns, identify important drivers, and generate predictions from the data themselves (Hochachka and others 2007). In ecology, exploratory approaches have historically meant making keen observations of our environment to identify ecological relationships and generate testable hypotheses. Data visualization and mining approaches have also been used as a step towards confirmatory (statistical inference) approaches. More formally, data mining refers to examining large, multivariate datasets to identify robust data patterns (for example, clusters of data points, anomalous signatures in data, or dependencies among variables) that are worth following up through focal hypothesis testing (confirmatory analysis). The use of computational algorithms, including ensemble regression and classification trees (Breiman 2001; Elith and others 2008) or association rule mining (Faust and Raes 2012), circumvents our limited ability to assign interactions and statistical distributions a priori to (possibly very) large numbers of variables. Such approaches are increasingly needed to draw insights across systems that are highly complex, or where replication is intractable, but this approach requires some reorientation in our thinking about data analysis (Breiman 2001).

Looking forward: the next 10 years

In an epoch of global change, determining our condition relative to the “safe operating space” of Earth’s planetary boundaries is of paramount importance (Rockstrom and others 2009; Steffen and others 2015). To do this, ecosystem scientists will be called upon to generate predictions and to identify solutions (Evans and others 2013). Indeed, a current focus for ecosystem scientists is to quantitatively predict how complex drivers interact to influence ecosystem change (Weathers and others 2016). Beyond the generation of new knowledge and insight, big-data products should help guide theoretical development for understanding how ecosystems function across spatial scales and how stable these functions are across time. Likewise, there is also still much to be learned about what ecosystem properties are actually predictable and at what scale (Evans and others 2013; Mouquet and others 2015; Petchey and others 2015).

There are multiple sources of uncertainty in big-data products and in the models that use them (Schaefer and others 2012; Raczka and others 2013; Xiao and others 2014), and ecosystem scientists will continue to define and understand uncertainty in data, models, and prediction. Veracity is often listed as one of the V’s defining big data (Lovelace and others 2016). Although veracity is critical for defining the boundaries of how to use and interpret big data, our focus on veracity goes beyond the traditional definition of measurement error and detection limits. Ecosystem management decisions that are based on big-data analytics will depend critically on data reliability across different spatial and temporal scales. Eddy covariance measures, for example, from a single tower have veracity defined by sensor technology, gap filling accuracy, and understanding of plant physiology and soil properties within the tower footprint (Falge and others 2001; Moffat and others 2007). However, the same data are less reliable for interpreting or predicting carbon dynamics at scales beyond that tower’s footprint. Concretely defining uncertainty in big data and incorporating this explicitly into big-data analytics is a current focus of ecosystem research that will ultimately determine the forecast horizon (Dietze and others 2013; Petchey and others 2015).

Ecosystem science as a discipline must forecast how changes at local to planetary scales affect human well-being. Scientific research has generated an incredible amount of data (Boose and others 2007; Dietze and others 2013). Although mechanistic studies will remain the hallmark for hypothesis testing, network science approaches are increasingly important for deepening our understanding of variability, stochasticity, and uncertainty in large-scale ecosystem processes. The utility of data networks has been recognized through infrastructure funding by the U.S. government for NEON (McDowell 2015) and FLUXNET networks. Even with these and other large investments in infrastructure, there remains a significant lag in usage of this infrastructure (for example, DataONE) by most ecosystem scientists to access and curate big data. This usage lag is representative of a growing need for greater technological capacity and workforce training to facilitate the visualization and analysis of diverse data. For example, technological infrastructure must find and bring together existing disjointed datasets, and there needs to be support to maintain and update these platforms. Additionally, the transfer of big data from a repository to a separate analytical platform is a non-trivial technical hurdle that impedes wider capacity for data exploration. Developing tools that can be executed within the data discoverability platform is critical for increasing community use (Michener and others 2012). Funding sustained infrastructure to make data discoverable and usable will increase the returns on the investments already made in ecosystem science through the near-term production of ecological insights (both through testing of outstanding hypotheses and the generation of new hypotheses, as suggested through analysis of synthetic data), and by building the predictive capacity of ecosystem science.

Even with the best infrastructure, motivating a greater focus on big-data approaches will still require a shift in the way academia values scientific products that arise from big data in ecosystem ecology; for example, by generating metrics that value the contribution of data products in addition to metrics designed to measure the impact of peer-reviewed publications, as well as recognizing that analyses that generate novel hypotheses from big data are as valuable as analyses designed to test a focused hypothesis—in fact, these two approaches should go hand in hand. Valuing the collaborative efforts and skill set needed to contribute to and analyze big-data products as potentially equivalent to the design of novel, data-generating experiments to test a particular hypothesis will require a mental shift that may be unfamiliar to many and even uncomfortable for some (Breiman 2001). However, the scientific benefits and opportunities are clear. Scientific innovation has consistently been demonstrated to scale with the size and diversity of research teams (Wuchty and others 2007; Jones and others 2008; Cheruvelil and others 2014; Read and others 2016). Those who have access to big data (integrating across Vs in Figure 1) and can synthesize those data to generate new hypotheses and models are well positioned to derive inference at scales necessary to understand ecosystem function, as well as to generate forecasts that can inform management and promote stability in a changing global environment.