Availability and coverage of hydrologic data in the US geological survey National Water Information System (NWIS) and US Environmental Protection Agency Storage and Retrieval System (STORET)
- First Online:
- 557 Downloads
The need for a unified and improved data access system for the nation’s vast hydrologic data holdings has increased over the past few years as researchers strive for better understanding the human impact on the nation’s water cycle. Large mission oriented data repositories such as the USGS’ National Water Information System (NWIS) and EPA’s Storage and Retrieval System (EPA STORET) play a crucial role in providing a substantial amount of the nationwide coverage, however they do differ regionally in terms of coverage (parameters) and geospatial data density. Besides the differences in geographic distribution, repositories tend to undergo changes in mission statements and as such have different foci in their data collection activities that change as time progresses. This paper places the two water information systems next to each in an attempt to work out the differences in terms of coverage and content and how they complement each other when overlaid. This is done through the use of a number the CUAHSI Hydrologic Information Systems components, namely a web-service suite called WaterOneFlow that permits interrogation of the available data content of a national water metadata catalogue into which these two information systems have been integrated.
KeywordsDatabases Data Coverage NWIS STORET WaterOneFlow Web-services
The data spectrum that hydrologists are in interested in by nature is extremely diverse and tends to span multiple time and spatial scales. As a result, it is difficult to actually define a precise limit or threshold that would identify data to be important for hydrology versus labeling it irrelevant. In addition, the ease of access to the data sources as well as the need to having to visit several of them to acquire all the data necessary to conduct an investigation, typically has been (and continuous to be) quite time consuming. Not surprisingly, in 2004 a survey carried out by the Consortium of the Universities for the Advancement of Hydrologic Science Inc., CUAHSI Hydrologic Information System project, HIS (Bandaragoda et al. 2006), returned, among many desires, as the number one wish formulated by hydrologists: “…better and easier access to hydrologic data.”
In response to this initial survey and after continued consulting of the hydrologic community (mostly academic at the time, but now more and more embracing other disciplines as well as private and governmental sectors) CUAHSI-HIS has embarked on developing cyberinfrastructure, CI, to respond to the voiced needs (Tarboton et al. 2007, Beran and Piasecki 2008, Zaslavsky et al. 2007). While the envisioned suite of CI developments is quite extensive (Maidment 2005), and concerns modeling, data access, data description, data storage, and higher end concepts like the development of a purely digital representation of a watershed including all its data contents as well as processes, the initial focus has been to work on a “WaterOneFlow” web services environment (Whiteaker et al. 2007) that permits a programmatic access to a national water metadata catalogue via the internet. Current web-services include a getSites, getParameter, and getValue procedure that permit querying and retrieving of national, regional, and local data sets that can be consumed in end user application using a standard return format (WaterML) encoded in XML.
A central component of the CUAHSI HIS developments is the point observations data model which in turn has been realized as a relational database (Tarboton et al. 2007). This data base model is used for the central water metadata catalogue (compiled and hosted at the San Diego Supercomputer Center) and acts as the underlying data information storage. The definition of the web-services and the standardized return messages (in WaterML format) is beyond the scope of this paper, but suffice it to say that these web-service signatures are designed such that they mimic each other regardless the data source. These web-services in turn can be invoked by back end applications, for example Hydroseek (Beran 2007) which was developed to provide uniform access to multiple local/national hydrologic data repositories.
While hydrologic data collection efforts are carried out by many entities, be it federal agencies, state or local agencies, NGOs, or individual investigators, the federal agencies typically have the largest scope in terms of number of different variables collected, and also their temporal and spatial coverage. Because it would be impossible to examine all data collection efforts, this paper focuses on the two biggest data stewards; the US Geological Survey (USGS) and the Environmental Protection Agency (EPA) which are responsible for the majority of the hydrologic and water quality data collected in the United States. EPA’s Storage and Retrieval System (STORET) (www.epa.gov/storet/) and USGS’ National Water Information System (NWIS) (http://waterdata.usgs.gov/nwis) make this data accessible through their web interfaces. While USGS is largely responsible for its own data collection efforts (however, it does receive finds from the US Corps of Engineers to collect data on USCOE’s behalf), EPA mostly receives data from other entities like state environmental agencies and tribes and thus largely acts as steward and manager for other collection efforts. However, because submission to STORET is not mandatory, not all the local/state agencies participate in STORET due to differences in business requirements. Those areas can be identified as bald spots in these agencies’ geographical coverage.
While there are overlaps in data collection activities, these two agencies have different foci, which reflects their different mission targets. STORET is a repository for predominantly water quality data (it does have some flow data also) while NWIS stores water quality data in addition to groundwater and stream data. Hence the two data repositories overlap on the water quality column and complement each other on stream flow and groundwater data. There have been several efforts by the two agencies to integrate their repositories at different levels over the past 6 years. EPA’s ‘Window to My Environment’ (WME) (http://www.epa.gov/enviro/wme/background.html) displays stations in both STORET and NWIS system on an interactive map. However WME does not deal with the heterogeneity problem. Thus for data discovery and retrieval the user is redirected to the respective data provider’s website. Following WME, Lockheed Martin Information Technology started a study for EPA and USGS (National Water Quality Monitoring Council Meeting Minutes, Durham, New Hampshire, July 26–28, 2005 http://water.usgs.gov/wicp/acwi/monitoring/minutes/nh_072605.html) which involves associating measurements of the two agencies with one another with the goal of solving the problems related to semantics.
The HIS CI activities first focused on the development of data access systems which necessitated the creation of extensive data source catalogs (a collection of detailed information about what the databases contain). These were compiled from both NWIS and STORET (and other much smaller data collections) using automated programmatic web-page parsing techniques. While the primary goal of harvesting was to create search engine indexes, it also made it possible to gain a deeper insight into these systems by examining the metadata and to deduce some basic understanding about the current content, spatial and temporal coverage, distinct differences between the systems, and the limitations of the two data sources.
The compiled catalogs, besides serving as a key to the WaterOneFlow environment, also provide considerable amount of information on the history of hydrologic data collection activities, data availability and effect of policy making on science. Both agencies deserve much praise for their willingness to cooperate in this endeavor. Several months after the initial harvesting of the NWIS site, USGS agreed to provide CUAHSI HIS developers with the data directly; which, besides making it faster, provided considerable relief from creating incorrect results due to errors created that were due to inconsistent contents in the respective databases. Ongoing discussions with EPA’s STORET team (status Spring 2008) will hopefully result in a similarly straight forward mechanism to retrieve and harvest STORET.
In order to execute the proposed study three steps were taken. First, the metadata catalogues for both databases were generated by using database dumps that were received from EPA (STORET) and USGS (NWIS) and then scattered into the underlying observation database model (ODM) metadata database. This step also included a sweep at correcting and supplying incorrect and missing metadata information on stations and their content. Second, the metadata catalogue was also stored in an OLAP cube (see later) for making it possible to rearrange the tables containing the metadata corresponding to desired queries and views. Third, the ODM metadata catalogue was then interrogated using the WaterOneFlow webservices described earlier to extract site data and content and coverage information.
Availability of geographic identifiers for stations in EPA STORET
Total number of sites
Sites with geographic coordinates
Sites with state/county information
Sites with hydrologic unit codes
From Table 1 it is clear that about 500 stations in the database while containing data cannot be discovered using a latitude/longitude based search box. The majority of these stations have been migrated from Legacy STORET. These stations are often geo-referenced by descriptions like; “CORNER OF CANAL NEAREST INTERSECTION OF 42ND AVE AND 37TH ST S, ST PETERSBURG”. Some 800 stations cannot be searched based on state–county information, while approximately only half of them actually have an accompanying Hydrologic Unit Code (HUC) stored alongside the site information. While 500 stations may not seem a lot compared to a total of approximately 274,900 it may be just those one or two stations that would have held crucial information. In addition, state information can be provided using abbreviations or state names or Federal Information Processing Standards (FIPS, www.nist.gov/itl/fipspubs) codes. The NWIS in contrast uses all three state, county and country FIPS codes where applicable. Geographic identifiers as used in Table 1 are available for all NWIS sites except about 500 that are missing geographic coordinates, about 1,800 that do not have state/county information, and 79,192 stations, out of approximately 1.7 million stations total, that are missing hydrologic unit codes (HUC) as of April 2007. Yet, it serves to demonstrate that these undoubtedly very extensive and valuable databases are not free of inconsistencies and missing data pieces. Missing geographic identifiers were computed using ancillary information. HUCs and states were assigned based on coordinates while sites in the ocean or estuaries were associated with the nearest state. When coordinates were not available, site and/or local data collection agency names were used to identify the state the site belongs in.
Site type mappings used in the analysis
STORET site type
Great Lake, lake, reservoir
Canal-transport, Waste pit, wetland
NWIS site type
Stream, stream-diversion, stream-outfall
Lake/reservoir + meteorological
Spring, ground-water other than spring
Land application, outfall
A similar heterogeneity problem exists concerning measured parameter names. Both NWIS and STORET individually use approximately 10,000 variable codes to identify their measurements. This means that one can find about 10,000 variables that are common to both systems albeit different using code structures. Approximately 725 parameters in NWIS have no match in STORET, and approximately 850 parameters in NWIS have no match (personal communication with EPA on the Water Quality Exchange effort, WQX 2008).
Because there are thousands of different parameters between the two databases it was decided to use a broader classification system so that a state-by-state comparison including parameter coverage could be carried out with numbers that can be compiled into concise graphics without overburdening the image. The parameters were classified into 10 categories: Streamflow, Stage/Gage Height, Reservoir Storage, Groundwater Flow, Groundwater Level, Water Quality, Soil Quality, Meteorology, Oceanography and Other. Streamflow, Stage/Gage Height, Reservoir Storage, Groundwater flow and Groundwater Level categories contain parameters related to quantity of surface- and groundwater. Water Quality and Soil Quality parameters contain physical, chemical and biological environmental quality parameters and the Meteorology category contains measurements related to precipitation, evaporation, air temperature, barometric pressure, winds, solar radiation and heat fluxes. “Other” represents ancillary data, i.e. parameters such as ‘Project Code’, ‘Sampler Type’ and ‘Location in cross-section’. In order to be able to analyze the considerable amount of data that is available, Online Analytical Processing (OLAP) (Chaudhuri and Dayal, 1997) technology was employed. OLAP uses a multidimensional data model, allowing for complex queries with significantly reduced execution times. This model supports pre-computed aggregations of records such that summaries based on a certain data attributes can easily be created. Maps were generated using ArcMap based on data from OLAP data cubes.
This also seems to suggest that from a research and science interest point of view groundwater figures prominently on the science agenda (particularly within USGS), while all other water bodies suffer from neglect or disinterest. This is of course not true as is it also important to take a close look at how many data channels each site operates, i.e. how many different parameters are being collected at the same site and thus what the data diversity is. Last but not least it should not be forgotten that different environmental monitoring needs require different densities of instrument distribution as well as sampling or measurement intervals. Environmental processes (hydrologic ones being a subarea) take place at different temporal and spatial scales that vastly influence the number of stations necessary to achieve a certain monitoring objective. In view of this it is perhaps clear why NWIS features so many groundwater well sites, because hydrologic sub-surface processes are difficult to access, they happen over large spatial and temporal scales, and they can exhibit substantial variation over smaller scales all of which contribute to the need to having a relatively dense array of wells. The question remains however, how much does the groundwater bin contribute to the overall data amount.
Above graphs document what both NWIS and STORET actually hold in terms of pure storage. However, even though all sites are accessible and will appear on the data access portals when queried, many do not offer up any data. One such site is the groundwater monitoring site 374028101001001 (located in Haskell County, Kansas, http://waterdata.usgs.gov/nwis/inventory/?site_no=374028101001001), which simply states that no data is available and that the state office needs to be contacted. Hence, while data is present it is not necessarily available to the public without special request. This distorts the data assessment and availability considerably (as presented in Fig. 4) and also somewhat explains the disproportional number of groundwater sites when compared to the total data they hold. Hence, a new analysis was conducted to extract only those sites that actually have something to offer when queried.
Available data sites and average data densities in NWIS and STORET
Number of sites
Number of records
Av. data density
Number of sites
Number of records
Av. data density
It is clear from Fig. 7 that collection of environmental data was literally nonexistent from 1800 to 1875, Colorado being the only state having data that reaches back that far. While coverage increased to encompass the entire continental US by 1900. Even by 1925 density, i.e. sites per state, was still very low—a clear indicator that during the period of increased industrialization, development, expansion, and settlement of the continent, little effort was spent on data collection programs that would monitor environmental health. Even by 1950 the density of sites has not increased by much, while the period from 1975 to 2007 has seen a dramatic increase in operational sites, no doubt also a result of legislative work like the Clean Air Act (1963), and the Clean Water Act (1977). Notice that the above figure, while certainly indicative, is not entirely accurate for reasons mentioned before: states like New Mexico, Oklahoma, Vermont and Maine (or even Texas) seem to have received less attention which is not necessarily true as they may have organizations that, while having substantial data collections, have not contributed to either NWIS or STORET (like the TCEQ in Texas).
It is clear from both figures that the number of active sites has decreased over the past 20 years (for NWIS) and 10 years for (STORET). Even though some of the funding for maintaining gages comes from cost-sharing partnerships between the USGS and more than 800 state, local, tribal, and other agencies, (Christen 2005) and EPA STORET having 305 similar partnerships, the primary reason appears to be that budget allocations have not kept up with needs to maintain all sites. While some fluctuations in the number of active sites are not unexpected, both USGS and EPA funds allocated for operating and maintaining monitoring sites have decreased (Ursery 2004; Renner 2007; Stokstad 2001; Bush 1992; Showstack 2004, Rogers, 2001) despite small and continuous increases in the overall budgets. For example, the water portion of the USGS budget has seen a slight increase of 0.6% (from 202.83 million in FY 2002, to 207.15 million in FY 2003), (from USGS Office of Budget http://www.usgs.gov/budget/funding_tables.asp). However, mandated pay raises in excess of 0.6% have effectively reduced the available funds for the monitoring programs.
While other reasons may have contributed to the decrease in available funding like the fact that some sites may have fulfilled their “duty” in collecting data, or a realignment of monitoring sites, or the addition of strategically placed sites that could have resulted in eliminating a larger number of sites previously needed to achieve the same monitoring objective, the primary reason is clearly budget related. In view of increasing demands for environmental protection (for example the Total Maximum Daily Load, TMDL, development), the need to look into the better understanding of interrelated processes at larger time and spatial scales, impacts of global warming and associated environmental changes on all scales, seem to suggest that this trend points into the opposite direction from where it should be heading. While an effort like that of CUAHSI to bring together as many data sources into a single data access environment and thus providing a means to amend the data contained in STORET and NWIS, it is clear that both NWIS and STORET are two of the most important databases this nation has for surface (and sub-surface) water point-monitoring data. A strong support for these two data stores will remain essential in providing nationwide coverage that serves as the foundation of all other data that can be added regionally to this base set.
▪ USGS NWIS has about 6 times as many stations as compared to STORET (1.7 million vs. 274,000). However, most of the stations are dedicated to groundwater monitoring (about 90% of all sites) while the remainder is largely dedicated to stream flow monitoring. STORET on the other side has a clear focus on collecting stream water quality data (it makes up ~40% of all sites even more when counting the lakes and reservoirs).
▪ There is a clearly defined geographical disparity in coverage, i.e. not all states enjoy the same number of sites. While this is not entirely surprising as small states should not necessarily have the same number of sites when compared to states that are 15–20 times larger, it does not fully explain the differences, i.e. similarly sized states still show large differences. USGS NWIS tends to have more coverage in the western states than in the eastern states, while EPA STORET has three states (Nevada, Florida, and Minnesota) that have significantly larger number of sites than the others. However, one needs to realize that low coverage is not synonymous with lack of data, as data may reside in other data centers, like TCEQ in Texas, an institution that does not participate in disseminating their data collections within NWIS or STORET.
▪ Even though the bulk of sites are related to groundwater monitoring, the bulk of actual data belongs in the stream monitoring bin. In other words, the stream monitoring sites collect a lot more data in terms of variety and also in terms of collecting frequency. 75% of NWIS and STORET combined data is dedicated to stream flow measurements (NWIS holding the bulk of this information). About 20% of the total is related water quality parameters (STORET holding about as many records as NWIS) while the rest is split among the other 8 bins.
▪ The database structure of both NWIS and STORET is not visible to the public, i.e. the total amount of data and ancillary information is not disclosed. NWIS and STORET each have a clearly defined access portal or window that is open to the public which limits access to a subset of the entire holdings. In fact, 66% of all sites offer up data that can be retrieved. The other 33%, while visible as existing stations, do not allow data querying and data must be requested (however, these are mostly groundwater monitoring sites and only marginally affect the number of stream and WQ sites). This limits the total number of immediately “useful” sites to about 1.17 million sites.
▪ While records reach back to the 1800s’ significant expansion of both databases took place in the 1970s continuing though the 1980s and 1990s. Coverage over time has reached all states in the US, even though regional differences are noticeable, i.e. certain states and watersheds have a fairly dense network of stations while others do not. The peak of available stations was reached in the mid 1990s with approximately 46,000 that were active. However, largely due to budgetary constraints in both host institutions, the number of active (data collecting) sites has continuously decreased over the last 10 to 15 years, a trend that is unfortunate. Particularly in view of increased public environmental awareness and realization that fresh water is not a commodity with endless supplies, one would expect monitoring efforts to increase not decrease. Both databases are viewed in the research communities as the two key data sources for water quality and stream data with nationwide coverage and continuation of these sources is vital for science as well as regulatory purposes. The CUAHSI HIS effort seeks to place these two data bases and their content in the spatial and temporal domain next to smaller data holdings thus hoping to produce a more complete and comprehensive data space that can serve a base for future expansion with specific foci.
The work presented in this paper was sponsored by the National Science Foundation through grant numbers 0609832 and 0412904. The authors would also like to acknowledge the Consortium for the Advancement of the Hydrologic Sciences, Inc. (CUAHSI) for their support and contribution in developing the ideas in this manuscript.