Space Physics Interactive Data Resource—SPIDR
- First Online:
- Cite this article as:
- Zhizhin, M., Kihn, E., Redmon, R. et al. Earth Sci Inform (2008) 1: 79. doi:10.1007/s12145-008-0012-5
SPIDR (Space Physics Interactive Data Resource) is a standard data source for solar-terrestrial physics, functioning within the framework of the ICSU World Data Centers. It is a distributed database and application server network, built to select, visualize and model historical space weather data distributed across the Internet. SPIDR can work as a fully-functional web-application (portal) or as a grid of web-services, providing functions for other applications to access its data holdings.
KeywordsGrid data mining Phenomena-based subsetting Product generation Satellite data Space weather
The World Data Center (WDC) system (ICSU 1996) was created by the International Council of Scientific Unions (ICSU) to archive and distribute data collected from the observational programs of the 1957–1958 international geophysical year (IGY). Originally established during the IGY in the United States, Europe, Russia, and Japan, the WDC system has since expanded to other countries and to new scientific disciplines. The WDC system now includes 52 centers in 12 countries. Its holdings include a wide range of solar, geophysical, environmental, and human dimensions data. These data cover timescales ranging from seconds to millennia and they provide baseline information for research in many ICSU disciplines, especially for monitoring changes in the geosphere and biosphere—gradual or sudden, foreseen or unexpected, natural or man-made.
Since the IGY, technological advances have transformed the gathering and exchange of data. Starting with paper tables and magnetic tapes, and total data holdings of about ∼1 Gb and annual data traffic ∼1 Mb/year, in the early-nineties the WDCs have switched to Internet ftp and http protocols for regular environmental data exchange on a global scale with digital archives of ∼1 Tb size and traffic ∼1 Gb/year. We project that for the electronic geophysical year (eGY) in 2007 the world-wide Grid of WDC digital archives will reach petabyte size with terabyte-scale annual network traffic.
The Space Physics Interactive Data Resource (SPIDR) (http://spidr.ngdc.noaa.gov) originally developed in 1995 as a demonstration for the international Global Observation and Information Network (GOIN) project is now a standard data source for solar-terrestrial physics, functioning within the framework of the World Data Centers for Solar-Terrestrial Physics. SPIDR has gone through a total of four versions including a complete rebuild using exclusively open-source tools between version one and two. It is a distributed database and application server network, built to select, visualize and model historical space weather data spanning hundreds of years and distributed across the Internet. SPIDR can work as a fully-functional web-application (portal) or as a Grid of web-services, providing functions for other applications to access its data holdings.
Currently SPIDR archives include solar activity and solar wind data, geomagnetic, ionospheric, cosmic rays, radio-telescope ground observations, telemetry and images from NOAA, NASA, and DMSP satellites. SPIDR portals, databases and services are installed in the USA, Russia, China, Japan, Australia, South Africa, India, France and Ukraine. SPIDR has more than 20,000 registered world-wide users and daily load of about 100 user sessions per site. SPIDR customers are predominantly academic and U.S. government users but a reasonable 20% also come from the commercial sector. SPIDR data and technology has application in environmental data sharing, visualization and mining, not only in space physics, but also in diverse environmental arenas such as seismology, GPS measurements, tsunami warning systems, and others.
Background and related work
The availability of a scalable virtual organization proxy mechanism for individual users based on digital certificates used by Grid for secure access and authentication with multiple distributed resources compared to the local portal user-password authentication (Foster 2001);
A data request and/or processing from large environmental archives may take quite a while even if we specifically optimize the database structure and the processing algorithms for this type of request, and a synchronous web-services call mechanism is not always appropriate to handle data requests which involve a long time delay (Barkstrom et al. 2003).
The SPIDR system concept is similar to several emerging technologies for data access in the environmental sciences. Notable among these are Unidata’s Thematic Real-time Environmental Distributed Data Service (THREDDS) (Domenico et al. 2002), the Environmental Scenario Generator (ESG) from the USAF (Kihn et al. 2004), and the Coordinated Data Analysis Web (CDAWeb) from NASA (http://cdaweb.gsfc.nasa.gov).
THREDDS, like SPIDR, allows for data held in remote repositories, but it supports a different set of formats in that storage (notably netCDF). Both THREDDS and CDAWeb are file-system based repositories; SPIDR supports integration via plugin-based data virtualization from databases and data file repositories. Once an archive is made available in SPIDR, THREDDS or the CDAWeb system, it is automatically cataloged to generate metadata and that metadata is made available in an XML format to support cross system searching.
These data access systems use a common middle layer computational data model. That is to say the data is mapped from its storage format to a standard model which is used by the functionality in the middle layer (visualization, data mining, sub-setting, etc.). The THREDDS server uses the OpenDAP data model (Gallagher et al. 2007), while the ESG uses the Five Dimensional Representation (FDR) and the CDAWeb uses data model imposed by the Common Data Format (CDF) (http://cdf.gsfc.nasa.gov). All these data models can be mapped to each other (with some reservations) and support a similar pattern to the SPIDR data, that is representing the data in a time, space, parameter based model. The web services architecture style used by the data access layer can be either REST1 (THREDDS) (Fielding 2000) or SOAP (ESG, CDAWeb) (Loughran and Smith 2005).
The SPIDR system architecture has the following main components: a web application (portal), metadata registry (Virtual Observatory), visualization and data mining engines, and a Grid of virtual data sources, which are exposed to the external clients, including the SPIDR portal, via data query and inject web services. Behind a data source’s web service one can have a database, a set of files in a local server file system, or a set of URLs to remote data sources. SPIDR is an Open Source project (http://sourceforge.net/projects/spidr).
A web-portal serves as an agent between the user and the Grid of environmental data sources. It performs two main functions. The first function is metadata management, which allows for fast and efficient catalog-level metadata search. Here by catalog-level metadata we mean general descriptions of data resources, stored as a managed collection of XML documents with a known XML schemas including owner info, geographic coverage, time coverage, data description, visualization methods (FGDC 1998). Our catalog-level metadata collection works much the same as other similar resources, including Global Change Master Directory (GCMD 2008) from NASA (http://gcmd.nasa.gov) or Master Environmental Library (MEL) from the US Defense Modeling and Simulation Office (Siquig and Lowe 1996). For a more detailed discussion on the role of metadata in distributed data networks see (Nieto-Santisteban et al. 2004).
The SPIDR portal combines a central metadata registry with a set of distributed data web services, web map services, and replica sets of data files. A user can search catalog-level metadata and inventory, use a persistent data basket to save the selection between the sessions, and plot or download the selected data in different formats, including ASCII text files, XML2 and NetCDF. A database administrator can upload files into the SPIDR databases using either a web services or web portal interface.
Metadata registry and data inventory
Both the catalog- and granule-level metadata records, which contain respectively a general description and detailed inventory of SPIDR data resources, can be updated either manually by a system administrator or automatically by the data robot collecting records from the data grid (see Fig. 2). The catalog-level metadata registry uses a native XML database backend based on the open-source product eXist (Meier 2006). The metadata engine has no predefined XML schema; it is possible to have different metadata schemas for different data categories. For example, data sources with spatial content, such as OpenGIS Web Map Services (http://www.opengeospatial.org/standards/wms) and time series databases with ground observations, can use FGDC metadata schema (FGDC 1998), and at the same time databases with satellite telemetry can have SPASE-formatted metadata records (SPASE 2006). The SPIDR high-level metadata engine has extended search capabilities allowing it to search in specific metadata elements (keyword, title, provider, etc.) using REST web-service API. In addition, it supports Web 2.0 style3 functionality with content versioning and moderation, direct collaborative editing of the metadata records at the SPIDR portal, a user discussion forum, e-mail and internal messaging, and wiki-style documentation and system help. SPIDR catalog level metadata search and export functions are implemented using Open Source VxOware Virtual Observatory framework (http://sourceforge.net/projects/vxoware).
The SPIDR granule-level inventory metadata registry uses an SQL database backend based built on the open-source product MySQL (MySQL 2004). The main purpose of the inventory is to list available parameters and stations from each database with some granularity in time, currently taken as monthly. That is whether a given station has any data for a given month. This information is needed in early validation of data requests for both availability and size of the data export, and for comparison of data holdings at different SPIDR nodes for database synchronization. When adding new data to SPIDR, the inventory can be updated either in real time or by periodic queries of the corresponding data source, depending on the input data load. At the same time the inventory metadata is updated the inventory summary, such as the station and parameter list with maximum date ranges is passed in order to update the corresponding catalog-level metadata.
Grid of virtual data sources
SPIDR data archives are logically organized into thematic groups called viewGroups (e.g. Geomagnetic Indices, or Interplanetary Magnetic Field). Each viewGroup may include several databases or groups of database tables, which we call tables. Each table may be considered a virtual database with a single configuration file describing the access mode (local or remote) together with a URL and details required for accessing the resource.
Regarding the software architecture, each table is served by a set of Java classes for data selection, insertion and update implementing the same interface. The corresponding class names are also listed in the table configuration file. These classes are dynamically loaded by the web service container and used by the SPIDR system in place of the “abstract” data access classes standing in the SPIDR core and used for any “generic” table. Each table is composed of several elements, which represent scalar observables of the space environment, such as the Bx, By and Bz (east, south, vertical) components of the interplanetary magnetic field. In many cases we also need a station (observatory or satellite) to define the scalar observable. Element values are varying in time, so within a given time interval for a given station the element defines time series which can be plotted or exported from SPIDR.
The standard procedure for adding a new database as a SPIDR data source involves implementation of the metadata manager and data export classes (we call them classGetter and classMetadataUpdate) and optional classes for parsing input and writing output data in special formats, such as WDC format for geomagnetic variations or SAO records from ionosondes.
Get a metadata record for a given viewGroup (Catalog Level Metadata WS).
For a given table, element, station and date interval get a data inventory and export data values in a variety of scientific formats, including XML and NetCDF (Inventory Level Metadata and Data Source WS).
Load several “standard” data files of several scientific formats into the database (Data Sink WS).
Synchronize two SPIDR archives by exporting data from one archive and loading into another (WS orchestration).
Data source web-service
CDM serialization formats include direct Java object serialization, XML, NetCDF, and for some databases also special formats introduced by the data users community. For example, geomagnetic field variations can be exported in WDC or Intermagnet formats. In any case, a SPIDR data source service will supply metadata describing parameter names, units of measure, visualization options, etc., and the data accreditation describing the data origin. For geomagnetic variations the data accreditation describes the observatory which has provided the data to SPIDR.
Data query options (time interval, data source, parameters, stations) are saved in the user basket, so the data can be re-selected in the future. Because of the real-time nature of the SPIDR databases, the data selection itself is transient, so theoretically in the next session user can find different (updated) observations in the data basket. All the data selection queries are logged, so the SPIDR administrator can view not only user session statistics, but also the frequency of data requests by source.
Satellite data granules and image archive web-services
Remote sensing and imagery databases have a different data model as compared to a sequential database. Usually the data collection is divided into “elementary” blocks called granules. A granule can be a daily set of solar images from different observatories, or a fixed-length section of satellite orbit with Earth observations in different spectral bands.
All the granule-based web-services in SPIDR have the same design pattern. The user’s data export request specifies the date range and type of the image. The web-service returns a list of granules with metadata and links to the preview and high-resolution images or binary files for granule data products like DMSP satellite SSJ/4 sensor readings.
Fuzzy search web-service
SPIDR has a fuzzy search web-service which is an implementation of the Environmental Scenario Search Engine (Zhizhin et al. 2006). The web-service input is a combination of fuzzy conditions like “very low”, “average”, “in the range between x 1 and x 2” to be applied to some space weather parameter values; in fact it is a formalized description of any environmental event. The fuzzy search web-service is not linked to a specific type of data source. This makes the data mining extremely flexible. One can search for an event over several parameters exported by different data sources. The output of this service is a time series with values between 0 and 1 for the fuzzy likeliness of the occurrence of the event at every moment in time. Output scores that are above a user specified threshold can be used as a filter for selection of another time series.
The highest scores in the fuzzy search web-service output can be used as indications of single occurrences of the desired event. For example, a very important event in the space weather called is called a magnetic storm can be defined by the fuzzy logic expression “(low (DST)) and (high (Kp))”. A SPIDR search for the year 2001 yields the event of November 6, 2001 with the Kp and DST plotted in Fig. 6. In that case meteorological satellite night-time images selected with this timing can be used for independent verification of the magnetic storm conditions in the event by showing aurora in the polar region. This is exactly the way the DMSP satellite orbits were filtered to find the clear auroral images presented in Fig. 8.
Data sink web-service
SPIDR databases are self-synchronizing. The synchronization has both push and pull modes and it is based on the data source and the data sink web-services. In the push mode, when a new data set is successfully parsed and loaded into a database at one of the SPIDR nodes (we call it “master”) using the data sink web service, all other nodes which are subscribed to this data stream (we call them “slaves”) will receive the same set of data exported from the “master” node using the data source web service. Each SPIDR node can be either “master” or “slave” depending on whether it receives data from external sources or from another node. Such a peer-to-peer synchronization via web-services CDM object exchange has many advantages for heterogeneous distributed system, where SPIDR nodes can run different operating systems, database engines, and network security policies. For a high volume of short input messages, we can use pull mode synchronization. In this case the “slave” node periodically calls the “master” data source and receives, say, the last day of observations as a single data set.
The SPIDR admin web interface has special tools to compare the same databases from several nodes and if necessary to order background synchronization from/to any of them. The inventory-level metadata from the “master” and “slave” nodes can be used to compare the data holdings and when there are any differences to start a background process at the “slave” node, which will pull the locally missing data from the “master” node using its data source web-service and load it into SPIDR by using the local data sink web-service.
This web-services based synchronization mechanism is a new step in automation of the data exchange between World Data Centers in different countries. The common data model used by all the SPIDR nodes eliminates unnecessary format translations when synchronizing databases at different nodes. The peer-to-peer push synchronization aligns with agency priorities by first loading data into the national “master” node and then exporting the data to a given list of subscribers abroad. The existence of several copies of the same database in a very distributed network helps support long term data preservation by protecting against local data loss. This might occur in a natural disaster.
A virtual observatory (VO) a term now appearing within the scientific data community is a distributed software system that allows users to find, access, and use resources from a collection of data repositories and service providers. A virtual observatory can provide either metadata or data services and is typically focused on presenting the collection of data, metadata and functional services to a given set of customers bound by a common interest. The virtual observatory is an implementation of what is typically called service oriented architecture (SOA) bound by a common theme.
Execute advanced search environmental archive queries based not only on metadata but on the included data content;
Conduct content-based query and data retrieval from virtual observatories.
Generate on-the-fly products interactively using existing data and metadata, as well as conducting detailed analysis;
Expand their ability to use and incorporate data from disciplines other than their own.
The SPIDR system as implemented includes most of the elements of a VO including catalog level metadata service. Catalog level metadata follows the FGDC and SPASE standards. In addition, the metadata services within SPIDR have the inventory level (available observation dates) and more detailed metadata (station description and data manager reference) objects.
Support of many possible levels of user interaction
Support of user community-centric views including
Portals to other VxO’s
Community specific on-line tutorials and document repositories
Forum, wiki and blog capabilities
Software libraries and downloadable tools.
The SPIDR user interface implements all of these criteria by using the Apache Struts5 (http://struts.apache.org) framework to define workflow. This makes it possible to easily create multiple systems views focused on a particular user skill level (e.g. novice, advanced, admin) or discipline (e.g. geomagnetic, iononspheric, cosmic rays) without having to rewrite code. The interface can largely be adapted by editing XML documents from the Struts workflow configuration.
Web-services based data flow and data transformations;
Metadata search and data export API to interface with other VxO’s;
Common computational and data models;
Mechanisms for creating derived data products;
Markup for events of interest, e.g. magnetic storm detection.
Obviously many of these items map directly to the Grid paradigm providing a strong link between a VO and an implementation of Grid. It is likely that as both progress there will be a merging in several key areas and particularly in environmental science related activities.
Space weather reanalysis
One of the more important projects enabled by the SPIDR system is the Space Weather Reanalysis (SWR) (Ridley and Kihn 2004). The objective of the SWR was to generate a two solar cycle (22 years) space weather representation using physically consistent data-driven space weather models. This was accomplished by using data made accessible through SPIDR and models integrated through Grid technologies. The resulting product is an enhanced look at the space environment on consistent grids, time resolution, coordinate systems and containing key fields allowing a scientist to quickly and easily incorporate the impact of the near-Earth space climate in models which can ingest this data. Before this project there were no climatological archives for the space-weather environment. Just as with terrestrial weather it is crucial for scientists to understand both daily weather forecasts as well as long term climate changes which affect operations.
The results of this project support further tools for intelligent data mining, classification and event detection which are applied to the historical space-weather database. This provides a reasonable starting point for the user interested in modeling the effect of the near-earth space environment on society.
Future work on grid data mining
In a June 1999 Nature article entitled “It’s sink or swim as a tidal wave of data approaches. ... Are scientists ready for the flood?” the authors stated “Most researchers are accustomed to studying a relatively small data set for a long time, using statistical models to tease out patterns. At some fundamental level that paradigm has broken down.” (Reichhardt 1999) This is more true today than ever, no longer are scientists restricted to a small collection of data that they themselves collect and manage, rather they have the possibility to interact with literally petabytes of environmental archives. Many of the important discoveries are coming not from a single discipline but from cross-discipline work where the integration of heretofore separate archives yields new and important discoveries.
In this arena the Grid plays an important part, because faced with an incredible volume of data in diverse formats, archives and types the scientists need a tool kit to point them to the relevant bits and maximize scientific productivity. An important tool which can be brought to bear on this is data mining. Data mining is the act of discovering by autonomous mathematical algorithms heretofore unknown relations between data. This is often used in the business community to match consumer buying patterns to goods and advertising for example. In the environmental sciences data mining has several important applications these include data quality control, human linguistic translation, event and trend detection, data classification and forecast, and deviation detection.
Making access to cross disciplinary data transparent to the user.
Providing access to services necessary to prepare the data (e.g. sub setting, transforms).
Integrating data mining services.
Unfortunately, Grids do not assist with the all important data validation and verification that is required before data mining tools can do effective work.
SPIDR data services can be used as an Open Grid Services Architecture—Data Access and Integration (OGSA-DAI, http://www.ogsadai.org.uk) resource using a data resource plugin mechanism. OGSA-DAI is a middleware product which supports the representation of various data resources, such as relational or XML databases, to the Internet and Grids (Antonioletti et al. 2005). It can be used as a data service layer both in the Globus Toolkit 4 and soon in the gLite grid implementations. That makes it possible to build grid portals and to run the space weather models within EGEE6 infrastructure (http://www.eu-egee.org) using SPIDR as an embedded (meta)data web service.
For example, the OGSA-DAI Grid data services can use Environmental Scenario Search Engine (ESSE, http://esse.wdcb.ru) for data mining in SPIDR. The ESSE user interface can translate natural language descriptions of environmental events (i.e. “large magnetic storm”) without having to assign specific values to parameters involved (Zhizhin et al. 2006): using the ESSE engine, the “large magnetic storm” event can be described as a fuzzy logic query to search for a combination of “low DST and large Kp” values of the global geomagnetic indices stored in SPIDR databases.
The data held in SPIDR can be quality controlled by peer-matching techniques where stations are compared to nearest neighbors to see if the observations are “similar” (in a fuzzy mathematical sense). This quality control technique was used for the data export from SPIDR to feed the numerical space weather models in the NGDC Space Weather Reanalysis. To data mine the Space Weather Reanalysis products, we were able to use similar analog forecast technique for ionospheric potentials (Kihn et al. 2006).
It is our belief that increasing data volumes demand application of new tools and methods to maximize scientific efficiency. It is out belief that software tools and mathematical methods exist which, provide analysis, classification, access, discovery and forecast methods for large volume data sets. The Grid will play an important part in making these tools available on the internet for use with the distributed archives that are being developed now and in the future. The SPIDR system is an early implementation of a Grid system, which while discipline focused, exhibits the key operational components of a of a true Grid environmental data system. The SPIDR system itself may be used as a pattern for those interested in implementing such an environmental tool.
Representational state transfer (REST) REST is a web-service architecture style that exploits the existing technology and protocols of the Web, including HTTP and XML. REST is simpler to use than the SOAP (Simple Object Access Protocol) approach, which requires writing or using a provided server program (to serve data) and a client program (to request data). SOAP, however, offers potentially more capability.
SPIDR simple XML data export schema has a semantic header followed by time-value element pairs similar to a pen-plotter language.
Here by Web 2.0 style we mean collaborative way of on-line editing of the SPIDR metadata with a basic set of standard services including content relations, tagging, change tracking and moderation, wiki markup, user blogs, etc.
Apache Struts is a free open-source framework for creating dynamic web content with JavaServer pages. Struts can interact with databases and business logic engines to customize a response and to control the application workflow.
The Enabling Grids for E-sciencE (EGEE) project brings together computing resources and researchers from 240 institutions in 45 countries to provide a seamless Grid infrastructure for e-Science that is available to scientists 24 hours-a-day. The EGEE project is funded by the European Commission.