Automatic optimized discovery, creation and processing of astronomical catalogs
We present the design of a novel way of handling astronomical catalogs in Astro-WISE in order to achieve the scalability required for the data produced by large scale surveys. A high level of automation and abstraction is achieved in order to facilitate interoperation with visualization software for interactive exploration. At the same time flexibility in processing is enhanced and data is shared implicitly between scientists. This is accomplished by using a data model that primarily stores how catalogs are derived; the contents of the catalogs are only created when necessary and stored only when beneficial for performance. Discovery of existing catalogs and creation of new catalogs is done through the same process by directly requesting the final set of sources (astronomical objects) and attributes (physical properties) that is required, for example from within visualization software. New catalogs are automatically created to provide attributes of sources for which no suitable existing catalogs can be found. These catalogs are defined to contain the new attributes on the largest set of sources the calculation of the attributes is applicable to, facilitating reuse for future data requests. Subsequently, only those parts of the catalogs that are required for the requested end product are actually processed, ensuring scalability. The presented mechanisms primarily determine which catalogs are created and what data has to be processed and stored: the actual processing and storage itself is left to existing functionality of the underlying information system.
KeywordsData mining Data lineage
Billions of astronomical objects are detected in large astronomical surveys, for which thousands of properties are quantified. The classical way to handle catalog data produced by large surveys, is to create a static relational database with a direct interface to the user. This classical approach has several conceptual drawbacks. The catalogs are published in releases as is; there is very limited flexibility in the derivation of the data. Redoing part of the data reduction entails downloading a large part of the original data and reprocessing it offline. Scientists require knowledge about the internal representation of the data to access it.
Examples of such an approach are the Sloan Digital Sky Survey (SDSS) [5, 9] and the WFCAM Science Archive . It is possible to create user-defined tables in the ‘CasJobs’ service1 of SDSS. These are of a limited size and there is no facility to reprocess the data. This is an inherently pushing or forward chaining approach, that is, scientists create derived catalogs in a stepwise fashion—starting from the released catalogs—until they reach their required end product. The used database queries are stored within the tables, but there is no conceptual information about what the data in the catalogs represent.
This paper discusses the design of novel mechanisms to handle such large catalogs in Astro-WISE through request driven processing. That is, scientists request their required end product, and the information system autonomously determines the best way to provide this catalog. We achieve a high level of automation and implicit scalability, while enhancing flexibility in processing and sharing of data. This is done by using an object oriented data model that focuses on storing information about processing; storing the catalog data itself is of secondary importance.
The Astro-WISE consortium has designed a new paradigm and has implemented a fully scalable information system to overcome the huge information avalanche produced by wide-field astronomical surveys [7, 8]. This is achieved by capturing in a generic way the reality of end-to-end survey operations into a conceptual data model which is translated into hierarchical classes. The model maps all links between dependencies: objects are stored in the database, which links all data products to their dependencies. This creates a dependency graph with the full data lineage of the entire processing chain.
Astro-WISE uses the advantages of Object-Oriented Programming (OOP) to process data in the simplest and most powerful ways. In essence, it turns the objects that represent conventional astronomical science products, into OOP objects, called process targets. Every individual science product, such as frame or catalog, is an instantiation of a specific process target class. Each of these process target instances knows how to process itself to create the data product it represents. Each process target has associated processing parameters, which are configurable parameters that guide the processing of that target.
The most unique aspect of Astro-WISE is its ability to process data based on the final desired result to an arbitrary depth. This data pulling is the heart of Astro-WISE and is called target processing. Contrary to the typical case of forward chaining such as in the SDSS CasJobs service, the Astro-WISE database links allow the dependency chain to be examined from the intended process target all the way back to the raw data. A target’s dependencies are checked to see if it is up-to-date: if there is a newer dependency or if the target does not exist, the target is (re)created.
1.2 A functional approach to catalogs as Process Targets
Target processing has been incorporated in the image reduction part of the Astro-WISE information system since its inception . Originally, only a few classes were available in Astro-WISE to handle catalog data, of which the SourceList is the most prominent. The SourceList is primarily used to create catalogs with attributes derived from images and has limited functionality for creating new catalogs from existing catalogs. In particular such derived catalogs do not have full data lineage and can therefore not be pulled. Furthermore they can require large amounts of duplication of catalog data, leading to scalability problems.
This paper describes how data lineage and data pulling mechanisms are extended to cover astronomical catalogs with the design of process target classes—which we call Source Collections—for catalog data. A Source Collection instance represents a collection of sources (astronomical objects) with attributes (or parameters) that quantify physical properties. There are separate process target classes for different operations to create and manipulate catalogs (Section 3.1). The Source Collection classes take data pulling mechanisms to a higher level than is necessary for images; in particular it is not required to store the catalog data that a Source Collection represents in its entirety.
We allow Source Collections to be created—and used as a dependency for other process targets—by specifying their data lineage, without requiring them to be processed, unlike other process targets in Astro-WISE.
Dependency graphs of Source Collections are created automatically through data pulling. These mechanisms create new Source Collections in a way that maximizes their reusability for future data pulling requests.
We present a novel way to process only the part of a Source Collection that is required for the last process target in a dependency graph. This is done by using the power of backward chaining to temporarily optimize the dependency graph.
We use a novel algorithm (Buddelmeijer et al. , hereafter Paper II) to infer the logical relationships between catalogs from their data lineage directly. This is required because the exact set of sources that a catalog represents might not be evaluated. This algorithm is used to find Source Collections and for the optimization of dependency graphs.
The methods to calculate new attributes from existing attributes are decoupled from their application. This offers scientists flexibility in implementing their own methods while reinforcing the principles of data pulling.
The catalog objects and data pulling mechanisms are designed to be used in query driven visualization . The high level of automation allows the data pulling to be abstracted, which implicitly minimizes the processing required to create the visualized datasets.
The remainder of the paper is structured as follows. The Source Collection concept is introduced and demonstrated with an example in Section 2. This is followed by a short description of the different Source Collection classes that are implemented in Astro-WISE in Section 3 and a discussion about storing Source Collections and the catalog data they represent in Section 4. Subsequently the concept of dependency graphs is explained in Section 5 and their automatic creation through data pulling in Section 6. The optimization of dependency graphs is discussed in Section 7 and their processing in Section 8. A summary and conclusion is provided in Section 9.
2 Introducing Source Collections
A Source Collection is an Astro-WISE process target (Section 1.1) for the handling of astronomical catalogs. These catalogs consist of sets of sources and attributes that quantify properties of the sources. The exact set of sources and the values of the attributes is determined by processing a Source Collection. When we refer to catalog data, we mean this processing result. A Source Collection can also be seen as a representation of the action required to derive the catalog data, since a Source Collection can be created without being processed. We refer to this action in a conceptual sense as the operator of a Source Collection and define separate process target classes for different operations on catalogs (Section 3.1).
Every source in a Source Collection has a unique identifier and two Source Collections are considered to represent the same sources if and only if the identifiers of their sources are identical. A source itself can be seen as an object in the computer science reading of the term. A parametrized property of a source can then be seen as an attribute of such an object. We will use the term attribute instead of parameter, which originates from this object oriented approach. Attributes quantify physical properties of the sources in a Source Collection and the set of attributes forms the Source Collection dimensions. The label of an attribute only describes what physical property is represented by the attribute. This labeling could be standardized, for example with Unified Content Descriptors; for the scope of this paper we will refer to attributes by their name only.
Every Source Collection instance is linked to its dependencies, forming a dependency graph all the way to the raw data. Dependencies of a Source Collection that are Source Collections themselves are also called its parents, because the catalog represented by the Source Collection is derived from them. Such a dependency graph can be visualized as interconnected nodes. In the figures in this paper, the dependencies of a process target are shown above it. Therefore the data processing runs from top to bottom and the data lineage from bottom to top. Such a dependency graph can conceptually be extended in both directions. The top nodes will contain photometric attributes and can be connected to nodes representing frames that were used to measure these attributes. The bottom nodes can be connected to nodes representing hypothetical process targets for graphs or other analysis products.
2.1 Source Collection example
The data set from which the sources should be selected: Source Collection A.
The selection criterion for the sources: a redshift below 0.1.
The required attributes: absolute and apparent magnitudes.
Source Collection B is created to select all sources that match the given selection criterion.
The information system determines that no absolute magnitudes have been defined for these sources and it creates Source Collection C to calculate absolute magnitudes from apparent magnitudes.
The information system determines that the calculation can be performed on all sources in Source Collection A. Therefore, it optimizes for generality and uses Source Collection A as dependency for Source Collection C, instead of B. The Source Collection is not yet processed at this stage.
Source Collections D and E are created to combine the attributes represented by different Source Collections and select the required ones.
The information system creates a temporary copy of the dependency graph in order to optimize it for scalability to fulfill the request as quickly as possible.
It reorganizes the dependency graph to minimize the required processing by placing the selection of sources before the calculation of absolute magnitudes.
The information system retrieves the data of Source Collection b and uses this to process Source Collection c completely. The calculated attributes will be stored for future requests as part of Source Collection C, because they cannot be derived on the fly.
The other Source Collections are processed on the fly while retrieving the catalog data of Source Collection e. The catalog data is subsequently returned to the scientist.
2.2 Key features in example
Catalog data is pulled and new Source Collections are created to compute attributes that do not yet exist (Section 6).
The final catalog has full data lineage: any attribute value can be recalculated and the selection criterion is stored (Section 4).
Calculations are defined to be as general as applicable. Source Collection C can be reused if at a later stage absolute magnitudes are requested for another subset of Source Collection A (Section 6).
The information system reorganizes the order of the Source Collections to prevent the calculation of unnecessary data (Section 7). The algorithm to determine logical relationships between sets of sources of Paper II is used for more complex dependency graphs.
The calculation of the absolute magnitudes can be performed on the workstation of the scientist or on a distributed computing cluster, while the selection of data can be performed on the database (Section 7).
3 Source Collection classes: elementary operations on catalogs
Many of the novel features of the Source Collections originate from the ability of the information system to assess aspects of the catalogs by inspecting only the data lineage. This is achieved by having a predefined set of operations that can be used to process a Source Collection. Separate process target classes are designed for the different operations. We use the term operator to refer to the action required to create the catalog data.
These operators are designed to be as elementary as possible in order to maximize the information that can be inferred from the data lineage directly. Therefore, there are no Source Collection operators that are entirely user-defined. However, the behavior of Source Collections can be influenced by setting the process parameters. For example, we do define an operator to calculate new attributes of sources from existing attributes (Section 3.2). This allows scientist to specify their own calculation method as a process parameter.
There are two main effects of the elementary operators: firstly, they allow the information system to determine whether a Source Collection can be used in the construction of a dependency graph (Section 6). Secondly, they allow efficient reorganization of the dependency graph, e.g. for partial processing (Section 7).
Most operators we define are modeled after relation operations  to allow them to be evaluated on the Astro-WISE database. In essence, we extend SQL commands to target processing, although this is not directly our goal. The important aspect in the design of the operators is maximizing the information that can be inferred from the data lineage. Not all operators we describe can be evaluated on SQL and vice versa, most operators can be evaluated in the Astro-WISE Python environment as well.
3.1 List of classes
Select Attributes: Selects a subset of attributes from a parent Source Collection.
Concatenate Attributes: Combines the different attributes from several parent Source Collections that represent the same sources.
Rename Attributes: Renames attributes of a parent Source Collection.
Filter Sources: Selects a subset of sources from a parent Source Collection by evaluating a selection criterion.
Select Sources: Selects a subset of sources from a parent Source Collection by listing the required sources explicitly.
Concatenate Sources: Combines the different sources of several parent Source Collections that represent the same attributes.
Relabel Sources: Changes the source identifiers of a parent Source Collection.
Attribute Calculator: Calculates new attributes from existing attributes for the sources in a parent Source Collection (Section 3.2).
External: Represents a catalog without data lineage.
Pass: Represents the exact same catalog as its parent.
SourceList Wrapper: A special Source Collection to use the Astro-WISE SourceList class as a Source Collection. The SourceList class is used to detect sources from images and measure photometric and related attributes.
3.2 Generic operator for attribute calculation
A special Source Collection class is designed for the calculation of new attributes of sources from existing attributes. The calculation performed by a Source Collection of this class, is decoupled from the definition of the class and is stored as another persistent object, which can be created by scientists themselves.
This auxiliary object is called an Attribute Calculator Definition and contains both information about how to perform the calculation as well as information about the calculation itself: which attributes are calculated, which attributes are required and which process parameters can be set. This allows the information system to discover attribute derivation methods in order to instantiate Source Collections to calculate these attributes for a requested set of sources. This offers scientists flexibility in implementing their own methods while reinforcing the principles of data pulling.
Multiple Attribute Calculator Definitions might exist for the calculation of the same attribute, for example through different methods or different versions of the same method. Astro-WISE has functionality to indicate that stored objects should not be used anymore by invalidating them, for example when a newer version of the object exist. This is used within the Source Collections to indicate that newer versions of Attribute Calculator Definitions exist. This allows existing functionality to be used for ensuring that catalogs are always created with the latest method and that out-dated catalogs are flagged for possible recreation.
4 Storing data lineage instead of tables
Different Source Collections can represent partially identical catalogs without any duplication of stored data.
The processing of intermediate Source Collections can be limited to those subsets that are required for the end node of a dependency graph. Source Collections can therefore be created with arbitrary sizes without performance penalties. This ensures maximum reusability of the created Source Collections.
No results have to be stored at all for Source Collections that can be processed on the fly.
4.1 Source Collection persistent properties
Data Lineage: Properties that define the catalog that is represented by the Source Collection. These are dependencies and process parameters. Dependencies are other process targets from which the catalog represented by this Source Collection is derived, often Source Collections as well. Process parameters influence the processing and are defined by the class of the Source Collection. The dependencies and process parameters together unambiguously define the catalog that the Source Collection represents.
Processing Results: Results of processing the Source Collection, detailed in Section 4.2.
Other Properties: Properties that do not refer directly to the processing or the processing results. These include identifiers of the object, a human readable name of the Source Collection, a reference to its creator, status of the processing, etc. Some of these can be specified by the user, others are set automatically by the information system.
4.2 Processing results
The result of processing a process target instance (Section 1.1) can be stored persistently. The processing results of image classes are primarily the values of the pixels of the image, which in Astro-WISE are stored as FITS files on the dataserver. For Source Collections the primary result is the catalog data it represents, which in Astro-WISE is stored in the database.
The catalog the Source Collection represents: the values of all the attributes for all the sources. This is the primary processing result and can be decomposed in the partial results that follow.
A partial catalog: the values of the attributes for a subset of the sources or attributes.
The set of sources the Source Collection represents, which can be seen as a list of identifiers of the sources. This can be further split up into the number of sources, or an identification of the set without actually enumerating all the sources individually.
The set of attributes of the sources. That is, which physical properties the Source Collection represents, not the actual values of the attributes.
The sharing of processing results leads to multiple paths to the same stored data. The dependency graphs representing these different paths are only created automatically by the information system through modifications of existing dependency graphs (Section 5.1). The information system ensures that the different paths are equivalent by only performing modifications where this is guaranteed.
5 Source Collection dependency graphs
A Source Collection represents a catalog that is derived from its dependencies, which again have dependencies themselves. These dependencies chain a Source Collection back to the raw data and form a graph of process targets. The term dependency graph is used to refer to this complete set of dependencies of a Source Collection. These graphs are directed acyclic graphs, or acyclic digraphs, because there are no cyclic dependencies .
In the figures depicting dependency graphs in this paper, the dependencies of a Source Collection are shown above it. Therefore the data processing runs from top to bottom and the data lineage from bottom to top. There are no arrows on the shown edges, because the preferred direction is dependent on context. This paper only treats the part of such a dependency graph that considers Source Collections.
5.1 Modifications of dependency graphs
Replacing a Source Collection with another one that represents the exact same catalog. This is the only mechanism that is used in the dependency graph optimization (Section 7).
Replacing a Source Collection with one that represents a different catalog. This is only performed during the creation of new dependency graphs (Section 7) and only on dependencies of the Pass Source Collections at the end of the graph.
The individual modifications themselves are designed in a way that separates the knowledge of how to perform a modification and why to do so. How to perform a modification is part of the definition of the Source Collection classes. Whether a specific modification should be applied is the responsibility of the part of the information system that governs the entire dependency graph. Therefore, all modifications are between a Source Collection and its direct dependencies, because an individual Source Collection has no knowledge of other objects.
A specific kind of modifying a dependency graph is ‘moving’ Source Collections through the graph. The way this should be interpreted—in simplified form—is that copies of a Source Collection and its parent are created, but with their dependencies swapped. The original Source Collection is then replaced by these copies. As a result, Source Collections can only be moved ‘up’ the graph. To move a Source Collection down, the Source Collection with that Source Collection as a parent should be moved up.
Some modifications can only be performed if the relationship between the sets of sources of the involved Source Collections is known. The information system uses the algorithm of Paper II to provide this information to the individual Source Collections.
6 Pulling catalogs
The ‘pushing’ way to use catalogs such as represented by the Source Collections is to define the catalog, process and store it, and then request subsets of the catalog. This order is changed with target processing .
Source Collections are primarily created automatically by pulling data, which means that the evaluation of processing starts at the end of the chain by requesting the final catalog that is required. The information system will autonomously create a dependency graph of Source Collection which ends with a Source Collection that represents the requested catalog (Algorithm 1).
There are two main goals of the data pulling mechanisms with respect to the creation of the dependency graph: Firstly, they ensure that existing Source Collections will be reused as much as possible and secondly, new Source Collections are created in a way that maximizes their reusability.
6.1 Data pulling: formulating a request
A starting Source Collection from which a selection is made.
A list of required attributes, not necessarily represented by the starting Source Collection.
Optionally, a selection criterion for the sources.
6.2 Data pulling: derivation preferences
The information system will use existing and newly created Source Collections to create a dependency graph which ends with a Source Collection representing the requested catalog. The information system is able to autonomously decide how to proceed if there are multiple Source Collections that can be used to fulfill a particular dependency. This is done by applying a ranking function to all Source Collections that can be used and select the one with the highest ranking. Scientists can influence this process by specifying their own ranking function or by overruling the choices made by the information system manually.
6.3 Data pulling: selecting sources
In this paper we restrict ourselves to requesting subsets of sources that are already represented by an existing Source Collection, because our focus is on operations on catalogs. In particular we assume the existence of Source Collections with photometric and related attributes derived from images. These catalogs could be created through pulling mechanisms as well; this is beyond the scope of this paper.
The logical relations algorithm of Paper II is used to search for an existing Source Collection that represents the requested selection. First all Source Collections that represent the same sources as the original Source Collection are found. Subsequently a Filter Sources is sought, one with the specified selection criterion and with one of these Source Collections as parent. New Source Collections are created to select the required sources if no suitable Source Collection is found. This might require more than only a single Filter Sources Source Collection because the information system has to ensure that the attributes used in the selection criteria are available.
For example, the specified selection criterion in the example in Section 2.1 depends on the availability of the redshift attribute. The information system would have tried to find this attribute if it would not have been included in Source Collection A.
A Select Attributes Source Collection is created to select no attributes from the found or created Source Collection with the sources. The required attributes are subsequently added to this new Source Collection with only sources (Section 6.4).
6.4 Data pulling: selecting attributes
Requested attributes for which no suitable Source Collections can be found, are derived with new Source Collections (Algorithm 4). In this paper we limit ourselves to attributes that are derived from other attributes using Attribute Calculators Source Collections. The calculation performed by an Attribute Calculator is specified through a process parameter referencing an Attribute Calculator Definition object. New Attribute Calculator Source Collections are instantiated for all Attribute Calculator Definitions that can be used to derive the requested attribute. The search for attributes is applied recursively if more attributes are required for the derivation of the requested attributes, as specified by the Attribute Calculator Definition.
6.5 Data pulling: storing Source Collections
The result of data pulling is the creation of a dependency graph that ends with a Source Collection that represents the requested catalog. The Source Collections in this dependency graph might be stored persistently if necessary. The information system will subsequently optimize this graph to process it in the most optimal way (Section 7).
7 Optimization of dependency graphs and processing
The information system will optimize the dependency graph of a Source Collection before processing the Source Collections that it contains. There are two goals to these optimizations: minimization of the required processing and optimization of the processing itself. These optimizations are performed on a temporary transient copy of the dependency graph, which can be discarded after the processing is completed.
Reducing the necessary processing to the minimum required for the last Source Collection is the primary goal of this paper. In essence this is done by placing filtering Source Collections before Source Collections that create new attributes and removing parts of the dependency graph that are not necessary for the final result. This will ensure that the Source Collections in the dependency graph only represent data that is required for the requested catalog data. These mechanisms allow the information system, or the scientist, to create and store Source Collections instances in their most general and reusable form, (e.g. as in Fig. 2a), because the creation and storage of the catalog data is minimized automatically (e.g. as in Fig. 2b).
Optimization of the processing itself is a secondary goal of this paper. This is done by reorganizing the dependency graphs such that the processing can be performed on the most suitable subsystem of the information system. For example, Source Collections that can best be processed on the database are placed such that they can be combined into one SQL query and processed together. Parts of the dependency graph can be parallelized in order to process large Source Collections on a distributed cluster, especially those that cannot be processed on the database.
7.1 Dependency optimization: strategy
7.2 Dependency optimization: simplifications
A dependency graph of Source Collections can be simplified as part of the optimization routines (Algorithm 6). A Source Collection that has already been fully processed does not have to be processed again. All these Source Collections can be substituted with an External Source Collection that represents the same catalog. Furthermore, the complexity of a dependency graph can be reduced by combining operators or removing redundant ones.
For example, the initial Source Collection in Fig. 2 is an External Source Collection for simplicity. In a realistic scenario, this Source Collection would have dependencies of its own and would only be substituted with an External Source Collection just before processing. In Fig. 4a two serial Select Attributes Source Collections are combined and in Fig. 3d two parallel Select Sources Source Collections are combined.
7.3 Dependency optimization: removing dependencies
Unnecessary parts of a dependency graph can be removed by moving Select Attributes Source Collections up in the graph (Algorithm 5). The result of moving a Select Attributes up past a Concatenate Attributes, might be that one of the dependencies of the Concatenate Attributes does not represent attributes anymore. The part of the graph that ends with this dependency might be removed from the graph in its entirety.
The set of sources of a Concatenate Attributes is the intersection of the sets of sources of its parents. Therefore it is only possible to remove this part of the dependency graph if doing so does not influence the selection of sources. The logical relations algorithm of Paper II is used to determine whether this is the case.
7.4 Dependency optimization: sources limitation
Select Sources Source Collections are moved through the dependency graph to ensure that only those parts of the Source Collections are processed that are required to create the catalog data of the end node (Algorithm 5). Filter Sources Source Collections first have to be converted into a Select Sources. Before moving the Select Sources Source Collections up the dependency graph, they are moved down in order to copy them to all parts of the dependency graph they are applicable to.
In Fig. 3 the Select Sources Source Collection is moved up through the Attribute Calculator Source Collection. This creates a copy of the Attribute Calculator that represents a subset of the sources of the original.
7.5 Dependency optimization: parallelization
The Source Collections are well suited for parallelization because they are processed on a per-source basis. A Source Collection can be parallelized by creating a set of Select Sources (or Filter Sources) Source Collections that each select a subset of the original target, such that all sources are selected exactly once. The set of Select Sources Source Collections is then combined with a Concatenate Sources Source Collection which can replace the original Source Collection in the dependency graph. Further optimization can move the Select Sources up to parallelize the entire graph. The parallelization algorithm is currently not implemented in Astro-WISE.
8 Processing and storage
The result of the dependency graph optimization is a set of Source Collections that requires the least amount of processing to create the requested catalog data. The information system will recursively process the Source Collections and store the results if necessary.
The mechanisms designed for the research presented in this paper are intended be used in conjunction with existing large-scale data storage and processing facilities. Therefore, the precise way catalog data is processed and stored is largely beyond the scope of this paper and will depend on what is available in the information system. We give a general discussion of how the processing and storage could be achieved and highlight how this is implemented in Astro-WISE.
In particular, for this paper we assume the existence of mechanisms for authentication of users, privilege management and for queuing requests when processing Source Collections on shared resources.
8.1 Processing: processing Source Collections
Database: The selection and combining operators are designed to be evaluated on a database. The operators of consecutive Source Collections can be combined into one database query. The database can create indexes on columns containing attributes that are frequently used in selection criteria and can automatically cache the results. Some Source Collections, in particular Attribute Calculators, will not be suitable to be processed on the database.
Workstation: The processing can be performed on the workstation of the scientist for Source Collection that cannot be processed on the database. Furthermore, all the relational operators should also be evaluated on the local machine during interactive visualization of small datasets. The latency of a round trip to the database or distributed computing facility is too large for responsive interaction. Such a local implementation of the Source Collections holds all the catalog data in memory or in files.
Distributed Computation: Operators that require large computations can be performed on a distributed processing cluster. This is done by parallelizing the respective parts of the dependency graph and evaluating each sub-graph on a cluster node.
8.2 Processing: storing catalog data
The result of processing a Source Collection—the exact set of sources and the values of the attributes—only has to be stored if this is necessary for performance reasons. Therefore we make no explicit distinction between storing and caching of catalog data.
The optimization process (Section 7) creates copies of Source Collections that represent subsets of the originals. If the information system decides to store the processing result of such a copy, it will append the catalog data of the copy to that of the original. Deciding what should be stored is primarily the responsibility of the information system. The decision should be made for individual processing results.
For example, it can be useful to store the identifiers of the sources without storing the values of the attributes in order to store the result of evaluating a complex selection criterion. Different Source Collections that represent the same sources can share this processing result.
A key principle of the presented research, inherited from Astro-WISE, is that a Source Collection cannot be altered once stored. Therefore, any stored catalog data of a Source Collection cannot change either. Reliability of the data storage, e.g. ACID properties , is automatically achieved as a result. For example, an incomplete database transaction will not leave the database in an invalid state because these will only append data that could also be ingested partially in the first place.
Stored catalog data can, in principle, also be deleted at any time, since it can always be recreated. A deletion mechanism is currently not incorporated in Astro-WISE; instead, all catalog data is backed up regularly to be able to recover quickly from database failures.
8.3 Processing: retrieving catalog data
The last node in the dependency graph represents the requested catalog. Once it has been processed, the catalog data can be returned to the scientist or used for further analysis or visualization. Any temporary transient Source Collections instantiated for the processing are discarded.
9 Summary and conclusions
Source Collections are primarily created automatically by the information system through data pulling. Source Collections that derive new data are created as general as possible in order to facilitate reuse and to prevent duplication of data.
Source Collections allow a functional approach to target processing: they can be seen as the operation to create the catalog data. A Source Collection is only processed when this is required, not necessarily at the moment it is created. Every Source Collection class correspond to an elementary operation on catalogs; complex operations should be split over multiple Source Collection instances.
The Source Collection have full data lineage, which allows the information system to assess aspects of the catalogs without processing them. For example, it allows the information system to optimize the a dependency graph of Source Collections before processing it.
A Source Collection is processed by creating temporary copy of the dependency graph and reordering the dependencies so they are as specific as possible in order to minimize the required processing.
A generic Source Collection class is designed for the calculation of new attributes from existing attributes. This offers a framework for scientists to implement their own methods while enforcing the benefits of full data lineage and data pulling.
This research is part of the project “Astrovis”, research program STARE (STAR E-Science), funded by the Dutch National Science Foundation (NWO), project no. 643.200.501.
Astro-WISE is an on-going project which started from a FP5 RTD programme funded by the EC Action “Enhancing Access to Research Infrastructures”.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 4.Gray, J.: The transaction concept: virtues and limitations. In: Proceedings of the 7th International Conference on Very Large Databases, pp. 144–154 (1981)Google Scholar
- 5.Gray, J., Szalay, A.S., Thakar, A.R., Kunszt, P.Z., Stoughton, C., Slutz, D., van den Berg, J.: Data Mining the SDSS SkyServer Database. ArXiv Computer Science e-prints (2002)Google Scholar
- 6.Hambly, N.C., Collins, R.S., Cross, N.J.G., Mann, R.G., Read, M.A., Sutorius, E.T.W., Bond, I., Bryant, J., Emerson, J.P., Lawrence, A., Rimoldini, L., Stewart, J.M., Williams, P.M., Adamson, A., Hirst, P., Dye, S., Warren, S.J.: The WFCAM science archive. Mon. Not. R. Astron. Soc. 384, 637–662 (2008)ADSCrossRefGoogle Scholar
- 7.McFarland, J.P., Verdoes-Kleijn, G., Sikkema, G., Helmich, E.M., Boxhoorn, D.R., Valentijn, E.A.: The Astro-WISE optical image pipeline development and implementation (2011). doi:10.1007/s10686-011-9266-x
- 9.Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C., vandenBerg, J.: The SDSS SkyServer: Public Access to the Sloan Digital Sky Server Data. ArXiv Computer Science e-prints (2002)Google Scholar