Reference Work Entry

Encyclopedia of Database Systems

pp 2012-2018

Parallel and Distributed Data Warehouses

  • Todd EavisAffiliated withConcordia University

Synonyms

Scalable decision support systems High performance data warehousing

Definition

To support the burgeoning data volumes now encountered in decision support environments, parallel and distributed data warehouses are being deployed with greater frequency. Having evolved from haphazard and often poorly understood repositories of operational information, the data warehouse itself has become one of the cornerstones of corporate IT architectures. However, as the underlying operational databases grow in size and complexity, so too do the associated data warehouses. In fact, it is not unusual for many corporate or scientific repositories to exceed a terabyte in size, with the largest now reaching 100 TB or more. While processing power has grown significantly during the past decade, the sheer scale of the workload places enormous strain on single CPU data warehousing servers. As a result, some form of data and/or query distribution is often employed in production environments. It is important to note, however, that while contemporary data warehouses are almost always based upon relational DBMS platforms, the unique characteristics and requirements of data warehouse environments often suggest design and optimization choices that are not often employed with general purpose parallelized database systems.

Historical Background

The terms “parallel DW” and “distributed DW” are very often used interchangeably. In practice, however, the distinction between the two has historically been quite significant. Distributed DWs, much like distributed DBs, grew out of a need to place processing logic and data in close proximity to the users who might be utilizing them. In general, multi-location organizations consist of a small number of distinct sites, each typically associated with a subset of the information contained in the global data pool. In the data warehousing context, this has traditionally lead to the development of some form of federated architecture. In contrast to monolithic, centralized DWs, federated models are usually constructed as a cooperative coalition of departmental or process specific data marts. A simple example is illustrated in Fig. 1. For the most part, design and implementation issues in such environments are similar to those of distributed operational DBMSs [11]. For example, it is important to provide a single transparent conceptual model for the distinct sites and to distribute data so as to reduce the effects of network latency and bandwidth limitations. One unique feature of the distributed data warehouse environment is perhaps the emphasis on integration and consolidation of distributed operational data, a process known in DW terminology as Extract, Transform, and Load (ETL).
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F16/MediaObjects/978-0-387-39940-9_16_Part_Fig1-261_HTML.jpg
Parallel and Distributed Data Warehouses. Figure 1

A simple federated model illustrating the mapping of a logical global schema on to a series of physically independent data marts.

With respect to parallelism, research has again been influenced by parallel DBMS projects such as Gamma [3] that were initiated in the mid-to-late 1980s. By the 1990s, it had become clear that commodity-based “shared nothing” databases provided significant advantages over the earlier SMP (Symmetric Multi-Processor) architectures in terms of cost and scalability [4]. Subsequent research therefore focused on partitioning and replication models for the tables of the parallelized DBMS [13]. In general, researchers identified the importance of full p-way horizontal striping for large database tables.

Data warehouse researchers have continued to explore the issues related to table partitioning. In addition to complete p-way partitioning schemes, full or partial replication (i.e., duplication) of fragments has been investigated [12]. This technique has been further extended by virtualizing partial fragments over physically replicated tables [7]. Typically, recent research in the area of table partitioning has exploited the use of “DBMS clusters.” Here, rather than constructing a complete, parallel DBMS platform, the parallel system is essentially constructed as a series of commodity DBMS systems, “glued together” with a thin partition-aware wrapper layer.

In addition to the continuation of the traditional partitioning work, a second important theme has been the parallelization of state-of-the-art sequential data cube generation methods. The data cube has become the primary abstraction for the multi-dimensional analysis that is central to modern data warehousing systems. Cube parallelization efforts have, in fact, taken two forms, one based upon data that is physically represented as array-based storage [8] and the other based upon relational storage [9,2]. In both cases, the complexity of the algorithm and implementation issues has lead to the development of relatively complete DBMS prototypes.
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F16/MediaObjects/978-0-387-39940-9_16_Part_Fig2-261_HTML.jpg
Parallel and Distributed Data Warehouses. Figure 2

A three dimensional OLAP space showing all 2 d cuboids and the parent-child relationships between them. Each cell would contain a Total Sales aggregate value.

Foundations

Central to data warehousing systems is a denormalized logical multi-dimensional model known as the Star Schema (the normalized version is referred to as a Snowflake). A Star Schema consists of a single, very large fact table housing the measurement records associated with a given organizational process. During query processing, this fact table is joined to one or more dimension tables, each consisting of a relatively small number of records that define specific business entities (e.g., customer, product, store). A complete data warehouse typically consists of multiple such Star Schema designs.

While the Star Schema forms the basis of the relational data warehouse, it can be extremely expensive to query the fact table directly, given that it often consists of tens of millions of records or more. Typically, the basic Star Schema is augmented with compact, pre-computed aggregates (called group-bys or cuboids) that can be queried much more efficiently at run-time. This collection of aggregates is known as the data cube. Specifically, for a d-dimensional space, {A 1, A 2,...,A d }, the cube defines the aggregation of the 2 d unique dimension combinations across one or more relevant measure attributes. In practice, the generation and manipulation of the data cube is often performed by a dedicated OLAP (online analytical processing) server that runs on top of the underlying relational data warehouse. While the OLAP server may utilize either array-based (MOLAP) or table-based (ROLAP) storage, both provide the same, intuitive multi-dimensional representation for the end user.

Given the enormous size of these new data warehouses, some form of parallelism is often employed in production environments. One option is a “shared everything” architecture. Here, designers would likely employ a CC-NUMA system (Cache Coherent Non Uniform Memory Access) that supports a single global memory pool and some form of single virtual disk (e.g., a disk array). Such systems have the advantage that they are relatively easy to administer as the hardware transparently performs much of the “magic.” That being said, shared everything designs also tend to be quite expensive and have limited scalability in terms of both the CPU count and the number of available disk heads. In terabyte-scale data warehouse environments, either or both of these constraints might represent a serious performance limitation.

For this reason, many vendors and researchers have turned towards distributed, “shared nothing” platforms for high performance database applications. In fact, the characteristics of DW processing environments make such models particulary attractive. The key distinctions between operational and DW processing include:
  1. 1.

    Operational systems tend to have high volume query streams. In contrast, DW systems typically process a much smaller number of user queries.

     
  2. 2.

    Operational queries have high selectivity, meaning that they touch relatively few records. Conversely, DW queries are comprehensive in scope, leading to very low selectivity and high I/O and computational load.

     
Why does this bode well for the use of shared nothing architectures? In large operational environments, high performance can often be achieved through the exploitation of what is often termed a “high throughout engine.” Here, improved performance is obtained via inter-query parallelism; that is, a stream of small queries is executed concurrently across the parallel/distributed system. However, the low volume/low selectivity combination in DW environments generally argues against the use of inter-query parallelism since such a division of work would likely lead to extensive disk thrashing (i.e., resource conflict). Note, as well, that the use of indexes, a staple in operational systems, is often of relatively little value in DW environments since full table scans are generally more cost effective for low selectivity queries.
As a consequence, the fully distributed, “shared nothing” DBMS model has been particulary attractive for high performance DW practitioners. By partitioning the core fact tables across a large number of independently controlled CPU/disk combinations, shared nothing designs are ideally suited to the exploitation of intra-query parallelism. In this case, an individual query q is decomposed and simultaneously executed across the p nodes of the system, with each partial query running against approximately 1∕p records of the partitioned fact table (the small dimension tables are typically replicated on each node). Figure 3b provides a simple example of this technique, contrasting it with the non-partitioned model typically utilized on a single node server (Fig. 3a). Though merging of results may be necessary, it is important to note that while the input partitions may be massive, the output of user-directed analytical DW queries is typically quitesmall. As a result, it should be seen that shared nothing data warehousing excels precisely because it offers tremendous I/O performance while simultaneously requiring only modest inter-node communication [10].
https://static-content.springer.com/image/prt%3A978-0-387-39940-9%2F16/MediaObjects/978-0-387-39940-9_16_Part_Fig3-261_HTML.jpg
Parallel and Distributed Data Warehouses. Figure 3

Fundamental DW partitioning schemes. (a) Query executed on a single node server that houses a large fact table (T1) and two small dimension tables (T2, T3). (b) Physical partitioning, creating fact table fragments (F1, F2). (c) Virtual partitioning on the clustering attribute P of the replicated fact table. (d) Adaptive virtual partitioning, using fragments and multiple sub-queries.

It is against this backdrop that many recent data warehousing partitioning projects have been set. Essentially, there are three forms of DW partitioning. In the first case, the fact tables are physically partitioned in the manner just described. While the performance with this approach can be impressive [15], it is also true that imbalances caused by inherent data skew make effective a priori data striping quite challenging. In response, the virtual partition model was proposed [1]. Here, as illustrated in Fig. 3c, the fact tables are replicated in full across the nodes of the DBMS cluster. Queries are then decomposed into sub-queries (1∕p records per node) that are run against virtual partitions mapped on top of the fully replicated tables. The advantage is that sub-queries may be (i) run against any cluster node and (ii) dynamically migrated to under-utilized nodes. The downside is (i) the storage requirement associated with p copies of the primary fact table and (ii) the fact that the commodity DBMS will invoke a full table scan if the partitions are too large. The third partitioning method attempts to combine the best features of the previous methods. In adaptive virtual partitioning (AVP) [7], small virtual partitions are layered on top of larger physical partitions. The AVP algorithm dynamically determines the maximum sub-query size that does not invoke a table scan by iteratively probing the commodity DBMS with successively larger queries. The result is a set of re-locatable sub-queries that can generally be answered efficiently without access to the code base of the commodity DBMS systems. A simple illustration is provided in Fig. 3d.

While partitioning is the cornerstone of contemporary high performance data warehouses, it should be clear that such methods represent a sort of “brute force” approach to query resolution. Though the response time may indeed represent close to linear speedup as a function of CPU/disk count, the implicit assumption is that full processing of the atomic, transactional data is necessary. This is certainly true for arbitrary or ad-hoc user queries since nothing is known about the possible query format. However, most data warehouses support user interaction through some form of OLAP interface. At its core, OLAP represents an analytical environment for the intuitive manipulation of the data cube. As noted above, there are O(2 d ) such views or cuboids. By materializing some (or occasionally all) of these pre-aggregated views, DW/OLAP systems can dramatically improve run-time performance on common queries without resorting to massive fact table scans. The trade-off, of course, is that aggregates must be accurately maintained (i.e., updated). In fact, this requirement exerts considerable pressure on the underlying server as data volumes explode in size while, at the same time, update windows shrink. It is important to understand, however, that the run-time performance benefit resulting from cube materialization is likely to be significantly larger than the (at best) linear speedup achieved by a conventional parallel query engine. As such, parallel resources may be more valuable when utilized for the construction and ongoing maintenance of the OLAP data structures (e.g., summaries), rather than for brute force query execution.

Because of the extensive computational requirements of cube generation in large data warehouses, a number of parallel cube construction and querying architectures have in fact been proposed. Note the use of the word “architectures” to indicate comprehensive models that include computation, memory management, and physical storage. In the parallel environment, both array-based (MOLAP) and relational (ROLAP) systems have been explored. With respect to MOLAP parallelism, a distributed memory framework targeting IBM’s SP2 was described in [8]. Here, cube construction begins by first equi-partitioning on a single attribute A 1. Using sequential MOLAP techniques, all partial cuboids containing A 1 (exactly \({{2}^{d}\over 2} \) of the total) are constructed independently on each node. The data set is then re-partitioned on A 2 and the process is repeated. In total, d such rounds are required. The MOLAP mechanism is attractive in the parallel environment as it is well suited to the shared nothing model. Moreover, the parallelized MOLAP cuboids have the potential to provide extremely fast run-time performance (as is generally the case with MOLAP). Having said that, like all MOLAP systems, the sparsity of array-based storage does necessitate complex sparse array compression. In addition, data skew has the potential to generate significant load imbalance during each of the d re-partitioning phases.

Alternatively, parallelism can be exploited in purely relational environments. The cgmCube project, for example, also utilizes state-of-the-art sequential cube construction methods but materializes results in the form of conventional relational tables [2]. Both shared nothing and shared disk platforms are targeted. In this case, the cube lattice is physically materialized by first creating a weighted minimum cost spanning tree representation that identifies the most cost effective means by which to materialize all O(2 d ) cuboids. Parallelism is supported by decomposing the spanning tree into psub-trees (using a k-min-max graph partitioning algorithm), each of which is computed in its entirety on a single node. The advantage here is that global costing decisions can be employed so as to significantly improve load balancing and to eliminate communication costs (note that the final cuboids can be re-partitioned in any way once the construction algorithm has terminated). Unlike array-based MOLAP systems, ROLAP does not provide implicit indexing, a problem of some significance given that traditional “single dimension” indexes such as the b-tree do not work particularly well in multi-dimensional spaces. cgmCube addresses this shortcoming by adding a forest of fully parallelized, multi-dimensional R-tree indexes. Ultimately, cgmCube’s algorithmic components are integrated into a fully parallelized server prototype called Sidera that essentially functions as a federation of single node OLAP servers [5]. In addition to cube generation and indexing, Sidera adds functionality for caching, selectivity estimation, and the management of dimensionhierarchies, as well as fully parallelized sorting and aggregation support. A logical picture of the Sidera server is illustrated in Fig. 3. It should be noted that due to the enormous complexity of parallelized OLAP systems, no direct performance comparison of the MOLAP and ROLAP alternatives has ever been performed.

Finally, one should be aware that commercial parallel DW implementations are also available. Of particular interest in the current context are the dedicated appliance style architectures. Here, an integrated, parallel shared nothing hardware/software bundle is tailored specifically to data warehousing workloads and query patterns. Recently, the traditional “high end” DW system provider, Teradata, has been joined by new vendors such as Netezza, DATAllegro, and Greenplum that build upon commodity hardware and open source DBMS software. While each provides proprietary mechanisms for elements such as indexing and partitioning, the DW appliance vendors tend to rely primarily on the aforementioned “brute force” style of parallelism. At present, it does not appear that there is a direct movement of theoretical results from the research community to the industrial sector.

Key Applications

A quick review of the recent database literature indicates that the majority of the current work in high performance databases is actually associated with data warehousing. This should perhaps not be surprising since data warehouses represent some of the largest databases in existence today. In fact, with the emergence of the Internet as a vehicle for both data collection and distribution, one can only expect this trend to continue. So as the average size of production DWs pushes past the terabyte limit, parallelism is likely to become an increasingly common theme.

Future Directions

In the short term, one would expect to see continued interest in the exploitation of commodity DBMS systems within loosely federated parallel DBMS clusters. This approach makes a great deal of sense given the considerable maturity of such platforms. Beyond this, OLAP query performance might be further improved by investigating the parallelization of more recent sequential cube methods. The tree-based Dwarf Cube is an obvious target in this regard [14]. We can also expect to see a greater focus on the emerging trend of real time or near real time data warehousing, particulary for parallel systems that target large, dynamic data environments.

In the longer term, one possible area for further investigation is the convergence of parallel/distributed data warehouses and the new data centric Grid technologies [6]. To this point in time, the Grid framework has primarily been associated with the concurrent processing of “flat files” rather than highly structured databases. Nevertheless, the Grid model offers interesting opportunities for the distributed, secure, and equitable processing of publicly accessible data repositories.

Experimental Results

Virtually all of the research mentioned in this discussion has relevant experimental evaluations within the associated references. In general, the parallel implementations described above – both partition oriented and OLAP-based – are capable of achieving near linear speedup on contemporary shared nothing parallel systems of 8–32 nodes.

Cross-references

Cube

Cube Implementations

Cube Lattice

Data Warehousing Systems:​ Foundations and Architectures

Dimension

Measure

Multidimensional​ Modeling

OLAP

On-Line Analytical Processing

Optimization and Tuning in Data Warehouses

Query Processing in Data Warehouses

Star Schema

Copyright information

© Springer Science+Business Media, LLC 2009
Show all