International Conference on Enterprise Information Systems

Enterprise Information Systems pp 280-299 | Cite as

Dimension Enrichment with Factual Data During the Design of Multidimensional Models: Application to Bird Biodiversity

  • Lucile Sautot
  • Sandro Bimonte
  • Ludovic Journaux
  • Bruno Faivre
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 241)

Abstract

Data warehouses (DW) and OLAP systems are technologies allowing the on-line analysis of huge volume of data according to decision-makers’ needs. Designing DW involves taking into account functional requirements and data sources (mixed design methodology) [1]. But, for complex applications, existing automatic design methodologies seem inefficient. In some cases, decision-makers need querying, as a dimension, data which have been defined as facts by actual automatic mixed approachs. Therefore, in this paper, we offer a new mixed refinement methodology relevant to constellation multidimensional schema. The proposed methodolgy allows to decision-makers to enrich a dimension with factual data. In order to validate our theoretical proposals, we have implemented an enrichment tool and we have tested it on a real case study from bird biodiversity.

Keywords

Multidimensional design Data warehouse OLAP Data mining 

1 Introduction

Data warehouses (DW) and OLAP systems are business intelligence technologies allowing the on-line analysis of huge volume of data. Warehoused data is organized according to the multidimensional model that defines the concepts of dimensions and facts. Dimensions represent analysis axes and they are organized in hierarchies. Facts are the analysis subjects and they are described by numerical indicators called measures. Warehoused data are then explored and aggregated using OLAP operators (e.g. Roll-up, Slice, etc.) [2].

The success of DW projects essentially depends on the design phase where functional requirements meet data sources [1]. Three main methodologies have been developed: user-driven, datadriven and mixed [3]. User-driven approach puts decision-makers at the center of the design phase by providing them tools to define the multidimensional model exclusively according to their analysis needs. Usually, data driven methodology proposals deduce the multidimensional model from structured and semistructured [4, 5] data sources exploiting metadata (e.g. foreign keys) and some empirical values. Finally, mixed approaches fusion the two previous described methods.

Hierarchies are crucial structures in DW since they allow aggregation of measures in order to provide a global and general analytic view of warehoused data. For that reasons, some works investigate definition of hierarchies by means of Data Mining (DM) algorithms [6, 7]. However, this design step is applied once the multidimensional model has been defined, and it takes into account only members of one dimension.

From our point of view, these methodologies present an important limitation since in real DW projects often those DM algorithms need data of different dimensions and facts. Thus, in this paper we present a framework for a mixed design of multidimensional models by integrating DM algorithms in a classical data driven-approach. This allows defining hierarchical structures, according to decisional users’ requirements, that cannot be deduced by classical datadriven methods. This hierarchical organization of dimensional data is translated in a complex multi-factual multidimensional model in order to represent as well as possible semantic of data sources.

The paper is organized in the following way: Sect. 2 introduces related work; a retail case study and the motivation are presented in Sect. 3; our design method is detailed in Sect. 4 and its implementation is shown on Sect. 5.

2 Related Work

Three types of approaches can be used to design a data warehouse: (i) Methods based on user specifications, or demand-driven approaches; (ii) Methods based on available data, or data-driven approaches; (iii) Mixed methods, or hybrid approaches. For example, [8] is an iterative demand-driven method where at each iteration, the system searches for the best data corresponding with the information required by the user in terms of dimensions or facts. Moreover, several other have proposed systems based on hybrid approach such as [9] that propose to express functional requirements using SQL queries.

Relational data driven approaches deduce multidimensional structures (facts and dimensions) from conceptual [1] and/or logical models [5, 10]. In particular some works investigate automatic discovering facts using some heuristics [10]. About dimensions some works propose using logical database metadata such as foreign keys [5] or some heuristics.

Other works use more complex algorithm to identify dimensions hierarchies. [11] propose a system to dynamically build hierarchies based on data from Twitter [11]. [12] present a new OLAP operator named OPAC that allows to aggregate facts that refer to complex objects, such as images. This operator is based on hierarchical clustering algorithm. [6] provide a framework for automatic defining hierarchies according to user rules. In order to personalize the multidimensional schema, [13] propose to create new levels in a hierarchy with the K-means algorithm. [14] propose to increase the OLAP cube exploration functionalities by providing the user data mining algorithms to analyze data. [15] use a hierarchical clustering to integrate continuous variables as dimensions in an OLAP schema. In the same line, [7] propose using Agglomerative Clustering for designing hierarchies, and the integration in a rapid prototyping methodology is presented in [16].

Concerning the multidimensional schema evolution issue, [17, 18] offer an interesting litterature review.

However, all existing works define hierarchies using only either dimensional data (i.e. attributes of dimension members) or factual data (i.e. measures) (see Table 1). But, in a constellation schema, a dimension can be enriched with a hierarchy created by using other dimensions and facts. It means that the creation of a new hierarchy can involved a refinement of facts and dimensions in the entier constellation schema. We detail this issue in the following section, using a real application case from bird biodiversity.
Table 1.

Summary of literature review related to automatic hierarchy building.

Data sources

Star schema

Constellation schema

Facts

One dimension

Facts and dimensions

Algorithm

K-means

[13]

[13]

Hierarchical classification

[7, 15, 16]

[12]

Our proposal

Other

[6, 11]

[14]

3 Motivations from a Real Case Concerning Bird Biodiversity

3.1 Case Study

Temporal Monitoring of Nesting Birds Along a River. Our case study comes from the STORI (Suivi Temporel des Oiseaux nicheurs en Rivière: Temporal Monitoring of Nesting Birds along a River), financed by the French Ministery of Environment since 1990 and by the European Union (ERDF: European regional development fund) since 2000.

The STORI was created to inventory bird species along two French rivers (the Loire river and the Allier river). The STORI has three aims:
  1. 1.

    First, the STORI aims to understand relationships between bird species and environments, particularly the impact of landscapes (habitat diversity, land cover,...) on the bird species distribution along a river.

     
  2. 2.

    The second aim of the STORI is the bird species monitoring, i.e. the appreciation of temporal changes in bird communities over several years.

     
  3. 3.

    Finally, the STORI aims to elaborate a bio-indicator based on bird species.

     

The STORI’s point of view is original compared to other works, which have similar aims. Classically, bio-indicators of river condition (fishes or insects) are not able to take into account the entire valley, because these bio-indicators are strongly dependant to the water [19]. Moreover, actual works are centered on one species whereas the STORI studies the entire bird community. Ecologically, studying communities is more relevant, in regards of the definition of an ecosystem.

In conclusion, the STORI is an original ecological study, which has produced an important data set on a long period (from 1989 to 2012) in a large spatial and biological area (the Loire river is 1000 Km long and harbores approximately 200 bird species).

Dataset Description.

Ornithological Data. Our ornithological data were collected during the STORI. To inventory bird species, we use a protocole based on point counts with unlimited distance, named the IPA method [20, 21]. The IPA method defines several precise “census points” (presented on Fig. 1), where each nesting bird is censused if it was detected by sight or by hearing (migratory birds and wintering birds are not censused). 198 census points have been defined along the Loire river. With the IPA method, we have obtained, at each census points, an semi-quantitative non-parametric index between 0 and 5. In conclusion, the ornithological data are a set of abundance index, censusing 213 bird species, in 198 census points, during four census campaigns (1990, 1996, 2002 et 2011).

Environmental Data. Our environmental data, describing the Loire river and the landscapes in the river valley, come from three main sources:
  • Field surveys executed in the same time that the bird species census: these surveys describe locally the riverbed (altitude, valley width, flow velocity, riverbank type, substratum, etc.).

  • Satellite images: they describe the river valley with attributes from image processing algorithms (percent of the area covered by forest, by grassland, connectance, cover land diversity, etc.).

  • Geographical data bases, which describe precisely the catchment area of the Loire river in terms of water system, geological area, ecological area and land cover.

The environmental data have been collected for each census point along the Loire river.

Numerous environmental data are time dependent. On the one hand, some environmental data can vary in value over years. On the other hand, many environmental data are not available at each bird census campaign. For examples:
  • The altitude of a census point is not a time dependent attribute, because we can easily assess that the altitude of a census point did not change between 1989 and 2012.

  • The percent of the area covered by forest around a census point has probably changed between 1989 and 2012, thus this attribute is time dependent. This attribute has been collected on a satellite image in 2001. This attribute is consequently available only for the bird census campaign in 2002.

Fig. 1.

Census points along the Loire river.

3.2 Motivation

In order to describe motivation of our new DW design methodology we present in this section a real case study concerning the bird biodiversity analysis [7]. This dataset has been collected to analyze spatio-temporal changes in bird populations along the Loire River (France) and to identify local and global environmental factors that can explain these changes. Data sources are stored in a relational database (PostGIS). Applying the data driven algorithm proposed in [9] we obtain the constellation schema depicted in Fig. 2, which presents two facts as described in the following. Abundances is one fact, and can be analyzed according to three dimensions (an instance is shown on Table 3): (i) the species dimension, which stores species names and attributes, (ii) the time dimension, which corresponds to the census years and (iii) the spatial dimension, which describes census points along the river. Using this model decision-maker can answer to queries like: “What is the total of birds per year and census point?” or “What is the total of birds per year and altitude?”. To complete bird census, the landscape and the river are described around each census point. Environment descriptions are represented by another fact, which is associated to the time dimension and the spatial dimension. With this model, it is possible to describe census points, for example a possible OLAP query is “What is the percentage of forest per census point in 2012?”.

Note that descriptions of census points that are not dependent from time, such as altitude and geology, are used as spatial dimension levels, while other attributes are represented as measures of another fact (e.g. percentage of forest). Unfortunately, abundances for a specie have not meaning if not related to environmental data of census points. In this situation a drill-across operation is not adequate since it will hide the species dimension. Indeed, with the drill-across operators facts are joined only on common dimensions. Moreover, the multidimensional model of Fig. 2 does not make possible to provide the decision-makers with OLAP queries aggregating abundance by classes of environmental variable (30 % of forest, 50 % of water, etc.), for example “What is the total of birds per year and group of census point with 30 % of forest?” or “What is the total of birds per year and group of census point with 50 % of water?”, since environmental parameters do not appear as levels, but as measures, prohibiting group-by queries.

Therefore, in our case study, decision-makers need for a new design method that group census points (dimensional data) by environmental parameters (factual data) and year (dimensional data).

Note that administrative division is not a spatial hierarchy which makes sense to study bird biodiversity. On the map presented on Fig. 1, boundaries of French departements and boundaries of large ecological areas do not match.

The multidimensional model allowing correct OLAP analysis should be the one shown on Fig. 3 [22]. This multidimensional schema presents only one fact and the spatial dimension is enriched with some levels representing group of environmental parameters for each year. Indeed, environmental parameters for census points in 2001 can be different from ones of 2002 implying that the same census point is not grouped in the same level on two different years as shown on Table 3.
Fig. 2.

Bird biodiversity case study: data-driven constellation schema.

Fig. 3.

Bird biodiversity case study: manually driven multi-version schema.

For example, data describing agricultural activities around the census points, are available only for the 2002 census campaign. Therefore, it is important to take into this different classification when navigating on the temporal dimension during an OLAP analysis session. For example, the query “What is the total of birds in 2002 and in census points with the same environmental parameters?” has to use the environment type 2002 level, and “What is the total of birds in 2011 and in census points with the same environmental parameters?” has to use the environment type 2011 level. For example an OLAP query using the environment type 2002 level and the temporal member 2011 is not coherent since it associates the number of birds on 2011 in the past geographical-environmental configuration of 2002, leading to erroneous interpretation.
Table 2.

Factual data of “Environments” node.

Years

Census points

Agencies

Percent of forest

Percent of grassland

2002

1

LE2I

0.176

0.250

2002

1

ONEMA

0.356

0.261

2002

2

LE2I

0.311

0.420

2002

2

ONEMA

0.255

0.574

2011

1

LE2I

0.189

0.278

2011

1

ONEMA

0.241

0.385

2011

2

LE2I

0.322

0.568

2011

2

ONEMA

0.257

0.575

Table 3.

Factual data of “Abundances” node.

Years

Census points

Species

Abundance

2002

1

Yellowhammer

1.5

2002

1

Coal Tit

0.5

2002

2

Yellowhammer

1.5

2002

2

Coal Tit

0

2011

1

Yellowhammer

1

2011

1

Coal Tit

3

2011

2

Yellowhammer

1

2011

2

Coal Tit

2

4 Our Proposal

In this section we introduce our framework for the refinement of multidimensional in a mixed approach. The main idea of our proposal is using an existing data driven methodology in a first step. Then, in our new design step, we collect user needs about hierarchies that are not been deduced in the multidimensional schema by means of the functional dependencies. These users’ needs are expressed in the form of facts existing in the constellation multidimensional model. In particular, the main idea is to provide an algorithm that transforms the constellation multidimensional schema by eliminating a fact node and integrating factual data in an associated dimension used for creating new levels.

To perform this algorithm, we translate the multidimensional model in a multidimensional graph.

In the following section we describe the multidimensional graph definitions (Sect. 4.1), the main algorithm is detailed in Sect. 4.2 and the calculation of new versioned hierarchies is explained in Sect. 4.3.

4.1 Preliminaries

In this subsection, we present some preliminary definitions.

We represent a multidimensional model using a graph.

Definition 1

Multidimensional graph. A multidimensional graph is a directed graph \(M_{G}=<D,F,A>\) with:

\(D=\left\{ d_{1},...,d_{m}\right\} \), dimensional nodes, which represent dimensions.

\(F=\left\{ f_{1},...,f_{n}\right\} \), fact nodes representing facts.

\(A=\left\{ a_{1},...,a_{p}\right\} \mid \forall i\in [\![1,p]\!],a_{i}=(f_{j},d_{k})\), with \(j\in [\![1,n]\!]\) and \(k\in [\![1,m]\!]\), are arcs1, meaning that arcs are only directed from a fact node to a dimensional node.

Moreover, \(M_{G}\) contains no alone node, isolated of another node, but can contain possibly disconnected sets of nodes if each sub-graph must contain at least one fact node.

Example

An example of multidimensional graph is shown on Fig. 4. “Species” dimension, “Census points” dimension, “Years” dimension, “Abundances” fact and “Environments” fact are described in previous sections. “Sources” dimension represents agencies, which collect data. “Budget” fact represents the funds allowed by each agency for each year to collecting data.

In our approach decision-maker want to enrich a dimension with some new hierarchies using some factual data. That dimension is called Target dimension.

Fig. 4.

Multidimensional graph \(M_{G}\).

Definition 2

Target dimension. The target dimension \(d_{t}\) of a multidimensional graph \(M_{G}\) is a dimension such as:

This means that dt is associated at least to two facts since one has to be removed and used to create its new levels.

Example

An example of possible target dimension is the “census point” dimension (Fig. 4).

Let us now formalize the fact node that is used to create levels.

Definition 3

Source node. The source node of a \(M_{G}\) with a target dimension \(d_{t}\) is a fact node \(f_{s}\in \left\{ f_{1},...,f_{u}\right\} \).

Example

With “census point” dimension as target node, an example of possible source node is the fact node “Environments”.

As we have said before our algorithm removes the source node from the graph. Therefore, a part of the structure of the graph is changed. Note that only nodes related to the source nodes are affected. We define this sub-graph in the following way

Definition 4

Source-target multidimensional sub-graph. Let \(M_{G}\) a multidimensional graph with a target dimension \(d_{t}\) and a source node \(f_{s}\) then the Source-target multidimensional sub-graph \(M_{G}'\) is a multidimensional graph such as: \(M_{G}'=<D',F',A'>\) with:

\(F'=\left\{ f_{i}\in F\mid \exists (f_{i},d_{t})\right\} \)

\(D'=\left\{ d_{i}\in D\mid \exists (f_{s},d_{i})\right\} \)

\(A'=\left\{ (f_{i},d_{j}) \mid f_{i} \in F',d_{j} \in D'\right\} \)

\(M_{G}'\) contains thereby only fact nodes linked to \(d_{t}\) and dimensional nodes linked to \(f_{s}\). In \(M_{G}'\), all fact nodes are so linked to at least one dimensional node and all dimensional nodes are so linked to at least one fact node. There is no isolated node in this sub-graph. \(M_{G}'\) is so a well-formed multidimensional graph.

Example

An example of Source-target multidimensional sub-graph using the previous example is shown on Fig. 4.

In order to formalize inputs of the agglomerative hierarchical clustering algorithm used for the creation of levels of the target dimension, we formalize factual data aggregated to a set of dimensions levels using the definition of instance fact node.

Definition 5

Instance fact node. Let \(M_{G}\) a multidimensional graph. Let \(m_{i}\) a member of the dimension \(d_{i}\). Then the instance fact node \(I(f, d_{1}.m_{1}, ...,d_{n}.m_{n})\) is the set of tuples representing facts of f aggregated to the dimensions members \(d_{1}.m_{1}\), ..., \(d_{n}.m_{n}\).

Example

Let, Table 2 representing the instance fact node for the node “Environments”, then Table 4 represents facts aggregated to the All member of the “Agencies” dimension:

(I(“Environments”, “Agencies.ALL”, “Years.1990”, “Census_points.*”))2

4.2 Algorithm

In this section we provide details and formalize our approach.

Removing a fact node from the multidimensional graph implies its redefinition. Thus, the main idea is in a first step to work on the source-target multidimensional graph exclusively, transform this sub-graph adding levels to the target dimension and removing the source node, and then finally re-integrate the new sub-graph in the rest of original multidimensional graph.

Removing the source node implies to handle its associated dimensions. It is possible to distinguish three types of dimensions:
  • The target dimension that will rest in the transformed sub-graph,

  • the Non Context dimensions \(D_{nc}\), and

  • the Context dimensions \(D_{c}\).

The Non context dimensions \(D_{nc}\) are dimensions that are only associated to the source node fact. In order to remove one dimension it is possible to provide a classical Dice operator, which consists in aggregating fact data to the top dimension member. Let us note that in order to avoid summarazability problems (aggregation cannot be reused) [23], in our approach we allow using only distributive and algebraic aggregation functions for the Dice operator.

Example

An example of Non contextual dimension is the “Agencies” node. In Table 4 is shown an example of the Dice operator on the Agencies dimension, which is a Non contextual dimension.

Table 4.

Factual data of “Environments” node aggregated on “Agencies”.

Years

Census Points

Percent of Forest

Percent of grassland

2002

1

0.266

0.256

2002

2

0.283

0.497

2011

1

0.215

0.332

2011

2

0.290

0.572

Formally,

Definition 6

Non contextual dimension. Let Source-target multidimensional sub-graph \(M_{G}'=<D' , F' , A' >\), then the set of non contextual dimension \(D_{nc}\) is
$$\begin{aligned} D_{nc}=\left\{ d_{1}^{nc},...,d_{v}^{nc}\right\} \subset D'\mid \forall i\in [\![1,v]\!]\exists !(d_{i}^{nc},f_{j})\mid f_{j} \in F' \end{aligned}$$

Note that in the previous formula, all dimensional nodes in \(D_{nc}\) are only linked to \(f_{s}\). Indeed, all dimensional nodes in \(M_{G}'\) are linked to \(f_{s}\) and dimensional nodes in \(D_{nc}\) are linked to one (and only one) dimensional node.

The Context dimensions \(D_{c}\) are dimensions in \(M_{G}'\) that are associated to \(f_{s}\) and another fact node f. With the future refined graph, users analyze facts in f according to \(d_{t}\). But, data used for calculating new hierarchies in \(d_{t}\) come from \(f_{s}\) and are thereby dependent of dimensions in \(D_{c}\). Therefore, we need to ensure that data used to create the hierarchy are coherent with data consulted by the user during their OLAP analysis. With this in mind, we offer a system that calculates hierarchies according a context, this context defining with \(D_{c}\).

Formally,

Definition 7

Contextual dimension. Let Source-target multidimensional sub-graph \(M_{G}'\), then the set of contextual dimension \(D_{c}\) is
$$\begin{aligned} D_{c} \subset D' \mid D_{c} = D' - (D_{nc} \cup \{ d_{t} \} ) \end{aligned}$$

Example

An example of contextual dimension is the “Years” node. On Table 3, we present data from “Abundances” node: data are dependent of “Years” dimensional node.

Once we have defined non context and contex dimensions let us provide our algorithm supposing that we have only one context dimension.

The input of this algorithm is the multidimensional graph \(M_{G}\) presented on Fig. 4.

Begin of the Refinement Algorithm
  • 1. Identify the Source-target multidimensional sub-graph\(M_{G}'\).

  • 2. Calculate a hierarchy for each instance of each context. This part of the algorithm is detailed in particular in the Sect. 4.3.

  • 3. Remove\(f_{s}\)from\(M_{G}\).

  • 4. Remove isolated nodes. The isolated nodes can be only dimensional nodes linked to\(f_{s}\). Then\(M_{G}\)is well formed.

End of the Refinement Algorithm
The output of this algorithm is a multidimensional graph, presented on Fig. 5. We note that \(f_{s}\) has been removed and there are new hierarchies in the “census points” node. Moreover, \(M_{G}\) remains a well-formed multidimensional graph and can be also implmented in a ROLAP architecture.
Fig. 5.

Refined multidimensional graph \(M_{G}\).

4.3 Automatic Creation of Hierarchies

In this section we describe how the is applied to create new levels of the target dimension.

A complete methodology to create new hierarchies in a multidimensional model with Hierarchical Agglomerative Clustering is presented in [7]. The main idea of this methodology is to build a new hierarchy into a dimension by using data, which describe items at the lowest level of the hierarchy. In our case, items are census points and description data are factual data. We suggest to use the Hierarchical Agglomerative Clustering, due to the similarity between the output of the Hierarchical Agglomerative Clustering and a hierarchy into an OLAP dimension [12].

Main steps of this algorithm are: (1) Calculation of distances between individuals; (2) Choice of the two nearest individuals. (3) Aggregation of the two nearest individuals in a cluster. The cluster is considered an individual. (4) Go back to the step 1 and loop while there is more than one individual.

In our approach the clustering (AHC) takes as inputs the instance of the source node \(f_{s}\) evaluated on each member of the context dimension and dicing it non context dimensions.

Formally, the step 2 of our algorithm is the following:

Begin of the Hierarchy Builder Algorithm

for each\(member_{i}\)of\(d_{c}\)

. create a new hierarchy of \(d_{t}\)

. AHC(I(\(f_{s}\),\(d_{1}^{nc}\).ALL,... ,\(d_{v}^{nc}\). ALL, \(d^{c}\).\(member_{i}\), \(d_{t}\).*))

End of the Hierarchy Builder Algorithm

An example is presented on Fig. 6. We note that two hierarchies for the spatial dimension have been created for years 2002 and 2011.
Fig. 6.

Contextual hierarchies of census points.

5 Validation and Experiments

In this section we present the implementation our proposal. Next, we present some calculated hierarchies. Finally, a semantic and performance evaluations are detailed in Sect. 5.3.

5.1 Implementation

The refinement tool implements our algorithm using Matlab3. It allows defining graph using a simple visual interface as shown on Fig. 7. The considered multidimensional graph is presented on the top part of the visual interface. On the bottom one, the algorithm ask inputs to users in a command window.

The data warehouse is implemented using PostgreSQL4. The OLAP server is Mondrian5, and the OLAP client is Saiku6.
Fig. 7.

Visual interface of the refinement tool.

5.2 Results

In this section, we present a comparison between two contextual hierarchies.

On Fig. 8, we provide the three main clusters of census points along the Loire river, using field survey data collected by the STORI in 1990. In contrast, the Fig. 9 provides the two main clusters of census points along the Loire river, using land cover data collected in a geographical information system7 by French state agencies in 2011.

These two clustering results correspond to the top levels of the hierarchies calculated by the proposed refinement algorithm. The refinement algorithm calculates obviously complete hierarchies, and our prototype implements these complete hierarchies. On Figs. 8 and 9, we provides only the top levels for a better reading.

If we compare the map in Figs. 1, 8 and  9, we can note that the typologies calculated by the refiement algorithm for 1990 and 2011 are not the same, and do not correspond to administrative division or ecological area defined in the catchment area of the Loire river. Thus, these typologies are original classifications of census points.
Fig. 8.

Census point clusters obtained by AHC based on field surveys in 1990.

Fig. 9.

Census point clusters obtained by AHC based on land cover in 2011.

5.3 Validation

Semantic Evaluation. In this subsection, we describe the added-value of our methodology from a design point of view (i.e. does the refinement methodology corresponds to decision-makers needs?). For that goal two we have investigated two aspects: (1) Do dimensions and facts created using our methodology correspond to decision-makers analysis needs?; (2) Do hierarchies created using our methodology improve analysis capabilities?

Therefore have decided to compare the result of our methodology with one proposed in [22]. Indeed, [22] propose a manually method to obtain a multi-version multidimensional schema, and when the time dimension is chosen as the context dimension our approach results a multi-version multidimensional schema. The result of this validation shows that the multidimensional schema produced with the manual methodology and our automatic methodology are equal.

Moreover, in order to validate the semantic correctness of using AHC for hierarchies definition, we have asked to ecologists of the project to choice between a spatial dimension with only one level, and a spatial dimension with a hierarchy created using AHC. When the number of created levels is not superior to 5, decision-makers prefer having hierarchies, since they can reveal interesting pattern such as agricultural profiles of census points. For example, data in the “Environments” fact table contains data that describe agriculture policies around each census point at each year. The data clustering according to these data can classify census points and allows decision-makers analyzing impact of agricultural practices on bird biodiversity. For example, decision-makers can analyze biodiversity according to agricultural forest and grassland parameters of census points, by using this simple OLAP query: “What is the biodiversity value per group of census points (first level of the hierarchy obtained with clustering) in 2002 and 2003?”. This query can reveal that for the same year, for example 2002, biodiversity is very affected by agricultural parameters since the aggregated biodiversity value for each group of census point is different.

Performance Evaluation. In this subsection, we test time performance of our methodology in order to validate its feasibility from a project deployment process point of view.

In particular we study time performance related to: (1) refinement algorithm for facts and dimension design, and (2) hierarchy creation using AHC.

In order to test the first point, we have created a set of 200 simulated constellation schema using from 2 to 100 dimensions, since real usable multidimensional schema presents maximum between 3 and 10 dimensions [2]. Finally, the worst time execution is 15.23 s. The average execution time is equal to 11.7 s with a standard deviation equal to 1.17 s. These performances are satisfactory for are good for an off-line design phase.

In this paragraph, we study time performances of the AHC algorithm. In this paragraph, “classified items” are census points (which are members of the “census points” dimension, the target dimension) and “attributes” are aggregated facts from the “Environments” fact node (which is the source fact node). The AHC algorithm has been also implemented in Matlab and its performance has been also tested. Using our case study data, we perform 2090 tests, with a number of classified items (source node instances-Enverinments facts) between 10 and 190, and a number of attributes (source node attributes-Enverinments fact measures) between 10 and 100, and the average calculation time is equal to 0.072 s, with a standard deviation equal to 0.002 s. To complete our evaluation, we simulate a data set with 10,000 classified items and 150 attributes. In this case, the AHC calculates a hierarchy in 147.36 s, with a standard deviation equal to 4.03, with a maximal calculation time equal to 214 s. All time performances are shown on Fig. 10. This calculation time (approximately four minutes) is efficient for an off-line design phase.
Fig. 10.

Execution times according the number of attributes and classified items.

6 Conclusion and Future Work

Design data warehouses system is a complex and crucial task depending on available data sources and decisional requirements. Existing work do not exploit the semantics of data to automatically create complex hierarchies. Thus in this paper, we present a mixed multidimensional refinement methodology, that transform constellation schema to define hierarchy level using a hierarchical clustering algorithm. Our refinement methodology enriches a dimension with factual data, and considers the context of factual data.

In our case, the context was the “Time” dimension. In this case, our work can be considered as a versionning methodology. But, the proposed prototype does not use temporal properties of the “Time” dimension. In other words, the proposed prototype is able to calculate new hierarchies for the spatial dimension depending on years, as well as new hierarchies for the temporal dimension depending on census points.

We present also the implementation of our method in a ROLAP architecture.

We perform the proposed methodology on a real application case from bird biodiversity. We have noted that actual automatic multidimensional design methodologies cannot produce a multidimensional schema, which covers all decision-maker needs due to the data complexity. Our methodology offers a solution to enrich dimensions with factual data and, by this way, to refine the multidimensional schema.

Our ongoing work is the extension of our methodology to simplify and reduce the number of created levels, using other DM algorithms such as SVM, etc., in order to provide decision-makers with easy OLAP exploration analysis and its implementation in a ROLAP architecture.

Moreover, we are also working to integrate our approach in the rapid prototyping methodology proposed in [16], and extending to help decision-makers and DW experts choose the right DM algorithms and parameters of the refinement algorithm (source node, contextual dimensions, etc.).

Future work concerns the usage of the formal evaluation framework Goal Question Metric [24] to evaluate our methodology.

Footnotes

  1. 1.

    In this paper, the notation \((f_{i},d_{j})\) represents the arc from fact node \(f_{i}\) to dimensional node \(d_{j}\).

  2. 2.

    ‘*’ means ‘all members of the dimension’.

  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.

Notes

Acknowledgements

Data acquisition received financial support from the FEDER Loire, Etablissement Public Loire, DREAL de Bassin Centre, the Région Bourgogne (PARI, Projet Agrale 5) and the French Ministry of Agriculture. We also thank heartily Pr. John Aldo Lee, from the Catholic University of Leuven, for his help.

References

  1. 1.
    Phipps, C., Davis, K.C.: Automating data warehouse conceptual schema design and evaluation. In: Proceedings of the 4th International Workshop on Design and Management of Data Warehouses (DMDW), vol. 2 (2002)Google Scholar
  2. 2.
    Kimball, R.: The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. Wiley, New York (1996)Google Scholar
  3. 3.
    Romero, O., Abello, A.: A survey of multidimensional modeling methodologies. Int. J. Data Warehouse. Min. 5, 1–23 (2009)CrossRefGoogle Scholar
  4. 4.
    Mahboubi, H., Ralaivao, J.C., Loudcher, S., Boussaïd, O., Bentayeb, F., Darmont, J., et al.: X-WACoDa: an XML-based approach for warehousing and analyzing complex data. In: Data Warehousing Design and Advanced Engineering Applications: Methods for Complex Construction, pp. 38–54 (2009)Google Scholar
  5. 5.
    Jensen, M.R., Holmgren, T., Pedersen, T.B.: Discovering multidimensional structure in relational data. In: Kambayashi, Y., Mohania, M., Wöß, W. (eds.) DaWaK 2004. LNCS, vol. 3181, pp. 138–148. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Favre, C., Bentayeb, F., Boussaid, O.: A knowledge-driven data warehouse model for analysis evolution. Frontiers Artif. Intell. Appl. 143, 271 (2006)Google Scholar
  7. 7.
    Sautot, L., Faivre, B., Journaux, L., Molin, P.: The hierarchical agglomerative clustering with gower index: a methodology for automatic design of OLAP cube in ecological data processing context. Ecol. Inf. 26, 217–230 (2014) (in Press)Google Scholar
  8. 8.
    Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Ore: An iterative approach to the design and evolution of multi-dimensional schemas. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP, DOLAP 2012, pp. 1–8. ACM, New York (2012)Google Scholar
  9. 9.
    Romero, O., Abello, A.: Automatic validation of requirements to support multidimensional design. Data Knowl. Eng. 69, 917–942 (2010)CrossRefGoogle Scholar
  10. 10.
    Carmè, A., Mazon, J.N., Rizzi, S.: A model-driven heuristic approach for detecting multidimensional facts in relational data sources. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 13–24. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Nguyen, T.B., Tjoa, A.M., Wagner, R.R.: An object oriented multidimensional data model for OLAP. In: Lu, H., Zhou, A. (eds.) WAIM 2000. LNCS, vol. 1846, pp. 69–82. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  12. 12.
    Messaoud, R.B., Boussaid, O., Rabaséda, S.: A new OLAP aggregation based on the AHC technique. In: DOLAP 2004, ACM Seventh International Workshop on Data Warehousing and OLAP, pp. 65–72 (2004)Google Scholar
  13. 13.
    Bentayeb, F.: K-means based approach for OLAP dimension updates. In: 10th International Conference on Enterprise Information Systems (ICEIS), pp. 531–534 (2008)Google Scholar
  14. 14.
    Leonhardi, B., Mitschang, B., Pulido, R., Sieb, C., Wurst, M.: Augmenting OLAP exploration with dynamic advanced analytics. In: 13th International Conference on Extending Database Technology (EDBT 2010) (2010)Google Scholar
  15. 15.
    Ceci, M., Cuzzocrea, A., Malerba, D.: OLAP over continuous domains via density-based hierarchical clustering. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011, Part II. LNCS, vol. 6882, pp. 559–570. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  16. 16.
    Sautot, L., Bimonte, S., Journaux, L., Faivre, B.: A methodology and tool for rapid prototyping of data warehouses using data mining: application to birds biodiversity. In: Ait Ameur, Y., Bellatreche, L., Papadopoulos, G.A. (eds.) MEDI 2014. LNCS, vol. 8748, pp. 250–257. Springer, Heidelberg (2014)Google Scholar
  17. 17.
    Arora, M., Gosain, A.: Schema evolution for data warehouse: a survey. Int. J. Comput. Appl. (0975–8887) 22, 6–14 (2011)Google Scholar
  18. 18.
    Subotic, D., Poscic, P., Jovanovic, V.: Data warehouse schema evolution: state of the art. In: Proceedings of the Central European Conference on Information and Intelligent Systems, pp. 18–25 (2014)Google Scholar
  19. 19.
    Legube, B., Merlet, N.: Les indicateurs biologiques de la qualité de l’eau. In: L’analyse de l’eau. 9e edn., pp. 865–962. Dunod (2009)Google Scholar
  20. 20.
    Blondel, J., Ferry, C., Frochot, B.: Point counts with unlimited distance. In: Ralph, C.J., Scott, J.M. (eds.) Estimating Numbers of Terrestrial Birds. Studies in Avian Biology. vol. 6, pp. 414–420 (1981)Google Scholar
  21. 21.
    I.B.C.C.: Censuring breeding bird by the I.P.A. method. Pol. Ecol. Stud. 3, 15–17 (1977)Google Scholar
  22. 22.
    Miquel, M., Bédard, Y., Brisebois, A., Pouliot, J., Marchand, P., Brodeur, J.: Modeling multi-dimensional spatio-temporal data werehouses in a context of evolving specifications. Int. Arch. Photogrammetry Remote Sens. Spat. Inf. Sci. 34, 142–147 (2002)Google Scholar
  23. 23.
    Lenz, H.J., Thalheim, B.: A formal framework of aggregation for the OLAP-OLTP model. J. Univ. Comput. Sci. 15, 273–303 (2009)MathSciNetMATHGoogle Scholar
  24. 24.
    Briand, L.C., Morasca, S., Basili, V.R.: An operational process for goal-driven definition of measures. IEEE Trans. Softw. Eng. 28, 1106–1125 (2002)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Lucile Sautot
    • 1
  • Sandro Bimonte
    • 2
  • Ludovic Journaux
    • 3
  • Bruno Faivre
    • 1
  1. 1.UMR BiogéosciencesUniversité de BourgogneDijonFrance
  2. 2.IRSTEA Centre de Clermont-FerrandAubièreFrance
  3. 3.UMR LE2IUniversité de BourgogneDijonFrance

Personalised recommendations