Advertisement

Vietnam Journal of Computer Science

, Volume 2, Issue 3, pp 169–179 | Cite as

The design of valid multidimensional star schemas assisted by repair solutions

  • Ali SalemEmail author
  • Hanêne Ben-Abdallah
Open Access
Regular Paper

Abstract

Multidimensional schemas are used to model data warehouse systems, a special type of large databases dedicated for decision support. Several approaches have been proposed in this domain to ensure the satisfaction of decision makers needs and the accuracy of the generated schemas. This paper presents an approach for the validation of multidimensional star schema assisted by repair solutions. Our approach aims to assist designers by detecting constraint violations and proposing repair solutions based on a number of error-based rules formalized in Prolog. Its efficiency stems from using both linguistic similarity computation and a set of heuristics developed by bottom-up design methods to produce warnings about semantic constraint violations and their impact on the quality of the analysis results.

Keywords

Multidimensional model Star schema  Data warehousing Design Linguistic similarity 

1 Introduction

Face to the international, unrestrained economic competition, an increasing interest in decision support systems has emerged over the last two decades. A decision support system assists decision makers to extract data pertinent to their analysis interests from their transactional systems (called on-line transactional processing systems). Most decision support systems rely on on-line analytical processing systems where data is often stored in multidimensional databases. These latter can be organized in terms of two types of storage areas: a data warehouse that regroups all data required for any potential analysis requirements, and/or a set of data marts each regrouping data extracts required for one particular analytical requirement, e.g. analyzing the sales of the enterprise. Given the fact that a data warehouse can be developed based on a set of data marts, in the remainder of this paper we will not distinguish between the two types and we will mean data mart development whenever we talk about decision support systems development.

The literature provides several methods (notations and processes) proposed to develop and/or enhance decision support systems (cf. [6, 17, 21, 34, 41, 45]). This area of research, however, still faces critical challenges to foster the efficient development of this type of systems. Similar to any software system, the efficient development of a decision support system depends on the quality of its models (specification and design, in particular). In their attempts to model a decision support system based on the data warehouse approach, the proposed methods have often relied on various multidimensional models to specify data mart schemas, including the star, constellation, and snowflake schema models [20, 25]. One main advantage of the used multidimensional models is their relatively simple graphical notations (cf. [2, 27]). However, current development processes may produce ambiguous, and sometimes, inconsistent schemas. By inconsistent, we mean a schema that is either ill-structured (for instance, having a circular functional dependency among its elements), or not amenable to loading from a data source (for instance, using elements unmatchable within the data source). These inconsistencies may result in non-functional decision support systems. The present work was undertaken to investigate the feasibility and potential gain effects brought by an approach based on a formal language (Prolog) for the design of a valid star schema.

Our design approach keeps the graphical notation advantage by allowing designers to draw their multidimensional schema. It also complements existing design approaches by offering a framework for the analysis of the structural, conformity and semantic constraints satisfaction of the schemas they produce. It aims at assisting designers by detecting constraint violations and proposing repair solutions based on a number of rules formalized in Prolog. Its efficiency stems from three sources: (i) the formalization strategy used facilitates the traceability of the violated constraints pertinent to schema elements [37]; (ii) the use of linguistic/semantic similarity computation (widely used in various artificial intelligence applications) together with the use of a set of heuristics developed by bottom-up design methods assist in the analysis of the conformity and semantic constraints; and (iii) the definition of a set of error-based repair solutions offers vital feedback to the designers for repairing their schema based on the errors detected.

That is, our approach is capable of detecting violations of various classes of constraints and suggesting repair solutions for any violated structural and/or conformity constraint. In addition, for every violated semantic constraint, it produces warning messages and information about the impact of the constraint violation on the quality of the analysis results. This paper presents our proposed approach with special focus on repair rules as they represent the underlying knowledge base from which repair solutions can be suggested.

The remainder of this paper is organized as follows: Sect. 2 first overviews the multidimensional schema concept through an example; it then provides a review of the related works currently available in the literature on the topic at hand. Section 3 describes our proposed design validation approach. Section 4 presents an illustrative case study and Sect. 5 reports on the results of a preliminary evaluation of the proposed approach. Finally, Sect. 6 summarizes the herein presented propositions and highlights their future extensions.

2 Related works

2.1 Multidimensional modeling: the star schema

The multidimensional star schema model is the basic building block for the remaining models (including the constellation, and snowflake schema models [20, 25]), and therefore any work on the star schema can be generalized on the remaining models.

A star schema represents one subject of analysis, called fact, as a point in a multidimensional space. The fact itself contains a set of measures that represent different numerical analysis indicators. The analysis can be performed along various axes called dimensions; these latter organize data in terms of hierarchies representing different levels of data details/aggregations [13]. Each level (also called attribute) can be further described with weak attributes.

Let us consider the car rental example depicted in Fig. 1 to illustrate the star schema concepts mentioned above. In this example, a decision-maker is interested in analyzing the fact renting in terms of the amount measure. The analysis can be conducted along three dimensions, namely cars, agency and date. Each of these dimensions offers several levels of amount aggregation, along one hierarchy for the date and agency dimensions and along one of two possible hierarchies for the cars (hcars1 and hcars2) dimension. For example, the hierarchy hcars1 offers aggregations along reg, model, brand, and/or all, respectively. The all attribute is the closure of each hierarchy, representing the highest level of granularity. The attribute power is a weak/descriptive attribute associated with the level reg.
Fig. 1

Example of multidimensional star schema: car rental fact

2.2 Constrained multidimensional schema generation

The quality of a multidimensional schema (and/or a data warehouse) was extensively addressed in the literature on data warehousing, particularly through metrics defined in terms of the structural elements of the schemas (cf. [38, 39]) and through sets of constraints defined at the model level.

Inspired from the software engineering domain, the so-far proposed metrics offer a heuristic view about the quality of data warehouse schemas based on empirical studies and best practices (cf. [15, 23]). While they can be used as indicators to the structural and, to a limited extent, semantic quality of data warehouse schemas, these metrics do not offer a means to the validation of the loadability of a schema. On the other hand, several researchers (cf. [7, 20, 25]) have proposed a set of constraints that a data warehouse schema must respect to produce both syntactically correct and loadable data warehouse schemas. This herein presented work is interested in constraint-based consistency checking given their coverage of the syntactic, semantic and loadability aspects.

The literature of multidimensional modeling provides several types and classes of constraints (cf. [2, 9, 11, 12, 19, 20, 27, 28, 36]). Our approach supports all existing constraints presented in the literature; in addition, it uses a simple classification for the structural, conformity, and semantic constraints.

The existing three approaches to multidimensional schema design account for only some of the constraints when generating the schemas, depending on their objective. More specifically, focused on generating schemas that meet the decision makers’ needs, the top-down approach [1, 22, 32, 43] disregards the data source involved in the definition of a multidimensional model, which, as previously reported in [30], may result in unloadable schemas where the conformity constraints may also not hold; in addition, this approach does not explicitly address the structural well-formedness of the produced schemas.

On the other hand, starting from the data sources [8, 14, 16, 31], the bottom-up approach ensures that the resulting multidimensional schema is loadable and respects the conformity constraints; however, it does not guarantee the structural well-formedness of the schema nor the satisfaction of decision makers’ needs. In this approach, the satisfaction of the structural constraints depends on both the identification rules of the multidimensional elements and the design quality of the source schema.

To overcome these problems, the mixed approach [5, 21, 40, 42] combines the two previous approaches by confronting the constructed schema with the data source schema.

Except for the work of Bargui et al. [5] which uses linguistic relationships, existing mixed methods rely on an exact match between the names of the multidimensional schema and data source schema elements; such assumption is too restrictive given the domains differences (decision making and information system). In addition, none of the mixed approaches offers repair solutions when a constraint is violated. The work presented in this paper complements all three types of design approaches by explicitly verifying all types of constraints in an automated way. In addition, it assists designers by suggesting repair solutions. Finally, this approach integrates the semantic similarity computation [18] to solve the problem pertinent to the difference between the names used in the design of the star schema and the names existing in the data source schema.

3 Proposed verification and validation methodology

Our approach relies on a formalization approach using Prolog to facilitate the automated analysis of star schemas. This formal language was selected particularly because of its multiple advantages: Its highly expressive power covers all the multidimensional concepts and constraints and the relational data base concepts and constraints. Moreover, its notation is relatively user-friendly and conducive to the simple understanding of mathematical notation. Last but not least, it offers a variety of useful editing and verification tools.

3.1 Overview

As illustrated in Fig. 2, our approach of multidimensional star schema analysis proceeds by transforming the graphical star schema into Prolog predicates. For validation purposes, the source database schema must also be translated into Prolog predicates. Afterward, the verification and validation (V&V) process can be invoked via the interrogation of Prolog. The results of this latter step are then interpreted to produce a report with the list of errors (if any), suggested repairs, and quality warnings. To make use of our approach, we have developed a toolset which offers a graphical editor through which the designer can draw his/her schema, invoke the Prolog translation and analyzer, and receive graphical and/or textual analysis reports.

The remainder of this section details the steps of our analysis approach while discussing their theoretical bases.
Fig. 2

Verification and validation framework of multidimensional star schemas

3.2 Semantic similarity integration

When using our approach, designers can draw their star schemas independently of the sources. They can, therefore, use multidimensional elements having names that are different from those existing in the source schemas as far as they belong to the same context. Accordingly, the names used in the star schema must be semantically similar to those existing in the source schemas. The similarity determination problem can be resolved through computing the semantic similarity between the names existing in the sources and the names the designer used in the star schema.

The exploitation of computers to quantify and compare semantic similarities is an important area of research in artificial intelligence. In particular, the development of efficient measures to compute concept similarity is fundamental for computational semantics. In this research area, several similarity measures rely on knowledge resources to quantify semantic resemblance, such as the WordNet “is a” taxonomy; this latter quantifies the common characteristics between the concepts expressed by the relation of specialization/generalization. Some of these measures are based on topological parameters extracted from the taxonomical hierarchy to achieve the best expression possible of the semantic content. Among these parameters, we cite the depth, the ancestors’ number and the descendants’ number which are exploited to quantify the specificity of a concept. These measures are evaluated using benchmarks that are based on datasets formed by words pairs evaluated by experts, such as RG65 [35], MC30 [29] and AG203 [3].

Our method uses WordNet to discover the semantic similarity among the names used in the star and source schemas. More specifically, it uses WordNet synsets in adapting the similarity measure proposed in [26] by applying the information content1 quantification method of [17].

WordNet represents a polysemous word by a set of synsets each of which contains cognitively synonymous nouns. For a given word w, its sysents can be retrieved through the WordNet “is a” taxonomy. Given a word \(w_1\) used to name a star schema element and a word \(w_2\) used to name a source schema element, the similarity measure proceeds as follows: it extracts their two sets of synsets [Syn (\(w_1\)), Syn(\(w_2\))] from WordNet; then it computes the semantic similarity between the words (\(w_1\), \(w_2\)) as follows:
$$\begin{aligned} SemSim(w_{1},w_{2})=\underset{(s_{1},s_{2})\in Syn(w_{1})\times Syn(w_{2})}{max} ICSemSim(s_{1},s_{2}) \end{aligned}$$
where ICSemSim(\(s_1\),\(s_2\)) represents the semantic similarity estimation based on the quantification method of information content [17] and the similarity measure of Lin [26] as follows:
$$\begin{aligned} ICSemSim(s_{1},s_{2})= \frac{2 \times IC(LCS(s_{1},s_{2}))}{IC(s_{1})+IC(s_{1})} \end{aligned}$$
where: IC(\(s_i\)) represents the information content of each synset \(s_i\) representing the word \(w_i\) computed using the method from [17] which is based on the ancestors’ subgraphs in the “is a” taxonomy; and LCS(\(s_1\),\(s_2\)) represents the lowest common subsumer of the two synsets \(s_1\) and \(s_2\) from the WordNet “is a” taxonomy. It’s important to note that the provided values are contained in the interval [0, 1].

When a name w belonging to the star schema does not exist in the source schema, the semantic similarity is computed between w and each name \(w_i\) in the source schema and which may play the role of w in the star schema. For example, the name of a dimension or a fact should be compared with only relationship names; this restriction reduces the computational complexity of the similarity measurement process and respects the heuristics of the bottom-up approach to star schema design.

Let Candidates Words be the set of names in the data source schema.

Among these candidate names, we will pick the name \(w_i\) as the correspondent to the name w in the star schema such that \(w_i\) has the maximum \(SemSim(w,w_{i})\) value.

Note that when computing the semantic similarity degrees between a name w and the names in the CandidatesWords set, we may obtain low values; that is, none of the candidate words expresses the same meaning as w. To make sure that we only retain a name that is reasonably similar to w, we used a threshold below which no \(w_i\) would be chosen. As discussed in Sect. 5, the value of this semantic similarity threshold can be determined experimentally. It can be also fine-tuned by the star schema designer in consultation with the data source administrator. The retained couple must be added to the knowledge base of prolog. To do this, we add the couple (w, \(w_i\)) to the prolog file containing the description of the source and the star schema using a predicate.

3.3 Formalization in Prolog

Our formalization strategy uses a set of transformation rules that preserve the “informal” semantics of the graphical star schema. It also ensures the traceability between the Prolog predicates and the analyzed star schema. This advantage is crucial for giving feedback on potentially violated constraints to the designer. Table 1 summarizes the transformation rules of the star schema concepts into Prolog.
Table 1

Transformation rules of star schema concepts

Concept

Rule

Example (Fig. 1)

Fact

Each fact is transformed into a fact predicate. It has three arguments: the first is the fact name; the second is the set of measures; and the third is the set of dimension names

fact(renting,[amount],[cars, agency, date])

Dimension

Each dimension is transformed into dimension predicate. It has two arguments: the first is the dimension name; the second is the set of hierarchy names

dimension(cars,[hcras1, hcras2])

Hierarchy

Each hierarchy is transformed into hierarchy predicate. It has three arguments: the hierarchy name, its list of levels, and its list of attributes

hierarchy(hcras1,[reg, model, brand, all],[reg, power, model, brand, all])

Attribute

Each attribute is transformed into attribute predicate. It has two arguments: its name and weight. The weight indicates that an attribute is an identifier, level or weak attribute (descriptor)

attribute(reg,id)

To provide for the validation of a star schema with respect to a relation data source, we also have defined transformation rules of relational schemas into Prolog. Briefly, each relationship is expressed in Prolog by the relation predicate with four arguments: the name of the relation, its primary key, its foreign key list, and the non-key attributes list. The attributes of the foreign key list are not necessarily considered as primary keys. For this reason, we distinguish two lists: a primary keys list and a foreign keys list. In the case where a foreign key is considered as a primary key, it must belong to both lists.

3.4 Prolog interrogation

This step consists in executing the process of the structural verification, the validation or verification of the semantic constraints. The structural verification process is a syntactic test. It checks the structural constraints presented in Table 2.
Table 2

List of multidimensional star schema structuralconstraints

Concept

Constraints

Description

Hierarchy

Unique identifier (CUIh)

The hierarchy contains a unique identifier

Unique all (CUAh)

The hierarchy contains a single level read all

Finest granularity attribute (CFAh)

In a hierarchy, the levels are classified from the finest granularity to the highest. The identifier is the level of the finest granularity

Higher granularity attribute (CHAh)

The level all is higher granularity. It is used to close a hierarchy

Non empty hierarchy (CNEh)

Each hierarchy has at least two levels: the id and the level all

Non cyclic (CNCh)

This constraint denies the existence of a cycle in the hierarchy

Dimension

Non empty dimension (CNEd)

Each dimension has at least one hierarchy

Unique identifier (CUId)

The dimension has an unique identifier

Unique all (CUAd)

The dimension contains a unique level all

Fact

Orthogonality (CORf)

The dimensions of a fact must be functionally independent

NonEmpty fact (CNEf)

Each fact contains at least one measure

Non isolate fact (CNIf)

Each point is connected to at least one dimension

As far as the validation process is concerned, it necessarily involves the testing of conformity constraints (see Table 3). The latter are inspired from bottom-up approaches. Thanks to the conformity constraints, the validation of the designers’ schemas is reduced to a simple checking of conformity constraints. To do so, we need a connection to the source database to extract its relational schema. In fact, to ensure the conformity constraints, we will need the names of primary key, foreign key, and non-key attribute for each relation and the types of each attribute. For this reason, we define a type preducate with two arguments: the attribute name and its type. The naming conflict between the star schema and data source schema is resolved by interaction with the dictionary WordNet2 [24] to account for other linguistic relationships, such as synonym and similarity [18].
Table 3

List of multidimensional star schema conformity constraints

Concept

Constraints

Description

Fact

Fact conformity (CFC)

Fact name must coincide with the name of one relation and all measures must be numeric and non-key attributes

Measure conformity (CMC)

All fact measures must belong only to the list of the non-key attributes of the relation having the same fact name. If the measure does not exist in this list, then it must belong to list of non-key attributes of a parallel relation to the relation having the same fact name

Orthogonality from the sources (COS)

It checks the functional dependency between the levels of the pairwise dimension hierarchies of the fact at the source schema

Dimension

Dimension conformity (CDC)

The dimension identifier must be at the same time a primary key of this relation and a foreign key for the fact relation. Otherwise, the identifier is a non-key attribute in the fact relationship. It must be a Boolean or temporal attribute in this fact relationship or a non-key attribute in another relation. Otherwise, the identifier corresponds to a primary key and not a foreign key in the fact relation

Hierarchy

Level conformity (CLC)

An attribute is considered as level when it is a primary key in any relationship. Otherwise, it is non-key attribute in relationship having the same dimension name and whose type is date or Boolean, or it belongs to another relationship

Hierarchical dependency (CHD)

A functional dependency must exist between any level and its successor in the source schema

Weak-attribute conformity (CWC)

There must exist a direct functional dependency, in the source schema, between any level and its weak attributes

The semantic constraints can be divided into two categories (see Table 4): summarizability and domain constraints. Our approach verifies theses two categories of semantic constraints based on a data source. Accordingly, we defined a set of queries inspired from re-engineering works [4, 44]. The results of these queries enrich the knowledge base of Prolog.
Table 4

List of multidimensional star schema semantic constraints

Concept

Constraints

Description

Summarizability

Disjunction

One instance of the strongest granularity level must be associated to one instance of the finer level level at most

Completeness

When moving from one level to another, this constraint requires that any given instance must relate to an instance at the new level

Fact-dimension relationship

It requires that any fact instance must be associated with at least one instance of dimension

Domain constraints

Restriction domain instances

It expresses an interval of level instances defined by the designer

To illustrate the verification process of the semantic constraints, let us consider the verification process of the disjunction constraint for the supplier dimension shown in Fig. 3.
Fig. 3

Supplier dimension

The supplier dimension contains one hierarchy, called hSupp, composed of the following levels: id, city, country, continent, and all. Table 5 expresses three instances of the supplier dimension: Ins1, Ins2, and Ins3. We note that the instance Tripoli related to the city level is associated with two instances of the country level: Libya and Lebanon. This instance sharing means that the disjunction constraint is violated. Our approach detects this problem at the source level, and warns the designer that violation of the disjunction semantic constraint may to engender certain inconsistencies in the results of the future decisional system. Indeed, two instances of city level associated with more than one instance of id. According to id, certain measures are probably counted many times.
Table 5

Example of supplier dimension instances

 

Id

City

Country

Continent

Ins1

f1

Tripoli

Libya

Africa

Ins2

f2

Tripoli

Lebanon

Asia

Ins3

f3

Paris

France

Europe

3.5 Interpretation and repairing solutions

The interpreter generates two types of reports: error and warning reports. After the transformation of the multidimensional star schema and the source schema in Prolog, the error reports are generated by mapping between the interpreter and Prolog. An error report consists of two zones: error messages and repairing solutions.

The warning messages inform the designers about the result qualities of the generated cube. They are the results of the semantic verification process. They identify the percentage of instances that do not satisfy each condition of the summarizability constraint, the most accurate results and the incoherence reasons. As for the domain constraints, the interpreter returns to the designer with a list of the violated constraints, it indicates the fact, dimension, hierarchy and level names.

The repairing solutions are detected by a set of rules we defined in Prolog. These rules are inspired from the conformity constraints (see Table 2). Each rule takes as a starting point the predecessor of the element that violates the constraint in the structure of the multidimensional star schema. The repairing rules are:
  • Rule of searching facts (RSF): the role of RSF is to provide designers with propositions as to which relationships that are able to be facts. The RSF selects each relationship that respects the fact conformity constraints. For each selected fact, the interpreter interrogates Prolog to execute the RSD rule. The selected fact is rejected when the result of the RSD rule is “No”. So, we conclude that the proposed fact cannot be related to any dimension.

  • Rule of searching measures (RSM): it returns to designers all attributes respecting the conformity constraint measure. RSM checks all attributes of fact relationship and its parallel relationships.

  • Rule of searching dimensions (RSD): To search all possible dimensions, RDS selects each relationship that respects the conformity constraints dimension. After detecting all dimensions, the orthogonality test is performed. The returned results represent all possible combinations of dimensions.

  • Rule of searching level (RSL): RSL searches all attributes that depend functionally and directly on the given level and that respect the level conformity constraint. Recursively, RSL allows the deduction of some new hierarchies. In the case where the identifier is taken as the given level, RSL returns all possible hierarchies. From a given level, the RSL rule provides designers with all possible successors. It is possible to give new hierarchies. The RSL rule is applied when some constrains are not satisfied, including the non-empty dimension, non-cyclic hierarchy, level conformity, and hierarchic dependency.

  • Rule of searching weak attribute (RSW): the RSW provides all the attributes that directly depend on a given level and that do not satisfy a level conformity constraint. The interpreter applies the RSW when a weak attribute conformity constraint is violated.

The repairing rules are used after the V&V process. Table 6 indicates the rules appropriate to fix the violated constraints.
Table 6

Repair rules according to the violated constraints

 

CHAh 

CNEh 

CNCh 

CNEd 

CORf 

CNEf 

CNIf 

CFC 

CMC 

COS 

CDC 

CLC 

CHD 

CWC 

RSF

     

X

X

X

      

RSM

     

X

  

X

     

RSD

    

X

 

X

X

 

X

X

   

RSL

X

X

X

X

       

X

X

 

RSW

             

X

4 Case study

To explain the technical details of our approach, we propose the V&V of the sale fact analyzed by period, documents, and publisher dimensions (Fig. 4).
Fig. 4

Sale fact

Table 7 describes the source schema after its transformation into Prolog. The attribute types are extracted and added into Prolog knowledge bases.
Table 7

Code Prolog related to the relational source schema

relation(author,[codAut],[],[fnameAut, snameAut, adrAut, specialty]).

relation(bookstore,[codBoo],[],[nameBoo, city, state, gov, country]).

relation(client,[codCl],[],[fnameCl, snameCl, numTelCl, adrCl]).

relation(buying,[ codDoc, codDi, codBoo],[codDoc, codDi, codBoo],[days, quantity\(_b\)]).

relation(distributor,[codDi],[],[nameDi, adrDi, numTelDi]).

relation(documents,[codDoc],[codEd, codAut],[titleDoc, nbrPaDoc, price, language, specialty]).

relation(editor,[codEd],[],[nameEd, numTelEd, adrEd, city, country]).

relation(governorate,[idGov],[],[idCountry, gov]).

relation(local,[idCity],[idState, idGov],[city]).

relation(sale,[codDoc, codCl, codBoo, codEd],[codDoc, codCl, codBoo, codEd],[days, amount]).

relation(tCountry,[idCountry],[],[country]).

relation(tState,[idState],[],[idCountry, state]).

relation(command,[codDoc,codCl, codBoo],[codDoc, codCl, codBoo],[days, quantity]).

The sale fact presents three structural errors:
  • The fact does not respect the CNEf. In fact, the fact does not contain measures. RSM is executed and searches in the sources schema for the relation whose name is similar to the fact name. It then looks for all attributes that belong to this relation and that can respect the CCM. The repairing solution is: amount and quantity. The first is a numeric attribute belonging to the sale relation; the second belongs to the command relationship and is detected because the command is parallel to the sale relation.

  • The CNCh is violated in the hierarchy docAut of the dimension documents. In this case, one occurrence of specialty will be replaced or eliminated. The RSL is automatically executed. In the first instance of execution, RSL searches for the relation having a name that is similar to that of the dimension. In the second instance of execution, it takes specialty as a starting point. It checks the fact that docAut is not a successor of specialty. In order to validate the correctness of the second occurrence of specialty, RSL is executed for another time and takes codAut as a starting point. To search for the possibility of substituting the first occurrence of specialty, RSL is re-executed for another time with the predecessor of the specialty level, codDoc, taken as a starting point. The repair solution is: codDoc, codAut, specialty, and all. Overall, the RSL must be executed at least three times by considering the following as starting points: the first occurrence of level violating the CNCh, its predecessor, and the predecessor of the second occurrence.

  • The CNEd is not respected in the publisher dimension. In fact, the publisher dimension does not contain hierarchies. To compute a repair solution, the relation having a name similar to that of the dimension is selected. In our example, this relation is editor. The RSL is then executed, with the identifier of the editor relation being taken as a starting point is. A new hierarchy is provided for each detected level. Each extracted level is considered as a starting point of a new execution for the RSL. For the running example, two hierarchies are detected: hier1: codEd, city and all. hier2: codEd, city and all.

After the revision of the sale fact, the validation process is required. The checking of the conformity constraints shows three errors: COS, CDC and CLC.

The COS error relates to the existence of a functional dependency between the codDoc identifier of the documents dimension and the codEd identifier of the publisher dimension. To repair this error, the RSD is initially executed to provide all combinations of the orthogonal dimension. The designer then needs to analyze sale by documents and publisher. Accordingly, and considering that codDoc determines codEd, the RSL is re-executed to provide a new hierarchy belonging to documents, with codDoc being used as a starting point. The proposed hierarchy must contain codEd as a level. The repairing solutions are:

Fact dimensions: documents, period, and bookstore.

The hierarchies in the documents dimension are:

hierarchy 1: codDoc, codEd, city and all.

hierarchy 2: codDoc, codEd, country and all.

Concerning the CDC error, the publisher must be eliminated or replaced because the editor relationship, having a name similar the publisher, cannot be a dimension for the sale fact. The RSD is, therefore, re-executed to provide a new dimension that substitutes publisher. The proposed solution eliminates the publisher dimension or replaces it by the bookstore dimension.

The third error, CLC, refers to the fact that price cannot be considered as a level. After the RSL is executed with codDoc taken as a starting point, the proposed solutions are:

hierarchy 1: codDoc, codEd, city and all.

hierarchy 2: codDoc, codEd, country and all.

hierarchy 3: codDoc, specialty and all.

Among the results of the running example of summarizability verification, the following message is returned:

Warning—completeness (successor):

Fact: sale

Dimension: documents

Hierarchy: docAut

Level: codDoc

Next level : codAut

Number: 2, 6 %

2, 6 % of codDoc instances have no successors (codAut) in the hierarchy docAut of the documents dimension of the sale fact. Thereafter, the most accurate results on the measures of sale are those relative to codDoc. According to codAut, certain measures are probably not included

For the running example, the following warning message is provided:

Warning—domain constraint

Fact: sale

Dimension: documents

Hierarchy: docAut

Level: specialty

Constraint: specialty = “chemistry”

The domain restrictions and the sets of the level instances is empty

The message indicates that the domain constraint integrated by the designer implies an empty set as an intersection between the domain restriction and the set of the city level instances.

5 Evaluation

To evaluate our approach, we developed a tool supporting the automatic V&V process. To guide the designers during error revisions, our tool provides error messages and repair solutions. The warning messages are generated to inform the designers about the result qualities of the future cube. To evaluate experimentally our approach, we adopted a black box testing approach to evaluate: (1) its capacity in detecting errors and warnings, and (2) the quality of its repair solution reports. This black box methodology was adopted to test the structural verification and error detection. It consists in testing some star schemas and comparing the detected errors with the waited ones.

The tests we conducted for the structural constraints demonstrated that our approach is capable of detecting all structural errors. The conformity error detection requires the connection to the data sources and it depends on the threshold of the semantic similarity. For a semantic similarity threshold near 1, the validation process becomes stricter; that is, the elements from the star schema whose names do not match (exact or synonym) names in the source schema are considered as nonconform. Reducing this threshold, the number of elements considered as not conform and having similar elements in the source is reduced. The best results were found by fixing the similarity threshold to 0.2. We determined this value experimentally on a set of 150 name pairs we collected from the literature and student projects. After applying our similarity measure while varying the threshold, we manually analyzed the results to identify the threshold that chose the words express the closest meaning. The source schemas used in our tested validation process include Media-planning [32], Faculty members load [10], the bookstore, etc. Table 8 shows samples of name pairs. In this list, we cannot replace student by teacher or sale by buying, which would be allowed with a threshold less than 0.2.
Table 8

Example of similar words

w

wi

Similarity threshold

Customer

Client

0.90

Supplier

Provider

0.94

Command

Order

0.45

Sale

Buying

0.11

Editor

Publisher

0.23

Student

Teacher

0.16

The warning messages are the results of the semantic constraint verification process. To test these constraints, we constructed source databases for the above-mentioned case studies, that contain a limited number of records ranging from 100 to 3000 records. Comparing the provided results with the multiplicity between the entities in the conceptual schemas, we obtain coherent results.

To check the repair solutions, we tested each search rule in some of its worst-case and intermediate scenarios. For the intermediate case, the example presented in the previous section illustrates the result quality. For the worst-case scenarios, because the behavior of our approach is similar to the bottom-up approach, we compared our results with those generated by this type of approach. The methods in this approach extract facts based on two criteria: the existence of a numeric attribute and/or the cardinality of the association between entities. The methods based on the first criterion can extract facts with no dimensions, and those based only on the cardinality may extract a fact with no numerical attribute (assuming the count as a default measure). Our validation approach takes in consideration both criteria.

When we applied several bottom-up and automatic methods, the process of dimension extraction provides all possible dimensions without respecting the orthogonality constraint. Using our approach, the worst-case scenarios of RSD execution is when the drawn fact is not associated with any dimension. The number of dimensions by fact is the same. However, our approach offers all possible combinations of orthogonal dimensions. From a given level, RSL detects all possible levels that depend functionally and directly on the given level and that respect the level conformity constraint. Recursively, RSL returns all possible hierarchies.

Compared to other approaches, our approach is characterized by its interactivity. In fact, guided by the provided repairing solutions, the designers can repair their schemas according to their needs. In summary, the approach presented in this work complements all three types of design approaches by explicitly verifying all types of constraints in an automated way. Besides its automation, this approach has two main merits. The first merit relates to its ability to integrate the semantic similarity computation [18] to solve the problem pertinent to the difference between the names used in the design of the star schema and the names existing in the data source schema. The second merit pertains to the solutions proposed to correct a schema based on the errors provided by the check module and hence avoiding its redesign.

6 Conclusion

In this paper, we proposed an approach for the design of valid multidimensional schemas. The approach has three distinct features: it handles structural, conformity, and semantic constraints; it uses linguistic similarity computation in the validation process; and it provides designers with possible repair solutions. It is advantageous compared to existing mixed approaches where designers cannot intervene after the need definition phase. Our approach is interactive: In error cases, a complete report is returned to the decision maker wherein errors are indicated and possible repair solutions are offered. To provide the designer, who can be non-expert in formal methods, with the means needed for their schema checking, we developed an automatic multidimensional V&V framework. Moreover, our framework provides designers with three message types: error reports, repair solutions, and warning messages. (It is also worth noting that our repair solution proposition approach can be extended to handle domain-specific constraints as illustrated in [37].) Finally, the input of our approach is a graphical notation. Thereby, the use of our framework does not require the expertise of a specific language in contrast with other works. For example, the use of the approaches described in [32, 42] require an expert in UML and the use of approaches described in [21, 33] require an expert in SQL.

Considering the promising potential of warning messages, future studies, some of which are underway in our laboratory, are needed to further investigate the potential benefits of these results in the context of extract-transform-load procedure definition.

Footnotes

  1. 1.

    Information content-based approach quantifies the similarity between concepts as a function of the information content (IC) that both concepts have in common in a given ontology. The basic idea is that general and abstract entities found in a discourse present less IC than more concrete and specialized ones.

  2. 2.

References

  1. 1.
    Abdelhédi, F., Ravat, F., Teste, O., Zurfluh, G.: Selfstar: un système interactif pour la construction de schémas multidimensionnels. In: INFORSID, pp. 335–350 (2011)Google Scholar
  2. 2.
    Abelló, A., Samos, J., Saltor, F.: Yam2: a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)CrossRefGoogle Scholar
  3. 3.
    Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’09), pp. 19–27. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  4. 4.
    Alhajj, R.: Extracting the extended entity-relationship model from a legacy relational database. Inf. Syst. 28(6), 597–618 (2003)CrossRefzbMATHGoogle Scholar
  5. 5.
    Bargui, F., Ben-Abdallah, H., Feki, J.: A hybrid approach for data mart schema design from NL-OLAP requirements. In: Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems (NLDB’09), pp. 295–296. Springer, Berlin (2010)Google Scholar
  6. 6.
    Battaglia, A., Golfarelli, M., Rizzi, S.: Qbx: a case tool for data mart design. In: Troyer, O., Bauzer Medeiros, C., Billen, R., Hallot, P., Simitsis, A., Mingroot, H. (eds.) Advances in Conceptual Modeling. Recent Developments and New Directions. Lecture Notes in Computer Science, vol. 6999, pp. 358–363. Springer, Berlin (2011)Google Scholar
  7. 7.
    Boulil, K., Pinet, F., Bimonte, S., Carluer, N., Lauvernet, C., Cheviron, B., Miralles, A., Chanet, J.P.: Guaranteeing the quality of multidimensional analysis in data warehouses of simulation results: application to pesticide transfer data produced by the macro model. Ecol. Inf. 16, 41–52 (2013)CrossRefGoogle Scholar
  8. 8.
    Cabibbo, L., Torlone, R.: The design and development of a logical system for olap. Lect. Notes Comput. Sci. 1874, 1–10 (2000)CrossRefGoogle Scholar
  9. 9.
    Carpani, F., Ruggia, R.: An integrity constraints language for a conceptual multidimensional data model. In: Proceedings of XIII International Conference on Software Engineering and Knowledge Engineering (SEKE), vol. 1 (2001)Google Scholar
  10. 10.
    Feki, J., Hachaichi, Y.: Une démarche et un outil. J. Decis. Syst. 16(3), 303–333 (2007)CrossRefGoogle Scholar
  11. 11.
    Franconi, E., Kamble, A.: The GMD data model for multidimensional information: a brief introduction. In: Data Warehousing and Knowledge Discovery, pp. 55–65. Springer, Berlin (2003)Google Scholar
  12. 12.
    Ghozzi, F., Ravat, F., Teste, O., Zurfluh, G.: Contraintes pour modèle et langage multidimensionnels. Ingénierie des Systèmes d’Information 9(1), 9–34 (2004)CrossRefGoogle Scholar
  13. 13.
    Golfarelli, M., Rizzi, S.: A methodological framework for data warehouse design. In: Proceedings of the 1st ACM International Workshop on Data Warehousing and OLAP, pp. 3–9. ACM, New York (1998)Google Scholar
  14. 14.
    Golfarelli, M., Rizzi, S., Saltarelli, E.: Wand: a case tool for workload-based design of a data mart. In: 10th National Convention on Systems Evolution for Data Bases, pp. 422–426. Citeseer (2002)Google Scholar
  15. 15.
    Gupta, R., Gosain, A.: Validating data warehouse quality metrics using PCA. In: ICDEM, pp. 170–172 (2010)Google Scholar
  16. 16.
    Hachaichi, Y., Feki, J.: An automatic method for the design of multidimensional schemas from object oriented databases. Int. J. Inf. Technol. Decis. Mak. 12(06), 1223–1259 (2013)CrossRefzbMATHGoogle Scholar
  17. 17.
    Hachaichi, Y., Feki, J., Ben-Abdallah, H.: Modélisation multidimensionnelle de documents xml centrés-données. J. Decis. Syst. 19(3), 313–345 (2010)CrossRefGoogle Scholar
  18. 18.
    Hadj Taieb, M.A., Ben Aouicha, M., Ben Hamadou, A.: A new semantic relatedness measurement using WordNet features. Knowl. Inf. Syst. 41(2), 467–497 (2014). doi: 10.1007/s10115-013-0672-4
  19. 19.
    Hurtado, C.A., Gutierrez, C., Mendelzon, A.O.: Capturing summarizability with integrity constraints in olap. ACM Trans. Database Syst. (TODS) 30(3), 854–886 (2005)CrossRefGoogle Scholar
  20. 20.
    Hurtado, C.A., Mendelzon, A.O.: Olap dimension constraints. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–179. ACM, New York (2002)Google Scholar
  21. 21.
    Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Ore: an iterative approach to the design and evolution of multi-dimensional schemas. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP (DOLAP’12), pp. 1–8. ACM, New York (2012)Google Scholar
  22. 22.
    Kimball, R.: The Data Warehouse Toolkit. Wiley, New York (2006)Google Scholar
  23. 23.
    Kumar, M., Gosain, A., Singh, Y.: On completeness and traceability metrics for data warehouse requirements engineering. Int. J. Comput. Syst. Eng. 1(4), 229–237 (2013)CrossRefGoogle Scholar
  24. 24.
    Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification, vol. 11, pp. 265–283. The MIT Press, USA (1998)Google Scholar
  25. 25.
    Lechtenbörger, J., Vossen, G.: Multidimensional normal forms for data warehouse design. Inf. Syst. 28(5), 415–434 (2003)CrossRefzbMATHGoogle Scholar
  26. 26.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)Google Scholar
  27. 27.
    Luján-Mora, S., Trujillo, J., Song, I.Y.: A uml profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006)CrossRefGoogle Scholar
  28. 28.
    Mazón, J.N., Lechtenbörger, J., Trujillo, J.: A survey on summarizability issues in multidimensional modeling. Data Knowl. Eng. 68(12), 1452–1469 (2009)CrossRefGoogle Scholar
  29. 29.
    Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991)CrossRefGoogle Scholar
  30. 30.
    Moody, D.L., Kortink, M.A.: From enterprise models to dimensional models: a methodology for data warehouse and data mart design. DMDW00, Sweden 5 (2000)Google Scholar
  31. 31.
    Nebot, V., Berlanga, R., Prez, J., Aramburu, M., Pedersen, T.: Multidimensional integrated ontologies: a framework for designing semantic data warehouses. In: Spaccapietra, S., Zimnyi, E., Song, I.Y. (eds.) Journal on Data Semantics XIII. Lecture Notes in Computer Science, vol. 5530, pp. 1–36. Springer, Berlin (2009)Google Scholar
  32. 32.
    Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decis. Support Syst. 42(3), 1449–1473 (2006)CrossRefGoogle Scholar
  33. 33.
    Romero, O., Abelló, A.: Multidimensional design by examples. In: Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’06), pp. 85–94. Springer, Berlin (2006)Google Scholar
  34. 34.
    Romero, O., Abelló, A.: Automatic validation of requirements to support multidimensional design. Data Knowl. Eng. 69(9), 917–942 (2010)CrossRefGoogle Scholar
  35. 35.
    Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)CrossRefGoogle Scholar
  36. 36.
    Salem, A., Jedidi, F.G., Ben-Abdallah, H.: Validation formelle de schma multidimensionnel vis vis de sa source. In: Ben-Abdallah, H., Feki, J. (eds.) EDA, RNTI, vol. B-6, pp. 137–151. Cpadus (2010)Google Scholar
  37. 37.
    Salem, A., Triki, S., Ben-Abdallah, H., Harbi, N., Boussaid, O.: Verification of security coherence in data warehouse designs. In: TrustBus, pp. 207–213 (2012)Google Scholar
  38. 38.
    Serrano, M.A., Calero, C., Piattini, M.: Experimental validation of multidimensional data models metrics. In: HICSS, p. 327 (2003)Google Scholar
  39. 39.
    Serrano, M.A., Trujillo, J., Calero, C., Piattini, M.: Metrics for data warehouse conceptual models understandability. Inf. Softw. Technol. 49(8), 851–870 (2007)CrossRefGoogle Scholar
  40. 40.
    Soussi, A., Feki, J., Gargouri, F.: Approche semi-automatisée de conception de schémas multidimensionnels valides. In: EDA, pp. 71–90 (2005)Google Scholar
  41. 41.
    Thenmozhi, M., Vivekanandan, K.: A tool for data warehouse multidimensional schema design using ontology. Int. J. Comput. Sci. Issues (IJCSI) 10(2), 161–168 (2013)Google Scholar
  42. 42.
    Tria, F.D., Lefons, E., Tangorra, F.: Hybrid methodology for data warehouse conceptual design by UML schemas. Inf. Softw. Technol. 54(4), 360–379 (2012)CrossRefGoogle Scholar
  43. 43.
    Tsois, A., Karayannidis, N., Sellis, T.: Mac: conceptual data modeling for olap. In: Proceedings of the International Workshop on DMDW, pp. 28–55. Citeseer (2001)Google Scholar
  44. 44.
    Yeh, D., Li, Y., Chu, W.: Extracting entity-relationship diagram from a table-based legacy database. J. Syst. Softw. 81(5), 764–771 (2008)Google Scholar
  45. 45.
    Zhao, J., Schewe, K.D., Koehler, H.: Dynamic data warehouse design with abstract state machines. J. Univers. Comput. Sci. 15(1), 355–397 (2009)zbMATHGoogle Scholar

Copyright information

© The Author(s) 2015

Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Authors and Affiliations

  1. 1.Mir@cl LaboratoryUniversity of SfaxSfaxTunisia
  2. 2.King Abdulaziz UniversityJeddahKSA

Personalised recommendations