Enhancing Cubes with Models to Describe Multidimensional Data

The Intentional Analytics Model (IAM) has been recently envisioned as a new paradigm to couple OLAP and analytics. It relies on two basic ideas: (i) letting the user explore data by expressing her analysis intentions rather than the data she needs, and (ii) returning enhanced cubes, i.e., multidimensional data annotated with knowledge insights in the form of interesting model components (e.g., clusters). In this paper we contribute to give a proof-of-concept for the IAM vision by delivering an end-to-end implementation of describe, one of the five intention operators introduced by IAM. Among the research challenges left open in IAM, those we address are (i) automatically tuning the size of models (e.g., the number of clusters), (ii) devising a measure to estimate the interestingness of model components, (iii) selecting the most effective chart or graph for visualizing each enhanced cube depending on its features, and (iv) devising a visual metaphor to display enhanced cubes and interact with them. We assess the validity of our approach in terms of user effort for formulating intentions, effectiveness, efficiency, and scalability.


Introduction
Data warehousing and OLAP (On-Line Analytical Processing) have been progressively gaining a leading role in enabling business analyses over enterprise data since the early 90's. During these thirty years, the underlying technologies have evolved from the early relational implementations (still widely adopted in corporate environments), to the new architectures solicited by Business Intelligence 2.0 scenarios, and up to the challenges posed by the integration with big data settings. However, recently, it has become more and more evident that the OLAP paradigm alone Stefano Rizzi stefano.rizzi@unibo is no longer sufficient to keep the pace with the increasing needs of new-generation decision makers. Indeed, the enormous success of machine learning techniques has consistently shifted the interest of corporate users towards more sophisticated analytical applications (Popovic et al. 2018;Schuff et al. 2018). In addition, recent research envisions cross-cutting data management, analytics, and artificial intelligence in various sectors, such as applied data science (Chiusano et al. 2021), behavioral research (Motiwalla et al. 2019) and social impact (Gupta et al. 2018).
In this direction, the Intentional Analytics Model (IAM) has been envisioned as a way to tightly couple OLAP and analytics . As sketched in Fig. 1, the IAM approach relies on two major cornerstones: (i) the user explores the data space by expressing her analysis intentions rather than by explicitly stating what data she needs, and (ii) in return she receives both multidimensional data and knowledge insights in the form of annotations of interesting subsets of data.
As to (i), five intention operators are proposed, namely, describe (describes one or more cube measures, possibly focused on one or more level members), assess (judges one or more cube measures with reference to some baseline), explain (reveals some hidden information in the data the user is observing, for instance in the form of a correlation between two measures), predict (shows data not in the Fig. 1 The IAM approach: the user expresses an intention and receives in return an enhanced cube original cubes, derived for instance with regression), and suggest (shows data similar to those the current user, or similar users, have been interested in). As to (ii), firstclass citizens of the IAM are enhanced cubes, defined as multidimensional cubes coupled with highlights, i.e., sets of cube cells associated with interesting components of models automatically extracted from cubes. Each operator is applied to an enhanced cube and returns a new enhanced cube. To assess the interestingness of model components, a measure based on their significance -expressed in terms of how novel, peculiar, and surprising they are expected to be to the user -is used. Noticeably, having different models automatically computed and evaluated in terms of their interestingness relieves the user from the time-wasting effort of trying different possibilities.
Example 1 Let a SALES cube be given, and let the user's intention be with SALES describe quantity for month = '1997-04' by type using outliers Firstly, the subset of cells for April 1997 are selected from the SALES cube, aggregated by product type, and projected on measure quantity (in OLAP terms, a slice-anddice and a roll-up operator are applied). Then, the outliers are found in these cells based on the values of quantity. Finally, a measure of interestingness is computed for the two components obtained (the outlier cells, and the nonoutlier ones), and the cells belonging to the component with maximum interestingness (in this case, outlier cells) are highlighted in the results shown to the user (see Fig. 2). The IAM vision aims at facilitating exploratory analysis by redefining queries and answers, and by providing the user with a declarative language that enables her to specify her analytical intentions . Such a paradigm shift necessarily includes a degree of automation, and a balance is to be sought between the implementation of the analytical intentions and the freedom left to the user to specify it. This raises a number of research challenges, e.g., (i) investigate if there are any other intention operators that should be considered besides the basic ones proposed, and how different operators can be combined; (ii) find techniques for automatically tuning the algorithms that create enhanced cubes by computing models; (iii) devise a measure to estimate the interestingness of model components; (iv) enrich the IAM framework with an approach to select the most effective chart or graph for visualizing each cube depending on its features such as number of dimensions, size, etc.; and (v) devise a visual metaphor for displaying enhanced cubes and interacting with them.
In the direction of providing a proof-of-concept for the IAM vision, the potentiality of the assess operator has been recently investigated by proposing a syntax, a semantics, and a basic optimization strategy (Francia et al. 2021). The goal of this paper is to take one step forward in the same direction by delivering an end-to-end implementation of the describe operator. Specifically, we address challenge (ii) by experimenting two techniques to automatically set the number of model components, and challenge (iii) by proposing and validating a new interestingness measure for model components. Notably, this measure is consistent with the multi-facets interestingness scheme introduced by Marcel et al. (2019). The present work gives a precise and motivated definition for both the facets used and the way they are aggregated to form a global score. We also address challenges (iv) and (v), by proposing a visualization that couples text-based representations and selected graphical representations with a component-driven interaction paradigm. In this way, the user will save the time required to try different visualizations; besides, by automatically selecting the most suitable charts based on the features of each cube, we discourage the user from adopting inappropriate visualizations which might lead her to wrong interpretations of data.
This paper significantly extends our previous work (Chédin et al. 2020) in different ways: -Cube schemata are defined in more general terms, allowing branches in hierarchies rather than only allowing linear hierarchies. -A new definition of interestingness is given based on three different facets of model components: surprise, novelty, and peculiarity. -The computation of interestingness is generalized to cover situations where an intention changes both the group-by set and the selection predicate of the previous intention, and when there is no roll-up/drill-down relationship between the two group-by sets. -The syntax of the describe operator has been extended.
-The visualization of enhanced cubes uses two more chart types to give users a more comprehensive and flexible description of data. -The approach is evaluated through a comprehensive set of tests not only in terms of efficiency, but also of scalability, effectiveness, and formulation complexity.
The paper outline is as follows. After introducing a formalism to manipulate cubes and queries in Section 2, in Section 3 we introduce models, components, and enhanced cubes, and in Section 4 we define an interestingness measure. Then, in Section 5 we show how an intention is transformed into an execution plan, in Section 6 we discuss how to automatically set the model size, i.e., its number of components, and in Section 7 we explain how enhanced cubes are visualized. Section 8 shows the results of the experimental tests we performed to evaluate the approach. Finally, in Section 9 we discuss the related literature, while in Section 10 we draw the conclusion.

Formalities
In this section we introduce the formal notations we will use in the paper to manipulate cubes. We start by defining cube schemata; note that the definitions we give support to hierarchies with branches and diamonds.

Definition 1 (Hierarchy and Cube Schema) A hierarchy is a triple
including a set of members; and (iii) (L, ≥ h ), where L = l∈L h Dom(l), is a part-of partial order.
The top level of h is called dimension. The bottom level, denoted ALL h , has a single member ALL h . The part-of partial order is such that, for each couple of levels l and l such that l h l and for each member u ∈ Dom(l), there is exactly one member u ∈ Dom(l ) such that u ≥ h u . A cube schema is a couple C = (H, M) where: The roll-up lattices of the hierarchies in H are shown in Fig. 3 together with an excerpt of the part-of partial order of the customer hierarchy. Intuitively, having customer Customer gender means that customers can be grouped based on their gender, and having Mary ≥ Customer Female means that Mary belongs to the group of females.
Aggregation is the basic mechanism to query cubes, and it is captured by the following definition of group-by set.
Definition 2 (Group-by Set and Coordinate) Given cube schema C = (H, M), a group-by set G of C is a set of levels, at least one from each hierarchy of H , such that for each couple of levels l, l ∈ G, l, l ∈ L h , we have l h l and l h l. The lattice induced on the set of all group-by sets of C by the roll-up lattices of the hierarchies in H , is denoted with H and called multidimensional lattice. A coordinate of a group-by set G is a tuple of members, one for each level of G. The partial order induced on the set of all coordinates of C by the part-of partial orders of the members in H , is denoted with ≥.
Intuitively, given two group-by sets G and G , if G H G (G roll-ups to G ) then the coordinates of G can be grouped by G ; given two specific coordinates of G and G , namely, γ and γ , if γ ≥ γ (γ is part of γ ) then γ belongs to the group defined by γ .
To support the definition of interestingness in Section 4, we need to introduce a further notation to establish a mapping between coordinates of different group-by sets. Given two members u and u of levels l and l both belonging to the same hierarchy h, we will write u u when either (i) l = l and u = u , or (ii) l h l and u ≥ h u , or (iii) l h l and u ≥ h u. Intuitively, this means that there is a directed path in the part-of partial order connecting the two members, so one of them is an ancestor of the other. Given two coordinates γ and γ of two group-by sets G and G , we will write γ γ when ∀u ∈ γ, ∃u ∈ γ : u u . Note that γ γ ⇔ γ γ .
Example 3  where G 1 H G 2 while G 3 is incomparable with both G 1 and G 2 (i.e., the coordinates of G 3 cannot be grouped by G 1 and G 2 , and vice versa). G 1 aggregates sales by date, product type, and store country, G 2 by month and category, G 3 by year, gender, age range, category, and country. A small excerpt of the multidimensional lattice is shown in Fig. 4. Example of coordinates of the three group-by sets are, respectively, γ 1 = 1997-04-15, AllCustomers, Fresh Fruit, Italy γ 2 = 1997-04, AllCustomers, Fruit, AllStores 1997, Female, [30-39],Fruit, France where γ 1 ≥ γ 2 (meaning that γ 1 is part of γ 2 ), while γ 3 is incomparable in the part-of partial order with both γ 1 and γ 2 (meaning that none of them is part of the other). We also have γ 1 γ 2 (because, for all levels, members are either the same -as for allCustomers -or one is an ancestor of the other -as 1997-04 for 1997-04-15), γ 1 γ 3 (because Italy is incomparable with France, i.e. no one is an ancestor of the other), and γ 2 γ 3 .
The instances of a cube schema are called cubes and are defined as follows: (iii) ω C is a partial function that maps some coordinates of G C to a numerical value for each measure m ∈ M C .
The function is partial since cubes are normally sparse: not all possible business events actually occur, and a coordinate participates in the function only if the event it describes took place. Each coordinate γ that participates in ω 0 , with its associated tuple t of measure values, is called a cell of C and denoted γ, t . With a slight abuse of notation, we will also consider a cube as the set of the coordinates corresponding to its cells, so we will write γ ∈ C to state that γ, t is a cell of C.
A cube whose group-by set G C includes all and only the dimensions of the hierarchies in H and such that M C = M, is called a base cube, the others are called derived cubes. In OLAP terms, a derived cube is the result of either a roll-up, a slice-and-dice, or a projection made over a base cube; this is formalized as follows.
Definition 4 (Cube Query) A query over cube schema C is a triple q = (G q , P q , M q ) where: (i) G q is a group-by set of H ; (ii) P q is a (possibly empty) set of selection predicates, each expressed over one level of H using either a comparison operators (=, ≥, etc.) or the set inclusion operator (e.g., country in Italy, France); Let C 0 be a base cube over C. The result of applying q to C 0 is a derived cube C = q(C 0 ) such that (i) G C = G q , (ii) M C = M q , and (iii) ω C assigns to each coordinate γ ∈ C satisfying the conjunction of the predicates in P q and to each measure m ∈ M C the value computed by applying op (m) to the values of m for all the coordinates γ of C 0 such that γ ≥ γ .
Example 4 The cube query over SALES used in Example 1 is q = (G q , P q , M q ) where G q = {allDates, allCustomers, type, allStores}, P q = {month = '1997-04'}, and M q = {quantity}. A cell of the resulting cube q(SALES 0 ) (where SALES 0 is the base cube) is AllDates, AllCustomers, Canned Fruit, AllStores with associated value 138 for quantity.

Enhancing Cubes with Models
Models are concise, information-rich knowledge artifacts (Terrovitis et al. 2007) that represent relationships hiding in the cube cells. The possible models range from simple functions and measure correlations to more elaborate techniques such as decision trees, clusterings, etc. A model is bound to (i.e., is computed over the levels/measures of) one cube, and is made of a set of components (e.g., a clustering model is made of a set of clusters). In the IAM, a relevant role is taken by data-to-model mappings. Indeed, a model partitions the cube on which it is computed into two or more subsets of cells, one for each component (e.g., the subsets of cells belonging to each cluster). (t, alg, C, I n, Out, μ) where:

Definition 5 (Model and Component) A model is a tuple
(i) t is the model type; (ii) alg is the algorithm used to compute Out; (iii) C is the cube to which M is bound; (iv) I n is the tuple of levels/measures of C and parameter values supplied to alg to compute M; (v) Out is the set of components that make up M; (vi) μ is a function mapping each coordinate of C to one component of Out.
Each model component is a tuple of a component identifier plus a variable number of properties that describe that component.
In the scope of this work, it is t ∈ {top-k, bottom-k, skyline, outliers, clustering}. The components for these model types are as follows: 1. For t = top-k, there are two components: one for top-k cells, one for the others (similarly for bottom-k). Each component is described by the average z-score of its cells. 2. For t = skyline, there are two components: one for the cells in the skyline, one for the others. Each component is described by the average z-score of its cells. To compute the skyline, we resort to the algorithm proposed by Chomicki et al. (2003). 3. For t = outliers, there are two components: one for outlier cells, one for the others. Each component is described by its outlierness. 2 To compute outliers, we adopt the isolation forest algorithm (Liu et al. 2008). 4. For t = clustering, there is one component for each cluster. Each component is described by the centroid of the corresponding cluster. To compute clustering we resort to the well-known k-means algorithm.
The model types listed above are suggested in the original proposition of the IAM as those that best meet the goal of describing a cube ). Other effective model types are not taken into account here because they were considered to better meet the goals of other intentional operators (e.g., correlation and regression are used to explain, time-series decomposition and auto-regression to predict). We also note that the properties mentioned for each model type are not meant to be exhaustive.
Example 5 A possible model over the derived cube q(SALES 0 ) in Example 4 is characterized by where n is the desired number of clusters and rndSeed is the seed to be used by the k-means algorithm to randomly generate the 3 seed clusters. Component c1 is characterized by property centroid with value 76.
As the last step in the IAM approach, cube C is enhanced by associating it with a set of models bound to C and with a highlight, i.e., with the subset of cells corresponding to the most interesting component of the model; these cells are determined via function μ.
Definition 6 An enhanced cube E is a triple of a cube C, a set of models {M 1 , . . . , M r } bound to C, and a highlight How to estimate the interestingness of component c, interest (c), is the subject of next section.

Estimating the Interestingness of Components
The basic idea of the IAM is that the user will work in sessions, similarly to the OLAP paradigm. Thus, starting from a base cube, the user will write a sequence of intentions; each intention, as explained in Section 5, will determine a cube query which will be applied to C 0 to obtain a derived cube. Now let C 0 be a base cube over schema C, C be the cube obtained by the current intention, M = (t, alg, C, I n, Out, μ) be a model bound to C, and c ∈ Out be one of the components of M.
The measure proposed by Chédin et al. (2020) to assess the interestingness of component c is based on the idea of prior belief (Bie 2013): specifically, it defines the interestingness of c as the difference of belief for corresponding cells in the cube before and after the application of the intention. In this work we develop a more sophisticated model, based on three facets of interestingness identified by Marcel et al. (2019), namely: 3 -The novelty of c, which measures its interestingness with respect to the history of the user with C 0 . Intuitively, a component has more novelty if it concerns a larger number of previously-unseen cells. -The peculiarity of c, which measures its interestingness with respect to the cells in the cube C obtained by the last intention the user has formulated with C 0 . Concretely, we compare the cells belonging to c to some related cells in Therefore, for each component, we give three scores, one for each interestingness facet. We then define the global interestingness as a linear combination of the three facets. Choosing the weights of each facet enables the user to craft their own interestingness score. For instance, in some typical exploratory OLAP scenario, frequently-seen components may still be seen as interesting by the user, who should then switch off novelty and surprise.
where nov(c), pec(c), and sur(c) denote, respectively, the novelty, peculiarity, and surprise of c, and the α's are normalized weights.

Novelty
To define this score, we assume that the system keeps track of the user's history with C 0 through the set V of all the cubes that the user has computed during her current session on C 0 .
Intuitively, a coordinate is novel if it has never appeared in V and not novel otherwise. The novelty of a component is the average novelty of its coordinates.

Peculiarity
Estimating peculiarity requires first of all to define the concept of "corresponding cell(s)" of each coordinate γ of C in the cube C obtained by the last intention the user has formulated with C 0 , which is done through a proxy function as follows. Intuitively, if the intention changes the group-by set, the corresponding coordinates(s) of γ are determined via the part-of order; if the intention changes the selection predicate, the corresponding coordinates of γ are γ itself if it is part of C , the empty set otherwise; if the intention changes the measure, the corresponding coordinates of γ are the empty set.
Definition 9 (Proxies) Let C be a cube over cube schema C, and C be the cube occurring immediately before C in the current session V . Let γ be a coordinate of C, and m be a measure in C. The proxies of γ for m are defined as For the first intention in an analysis session, C is undefined; since in this case the user has no prior belief, we conventionally put proxy C,m (γ ) = ∅ for all γ ∈ C.
Note that, in OLAP terms, if C is a roll-up of C , the intercells mapping defined by the proxy function is many-to-one; if C is a drill-down of C , the mapping is one-to-many; in all other cases (drill-anywhere), the mapping is many-to-many. be a sequence of three intentions q 1 , q 2 , q 3 formulated by the user. When no level is specified in the by clause for hierarchy h, it is implicitly assumed by ALL h . Thus, while the plan generated for the first intention relies on query q 1 = q defined in Example 4, the ones for the second and third intentions rely on q 2 and q 3 with G q 1 = {allDates, gender, category, allStores} and G q 2 = {allDates, allCustomers, category, allStores}, respectively (the selection predicates and measures do not change). Let C 1 , C 2 , and C 3 be the cubes resulting from q 1 , q 2 , and q 3 , respectively. Some of the inter-cell relationships induced by the proxy function are shown by green lines in Fig. 5. Since C 2 is a drill-anywhere of C 1 , the relationship is many-to-many; conversely, since C 3 is a roll-up of C 2 , the relationship here is many-to-one.
We can now define peculiarity as follows.
and function z m () returns the z-score of a cell for measure m over the whole cube that the cell belongs to. Intuitively, the z-score captures to what extent the value of a measure for a cell deviates from the measure values for other cells in the cube, and peculiarity compares the z-scores of a cell with those of its proxies. A cell is more peculiar if such difference is higher. The peculiarity of a component is the average peculiarity of its coordinates, normalized by the highest peculiarity value.
Example 7 Consider again the intentions in Example 6. Figure 5 shows the z-score, the novelty, and the peculiarity of each cell of the three cubes. The novelty is 1 for all cells, since in all cases the coordinates are seen for the first time during the session. As to the peculiarity, in C 1 its values are simply the absolute values of the z-scores z m , as per Definition 10 (C 1 is the result of the first intention in the session, so the set of proxies is empty for all coordinates).

Surprise
While novelty describes whether a cell was previously unknown to the user (i.e., not present in V ), surprise assesses whether it challenges the user's previous beliefs (i.e., what the user learned from V ).

Definition 11 (Surprise) Let c be a component of model
Intuitively, a coordinate is more surprising if its members were not frequently seen in V . Hence, we count the number of cubes each member appears in; the surprise of coordinate γ is 0 when all of its members already appeared in all the cubes of V , 1 when all of its members never appeared in V . For the first intention in an analysis session, we set sur(c) = 1 for all components c. The surprise of a component is the average surprise of its coordinates.
Note that novelty and surprise are defined in a such way that a coordinate can be novel and still have a low surprise (if all its members are frequent in V ) and, conversely, a coordinate can be surprising while not being novel (for instance if it was seen only once and all its members are infrequent in V ).
Example 8 Consider again the intentions in Example 6. Figure 5 shows the surprise of each cell of the three cubes.
Note that for C 1 and C 2 all cells have surprise 1, since all the members of their coordinates were never seen before. Conversely, the cells of C 3 have surprise 0.5, since each of their members was already seen once within a history of two previous cubes (|V | = 2). Now, let M 2 be the model of type top-k, with k = 1, computed on C 2 ; this model has two components: c 1 2 , including only the top-1 cell (in red), and c 2 2 , including all the others. The interestingness values for these two components are interest (c 1 2 ) = 1.00 and interest (c 2 2 ) = 0.83, respectively. So, the enhanced cube E 2 resulting from the second intention includes C 2 , M 2 , and the highlight c 1 2 . Finally, let M 3 be the top-1 model computed on C 2 , with components c 1 3 (the top-1 cell, in red) and c 2 3 (all the other cells). It is interest (c 1 3 ) = 0.83 and interest (c 2 3 ) = 78, so the highlight here is c 1 3 .

Example 9
As an example of computation of interestingness when an intention changes the selection predicate of the previous one, consider the session with SALES describe quantity for type = 'Beer' by product with SALES describe quantity for category = 'Beer and Wine' by product The resulting cubes are shown in Fig. 6. Here, the proxy mapping for the cells included in both cubes is one-to-one; conversely, the cells in C 2 that were not present in C 1 map to all the cells of C 1 .

Execution Plans for describe Intentions
The describe operator provides an answer to the user asking "show me my business" by describing one or more cube measures, possibly focused on one or more level members, at some given granularity ( (optional parts are in brackets) where m 1 , . . . , m z ∈ M are measures of C, P is a set of selection predicates each over one level of H , {l 1 , . . . , l n } denote a group-by set of H , 4 t 1 , . . . , t r are model types, and the k i 's are the desired sizes to be applied to the models returned as explained in point 2 below.
The plan corresponding to a fully-specified intention, i.e., one where all optional clauses have been specified, is: Partially-specified intentions are interpreted as follows: -If the for clause has not been specified, we consider P q = T RUE. -If the by clause has not been specified, we consider G q = {ALL 1 , . . . , ALL n }. -If the using t 1 , . . . , t r clause has not been specified, all model types listed in Section 3 are computed over C (the skyline is computed only if z > 1, i.e., at least two measures have been specified). -If the size clause has not been specified for one or more models, the value of k i is determined automatically as discussed in Section 6.
Example 10 Consider the following session on the SALES cube: The models computed for the first intention are top-k, bottom-k, clustering, and outliers (computing the skyline for a single measure makes no sense). For the second and the third intentions, a clustering producing 3 clusters and the skyline are computed, respectively.

Setting the Model Size
Our approach to find the best value for the size parameter k when it is not specified in the intention is based on good practices in hierarchical clustering, especially when single-linkage is used, meaning that inter-cluster distance is measured by the closest two points of the clusters. The best separation of clusters can then be found by finding the knee of the evaluation graph of the clustering algorithm, which is a two dimensional plot where the xaxis is the number of clusters produced and the y-axis is one classical clustering evaluation metric (error, silhouette, etc.) considering x clusters. In hierarchical clustering, since the cost for merging clusters constantly increases, the evaluation graph often looks like an L-shaped curve with a more or less defined knee. The assumption usually made is that the best merging cost threshold is at the curve knee, where the curve switches from a sharp slope to a low decreasing line. We tested two solutions from the literature, namely Lmethod (Salvador and Chan 2004) and Kneedle (Satopaa et al. 2011), which have been proposed to find the knee in a curve of discrete data. These methods were compared using 3-dimensional non-random toy datasets specifically created for the experiment with the Scikit-Learn Python package, varying the size (6, 30, and 300 samples) and the shape of clusters, defining a ground truth. We only report the main findings.
While both methods achieve similar good results for knee detection, the L-method takes longer to execute and tends to shift the knee on large data sets. This can be seen, for instance, in Fig. 7 on the top-right graph. The right knee seems to be located at x = 25 but the method returned a knee at x = 62. Since Kneedle is quicker and provides more consistent results, we have adopted it to determine k, both for clustering (k being the number of clusters), top/bottom-k (where k is the number of points in the first cluster, i.e., the one with higher values), and outliers (where k is the number of points in the first and last cluster).

Visualizing Enhanced Cubes
In this section we discuss how to provide an effective description of an enhanced cube by coupling text-based representations (a pivot table and a ranked component list) and graphical representations (one or more charts) with an ad-hoc interaction paradigm. The guidelines we adopt to this end are explained below: (i) For visualization purposes, we assume that an intention can select at most three measures (1 ≤ z ≤ 3) and three group-by levels (1 ≤ n ≤ 3). This is actually not a strong limitation, considering that a visualization of four or more dimensions and/or measures using a single table or chart is hardly interpretable and definitely not intuitive.
(ii) Since we are focusing on intentions aimed at describing data, we believe that providing multiple visualizations from different points of view should be preferred to just picking the "most effective one". Indeed, the effectiveness of a visualization type largely depends on the skills and personal tastes of each user. (iii) We restrict to considering visualization types that can be easily understood both by lay users and skilled users, and are suitable for multidimensional data. (iv) Clearly, the effectiveness of a visualization type also depends on the features of the specific dataset. Using an unsuitable visualization can generate confusion and misunderstandings in users, and can lead them to wrong conclusions. Thus, for each intention we visualize only the charts that are recognized to be suitable given the characteristics of the data to be shown. (v) Models and components play a key role in the IAM approach. Thus, the visualizations we provide aims at showing not only dimension and measure values, but also the different components of a model using a color code. For the same reason, the interaction paradigm should be component-driven.
The visualization we provide for enhanced cube E based on guidelines (ii) and (v) includes three distinct but interrelated areas: a table area that shows the cube cells using a pivot table; a chart area that complements the table area by representing the cube cells through one or more charts; a component area that shows a list of model components sorted by their interestingness. The chart types we consider following guidelines (i) and (iii) are multiple line graphs, radar charts, grouped column charts, heat maps, bubble charts, parallel coordinate charts, and scatter plots. The heuristics we adopt to decide whether using or not each chart type for a given enhanced cube E (guideline (iv)) was inspired by the work of Golfarelli and Rizzi (2020), where a suitability score is assigned to each chart type depending on the features of the dataset to be visualized. For instance, bubble charts are considered to be suitable to visualize ndimensional data if the bubble size is mapped to a numerical attribute -such as a measure -and the bubble color is mapped to either a numerical attribute -such as a second measure -or a categorical attribute -such as a model component. Specifically, the features of E we take into account to this end are the number n of dimensions, the number z of measures, and the domain cardinality and type of the dimensions.
The pseudocode is shown in Algorithm 1; it is based on the heuristics described below: -IfE has one dimension d 1 (of temporal type) and one or more measures, draw a multiple line graph using the X axis for d 1 and the Y axis for the measure(s) values (Fig. 8a). Different line colors are used to distinguish the different measures. Markers take the colors of the components of model t, i.e., the model to which the highlight of E belongs.
-If E has one low-cardinality dimension d 1 (of nontemporal type) and one or more measures, draw a radar chart using the angle for d 1 and the radius for measure(s) values (Fig. 8b). Different line colors are used to distinguish the different measures. Markers take the colors of the components of t.
-If E has one dimension d 1 and one or more measures, draw a heat map using the X axis for d 1 and the Y axis for the different measures (Fig. 8c). Measure(s) values are shown using shades of color.
-If E has two low-cardinality dimensions d 1 , d 2 and one measure, draw a grouped column chart using the X axis for d 1 , the Y axis for measure values, and the color for d 2 (Fig. 8d).
-If E has two dimensions d 1 , d 2 and one measure, draw a heat map using the X axis for d 1 , the Y axis for d 2 , and the color shades for measure values.
-If E has two (three) dimensions d 1 , d 2 (d 3 ) and one or two measures, draw a 2D (3D) bubble chart using the X axis for d 1 , the Y axis for d 2 , (the Z axis for d 3 ), and the bubble size for the values of one measure (Fig. 8e). If there is a second measure, its values are shown using shades of color of bubbles; otherwise, bubbles take the colors of the components of t.
-If E has two (three) measures, draw a 2D (3D) scatter plot using the X, Y (Z) axes for the different measures ( Fig. 8f). Points take the colors of the components of t.
-If E has three measures, draw a parallel coordinate chart using one coordinate for each measure (Fig. 8g). Lines take the colors of the components of t.
A summary of the chart types used depending on the number of dimensions n and the number of measures z is shown in Table 1.
The interaction paradigm we adopt is component-driven (guideline (v)). Specifically, clicking on one component c in the component area leads to emphasizing the corresponding cube cells (i.e., those that map to c via function μ) both in the table area and in the chart area. The highlight is the top component in the list and is selected by default. Following the details-on-demand paradigm (Shneiderman 1996), interaction is enhanced using a tooltip that, when the mouse is positioned on a data point, shows its coordinate, its measure value(s), and the component(s) it belongs to. Figure 9 shows the visualization obtained when the following intention is formulated: with SALES describe storeCost by month, category

Example 11
On the top-left, the table area; on the right, the chart area; on the bottom-left, the component area. Here it is n = 2 and z = 1, so a heat map and a bubble chart have been selected (the grouped column chart is not selected due to the high cardinality of the month dimension). The top-interestingness component is a cluster, so a color has been assigned to each component of clustering (i.e., to each cluster) and is uniformly used in all three areas. The highlight (in green) is currently selected and is emphasized using a thicker border in all areas. Note that a tooltip with all the details about a single cell is also shown (in yellow).

Experimental Tests
In this section we discuss the results of the tests we made to evaluate our approach from four points of view: formulation effort (as compared to the one using plain SQL and Python), effectiveness (as compared to the interestingness measure used by Chédin et al. (2020)), efficiency, and scalability. The prototype implementation we used for the tests uses the simple multidimensional engine described by Francia et al. (2020), which in turn relies on the Oracle 11g DBMS to execute queries on a star schema based on multidimensional metadata (in principle, the prototype could work on top of any other multidimensional engine). The mining models are imported from the Scikit-Learn Python library. Finally, the web-based visualization is implemented in JavaScript and exploits the D3 library for chart visualization. The prototype implementation can be accessed at http://semantic.csr. unibo.it/describe/.

Formulation Effort
The first goal of our experiments is to evaluate the saving in user's effort when writing a describe intention over the one necessary to obtain the same result using plain SQL and Python. To this end we adopt the simple metric proposed by Jain et al. (2016), where the ASCII character length is used as an approximation for the effort it takes to craft a query. 5 For this evaluation we used a simple session including three intentions on the SALES cube, where the by clause is progressively enlarged and all the models are computed: The results are shown in Table 2; for SQL and Python we considered the code generated by our prototype to execute each intention. Remarkably, the total formulation effort using SQL+Python is, for each intention type, almost two orders of magnitude larger than using describe intentions.
To also have some insight into the time required to operate manually, we asked five PhD students in computer science to use Python to manually extract two types of models (outliers and clustering) from a 2000 tuples bidimensional cube. This real-world cube was created from the COVID dataset made available by the European Center for Disease Prevention and Control. 6 Table 3 shows, for each student, her skill in Python (beginner/intermediate/advanced), the time taken for doing the exercise (in minutes), the models she extracted, and the ASCII character length of the Python code she wrote, disregarding the quality of the models extracted. We remark that even skilled students needed quite a long time for extracting both models, and had to write substantial Python programs (even though, in comparison with Table 2, they were asked to compute two models only).

Effectiveness
Our second experimental goal is to assess the effectiveness of our approach. Specifically, we compare the 3-facets interestingness measure as of Definition 7 with the 1-facet measure adopted by Chédin et al. (2020); note that the latter mostly corresponds to peculiarity as of Definition 10. The experimental setting we use here is again that of a realworld cube extracted from the COVID dataset. On this cube 6 www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases we run 20 distinct describe sessions (including exactly 7 intentions each), of which 10 were created manually as done by Outa et al. (2020), and 10 were created with the CubeLoad workload generator (Rizzi and Gallinucci 2014).
To compare the two interestingness measures we compute the highlight coverage of each intention I as follows. Let C 0 be the base cube and c be the highlight of I ; we define the coverage of c as cov(c) = |{γ ∈ C 0 : ∃γ ∈ μ −1 (c), γ γ }| |C 0 | Intuitively, the coverage of highlight c is the percentage of cells of C 0 that roll-up to cells belonging to c. The cumulative highlight coverages at each session step, averaged over all 20 sessions, are reported in Fig. 10 (all α weights in Definition 7 are set to 1 3 ). Overall, the figure clearly shows that the cumulative coverage of the 3-facets interestingness is higher than the one of the 1-facet interestingness, which means that the enhanced formulation we adopted in this work is more effective in providing diversified highlights over the cube, leading to a more comprehensive exploration. We also noted that the by clause has a major impact on the highlights, i.e., in sessions mainly consisting of roll-ups and drill-downs the two measures of interestingness behave quite similarly since peculiarity is the main driver. On the other hand, the longer the session, the larger the effect of surprise and novelty in ensuring a more diversified coverage.

Efficiency
Our third experimental goal is to investigate if the performance of our approach is compatible with the nearreal-time requirement of interactive analysis sessions. To this end we populated the SALES cube using the FoodMart data. 7 We reused the 3-intention session introduced in Section 8.1; from the performance point of view this corresponds to considering the worst case, in which all five models are computed on cubes obtained by progressively including in the group-by set the three dimensions with  Table 4 shows the total execution time and its breakdown into the times necessary to query the base cube, to compute the models, to measure the interestingness, and to generate the pivot table returned to the browser. Remarkably, it turns out that at most 18 seconds are necessary to retrieve and visualize an enhanced cube of more than 86000 cells, which is perfectly compatible with the execution time of a normal OLAP query. The table shows that the main cost component is, after model computation, the measurement of interestingness. The most computationally-expensive facets are peculiarity and surprise, the former mostly depending on the cube cardinality, the latter increasing with the session length.

Scalability
Our last experimental goal is to evaluate the scalability of our approach. To this end we used the Star Schema Benchmark (SSB) cube, described by four hierarchies; please refer to the work by O'Neil et al. (2009) for the logical schema of the SSB dataset. Specifically, we generated three base SSB cubes, namely SSB 1 , SSB 10 ,  and SSB 100 , with different scale factors resulting in the following cardinalities: |SSB 1 | = 6 · 10 6 |SSB 10 | = 6 · 10 7 |SSB 100 | = 6 · 10 8 Note that the cardinality of each cube is equal to the number of tuples in the corresponding fact table. As commonly done in OLAP settings, primary and foreign keys were indexed using B-Trees, and materialized views were created to improve performances. The experiments were focused on three describe intentions similar to those introduced in Section 8.1, i.e., with progressively-enlarged group-by sets. Since the by and for clauses of each describe intention are not changed, scaling up the cardinality of the base cube implies that also the cardinality of the resulting cube C scales up as shown in Table 5. To reduce the impact of caching, each intention was executed five times on each base cube, and the execution times were averaged. Figure 11 shows, on a logarithmic scale, the times in seconds for executing the three intentions on the three base cubes with increasing cardinalities. When I 3 is executed over SSB 100 , yielding as a result a cube with almost 1.5 millions of cells, the overall time turns out to be about 95 seconds, which is still compatible with the requirements of an interactive analysis session. Of this time, 68 seconds are Fig. 11 Execution times for increasing cardinalities of the base cube used to compute the models, and 24 seconds to compute the interestingness. Though the chart shows an exponential trend, which clearly raises some concerns about further scalability, we observe that even dealing with a 1.5M-cells cube should be considered quite unusual in the context of an analysis session.

Related Work
The idea of coupling data and analytical models was born in the 90's with inductive databases, where data were coupled with patterns meant as generalizations of the data (Raedt 2002). Later on, data-to-model unification was addressed in MauveDB (Deshpande and Madden 2006), which provides a language for specifying model-based views of data using common statistical models. However, achieving a unified view of data and models was still seen as a research challenge in business intelligence a few years later (Pedersen 2009). More recently, Northstar (Kraska 2018) has been proposed as a system to support interactive data science by enabling users to switch between data exploration and model building, adopting a real-time strategy for hyper-parameter tuning. Finally, the coupling of data and models is at the core of the IAM vision ), on which this paper relies. The three basic pillars of IAM are (i) the redefinition of query as expressing the user's intention rather than explicitly declaring what data are to be retrieved, (ii) the extension of query results from plain data cubes to cubes enhanced with models and highlights, and (iii) the characterization of model components in terms of their interestingness to users.
The coupling of the OLAP paradigm and data mining to create an approach where concise patterns are extracted from multidimensional data for user's evaluation, was the goal of some approaches commonly labeled as OLAM (Han 1997). In this context, k-means clustering is used by Bentayeb and Favre (2009) to dynamically create semantically-rich aggregates of facts other than those statically provided by dimension hierarchies. Similarly, the shrink operator is proposed by Golfarelli et al. (2014) to compute small-size approximations of a cube via agglomerative clustering. Other operators that enrich data with knowledge extraction results are DIFF (Sarawagi 1999), which returns a set of tuples that most successfully describe the difference of values between two cells of a cube, and RELAX (Sathe and Sarawagi 2001), which verifies whether a pattern observed at a certain level of detail is also present at a coarser level of detail, too. Finally, Chen et al. (2005) reuse the OLAP paradigm to explore prediction cubes, i.e., cubes where each cell summarizes a predictive model trained on the data corresponding to that cell. The IAM approach can be regarded as OLAM since, like the approaches mentioned above, it relies on mining techniques to enhance the cube resulting from an OLAP query. However, while each of the approaches above uses one single technique (e.g., clustering) to this end, the IAM leans on multiple mining techniques to give users a wider variety of insights, using the interestingness measure to select the most relevant ones.
In the same direction, Sarawagi (2000) describes a method that profiles the exploration of a user and uses the Maximum Entropy principle to recommend which unvisited parts of the cube can be the most surprising in a subsequent query. The Cinecubes method (Gkesoulis and Vassiliadis 2013;Gkesoulis et al. 2015) aims at providing automated reporting as a result to an original OLAP query. The proposed method enriches an original OLAP query with auxiliary queries to aid (a) the comparison and assessment of the result of the query to similar data and (b) the explanation of the result with values at the most detailed level. So, the results of the Cinecubes system can coarsely be grouped as the result of two operators: the first one computes queries for values similar to ones defining the selection filters of the original query; the second one by drilling down into the dimensions of the result, one dimension at a time.
The characteristics of the different approaches for visualizing data and interacting with them have been deeply explored in the literature, also with reference to their suitability for datasets with different features and users with varying skills and goals. Börner (2015) surveys the classifications proposed in the literature for visualization types and integrates them into a single comprehensive framework. Abela (2008) proposes a decision tree to select the best visualization according to the user's goal and to the main features of data. More recently, SkyViz -to which our approach is inspired -starts from a visualization context based on seven coordinates for assessing the user's objectives and describing the data to be visualized (Golfarelli and Rizzi 2020). Then it uses skyline-based techniques to translate a visualization context into a set of suitable visualization types and to find the best bindings between the columns of the dataset and the graphical coordinates used by each visualization type.
To the best of our knowledge, though some tools (e.g., Spotfire and Tableau) integrate OLAP and analytics capabilities in the same environment, none of them allows users to formulate queries at a higher level of abstraction than OLAP (as done in the IAM using intentions), nor they support the automated out-of-the-box enrichment of cubes with insights obtained by analytics (as done in the IAM through enhanced cubes). For instance, Tableau 8 enables OLAP sessions through a drag-and-drop metaphor. First, the user selects the levels and measures in which she is interested. Then, Tableau provides a single visualization based on such levels and measures (no cardinality checks are performed against level domains). Finally, the user can manually add some models (e.g., linear regression) and statistics. Thus, in comparison to the describe operator and the IAM, Tableau does not provide a high-level syntax (i.e., users must explicitly pick levels, measures, and models), an interestingness measure, and multiple visualizations combined with interesting highlights.
As stated in the Introduction, this paper extends our previous work (Chédin et al. 2020)  situations where an intention changes both the group-by set and the selection predicate of the previous intention, and when there is no roll-up/drill-down relationship between the two group-by sets. -The syntax of the describe operator has been extended by supporting multiple levels in the by clause and by allowing users to specify different sizes for each model. -The visualization of enhanced cubes uses two more chart types to give users a more comprehensive and flexible description of data. -The approach is evaluated through a comprehensive set of tests not only in terms of efficiency, but also of scalability, effectiveness, and formulation complexity.

Conclusion
In this paper we have given a proof-of-concept for the IAM vision by delivering an end-to-end implementation of the describe operator, based on a novel measure of interestingness and relying on a visual metaphor to display enhanced cubes. This new measure of interestingness has been shown to be more effective than the one proposed by Chédin et al. (2020) in providing diversified highlights over enhanced cubes. We have also showed that our approach diminishes the effort for formulating complex analyses while ensuring that performances are compatible with nearreal-time requirements of interactive sessions. The main directions for future research we wish to pursue are: (i) evaluate the effectiveness of the approach by conducting extensive experiments with real users; (ii) optimize the computation of interestingness, especially for long sessions; and (iii) extend the approach to operate with dashboards of enhanced cubes.
Funding Open access funding provided by Alma Mater Studiorum -Università di Bologna within the CRUI-CARE Agreement.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. His current research focuses on database, OLAP and data warehousing, personalization, recommender systems, exploratory data analysis, and data narration. He authored numerous publications in international conferences and journals on these subjects, including Information Systems, Decision Support Systems, Data and Knowledge Engineering and Knowledge and Information Systems. He served as program committee member in top tier international conferences, including ER, VLDB, EDBT, and chaired the international Workshop on Data Warehousing and OLAP (DOLAP) in 2017 and 2021. He served as guest editor for international journals, including Information Systems and the International Journal of Data Warehousing and Mining. He is a member of the regular editorial board of the international journal Data and Knowledge Engineering.

Publisher's Note
Verónika Peralta is an Associate Professor at the University of Tours (France) where she is head of the Computer Science department. She received her Ph.D. in 2006 from the University of Versailles (France) and the University of the Republic (Uruguay). Her current research interests include data and information quality, exploratory data analysis, business intelligence and data narration. She has published numerous papers in international refereed journals and conferences on these fields and served as program committee member and guest editor in many international conferences and journals. She has extended experience in teaching information systems, databases, data warehousing and data quality, and has large professional experience as a data warehouse developer and consultant. Stefano Rizzi received his Ph.D. in 1996 from the University of Bologna, Italy. Since 2005 he is Full Professor at the University of Bologna. He has published more than 150 papers in international refereed journals and conferences mainly in the fields of data warehousing, business intelligence, and pattern recognition, and a research book on data warehouse design. He is member of the steering committee of DOLAP and of the editorial board of the Data and Knowledge Engineering Journal of Elsevier, and has been a member of the steering committee of the ER Conference. He participated in the H2020-ICT-2015 TOREADOR project and in several national research projects contracts with companies. His research interests include data warehouse design and business intelligence, in particular OLAP on NoSQL data, social business intelligence, and analysis services for big data.