Exploratory computing: a comprehensive approach to data sensemaking
- 1.2k Downloads
The Big Data challenge has made the issue of “making sense” of data urgent and unavoidable. This paper introduces exploratory computing (EC), a novel paradigm whose aim is to support a comprehensive “exploratory” experience for the user. “Exploratory” because it supports search and discovery of information through various tasks (investigation, knowledge seeking, serendipitous discovery, comparison of information...) in a dynamic interaction, where meaningful feedbacks from the system play a crucial role, closely resembling a human-to-human dialogue. “Computing” because a complex interaction as the one outlined above requires powerful computational strength for the user to be able to fully profit from, and even enjoy, the interaction. EC is not associated with a predefined set of techniques: Rather, it is an approach that can be concretized in different ways. In the paper, two different implementations of the EC approach are presented, both of which interpret the EC high-level requirements. It is the authors’ hope that others will follow.
KeywordsBig Data Rich Data Exploratory computing Faceted search Data mining Data exploration Intensional query answering
Data are available nowadays like never before, creating a treasure of information in wait for the right time to be put into service. For this reason, the exploration of big (and possibly rich 1) datasets has received much attention lately, in almost all domains of knowledge. This paper is about a comprehensive approach for making sense of data: Exploratory Computing, EC in what follows [1, 2, 3]. The EC approach has at its core the idea that non-expert users (i.e., non-data analysts) can be empowered to perform complex queries within a set of items to understand a phenomenon, answer research questions, and support a decision-taking process etc.
Why “exploratory”? In the definition of Wikipedia, exploration is “the act of searching for the purpose of discovery of information or resources.” The closest analogy is with human-to-human dialogue. In a dialogue, interlocutors continuously query their respective “mental databases” in ways yet unparalleled by the most sophisticated information access systems. By interacting with each other, human beings can investigate, search inspiration, compare, look for specific data, learn, etc., switching easily from one mode to another.
Why “computing”? Because, in order to support an interaction where the user can pursue the above goals and switch from one to the other at her will and in short time, powerful computing capabilities are mandatory. Nobody would stay in a dialogue where minutes are required to get an answer.
In principle, EC can be applied to datasets of various sizes (from small to very large, to Big Data) and of various structural complexities (in terms of variety of attributes and interrelations among them).
Many effective ways for extracting knowledge are already (well) known and reported in the literature, among which querying, search, faceted search, data mining, data exploration and visualization are only few of them. With respect to these, the intent of EC is to offer a comprehensive approach, exploiting all the affordances that are already there—and possibly giving vent to new ones—to assist a user in a process that resembles the interaction between two human beings, in which questions are explorative, possibly imprecise, and answers may be a bit inaccurate but suggestive, conveying an idea that stimulates the interlocutor to further questions and so on.
From the Big Data perspective, the problems posed by the EC vision are diverse and challenging . Indeed, in the database research community, most methods for extracting significant and synthetic knowledge from data are tailored to be used by professional data scientists, who, besides having knowledge of the domain of interest, should also be proficient mathematicians, statisticians and computer engineers at the same time in order to use them effectively. Moreover, normally the analyst concentrates on one analysis method at a time, for example adopting a certain data mining or statistical technique to solve a specific problem.
The aim of EC—that, from the database researcher’s viewpoint, becomes a genre of database exploration —is instead to assist the so-called data enthusiasts : inexperienced, non-computer-savvy users who need the support of a sophisticated system that guides them in the inspection. The result is a reconnaissance path where, at each step, the system presents the users with relevant and concise facts about the data in order for them to understand better and progress to the next exploration step.
The notion of relevance is of paramount importance, since the main system’s task is to call the user attention to relevant (or surprising) differences or similarities between the datasets encountered along the route. This naturally leads us to the need of two basic instruments: On the one hand, we want to extract “the gist” of the data in order to present the user with a synthetic vision of them, and on the other hand, we also need some kind of comparison between facets, for the user to be aware of their mutual differences, and thus get an idea of relevant facts. Moreover, both operations must be carried out in as little time as possible, since the user will not be happy to wait too much to progress from one exploration step to the next one.
Focused on the intensional (in the logical sense) rather than on the extensional nature 2 of the sets and subsets being “explored.” Let us clarify with an example: a query about “hotels in Milan” to a search engine will provide the user with a list of hotels (extensional set); a question to a friend about “hotels in Milan” will probably give vent to intensional considerations like “most of them are quite expensive (reference to the whole set), but those around the central station are cheaper and still decent (reference to a subset dynamically created on the spot), just like those close to the University, but there you have more noise.” Another example of “intensionality” is the “comparison of different sets of objects”: most of the times, if a user wishes to understand the difference between objects (in a given dataset), what she is aiming at is not the set difference; but rather the “intensional” one, that is, the properties that distinguish one set from the other.
Dialogue-based. The linguist Emile Benveniste , who studied how bees communicate and compared their system of signs with human language, noticed that bees have no dialogue, in the sense that just one turn of the conversation is enacted: A bee passes a piece of information to another bee about where the food is located, and then the intercourse finishes. There is no reply of any sort. On the contrary, human beings dialogue, in the sense that they reply to what the interlocutor says, making reference to what she said and building on top of it. EC aims at supporting dialogic interactions where the user can build on top of the feedback provided by the system in the previous turn-taking without having to start every time from scratch. Moreover, and again like in a dialogue, the user must be allowed to make reference to previous turn-takings, not just the one immediately preceding the last.
Meaningful in terms of feedback. Again in the spirit of a dialogue, the system has to provide meaningful feedbacks, interpreting the user’s requests and proactively putting forth relevant information, likely to steer the conversation into unexpected directions.
2 Rationale and possible scenarios
In this section, we first discuss the overall features of EC, outlining why they are relevant and in what sense they are novel. In the following sections, we briefly compare EC with other well-known approaches. Then, we describe specific developments (systems and applications) implementing the EC approach. In the final section, we draw the conclusions and describe possible directions for future research.
Let us now try to define what is the vision of EC in general terms (warning the reader that probably the definition will be better understood after reading the following sections, where specific systems and applications are described). First of all, it must be clarified that there is no specific technique associated to EC, but several existing techniques can be used and novel techniques can be developed. In fact, EC is more about a vision than a specific technique and is characterized along 4 dimensions: (1) the user point of view (at the beginning and during an exploratory session), (2) the user experience (during the exploration), (3) the possible outcomes (of an exploratory session) and (4) the emphasis upon the “relevance” of data. Let us now go through each of them.
The user point of view EC cares for users with very generic goals or goals that are not easily translatable in terms of specific actions over a dataset. It must be clarified that by “generic goal” we do not mean “no goal,” since we believe that in all possible real-life situations a goal always exists, even if very generic like “understanding a situation (=dataset).” Let us assume, for example, that there is a dataset about museum exhibits, a set of ancient musical instruments. A potential user of the dataset may have very different starting points, backgrounds, attitudes and goals: For example, she may not know the specific domain well; or, alternatively, she may know (to some extent) the domain, but not that specific dataset; she may want to understand how objects are distributed across various facets (period of time in which they were manufactured, materials used for their various components, type of instruments, style of manufacturing, types of decoration,...); she may want to understand how the facets influence each other (“were instruments of the same material built in the same time period?”); she may want to understand how restorations were performed; she might try to identify the most interesting objects (in view of a possible visit to the museum); she might be trying to understand if a visit is worth a trip; etc.
In addition, it is very realistic to imagine that all the above are not just possible starting points for the user but also (and more important) possible steps within an on-going conversation between the user and the system that, just like in a human-to-human interaction, leads to changes of opinions, ideas, interests and goals. In other words, the situation can evolve during the exploration itself. A user may start with some point of view and end up with different ones. Somehow the essence of an exploratory session is discovery: Users change point of view along the way. Some of the above situations can be dealt with by means of other existing approaches (briefly examined in the next section); dealing with all of them is the goal of EC, and this is not possible with existing approaches.
The user experience (during an exploratory session) The user experience during an exploratory session can combine (in a highly interactive stream) several aspects: investigating properties of the initial dataset; creating subsets and investigating their properties; verifying how facets (attributes) affect other facets (attributes) in the initial dataset or in dynamically created subsets; visualizing properties of datasets; comparing properties of various subsets; identifying interesting phenomena in the initial dataset or subsets; saving/retrieving subsets; comparing properties of subsets, etc. To our knowledge, no system currently bundles all of the features of a user experience above described. EC, instead, aims at providing all (or at least most) of the above features together, in order to empower users with the possibility of continuously changing point of view, as her understanding of the dataset evolves.
The outcomes (of an exploratory session) An exploratory session may have various outcomes and it is nearly impossible to list all of them. Some obvious results could be identifying interesting objects (similar to results obtained via queries or search or faceted search or exploratory approaches). Other results could be understanding the overall properties of the initial dataset (similar to data mining results). Less obvious results could be identifying interesting subsets, by combining facets (attributes) that could not be anticipated at the beginning but were interactively discovered along the session. Even less obvious could be understanding interrelationships (beyond statistical analysis) among facets (attributes) in specific subsets. An example could be that while “style” and “material” have little correlation in general, “ebony” is often used for “keyboard” instruments built between the XVII and XVIII centuries, but not in previous centuries. Even less obvious results could be a better understanding of the domain and of that specific dataset: for example, understanding how “style,” “material” and “restoration technique” are interrelated. Another non-obvious result could be the discovery of niche properties (i.e., properties that make sense only in small subsets), etc.
The relevance of data By “relevance” of data we mean the meaning that users associate with them. Meaning is a subjective point of view, since it is related to how a specific item can be associated, in the user’s perception, with a set of other notions. Assume that an object is made of <gold>, <ebony> and <brass>. The question about the relevance of these materials brings to a deeper (than usual) interpretation of data. Relevance can be interpreted in various ways; one obvious interpretation could be about quantity, in percentage. Saying that the object is made 50% of brass, 40% of ebony and 10% of gold, may seem to provide an obvious answer. But we can interpret the relevance in a different way: It may be possible, in fact, that the brass component is of average quality, while the ebony component is of good quality and the gold component is of extraordinary quality. The relevance in quantity does not match the relevance in quality! So, if a user is exploring the <material> aspect (or facet, or attribute), should the system emphasize quantity or quality? Or other aspects of relevance?
Note that the notion of relevance is strictly related to that of “intensionality,” introduced before, since relevance involves the meaning of data (intension) and not just their “extension.” One of the aims of EC is to put at the center stage of a session the relevance of data.
In the next sections, we exemplify some of the consequences of an EC approach, describing systems and applications that show where EC can lead to, and, as we have done with the “EC manifesto” , try to stimulate readers to take on and build their tools and applications that exhibit the features of EC.
Exploration of a dataset gathering hundreds of educational experiences, each one described via a set of documents (content) and “profiled” according to a set of facets (e.g., subject-matter, technology used, benefits, organization,...), meant to help users to decide how to foster the introduction of technology at school.
Exploring historical medical data about thyroid diseases, trying to extract (possibly unexpected) results.
Civil Protection interacts with a dataset about natural disasters in Italy to gain information about floods.
3 State of the art
Since the early practical uses of computers, data have been at the center stage. Two main issues are relevant: how to organize them and how to use them. Since the 1970s, “querying” became known as a paradigm where “the user knows what she is looking for, and asks for it.” Recently, however, the need for a new search paradigm for the exploration of large datasets has been recognized and faced from different perspectives . Differently from the past, nowadays more dynamic data-driven applications are built that do not always have the same requirements as traditional database systems. On the contrary, they are usually discovery-oriented applications whose users have only a vague idea, if any, of what they are looking for. The EC vision [1, 2] borrows intuitions from many research areas, the most prominent of which are tackled below.
Search: you get what you are looking for Search is good at identifying subsets of potentially interesting objects but not good at making sense of datasets or at finding clusters of interesting subsets. EC offers a more relaxed way of specifying what the user is looking for and a much higher “recall” (the percentage of relevant objects retrieved). On the contrary, there is a loss of “precision” (the percentage of relevant objects vs. all retrieved objects) and the need of a “ranking” specifying which objects are more closely related to what the user is looking for. In fact, while search is mainly focused on exhibiting objects, EC is more prone to denote objects by means of their properties, i.e., intension [7, 8]. There are many different research areas that deal with developing and using methods and techniques to extract and maintain intensional information from data.
A key notion in the field of information retrieval but also in information science in general is that of relevance  whose definition has stimulated widely dissonant ideas and works. As introduced in Sect. 2, this is a center-stage notion also in the EC approach and its definition might vary depending on the point of view.
Data mining: data speak out Users understand data since they are analyzed in various statistical ways (distribution, outliers, clusters, regression,...) and algorithms and techniques have been widely studied to support the discovery of properties from data. There are several techniques in the field of data mining [10, 11], such as association rule mining, clustering, classification and so on whose aim is to find significant regularities, correlations or groupings in large amounts of data. This information provides implicit (intensional and approximate) knowledge about the initial data. A “preprocessing” is needed (to run the algorithms) so that the initial dataset cannot be modified. Data mining is good at identifying potential relationships across data but has two major problems (with respect to the EC requirements): The dataset cannot be broken down into subsets and the set of available relationships is somehow predefined, and not dynamically discovered via user interaction. In fact, the research in data mining is mostly concerned with finding the best algorithms used to extract knowledge and does not put much emphasis on the way such information should be made available to final users.
Still, there are many interesting approaches from this field of research that can be appropriately adapted to be used in EC. In [12, 13] the authors propose to extract approximate knowledge using data mining techniques, with the aim of storing association rules  as a synthetic (yet approximate) description of the data to be returned as intensional answers. These summarization techniques are proposed for relational or XML data and address also the objective of performance, achieved by querying instance patterns instead of using classical indexing techniques. However, these approaches provide very little interaction with users, that is, no incremental exploration is possible. Works in [15, 16] introduce the use of subspace analysis and interactive machine learning to help understand complex and high-dimensional medical data by also leveraging domain expert competences. However, an expert-in-the-loop approach might not always be feasible; moreover, this approach tends to use a single or limited number of algorithms as opposed to EC where many different techniques might be exploited during the data exploration process.
Another interesting data mining technique is subgroup discovery  that considers a set of items characterized by some properties, one of which being the target one. The aim is to identify the subgroups of the population that have the most unusual features with respect to the target property, and their descriptions, in most cases expressed through rules.
Data mining, machine learning and other summarization techniques have also been applied in the field of database summarization . The aim of this area of research is to summarize the content and schema of a database by producing a reduced version of it that might describe the data at different levels of granularity. Different techniques have been adopted to summarize both tables and views, among them random walks, hierarchical clustering, probabilistic summarization and fuzzy approaches. The summarized database is query-independent, and thus, a single employment of these techniques is not enough to support an entire exploration process; however, they can be used during each step to summarize the results of a given query.
Probabilistic and fuzzy approaches are also used to represent approximate or uncertain information in probabilistic  and fuzzy  databases. These areas of research adopt probabilities and fuzzy-set approaches to model data that is uncertain and provide ways to query them. Differently from these approaches, the EC approach is based on certain data and focuses on providing a multitude of ways to explore them.
Faceted search (popular with many commercial applications and not very popular in the scientific field): navigate properties. The field of faceted search , or faceted navigation, also deals with the idea of exploration as a process. It proposes a strategy of accessing sets of information items based on a taxonomy. To support faceted search, items have to be classified according to a number of “facets.” Let us start with a definition: “Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters [...] A faceted classification system classifies each information element along multiple explicit dimensions, called facets , enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order”. 3 To be more precise, we can define faceted search as faceted navigation enhanced by the use of retrieval systems making use of keywords. It was put forth for querying documents , databases  and semantic data [25, 26], giving vent to the so-called semantic faceted search. Facets are attributes of objects; in typical applications, users visualize how many objects have certain facets’ values; navigation is achieved by continuously redefining the dataset restricting it to those objects that have a specific value for a facet. Visualizing the possible values for the various facets and the number of objects having that values has a number of advantages: It helps users to understand the application domain; it helps users to understand the distribution of objects across the various facets; it “suggests” choices and navigation moves and prompts an interactive exploration (impossible with data mining, for example). The limitation is that the “sense” that can be made is very limited (basically cardinalities of the various attributes), and correlations across facets’ values are difficult to assess.
Facets add a great deal of flexibility to the search process while reflecting, or even “teaching,” about a domain’s concepts and terms . For example, a system based on faceted navigation within the domain of education could have as facets “pedagogical approaches” (with values like “constructivism,” “authentic learning,” etc.) or “inclusion issues” (with values like “learning disorders,” “sociocultural disadvantage,” etc.). The user can investigate the dataset by combining the values and while doing so she also implicitly “learns” about the domain. It is therefore clear that a faceted navigation system can support user sessions in which the user is encouraged to literally “explore” the dataset  instead of narrowing down almost immediately her search by means of keywords, running the risk, in so doing, of eliminating potentially interesting items that could have been found in a more step-by-step, explorative, process. Moreover, faceted search is apt to support investigations that are NOT aimed at finding one or some specific elements but rather at understanding, making sense, of a dataset (or subset), thanks to subsequent selection, or de-selection, of facets [29, 30].
For these reasons, faceted search has become “the” standard for the exploration of complex datasets [21, 23]. The outcomes of a faceted navigation session are basically two: (i) the set of items that satisfy the criteria; (ii) the set of facets’ values that can be combined with the query to produce a non-empty set of answers.
Recently there have been many proposals for semantic faceted search, including gFacet , Visor  and Sewelis . Faceted search, however, has no foundational model, and it does not support rich data that are also Big, ignoring the performance issues that derive from their coupling.
Exploratory data analysis and Data Exploration: see by yourself Exploratory data analysis is a field of statistics and data analysis mainly concentrating on the way data are displayed and on how their significant aspects are detected . It offers visualization clues about relevant features of the datasets where visualizations can range from simple histograms or multivar-charts, to 3D interactive trees to sophisticated multilinear principle component analysis. The clear advantage is that the properties of the dataset are “shown,” thus giving support to an immediate overall understanding of a dataset; however, the approach mainly focuses on visualization without giving much space to features such as supporting exploration as a process. Data Exploration , on the other hand, is about efficiently extracting knowledge from data, and its techniques range from data storage to user interaction. Mainly three lines of research are involved: (i) visualization tools for data exploration, (ii) novel optimizations for interactive exploration times and (iii) re-examination of the database architecture to match the features of the new exploration workloads. Thus, this approach is a databases’ perspective to the problem of exploring data with efficiency as its main focus.
Finally, in the context of relational data, the idea of extending database queries with recommendations has also been introduced  with the aim of suggesting to users tuples that are not in the results of their queries but might be of potential interest.
The EC vision borrows intuitions from all the introduced areas of research, in particular combining faceted search with data analysis, and especially data mining, in order to boost the explorative experience. Faceted search is in fact the perfect tool for creating sets of items the user is interested into checking “one by one,” so to speak, while data mining extracts approximate, intensional properties of the datasets, thus allowing a—more or less deep—understanding of the sets themselves, as it will be shown in the next paragraph.
4 A database perspective
Appropriate choice of summarization techniques that through a synthetic view of the data highlight their properties.
Fast summary computation.
Interesting user involvement techniques.
(a) Appropriate choice of the summarization techniques The use of statistical summaries—like for instance distributions—to summarize the values of an attribute may emphasize relevant aspects at a glance, permitting to see the data at a higher level of abstraction (highlighting their properties) and to take one or more further actions. In fact, a distribution can describe a dataset in an approximate, intensional way, in that it exhibits the data properties instead of the data themselves. In this direction, one possible technique is computing the distributions of the attribute values in the tuple sets resulting from two different queries. This analysis has the main aim of helping users to gain knowledge about the datasets they are exploring, by comparison between the main features of two query results. In this case, the differences between the distributions of the same attribute in the two datasets may hint at the relevance of one attribute in the query result(s) as shown by the example of query Q1 above, where the relevant fact is that the age values are distributed differently from that of the original dataset.
Similar results can be also obtained by data mining techniques: One is the well-known subgroup discovery , whose aim is discovering interesting relationships between different properties (attributes) of a set with respect to a target variable.The patterns extracted are normally represented in the form of rules and called subgroups. Subgroup discovery is mainly concerned with categorical attributes, while the use of distributions, or of entropy, allows also for numerical attributes; moreover, the adoption of these two typical statistical techniques also allows for the use of traditional statistical hypothesis tests.
Also mining association rules are useful, as shown in detail in the next section. For instance, by mining association rules from the result of query Q2 the system might discover that (i) “\(80\%\) of the patients from Lombardia lived between the Second World War and the sixties in the area of Valtellina” and (ii) “\(70\%\) of patients from Southern Italy are from a seaside location”.
(b) Fast summary computation The fast statistical analysis of large-scale databases becomes essential to the development of a reasonable exploratory system. Of course, while the problem can be faced by pre-computing as many summaries as possible in case of a slowly-changing database, it becomes much more difficult when the database may receive frequent updates. In the first case at least a portion of the feature summaries can be pre-computed, and, in the same fashion as in inductive DBs , all the intensional descriptions can be stored together with the data, to be queried when needed. In the second case, the summaries must necessarily be computed on the fly, which entails unfaltering investigation of fast data-intensive computation techniques: one of the eternal challenges for DB researchers.
Quick computation of attribute summaries is a long-studied area for databases, since relational DBMSs build and maintain statistics on the data in order to estimate the selectivity of attribute values for query optimization. Such techniques, along with others, can be applied to our case. In general, many techniques have been proposed for fast, incremental histogram computation, typically based on sampling. For example  propose the development of a statistical algorithm to measure the difference between two tuple sets Q1 and Q2 with a common target feature, in order to compute the relative relevance. The method relies on an ensemble of hypothesis tests operating on randomly extracted subsets of the original tuple sets. The main intuition is that the hypothesis tests should be conducted incrementally, in order to increase scalability, while at the same time keeping the emergence of false positives under control.
Similarly quick, albeit less precise, is the computation of other measures, such as the entropy, which establishes the “level of variety” of a set of values. When a query result shows a different entropy of a certain attribute w.r.t. that of the same attribute in the initial dataset, it means that its value set has been altered, suggesting a line of further investigation, like for instance submitting query Q2 above.
(c) Interesting user involvement techniques The first, immediate thought that comes to mind when thinking about effective user interaction is related to intuitive visualization techniques. An exploratory system should support intuitive visualization of the query answer synthesis, which highlights the relevant properties of the answers to the current and past queries. Some efforts in this respect have been presented in the next sections; however, the quest for this kind of tools should never stop.
Another feature that is badly needed in database exploration is the capability to provide explanations  and causal dependencies, whose aim is basically to understand the reasons for query results. Imagine that the user wants to understand “why there is a huge difference between Valtellina and the rest of Lombardia”: By joining the original dataset with one containing data about the Italian regions an explanation system might discover that Valtellina is very poor of iodine. The user will thus learn that iodine is strictly related to thyroid disorders, and possibly ask for another explanation, to see why such disorders are also found in seaside locations [rule (ii) extracted from the result of Q2]. The explanation facility is thus another formidable support for a user who tries to grasp from the data more than a flat sequence of items.
4.1 The IQ4EC system
a synthetic view of the data (extracted by applying efficient data mining techniques);
intensional query answering capabilities, i.e., the answer to a query is the synthetic knowledge rather than the extensional dataset;
After this feedback, Alice wants to learn about Central Italy (step 2) and discovers that “23% of flash floods in Central Italy are caused by intense rainfall” thus she asks the system to retrieve information about flash floods (step 3). Suppose the system, after querying the intensional knowledge, realizes that the result is empty; then it performs the query over the extensional data and: (i) it shows the result set to Alice and (ii) it mines this result set on-line to extract intensional knowledge. This new intensional knowledge is stored to be ready to answer future requests on the data
At this point, Alice wants to know about the countermeasures adopted in flooding events (step 4). The system looks for an answer in the intensional knowledge and, e.g., finds that: (i) 50% of countermeasures are structural and, (ii) 65% of structural countermeasures consist in constructing dikes.
Intensional knowledge mining intensional knowledge is mined from the data in the form of association rules (ARs) which are stored into a relational database that acts as an intensional repository.
Intensional knowledge querying the intensional repository is used to answer user requests: (i) users formulate queries as if addressed to the extensional data and the system rewrites them and executes them over the intensional repository; (ii) users formulate queries directly on the intensional repository.
NODE (idNode, tab, attribute, val)
RULE (idRule, tab, supp, conf, nAnt, nCons)
ANTEC_CONSEQ (idNode, idRule, role)
TUPLE_COUNT (table, nTuples)
SELECT clause: the following terms are allowed: ‘*’, attribute names, COUNT aggregate operator. They, however, may not be mixed in a single SELECT clause; moreover, the DISTINCT clause and aggregates different than COUNT are not allowed.
FROM clause: It may contain one relation, no aliasing or nested queries are allowed.
WHERE clause: It may contain AND and OR operators. No other operators (e.g., NOT, IN), nor nested queries are allowed.
- WHAT ABOUT query: allows users to directly query intensional knowledge. Its syntax is: and IQ4EC returns every AR containing:
attributes from the tables in the WHAT ABOUT
the elements that satisfy the conditions
confidence more or equal than the threshold
first, syntactic restructuring is performed where all unsupported clauses (such as GROUP BY, ORDER BY) and characters are deleted;
second, a semantic inspection is carried out where the query is divided into the three allowed clauses and the content of each clause is further analyzed.
Given a query we can ensure that the answer to it, provided by means of intensional information, conveys the exact properties of the extensional data that validate its requests (the formal proof can be found in ).
5 EXPLORE: a tool for exploratory portals
This section presents another specific approach to EC: exploratory portals. A development tool has been built, in various versions: the current one is EXPLORE-5. EXPLORE allows the fast deployment of portals that can be used for exploring “rich” datasets, i.e., datasets where items display a significant number of facets, with complex relationships among them. Various EC-portals have been built and deployed, some of them just for the sake of research and others for real-life usage. The application domains have been cultural heritage (for various sub-domains) and education (in various versions). The main idea of EXPLORE is to combine Faceted Search, Data Mining and novel set-based techniques in order to allow highly interactive and effective exploratory sessions [38, 39, 40].
The EXPLORE technology was first developed in 2007 by HOC-LAB of Politecnico di Milano. The current version, EXPLORE-5, was implemented by the GSA-LAB of University of Salento. Figure 8 presents the software architecture of EXPLORE-5 that currently (May 2016) implements most of the features described in this section .
5.1 The functionalities
In order to describe how an exploratory portal works, let us consider an example from research in educational technology (Fig. 1). The dataset consists of a number of educational experiences based on technology, carried on by teachers at school. On the left, a representation of some intensional properties of the current set can be seen. Each facet (attribute) is displayed with the possible values (visible on demand); for each facet, various representations are possible: absolute value, percentage, distribution, word-cloud, etc. In addition, some small icons above the facets allow the user to make use of the full Boolean expressions, spanning several attributes, which are provided by EXPLORE-5. On the right, data visualization techniques show the instances belonging to the current set (there is one circle for each instance and instances are ordered according to their timestamp). A typical exploratory session starts with the user taking a look at the various facets, looking at how their values are distributed.
The portal was created in cooperation between Politecnico di Milano and the University of Salento (Lecce, Italy) within a national research project (EDOC@WORK).
As can be seen from Fig. 1, emphasis is on the 43 facets, which entails more than 250 different values characterizing this “rich” dataset. The extension of the dataset is shown on the right (the small tiles within the mosaic): It can be used to visualize at a glance the current set and also to directly access each individual item. Various color schemes (unfortunately not shown in the figure) can be used, in order to visualize immediately the distribution of the values for a specific subset.
Several subsets have been created during the exploration. At almost every step, the standard answer search/query engines would provide (i.e., “which items correspond to my request”) would basically be irrelevant. As the example shows, the user is almost never interested in the “extensional answer”: She looks at the features and properties of the various subsets, not at the list of experiences.
The varieties of possible combination of attributes cannot be anticipated in advance. The user discovers what she is looking for along the way.
Even a user with a mild understanding of the educational technology domain can undergo a fruitful exploration session.
Exploration (with this type of portals, at least) is a complex interactive process, made of several attempts and trials.
It is important to save intermediate stages of exploration along the way, in order to be able to step back and follow promising tracks among the many that open up during a session.
5.2 The technology
6 Experimental evaluation: user experience
In order to test whether our approach is actually fulfilling the EC objectives, we set up a focus group with scholars from Italian universities and research centers from two domains: education and educational technology. There were representatives from five different research groups, each represented by one/two scholars. The focus group was deemed the best way to assess the approach’s validity because of the ease of going into details where issues emerge: Exploratory portals are meant to support non-trivial tasks, and therefore our aim was understanding what did or did not work, not just in terms of small technicalities, but from a general usability viewpoint.
The Unexpected Learning: How Unexpected Benefits Can Be Generated Through ICT-Based Experiences, by the Polytechnic University of Marche
PoliCultura and Moodle: A Blended Learning Environment, by the University of Perugia
Educational Technology at the Primary School Level: A Survey, by the University of Perugia
Inclusion and Group Work: What Benefits? by Politecnico di Milano
Educational Technology at Junior High-School Level: A Survey, by the University of Bologna
Technology@School: Analysis of an ICT-Based Format, by the University of Bologna.
Investigation on the Relationship Between Technology and Educational Benefits, by the University of Salento
Is there any relation between inclusion and group work?
How are experiences dealing with inclusion through group work characterized?
What are the differences between two relevant group work organizations, namely heterogeneous and homogenous groups?
After the report’s completion, the representatives from each group were interviewed (via Skype) on their usage of the portal. Results are reported elsewhere ; for the purpose of this paper, notice that one of the most significant feedbacks was on the “Browsing and serendipity” affordances. In other words, scholars explained that, as an effect of an interaction activity, the portal enticed them to perform further queries they had not thought about at first, and that in doing so they, so to speak, “stumbled” upon unexpected discoveries.
The portal was then redesigned, and further investigations were performed (this time by HOC-LAB of Politecnico di Milano) that again gave vent to scientific publications. Though certainly still subject to improvements, the portal is proving its usefulness as a research tool for non-expert users.
emphasis upon the “intensionality” of the datasets, i.e., their “meaning,” versus “extensionality” (i.e., the items they contain).
capability to accommodate users with various kinds/degrees of knowledge: those who know very well the domain and the specific dataset, those who know the domain but not the specific dataset, those who have a weak understanding of the domain, ...
support for in-depth exploration: discovery (e.g., of interesting features), understanding (e.g., of the domain and of the specific dataset), identification of interesting subsets, discovery of “niche features” (e.g., features belonging to subsets only), comparison of features of subsets, etc.
sensemaking: At the end of a session, the user has a better understanding of the domain and of the specific dataset being explored.
Relevance Since the aim of EC is to help users make sense of data, their “relevance” is an important issue to be considered . As explained in Sect. 2, an EC system must be able to support different interpretations of relevance (quantitative, qualitative) and a quick switch among them. We are therefore struggling with several technicalities in order to allow her to do so: data modeling, concrete data representation, interface design, interaction design, etc.
Subsets creation and comparisons After some observations of users’ sessions, we noted that oftentimes more than the properties of data in the “universe” (i.e., the initial set) the properties of “subsets” of the universe matter, in comparison with either the universe itself or other subsets. A sentence like “primary schools make use of tablets a lot” should not be interpreted at face value, but getting to know the data that support it, which are: \(25\%\) of primary schools make use of tablets, which is a higher percentage with respect to the average \(12\%\). Another example is as follows: If we wonder in what sense Lombardy schools are characterized with respect to, let us say, Sicilian (or any other region) schools, what we are hinting to is, in fact, a comparison between subsets. The above raises complexities since, as we have said many times, during an exploration a user may take different points of view and change her mind on the way. Here are some possibilities: (a) the user knows how to qualify subsets (e.g., by regions) and which facets she is interested in (e.g., how technology affects learning benefits); (b) the user knows how to qualify subsets (e.g., by regions) but has a vague if not totally missing idea about which facets she is interested in (she actually wants to discover them); (c) the user does not know how to qualify subsets, but she knows which facets she is interested in (e.g., how technology affects learning benefits); (d) the user does not have predefined hypothesis and tries to discover which subsets may help her sketch interesting hypothesis. The reader may well understand how the goal of supporting all these possibilities as well as letting the user continuously change her point of view is technically difficult in terms of performance as well as of interfaces’ design.
Long terms, collaborative and semi-interactive sessions While using an exploratory system, users create a lot of intermediate results (subsets, property analysis, interesting screenshots, etc.). First of all, there is the need of keeping these materials organized (the history of a browser would not be enough); secondly, after several minutes (or a few hours) of exploration, users need to suspend their work and retrieve it later. Another observation is that explorations are not a one-man-job; typically, researchers need to compare their work, exchange intermediate results, intervening in the exploration started by someone else. Exploration create trees (or forests!) of intermediate sessions. Keeping track of them and making them truly collaborative is a challenge. Finally, the notion of “interactive” session should be revised. Long-lived exploration sessions must be understood and properly addressed.
By rich data we mean data presenting many properties, or facets, as opposed to very simple data like sensor readings or numerical series.
“Meanings are often called intensions, and things designated, extensions” (Stanford Encyclopedia of Philosophy, retrieved April 2016—http://plato.stanford.edu/entries/logic-intensional/#WhaAbo).
In the database world, the notion of facet may well be assimilated to that of attribute. While not necessarily this means being constrained to the choice of a relational database, the examples used from now on will often draw from the relational world for the sake of simplicity.
- 1.Di Blas, N., Mazuran, M., Paolini, P., Quintarelli, E., Tanca, L.: Exploratory computing: a draft manifesto. In: International Conference on Data Science and Advanced Analytics, pp. 577–580 (2014)Google Scholar
- 2.Di Blas, N., Mazuran, M., Paolini, P., Quintarelli, E., Tanca, L.: Exploratory computing: a challenge for visual interaction. In: International Working Conference on Advanced Visual Interfaces, pp. 361–362 (2014)Google Scholar
- 4.Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 277–281 (2015)Google Scholar
- 6.Benveniste, E.: Problèmes de linguistique générale. Gallimard, Paris (1966)Google Scholar
- 11.Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co. Inc., Reading (2005)Google Scholar
- 13.Baralis, E., Garza, P., Quintarelli, E., Tanca, L.: Answering XML queries by means of data summaries. ACM Trans. Inf. Syst. 25(3), Article No. 10 (2007)Google Scholar
- 14.Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: International Conference on Very Large Data Bases, pp. 487–499 (1994)Google Scholar
- 15.Wartner, S., Girardi, D., Wiesinger-Widi, M., Trenkler, J., Kleiser, R., Holzinger, A.: Ontology-Guided Principal Component Analysis: Reaching the Limits of the Doctor-in-the-Loop. Springer, Berlin (2016)Google Scholar
- 18.Saint-Paul, R., Raschia, G., Mouaddib, N.: General purpose database summarization. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, pp. 733–744. VLDB Endowment (2005)Google Scholar
- 20.Petry, F.E.: Fuzzy databases: principles and applications, vol. 5. Springer Science and Business Media, Berlin (2012)Google Scholar
- 21.Tunkelang, D.: Faceted Search. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, San Rafael (2009)Google Scholar
- 22.Ranganathan, S.: Colon classification. Asia Publishing House, Kolkata (1964)Google Scholar
- 23.Yee, K.-P., Swearingen, K., Li, K., Hearst, M.A.: Faceted metadata for image search and browsing. In: Conference on Human Factors in Computing Systems, pp. 401–408 (2003)Google Scholar
- 24.Dakka, W., Ipeirotis, P.G., Wood, K.R.: Automatic construction of multifaceted browsing interfaces. In: International Conference on Information and Knowledge Management, pp. 768–775 (2005)Google Scholar
- 26.Ferré, S., Hermann, A.: Semantic search: reconciling expressive querying and exploratory search. In: The Semantic Web: International Semantic Web Conference, pp. 177–192 (2011)Google Scholar
- 28.Sacco, G.M., Tzitzikas, Y.: Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience, 1st edn. Springer, Berlin (2009)Google Scholar
- 29.Spagnolo, L., Bolchini, D., Paolini, P., Di Blas, N.: Beyond findability search-enhanced information architecture for content-intensive rich internet applications. J. Inf. Archit. 2(1), 19–36 (2010)Google Scholar
- 30.Spagnolo, L.: Exploratory computing: designing discovery-driven user experiences. Ph.D. thesis, Politecnico di Milano (2013)Google Scholar
- 31.Heim, P., Ertl, T., Ziegler, J.: Facet graphs: complex semantic querying made easy. In: The Semantic Web: Research and Applications, pp. 288–302 (2010)Google Scholar
- 32.Popov, I.O., Schraefel, M.M.C., Hall, W., Shadbolt, N.: Connecting the dots: a multi-pivot approach to data exploration. In: The Semantic Web: International Semantic Web Conference, pp. 553–568 (2011)Google Scholar
- 34.Stefanidis, K., Drosou, M., Pitoura, E.: you may also like results in relational databases (2009)Google Scholar
- 36.Alippi, C., Quintarelli, E., Roveri, M., Tanca, L.: Towards a statistical framework for attribute comparison in very large relational databases. 10th Scientific Meeting of the Classification and Data Analysis Group (CLADAG 2015), Santa Margherita di Pula, 8–10 October 2015 (2015)Google Scholar
- 38.Di Blas, N., Fiore, A., Mainetti, L., Vergallo, R., Paolini, P.: A portal of educational resources: providing evidence for matching pedagogy with technology. Res. Learn. Technol., 22, 1–26 (2014)Google Scholar
- 39.Paolini, P., Di Blas, N.: Exploratory portals: the need for a new generation. In: International Conference on Data Science and Advanced Analytics, pp. 581–586 (2014)Google Scholar
- 40.Di Blas, N., Mainetti, L., Paolini, P., Vergallo, R.: Exploratory computing: combining faceted search with data mining. In: Proceedings of the 24th Italian Symposium on Advanced Database Systems (2016)Google Scholar
- 42.Di Blas, N., Paolini, P.: Distributed and dynamic tpack: evidences from a (large) case study. In: Proceedings of Society for Information Technology and Teacher Education International Conference 2016, pp. 2815–2821. Association for the Advancement of Computing in Education (AACE) (2016)Google Scholar
- 43.Nayrolles, M.M.: Mastering Apache Solr: A practical guide to get to grips with Apache Solr. inKstall Solutions, Tigard (2014)Google Scholar
- 45.Di Blas, N., Paolini, P.: Technology and group work: inclusion or diversification of talents? pp. 218–231 (2013)Google Scholar
- 46.Sperber, D., Wilson, D.: Relevance. Communication and Cognition, 2nd edn. Blackwell, Oxford (1995)Google Scholar