Query Adaptation and Privacy for Real-Time Business Intelligence

Extended Abstract
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 206)

Abstract

This paper (extended abstract) discusses several technical challenges and issues that need special attention when dealing with real-time business intelligence (RTBI) systems. While most contributions of previous BIRTE Workshops focused on (database) technology this extended abstract will take a more holistic view by covering technical and non-technical aspects. First, we introduce and discuss two real-world applications to derive technical and non-technical requirements that are quite diverse in the context of real-time business intelligence. Based on those requirements and based on our experience in developing the Stratosphere database management system [1] we outline our already existing and future approaches to query adaptation and of statistics building that are about to be implemented into Stratosphere to support RTBI.

In the second part of the extended abstract we discuss important aspects of privacy when dealing with personal data, and outline necessary requirements for implementing real-time business intelligence systems to protect people’s privacy (to some extent). It will become apparent that often there exists a trade-off between the level of privacy and the utility expected by those who perform real-time business analytics.

Keywords

Real-time business intelligence Big data Map Reduce paradigm Query optimization Privacy k-anonymity Adversary knowledge 

1 Looking at the Real World

During the last decade, the challenges in RTBI systems have been on extending existing database technology to fit better the needs of this area. However, now we are at the brink that this technology is used more and more thus penetrating everyone’s lives. For example, a new startup in Berlin named Tazaldoo/Tame [2] uses twitter feeds to perform sentiment analysis for journalists to provide them with leads for the next big stories (before anyone else is aware of them) and for politicians to make them aware of existing positive or negative popularity trends in their realm. Tazaldoo/Tame provides this analysis incrementally in (almost) real time. The same techniques carry over to discover economic trends and changes early enough that companies are able to react appropriately as early as possible.

A second example indicates the trade-offs of RTBI-technology when it infringes on people’s privacy. In the Square Mile of London, the company Renew London [4] installed garbage bins which include screens on all four sides of the bin being able to show in HD quality any kind of visual information, see Fig. 1. By connecting to people’s (unprotected) phones and by gathering enough information about those people passing by the company provides personal advertisement on the spot.
Fig. 1.

Stalking bins in London

Their solution uses the technology developed by the TechCompany PresenceOrb [5]. The company claims that its technology allows “for rich analytical insight and immediate on site reactions ...". Such massive intrusion of privacy caused general protests that caused the removal of the stalking bins.

Both examples show the new abilities and consequences RBTI technology may have, at the same time they motivate the two topics that we shall discuss in the rest of this paper. The next section outlines how to improve MapReduce-style execution environments such as Hadoop [3] or Stratosphere [1] for RBTI by gathering and storing statistical information about the data sources accessed. Such information will be the basis to adapt query execution “on the fly” (which is not part of this extended abstract). The following section then outlines how to model the knowledge an adversary gathers when asking a sequence of queries to a database that returns anonymized answers. We do not provide a detailed presentation of these two topics, as they are work in progress by Ph.D. students of the DBIS research group at Humboldt-Universität zu Berlin.

2 Gathering Statistics in a MapReduce Environment

Over the last years, the MapReduce paradigm has gained momentum as a model and a programming paradigm for data intensive applications that should take advantage of parallelism in a compute cluster environment. Hadoop is probably the most prominent system implementing the MapReduce paradigm [11]. Stratosphere developed by several research groups in Berlin (TU Berlin, HU Berlin, HPI Potsdam) is an alternative system that extends the MapReduce paradigm with additional second order functions at the same time using database oriented concepts for its implementation, including the ability of query optimization [1, 6]. As the execution of a MapReduce program may take minutes, hours or even days it might be advantageous to check if the current execution plan is the best possible. That is, the system could execute an alternative plan that could execute faster and/or with fewer resources than the current one. As this is a classical query optimization problem, we explore adaptive optimization in the context of MapReduce systems as a promising addition to the query execution environment. There have been several approaches for traditional object-relational DBMS as reported in [7, 8]. In our context, we see four major steps to develop an integrated approach for adaptive query optimization in Stratosphere as shown in Fig. 2:
  1. 1.

    Measure, i.e. understand the current status of executed query;

     
  2. 2.

    Analyze if the current status should/must be adapted;

     
  3. 3.

    Re-optimize by generating alternative (partial) query execution plans;

     
  4. 4.

    Deploy alternative (partial) plans into the currently executing query.

     
Fig. 2.

Cycle for query adaptation

Rather than starting query execution from scratch the underlying idea is to smoothly integrate changes of the execution plan into the currently executing plan without changing the outcome of the query (such as generating duplicate results). For the purpose of this paper, we focus on Step 1 of this cyclic adaptation process, i.e. how to measure the current state by gathering statistics about the data sources accessed. We see such statistics as the basis of making decision about alternative execution plans as in traditional DBMS. The goal of our work is to provide an environment for building histograms incrementally and adaptable with minimal overhead for the running query. Furthermore, we envisage that the plan generator automatically adds statistical operators to an execution plan rather than a programmer or an administrator. That is, we gather statistics about the results of partial plans that could lead the current (and future) execution of queries in a more informed manner most likely improving response time for RTBI. To implement our vision we currently design a Statistics Store that incrementally stores gathered statistics in such a way that future queries could benefit from past executions of similar queries. Additionally, we design algorithms how to detect the needs for new statistics, how to find similar statistics in the statistics store, and how to determine which statistics to collect by adding operators to an (existing) execution plan (operator injection) thus combining statistics collection with regular query execution (piggybacking). The general steps for injecting statistical operators into an execution plan are the following:
  1. 1.

    The Query Optimizer requests specific statistics from the Cost Estimator component;

     
  2. 2.

    The cost Estimator component in turn asks Statistics Store if such statistics exists. If so, the Statistics Store returns requested statistics;

     
  3. 3.

    If the requested statistics is not found, a request for collecting statistics is generated;

     
  4. 4.

    Whenever the Optimizer component has generated an execution plan the Injector component checks if there exist statistics requests that could be generated by the current query;

     
  5. 5.

    If possible, the Injector components selects the corresponding statistic request(s) and generates appropriate statistics operators that are integrated into the current query execution plan;

     
  6. 6.

    During query execution, the injected statistics operators emit statistical data that are transmitted to and written into the (distributed) statistics store for further use.

     

Based on these steps we currently design and implement Statistics Store that we shall integrate into the Stratosphere system. Such Statistics store is one of the necessary steps to make Stratosphere more amenable for RTBI queries. R. Bergmann will publish results in his Ph.D. thesis in more details [13].

3 Modeling the Adversary’s Knowledge for Detecting Privacy Breaches

Privacy has been an active research field for over ten years. The work by Samarati and Sweeny [9] on k-anonymity as well as the work by Dwork [10] on differential privacy spawned an increasing amount of research work and results both of which are not the focus of this paper. Rather we focus on an often-neglected aspect of privacy, i.e. how to model and how to represent the adversary’s knowledge that (s)he gains with results generated by (a sequence of) queries. Such an explicit representation of his/her knowledge could provide the basis for deciding how many queries to answer before rejecting further answers to queries. In the following, we present first steps on how to represent the increasing knowledge of an adversary by bi-partite graphs before outlining challenges and some initial algorithms on how to determine when to stop answering queries. We use the notion of k-anonymity to outline our approach. We assume that the reader is familiar with the k-anonymity approach: if we want to publish a table with personal (sensitive) data, we transform a given table into its anonymized equivalent as follows: the identifying attributes are removed; the values of the quasi-identifying attributes are anonymized. We show such a scenario in Fig. 3 with Name being the identifying attribute, Zipcode, Age, and Sex being quasi-identifiers, and Disease being the sensitive attribute.
Fig. 3.

k-anonymity example

If we know a person’s Zipcode, Age, and Sex we cannot determine the exact disease that (s)he has when accessing the anonymized release table. However, when considering the example of Fig. 4 we recognize that we can derive that Clark has a Cold based on answers R1 and R2. Similarly we also derive that Gary must have Earache based on the answers for queries Q2 and Q3.
Fig. 4.

Sequence of queries with privacy breach

These privacy breaches are not immediately obvious and not always easy to detect. Therefore, it becomes necessary to develop a systematic approach to determine when a privacy breach occurs. Our approach uses bi-partite graphs to model the set of existing identifiers and the set of existing sensitive values. Figure 5 shows edges between identifier vertices and vertices for sensitive values to indicate possible value assignments between both. In general, there must exist at least k edges between identifier vertices and vertices for sensitive values to satisfy k-anonymity. Such a graph is the basis for computing different perfect matchings between identifiers and sensitive values, i.e. matchings where each identifier vertex is connected to exactly one vertex for a sensitive value and vice versa. The light red edges in the six different graphs of Fig. 6 indicate the six possible alternative matchings between identifiers and sensitive values. However, for a sequence of query results the underlying vertex structure might change due to additional constraints (i.e. increase of knowledge) resulting from additional answers to new queries as shown for the example in Fig. 7.
Fig. 5.

Bipartite graph modeling relationships between identifiers and sensitive values

Fig. 6.

Perfect matchings between identifiers and sensitive values

Fig. 7.

Two query answers with corresponding graph and perfect matchings

Here we realize that G1 changed due to the query answer modeled by G2: ID Vertex 3 now has only one edge relating it to Vertex C since sensitive value C is the only one present in both graphs for making a consistent value assignment for ID Vertex 3 that is possible in both of them. Thus, only one matching is at most possible between ID Vertex 3 and Vertex C representing the corresponding sensitive value.

Using this approach to model the (increasing) knowledge of the adversary with an increasing number of queries, we developed a series of polynomial approximation algorithms – the general problem is NP-complete – to detect privacy violations before returning an answer to a submitted query. These polynomial algorithms are necessary to decide in real-time whether to answer a query or not. Only then, we can guarantee that such an approach is usable and feasible in an RBTI query execution environment. More detailed results are to be published soon in the Ph.D. thesis by Lukas Dölle [12].

4 Summary

This paper discusses two important challenges for implementing RTBI. The first challenge is to provide an adequate basis for long running analytical queries by gathering statistical information about the data sources accessed. Our approach is to integrate statistical operators into user queries automatically and to store the statistical results in a statistics store for future use. Second, we focus on supporting privacy protection better by modeling the adversary’s knowledge by (a set of) bi-partite graphs that allow us to perform inferences about the knowledge gained by each query. Lukas Dölle and Rico Bergman currently develop both approaches as part of their Ph.D. thesis at the DBIS research group at Humboldt-Universität zu Berlin.

References

  1. 1.
    Stratosphere. http://www.stratosphere.eu. Accessed Dec 2013
  2. 2.
    Tazaldoo/tame. http://www.tame.it. Accessed Dec 2013
  3. 3.
    Hadoop. http://hadoop.apache.org/. Accessed Dec 2013
  4. 4.
    Renew London. http://renewlondon.com. Accessed Dec 2013
  5. 5.
    PresenceOrb. http://www.presenceorb.com/. Accessed Dec 2013
  6. 6.
    Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. PVLDB 5(11), 1256–1267 (2012)Google Scholar
  7. 7.
    Deshpande, A., Ives, Z.G., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1), 1–140 (2007)CrossRefMATHGoogle Scholar
  8. 8.
    Rundensteiner, E.A., Ding, L., Sutherland, T.M., Zhu, Y., Pielech, B., Mehta, N.: CAPE: continuous query engine with heterogeneous-grained adaptivity. In: VLDB Proceedings of the Thirteenth International Conference on Very Large Data Bases, Toronto, Canada, pp. 1353–1356 (2004)Google Scholar
  9. 9.
    Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: PODS 1988, p. 188 (1998)Google Scholar
  10. 10.
    Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)Google Scholar
  12. 12.
    Dölle, L.: Detecting privacy breaches when answering a sequence of queries, Ph.D. thesis (in German), Humboldt-Universität zu, Berlin (2014)Google Scholar
  13. 13.
    Bergmann, R.: Gathering statistics for query adaptation. Ph.D. thesis (in German). Humboldt-Universität zu, Berlin (2014)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Humboldt-Universität zu BerlinBerlinGermany

Personalised recommendations