Query Adaptation and Privacy for Real-Time Business Intelligence
This paper (extended abstract) discusses several technical challenges and issues that need special attention when dealing with real-time business intelligence (RTBI) systems. While most contributions of previous BIRTE Workshops focused on (database) technology this extended abstract will take a more holistic view by covering technical and non-technical aspects. First, we introduce and discuss two real-world applications to derive technical and non-technical requirements that are quite diverse in the context of real-time business intelligence. Based on those requirements and based on our experience in developing the Stratosphere database management system  we outline our already existing and future approaches to query adaptation and of statistics building that are about to be implemented into Stratosphere to support RTBI.
In the second part of the extended abstract we discuss important aspects of privacy when dealing with personal data, and outline necessary requirements for implementing real-time business intelligence systems to protect people’s privacy (to some extent). It will become apparent that often there exists a trade-off between the level of privacy and the utility expected by those who perform real-time business analytics.
KeywordsReal-time business intelligence Big data Map Reduce paradigm Query optimization Privacy k-anonymity Adversary knowledge
1 Looking at the Real World
During the last decade, the challenges in RTBI systems have been on extending existing database technology to fit better the needs of this area. However, now we are at the brink that this technology is used more and more thus penetrating everyone’s lives. For example, a new startup in Berlin named Tazaldoo/Tame  uses twitter feeds to perform sentiment analysis for journalists to provide them with leads for the next big stories (before anyone else is aware of them) and for politicians to make them aware of existing positive or negative popularity trends in their realm. Tazaldoo/Tame provides this analysis incrementally in (almost) real time. The same techniques carry over to discover economic trends and changes early enough that companies are able to react appropriately as early as possible.
Their solution uses the technology developed by the TechCompany PresenceOrb . The company claims that its technology allows “for rich analytical insight and immediate on site reactions ...". Such massive intrusion of privacy caused general protests that caused the removal of the stalking bins.
Both examples show the new abilities and consequences RBTI technology may have, at the same time they motivate the two topics that we shall discuss in the rest of this paper. The next section outlines how to improve MapReduce-style execution environments such as Hadoop  or Stratosphere  for RBTI by gathering and storing statistical information about the data sources accessed. Such information will be the basis to adapt query execution “on the fly” (which is not part of this extended abstract). The following section then outlines how to model the knowledge an adversary gathers when asking a sequence of queries to a database that returns anonymized answers. We do not provide a detailed presentation of these two topics, as they are work in progress by Ph.D. students of the DBIS research group at Humboldt-Universität zu Berlin.
2 Gathering Statistics in a MapReduce Environment
Measure, i.e. understand the current status of executed query;
Analyze if the current status should/must be adapted;
Re-optimize by generating alternative (partial) query execution plans;
Deploy alternative (partial) plans into the currently executing query.
The Query Optimizer requests specific statistics from the Cost Estimator component;
The cost Estimator component in turn asks Statistics Store if such statistics exists. If so, the Statistics Store returns requested statistics;
If the requested statistics is not found, a request for collecting statistics is generated;
Whenever the Optimizer component has generated an execution plan the Injector component checks if there exist statistics requests that could be generated by the current query;
If possible, the Injector components selects the corresponding statistic request(s) and generates appropriate statistics operators that are integrated into the current query execution plan;
During query execution, the injected statistics operators emit statistical data that are transmitted to and written into the (distributed) statistics store for further use.
Based on these steps we currently design and implement Statistics Store that we shall integrate into the Stratosphere system. Such Statistics store is one of the necessary steps to make Stratosphere more amenable for RTBI queries. R. Bergmann will publish results in his Ph.D. thesis in more details .
3 Modeling the Adversary’s Knowledge for Detecting Privacy Breaches
Here we realize that G1 changed due to the query answer modeled by G2: ID Vertex 3 now has only one edge relating it to Vertex C since sensitive value C is the only one present in both graphs for making a consistent value assignment for ID Vertex 3 that is possible in both of them. Thus, only one matching is at most possible between ID Vertex 3 and Vertex C representing the corresponding sensitive value.
Using this approach to model the (increasing) knowledge of the adversary with an increasing number of queries, we developed a series of polynomial approximation algorithms – the general problem is NP-complete – to detect privacy violations before returning an answer to a submitted query. These polynomial algorithms are necessary to decide in real-time whether to answer a query or not. Only then, we can guarantee that such an approach is usable and feasible in an RBTI query execution environment. More detailed results are to be published soon in the Ph.D. thesis by Lukas Dölle .
This paper discusses two important challenges for implementing RTBI. The first challenge is to provide an adequate basis for long running analytical queries by gathering statistical information about the data sources accessed. Our approach is to integrate statistical operators into user queries automatically and to store the statistical results in a statistics store for future use. Second, we focus on supporting privacy protection better by modeling the adversary’s knowledge by (a set of) bi-partite graphs that allow us to perform inferences about the knowledge gained by each query. Lukas Dölle and Rico Bergman currently develop both approaches as part of their Ph.D. thesis at the DBIS research group at Humboldt-Universität zu Berlin.
- 1.Stratosphere. http://www.stratosphere.eu. Accessed Dec 2013
- 2.Tazaldoo/tame. http://www.tame.it. Accessed Dec 2013
- 3.Hadoop. http://hadoop.apache.org/. Accessed Dec 2013
- 4.Renew London. http://renewlondon.com. Accessed Dec 2013
- 5.PresenceOrb. http://www.presenceorb.com/. Accessed Dec 2013
- 6.Hueske, F., Peters, M., Sax, M., Rheinländer, A., Bergmann, R., Krettek, A., Tzoumas, K.: Opening the black boxes in data flow optimization. PVLDB 5(11), 1256–1267 (2012)Google Scholar
- 8.Rundensteiner, E.A., Ding, L., Sutherland, T.M., Zhu, Y., Pielech, B., Mehta, N.: CAPE: continuous query engine with heterogeneous-grained adaptivity. In: VLDB Proceedings of the Thirteenth International Conference on Very Large Data Bases, Toronto, Canada, pp. 1353–1356 (2004)Google Scholar
- 9.Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information (abstract). In: PODS 1988, p. 188 (1998)Google Scholar
- 11.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)Google Scholar
- 12.Dölle, L.: Detecting privacy breaches when answering a sequence of queries, Ph.D. thesis (in German), Humboldt-Universität zu, Berlin (2014)Google Scholar
- 13.Bergmann, R.: Gathering statistics for query adaptation. Ph.D. thesis (in German). Humboldt-Universität zu, Berlin (2014)Google Scholar