Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Approximate Computing for Stream Analytics

  • Do Le QuocEmail author
  • Ruichuan Chen
  • Pramod Bhatotia
  • Christof Fetzer
  • Volker Hilt
  • Thorsten Strufe
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_153-1
  • 389 Downloads

Abstract

Approximate computing has become a promising mechanism to trade off accuracy for efficiency. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing – based on the chosen sample size – can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, the state-of-the-art systems for approximate computing primarily target batch analytics, where the input data remains unchanged during the course of computation. Thus, they are not well-suited for stream analytics. This motivated the design of StreamApprox– a stream analytics system for approximate computing. To realize this idea, an online stratified reservoir sampling algorithm is designed to produce approximate output with rigorous error bounds. Importantly, the proposed algorithm is generic and can be applied to two prominent types of stream processing systems: (1) batched stream processing such as Apache Spark Streaming, and (2) pipelined stream processing such as Apache Flink.

Introduction

Stream analytics systems are extensively used in the context of modern online services to transform continuously arriving raw data streams into useful insights (Foundation 2017a; Murray et al. 2013; Zaharia et al. 2013). These systems target low-latency execution environments with strict service-level agreements (SLAs) for processing the input data streams.

In the current deployments, the low-latency requirement is usually achieved by employing more computing resources. Since most stream processing systems adopt a data-parallel programming model (Dean and Ghemawat 2004), almost linear scalability can be achieved with increased computing resources (Quoc et al. 2013, 2014, 2015a,b). However, this scalability comes at the cost of ineffective utilization of computing resources and reduced throughput of the system. Moreover, in some cases, processing the entire input data stream would require more than the available computing resources to meet the desired latency/throughput guarantees.

To strike a balance between the two desirable, but contradictory design requirements – low latency and efficient utilization of computing resources – there is a surge of approximate computing paradigm that explores a novel design point to resolve this tension. In particular, approximate computing is based on the observation that many data analytics jobs are amenable to an approximate rather than the exact output (Doucet et al. 2000; Natarajan 1995). For such workflows, it is possible to trade the output accuracy by computing over a subset instead of the entire data stream. Since computing over a subset of input requires less time and computing resources, approximate computing can achieve desirable latency and computing resource utilization.

Unfortunately, the advancements in approximate computing are primarily geared towards batch analytics (Agarwal et al. 2013; Srikanth et al. 2016), where the input data remains unchanged during the course of computation. In particular, these systems rely on pre-computing a set of samples on the static database, and take an appropriate sample for the query execution based on the user’s requirements (i.e., query execution budget). Therefore, the state-of-the-art systems cannot be deployed in the context of stream processing, where the new data continuously arrives as an unbounded stream.

As an alternative, one could in principle repurpose the available sampling mechanisms in well-known big data processing frameworks such as Apache Spark to build an approximate computing system for stream analytics. In fact, as a starting point for this work, based on the available sampling mechanisms, an approximate computing system is designed and implemented for stream processing in Apache Spark. Unfortunately, Spark’s stratified sampling algorithm suffers from three key limitations for approximate computing. First, Spark’s stratified sampling algorithm operates in a “batch” fashion, i.e., all data items are first collected in a batch as Resilient Distributed Datasets (RDDs) (Zaharia et al. 2012), and thereafter, the actual sampling is carried out on the RDDs. Second, it does not handle the case where the arrival rate of sub-streams changes over time because it requires a pre-defined sampling fraction for each stratum. Lastly, the stratified sampling algorithm implemented in Spark requires synchronization among workers for the expensive join operation, which imposes a significant latency overhead.

To address these limitations, this work designed an online stratified reservoir sampling algorithm for stream analytics. Unlike existing Spark-based systems, the algorithm performs the sampling process “on-the-fly” to reduce the latency as well as the overheads associated in the process of forming RDDs. Importantly, the algorithm generalizes to two prominent types of stream processing models: (1) batched stream processing employed by Apache Spark Streaming (Foundation 2017b), and (2) pipelined stream processing employed by Apache Flink (Foundation 2017a).

More specifically, the proposed sampling algorithm makes use of two techniques: reservoir sampling and stratified sampling. It performs reservoir sampling for each sub-stream by creating a fixed-size reservoir per stratum. Thereafter, it assigns weights to all strata respecting their arrival rates to preserve the statistical quality of the original data stream. The proposed sampling algorithm naturally adapts to varying arrival rates of sub-streams, and requires no synchronization among workers (see section “Design”). Based on the proposed sampling algorithm, StreamApprox– an approximate computing system for stream analytics – is designed.

Overview and Background

This section gives an overview of StreamApprox (section “System Overview”), its computational model (section “Computational Model”), and its design assumptions (section “Design Assumptions”).

System Overview

StreamApprox is designed for real-time stream analytics. In this system, the input data stream usually consists of data items arriving from diverse sources. The data items from each source form a sub-stream. The system makes use of a stream aggregator (e.g., Apache Kafka Foundation 2017c) to combine the incoming data items from disjoint sub-streams. StreamApprox then takes this combined stream as the input for data analytics.

StreamApprox facilitate data analytics on the input stream by providing an interface for users to specify the streaming query and its corresponding query budget. The query budget can be in the form of expected latency/throughput guarantees, available computing resources, or the accuracy level of query results.

StreamApprox ensures that the input stream is processed within the specified query budget. To achieve this goal, the system makes use of approximate computing by processing only a subset of data items from the input stream, and produce an approximate output with rigorous error bounds. In particular, StreamApprox uses a parallelizable online sampling technique to select and process a subset of data items, where the sample size can be determined based on the query budget.

Computational Model

The state-of-the-art distributed stream processing systems can be classified in two prominent categories: (i) batched stream processing model, and (ii) pipelined stream processing model. These systems offer three main advantages: (a) efficient fault tolerance, (b) “exactly-once” semantics, and (c) unified programming model for both batch and stream analytics. The proposed algorithm for approximate computing is generalizable to both stream processing models, and preserves their advantages.

Batched stream processing model.

In this computational model, an input data stream is divided into small batches using a pre-defined batch interval, and each such batch is processed via a distributed data-parallel job. Apache Spark Streaming (Foundation 2017b) adopted this model to process input data streams.

Pipelined stream processing model.

In contrast to the batched stream processing model, the pipelined model streams each data item to the next operator as soon as the item is ready to be processed without forming the whole batch. Thus, this model achieves low latency. Apache Flink (Foundation 2017a) implements this model to provide a truly native stream processing engine.

Note that both stream processing models support the time-based sliding window computation (Bhatotia et al. 2014). The processing window slides over the input stream, whereby the newly incoming data items are added to the window and the old data items are removed from the window. The number of data items within a sliding window may vary in accordance to the arrival rate of data items.

Design Assumptions

StreamApprox is based on the following assumptions. The possible means to address these assumptions are discussed in section “Discussion”.
  1. 1.

    There exists a virtual cost function which translates a given query budget (such as the expected latency guarantees, or the required accuracy level of query results) into the appropriate sample size.

     
  2. 2.

    The input stream is stratified based on the source of data items, i.e., the data items from each sub-stream follow the same distribution and are mutually independent. Here, a stratum refers to one sub-stream. If multiple sub-streams have the same distribution, they are combined to form a stratum.

     

Design

In this section, first the StreamApprox’s workflow (section “System Workflow”) is presented. Then, its sampling mechanism (section “Online Adaptive Stratified Reservoir Sampling”) and its error estimation mechanism (section “Error Estimation”) are described (see details in Quoc et al. 2017d,c).

System Workflow

This section shows the workflow of StreamApprox. The system takes the user-specified streaming query and the query budget as the input. Then it executes the query on the input data stream as a sliding window computation (see section “Computational Model”).

For each time interval, StreamApprox first derives the sample size (sampleSize) using a cost function based on the given query budget. Next, the system performs a proposed sampling algorithm (detailed in section “Online Adaptive Stratified Reservoir Sampling”) to select the appropriate sample in an online fashion. This sampling algorithm further ensures that data items from all sub-streams are fairly selected for the sample, and no single sub-stream is overlooked.

Thereafter, the system executes a data-parallel job to process the user-defined query on the selected sample. As the last step, the system performs an error estimation mechanism (as described in section “Error Estimation”) to compute the error bounds for the approximate query result in the form of output ± error bound. The whole process repeats for each time interval as the computation window slides (Bhatotia et al. 2012a).

Online Adaptive Stratified Reservoir Sampling

To realize the real-time stream analytics, a novel sampling technique called Online Adaptive Stratified Reservoir Sampling (OASRS) is proposed. It achieves both stratified and reservoir samplings without their drawbacks. Specifically, OASRS does not overlook any sub-streams regardless of their popularity, does not need to know the statistics of sub-streams before the sampling process, and runs efficiently in real time in a distributed manner.

The high-level idea of OASRS is simple. The algorithm first stratifies the input stream into sub-streams according to their sources. The data items from each sub-stream are assumed to follow the same distribution and are mutually independent. (Here, a stratum refers to one sub-stream. If multiple sub-streams have the same distribution, they can be combined to form a stratum.) The algorithm then samples each sub-stream independently, and perform the reservoir sampling for each sub-stream individually. To do so, every time a new sub-stream Si is encountered, its sample size Ni is determined according to an adaptive cost function considering the specified query budget. For each sub-stream Si, the algorithm performs the traditional reservoir sampling to select items at random from this sub-stream, and ensures that the total number of selected items from Si does not exceed its sample size Ni. In addition, the algorithm maintains a counter Ci to measure the number of items received from Si within the concerned time interval.

Applying reservoir sampling to each sub-stream Si ensures that algorithm can randomly select at most Ni items from each sub-stream. The selected items from different sub-streams, however, should not be treated equally. In particular, for a sub-stream Si, if Ci > Ni (i.e., the sub-stream Si has more than Ni items in total during the concerned time interval), the algorithm randomly selects Ni items from this sub-stream and each selected item represents Ci/Ni original items on average; otherwise, if Ci ≤ Ni, the algorithm selects all the received Ci items so that each selected item only represents itself. As a result, in order to statistically recreate the original items from the selected items, the algorithm assigns a specific weight Wi to the items selected from each sub-stream Si:
$$\displaystyle \begin{aligned} W_i = \Bigg\{ \begin{array}{ll} C_i / N_i & \text{\qquad if }C_i > N_i \\ 1{\qquad } & \text{\qquad if }C_i \leq N_i \end{array}{} \end{aligned} $$
(1)

StreamApprox supports approximate linear queries which return an approximate weighted sum of all items received from all sub-streams. Though linear queries are simple, they can be extended to support a large range of statistical learning algorithms (Blum et al. 2005, 2008). It is also worth mentioning that, OASRS not only works for a concerned time interval (e.g., a sliding time window), but also works with unbounded data streams.

Distributed execution.

OASRS can run in a distributed fashion naturally as it does not require synchronization. One straightforward approach is to make each sub-stream Si be handled by a set of w worker nodes. Each worker node samples an equal portion of items from this sub-stream and generates a local reservoir of size no larger than Ni/w. In addition, each worker node maintains a local counter to measure the number of its received items within a concerned time interval for weight calculation. The rest of the design remains the same.

Error Estimation

This section describes how to apply OASRS to randomly sample the input data stream to generate the approximate results for linear queries. Next, a method to estimate the accuracy of approximate results via rigorous error bounds is presented.

Similar to section “Online Adaptive Stratified Reservoir Sampling”, suppose the input data stream contains X sub-streams \(\{S_i\}_{i=1}^{X}\). StreamApprox computes the approximate sum of all items received from all sub-streams by randomly sampling only Yi items from each sub-stream Si. As each sub-stream is sampled independently, the variance of the approximate sum is: \(\mathrm {Var(SUM)} = \sum _{i=1}^{X} \mathrm {Var}(\mathrm {SUM}_i)\).

Further, as items are randomly selected for a sample within each sub-stream, according to the random sampling theory (Thompson 2012), the variance of the approximate sum can be estimated as:
$$\displaystyle \begin{aligned} \widehat{\mathrm{Var}}(\mathrm{SUM}) = \sum_{i = 1}^{X} \Big(C_i\times(C_i - Y_i)\times \frac{s^2_{i}}{Y_{i}} \Big) \end{aligned} $$
(2)
Here, Ci denotes the total number of items from the sub-stream Si, and si denotes the standard deviation of the sub-stream Si’s sampled items:
$$\displaystyle \begin{aligned} s^2_{i} &= \frac{1}{Y_i - 1} \times \sum_{j = 1}^{Y_i} (I_{i,j} - \bar{I_i})^2, \mathrm{where}\notag\\ \bar{I_i} &= \frac{1}{Y_i}\times \sum_{j = 1}^{Y_i} I_{i,j} \end{aligned} $$
(3)
Next, the estimation of the variance of the approximate mean value of all items received from all the X sub-streams is described. This approximate mean value can be computed as:
$$\displaystyle \begin{aligned} \mathrm{MEAN} &= \frac{\mathrm{SUM}}{\sum_{i=1}^{X} C_i} = \frac{\sum_{i=1}^{X} (C_i \times \mathrm{MEAN}_i)}{\sum_{i=1}^{X} C_i}\notag\\ &= \sum_{i=1}^{X} (\omega_i \times \mathrm{MEAN}_i) \end{aligned} $$
(4)
Here, \(\omega _i = \frac {C_i}{\sum _{i=1}^{X} C_i}\). Then, as each sub-stream is sampled independently, according to the random sampling theory (Thompson 2012), the variance of the approximate mean value can be estimated as:
$$\displaystyle \begin{aligned} \widehat{\mathrm{Var}}(\mathrm{MEAN}) &= \sum_{i=1}^{X} \mathrm{Var}(\omega_{i} \times \mathrm{MEAN}_i) \notag\\ &= \sum_{i=1}^{X} \Big( \omega_{i}^2 \times \mathrm{Var}(\mathrm{MEAN}_i) \Big)\notag\\ &= \sum_{i=1}^{X} \Big( \omega_{i}^2 \times \frac{s^2_{i}}{Y_i}\times \frac{C_i - Y_i}{C_i} \Big){} \end{aligned} $$
(5)

Above, the estimation of the variances of the approximate sum and the approximate mean of the input data stream has been shown. Similarly, the variance of the approximate results of any linear queries also can be estimated by applying the random sampling theory.

Error bound.

According to the “68-95-99.7” rule (Wikipedia 2017), approximate result falls within one, two, and three standard deviations away from the true result with probabilities of 68%, 95%, and 99.7%, respectively, where the standard deviation is the square root of the variance as computed above. This error estimation is critical because it gives a quantitative understanding of the accuracy of the proposed sampling technique.

Discussion

The design of StreamApprox is based on the assumptions mentioned in section “Design Assumptions”. This section discusses some approaches that could be used to meet the assumptions.

I: Virtual cost function.

This work currently assumes that there exists a virtual cost function to translate a user-specified query budget into the sample size. The query budget could be specified as either available computing resources, desired accuracy, or latency.

For instance, with an accuracy budget, the sample size for each sub-stream can be determined based on a desired width of the confidence interval using Eq. (5) and the “68-95-99.7” rule. With a desired latency budget, users can specify it by defining the window time interval or the slide interval for the computations over the input data stream. It becomes a bit more challenging to specify a budget for resource utilization. Nevertheless, there are two existing techniques that could be used to implement such a cost function to achieve the desired resource target: (a) virtual data center (Angel et al. 2014), and (b) resource prediction model (Wieder et al. 2012) for latency requirements.

Pulsar (Angel et al. 2014) proposes an abstraction of a virtual data center (VDC) to provide performance guarantees to tenants in the cloud. In particular, Pulsar makes use of a virtual cost function to translate the cost of a request processing into the required computational resources using a multi-resource token algorithm. The cost function could be adapted for StreamApprox as follows: a data item in the input stream is considered as a request and the “amount of resources” required to process it as the cost in tokens. Also, the given resource budget is converted in the form of tokens, using the pre-advertised cost model per resource. This allows computing the sample size that can be processed within the given resource budget.

For any given latency requirement, resource prediction model (Wieder et al. 2010a,b, 2012) could be employed. In particular, the prediction model could be built by analyzing the diurnal patterns in resource usage (Charles et al. 2012) to predict the future resource requirement for the given latency budget. This resource requirement can then be mapped to the desired sample size based on the same approach as described above.

II: Stratified sampling.

This work currently assume that the input stream is already stratified based on the source of data items, i.e., the data items within each stratum follow the same distribution – it does not have to be a normal distribution. This assumption ensures that the error estimation mechanism still holds correct since StreamApprox applies the Central Limit Theorem. For example, consider an IoT use-case which analyzes data streams from sensors to measure the temperature of a city. The data stream from each individual sensor follows the same distribution since it measures the temperature at the same location in the city. Therefore, a straightforward way to stratify the input data streams is to consider each sensor’s data stream as a stratum (sub-stream). In more complex cases where StreamApprox cannot classify strata based on the sources, the system needs a pre-processing step to stratify the input data stream. This stratification problem is orthogonal to this work, nevertheless for completeness, two proposals for the stratification of evolving streams, bootstrap (Dziuda 2010) and semi-supervised learning (Masud et al. 2012), are discussed in this section.

Bootstrap (Dziuda 2010) is a well-studied non-parametric sampling technique in statistics for the estimation of distribution for a given population. In particular, the bootstrap technique randomly selects “bootstrap samples” with replacement to estimate the unknown parameters of a population, for instance, by averaging the bootstrap samples. A bootstrap-based estimator can be employed for the stratification of incoming sub-streams. Alternatively, a semi-supervised algorithm (Masud et al. 2012) could be used to stratify a data stream. The advantage of this algorithm is that it can work with both labeled and unlabeled streams to train a classification model.

Related Work

Over the last two decades, the databases community has proposed various approximation techniques based on sampling (Al-Kateb and Lee 2010; Garofalakis and Gibbon 2001), online aggregation (Hellerstein et al. 1997), and sketches (Cormode et al. 2012). These techniques make different trade-offs w.r.t. the output quality, supported queries, and workload. However, the early work in approximate computing was mainly geared towards the centralized database architecture.

Recently, sampling-based approaches have been successfully adopted for distributed data analytics (Agarwal et al. 2013; Srikanth et al. 2016; Krishnan et al. 2016; Quoc et al. 2017b,a). In particular, BlinkDB (Agarwal et al. 2013) proposes an approximate distributed query processing engine that uses stratified sampling (Al-Kateb and Lee 2010) to support ad-hoc queries with error and response time constraints. Like BlinkDB, Quickr (Srikanth et al. 2016) also supports complex ad-hoc queries in big-data clusters. Quickr deploys distributed sampling operators to reduce execution costs of parallelized queries. In particular, Quickr first injects sampling operators into the query plan; thereafter, it searches for an optimal query plan among sampled query plans to execute input queries. However, these “big data” systems target batch processing and cannot provide required low-latency guarantees for stream analytics.

IncApprox (Krishnan et al. 2016) is a data analytics system that combines two computing paradigms together, namely, approximate and incremental computations (Bhatotia et al. 2011a,b, 2012b) for stream analytics. The system is based on an online “biased sampling” algorithm that uses self-adjusting computation (Bhatotia 2015; Bhatotia et al. 2015) to produce incrementally updated approximate output. Lastly, PrivApprox (Quoc et al. 2017a,b) supports privacy-preserving data analytics using a combination of randomized response and approximate computation. By contrast, StreamApprox supports low-latency in stream processing by employing the proposed “online” sampling algorithm solely for approximate computing, while avoiding the limitations of existing sampling algorithms.

Conclusion

This paper presents StreamApprox, a stream analytics system for approximate computing. StreamApprox allows users to make a systematic trade-off between the output accuracy and the computation efficiency. To achieve this goal, StreamApprox employs an online stratified reservoir sampling algorithm which ensures the statistical quality of the sample selected from the input data stream. The proposed sampling algorithm is generalizable to two prominent types of stream processing models: batched and pipelined stream processing models.

Cross-References

References

  1. Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the ACM European conference on computer systems (EuroSys)Google Scholar
  2. Al-Kateb M, Lee BS (2010) Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)Google Scholar
  3. Angel S, Ballani H, Karagiannis T, O’Shea G, Thereska E (2014) End-to-end performance isolation through virtual datacenters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  4. Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)Google Scholar
  5. Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)Google Scholar
  6. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)Google Scholar
  7. Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) Slider: incremental sliding-window computations for large-scale data analysis. Technical report MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.org/tr/2012-004.pdf
  8. Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)Google Scholar
  9. Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)Google Scholar
  10. Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)Google Scholar
  11. Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the sulq framework. In: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS)Google Scholar
  12. Blum A, Ligett K, Roth A (2008) A learning theory approach to non-interactive database privacy. In: Proceedings of the fortieth annual ACM symposium on theory of computing (STOC)Google Scholar
  13. Charles R, Alexey T, Gregory G, Randy HK, Michael K (2012) Towards understanding heterogeneous clouds at scale: Google trace analysis. Techical reportGoogle Scholar
  14. Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases. Now, BostonGoogle Scholar
  15. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  16. Doucet A, Godsill S, Andrieu C (2000) On sequential monte carlo sampling methods for bayesian filtering. Stat Comput 10:197–208Google Scholar
  17. Dziuda DM (2010) Data mining for genomics and proteomics: analysis of gene and protein expression data. Wiley, HobokenGoogle Scholar
  18. Foundation AS (2017a) Apache flink. https://flink.apache.org
  19. Foundation AS (2017b) Apache spark streaming. https://spark.apache.org/streaming
  20. Foundation AS (2017c) Kafka – a high-throughput distributed messaging system. https://kafka.apache.org
  21. Garofalakis MN, Gibbon PB (2001) Approximate query processing: taming the terabytes. In: Proceedings of the international conference on very large data bases (VLDB)Google Scholar
  22. Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  23. Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on World Wide Web (WWW)Google Scholar
  24. Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen KW, Oza NC (2012) Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33:213–244Google Scholar
  25. Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a timely dataflow system. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar
  26. Natarajan S (1995) Imprecise and approximate computation. Kluwer Academic Publishers, BostonGoogle Scholar
  27. Quoc DL, Martin A, Fetzer C (2013) Scalable and real-time deep packet inspection. In: Proceedings of the 2013 IEEE/ACM 6th international conference on utility and cloud computing (UCC)Google Scholar
  28. Quoc DL, Yazdanov L, Fetzer C (2014) Dolen: user-side multi-cloud application monitoring. In: International conference on future internet of things and cloud (FICLOUD)Google Scholar
  29. Quoc DL, D’Alessandro V, Park B, Romano L, Fetzer C (2015a) Scalable network traffic classification using distributed support vector machines. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  30. Quoc DL, Fetzer C, Felber P, Étienne Rivière, Schiavoni V, Sutra P (2015b) Unicrawl: a practical geographically distributed web crawler. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  31. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403
  32. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)Google Scholar
  33. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in apache flink and apache spark streaming. CoRR, abs/1709.02946Google Scholar
  34. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (middleware)Google Scholar
  35. Srikanth K, Anil S, Aleksandar V, Matthaios O, Robert G, Surajit C, Ding B (2016) Quickr: lazily approximating complex ad-hoc queries in big data clusters. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  36. Thompson SK (2012) Sampling. Wiley series in probability and statistics. The Australasian Institute of Mining and Metallurgy, CarltonGoogle Scholar
  37. Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling mapreduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)Google Scholar
  38. Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)Google Scholar
  39. Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)Google Scholar
  40. Wikipedia (2017) 68-95-99.7 RuleGoogle Scholar
  41. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI)Google Scholar
  42. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Do Le Quoc
    • 1
    Email author
  • Ruichuan Chen
    • 2
  • Pramod Bhatotia
    • 3
  • Christof Fetzer
    • 1
  • Volker Hilt
    • 2
  • Thorsten Strufe
    • 1
  1. 1.TU DresdenDresdenGermany
  2. 2.Nokia Bell LabsStuttgartGermany
  3. 3.University of Edinburgh and Alan Turing InstituteEdinburghUK

Section editors and affiliations

  • Asterios Katsifodimos
    • 1
  • Pramod Bhatotia
    • 2
  1. 1.Delft University of TechnologyDelftNetherlands
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUnited Kingdom