Introduction

In this work, we consider resource allocation problems in database systems using means of quantitative methods and operations research. Specifically, we seek to run database workloads efficiently and, at the same time, try to avoid poor performances in case of stochastic or even uncertain future workloads. In this context, we look for risk-averse decision support models that help to optimize whether to store, create, or remove indexes by taking into account different potential workload scenarios, which can—to some extent—be anticipated based on historic data.

Motivation and Background

Indexes are auxiliary data structures. They are used in a relational database system to reduce the execution time required for generating the result of a database query. The shorter the execution time of a workload’s query set, the more queries can be executed per time unit. Consequently, reducing query execution times implicitly increases the throughput of the database. Indexes are auxiliary data structures that have to be stored in addition to the stored data of a database itself, which leads to additional memory consumption and increases the overall memory footprint of the database. Memory capacity is limited and, therefore, a valuable resource. For this reason, it is important to take the memory consumption into account for decision-making about which indexes to store in the system’s memory.

Table 1 Sample index combinations with their memory consumption and the resulting execution times of a hypothetical query, cf. Weisgut et al. [23]

For each database query, different indexes may exist, each of which can improve the query execution time differently. Table 1 shows an exemplary scenario in which different combinations of indexes lead to different execution times of a single hypothetical example query. The first combination without any index leads to the longest execution time of the query with 500 ms. The best execution time of 300 milliseconds can be achieved using both index 1 and index 2. However, the best performing combination regarding the execution time also involves the largest memory footprint. The second-best solution from an execution time perspective results in an index memory consumption of only 40% of the optimal solution and is only about 17% slower. This simple example illustrates the need to consider the index memory consumption for selecting which indexes should be used.

Of course, in real-world database scenarios, a database management system processes more than only a single query. Instead, a set of queries is executed on a database with a certain frequency in a specific time frame for each query. The set of queries with their frequencies is referred to as workload. Executing a workload using a selection of indexes has a certain performance. This performance is characterized by the total execution time of the workload and the selected indexes’ memory consumption. The workload execution time should be as low as possible, and the index memory consumption must not exceed a specific memory budget.

A further challenge of selecting the set of indexes that shall be present and used by queries is index interaction. “Informally, an index a interacts with an index b if the benefit of a is affected by the presence of b and vice versa” [19]. For example, assume a particular index i for a subset S of the overall workload may provide the best performance improvement for each query in that subset. There is also no other index that has a better accumulated performance improvement. Suppose i has such a high memory consumption so that the available index memory budget is fully exhausted. In that case, no other index can be created. Therefore, only queries of the subset S are improved by index i. Another index selection might be worse for the workload subset S but better for the overall workload. Consequently, a (greedily chosen) single index whose accumulated performance improvement is the highest is not necessarily in the set of indexes that provides the best performance improvement for the total workload.

Contributions

This paper is an extended version of [23], where we present solver-based approaches to address specific challenges of index selection that occur in practice. Besides a basic problem, solution concepts for several extended problem versions are proposed. The main contributions of [23] are the following: (i) We study solver-based approaches for single- and multi-index selection problems. (ii) We use a flexible chunk-based heuristic approach to attack larger problems. (iii) We consider extensions with multiple stochastic workload scenarios and reconfiguration costs. (iv) We derive risk-aware index selections using worst case and variance-based objectives. (v) We use reproducible examples to test our approaches, which can be easily combined.

Compared to [23], in this paper, we present extended studies and the following additional contributions:

  • We use utility functions in a piecewise linear framework as a further criterion for robust index selections in case of stochastic workloads (Section “Expected Utility Using Piecewise Linear Functions”).

  • We use the workloads of the standard database benchmarks TPC-H and TPC-DS for the evaluation.

  • We evaluate our approaches in stochastic dynamic settings using experiments with in-sample and out-of-sample data (Section “Numerical Evaluation for Stochastic Dynamic Workloads”).

  • We show how to use historic workload data to adapt to unknown stochastic seasonal workload patterns in a robust and anticipating way.

The remainder of this paper is organized as follows. Section “Related Work” discusses related work. In Section “Solution Approach”, a basic version of the index selection problem as well as some extensions are formulated, e.g., the additional consideration of reconfiguration costs. For the described problems, different models and solution approaches are presented. An evaluation of the proposed models using synthetic, scalable workloads is given in Section  “Implementation and Evaluation”. In Section “Robust Selection Approaches for StochasticWorkloads”, we derive different robust selection approaches for dealing with stochastic workload scenarios. In Section “Numerical Evaluation for Stochastic Dynamic Workloads”, we evaluate those approaches for the TPC-H and TPC-DS benchmark workloads. Final conclusions and directions for future work are given in Section “Conclusion and Future Work”.

Related Work

The recommendation and automated selection of indexes have been in the focus of database research for many years and are still important today, particularly in the rise of self-optimizing databases [13, 15]. In the following, we give an overview of index selection algorithms.

A recent and comprehensive overview of the historic development as well as an evaluation of index selection algorithms is summarized by Kossmann et al. [11]. Current state-of-the-art index selection algorithms are, e.g., AutoAdmin [5], DB2Advis [21], CoPhy [6], DTA [4], and Extend [18]. All those selection approaches focus on deterministic workloads. Risk aversion in case of multiple potential workloads is not supported. As typically iterated or recursive methods are used, it is not straightforward how they have to be amended to address the extensions considered in this paper, such as multiple workloads, risk aversion, or transition costs.

Early approaches tried to derive optimal index configurations by evaluating attribute access statistics [7]. Newer index selection approaches are mostly coupled with the query optimizer of the database system [11]. By doing so, the cost models of the index selection algorithm and the optimizer are the same. As a result, the benefit of considered indexes can be estimated consistently [5]. As optimizer invocations are costly, especially for complex queries, along with improved index selection algorithms, techniques to reduce and speed up optimizer calls have been developed [5, 14, 22].

An increasing number of possible optimizer calls for index selection algorithms opens the possibility to investigate an increasing number of index candidates. Compared to greedy algorithms [5, 21], approaches using mathematical optimization are able to efficiently evaluate index combinations. In this context, we perceive a shift away from greedy algorithms [5, 21] toward approaches using mathematical optimization models and methods of operations research, especially integer linear programming (ILP) [3, 6]. A major challenge of these solver-based approaches is to deal with the increasing complexity of integer programs. An obvious solution is reducing the number of initially considered index candidates, which may, however, reduce the solution quality.

Alternatively, also machine learning-based approaches for index selection are an emerging research direction. For example, deep reinforcement learning (RL) techniques have already been applied, cf., e.g., Sharma et al. [20] or Kossmann et al. [12]. Such approaches, however, require extensive training, are still limited with regard to large workloads or multi-attribute indexes, and in particular, do not support risk-averse optimization criteria.

A further research direction seeks to solve joint tuning problems, where besides the index selection, additional features, such as sorting, compression or data tiering selections, are taken into account (see, e.g., Richly et al. [16]).

Solution Approach

In this section, we present approaches to solve different versions of index selection problems. Section “Basic Problem” describes a fundamental index selection problem, which is considered the basic problem in this work. In addition to the problem’s description, we formulate an integer linear programming model, which allows solving this problem. Section “Chunking Extension” (chunking), Section “Extension to Multi-Index Configurations” (multiple indexes per query), and Section “Consideration of Reconfiguration Costs” (reconfiguration costs) each describe an extension of the basic problem and explain which adjustments can be made to the solution of the basic problem to solve the extended problems. After an evaluation of the mentioned models, cf. Section “Implementation and Evaluation”, in Section “Robust Selection Approaches for StochasticWorkloads”, we will finally present our approaches to obtain robust index selections in scenarios when workloads are stochastic. Those models are evaluated in Section “Numerical Evaluation for Stochastic Dynamic Workloads”.

Basic Problem

In this subsection, we first describe a basic version of the index selection problem, which resembles typical properties. The basic index selection problem is about finding a subset of a given set of (multi-attribute) index candidates used by a hypothetical database to minimize the total execution time of a given workload. The given workload consists of a set of queries and a frequency for each query. A query can use no index or exactly one index for support. Different indexes induce different improvements for a single query. As a result, the execution time of a query highly depends on the used index. A query has the longest execution time if no index is used. For each query, it has to be decided whether and which index is to be used. Only if at least one query uses an index, the index can belong to the set of selected indexes. Each index involves a certain amount of memory consumption. The total memory consumption of the selected indexes must not exceed a predefined index memory budget.

Table 2 Basic model parameters and decision variables, cf. Weisgut et al. [23]

Table 2 shows the formal representation of the given parameters and the decision variables of our model. The binary variable \(u_{q,i}\in \{0,1\}\) is used to control whether an index i is used for query q. Variable \(v_i \in \{0,1\}\) indicates whether index i is selected overall. It is used to calculate the overall memory consumption of selected indexes. Similar to the work of Schlosser and Halfpap [17], the considered standard basic index selection problem can be formulated as an integer LP model:

$$\begin{aligned} \underset{v_i,u_{q,i} \in \{0,1\}^{I + Q \times (I+1)}}{\text {minimize}} \sum \limits _{\begin{array}{c} q = 1,...,Q, \quad i = 0,...,I \end{array}} {u_{q,i}\cdot t_{q,i}\cdot f_q}, \end{aligned}$$
(1)
$$\begin{aligned}&\textit{s.t.} \quad \sum \limits _{i=1,...,I}{v_i\cdot m_i}\le M, \end{aligned}$$
(2)
$$\begin{aligned}&\quad \qquad \sum \limits _{i=0,...,I}{u_{q,i}} = 1, \qquad q = 1,...,Q ,\end{aligned}$$
(3)
$$\begin{aligned}&\quad \qquad \sum \limits _{q=1,...,Q}{u_{q,i}} \ge v_i, \qquad i = 1,...,I, \end{aligned}$$
(4)
$$\begin{aligned}&\quad \qquad \frac{1}{Q}\cdot \sum \limits _{q=1,...,Q}{u_{q,i}}\le v_i, \qquad i = 1,...,I. \end{aligned}$$
(5)

The objective (1) minimizes the execution time of the overall workload taking into account the index usage for queries, the index-dependent execution times, and the frequency of queries. The constraint (2) ensures that the selected indexes do not exceed the given memory budget M. Constraint (3) ensures that a maximum of one index is used for a single query. Here, a unique option has to be chosen including the no index option. Thus, if \(u_{q,i}\) with \(i = 0\) is true, no index is used for query q. The constraints (4) and (5) are required to connect \(u_{q,i}\) with \(v_i\). If no query uses a specific index i, constraint (4) ensures that \(v_i\) is equal to 0 for that index. If at least one query uses index i, constraint (5) ensures that \(v_i\) is equal to 1 for that index.

Chunking Extension

The number of possible solutions of the index selection problem grows exponentially with the number of index candidates. Databases for modern enterprise applications consist of hundreds of tables and thousands of columns. This leads to long execution times to find the optimal solution of the increasing problem. In this extension, the set of possible indexes is split into chunks of indexes. The index selection problem will then be solved via (1)–(5) only with the reduced set of indexes for each chunk, and the indexes of the optimal solution will be returned. After solving the problem for each chunk, the best indexes of each chunk will get on. In the second round, the reduced number of remaining indexes will be used for a final selection using again (1)–(5).

The approach allows an effective problem decomposition and accounts for index interaction. Naturally, chunking remains a heuristic approach an does not guarantee an optimal solution, but the main advantage is to avoid large problems. Of course chunks should not be chosen too small as splitting the global problem into too many local problems can also add overhead (see the evaluations presented in Section “Results for Index Chunking”). An other advantage is that, in the first round, all the chunks could be solved in parallel. With this, the overall execution time could be reduced even further.

Extension to Multi-index Configurations

Our basic problem introduced in Section “Basic Problem” could not handle the interaction of indexes as described in the introduction: One query could be accelerated by more than one index and the performance gain of an index could be affected by other indexes. We tackle this part of the index selection problem by adding one level of indirection called index configurations. An index configuration maps to a set of indexes.

For ease of simplicity, we consider the use of at most two indexes per query. Assuming the index selection problem has ten indexes, then the first configuration (configuration 0) means that no index is used for this query. The next ten possible configurations point to the respective indexes, e.g., configuration 1 points to index 1, configuration 2 points to index 2, configuration 10 points to index 10. The subsequent configurations map to sets containing combinations of two indexes. Database queries could be accelerated by more than two indexes, but we simplified the configurations in our implementation so that they can consist of a maximum of two different indexes. We use a binary parameter \(d_{c, i}\) indicating whether configuration c contains the index i, where \(c=0,...,C\) and \(i=0,...,I\) with C being the number of index configurations and I being the number of indexes. Furthermore, we assume that ten percent of all possible index combinations will interact in configurations. Our approach to index selection works with configurations in the same way as with indexes, cf. (1)–(5). The constraints (3)–(5) of the basic problem, cf. Section “Basic Problem”, are adapted in the following way for multi-index configurations:

$$\begin{aligned} \sum \limits _{c=0,...,C}{u_{q,c}} = 1, \qquad \quad q = 1,...,Q, \end{aligned}$$
(6)
$$\begin{aligned} \sum \limits _{q=1,...,Q}{u_{q,c}\cdot d_{c, i}} \ge v_i, \quad \begin{array}{l} c = 1,...,C\\ i = 1,...,I, \end{array} \end{aligned}$$
(7)
$$\begin{aligned} \frac{1}{Q}\cdot \sum \limits _{q=1,...,Q}{u_{q,c}}\cdot d_{c, i}\le v_i, \quad \begin{array}{l} c = 1,...,C\\ i = 1,...,I. \end{array} \end{aligned}$$
(8)

Again, the binary variable \(u_{q,c}\) is used to control if configuration c is used for query q and C is the number of index configurations. Similar to the basic approach, \(c = 0\) represents a configuration that contains no index. Constraint (6) ensures that one single query uses exactly one configuration option instead of one index. In constraint (7)–(8), the parameter \(d_{c,i}\) is included to activate indexes of used configurations.

Consideration of Reconfiguration Costs

In this problem extension, we consider the costs of a transition from one index configuration to another. We assume that the database removes indexes that are no longer used and loads indexes that are to be used into the memory. Typically, the database would need to do some I/O operations, which are time expensive and generate additional costs. We model such creation and removal costs in our final extension to reduce such transition costs.

To adapt the index configurations, the algorithm identifies the differences between the previous configuration (now characterized by parameters \({{\bar{v}}}_i\)) and a new target configuration governed by the variables \(v_i\). For each removal at index i, the algorithm looks up the removal costs \(rm_i\) for index i and adds them to the total removal costs RM. Analogous, the algorithm proceeds to calculate the total creation costs MK using the creation costs \(mk_i\) of the index i. The sum of the removal costs and creation costs is then being added to any of the previous objectives, which allows to avoid high transition costs. The costs can be modeled linearly:

$$\begin{aligned} RM=\sum \limits _{i=1,...,I}{{{\bar{v}}}_i \cdot (1-v_i) \cdot rm_i}, \end{aligned}$$
(9)
$$\begin{aligned} MK=\sum \limits _{i=1,...,I}{v_i \cdot (1-{{\bar{v}}}_i) \cdot mk_i}. \end{aligned}$$
(10)

Table 3 describes an explanatory calculation of the transition costs between two index selections. The previous (and thus fixed) selection, cf. \({{\bar{v}}}\), uses the indexes 1 and 3. The new selection, cf. v, uses index 1 and 2. The resulting transition costs are 50.

Table 3 Transition cost calculation example, cf. Weisgut et al. [23]

Implementation and Evaluation

In this section, we demonstrate the applicability of our proposed models. The considered setup and the input data are described in Section “Setup and Scalable Test Datasets”, respectively. In Section “Results for Basic Index Selection Problems”, we reflect on the scalability of the basic approach. Then, in Section “Results for Index Chunking”, we investigate when index chunking is beneficial for the performance compared to the basic approach and reflect on the cost trade-off that the heuristic entails. Afterward, in Section “Results for Multi-Index Configurations”, we determine the computational overhead of the multi-index extension.

Implementation and Solvers

To evaluate the described approaches, we implemented our different models using AMPL,Footnote 1 which is a modeling language tailored for working with optimization problems [8]. The syntax of AMPL is quite close to the algebraic one and should be easy to read and understand, even for the readers, who have never seen AMPL syntax before. AMPL itself translates the given syntax into a format solvers can understand.

The solver is a separate program that needs to be specified by the developer. The approach is based on linear/quadratic programming using integer numbers. Solvers such as, e.g., CPLEXFootnote 2 and Gurobi,Footnote 3 are suited to solve our models. Several tests showed that Gurobi is faster than CPLEX in most cases, which is the reason why we used Gurobi.

All code files are available at GitHub.Footnote 4 Our implementation in AMPL allows the reader to evaluate the different approaches and to reproduce its results, see Section “Implementation and Evaluation”. The .run-file contains information about the selected solver and loads the specified model and data specifications. After a given problem was solved, the solution is displayed. The .mod-file contains the description of the mathematical model, such as parameters, constraints, and objectives used. The input data, which are required for solving a certain problem, are specified in the .dat-file.

Setup and Scalable Test Datasets

All performance measurements were performed on the same machine, featuring an Intel i5 8th generation (4 cores) and 8GB memory storage. All measurements were repeated three times. For each time measurement, we used the AMPL build-in function \(\_total\_solve\_time\). It returns the user and system CPU seconds used by all solve commands [9]. The final value was determined by the mean of all three measurement results. All non-related applications have been closed to reduce any side effects of the operating system.

The datasets that are being used for the evaluation are being generated randomly, using multiple fixed random seeds. Each dataset is defined by the number of indexes, queries, and available memory budget. The algorithm provided in the index-selection.data-file then generates the execution time of each query, depending on the utilized index. First, the “original” execution time for the query without using any index are chosen randomly within the interval [10; 100]. Based on the drawn costs, the speedup for each index is calculated by choosing a random value between the “original’ costs and a 90% speedup. The memory consumption of a query can be an integer between 1 and 20. The frequencies can be between 1 and 1 000.

The extensions that are applied on top of the basic approach introduce further variables that need to be generated. For the stochastic workload extension, we introduced a workload intensity, which gets drawn randomly for each workload. This also applies to the transition cost extension, where the creation costs and removal costs are random. The multi-index configurations package requires a more complex generation process since each index configuration should be a unique set of indexes. The configuration zero represents the option that no index is being used. The configurations 1 to I point to their respective single index. All other generated configurations consist of up to two indexes, whereas the combinations are drawn randomly. By using a second data structure, it is ensured that no index combination is used multiple times. The speedup s for a combination, existing of two indexes i and j, is then calculated by the following formulas:

$$\begin{aligned} min\_speedup_{s_{i,j}}=max\left( s_i, s_j\right) , \end{aligned}$$
(11)
$$\begin{aligned} max\_speedup_{s_{i,j}}=s_i + s_j. \end{aligned}$$
(12)

The minimum speed up and the maximum speed up are then passed to a function that returns a uniformly distributed random number within the continuous interval \([min\_speedup_{s_{i,j}}; max\_speedup_{s_{i,j}}]\), cf. (11)–(12).

Outsourcing the generation of input data into the index-selection.data-file allows for an easy replacement with actual data, e.g., benchmarking data of a real system. However, this also enables the reader to validate basic example cases on their own.

Results for Basic Index Selection Problems

Fig. 1
figure 1

Execution times in seconds of the basic solution, cf. (1)–(5), for different numbers of index candidates, cf. Weisgut et al. [23]

“The complexity of the view- and index-selection problem is significant and may result in high total cost of ownership for database systems” [10]. In this section, we evaluate our basic solution, cf. Section “Basic Problem”. We show the scalability of our implemented solution, which we later compare to the chunking approach. We set up the memory limit with 100 units and assume 100 queries with random occurrences uniform between 1 and 1000. To test the scalability on our machine, we generate data with 50, 100, 150,..., 1450, 1500 index candidates. The outcome measurements are shown in Fig. 1. The total solve time on the y-axis is a logarithmic scale.

Figure 1 shows the growing total solve time while the number of indexes rises. With an increasing index count, the execution times vary more. Naturally, the solve time depends on the specific generated data input. In some cases with over 1000 indexes, the generated input could not be solved with our setup in a meaningful time. Note that the possible combinations of the index selection problem grows exponentially. To limit the number of index candidates, one might only consider smaller subsets of a workload’s queries that are responsible for the majority of the workload costs.

Results for Index Chunking

To tackle the exponential growing number of admissible index combinations, we divide the problem into chunks, find the best indexes of each chunk, and then find the best indexes of the winners of all chunks, cf. Section “Chunking Extension”. Compared to the basic index selection solution, problems with much more indexes can be solved with chunking.

Figure 2 shows the total solve time for using different numbers of chunks between 1 and 50 (the case without chunking is shown in blue). In this experiment all other parameters were fixed (see previous Section “Results for Basic Index Selection Problems”). The dots of the different chunking versions show a linear relationship between an index count of 500–2500. In the beginning, the total solve time of the chunking curve has a higher slope.

Fig. 2
figure 2

Execution times in seconds of basic solutions and with different numbers of chunks (lower is better), cf. Weisgut et al. [23]

The overhead introduced by chunking, i.e., splitting the index candidates into chunks, has not always a positive impact on the total execution time of our linear program. The chunking solution has a lower scattering than the basic solution. In each execution, some chunks could be solved faster than the mean and some other chunks needed more time. We observe that long and short solve times of single chunks balance each other and chunking leads to lower variation.

As described in Section “Chunking Extension”, the heuristic chunking approach might cause the final solution not to be an optimal global solution of the initial problem. In this context, Table 4 shows the total cost growth of the found solution compared to the optimal solution. The lower the chunk count, the higher the mean and the maximum growth. With a lower chunk count, the possibility that an index of the optimal solution is not a winner of the related chunk is higher. The more chunks, the more indexes get on to the final round.

Table 4 Total workload costs growth for different numbers of chunks compared to the optimal solution in percent (%), cf. [23]

Chunking reduces the total solve time and fewer outliers with very long execution times occur. The degradation of the calculated solution is surprisingly low. We observe that the total workload costs growth is consistently lower than 1%.

Results for Multi-index Configurations

In this section, we evaluate the potential solve time overhead, which might get introduced by the multi-index configuration extension, see Section “Extension to Multi-Index Configurations”. Therefore, we compare the solve times of the extension with the basic approach. For both implementations, we tested multiple settings. The number of indexes defines a setting and is one of 10, 50, 100, 200, 500, or 1000. Independent of the setting, we assume a memory budget of 100 units and 100 queries. Figure 3 shows the solve time for both settings in comparison.

Fig. 3
figure 3

Execution times in seconds of the basic approach and the multi-index configuration extension in comparison, cf. Weisgut et al. [23]

Overall, as expected, the multi-index configuration extension has an execution time overhead compared to the basic solution, which assigns at most only one index to each query. However, the additional required solve time is acceptable. Further, the relative solution time overhead decreases with an increasing number of indexes. One explanation for this effect could be that increasing the number of index candidates also increases the number of dominated indexes excluded by the (pre)solver.

Robust Selection Approaches for Stochastic Workloads

Until now, we considered a single given workload only. However, in the context of enterprise applications, we observe that each day has a different workload. Further, the workloads at the end of the week or on the weekend can systematically contain fewer or different compositions of requests compared to workloads during the rest of the week. Further, future workloads are not perfectly plannable and can be seen as stochastic. Hence, a good index selection should be optimized such that it provides suitable results for a variety of potential workloads scenarios. In this section, we propose an approach that can take multiple workloads into account. The solution seeks to provide a robust index selection, where robust means that the performance is good no matter which workload of different potential ones may occur.

In the following, we assume W potential workload scenarios, e.g., for the next period of time. We consider the probability for workload w denoted by \(P_w\), \(w=1,...,W\), where \(\sum \nolimits _{w=1,...,W} P_w = 1\). A workload scenario is characterized by a set of queries, which occur with given frequencies within a certain time frame.

Risk Neutral Baseline Approach

As a standard baseline approach, we use the risk neutral model. The expected total workload cost T across all workloads is being calculated as

$$\begin{aligned} T=\sum \limits _{w=1,...,W}{g_w \cdot P_w }, \end{aligned}$$
(13)

where we use the untuned execution time \(g_w^{(0)} := \sum \nolimits _{q = 1,...,Q} {u_{q, 0} \cdot t_{q, 0} \cdot f_{w,q}}\) (i.e., when no indexes are used) to define the normalized execution time \(g_w\) of a workload w with \(f_{w,q}\) being the frequency of query q in workload w, \(w=1,...,W\), i.e.,

$$\begin{aligned} g_w= 1/g_w^{(0)} \cdot \sum \limits _{q = 1,...,Q,\, i = 0,1,...,I} {u_{q, i} \cdot t_{q, i} \cdot f_{w,q}}. \end{aligned}$$
(14)

The information whether a configuration c is being used for a query q of a workload w is shared between the workloads, leaving it to the solver to minimize the total costs across all workloads. In this context, Figure 4 shows exemplary total workload costs when minimizing the expected execution time T with \(W=4\) workload scenarios, cf. W1 - W4, for some distribution P. Following the risk neutral objective, it is typical that the actual costs of each workload differ a lot, leading, e.g., to poor performances (for W1 and W3) in favor of others (W2 and W4).

Fig. 4
figure 4

Exemplary costs for each workload when minimizing the expected total costs (T) (lower is better), cf. Weisgut et al. [23]

Next, we propose different risk-averse approaches to make the index selection performance more robust.

Weighted Worst-Case Optimization Approach

We include the worst-case performance by punishing the maximum total workload costs as additional costs besides the mean T. The maximum workload costs L (modeled as a continuous variable) are determined by the constraint:

$$\begin{aligned} L \ge g_w \qquad \forall w = 1,...,W. \end{aligned}$$
(15)

The following (mixed) ILP, cp. (1)–(5), now includes this maximum workload cost (L) in the objective using the penalty factor \(a\ge 0\), cf. (13)–(14):

$$\begin{aligned}&\underset{v_i,u_{q,i} \in \{0,1\}^{I + Q \times (I+1)},\, L \in {\mathbf {R}}}{\textit{minimize}} T + a \cdot L. \end{aligned}$$
(16)

Note, L does not depend on the workload distribution P. Further, for \(a \rightarrow \infty\), we obtain the pure worst-case optimization criterion.

Fig. 5
figure 5

Exemplary workload costs with increased robustness by optimizing the worst-case costs (L), cf. Weisgut et al. [23]

Fig. 6
figure 6

Exemplary workload costs with increased robustness by using mean-variance optimization, cf. Weisgut et al. [23]

Figure 5 illustrates a typical solution leading to better worst-case cost, cp. Fig. 4. However, the costs of W2 and W3 increased, leaving also a bigger gap between W4 and the rest.

Mean–Variance Optimization Approach

To obtain even (and plannable) performances, we apply the mean-variance optimization criterion, i.e., we weight the mean T and the variance V of the workload costs defined by, cf. (13)–(14),

$$\begin{aligned} V=\sum _{w=1,...,W}{\left( g_w- T \right) ^2}\cdot P_w. \end{aligned}$$
(17)

as a penalty to minimize the scenarios’ cost differences. Now, the factor b and the term \(b\cdot V\) is used in the objective, cp. (16)–(17),

$$\begin{aligned} \underset{v_i,u_{q,i} \in \{0,1\}^{I + Q \times (I+1)}}{\text {minimize}} T + b \cdot V. \end{aligned}$$
(18)

Remark that the problem, cf. (18), becomes a binary quadratic programming (BQP) problem using the variance V in the penalty term. Using the mean–variance criterion (18) may lead to results illustrated in Fig. 6. Typically, all costs tend to be within a similar range. However, in comparison to the previous figures, not only have been W1, W2, and W3 brought into a plannable range, but also W4. The total costs of W4 may not be reduced, which makes the result indeed more robust but less effective in the end. An alternative option to resolve this issue would be to use the semivariance instead of V. Similar to the variance, the semivariance can be used to penalize only those workloads whose costs are higher than the mean cost of all workloads, i.e., workloads with lower costs would not increase the applied penalty.

Expected Utility Using Piecewise Linear Functions

In Sections “Weighted Worst-Case Optimization Approach” and “Mean-Variance Optimization Approach”, we presented risk-averse objectives that minimize an upper bound to optimize the execution time of the worst workload or use the mean-variance criterion to achieve robust execution times with small deviations. However, optimizing the mean–variance criterion also penalizes execution times that are better than the average execution time, which can be a disadvantage as short execution times are desirable from a database perspective.

As an alternative to model more flexible risk-averse objectives, we use utility functions, which allow to identify index selections such that all workload scenarios are addressed but a higher focus is put on improving scenarios with high costs. However, the included non-linearity is a challenge for solvers, particularly as we have an integer problem. To resolve this issue in a pragmatic way, we model the non-linear objective by using piecewise linear approximations.

As we have a cost minimization problem, we use the scenario-specific costs \(g^{(0)}_w\) representing the total workload costs of workload w when no indexes are used at all as an upper bound. Hence, the (normalized) benefit of using indexes for workload w can be expressed as \(\beta _w := (g^{(0)}_w - g_w)/g^{(0)}_w\), cf. (14). In this context, we consider continuous concave utility functions u(z), \(z\ge 0\), to measure the decreasing marginal utility of the benefit of a workload. Further, to approximate u(z), we use 1000 equidistant values between 0 and 1 for a piecewise linear representation of u denoted by \({{\tilde{u}}}\), i.e.,

$$\begin{aligned} {{\tilde{u}}}(z) \approx u(z), \qquad z \in [0,1]. \end{aligned}$$
(19)

Finally, the objective to maximize the expected utility of benefits over all workload scenarios reads as

$$\begin{aligned} \underset{{v_i},{u_{q,i}} \in {{\{ 0,1\} }^{I + Q \times (I + 1)}}}{\text {maximize}} \sum\limits_{w = 1,...,W} {\tilde u\left( {1 - {g_w}/g_w^{(0)}} \right)} \cdot {P_w}. \end{aligned}$$
(20)

Note, the piecewise linear integer problem (20) can efficiently be solved using standard solvers.

To sum up, the risk-averse approaches proposed in this section enable us to use multiple workloads (e.g., observed in the past) to optimize index selections for potential/stochastic scenarios under risk considerations.

Numerical Evaluation for Stochastic Dynamic Workloads

In this section, we evaluate the performance of the risk-averse selection approaches proposed in Section “Robust Selection Approaches for StochasticWorkloads”. Our evaluation uses the workloads of the established standard database benchmark TPC-H and TPC-DS, which are introduced in Section “Setup and Model Input”. Further, in Section “Experiment with Stochastic Dynamic Workload Scenarios”, we define our experiments with stochastic dynamic workload scenarios, which are based on Section “Setup and Model Input” and different randomized query frequencies. In Section “Evaluation Results for Stochastic Dynamic Workloads (In-Sample)”, we evaluate our index selection models for stochastic workloads, where we use drawn samples of dynamic workload data, which serve as exemplary historic data. Further, in Section “Evaluation Results for Stochastic Dynamic Workloads (Out-of-Sample)”, we verify the robustness of the selections obtained in Section “Evaluation Results for Stochastic Dynamic Workloads (In-Sample)” by evaluating them for unseen generated out-of-sample workloads. Finally, in Section “Discussion”, we discuss the results obtained.

Setup and Model Input

We consider the case with at most one index per query. To generate realistic workload data, we set up a PostgreSQL 11 [1] database system with the extension HypoPG [2]. HypoPG allows creating hypothetical indexes and estimating their sizes. We loaded TPC-H and TPC-DS tables with scale factor 1. For TPC-H (\(Q=22\), \(I=275\)) and the TPC-DS queries (\(Q=99\), \(I=1\,422\)) and their various index candidates (with 1-2 attributes), we derived query costs \(t_{q,i}\), \(i = 0,...,I\), by calling PostgreSQL’s EXPLAIN with the query template string using fixed parameters, see [17]. Index sizes for the required memory, cf. \(m_i\), are modeled by using HypoPG’s function hypopg_relation_size(). The obtained model inputs to reproduce the calculations are available online.Footnote 5

Experiment with Stochastic Dynamic Workload Scenarios

We consider workloads for different days of the week, which are referred by \(j=0,...,6\). Further, we generate workload data for 20 weeks as historic data. Hence, for days of type j, we consider \(W_j=20\) historic workload scenarios. Without loss of generality, for all j, we use uniform probabilities \(P^{(j)}_w=1/20\), for the workload scenarios w, \(w=1,...,W_j\), \(W_j=20\), \(j=0,...,6\). To generate random seasonal patterns for query q and the type of day j, we use skewed random numbers \(z_{q,j}:=1/t_{q,0} \cdot (Uniform(0,0.1)^{(1/0.1)})\). Based on \(z_{q,j}\) for the 20 days of type j, we use the randomized query frequencies \(\bar{f}_{q,j,k}:=z_{q,j} \cdot (Uniform(0,0.2)^{(1/0.2)})\), for all \(k=1,...,20\).

The total workload costs of day j in week k amount to \({C^{(0)}_{j,k}} = \sum \nolimits _{q = 1,...,Q} {{{{{\bar{f}}}}_{q,j,k}}} \cdot {t_{q,0}}\). As we seek to optimize index selections against different mixtures of workloads and not primarily against different amplitudes of load, in our experiment, we normalize the overall load, i.e., in our models for day j in week k, we consider the final query frequencies \(f_{q,j,k} := {{{\bar{f}}}}_{q,j,k} \cdot C^{(0)}_{1,1}/C^{(0)}_{j,k}\). For day j in week k without using indexes, we have the untuned total workload costs \(g^{(0)}_{j,k} = \sum \nolimits _{q = 1,...,Q} f_{q,j,k} \cdot {t_{q,0}}\). Finally, for the memory budget M, we use \(M:= \delta \cdot \sum \nolimits _{i = 1,...,I} m_i\) for a given parameter value \(\delta \in [0,1]\).

Evaluation Results for Stochastic Dynamic Workloads (In-Sample)

For generated workload scenarios over 20 weeks, as described in Section “Experiment with Stochastic Dynamic Workload Scenarios”, for each type of day j, \(j=0,...,6\), we separately apply the models proposed in Section “Robust Selection Approaches for StochasticWorkloads”, which are based on different optimization criteria. Besides the risk neutral model (RN), the weighted worst-case approach (WWC), the mean-variance optimization model (MVO), and the expected utility model (EU)—as another baseline—we consider the approach that optimizes the selection for a certain single workload (SW) only. For this, we used the last one of the day’s type. Alternatively, one could also use the most likely scenario, cf. \(P_w\), or the average workload over all types of days.

The results of Table 5 are based on the TPC-H benchmark (cf. \(Q=22\) queries, \(I=275\) index candidates) for a budget M determined by \(\delta =0.0017\). For the WWC model, we used \(a=100\) to basically optimize the worst case L. For the MVO model, we used \(b=5\). For the EU model, we used the utility function \(u(z)=z^\gamma\), \(\gamma ={0.5}\). For the SW approach, for each type of day j, we used the workload of week \(k=1\).

Table 5 shows the results when optimizing for the five different approaches. Each model was solved seven times for all seven types of days j for 20 different associated workload scenarios each. The numbers in Table 5 are average values of the seven different performance values for the seven different types of days of the week. Further, all selections were also evaluated (as averages) for the criteria not optimized for. A row contains results that are optimized for the criterion given in the first column. The different columns contain an evaluation of the associated index selection with respect to all different objectives. Hence, within the different columns, the value on the diagonal (marked in bold style) represents the best possible objective value. Note, that all criteria are minimizations except the EU approach, where large objective values are

As expected, we, observe that, e.g., the WWC approach provides the best worst-case objective (L=0.976225) but falls short in the RN performance. Here, of course, the RN approach provides the best results (0.743616) but performs comparably bad regarding the WWC or MVO criterion. Naturally, optimizing for one single scenario only, cf. SW, can lead to fairly random results when evaluating other criteria. Overall, in this experiment, the EU approach provides good results regarding all criteria.

Table 5 In-sample results of the five different objectives regarding the performance metrics: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 scenarios of data each (TPC-H benchmark workloads)

Table 6 contains corresponding results based on the larger TPC-DS benchmark (cf. \(Q=99\) queries, \(I=1\,422\) index candidates), where we used the same model parameters, i.e., \(\delta =0.0017\), \(a=100\), \(b=5\), and \(\gamma ={0.5}\). The optimization of the MVO model is excluded as its computation was not possible within 1 h, due to its non-linear (quadratic) structure. The solutions for all other models could be computed within single seconds. The results obtained are similar to the TPC-H case. Further, they show the applicability of our linear models for larger workloads.

Table 6 In-sample results of the five different objectives regarding the performance metrics: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 scenarios of data each (TPC-DS benchmark workloads)

Evaluation Results for Stochastic Dynamic Workloads (Out-of-Sample)

Table 7 shows the average performance results of the determined index selections associated to Table 5 when evaluated for 20 new unseen workload scenarios for each type of day j. Those new scenarios respect the same seasonality pattern, cf. realized \(z_{q,j}\) values, but are based on new i.i.d. random variables for \({{\bar{f}}}\) (see Section “Experiment with Stochastic Dynamic Workload Scenarios”). The 20 new unseen scenarios can be seen as potential workloads that effectively occur on the next day of a certain type j. Those scenarios serve as an out-of-sample validation of our (robust) index selections that have been computed based on 20 seen historic workloads, cf. Section “Evaluation Results for Stochastic Dynamic Workloads (In-Sample)”.

Table 7 Out-of-sample results of the five different objectives regarding the performance metrics: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 (unseen) scenarios of data each (TPC-H benchmark)
Table 8 Upper performance bounds for out-of-sample results of the five different objectives: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 scenarios of data each (TPC-H benchmark)

In Table 7, again, the performance results for the different objectives have been averaged of all seven types of days. The performance values on the diagonal (italicized) show the results that have been tried to optimize using similar but different workload data. Overall, these values are among the best (within one column). Naturally, it is possible that selections associated with one of the other criteria accidentally provide better results for the new data.

To finally determine the quality of our out-of-sample results for the different objectives, we compare the results obtained in Table 7 to the best possible results that could have been obtained when (i) knowing all the 20 new scenarios for all types of days and (ii) computing optimized solutions as done in Table 5—but now for the unseen data. Those upper bound results are shown in bold in Table 8, including their evaluations for the non-optimized criteria. The results confirm that the italicized numbers of Table 7 are indeed close to the bold numbers of Table 8, which verifies the quality of our models to determine the intended robust index selections.

Similar to Tables 7,  8, Tables 9 and 10 contain the results for the TPC-DS benchmark. The optimization of the MVO model is again excluded due to runtime issues. More importantly, the results show that the quality and robustness of our linear solutions also hold for larger workloads.

Table 9 Out-of-sample results of the five different objectives regarding the performance metrics: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 (unseen) scenarios of data each (TPC-DS benchmark)
Table 10 Upper performance bounds for out-of-sample results of the five different objectives: SW, RN, MVO, WWC, and EU; average performance results over seven types of days for 20 scenarios of data each (TPC-DS benchmark)

Recall, that reconfiguration costs, cf. Section “Consideration of Reconfiguration Costs”, have been excluded in this experiment to not influence the out-of-sample robustness analysis. Undoubtedly, as they are modeled linearly, they could be directly included to incrementally adapt index selection dynamically over time by taking into account different workload scenarios of different kind (from one day/period to another).

Discussion

Overall, we can summarize that our models and extensions are applicable and effective for the index selection problem. Further, in contrast to the synthetic workloads generated in [23], we demonstrated the models’ effectiveness with the analytical database workloads of the TPC-H and TPC-DS benchmarks established in the industry and the research community. Being able to (re)allocate indexes in uncertain and changing environments addresses a relatively untapped field and, hence, contributes to existing works. We argue that the obtained results showed the intended characteristics for the following reasons: First, being able to integrate the targeted goals directly in a model’s objective is effective. Second, solver-based LP models—in contrast to heuristic or even black box algorithms—guarantee optimality. The price for optimality is impaired scalability, i.e., our models are limited to small and midsize problem instances.

Conclusion and Future Work

We considered different variants of index selection problems and proposed solver-based solution concepts. For the basic version of the problem, assuming one deterministic workload only, we presented an ILP-based solution and analyzed a chunk-based heuristic to lower the problem complexity in case of large index candidate sets. We obtained that, compared to the optimal solution of the basic problem, this heuristic performs near-optimal and allows to significantly reduce the overall solution time.

Further, to address settings where index selections are adaptively chosen over time, e.g., under (expected) changing workloads patterns, we endogenized costs to create and remove indexes in the presence of an existing index configuration while balancing performance and minimal required reconfiguration costs. The extended model remains linear and the numerical overhead is small.

In the second part of the paper, we considered multiple potential workload scenarios in a stochastic framework. To identify robust index selections for such type of uncertain workload scenarios, we proposed the following risk-averse objectives: (i) weighted worst-case costs (linear), (ii) mean-variance optimization (quadratic), and (iii) expected utility criteria (piecewise linear). Our results show that the different models are applicable—particular the linear ones—and that the targeted effect to avoid bad and uneven performances is achieved.

Using numerical experiments based on TPC-H and TPC-DS benchmark workloads, we studied the performance robustness of the risk-averse selection models when trained/optimized for observable historic dynamic workload data and finally applied for a variety of unseen randomly generated (out-of-sample) workloads. We verified the high quality of the results obtained by comparing the achieved performance against upper bounds, which are based on exploiting full knowledge of the out-of-sample workload scenarios.

In future work, further experiments could be performed to study how the performance robustness is affected when using the chunking heuristic or reconfiguration costs. Note, that all our concepts can be easily combined. Other research directions may include scalability issues of solver-based approaches as well as the integration of risk-averse objectives in rule-based or self-learning index selection algorithms.