Robust Index Selection for Stochastic Dynamic Workloads

Fast query processing is a primary goal of modern database systems. The use of indexes is crucial to reduce the execution times of database queries. Hence, it is of great interest to determine an efficient selection of indexes for a database management system (DBMS). However, index selection problems are highly challenging as indexes cause additional memory consumption and the individual benefit of an index is influenced by the selection of others. In this paper, we consider index selection problems accounting for non-standard features, such as (i) multiple potential workloads, (ii) different risk-averse objectives, (iii) multi-index configurations, (iv) reconfiguration costs, and (v) anticipation of dynamic workload scenarios. For the different problem extensions, we propose specific model formulations, which can be solved efficiently using solver-based solution techniques. The applicability and performance of our concepts are demonstrated using reproducible synthetic workloads as well as standard TPC-H and TPC-DS-based benchmark workloads.


Introduction
In this work, we consider resource allocation problems in database systems using means of quantitative methods and operations research. Specifically, we seek to run database workloads efficiently and, at the same time, try to avoid poor performances in case of stochastic or even uncertain future workloads. In this context, we look for risk-averse decision support models that help to optimize whether to store, create, or remove indexes by taking into account different potential workload scenarios, which can-to some extent-be anticipated based on historic data.

Motivation and Background
Indexes are auxiliary data structures. They are used in a relational database system to reduce the execution time required for generating the result of a database query. The shorter the execution time of a workload's query set, the more queries can be executed per time unit. Consequently, reducing query execution times implicitly increases the throughput of the database. Indexes are auxiliary data structures that have to be stored in addition to the stored data of a database itself, which leads to additional memory consumption and increases the overall memory footprint of the database. Memory capacity is limited and, therefore, a valuable resource. For this reason, it is important to take the memory consumption into account for decision-making about which indexes to store in the system's memory.
For each database query, different indexes may exist, each of which can improve the query execution time differently. Table 1 shows an exemplary scenario in which different combinations of indexes lead to different execution times of a single hypothetical example query. The first combination without any index leads to the longest execution time of the query with 500 ms. The best execution time of 300 milliseconds can be achieved using both index 1 and index 2. However, the best performing combination regarding the execution time also involves the largest memory footprint. The second-best solution from an execution time perspective results in an index memory consumption of only 40% of the optimal solution and is only about 17% slower. This simple example illustrates the need to consider the index memory consumption for selecting which indexes should be used.
Of course, in real-world database scenarios, a database management system processes more than only a single query. Instead, a set of queries is executed on a database with a certain frequency in a specific time frame for each query. The set of queries with their frequencies is referred to as workload. Executing a workload using a selection of indexes has a certain performance. This performance is characterized by the total execution time of the workload and the selected indexes' memory consumption. The workload execution time should be as low as possible, and the index memory consumption must not exceed a specific memory budget.
A further challenge of selecting the set of indexes that shall be present and used by queries is index interaction. "Informally, an index a interacts with an index b if the benefit of a is affected by the presence of b and vice versa" [19]. For example, assume a particular index i for a subset S of the overall workload may provide the best performance improvement for each query in that subset. There is also no other index that has a better accumulated performance improvement. Suppose i has such a high memory consumption so that the available index memory budget is fully exhausted. In that case, no other index can be created. Therefore, only queries of the subset S are improved by index i. Another index selection might be worse for the workload subset S but better for the overall workload. Consequently, a (greedily chosen) single index whose accumulated performance improvement is the highest is not necessarily in the set of indexes that provides the best performance improvement for the total workload.

Contributions
This paper is an extended version of [23], where we present solver-based approaches to address specific challenges of index selection that occur in practice. Besides a basic problem, solution concepts for several extended problem versions are proposed. The main contributions of [23] are the following: (i) We study solver-based approaches for single-and multi-index selection problems. (ii) We use a flexible chunk-based heuristic approach to attack larger problems. (iii) We consider extensions with multiple stochastic workload scenarios and reconfiguration costs. (iv) We derive risk-aware index selections using worst case and variance-based objectives. (v) We use reproducible examples to test our approaches, which can be easily combined.
Compared to [23], in this paper, we present extended studies and the following additional contributions: -We use utility functions in a piecewise linear framework as a further criterion for robust index selections in case of stochastic workloads (Section "Expected Utility Using Piecewise Linear Functions"). -We use the workloads of the standard database benchmarks TPC-H and TPC-DS for the evaluation. -We evaluate our approaches in stochastic dynamic settings using experiments with in-sample and out-of-sample data (Section "Numerical Evaluation for Stochastic Dynamic Workloads"). -We show how to use historic workload data to adapt to unknown stochastic seasonal workload patterns in a robust and anticipating way.
The remainder of this paper is organized as follows. Section "Related Work" discusses related work. In Section "Solution Approach", a basic version of the index selection problem as well as some extensions are formulated, e.g., the additional consideration of reconfiguration costs. For the described problems, different models and solution approaches are presented. An evaluation of the proposed models using synthetic, scalable workloads is given in Section "Implementation and Evaluation". In Section "Robust Selection Approaches for Stochastic-Workloads", we derive different robust selection approaches for dealing with stochastic workload scenarios. In Section "Numerical Evaluation for Stochastic Dynamic Workloads", we evaluate those approaches for the TPC-H and TPC-DS benchmark workloads. Final conclusions and directions for future work are given in Section "Conclusion and Future Work".  0  500  True  False  100  350  False  True  150  400  True  True  250  300 SN Computer Science

Related Work
The recommendation and automated selection of indexes have been in the focus of database research for many years and are still important today, particularly in the rise of selfoptimizing databases [13,15]. In the following, we give an overview of index selection algorithms. A recent and comprehensive overview of the historic development as well as an evaluation of index selection algorithms is summarized by Kossmann et al. [11]. Current stateof-the-art index selection algorithms are, e.g., AutoAdmin [5], DB2Advis [21], CoPhy [6], DTA [4], and Extend [18]. All those selection approaches focus on deterministic workloads. Risk aversion in case of multiple potential workloads is not supported. As typically iterated or recursive methods are used, it is not straightforward how they have to be amended to address the extensions considered in this paper, such as multiple workloads, risk aversion, or transition costs.
Early approaches tried to derive optimal index configurations by evaluating attribute access statistics [7]. Newer index selection approaches are mostly coupled with the query optimizer of the database system [11]. By doing so, the cost models of the index selection algorithm and the optimizer are the same. As a result, the benefit of considered indexes can be estimated consistently [5]. As optimizer invocations are costly, especially for complex queries, along with improved index selection algorithms, techniques to reduce and speed up optimizer calls have been developed [5,14,22].
An increasing number of possible optimizer calls for index selection algorithms opens the possibility to investigate an increasing number of index candidates. Compared to greedy algorithms [5,21], approaches using mathematical optimization are able to efficiently evaluate index combinations. In this context, we perceive a shift away from greedy algorithms [5,21] toward approaches using mathematical optimization models and methods of operations research, especially integer linear programming (ILP) [3,6]. A major challenge of these solver-based approaches is to deal with the increasing complexity of integer programs. An obvious solution is reducing the number of initially considered index candidates, which may, however, reduce the solution quality.
Alternatively, also machine learning-based approaches for index selection are an emerging research direction. For example, deep reinforcement learning (RL) techniques have already been applied, cf., e.g., Sharma et al. [20] or Kossmann et al. [12]. Such approaches, however, require extensive training, are still limited with regard to large workloads or multi-attribute indexes, and in particular, do not support risk-averse optimization criteria.
A further research direction seeks to solve joint tuning problems, where besides the index selection, additional features, such as sorting, compression or data tiering selections, are taken into account (see, e.g., Richly et al. [16]).

Solution Approach
In this section, we present approaches to solve different versions of index selection problems. Section "Basic Problem" describes a fundamental index selection problem, which is considered the basic problem in this work. In addition to the problem's description, we formulate an integer linear programming model, which allows solving this problem. Section "Chunking Extension" (chunking), Section "Extension to Multi-Index Configurations" (multiple indexes per query), and Section "Consideration of Reconfiguration Costs" (reconfiguration costs) each describe an extension of the basic problem and explain which adjustments can be made to the solution of the basic problem to solve the extended problems. After an evaluation of the mentioned models, cf. Section "Implementation and Evaluation", in Section "Robust Selection Approaches for StochasticWorkloads", we will finally present our approaches to obtain robust index selections in scenarios when workloads are stochastic. Those models are evaluated in Section "Numerical Evaluation for Stochastic Dynamic Workloads".

Basic Problem
In this subsection, we first describe a basic version of the index selection problem, which resembles typical properties. The basic index selection problem is about finding a subset of a given set of (multi-attribute) index candidates used by a hypothetical database to minimize the total execution time of a given workload. The given workload consists of a set of queries and a frequency for each query. A query can use no index or exactly one index for support. Different indexes induce different improvements for a single query. As a result, the execution time of a query highly depends on the used index. A query has the longest execution time if no index is used. For each query, it has to be decided whether and which index is to be used. Only if at least one query uses an index, the index can belong to the set of selected indexes. Each index involves a certain amount of memory consumption. The total memory consumption of the selected indexes must not exceed a predefined index memory budget. Table 2 shows the formal representation of the given parameters and the decision variables of our model. The binary variable u q,i ∈ {0, 1} is used to control whether an index i is used for query q. Variable v i ∈ {0, 1} indicates whether index i is selected overall. It is used to calculate the overall memory consumption of selected indexes. Similar to the work of Schlosser and Halfpap [17], the considered standard basic index selection problem can be formulated as an integer LP model: The objective (1) minimizes the execution time of the overall workload taking into account the index usage for queries, the index-dependent execution times, and the frequency of queries. The constraint (2) ensures that the selected indexes do not exceed the given memory budget M. Constraint (3) ensures that a maximum of one index is used for a single query. Here, a unique option has to be chosen including the no index option. Thus, if u q,i with i = 0 is true, no index is used for query q. The constraints (4) and (5) are required to connect u q,i with v i . If no query uses a specific index i, constraint (4) ensures that v i is equal to 0 for that index. If at least one query uses index i, constraint (5) ensures that v i is equal to 1 for that index.

Chunking Extension
The number of possible solutions of the index selection problem grows exponentially with the number of index candidates. Databases for modern enterprise applications consist of hundreds of tables and thousands of columns. This leads to long execution times to find the optimal solution of the increasing problem. In this extension, the set of possible indexes is split into chunks of indexes. The index selection problem will then be solved via (1)-(5) only with the reduced set of indexes for each chunk, and the indexes of the optimal solution will be returned. After solving the problem for each chunk, the best indexes of each chunk will get on. In the second round, the reduced number of remaining indexes will be used for a final selection using again (1)-(5). The approach allows an effective problem decomposition and accounts for index interaction. Naturally, chunking remains a heuristic approach an does not guarantee an optimal solution, but the main advantage is to avoid large problems. Of course chunks should not be chosen too small as splitting the global problem into too many local problems can also add overhead (see the evaluations presented in Section "Results for Index Chunking"). An other advantage is that, in the first round, all the chunks could be solved in parallel. With this, the overall execution time could be reduced even further.

Extension to Multi-index Configurations
Our basic problem introduced in Section "Basic Problem" could not handle the interaction of indexes as described in the introduction: One query could be accelerated by more than one index and the performance gain of an index could be affected by other indexes. We tackle this part of the index selection problem by adding one level of indirection called index configurations. An index configuration maps to a set of indexes.
For ease of simplicity, we consider the use of at most two indexes per query. Assuming the index selection problem has ten indexes, then the first configuration (configuration 0) means that no index is used for this query. The next ten possible configurations point to the respective indexes, e.g., configuration 1 points to index 1, configuration 2 points to index 2, configuration 10 points to index 10. The subsequent configurations map to sets containing combinations of two indexes. Database queries could be accelerated by more Parameter Execution time of query q = 1, ..., Q using index i = 0, ..., I ; i = 0 indicates that no index is used by query q m i Parameter Memory consumption of index i = 1, ..., I f q Parameter Frequency of query q = 1, ..., Q u q,i Decision variable Binary variable whether index i = 0, ..., I is used for query q = 1, ..., Q ; i = 0 indicates that no index is used by query q v i Decision variable binary variable whether index i = 1, ..., I is used for at least one query and thus, part of the selection SN Computer Science than two indexes, but we simplified the configurations in our implementation so that they can consist of a maximum of two different indexes. We use a binary parameter d c,i indicating whether configuration c contains the index i, where c = 0, ..., C and i = 0, ..., I with C being the number of index configurations and I being the number of indexes. Furthermore, we assume that ten percent of all possible index combinations will interact in configurations. Our approach to index selection works with configurations in the same way as with indexes, cf.
(1)- (5). The constraints (3)-(5) of the basic problem, cf. Section "Basic Problem", are adapted in the following way for multi-index configurations: Again, the binary variable u q,c is used to control if configuration c is used for query q and C is the number of index configurations. Similar to the basic approach, c = 0 represents a configuration that contains no index. Constraint (6) ensures that one single query uses exactly one configuration option instead of one index. In constraint (7)-(8), the parameter d c,i is included to activate indexes of used configurations.

Consideration of Reconfiguration Costs
In this problem extension, we consider the costs of a transition from one index configuration to another. We assume that the database removes indexes that are no longer used and loads indexes that are to be used into the memory. Typically, the database would need to do some I/O operations, which are time expensive and generate additional costs. We model such creation and removal costs in our final extension to reduce such transition costs.
To adapt the index configurations, the algorithm identifies the differences between the previous configuration (now characterized by parameters v i ) and a new target configuration governed by the variables v i . For each removal at index i, the algorithm looks up the removal costs rm i for index i and adds them to the total removal costs RM. Analogous, the algorithm proceeds to calculate the total creation costs MK using the creation costs mk i of the index i. The sum of the removal costs and creation costs is then being added to any of the previous objectives, which allows to avoid high transition costs. The costs can be modeled linearly:

Implementation and Evaluation
In this section, we demonstrate the applicability of our proposed models. The considered setup and the input data are described in Section "Setup and Scalable Test Datasets", respectively. In Section "Results for Basic Index Selection Problems", we reflect on the scalability of the basic approach. Then, in Section "Results for Index Chunking", we investigate when index chunking is beneficial for the performance compared to the basic approach and reflect on the cost trade-off that the heuristic entails. Afterward, in Section "Results for Multi-Index Configurations", we determine the computational overhead of the multi-index extension.

Implementation and Solvers
To evaluate the described approaches, we implemented our different models using AMPL, 1 which is a modeling language tailored for working with optimization problems [8]. The syntax of AMPL is quite close to the algebraic one and should be easy to read and understand, even for the readers, who have never seen AMPL syntax before. AMPL itself translates the given syntax into a format solvers can understand. The solver is a separate program that needs to be specified by the developer. The approach is based on linear/quadratic programming using integer numbers. Solvers such as, e.g., CPLEX 2 and Gurobi, 3 are suited to solve our models. Several tests showed that Gurobi is faster than CPLEX in most cases, which is the reason why we used Gurobi. All code files are available at GitHub. 4 Our implementation in AMPL allows the reader to evaluate the different approaches and to reproduce its results, see Section "Implementation and Evaluation". The .run-file contains information about the selected solver and loads the specified model and data specifications. After a given problem was solved, the solution is displayed. The .mod-file contains the description of the mathematical model, such as parameters, constraints, and objectives used. The input data, which are required for solving a certain problem, are specified in the .dat-file.

Setup and Scalable Test Datasets
All performance measurements were performed on the same machine, featuring an Intel i5 8th generation (4 cores) and 8GB memory storage. All measurements were repeated three times. For each time measurement, we used the AMPL buildin function _total_solve_time . It returns the user and system CPU seconds used by all solve commands [9]. The final value was determined by the mean of all three measurement results. All non-related applications have been closed to reduce any side effects of the operating system.
The datasets that are being used for the evaluation are being generated randomly, using multiple fixed random seeds. Each dataset is defined by the number of indexes, queries, and available memory budget. The algorithm provided in the indexselection.data-file then generates the execution time of each query, depending on the utilized index. First, the "original" execution time for the query without using any index are chosen randomly within the interval [10; 100]. Based on the drawn costs, the speedup for each index is calculated by choosing a random value between the "original' costs and a 90% speedup. The memory consumption of a query can be an integer between 1 and 20. The frequencies can be between 1 and 1 000.
The extensions that are applied on top of the basic approach introduce further variables that need to be generated. For the stochastic workload extension, we introduced a workload intensity, which gets drawn randomly for each workload. This also applies to the transition cost extension, where the creation costs and removal costs are random. The multi-index configurations package requires a more complex generation process since each index configuration should be a unique set of indexes. The configuration zero represents the option that no index is being used. The configurations 1 to I point to their respective single index. All other generated configurations consist of up to two indexes, whereas the combinations are drawn randomly. By using a second data structure, it is ensured that no index combination is used multiple times. The speedup s for a combination, existing of two indexes i and j, is then calculated by the following formulas: The minimum speed up and the maximum speed up are then passed to a function that returns a uniformly distributed random number wit hin t he continuous inter val [min_speedup s i,j ;max_speedup s i,j ] , cf. (11)- (12).
Outsourcing the generation of input data into the indexselection.data-file allows for an easy replacement with actual data, e.g., benchmarking data of a real system. However, this also enables the reader to validate basic example cases on their own.

Results for Basic Index Selection Problems
"The complexity of the view-and index-selection problem is significant and may result in high total cost of ownership for database systems" [10]. In this section, we evaluate our basic solution, cf. Section "Basic Problem". We show the scalability of our implemented solution, which we later compare to the chunking approach. We set up the memory limit with 100 units and assume 100 queries with random occurrences uniform between 1 and 1000. To test the scalability on our machine, we generate data with 50, 100, 150,..., 1450, 1500 index candidates. The outcome measurements are shown in Fig. 1. The total solve time on the y-axis is a logarithmic scale. Figure 1 shows the growing total solve time while the number of indexes rises. With an increasing index count, the execution times vary more. Naturally, the solve time depends on the specific generated data input. In some cases with over 1000 indexes, the generated input could not be solved with our setup in a meaningful time. Note that the possible combinations of the index selection problem grows exponentially. To limit the number of index candidates, one might only consider smaller subsets of a workload's queries that are responsible for the majority of the workload costs.

Results for Index Chunking
To tackle the exponential growing number of admissible index combinations, we divide the problem into chunks, find the best indexes of each chunk, and then find the best indexes of the winners of all chunks, cf. Section "Chunking Extension". Compared to the basic index selection solution, problems with much more indexes can be solved with chunking.
SN Computer Science Figure 2 shows the total solve time for using different numbers of chunks between 1 and 50 (the case without chunking is shown in blue). In this experiment all other parameters were fixed (see previous Section "Results for Basic Index Selection Problems"). The dots of the different chunking versions show a linear relationship between an index count of 500-2500. In the beginning, the total solve time of the chunking curve has a higher slope.
The overhead introduced by chunking, i.e., splitting the index candidates into chunks, has not always a positive impact on the total execution time of our linear program. The chunking solution has a lower scattering than the basic solution. In each execution, some chunks could be solved faster than the mean and some other chunks needed more time. We observe that long and short solve times of single chunks balance each other and chunking leads to lower variation.
As described in Section "Chunking Extension", the heuristic chunking approach might cause the final solution not to be an optimal global solution of the initial problem. In this context, Table 4 shows the total cost growth of the found solution compared to the optimal solution. The lower the chunk count, the higher the mean and the maximum growth. With a lower chunk count, the possibility that an index of the optimal solution is not a winner of the related chunk is higher. The more chunks, the more indexes get on to the final round.
Chunking reduces the total solve time and fewer outliers with very long execution times occur. The degradation of the calculated solution is surprisingly low. We observe that the total workload costs growth is consistently lower than 1%.

Results for Multi-index Configurations
In this section, we evaluate the potential solve time overhead, which might get introduced by the multi-index configuration extension, see Section "Extension to Multi-Index Configurations". Therefore, we compare the solve times of the extension with the basic approach. For both implementations, we tested multiple settings. The number of indexes defines a setting and is one of 10, 50, 100, 200, 500, or 1000. Independent of the setting, we assume a memory budget of 100 units and 100 queries. Figure 3 shows the solve time for both settings in comparison.
Overall, as expected, the multi-index configuration extension has an execution time overhead compared to the basic solution, which assigns at most only one index to each query. However, the additional required solve time is acceptable. Further, the relative solution time overhead decreases with an increasing number of indexes. One explanation for this

Robust Selection Approaches for Stochastic Workloads
Until now, we considered a single given workload only. However, in the context of enterprise applications, we observe that each day has a different workload. Further, the workloads at the end of the week or on the weekend can systematically contain fewer or different compositions of requests compared to workloads during the rest of the week. Further, future workloads are not perfectly plannable and can be seen as stochastic. Hence, a good index selection should be optimized such that it provides suitable results for a variety of potential workloads scenarios. In this section, we propose an approach that can take multiple workloads into account. The solution seeks to provide a robust index selection, where robust means that the performance is good no matter which workload of different potential ones may occur. In the following, we assume W potential workload scenarios, e.g., for the next period of time. We consider the probability for workload w denoted by P w , w = 1, ..., W , where ∑ w=1,...,W P w = 1 . A workload scenario is characterized by a set of queries, which occur with given frequencies within a certain time frame.

Risk Neutral Baseline Approach
As a standard baseline approach, we use the risk neutral model. The expected total workload cost T across all workloads is being calculated as where we use t he untuned execution time g (0) w ∶= ∑ q=1,...,Q u q,0 ⋅ t q,0 ⋅ f w,q (i.e., when no indexes are used) to define the normalized execution time g w of a workload w with f w,q being the frequency of query q in workload w, w = 1, ..., W , i.e., The information whether a configuration c is being used for a query q of a workload w is shared between the workloads, leaving it to the solver to minimize the total costs across all workloads. In this context, Figure 4 shows exemplary total workload costs when minimizing the expected execution time T with W = 4 workload scenarios, cf. W1 -W4, for some distribution P. Following the risk neutral objective, it is typical that the actual costs of each workload differ a lot, leading, e.g., to poor performances (for W1 and W3) in favor of others (W2 and W4).
Next, we propose different risk-averse approaches to make the index selection performance more robust.

Weighted Worst-Case Optimization Approach
We include the worst-case performance by punishing the maximum total workload costs as additional costs besides the mean T. The maximum workload costs L (modeled as a continuous variable) are determined by the constraint: The following (mixed) ILP, cp. (1)-(5), now includes this maximum workload cost (L) in the objective using the penalty factor a ≥ 0 , cf. (13)- (14): (15) L ≥ g w ∀w = 1, ..., W.

SN Computer Science
Note, L does not depend on the workload distribution P. Further, for a → ∞ , we obtain the pure worst-case optimization criterion. Figure 5 illustrates a typical solution leading to better worst-case cost, cp. Fig. 4. However, the costs of W2 and W3 increased, leaving also a bigger gap between W4 and the rest.

Mean-Variance Optimization Approach
To obtain even (and plannable) performances, we apply the mean-variance optimization criterion, i.e., we weight the mean T and the variance V of the workload costs defined by, cf. (13)- (14), as a penalty to minimize the scenarios' cost differences. Now, the factor b and the term b ⋅ V is used in the objective, cp. (16)- (17), Remark that the problem, cf. (18), becomes a binary quadratic programming (BQP) problem using the variance V in the penalty term. Using the mean-variance criterion (18) may lead to results illustrated in Fig. 6. Typically, all costs tend to be within a similar range. However, in comparison to the previous figures, not only have been W1, W2, and W3 brought into a plannable range, but also W4. The total costs of W4 may not be reduced, which makes the result indeed more robust but less effective in the end. An alternative option to resolve this issue would be to use the semivariance instead of V. Similar to the variance, the semivariance can be used to penalize only those workloads whose costs are higher than the mean cost of all workloads, i.e., workloads with lower costs would not increase the applied penalty.

Expected Utility Using Piecewise Linear Functions
In Sections "Weighted Worst-Case Optimization Approach" and "Mean-Variance Optimization Approach", we presented risk-averse objectives that minimize an upper bound to optimize the execution time of the worst workload or use the mean-variance criterion to achieve robust execution times with small deviations. However, optimizing the mean-variance criterion also penalizes execution times that are better than the average execution time, which can be a disadvantage as short execution times are desirable from a database perspective.
As an alternative to model more flexible risk-averse objectives, we use utility functions, which allow to identify index selections such that all workload scenarios are addressed but a higher focus is put on improving scenarios with high costs. However, the included non-linearity is a challenge for solvers, particularly as we have an integer problem. To resolve this issue in a pragmatic way, we model the non-linear objective by using piecewise linear approximations.
As we have a cost minimization problem, we use the scenario-specific costs g (0) w representing the total workload costs of workload w when no indexes are used at all as an upper bound. Hence, the (normalized) benefit of using indexes for workload w can be expressed as w ∶= (g (0) w − g w )∕g (0) w , cf. (14). In this context, we consider continuous concave utility functions u(z), z ≥ 0 , to measure the decreasing marginal utility of the benefit of a workload. Further, to approximate u(z), we use 1000 equidistant values between 0 and 1 for a piecewise linear representation of u denoted by ũ , i.e., Finally, the objective to maximize the expected utility of benefits over all workload scenarios reads as Note, the piecewise linear integer problem (20) can efficiently be solved using standard solvers. Exemplary workload costs with increased robustness by using mean-variance optimization, cf. Weisgut et al. [23] To sum up, the risk-averse approaches proposed in this section enable us to use multiple workloads (e.g., observed in the past) to optimize index selections for potential/stochastic scenarios under risk considerations.

Numerical Evaluation for Stochastic Dynamic Workloads
In this section, we evaluate the performance of the risk-averse selection approaches proposed in Section "Robust Selection Approaches for StochasticWorkloads". Our evaluation uses the workloads of the established standard database benchmark TPC-H and TPC-DS, which are introduced in Section "Setup and Model Input". Further, in Section "Experiment with Stochastic Dynamic Workload Scenarios", we define our experiments with stochastic dynamic workload scenarios, which are based on Section "Setup and Model Input" and different randomized query frequencies. In Section "Evaluation Results for Stochastic Dynamic Workloads (In-Sample)", we evaluate our index selection models for stochastic workloads, where we use drawn samples of dynamic workload data, which serve as exemplary historic data. Further, in Section "Evaluation Results for Stochastic Dynamic Workloads (Out-of-Sample)", we verify the robustness of the selections obtained in Section "Evaluation Results for Stochastic Dynamic Workloads (In-Sample)" by evaluating them for unseen generated outof-sample workloads. Finally, in Section "Discussion", we discuss the results obtained.

Setup and Model Input
We consider the case with at most one index per query. To generate realistic workload data, we set up a PostgreSQL 11 [1] database system with the extension HypoPG [2]. HypoPG allows creating hypothetical indexes and estimating their sizes. We loaded TPC-H and TPC-DS tables with scale factor 1. For TPC-H ( Q = 22 , I = 275 ) and the TPC-DS queries ( Q = 99 , I = 1 422 ) and their various index candidates (with 1-2 attributes), we derived query costs t q,i , i = 0, ..., I , by calling PostgreSQL's EXPLAIN with the query template string using fixed parameters, see [17]. Index sizes for the required memory, cf. m i , are modeled by using HypoPG's function hypopg_relation_size(). The obtained model inputs to reproduce the calculations are available online. 5

Experiment with Stochastic Dynamic Workload Scenarios
We consider workloads for different days of the week, which are referred by j = 0, ..., 6 . Further, we generate workload data for 20 weeks as historic data. Hence, for days of type j, we consider W j = 20 historic workload scenarios. Without loss of generality, for all j, we use uniform probabilities P (j) w = 1∕20 , for the workload scenarios w, w = 1, ..., W j , W j = 20 , j = 0, ..., 6 . To generate random seasonal patterns for query q and the type of day j, we use skewed random numbers z q,j ∶= 1∕t q,0 ⋅ (Uniform(0, 0.1) (1∕0.1) ) . Based on z q,j for the 20 days of type j, we use the randomized query frequencies f q,j,k ∶= z q,j ⋅ (Uniform(0, 0.2) (1∕0.2) ) , for all k = 1, ..., 20. The total workload costs of day j in week k amount to C (0) j,k = ∑ q=1,...,Qfq,j,k ⋅ t q,0 . As we seek to optimize index selections against different mixtures of workloads and not primarily against different amplitudes of load, in our experiment, we normalize the overall load, i.e., in our models for day j in week k, we consider the f inal quer y frequencies f q,j,k ∶=f q,j,k ⋅ C (0) 1,1 ∕C (0) j,k . For day j in week k without using indexes, we have the untuned total workload costs ..,Q f q,j,k ⋅ t q,0 . Finally, for the memory budget M, we use M ∶= ⋅ ∑ i=1,...,I m i for a given parameter value ∈ [0, 1].

Evaluation Results for Stochastic Dynamic Workloads (In-Sample)
For generated workload scenarios over 20 weeks, as described in Section "Experiment with Stochastic Dynamic Workload Scenarios", for each type of day j, j = 0, ..., 6 , we separately apply the models proposed in Section "Robust Selection Approaches for StochasticWorkloads", which are based on different optimization criteria. Besides the risk neutral model (RN), the weighted worst-case approach (WWC), the meanvariance optimization model (MVO), and the expected utility model (EU)-as another baseline-we consider the approach that optimizes the selection for a certain single workload (SW) only. For this, we used the last one of the day's type. Alternatively, one could also use the most likely scenario, cf. P w , or the average workload over all types of days.
The results of Table 5 are based on the TPC-H benchmark (cf. Q = 22 queries, I = 275 index candidates) for a budget M determined by = 0.0017 . For the WWC model, we used a = 100 to basically optimize the worst case L. For the MVO model, we used b = 5 . For the EU model, we used the utility function u(z) = z , = 0.5 . For the SW approach, for each type of day j, we used the workload of week k = 1. Table 5 shows the results when optimizing for the five different approaches. Each model was solved seven times for all seven types of days j for 20 different associated workload scenarios each. The numbers in Table 5 are average values of the seven different performance values for the seven different types of days of the week. Further, all selections were also evaluated (as averages) for the criteria not optimized for. A row contains results that are optimized for the criterion given in the first column. The different columns contain an evaluation of the associated index selection with respect to all different objectives. Hence, within the different columns, the value on the diagonal (marked in bold style) represents the best possible objective value. Note, that all criteria are minimizations except the EU approach, where large objective values are As expected, we, observe that, e.g., the WWC approach provides the best worst-case objective (L=0.976225) but falls short in the RN performance. Here, of course, the RN approach provides the best results (0.743616) but performs comparably bad regarding the WWC or MVO criterion. Naturally, optimizing for one single scenario only, cf. SW, can lead to fairly random results when evaluating other criteria. Overall, in this experiment, the EU approach provides good results regarding all criteria. Table 6 contains corresponding results based on the larger TPC-DS benchmark (cf. Q = 99 queries, I = 1 422 index candidates), where we used the same model parameters, i.e., = 0.0017 , a = 100 , b = 5 , and = 0.5 . The optimization of the MVO model is excluded as its computation was not possible within 1 h, due to its non-linear (quadratic) structure. The solutions for all other models could be computed within single seconds. The results obtained are similar to the TPC-H case. Further, they show the applicability of our linear models for larger workloads. Table 7 shows the average performance results of the determined index selections associated to Table 5 when evaluated for 20 new unseen workload scenarios for each type of day j. Those new scenarios respect the same seasonality pattern, cf. realized z q,j values, but are based on new i.i.d. random variables for f (see Section "Experiment with Stochastic Dynamic Workload Scenarios"). The 20 new unseen scenarios can be seen as potential workloads that effectively occur on the next day of a certain type j. Those scenarios serve as an out-ofsample validation of our (robust) index selections that have been computed based on 20 seen historic workloads, cf. Section "Evaluation Results for Stochastic Dynamic Workloads (In-Sample)".

Evaluation Results for Stochastic Dynamic Workloads (Out-of-Sample)
In Table 7, again, the performance results for the different objectives have been averaged of all seven types of days. The performance values on the diagonal (italicized) show the results that have been tried to optimize using similar but different workload data. Overall, these values are among the best (within one column). Naturally, it is possible that selections associated with one of the other criteria accidentally provide better results for the new data.
To finally determine the quality of our out-of-sample results for the different objectives, we compare the results obtained in Table 7 to the best possible results that could have been obtained when (i) knowing all the 20 new scenarios for all types of days and (ii) computing optimized solutions as done in Table 5-but now for the unseen data. Those upper bound results are shown in bold in Table 8, including their  evaluations for the non-optimized criteria. The results confirm that the italicized numbers of Table 7 are indeed close to the bold numbers of Table 8, which verifies the quality of our models to determine the intended robust index selections. Similar to Tables 7, 8, Tables 9 and 10 contain the results for the TPC-DS benchmark. The optimization of the MVO model is again excluded due to runtime issues. More importantly, the results show that the quality and robustness of our linear solutions also hold for larger workloads.
Recall, that reconfiguration costs, cf. Section "Consideration of Reconfiguration Costs", have been excluded in this experiment to not influence the out-of-sample robustness analysis. Undoubtedly, as they are modeled linearly, they could be directly included to incrementally adapt index selection dynamically over time by taking into account different workload scenarios of different kind (from one day/period to another).

Discussion
Overall, we can summarize that our models and extensions are applicable and effective for the index selection problem. Further, in contrast to the synthetic workloads generated in [23], we demonstrated the models' effectiveness with the analytical database workloads of the TPC-H and TPC-DS benchmarks established in the industry and the research community. Being able to (re)allocate indexes in uncertain and changing environments addresses a relatively untapped field and, hence, contributes to existing works. We argue that the obtained results showed the intended characteristics for the following reasons: First, being able to integrate the targeted goals directly in a model's objective is effective. Second, solverbased LP models-in contrast to heuristic or even black box algorithms-guarantee optimality. The price for optimality is impaired scalability, i.e., our models are limited to small and midsize problem instances.

Conclusion and Future Work
We considered different variants of index selection problems and proposed solver-based solution concepts. For the basic version of the problem, assuming one deterministic workload only, we presented an ILP-based solution and analyzed a chunk-based heuristic to lower the problem complexity in case of large index candidate sets. We obtained that, compared to the optimal solution of the basic problem, this heuristic performs near-optimal and allows to significantly reduce the overall solution time.
Further, to address settings where index selections are adaptively chosen over time, e.g., under (expected) changing workloads patterns, we endogenized costs to create and remove indexes in the presence of an existing index configuration while balancing performance and minimal required reconfiguration costs. The extended model remains linear and the numerical overhead is small.
In the second part of the paper, we considered multiple potential workload scenarios in a stochastic framework. To identify robust index selections for such type of uncertain workload scenarios, we proposed the following risk-averse objectives: (i) weighted worst-case costs (linear), (ii) meanvariance optimization (quadratic), and (iii) expected utility criteria (piecewise linear). Our results show that the different models are applicable-particular the linear ones-and that the targeted effect to avoid bad and uneven performances is achieved.
Using numerical experiments based on TPC-H and TPC-DS benchmark workloads, we studied the performance robustness of the risk-averse selection models when trained/ optimized for observable historic dynamic workload data and finally applied for a variety of unseen randomly generated (out-of-sample) workloads. We verified the high quality of the results obtained by comparing the achieved performance against upper bounds, which are based on exploiting full knowledge of the out-of-sample workload scenarios.
In future work, further experiments could be performed to study how the performance robustness is affected when using the chunking heuristic or reconfiguration costs. Note, that all our concepts can be easily combined. Other research directions may include scalability issues of solver-based approaches as well as the integration of risk-averse objectives in rule-based or self-learning index selection algorithms.

Appendix
See Table 11.  Binary variable whether index i = 0, ..., I is used for query q = 1, ..., Q; i = 0 Indicates that no index is used by query q u q,c Binary variable (for the multi-index extension) whether configuration c = 0, ..., I is used for query q = 1, ..., Q ; c = 0 represents an empty configuration with no indexes v i Binary variable whether index i = 1, ..., I is used (for at least one query) Funding Open Access funding enabled and organized by Projekt DEAL. Not applicable.

Availability of Data and Materials Not applicable.
Code Availability Not applicable, could be provided.

Declarations
Conflicts of Interest/Competing Interests Not applicable.
Ethics Approval Not applicable.
Consent to Participate Not applicable.

Consent for Publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.