1 Introduction

A self-adaptive system can modify its own structure and behavior at runtime based on its perception of the environment, of itself and of its requirements [9, 24, 34]. An example is a self-adaptive web service, which faced with a sudden increase in workload, may reconfigure itself by deactivating optional system features. An online store, for instance, may deactivate its resource-intensive recommender engine in the presence of a high workload. By adapting itself at runtime, the web service is able to maintain its quality requirements (here: performance) under changing workloads.

To develop a self-adaptive system, software engineers have to develop self-adaptation logic that encodes when and how the system should adapt itself. However, in doing so, software engineers face the challenge of design time uncertainty [6, 45]. Among other concerns, developing the adaptation logic requires anticipating the potential environment states the system may encounter at runtime to define when the system should adapt itself. Yet, anticipating all potential environment states is in most cases infeasible due to incomplete information at design time. As an example, take a service-oriented system which dynamically binds concrete services at runtime. The concrete services that will be bound at runtime and thus their quality are typically not known at design time. As a further concern, the precise effect of an adaptation action may not be known and thus accurately determining how the system should adapt itself is difficult. As an example, while software engineers may know in principle that activating more features will have a negative impact on performance, exactly determining the performance impact is more challenging [36].

Online reinforcement learning is an emerging approach to realize self-adaptive systems in the presence of design time uncertainty. Online reinforcement learning means that machine learning is employed at runtime (see existing solutions discussed in Sect. 7). The system can learn from actual operational data and thereby can leverage information only available at runtime. In general, reinforcement learning aims to learn suitable actions via an agent’s interactions with its environment [37]. The agent receives a reward for executing an action. The reward expresses how suitable that action was. The goal of reinforcement learning is to optimize cumulative rewards.

1.1 Problem statement

Reinforcement learning faces the exploration–exploitation dilemma [37]. To optimize cumulative rewards, actions should be selected that have shown to be suitable, which is known as exploitation. However, to discover such actions in the first place, actions that were not selected before should be selected, which is known as exploration. How exploration happens has an impact on the performance of the learning process [4, 13, 37]. We focus on two problems related to how a system’s set of possible adaptation actions, i.e., its adaptation space, is explored.

Random exploration: Existing online reinforcement learning solutions for self-adaptive systems propose randomly selecting adaptation actions for exploration (see Sect. 7).

The effectiveness of exploration therefore directly depends on the size of the adaptation space, because each adaptation action has an equal chance of being selected. Some reinforcement learning algorithms can cope with a large space of actions, but require that the space of actions is continuous in order to generalize over unseen actions [29]. Self-adaptive systems may have large, discrete adaptation spaces. Examples include service-oriented systems, which may adapt by changing their service compositions [28], or reconfigurable software systems, which may adapt by changing their active set of features at runtime [23]. A simple example is a service composition consisting of eight abstract services that allows dynamically binding two concrete services each. Assuming no temporal or logical constraints on adaptation, this gives \(2^{8} = 256\) possible adaptations. In the presence of such large, discrete adaptation spaces, random exploration may lead to slow learning at runtime [4, 13, 37].

Evolution-unaware exploration: Existing online reinforcement learning solutions are unaware of system evolution [20]. They do not consider that a self-adaptive system, like any software system, typically undergoes evolution [15]. In contrast to self-adaptation, which refers to the automatic modification of the system by itself, evolution refers to the manual modification of the system [24]. Due to evolution, the adaptation space may change, e.g., existing adaptation actions may be removed or new adaptation actions may be added. Some reinforcement learning algorithms can cope with environments that change over time (non-stationary environments) [29, 37]. However, they cannot cope with changes of the adaptation space. Existing solutions thus explore new adaptation actions only with low probability (as all adaptation actions have an equal chance of being selected), and thus may take quite long until new adaptation actions have been explored.

Thus, this paper addresses two problems of exploration in online reinforcement learning for self-adaptation: (1) coping with large discrete adaptation spaces and (2) coping with changes of the adaptation space due to evolution.

1.2 Contributions

We introduce exploration strategies for online reinforcement learning that address the above two problems. Our exploration strategies use feature models [25] to give structure to the system’s adaptation space and thereby leverage additional information to guide exploration. A feature model is a tree or a directed acyclic graph of features, organized hierarchically. An adaptation action is represented by a valid feature combination specifying the target run-time configuration of the system.

Our strategies traverse the system’s feature model to select the next adaptation action to be explored. By leveraging the structure of the feature model, our strategies guide the exploration process. In addition, our strategies detect added and removed adaptation actions by analyzing the differences between the feature models of the system before and after an evolution step. Adaptation actions removed as a result of evolution are no longer explored, while added adaptation actions are explored first.

This article has been substantially extended from our earlier conference publication [26] and provides the following main new contributions:

Broader scope: We extended the scope to cover self-adaptive software systems, thereby generalizing from self-adaptive services focused in [26]. This is reflected by providing a conceptual framework for integrating reinforcement learning into the MAPE-K reference model of self-adaptive systems, adding an additional subject system from a different domain, as well as by expanding the discussion of related work.

Additional reinforcement learning algorithm: In addition to integrating our strategies into the Q-Learning algorithm, we integrate them into the SARSA algorithm. These two algorithms differ with respect to how the knowledge is updated during the learning process. Q-Learning updates the knowledge on the basis of the best possible next action. SARSA updates the knowledge on the basis of the action that the already learned policy takes [37]. As a result, Q-Learning tends to perform better in the long run. However, SARSA is better in avoiding expensive adaptations. If, for a given system, executing “wrong” adaptations is expensive, then SARSA is more appropriate, otherwise Q-Learning is preferable. Our strategies work for both algorithms.

Additional subject system: In addition to the adaptive cloud service in [26], we validate our approach with a reconfigurable data base system. The two systems differ in terms of their adaptation space, the structure of their feature model, and their quality characteristics (response time instead of energy and virtual machine migrations), thereby contributing to the external validity of our experiments.

In what follows, Sect. 2 explains fundamentals of feature models and self-adaptation, explains the integration of reinforcement learning into the MAPE-K reference model, as well as introduces a running example. Section 3 describes our exploration strategies and how they are integrated with the Q-Learning and SARSA algorithms. Section 4 presents the design of our experiments, and Sect. 5 presents our experimental results. Section 6 provides a discussion of current limitations and assumptions. Section 7 analyzes related work. Section 8 provides a conclusion and outlook on future work.

2 Fundamentals

2.1 Feature models and self-adaptation

A feature model is a tree of features organized hierarchically [25] and describes the possible and allowed feature combinations. A feature f can be decomposed into mandatory, optional or alternative sub-features. If feature f is activated, its mandatory sub-features have to be activated, its optional sub-feature may or may not be activated, and at least one of its alternative sub-features has to be activated. Additional cross-tree constraints express inter-feature dependencies. A feature model can be used to define a self-adaptive system’s adaptation space, where each adaptation action is expressed in terms of a possible runtime configuration, i.e., feature combination [12, 16].

Fig. 1
figure 1

Feature model and adaptation of example web service

Figure 1 shows the feature model of a self-adaptive web service as a running example. The DataLogging feature is mandatory (which means it is always active), while the ContentDiscovery feature is optional. The DataLogging feature has three alternative sub-features, i.e., at least one data logging sub-feature must be active: Min, Medium or Max. The ContentDiscovery feature has two optional sub-features Search and Recommendation. The cros-tree constraint Recommendation \(\Rightarrow \) Max \(\vee \) Medium specifies that a sufficient level of data logging is required to collect enough information about the web service’s users and transactions to make good recommendations.

Let us consider that the above web service should adapt to a changing number of concurrent users to keep its response time below 500 ms. A software engineer may express an adaptation rule for the web service such that it turns off some of its features in the presence of more users, thereby reducing the resource needs of the service. The right-hand side of Fig. 1 shows a concrete example for such an adaptation. If the service faces an environment state of more than 1000 concurrent users, the service self-adapts by deactivating the Search feature.

2.2 Reinforcement learning and self-adaptation

As illustrated in Fig. 2a, reinforcement learning aims to learn an optimal action selection policy via an agent’s interactions with its environment [37].

Fig. 2
figure 2

Integration of reinforcement learning into the MAPE-K reference model: a basic reinforcement learning model, b MAPE-K model, c integrated model

At a given time step t, the agent selects an action a (from its adaptation space) to be executed in environment state s. As a result, the environment transitions to \(s'\) at time step \(t+1\) and the agent receives a reward r for executing the action. The reward r together with the information about the next state \(s'\) are used to update the action selection policy of the agent. The goal of reinforcement learning is to optimize cumulative rewards. As mentioned in Sect. 1, a trade-off between exploitation (using current knowledge) and exploration (gathering new knowledge) must be made. That is, to optimize rewards, actions should be selected that have shown to be useful (exploitation), but to discover such actions in the first place, actions that were not selected before must also be selected (exploration).

A self-adaptive system can conceptually be structured into two main elements [19, 34]: the system logic (aka. the managed element) and the self-adaptation logic (aka. the autonomic manager). To understand how reinforcement learning can be leveraged for realizing the self-adaptation logic, we use the well-established MAPE-K reference model for self-adaptive systems [9, 44]. As depicted in Fig. 2b, MAPE-K structures the self-adaptation logic into four main conceptual activities that rely on a common knowledge base [17]. These activities monitor the system and its environment, analyze monitored data to determine adaptation needs, plan adaptation actions, and execute these adaptation actions at runtime.

Figure 2c depicts how the elements of reinforcement learning are integrated into the MAPE-K loop.

For a self-adaptive system, “agent” refers to the self-adaptation logic of the system and “action” refers to an adaptation action [30]. In the integrated model, action selection of reinforcement learning takes the place of the analyze and plan activities of MAPE-K. The learned policy takes the place of the self-adaptive system’s knowledge base. At runtime, the policy is used by the self-adaptation logic to select an adaptation action a based on the current state s determined by monitoring. The action selected using the policy may be either to leave the system in the current state (i.e., no need for adaptation), or a specific adaptation, which is then executed.

3 Feature-model-guided exploration (FM-guided exploration)

As motivated in Sect. 1, our exploration strategies use feature models (FM) to guide the exploration process. We first explain how these FM-guided exploration strategies can be integrated into existing reinforcement learning algorithms. Thereby, we also provide a realization of the integrated conceptual model from Sect. 2. We then introduce the realization of the actual FM-guided exploration strategies.

3.1 Integration into reinforcement learning

We use two well-known reinforcement learning algorithms for integrating our FM-guided exploration strategies: Q-Learning and SARSA. We chose Q-Learning, because it is the most widely used algorithm in the related work (see Sect. 7). We chose SARSA, as it differs from Q-Learning with respect to how the knowledge is updated during learning. Q-Learning (an off-policy algorithm) updates the knowledge based on selecting the next action which has the highest expected reward [37]. SARSA (an on-policy algorithm) updates the knowledge based on selecting the next action by following the already learned action selection policy.

Algorithm 1 shows the extended Q-Learning algorithm. A value function Q(sa) represents the learned knowledge, which gives the expected cumulative reward when performing an action a in a state s [37]. There are two hyper-parameters: the learning rate \(\alpha \), which defines to what extent newly acquired knowledge overwrites old knowledge, and the discount factor \(\gamma \), which defines the relevance of future rewards. After the initialization (lines 2–3), the algorithm repeatedly selects the next action (line 5), performs the action and observes its results (line 6), and updates its learned knowledge and other variables (lines 7–8). Algorithm 2 shows the extended SARSA algorithm, which follows a similar logic. However, while Q-Learning updates the knowledge by selecting the action with the highest Q value (Algorithm 1, line 7), SARSA selects the action according to the current policy (Algorithm 2, line 8).

Our strategies are integrated into reinforcement learning in the getNextAction function, which selects the next adaptation action while trading off exploration and exploitation. We use the \(\epsilon \)-greedy strategy as a baseline, as a standard action selection strategy in reinforcement learning, widely used in the related work (see Sect. 7). With probability \(1-\epsilon \), \(\epsilon \)-greedy exploits existing knowledge, while with probability \(\epsilon \), it selects a random action. In contrast to random exploration, we use our FM-guided exploration strategies by calling the getNextConfiguration function (Algorithm 1, line 17). To prevent FM-guided exploration from prematurely converging to a local minimum, we follow the literature and use a little randomness [31], i.e., perform random exploration with probability \(\delta \cdot \epsilon \) (lines 15, 16). Here, \(0\le \delta \le 1\) is the probability for choosing a random action, given that we have chosen to perform exploration.

To facilitate convergence of the learning process, we use the \(\epsilon \)-decay approach. This is a typical approach in reinforcement learning, which starts at \(\epsilon = 1 \) and decreases it at a predefined rate \(\epsilon _\mathrm {d}\) after each time step. We also follow this decay approach for the FM-guided strategies to incrementally decrease \(\delta \) with rate \(\delta _\mathrm {d}\).

figure a
figure b

3.2 Feature–model–structure exploration for large adaptation spaces

To capture large adaptation spaces, we propose the FM-structure exploration strategy, which takes advantage of the semantics typically encoded in the structure of feature models. Non-leaf features are typically abstract features used to better structure variability [40]. Abstract features do not directly impact the implementation, but delegate their implementation to their sub-features. Sub-features thereby offer different implementations of their abstract parent feature. As such, the sub-features of a common parent feature, i.e., sibling features, can be considered semantically connected.

In the example from Sect. 2, the ContentDiscovery feature has two sub-features Search and Recommendation offering different concrete ways how a user may discover online content. The idea behind FM-structure exploration is to exploit the information about these potentially semantically connected sibling features and explore them first before exploring other features.Footnote 1 Table 1 shows an excerpt of a typical exploration sequence of the FM-structure exploration strategy with the step-wise exploration of sibling features highlighted in gray. Exploration starts with a randomly selected leaf feature, here: Recommendation. Then all configurations involving this leaf feature are explored before moving to its sibling feature, here: Search.

Table 1 Example for FM-structure exploration (excerpt)

FM-structure exploration is realized by Algorithm 3, which starts by randomly selecting an arbitrary leaf feature f among all leaf features that are part of the current configuration (lines 5, 6). Then, the set of configurations \({\mathscr {C}}_{f}\) containing feature f is computed, while the sibling features of feature f are gathered into a dedicated siblings set (line 7). While \({\mathscr {C}}_{f}\) is non-empty, the strategy explores one randomly selected configuration from \({\mathscr {C}}_{f}\) and removes the selected configuration from \({\mathscr {C}}_{f}\) (lines 1113). If \({\mathscr {C}}_{f}\) is empty, then a new set of configurations containing a sibling feature of f is randomly explored, provided such sibling feature exists (lines 1517). If no configuration containing f or a sibling feature of f is found, the strategy moves on to the parent feature of f, which is repeated until a configuration is found (line 13) or the root feature is reached (line 22).

figure d

3.3 Feature–model–difference exploration strategy for system evolution

To capture changes in the system’s adaptation space due to system evolution, we propose the FM-difference exploration strategy, which leverages the differences in feature models before (\({\mathscr {M}}\)) and after (\(\mathscr {M'}\)) an evolution step. Following the product line literature, we consider two main types of feature model differences [39]:

Added configurations (feature model generalization). New configurations may be added to the adaptation space by (i) introducing new features to \(\mathscr {M'}\), or (ii) removing or relaxing existing constraints (e.g., by changing a sub-feature from mandatory to optional, or by removing cross-tree constraints). In our running example, a new sub-feature Optimized might be added to the DataLogging feature, providing a more resource efficient logging implementation. Thereby, new configurations are added to the adaptation space, such as {DataLogging, Optimized, ContentDiscovery, Search}. As another example, the Recommendation implementation may have been improved and it now can work with the Min logging feature. This removes the cross-tree constraint shown in Fig. 1, and adds new configurations such as {DataLogging, Min, ContentDiscovery, Recommendation}.

Removed configurations (feature model specialization). Symmetrical to above, configurations may be removed from the adaptation space by (i) removing features from \({\mathscr {M}}\), or (ii) by adding or tightening constraints in \(\mathscr {M'}\).

To determine these changes of feature models, we compute a set-theoretic difference between valid configurations expressed by feature model \({\mathscr {M}}\) and feature model \(\mathscr {M'}\). Detailed descriptions of feature model differencing as well as efficient tool support can be found in [1, 5]. The feature model differences provide us with adaptation actions added to the adaptation space (\(\mathscr {M'} \setminus {\mathscr {M}}\)), as well as adaptation actions removed from the adaptation space (\({\mathscr {M}} \setminus \mathscr {M'}\)).

Our FM-difference exploration strategies first explore the configurations that were added to the adaptation space, and then explore the remaining configurations if needed. The rationale is that added configurations might offer new opportunities for finding suitable adaptation actions and thus should be explored first. Configurations that were removed are no longer executed and thus the learning knowledge can be pruned accordingly. In the reinforcement learning realization (Sect. 3.1), we remove all tuples (sa) from Q, where a represents a removed configuration.

FM-difference exploration can be combined with FM-structure exploration, but also with \(\epsilon \)-greedy. In both cases, this means that instead of exploring the whole new adaptation space, exploration is limited to the set of new configurations.

4 Experiment setup

We experimentally assess our FM-guided exploration strategies and compare them with \(\epsilon \)-greedy as the strategy used in the related work (see Sect. 7). In particular, we aim to answer the following research questions:

RQ1: How does learning performance and system quality using FM-structure exploration (from Sect. 3.2) compare to using \(\epsilon \)-greedy?

RQ2: How does learning performance and system quality using FM-difference exploration (from Sect. 3.3) compare to evolution-unaware exploration?

4.1 Subject systems

Our experiments build on two real-world systems and datasets. The CloudRM system is an adaptive cloud resource management service offering 63 features, 344 adaptation actions, and a feature model that is 3 levels deep. The BerkeleyDB-J system is an open source reconfigurable database written in Java with 26 features, 180 adaptation actions and 5 levels.

CloudRM System: CloudRM [21] controls the allocation of computational tasks to virtual machines (VMs) and the allocation of virtual machines to physical machines in a cloud data center.Footnote 2 CloudRM can be adapted by reconfiguring it to use different allocation algorithms, and the algorithms can be adapted by using different sets of parameters. We implemented a separate adaptation logic for CloudRM by using the extended learning algorithms as introduced in Sect. 3.1.

We define the reward function as \(r = -(\rho \cdot e + (1-\rho ) \cdot m)\), with energy consumption e and number of VM manipulations m (i.e., migrations and launches), each normalized to be within [0, 1]. We use \(\rho = 0.8\), meaning we give priority to reducing energy consumption, while still maintaining a low number of VM manipulations.

Our experiments are based on a real-world workload trace with 10,000 tasks, in total spanning over a time frame of 29 days [22]. The CloudRM algorithms decide on the placement of new tasks whenever they are entered into the system (as driven by the workload trace). For RQ2, the same workload was replayed after each evolution step to ensure consistency among the results.

To emulate system evolution, we use a 3-step evolution scenario.

Starting from a system that offers 26 adaptation actions, these three evolution steps respectively add 30, 72 and 216 adaptation actions.

BerkeleyDB-J: The BerkeleyDB-J dataset was collected by Siegmund et al [36] and was used for experimentation with reconfigurable systems to predict their response times.Footnote 3 We chose this system because the configurations are expressed as a feature model and the dataset includes performance measurements for all system configurations, which were measured using standard benchmarks.Footnote 4 Adaptation actions are the possible runtime reconfigurations of the system. We define the reward function as \(r = -t\), with t being the response time normalized to be within [0, 1].Footnote 5

Given the smaller size of BerkeleyDB-J’s adaptation space, we use a 2-step evolution scenario to emulate system evolution. We first randomly change two of the five optional features into mandatory ones, thereby reducing the size of the adaptation space. We start from this reduced adaptation space and, randomly change the mandatory features back into optional ones. Starting from a system that offers 39 adaptation actions, these two evolution steps respectively add 20 and 121 adaptation actions.

4.2 Measuring learning performance

We characterize the performance of the learning process by using the following metrics from [38]: Asymptotic performance measuring the reward achieved at end of the learning process. Time to threshold measuring the number of time steps it takes the learning process to reach a predefined reward threshold (in our case \(90\%\) of the difference between maximum and minimum performance). Total performance measuring the overall learning performance by computing the area between the reward curve and the asymptotic reward. In addition, we measure how the different strategies impact on the quality characteristics of the subject systems.

Given the stochastic nature of the learning strategies (both \(\epsilon \)-greedy and to a lesser degree our strategies involve random decisions), we repeated the measurements 500 times and averaged the results.

4.3 Prototypical realization

The learning algorithms, as well as the \(\epsilon \)-greedy and FM-based exploration strategies were implemented in Java. Feature model management and analysis were performed using the FeatureIDE framework,Footnote 6 which we used to efficiently compute possible feature combinations from a feature model.

4.4 Hyper-parameter optimization

To determine suitable hyper-parameter values (see Sect. 3.1), we performed hyper-parameter tuning via exhaustive grid search for each of the subject systems and each of the reinforcement learning algorithms. We measured the learning performance for our baseline \(\epsilon \)-greedy strategy for 11,000 combinations of learning rate \(\alpha \), discount factor \(\gamma \), and \(\epsilon \)-decay rate. For each of the subject systems and reinforcement learning algorithms we chose the hyper-parameter combination that led to the highest asymptotic performance. We used these hyper-parameters also for our FM-guided strategies.

5 Results

To facilitate reproducibility and replicability, our code, the used data and our experimental results are available online.Footnote 7

5.1 Results for RQ1 (FM-structure exploration)

Figure 3 visualizes the learning process by showing how rewards develop over time, while Table 2 quantifies the learning performance using the metrics introduced above.

Fig. 3
figure 3

Learning performance for large adaptation spaces (RQ1)

Table 2 Comparison of exploration strategies for large adaptation spaces (RQ1)

Across the two systems and learning algorithms, FM-structure exploration performs better than \(\epsilon \)-greedy wrt. total performance (33.7% on average) and time to threshold (25.4%), while performing comparably wrt. asymptotic performance (0.33%). A higher improvement is visible for CloudRM than for BerkeleyDB-J, which we attribute to the much larger adaptation space of CloudRM, whereby the effects of systematically exploring the adaptation space become more pronounced.

For CloudRM, FM-structure exploration consistently leads to less VM manipulations and lower energy consumption. While savings in energy are rather small (0.1% resp. 0.23%), FM-structure exploration reduces the number of virtual machine manipulations by 7.8% resp. 9.15%. This is due to the placement algorithms of CloudRM having a small difference wrt. energy optimization, but a much larger difference wrt. the number of virtual machine manipulations. For BerkeleyDB-J, we observe an improvement in response times of 1.55% resp. 4.13%. This smaller improvement is consistent with the smaller improvement in learning performance.

Analyzing the improvement of FM-structure exploration for the different learning algorithms, we observe an improvement of 24.2% (total performance) resp. 15.1% (time to threshold) for Q-Learning, and a much higher improvement of 43.2% resp. 35.8% for SARSA. Note, however, that the overall learning performance of SARSA is much lower than that of Q-Learning. SARSA performs worse wrt. total performance (\(-23\%\) on average), time to threshold (\(-27.6\%\) on average), and asymptotic performance (\(-3.82\%\) on average). In addition, SARSA requires around 19.4% more episodes than Q-Learning to reach the same asymptotic performance. The reason is that SARSA is more conservative during exploration [37]. If there is an adaptation action that leads to a large negative reward which is close to an adaptation action that leads to the optimal reward, Q-Learning exhibits the risk of choosing the adaptation action with the large negative reward. In contrast, SARSA will avoid that adaptation action, but will more slowly learn the optimal adaptation actions. So, in practice one may choose between Q-Learning and SARSA depending on how expensive it is to execute “wrong” adaptations.

5.2 Results for RQ2 (FM-difference exploration)

We compare FM-difference exploration combined with \(\epsilon \)-greedy and FM-structure exploration with their respective evolution-unaware counterparts (i.e., the strategies used for RQ1). It should be noted that even though we provide the evolution-unaware strategies with the information about the changed adaptation space (so they can fully explore it), we have not modified them such as to differentiate between old and new adaptation actions.

Like for RQ1, Fig. 4 visualizes the learning process, while Table 3 quantifies learning performance. We computed the metrics separately for each of the evolution steps and report their averages. After each evolution step, learning proceeds for a given number of time steps, before moving to the next evolution step.

Fig. 4
figure 4

Learning performance across system evolution (RQ2)

Table 3 Comparison of exploration strategies across evolution steps (RQ2)

The FM-difference exploration strategies consistently perform better than their evolution-unaware counterparts wrt. total performance (50.6% on average) and time to threshold (47%), and perform comparably wrt. asymptotic performance (1.7%). Like for RQ1, the improvements are more pronounced for CloudRM, which exhibits a larger action space than BerkeleyDB-J.

For CloudRM, FM-difference exploration reduces the number of virtual machine manipulations by 19.8% resp. 30.9%, while keeping energy consumption around the same as the non-evolution-aware strategies. For BerkeleyDB-J, FM-difference exploration leads a reduction in response time by 1.24% resp. 2.56%. Like for RQ1, this smaller reduction is consistent with the smaller learning performance.

The improvement of FM-difference exploration is more pronounced for \(\epsilon \)-greedy than for FM-structure exploration; e.g., showing a 94.4% improvement in total performance for \(\epsilon \)-greedy compared with an improvement of only 6.85% for FM-structure exploration. This suggests that, during evolution, considering the changes of the adaptation space has a much larger effect than considering the structure of the adaptation space. In addition, we note that due to the way we emulate evolution in our experiments, the number of adaptations introduced after an evolution step is much smaller (66 on average) than the size of the whole adaptation space of the subject systems (262 on average), thus diminishing the effect of FM-structure exploration.

Analyzing the improvement of FM-difference exploration for the different learning algorithms, we can observe the same effect as for RQ1. While FM-difference exploration shows a much higher improvement for SARSA, the overall learning performance for SARSA is much lower than for Q-Learning.

5.3 Validity risks

We used two realistic subject systems and employed real-world workload traces and benchmarks to measure learning performance and the impact of the different exploration strategies on the systems’ quality characteristics. The results reinforce our earlier findings from [26] and also indicate that the size of the adaptation space may have an impact on how much improvement may be gained from FM-structure exploration. As part of our future work, we plan experiments with additional subject systems to confirm this impact for larger action spaces.

We chose \(\epsilon \)-greedy as a baseline, because it was the exploration strategy used in existing online reinforcement learning approaches for self-adaptive systems (see Sect. 7). Alternative exploration strategies were proposed in the broader field of machine learning. Examples include Boltzmann exploration, where actions with a higher expected reward (e.g., Q value) have a higher chance of being explored, or UCB action selection, where actions are favored that have been less frequently explored [37]. A comparison among those alternatives is beyond the scope of this article, because a fair comparison would require the careful variation and analysis of a range of many additional hyper-parameters. We plan addressing this as part of future work.

6 Limitations and assumptions

Below, we discuss current limitations and assumptions of our approach.

6.1 Completeness of feature models

We assume that feature models are complete with respect to the coverage of the adaptation space and that during an evolution step they are always consistent and up to date. A further possible change during service evolution can be the modification of a feature’s implementation, which is currently not visible in the feature models. Encoding such kind of modification thus could further improve our FM-guided exploration strategies.

6.2 Structure of feature models

One aspect that impacts FM-structure exploration is how the feature model is structured. As an example, if a feature model has only few levels (and thus little structure), FM-structure exploration behaves similar to random exploration, because such a “flat” feature model does not provide enough structural information. On the other hand, providing reinforcement learning with too much structural information might hinder the learning process. As case in point, we realized during our experiments that the alternative FM-structure exploration strategy from our earlier work [26] indeed had such negative effect for the BerkeleyDB-J system. This alternative strategy used the concept of “feature degree”Footnote 8 to increase the amount of structural information used during learning.

6.3 Types of features

Our approach currently only supports discrete features in the feature models, and thus only discrete adaptation actions. Capturing feature cardinalities or allowing numeric feature values is currently not possible, and thus continuous adaptation actions cannot be captured.

6.4 Adaptation constraints

When realizing the exploration strategies (both \(\epsilon \)-greedy and FM-guided), we assumed we can always switch from a configuration to any other possible configuration. We were not concerned with the technicalities of how to reconfigure the running system (which, for example, is addressed in [8]). We also did not consider constraints concerning the order of adaptations. In practice, only certain paths may be permissible to reach a configuration from the current one. To consider such paths, our strategies may be enhanced by building on work such as [32].

7 Related work

We first review papers that apply online reinforcement learning to self-adaptive systems but do not consider large discrete adaptation spaces or system evolution, and then review papers that do.

7.1 Applying online reinforcement learning to self-adaptive systems

Barrett et al use Q-Learning with \(\epsilon \)-greedy for autonomic cloud resource allocation [3]. They use parallel learning to speed up the learning process. Caporuscio et al use two-layer hierarchical reinforcement learning for multi-agent service assembly [7]. They observe that by sharing monitoring information, learning happens faster than when learning in isolation. Arabnejad et al apply fuzzy reinforcement learning with \(\epsilon \)-greedy to learn fuzzy adaptation rules [2]. Moustafa and Zhang use multi-agent Q-Learning with \(\epsilon \)-greedy for adaptive service compositions [28]. To speed up learning, they use collaborative learning, where multiple systems simultaneously explore the set of concrete services to be composed. Zhao et al use reinforcement learning (with \(\epsilon \)-greedy) combined with case-based reasoning to generate and update adaptation rules for web applications [46]. Their approach may take as long to converge as learning from scratch, but may offer higher initial performance. Shaw et al apply reinforcement learning for the consolidation of virtual machines in cloud data centers [35].

Recently, deep reinforcement learning has gained popularity. In deep reinforcement learning  a deep neural network is used to store the learned knowledge (for example, the Q function in Deep Q-Learning). Wang et al use Q-learning (using \(\epsilon \)-greedy) together with function approximation. They use neural networks to generalize over unseen environment states and thereby facilitate learning in the presence of many environment states [42]. Yet, they do not address large action spaces. Moustafa and Ito use deep Q-Networks enhanced with double Q-Learning and prioritized experience replay for adaptively selecting web services for a service workflow [27]. Wang et al also address the service composition problem, and apply deep Q-Learning with Recurrent Neural Network (RNN), which can also handle partially observable states [43]. Restuccia and Melodia propose an efficient implementation of reinforcement learning using deep Q-Networks for adaptive radio control in wireless embedded systems [33]. In our earlier work, we used policy-based reinforcement learning for self-adaptive information systems [30], where the policy is represented as a neural network. Thereby, we addressed continuous environment states and adaptation actions. Using deep neural networks, these approaches can better generalize over environment states and actions. Thereby, deep reinforcement learning in general may perform better in large adaptation spaces. However, to be able to generalize, the adaptation space must be continuous [10]. A continuous space of actions is represented by continuous variables, such as real-valued variables. Setting a specific angle for a robot arm or changing the set-point of a thermostat are examples for a continuous space of actions [18]. However, as motivated in Sect. 1, many kinds of self-adaptive systems have a discrete, i.e., non-continuous space of adaptation actions.

In conclusion, none of the approaches reviewed above addresses the influence of large discrete adaptation spaces nor that of system evolution on learning performance. Thus, these approaches may suffer from poor performance over an extended period of time while the system is performing random exploration of the large adaptation space, as it may take a long time to find suitable adaptations. In addition, such approaches may only recognize adaptation possibilities added by evolution late, thereby negatively impacting on the system’s long-term overall performance.

7.2 Considering large adaptation spaces and evolution

Bu et al explicitly consider large adaptation spaces [4]. They employ Q-Learning (using \(\epsilon \)-greedy) for self-configuring cloud systems. They reduce the size of the adaptation space by splitting it into coarse-grained sub-sets for each of which they find a representative adaptation action using the simplex method. Their experiments indicate that their approach indeed can speed up learning. Yet, they do not consider service evolution.

Dutreilh et al explicitly consider service evolution [11]. They use Q-Learning for autonomic cloud resource management. To speed up learning, they provide good initial estimates for the Q function, and use statistical estimates about the environment behavior. They indicate that system evolution may imply a change of system performance and sketch an idea on how to detect such drifts in system performance. Yet, they do not consider that evolution may also introduce or remove adaptation actions.

A different line of work uses supervised machine learning to reduce the size of the adaptation space. As an example, Van Der Donckt et al use deep learning to determine a representative and much smaller subset of the adaptation space [41]. However, supervised learning requires labeled training data representative of the system’s environment, which may be challenging to obtain due to design time uncertainty.

Our earlier work made a first attempt to address both large adaptation spaces and evolution in online reinforcement learning for self-adaptive systems by means of FM-guided exploration [26]. The present paper extends our earlier work in multiple respects, including a conceptual framework for integrating reinforcement learning into the MAPE-K reference model of self-adaptive systems, covering a broader range of subject systems, and integrating our strategies into two different reinforcement learning algorithms.

8 Conclusion and outlook

We introduced feature-model-guided exploration strategies for online reinforcement learning that address potentially large adaptation spaces and the change of the adaptation space due to system evolution. Experimental results for two adaptive systems indicate a speed up of learning and an improvement of quality characteristics in turn.

As future work, we plan using extended feature models, which offer a more expressive notation allowing to capture feature cardinality and even numeric feature values [25]. Thereby, we can express adaptation spaces which combine discrete and continuous adaptation actions. This will require more advanced feature analysis methods [14] to be used as part of our FM-based exploration strategies. In addition, we aim to extend our strategies to also consider changes in existing features (and not only additions and removal of features) during system evolution. This, among others, will require extending the feature modeling language used. Finally, we plan using deep reinforcement learning, which represents the policy as a neural network, and thereby can generalize over environment states and adaptation actions [30].