1 Introduction

During the past few years, the development of cloud computing has achieved significant success in the industry since it can provide economical, scalable, and elastic access to computing resources, thus liberating people from installing, configuring, securing, and updating a variety of hardware and software [1,2,3]. More and more firms and personal users have been using cloud computing services over Internet, which contribute to the development of the cloud computing market. The global cloud computing market is expected to grow at a 30% compound annual growth rate (CAGR) reaching $270 billion in 2020. To compete for hundreds of billions of dollars, many firms as service providers have been participating in the cloud market [4]. Now, there exist many dominating cloud platforms offering cloud services, such as Microsoft’s Azure, IBM’s SoftLayer and Amazon’s AWS. In the cloud market with multiple cloud providers, cloud users have various choices, and they usually participate in the provider which can satisfy their demands and charge the lowest price to them. Actually, when multiple providers offer similar quality of service [5,6,7], the price will significantly affect users’ choices and thus providers’ profits. Therefore, cloud providers need to set effective prices to compete against each other. Furthermore, the competition among providers usually lasts for a long time, i.e. the providers compete against each other repeatedly, and thus they need to maximize the long-term profits. In this paper, we will analyze how a cloud provider designs an appropriate pricing policy to maximize the long-term profit and also remain attractive to cloud users.

There exist some works on designing pricing policies for cloud providers. In [8], a non-cooperative competing model based on game theory has been proposed which computes the equilibrium price for one-shot game and does not consider the long-term profits. In [9, 10], the authors assume that there is only one provider, while in today’s cloud market multiple providers exist and compete against each other. Then the authors in [11, 12] analyze the user behavior with respect to the providers’ prices, but ignore the competition among providers. In [13], the authors analyze the pricing policy in the competing environment by assuming that there is only one proactive provider, and other providers just follow the proactive one’s pricing policy. Some other works, such as [14, 15], consider the competition among providers but does not capture the market dynamics, and their algorithms can only be applied to a very small market with few users.

To the best of our knowledge, few works have considered the situation of multiple providers competing against each other repeatedly. In this paper, we will analyze how the competing cloud provider sets price effectively to maximize the long-term profits in the context with two competing providers.Footnote 1 In more detail, we first describe basic settings of cloud users and providers. Specifically, we consider the uncertainty of users choosing cloud providers in the setting, which is consistent with the realistic user behavior. Furthermore, how users choosing cloud providers is affected by the prices, and how cloud providers setting prices is affected by users’ choices, and therefore it is a sequential-decision problem. Moreover, this problem involves two self-interested cloud providers, and thus it is a Markov game [16]. In this paper, we model the competition between cloud providers as a Markov game, and then use two typical reinforcement learning algorithms, minimax-Q learning [17] and Q learning [18], to solve this game and design the pricing policy. We then run extensive experiments to evaluate the policies in different situations. We find that although minimax-Q learning, which was specifically designed for Markov games, is more suitable to be applied in this issue, Q learning based pricing policy performs better in terms of making profits. We also find that minimax-Q based pricing policy is better for remaining attractive to cloud users.

The structure of the paper is as follows. In Sect. 2, we describe basic settings of cloud users and providers. In Sect. 3, we describe how to use Q learning and minimax-Q learning algorithms to design the pricing policy. We run extensive experiments to evaluate the pricing policies in different situations in Sect. 4. Finally, we conclude the paper in Sect. 5.

2 Basic Settings

In this section, we describe the basic settings of cloud providers and users. We assume that there are N users and two cloud providers, A and B. Cloud providers compete against each other repeatedly, i.e. the competition consists of multiple stages. At the beginning of each stage, each provider publishes its price, and then each user chooses to be served by which provider based on its choice model. According to users’ choices, the two providers compute the obtained profits at the current stage, and the competition enters into the next stage.

2.1 Cloud Providers

Cloud providers can make profits by charging fees to users, while they also need to pay for the cost of offering services (e.g. power, hardware, infrastructure maintenance cost and so on). At stage t, provider i should pay for the cost of offering per unit service [19], which is denoted as \(c_{i,t}\). We assume that each user only requests one-unit service. Therefore, the amount of requested service at stage t is equal to the number of users. At the beginning of the competition, the initial marginal cost of provider i is \(c_{i,0}\). At stage t, the amount of users choosing provider i is \(N_{i,t}\), and then at this stage, the marginal cost is:

$$\begin{aligned} c_{i,t}=c_{i,0}(N_{i,t})^{-\beta } e^{-\theta t}\end{aligned}$$
(1)

This equation indicates that as more users requiring the service and as time goes, the marginal cost decreases [20]. Specifically, when the provider receives more demands of services, its marginal cost would be decreased because of economics of scale, where \(\beta \) is the parameter for the economics of scale, and \(\beta >0\). Furthermore, the reduction of hardware cost and the development of technology contribute to the temporal decaying factor of the marginal cost, where \(\theta \) is the parameter of temporal decaying factor, and \(\theta >0\).

We assume that the price is denoted as p, and all allowable prices constitute a finite set P. The price is actually the action used in Sect. 3. After providers publishing the prices, users make the choices of providers. We can calculate the immediate reward of each provider, which is the immediate profit made at the current stage t:

$$\begin{aligned} r_{i,t}=N_{i,t}(p_{i,t}-c_{i,t})\end{aligned}$$
(2)

where \(p_{i,t}\) is the price set by provider i at stage t.

2.2 Cloud Users

Each user has a marginal value on per-unit requested service, which is denoted as \(\delta \). At stage t, after all providers publish the prices, user j can calculate its expected revenue when entering provider i, which is:

$$\begin{aligned} R_{j,i}^t=\delta _j-p_{i,t} \end{aligned}$$
(3)

Intuitively, based on Eq. 3, cloud users can determine in which provider they can obtain the maximal revenue at the current stage, and then choose that provider. However, in the real world, users keep requiring cloud services, and they usually take into account the prices at previous stages. Specifically, in this paper, we assume that the users will consider the prices at the current stage t and the last stage \(t-1\) when choosing the cloud providers. We do not need to consider the prices at all previous stages since in this paper, the providers’ prices and the users’ choices are affected by each other, and thus the price of the last stage actually implies the dynamic interaction of all previous stages. Therefore, the expected utility that user j can make when entering provider i is:

$$\begin{aligned} v_{j,i}^t= \xi R_{j,i}^t + (1-\xi )R_{j,i}^{t-1} \end{aligned}$$
(4)

where \(\xi \) is the weight of the price considered by the user at this stage. Furthermore, in reality, when agents make decisions, their choices are affected by some unobservable factors [21], such as customers’ loyalty on some product brand, which is denoted as \(\eta _{j,i}\). This part introduces the uncertainty of users’ choice. Now the utility that cloud user j makes in provider i at stage t is defined as follows:

$$\begin{aligned} u_{j,i}^t=v_{j,i}^t+\eta _{j,i}\end{aligned}$$
(5)

We assume that the random variable \(\eta _{j,i}\) is an independently, identically distributed extreme value, i.e. it follows Gumbel and type I extreme value distribution [21], and the density of \(\eta _{j,i}\) is

$$\begin{aligned} f(\eta _{j,i})=e^{-\eta _{j,i}}e^{-e^{-\eta _{j,i}}}\end{aligned}$$
(6)

and the cumulative distribution is

$$\begin{aligned} F(\eta _{j,i})=e^{-e^{-\eta _{j,i}}}\end{aligned}$$
(7)

The probability of user j choosing provider i at stage t, which is denoted as \(P_{j,i}^t\)

$$\begin{aligned} P_{j,i}^t= & {} Prob(u_{j,i}^t>u_{j,i'}^t,\forall i'\ne i)\nonumber \\ {}= & {} Prob(v_{j,i}^t+\eta _{j,i}>v_{j,i'}^t+\eta _{j,i'},\forall i'\ne i)\nonumber \\ {}= & {} Prob(\eta _{j,i'}<\eta _{j,i}+v_{j,i}^t-v_{j,i'}^t,\forall i'\ne i)\end{aligned}$$
(8)

According to (7), \(P_{j,i}^t\) is

$$\begin{aligned} P_{j,i}^t=e^{-e^{(\eta _{j,i}+v_{j,i}^t-v_{j,i'}^t)}}\end{aligned}$$
(9)

Since \(\eta _{j,i}\) is independent, the cumulative distribution over all \(i\ne i'\) is the product of the individual cumulative distributions

$$\begin{aligned} P_{j,i}^t\mid \eta _{j,i}=\prod e^{-e^{-(\eta _{j,i}+v_{j,i}^t-v_{j,i'}^t)}}\end{aligned}$$
(10)

And \(\eta _{j,i}\) is unknown to the providers, so the choice probability is the integral of \(P_{j,i}^t\mid \eta _{j,i}\) over all values of \(\eta _{j,i}\) weighted by its density

$$\begin{aligned} P_{j,i}^t=\int (\prod _{i'\ne i}e^{-e^{-(\eta _{j,i}+p_{i,t}-p_{i',t})}})e^{-\eta _{j,i}}e^{-e^{-\eta _{j,i}}}d\eta _{j,i}\end{aligned}$$
(11)

The closed-form expression is

$$\begin{aligned} P_{j,i}^t=\frac{e^{v_{j,i}^t}}{\sum _{i'}e^{v_{j,i'}^t}}\end{aligned}$$
(12)

which is the probability of user j choosing to be served by provider i at stage t.

3 Reinforcement Learning Algorithms

After describing the basic settings, we now introduce how to design a pricing policy for the cloud provider. How to set an effective price is a decision-making problem, and reinforcement learning algorithms have been widely used to solve similar issues. Specifically, we adopt Q learning algorithm [18] to determine how the provider sets the price. Note that Q learning algorithm is usually used to solve the sequential decision problem involving only one agent, and therefore when using Q learning algorithm, we let the opponent’s action be part of the environment. Moreover, since our problem actually involves two providers competing against each other repeatedly, it can be modeled as a Markov game [16]. In such a game, we use minimax-Q learning algorithm [17] to solve this issueFootnote 2. In the following, we introduce how to design the pricing policy based on Q learning and minimax-Q learning algorithms respectively.

At stage t, provider A sets price according to its own and the opponent B’s price at the last stage \(t-1\), which is denoted as state \(s_{t-1}=(p_{A,t-1},p_{B,t-1})\). Note that the state does not involve the amount of users participating in each provider since the price has implied users’ choices and therefore we only use the prices to represent the state. The state space is denoted as \(S=P\times P\). For simplicity, in the following, we use \(a\in P\) and \(b\in P\) to represent the actions of providers A and B respectively. The pricing policies of provider A and B are denote as \(\varPi _A\) and \(\varPi _B\) respectively. Based on these notations, Q learning algorithm is shown in Algorithm 1, and minimax-Q learning algorithm is shown in Algorithm 2. In this setting, it is guaranteed that both algorithms will converge [17, 18]. The final output \(\varPi _A\) is the designed pricing policy.

figure a
figure b

4 Experimental Analysis

In this section, we run numerical simulations to analyze the reinforcement learning based pricing policies in different situations. We first describe the parameter setup in the experiments in the following.

Table 1. Experimental parameters

4.1 Experimental Parameters

First, we assume that each cloud provider has the same initial marginal cost, i.e. \(c_{.,0}=5\), and the marginal cost is decreased as the demand increases. In addition, we assume that the set of allowable prices P chosen by cloud providers is \(\{10,20,\ldots ,100\}\). Furthermore, we assume that there are \(N=100\) cloud users in total. The marginal values of users \(\delta \) are independent random variable, and for illustrative purpose, we assume that they are drawn from a uniform distribution with support [50, 150]. Other parameters used in the following simulations follow the typical setting in the related literature, and are shown in Table 1.

Fig. 1.
figure 1

Pricing policies trained in Q and minimax-Q learning algorithms

4.2 Pricing Policy

We first describe the pricing policies trained and output by minimax-Q and Q learning algorithms respectively. We consider the case that the cloud provider takes Q learning and minimax-Q learning algorithms against the opponent choosing actions randomly, and the case that both cloud providers are trained in Q learning and minimax-Q learning algorithms. We name the trained pricing policies as QRQQMR and MM respectively, and for example QQ means that both cloud providers adopt Q learning algorithm and are trained against each other. We show these four pricing policies in Fig. 1. From these figures, we can find the probability of the provider choosing each price (action) at each state. Note that in this paper, the state is a tuple including two providers’ prices at the last stage, and in order to show the state in one-dimension state-axis, we map state (10, 10) to 1, map state (10, 20) to 2, ......, map (20, 10) to 11, ......, and map (100, 100) to 100. Furthermore, we find that no provider intends to set a minimal price 10 to attract users, since on one hand, a low price is not beneficial for the long-term profit, and on the other hand, cloud users’ choices of providers are also affected by some unobservable factors, and thus a provider with a minimal price cannot attract all users, but lose some profits. Furthermore, we find that these pricing policies will not set the highest price since such a high price will drive all users to leave. Moreover, we find that the surface of QQ and QR is sharper than MR and MM’s. Specifically, at some state, QR and QQ will choose a deterministic action, but MR and MM have mixed actions. This is because in contrast to Q learning trying to maximizing the profit regardless of the opponent’s action, minimax-Q learning needs to randomize the action in order to maximize the profit in the worst case.

4.3 Evaluation

In this section, we run simulations to evaluate the above pricing policies in different situations. Specifically, we use the average profit and the winning percentage as the evaluation metrics.

Fig. 2.
figure 2

MM, MR, QQ, QR vs. Random

vs. Random Pricing Policy.

We first evaluate these four pricing policies against the opponent adopting a random pricing policy, which chooses each price with equal probability. The reason for doing this is that when participating in the cloud market, some fresh cloud providers often explore the competing environment randomly in order to collect more market information. The competing results are shown in Fig. 2, where Fig. 2(a) is the average profit over 10000 stages, and Fig. 2(b) is the winning percentage of these four policies competing against the random pricing policy. We find that all these four pricing policy can beat the opponent. Not surprisingly, we find that QR is the best one among these four policies when competing against the random pricing policy since QR is trained specifically against the random policy. In contrast, we find that MR is the worst one. This is because when the opponent chooses the price randomly, i.e. not trying to make the other side be the worst, minimax-Q cannot perform well. In fact, we find that QQ and QR perform better than MR and MM. This may indicate that the agent should adopt Q learning when its opponent cannot take action in an intelligent way.

Fig. 3.
figure 3

MMMRQQQR vs. \(Linear\ Reduction\)

Fig. 4.
figure 4

MMMRQQQR vs. \(Exp\ Reduction\)

Fig. 5.
figure 5

Q-X vs. X

vs. Price Reduction Policy. In the real world, cloud providers usually attract cloud users by decreasing the price continuously. For example, when a fresh cloud provider enters the market, it may keep decreasing the price to attract users. We also evaluate these four pricing policies against the cloud provider which keeps reducing the price. Specifically, we consider two typical price reduction policies, Linear Reduction and Exp Reduction. In Linear Reduction policy, the price decreases linearly with respect to time, where at stage t the price is \(p_t=p_0-0.01t\) (\(p_0\) is the initial price, and we set it as the maximal price, i.e. 100), while in Exp Reduction, the price decreases exponentially with time, where \(p_t=p_0*e^{-0.0003t}\) (\(p_0=100\) which is the same as before). The results of QQQRMMMR competing against Linear Reduction policy and Exp Reduction policy are shown in Figs. 3 and 4. We still find that our reinforcement learning-based pricing policies can beat these price reduction policies. Again, we find that MM and MR cannot outperform the reduction policies significantly than that QQ and QR do.

Q-X vs. X. In the above, it seems that when competing against the opponent using simple pricing policies (i.e. random or price reduction), Q learning based pricing policy is better. However, after investigating the fundamentals of minimax-Q and Q learning algorithms, we can see that minimax-Q is more suitable in this Markov game with two competing providers. Since this is not proved in the above experiments, in the following we further investigate this issue by using Q learning algorithm to train the pricing policy against the above four policies, i.e. QQQRMRMM. We then obtained four new pricing policies, Q-QR, Q-QQ, Q-MR, and Q-MM. Q-QR is a pricing policy based on Q learning competing against QR pricing policy in the above section. We run the following simulations including Q-MM vs MM, Q-MR vs MR, Q-QQ vs QQ and Q-QR vs QR. The results are shown in Fig. 5, where QQQRMRMM are denoted as X. From Fig. 5(b), in terms of winning percentage, we find that Q-QQ outperforms QQ and Q-QR outperforms QR, i.e. the winning percentage is more than \(50\%\). However, even though Q-MM is trained against MM, it is outperformed by MM, i.e. the winning percentage is less than \(50\%\). The similar result happens for Q-MR. However, from Fig. 5(a), we find that even though Q-MM is outperformed by MM in terms of winning percentage, it obtains more profits than MM. It is similar for Q-MR. This is because minimax-Q based pricing policies try to do the best in the worst case, and therefore its winning percentage can be kept at a good level. However, Q learning based policies try to maximize the profits at all times, and therefore perform better in terms of making profits.

5 Conclusions

How to set prices effectively is an important issue for the cloud provider, especially in the environment with multiple cloud providers competing against each other. In this paper, we use reinforcement learning algorithms to address this issue. Specifically, we model the issue as a Markov game, and use minimax-Q and Q learning algorithms to design the pricing policies respectively. We then run extensive experiments to analyze the pricing policies. We find that although minimax-Q is more suitable in analyzing the competing game with multiple self-interested agents, Q learning based pricing policy performs better in terms of making profits. We also find that minimax-Q based pricing policy is better for remaining attractive to cloud users. The experimental results can provide useful insights on designing practical pricing policies in different situations.