Online revenue maximization for server pricing

Efficient and truthful mechanisms to price resources on servers/machines have been the subject of much work in recent years due to the importance of the cloud market. This paper considers revenue maximization in the online stochastic setting with non-preemptive jobs and a unit capacity server. One agent/job arrives at every time step, with parameters drawn from the underlying distribution. We design a posted-price mechanism which can be efficiently computed and is revenue-optimal in expectation and in retrospect, up to additive error. The prices are posted prior to learning the agent’s type, and the computed pricing scheme is deterministic, depending only on the length of the allotted time interval and on the earliest time the server is available. We also prove that the proposed pricing strategy is robust to imprecise knowledge of the job distribution and that a distribution learned from polynomially many samples is sufficient to obtain a near-optimal truthful pricing strategy.


Introduction
Designing mechanisms for a desired outcome with strategic and selfish agents is an extensively studied problem in economics, with classical work by Myerson [1], and Vickrey-Clarke-Groves [2], for truthful mechanisms.The advent of online interaction and e-commerce has added an efficiency constraint on the mechanisms, going so far as to prioritize computational efficiency over classical objectives: e.g.choosing simple approximate mechanisms when optimal mechanisms are computationally difficult, or impossible.Beginning with Nisan and Ronen [3], the theoretical computer science community has contributed greatly to the field, in both fundamental problems and specific applications.These include designing truthful mechanisms for the maximization of welfare and revenue, and has also focused on learning distributions of agent types, menu complexity, and dynamic mechanisms (e.g., [4,5].) We consider this question in the setting of selling computational resources on remote servers or machines (cf.[6,7].)This is arguably one of the fastest growing markets on the Internet.The goods (resources) are assigned nonpreemptively and thus have strong complementarities.Furthermore, since the supply (server capacity) is limited, any mechanism trades immediate revenue for future supply.Finally, mechanisms must be incentive-compatible, as nontruthful, strategic, behaviour from the agents can skew the performance of a mechanism from its theoretical guarantees.This leads us to the following question: Question.Can we design an efficient, truthful, and revenue-maximizing mechanism to sell time-slots nonpreemptively on a single server?
We design a posted-price mechanism which maximizes expected revenue up to additive error, for agents/buyers arriving online, with parameters of value, length and maximum delay, drawn from an underlying distribution.
Three key aspects distinguish our problem from standard online scheduling: (i) In our setting, as time progresses, the server clears up, allowing longer jobs to be scheduled in the future if no smaller jobs are scheduled until then.(ii) Scheduling the jobs is not exclusively to the discretion of the mechanism designer, but must also be desired by the job itself, while also producing sufficient revenue.(iii) As the mechanism designer, we do not have access to job parameters in an incentive-compatible way before deciding on a posted price menu.These three features lie at the core of the difficulty of our problem.Our focus will be on devising online mechanisms in the Bayesian setting.
In our online model, time on the server is discrete.At every time step, an agent arrives on the server, with a value V , length requirement L, and maximum delay D. These parameters are drawn from a common distribution, i.i.d.across jobs.The job wishes to be scheduled for at least L consecutive time slots, no more than D time units after its arrival, and wishes to pay no more than V .Jobs are assumed to have quasi-linear utility in money, and so prefer the least-price interval within their constraints.The mechanism designer never learns the parameters of the job.Instead, she posts a price menu of (length,price) pairs, and the minimum available delay s.The job accepts to be scheduled so long as D ≥ s, and there is some (length,price) pair in the menu of length at least L and price at most V .We note that the pricing scheme can be dynamic, changing through time.If, at time epoch t, an agent chooses option (ℓ, π ℓ ), then she pays π ℓ and her job will be allocated to the interval [t + s, t + s + ℓ].She will choose the option which minimizes π ℓ .Throughout this paper we assume that the random variables L, V, D are discrete, and have finite support, unless specified differently.

Summary of Our Results
1. We model the problem of finding a revenue maximizing pricing strategy as a Markov Decision Process (MDP).Given a price menu (length,price) and a state (minimum available delay) s at time t, the probability of transition to any other state at time t + 1 is obtained from the distribution of the job's parameters.The revenue maximizing pricing strategy can be efficiently computed via backwards induction.We also present, in Appendix C.2, an approximation scheme in the case where V is a continuous random variable.
samples collected through the observations of the agents' decisions.We provide a truthful posted price εapproximate mechanism if the number of samples is polynomial in 1/ε and the size of the support of the distribution.

Related Work
Much recent work has focused on designing efficient mechanisms for pricing cloud resources.Chawla et al. [8] recently studied "time-of-use" pricing mechanisms, to match demand to supply with deadlines and online arrivals.Their result assumes large-capacity servers, and seeks to maximize welfare in a setting in which the jobs arriving over time are not i.i.d.. [9] provides a mechanism for preemptive scheduling with deadlines, maximizing the total value of completed jobs.Another possible objective for the design of incentive-compatible scheduling mechanisms is the total value of completed jobs, which have release times and deadlines.[10] solves this problem in an online setting, while [11], in the offline setting for parallel machines, and [12], in the online competitive setting with uncertain supply.[13] focuses on social welfare maximization for non-preemptive scheduling on multiple servers, and obtains a constant competitive ratio as the number of servers increases.Our work differs from these by considering revenue maximization and stochastic job types which are i.i.d.over time.[14] addresses computing a price menu for revenue maximization with different machines.Finally, [7] proposes a system architecture for scheduling and pricing in cloud computing.
Posted price mechanisms (PPM) have been introduced by [15] and have gained attention due to their simplicity, robustness to collusion, and their ease of implementation in practice.One of the first theoretical results concerning PPM's is an asymptotic comparison to classical single-parameter mechanisms [16].They were later studied by [17] for the objective of revenue maximization, and further strengthened by [18] and [19].[20] shows that sequential PPM's can 1 /2-approximate social welfare for XOS valuation functions, if the price for an item is equal to the expected contribution of the item to the social welfare.
Sample complexity for revenue maximization was recently been studied in [5] showing that a polynomially many of samples suffice to obtain near optimal Bayesian auction mechanisms.An approach based on statistical learning that allows to learn mechanisms with expected revenue arbitrarily close to optimal from a polynomial number of samples has been proposed in [21].The problems of learning simple auctions from samples has been studied in [22].

Structure of the Paper
In Section 2 we describe the model of the problem as a Markov Decision Process.In Section 3 we present an efficient algorithm for computing optimal policies for the finite time horizon given full knowledge of the distribution of the jobs' paramethers.This is extended to other settings in Appendix C. In Section 3.3, we demonstrate that the optimal policy is monotone and in Section 3.4 we describe the concentration bounds on the revenue of a pricing policy.Section 4.2 gives the learning algorithm and error bounds for computing the pricing policies with only (partial) sample access to the job distribution.In Finally, Section 4.3 and Section 5 are devoted to describing and summarizing the final result and future directions of research.
Proof details are provided in Appendix B.

Model
Notation.In what follows, the variables t, ℓ or L, v or V , and d or D are reserved for describing the parameters of a job that wishes to be scheduled.Respectively, they represent the arrival time t, required length ℓ, value v, and maximum allowed delay d.The lowercase variables represent fixed values, whereas the uppercase represent random variables.Script-uppercase letters L, V, D represent the supports of the distributions on L, V , and D, respectively; and the bold-uppercase letters L, V, D represent the maximum values in these respective sets.Finally, π is reserved for pricing policy, whereas p is reserved for probabilities.
Single-Machine, Non-Preemptive, Job Scheduling.A sequence of random jobs wish to be scheduled on a server, non-preemptively, for a sufficiently low price, within a time constraint.Formally, at every time step t, a single job with parameters (L, V, D) is drawn from an underlying distribution Q over the space L × V × D. It wishes to be scheduled for a price π ≤ V in an interval [a, b] such that a − t ≤ D and b − a ≥ L.
Price Menus.Our goal is to design a take-it-or-leave-it, posted-price mechanism which maximizes expected revenue.At each time period, the mechanism posts a "price menu" and an earliest-available-time s t , indicating that times t through t + s t − 1 have already been scheduled.(s t will henceforth be referred to as the state of the server.)We let S := {0, . . ., D + L} to be the set of all possible states.The state of the server at a given time t is naturally a random variable which depends on the earlier jobs and on the adopted policy π.As before, we will denote with s or s t the fixed value, and with S or S t the corresponding random variable.The price menu will be given by the function π : [T ] × S × L → R, i.e., if we are a time t and the server is in state s t , then the prices are set according to π t (s t , •) : L → R. The reported pair (π t (s t , •), s t ) is computed by the scheduler's strategy, which we determine in this paper.Once this is posted, a job (L, V, D) is then sampled i.i.d.from the underlying distribution Q.
If V ≥ π t (s t , ℓ) for some ℓ ≥ L, and D ≥ s t , then the job accepts the schedule, and reports the length ℓ ≥ L which minimize price.Otherwise, the job reports ℓ = 0 and is not scheduled.To guarantee truthfulness, it suffices to have π t (s, •) be monotonically non-decreasing for every state s: the agent would not want a longer interval since it costs more, and would not want one of the shorter intervals since they cannot run the job.It should be clear that the mechanism's strategy is to always report monotone non-decreasing prices, as a decrease in the price menu will only cause more utilization of the server, without accruing more revenue.The main technical challenge in this paper, then, is to show that under some assumptions, the optimal strategy is monotone non-decreasing, and efficiently computable.
Revenue Objective.Revenue can be measured in either a finite or an infinite discounted horizon.In the former (finite) case, only T time periods will occur, and we seek to maximize the expected sum of revenue over these periods.
In the infinite-horizon setting, future revenue is discounted, at an exponentially decaying rate.Formally, revenue at time t is worth a γ t fraction of revenue at time 0, for some fixed γ < 1. See Appendix C.1.Recall that the job parameters are drawn independently at random from the underlying distribution, so the scheduler can only base their "price menu" on the state of the system and the current time.Thus, the only realistic strategy is to fix a state-and-timedependent pricing policy π : ), X 3 , . . .} be the random sequence of jobs arriving, sampled i.i.d.from the underlying distribution.Let π : [T ] × S × L → R be the pricing policy.We denote as Rev t (X , π) the revenue earned at time t with policy π and sequence X .If X t does not buy, then Rev t (X , π) = 0, and otherwise, it is equal to π t (s t , L t ).We denote as CmlRev T the total (cumulative) revenue earned over the T periods.Thus, (1) We will also need the expected-future-revenue, given a current time and server state, which we will denote as follows: The subscript of the expectation X ≥t denotes that we consider only jobs arriving from time t onward.Our objective is to find the pricing policy π which maximizes U π 0 (s = 0).Call this π * , and denote the expected revenue under π * as U * t (•).
3 Bayes-optimal Strategies for Sever Pricing In this section we seek to compute an optimal monotone pricing policy π : [T ] × S × L → R which maximizes revenue in expectation over T jobs sampled i.i.d.from an underlying known distribution Q.This is extended to the infinite-horizon, discounted, setting in Appendix C.1.
We first model the problem of maximizing the revenue in online server pricing as a Markov Decision Process that admits an efficiently-computable, optimal pricing strategy.The main contribution of this section is to show that, for a natural assumption on the distribution Q, the optimal policy is monotone.We recall that this allows us to derive truthful Bayes-optimal mechanisms.

Markov Decision Processes.
We show that the theory of Markov Decision Processes is well suited to model our problem.A Markov Decision Process is, in its essence, a Markov Chain whose transition probabilities depend on the action chosen at each state, and where to each transition is assigned a reward.A policy is then a function π mapping states to actions.In our setting, the states are the states of the system outlined in Section 2 (i.e., the possible delays before the earliest available time on the server), and the actions are the "price menus."At every state s, a job of a random length arrives, and with some probability, chooses to be scheduled, given the choice of prices.The next state is either max{s − 1, 0}, if the job does not choose to be scheduled (since we have moved forward in time), or s + ℓ − 1, if a job of length ℓ is scheduled, since we have occupied ℓ more units.The transition probabilities depend on the distribution of job lengths, and the probability that a job accepts to be scheduled given the pricing policy (action).Formally, (Transitions to state "−1" should be read as transitions to state " 0".)Note that a job of length ℓ may choose to purchase an interval of length greater than ℓ, which would render these transition probabilities incorrect.However, this may only happen if the larger interval is more affordable.It is therefore in the scheduler's interest to guarantee that π t (s, •) in monotone non-decreasing in ℓ, which incentivizes truthfulness, since this increases the amount of server-time available, without affecting revenue.Thus we restrict ourselves to this case.
It remains to define the transition rewards.They are simply the revenue earned.Formally, a transition from state s t to s t + ℓ − 1 incurs a reward of π t (s, ℓ), whereas a transition from state s t to s t − 1 incurs 0 reward.We wish to compute a policy π in such a way as to maximize the expected cumulative revenue, given as the (possibly discounted) sum of all transition rewards in expectation.

Solving for the Optimal Policy with Distributional Knowledge
In this section, we present a modified MDP whose optimal policies can be efficiently computed, and show that these policies are optimal for the original MDP.In this section, we assume that the mechanism designer is given access to the underlying distribution Q.However, in the following sections, we will show that if the distribution Q is estimated from samples, then solving for the MDP on this estimated distribution is sufficient to ensure sufficiently good revenue guarantees.
Since the problem has been modelled as a Markov Decision Process (MDP), we may rely on the wealth of literature available on MDP solutions, in particular we will leverage the backwards induction algorithm (BIA) of [23] Section 4.5, included in Appendix B as Algorithm 1.We will however need to ensure that this standard algorithm (i) runs efficiently, and (ii) returns a monotone pricing policy.
Note that past prices do not contribute to future revenue insofar as the current state remains unchanged.Thus, to compute optimal current prices, we need only know the current state and expected future revenue.This allows us to use the BIA.The idea is to compute the optimal time-dependent policy, and the incurred expected reward, for shorter horizons, then use this to recursively compute the optimal policies for longer horizons.
The total runtime of the BIA is O(T |S||A|), where S and A denote the action and state spaces, respectively.Note that the dependence on T is unavoidable, since any optimal policy must be time-dependent.Recall that L and D denote the maximum values that L and D can take, respectively, and V is the set of possible values that V can take.Denote K := max{D + L, |V|}.If we define the action space naïvely, we have |S| = D + L ≤ K, and |A| ≤ K L .Thus, a naïve definition of the MDP bounds the runtime at K O(K) , which is far from efficient.Requiring monotonocity only affects lower-order terms.
Modified MDP.To avoid this exponential dependence, we can be a little more clever about the definition of the state space: instead of states being the possible server states, we define our state space as possible (state, length) pairs.Thus, when the MDP is in state (s, ℓ), the server is in state s, and a job of length ℓ has been sampled from the distribution.Our action-space then is simply the possible values of π t (s, ℓ), and the transition probabilities and rewards become: Therefore, we get |S| = (D + L) • L ≤ K 2 , and |A| ≤ K. Thus, the runtime of the algorithm becomes O(T K 3 ).A full description of the procedure is given in Appendix B as Algorithm 2. It remains to prove that it is correct.We begin by claiming that these two MDPs are equivalent in the following sense: Lemma 1.For any fixed pricing policy π : where the U π t (•)'s are as in (2), and the u π t (•, •)'s are from the modified MDP.
(See Appendix B for a proof.)This lemma, however, does not suffice on its own, as agents may behave strategically by over-reporting their length, if the prices are not increasing.This would alter the transition probabilities, breaking the analysis.We will see that under a mild assumption, this can not happen, as the optimal policy for non-strategic agents will be monotone, and therefore truthful.

Monotonicity of the Optimal Pricing Policies
Recall that the solution of the more efficient MDP formulation is only correct if we can show that it is always monotone without considering the strategic behaviour of agents, ensuring incentive-compatibility of the optimum.
An optimal monotone strategy cannot be obtained for all the distributions on L, V, and D. As an example, for any distribution where a job's value is a deterministic function of their length, the optimal policy is to price-discriminate by length.If this function is not monotone, the optimum won't be either.To this end, we introduce the following assumption, which we will discuss below, and which will imply monotonicity of the pricing policy.
Assumption 1.The quantity P[V ≥µ ′ ,D≥s|L=ℓ] P[V ≥µ,D≥s|L=ℓ] is monotone non-decreasing as ℓ grows, for any state s and 0 ≤ µ < µ ′ fixed.This is not a natural, or immediately intuitive assumption.However, we will show that it is satisfied if the valuation of jobs follows a log-concave distribution which is parametrized by the job's length, and where the valuation is (informally) positively correlated with this length.Log-concave distributions are also commonly referred to as distributions possessing a monotone hazard rate, and it is common practice in economic settings to require this property of the agent valuations.
Lemma 2. Let, V s ℓ denote the marginal r.v.V conditioned on L = ℓ and D ≥ s.Let Z be a continuously-supported random variable, and A discussion of log-concave random variables and a proof of this fact is given in Appendix A. Many standard (discrete) distributions are (discrete) log-concave random variables, including the uniform, Gaussian, logistic, exponential, Poisson, binomial, etc.These can be proved to be log-concave from the discussion in Appendix A. In the above, the γ terms represent a notion of spread or shifting, parametrized by the length, indicating some amount of positive correlation.
It remains to show price monotonicity under the above assumption.First, we begin with the following, which holds for arbitrary distributions.Lemma 3. Let U * t (s) be the expected future revenue earned starting at time t in state s, for the optimal policy computed by Algorithm 2. Then the function s → U * t (s) is monotone non-increasing in s for any t fixed.
See Appendix B for the proof.This lemma ensures that over-selling time on the server can only hurt the mechanism.This allows us to conclude Lemma 4. If the distribution on job parameters satisfies the above assumption, then for all ℓ, s, t, we have π * t (s, ℓ) ≤ π * t (s, ℓ + 1).

Sketch.
A full proof may be found in Appendix B. The idea is to show that, for any price µ less than the optimum π * t (s, ℓ), the difference in revenue between charging µ and π * t (s, ℓ) to jobs of length ℓ is less than the difference in revenue between the same prices for jobs of length ℓ + 1.This is achieved by applying the assumption to recursive definition of future revenue, along with the previous lemma.Thus, we can conclude that the optimal price π * t (s, ℓ + 1) must be greater than π * t (s, ℓ).
With Lemma 4 and the results of Appendix C, we finally have: Theorem 5.The online server pricing problem admits an optimal monotone pricing strategy when the variables L, V , and D satisfy assumption 1.Also, 1.In the finite horizon setting, when V is finitely supported, an exact optimum can be computed in time O(T K 3 ).
2. In the infinite horizon setting, when V is finitely supported, for all ε > 0, an ε-additive-approximate policy can be computed in time 3. In the finite horizon setting, when V is continuously supported, for all η > 0, an ηT -additive-approximate policy can be computed in time O(T K 2 V/η).

Concentration Bounds on Revenue for Online Scheduling
In this section, we show that the revenue of arbitrary policies concentrates around their mean.In particular it holds true for the optimal or approximately optimal strategies described above.This will also allow us to argue later that, if we have an estimate Q of Q, then execute Algorithm 2 given the distribution Q, then the output policy will perform well with respect to Q, both in expectation, and with high probability.To show this concentration, we will consider the Doob or exposure martingale of the cumulative revenue function, introduced in Section 2. Define where the X i 's are jobs in the sequence X and the expected value is taken with respect to X i+1 , . . .X T .Thus, R π 0 is the expected cumulative revenue, and R π T is the random cumulative revenue.To formally describe this martingale sequence, we will introduce some notation, and formalize some previous notation.Recall that X 1 , X 2 , . . . is a sequence of jobs sampled i.i.d.from an underlying distribution Q.Fix a pricing policy π : [T ] × S × L → R. Note that the state at time t is a random variable depending on both the (deterministic) pricing policy and the (random) X 1 , . . ., X t−1 .We denote it S t (π, X ), or S t for short.Formally, suppose X t = (V t , L t , D t ), then S t+1 (π, X ) = S t (π, X ) − 1 if either V t < π t (S t , L t ) or D t < S t , and otherwise S t+1 (π, X ) = S t (π, X ) + L t − 1.Furthermore, let Rev t (π, X ) be equal to 0 in the first case above (the t-th job is not scheduled), and π t (S t , L t ) otherwise.Thus, S t (π, X ) and Rev t (π, X ) are functions of the random values X 1 , . . ., X t for π fixed.Note that Rev t implicitly depends on S t .Let X >i := (X i+1 , X i+2 , . ..) and X ≤i := (X 1 , . . .X i ).Recalling that CmlRev T (X , π) = T t=1 Rev t (X , π), we have We wish to show that CmlRev(X , π) concentrates around its mean.Since R π 0 is the expected revenue due to π, and R π T is the (random) revenue observed, it suffices to show |R π 0 − R π T | is small, which we will do by applying Azuma's inequality, after showing the bounded-differences property.This gives, see Appendix B.3 for details, Theorem 6.Let X be a finite sequence of T jobs sampled i.i.d.from Q, and let π be any monotone policy.Then, with probability

. in the finite horizon, and in the infinite
In particular these results hold true for the (approximately) optimal pricing strategies of Theorem 5.

Robustness of Pricing with Approximate Distributional Knowledge
In this section, we show that results analogous to Theorems 5 and 6 may be obtained even in the case in which we do not have full knowledge of the distribution Q, but only an estimate Q.We then show how to obtain a valid Q from samples.

Robustness of the pricing strategy
Let's suppose that instead of knowing the exact distribution Q = (D, L, V ) of the jobs, we have only access to some estimate Q = ( D, L, V ) with the following property, for some ε > 0: For the sake of brevity, we abuse notation and denote the condition in (8) as |Q − Q| < ε.Later, we will need to estimate the value given Q, that is the probability that the job has length ℓ, but either cannot afford price v, or cannot be scheduled s slots in the future.This is equal to The left-hand term is equal to P[L = ℓ,V ≥ 0,D ≥ 0], and so we have access to both terms.The estimation error is additive, so the deviation is at most 2ε.Denote p ℓ t,s := P[V ≥ π t (s, ℓ), D ≥ s|L = ℓ], and recall the expected revenue from time t onwards, conditioning on S t = s.Let Û π t (•) be the same as U π t (•), but where the variables are distributed as Q.As before, let U * t (•) be U π t (•) for π = π * , the Bayes-optimal policy returned by Algorithm 2, and Û * t (•) defined similarly but with respect to Q.We will show that Û * t (•) is a good estimate for U * t (•).Lemma 7. Let Q, and Q such that |Q − Q| < ε.
The proof of 1 is in Appendix B.4, and the proof of 2 in Appendix C.1.

Learning the Underlying Distribution from Samples
As discussed above, we show here how to compute a Q from samples of Q, such that |Q − Q| is small with high probability.In particular we present a sampling procedure which respects the rules of the pricing server mechanism.When a job arrives, we only learn its length, and only if it agrees to be scheduled.Thus, we are not given full samples of Q, complicating the learning procedure.Thanks to the previous section, we know that a policy which is optimal with respect to Q will be close-to-optimal with respect to Q.We remark, however, that the power of the results of the previous section is not exhausted by this application: one may apply directly the robustness results to specific problems in which the Q is subject to (small) noise or an approximate distribution is already known from other sources.
, } be an i.i.d.sample of n jobs from the underlying distribution Q.Note that the expectation of an indicator is the probability of the indicated event.Fix a length ℓ, a state s, and a value v.As a consequence of Höffding's inequality, with probability 1 − δ, Sampling Procedure.We wish to estimate the value P[L = ℓ, V ≥ v, D ≥ s] for all choices of ℓ, v, and s.Fixing v and s, we may repeatedly post prices π t (s, ℓ) = v and declare that the earliest available time is s, then record (i) which job accepts to be scheduled, and (ii) the length of each scheduled job.Let ε > 0 and n ≥ log(2/δ)/(2ε 2 ), then by (10), the sample-average of each value will have error at most ε with probability 1 − δ, for any one choice of (ℓ, v, s).
Repeating this process for all ≤ K 2 choices of v ∈ V and s ∈ S gives us estimates for each.Now, if we want to have the estimate hold over all choices of ℓ, v, s, it suffices to take the union bound over all ≤ K 3 values (incl.ℓ), and scaling accordingly.If we take n ≥ 3 log(2K/δ)/(2ε 2 ) samples for each of the ≤ K 2 choices of v and s, then simultaneously for all ℓ, v, and s, the quantity in (10) is at most ε.So we have obtained the "|Q − Q| < ε" condition.
It should be noted that, for this sampling procedure, if a job of length ℓ is scheduled, we must possibly wait at most ℓ times units before taking the next sample to clear the buffer.This blows up the sampling time by a factor of O(L).
The following result follows immediately from Lemma 7 and Höffding's inequality for the right choice of n.
Lemma 8. Let n, Q, and Q, be as above.In the finite horizon, for all ε > 0, if n ≥ 6T K 4 log(2K/δ)/ε 2 , we have that with probability

Performance of the Computed Policy
We use here the result of the previous sections to analyze the performance of the policy output by Algorithm 2 after the learning procedure.By the estimation of revenue, the best policy in estimated-expectation is near-optimal in expectation.Since revenues from arbitrary policies concentrate, we get near-optimal revenue in hindsight.Formally, for ε > 0, Lemma 8 gives us that if the sample-distribution Q is computed on n ≥ 6T K 4 log(2K/δ)/ε 2 samples, then with probability 1 − δ over the samples, ) is exactly the expected cumulative revenue of the optimal policy.For clarity of notation, denote We have shown that for sufficient samples, |ECRev This observation allows us to then conclude Theorem 9 (Finite Horizon).Let Q be the underlying distribution over jobs.Let ε > 0, and n ≥ 24T K 4 log(8K/δ)/ε 2 .Then in time O(T K 3 + nL), we may compute a policy π which is monotone in length, and therefore incentive compatible, such that for any policy π, with probability (1 − δ), Furthermore, if the distribution over values V is continuous rather than discrete, we may compute in time O(T K 2 V/η+ nL) a monotone policy π such that for any policy π, with probability 1 − δ, We have chosen n ≥ 6T K 4 log(2K/(δ/4))/(ε/2) 2 .Let π * be the optimal policy for the true distribution Q.By Theorem 6, we have |CmlRev T (X , π) − ECRev T (π|Q)| < V 2 log(8/δ)(T + 1) with probability 1 − δ/4 for both π and π.Furthermore, by Lemma 8, |ECRev T (π|Q) − ECRev T (π| Q)| < ε/2 with probability 1 − δ/4, for both π = π and π * .This is because from the point of view of π, Q is the true distribution, and Q is the estimate.
Taking the union bound over all four events above, and recalling that π maximizes ECRev T (π| Q), and π * maximizes ECRev T (π|Q), we get the following with probability 1 − δ: as desired.When V is continuously distributed, choose prices which are multiples of η between 0 and V, as is outlined in Appendix C.
For what concerns the γ-discounted infinite horizon case, we have the following Theorem 10 (Infinite Horizon, Discounted).Let Q be the underlying distribution over jobs.Let ε > 0, and n ≥ 24K 4 log(8K/δ) ε 2 (1−γ) .Then we may compute a policy π in time + nL , which is monotone, and thus incentive compatible, such that for any policy π, with probability (1 − δ), Furthermore, if the distribution over values V is continuous rather than discrete, we may compute in time + nL a monotone policy π such that for any π, with probability 1 − δ, As above, this policy π is computed by learning Q from n samples as in Section 4.2, then running the modified Algorithm 2 for the estimated distribution as in Appendix C.1.In case V is continuously distributed, we restrict ourselves to prices which are multiples of η between 0 and V.The details of the proof are in Appendix C. We recall that all these results need the distribution assumption from Section 3.3.

Conclusions and Future Work
In summary, we propose to price time on a server by first learning the distribution over jobs from samples, then computing the Bayes-optimal policy from the estimated distribution.Our learning algorithm is simple: we sample the distribution through the observation of n jobs at artificially fixed prices and server-states, and learn the job parameters depending on whether they accept to be scheduled.Using these observations, we build an observed distribution Q.We then run Algorithm 2 with Q and compute an optimal policy π for Q.We are guaranteed that the policy prices monotonically (due to Lemma 3), and therefore it is incentive compatible, which implies the correctness of the estimated revenue.
Future Work.There are many natural extensions to this work.For example, one could consider a multi-server setting, settings where jobs can request to be scheduled later than the earliest available time, or settings where jobs need various quantities of differing resources, such as memory and computation time.
For V s ℓ ∼ Z + γ s ℓ , observe that for x ′ > x and γ ′ > γ, we have since log F is a non-increasing and concave function, by assumption.Also where the first inequality is the same as the previous equation, as the second is by monotonicity.Thus we have done the continuous case.
We present a final fact that justifies the use of ⌊Z⌋-type random variables: Lemma 13.If Y is a discrete log-concave random variable, then there exists a continuous log-concave Z such that Y ∼ ⌊Z⌋.
Proof.Let P : Z → [0, 1] be the right-hand cumulative mass function for Y .Then, it suffices to have P[Z ≥ n] = P (n) for all integers n.Let φ : R → R be the piecewise-linear function such that φ(−∞) → 0, φ(∞) → −∞, and φ(n) = log(P (n)) for all n.Since log(P ) is a discretely concave and non-increasing function, φ must be concave and non-increasing.We can then set Z to be the random variable whose density is given by − d dx exp(φ(x)).

B Detailed Proofs
We present in this section the detailed proofs of the lemmas and theorems from the text.B.1 gives the pseudocode for the dynamic programs that compute the optimal pricing policies, outlined in Section 3, B.2 gives the proofs for the monotonicity of the pricing policies, along with the discussion on log-concave random variables from Appendix A, B.3 gives the concentration bounds from the last part of Section Initialize U * T (s) ← 0 for all s ∈ S, and u * T (s, ℓ) ← 0 for all s ∈ S, ℓ ∈ L. for t from T − 1 to 0, descending do for s ∈ S do for ℓ ∈ L do Algorithm 2: Optimal policy in finite horizon (Lemma 1, p.5) For any fixed pricing policy π : where the U π t (•)'s are as in (2), and the u π t (•, •)'s are from the modified MDP.
For simplicity, let ∆ ℓ := U * t+1 (s + ℓ − 1) − U * t+1 (s − 1), and so for any µ = µ 0 , Note that, as discussed in the proof of the previous lemma, µ 0 + ∆ ℓ ≥ 0, as otherwise it would be beneficial to set π * t (s, ℓ) ← ∞.The above inequality is then equivalent to We wish to show that, if µ ≤ µ 0 , then as ℓ increases, the above inequality still holds.This would imply that the price µ 0 =: π * 9 (s, ℓ) gives better return than µ for jobs of length ℓ + 1, implying that the optimal price must be at least π * t (s, ℓ), which is our desired goal.Now, by assumption 1, the left-hand-side is non-decreasing in ℓ, so it remains to show that the right-hand-side is non-increasing in ℓ.The only changing term is ∆ ℓ , which by Lemma 3, is non-increasing in ℓ.Since it is in the denominator of a subtracted, non-negative term, we have our desired result.

B.3 Concentration Bounds on Revenue for Online Scheduling
(Theorem 6, p. 7) Let X be a finite sequence of jobs sampled i.i.d.from an underlying distribution Q, and let π be any monotone policy.Then, with probability in the finite horizon, and in the infinite-horizon-discounted, In particular these results hold true for the (approximately) optimal pricing strategy computed in the previous part of the section.
Proof.For the finite horizon, we apply Azuma's inequality to the martingale R π t .We being by showing the boundeddifferences property.Note that we do not require truthful behaviour from the jobs, since taking strategic behaviour into account for a non-monotone policy is equivalent to modifying the distribution over the jobs, and making the distribution state-dependent, by increasing the length of those jobs who would rather buy a longer interval.Thus, where the last inequality follows from properties of conditional expectation.With this property, we can apply Azuma's, and get For the infinite-horizon-discounted, we observe that equation (7) However, the argument of the supremum in left-hand term in the summand must be at most V, since if U * t+1 (σ + ℓ − 1) ≤ U * t+1 (s − 1), it is best to π * t (σ) = ∞, which makes p ℓ t,s = 0, putting all the weight on U * t+1 (s − 1).Furthermore, we have shown in Lemma 3 that U * t+1 (s + ℓ − 1) ≤ U * t+1 (s − 1).Thus, we get Inductively applying this gives U * t (s) − Û * t (s) ≤ 2(T − t)LVε as desired.

C Extensions
In this section, we extend the finite-horizon results to compute the optimal policies in the infinite-horizon-discounted setting, and also to argue that the optimal policy may be computed within some error when the distribution over values is continuous, rather than discrete.
These results are needed to show the full statements of Theorems 5-10.