Strategic bidding in freight transport using deep reinforcement learning

van Heeswijk, W. J. A.

doi:10.1007/s10479-022-04572-z

Strategic bidding in freight transport using deep reinforcement learning

Original Research
Open access
Published: 22 February 2022

(2022)
Cite this article

Download PDF

You have full access to this open access article

Annals of Operations Research Aims and scope Submit manuscript

Strategic bidding in freight transport using deep reinforcement learning

Download PDF

W. J. A. van Heeswijk ORCID: orcid.org/0000-0002-5413-9660¹

1684 Accesses
3 Citations
Explore all metrics

Abstract

This paper presents a multi-agent reinforcement learning algorithm to represent strategic bidding behavior by carriers and shippers in freight transport markets. We investigate whether feasible market equilibriums arise without central control or communication between agents. Observed behavior in such environments serves as a stepping stone towards self-organizing logistics systems like the Physical Internet, while also offering valuable insights for the design of contemporary transport brokerage platforms. We model an agent-based environment in which shipper and carrier actively learn bidding strategies using policy gradient methods, posing bid- and ask prices at the individual container level. Both agents aim to learn the best response given the expected behavior of the opposing agent. Inspired by financial markets, a neutral broker allocates jobs based on bid-ask spreads. Our game-theoretical analysis and numerical experiments focus on behavioral insights. To evaluate system performance, we measure adherence to Nash equilibria, fairness of reward division and utilization of transport capacity. We observe good performance both in predictable, deterministic settings ($\sim $ 95% adherence to Nash equilibria) and highly stochastic environments ($\sim $ 85% adherence). Risk-seeking behavior may increase an agent’s reward share, yet overly aggressive strategies destabilize the system. The results suggest a potential for full automation and decentralization of freight transport markets. These insights ease the design of real-world market platforms, suggesting an innate tendency of markets to reach equilibria without behavioral models, information sharing or explicit incentives.

Smart Containers with Bidding Capacity: A Policy Gradient Algorithm for Semi-cooperative Learning

The third party logistics provider freight management problem: a framework and deep reinforcement learning approach

Article 26 February 2024

Customized Dynamic Pricing for Air Cargo Network via Reinforcement Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

While the transport sector traditionally relies on fixed contracts and manual negotiations, modern technology allows for automated negotiation and more appropriate responses to its inherently dynamic nature. Paradigms such as the Physical Internet and self-organizing logistics provide conceptual outlines to organize such systems, yet many facets remain unexplored. This paper investigates a strategic bidding mechanism based on multi-agent reinforcement learning, deliberately exploring a setting without communication or centralized control and presenting decentralized planning in a pure form. For present-day applications, the resulting insights are useful for market platform design, which deal with challenges such as horizontal collaboration, information-sharing infrastructures and incentive design (Cruijssen 2020; Karam et al. 2021). With the platform economy rapidly growing, finding ways to connect independent market participants becomes increasingly important (Kenney and Zysman 2016).

We study a multi-agent environment containing a carrier, shipper and broker, as envisioned in the Physical Internet (Montreuil et al. 2013). For each individual transport job (e.g., a smart container), the shipper and carrier pose bid- and ask prices respectively. The neutral broker agent—with a business model inspired by transport matching platforms and highly decentralized financial markets—matches these bids and asks at a batch level, solving a knapsack problem that maximizes bid-ask spread. Being a rule-based agent, the broker may be considered as the ‘environment’ rather than an active agent. The shipper and carrier compete against each other, actively learning strategies to maximize their own reward given the deployed strategy of the opponent. We stress that the learning algorithm is model-free; agents do not explicitly model the expected strategy of the opponent. By investigating a model-free and stochastic setting, we minimize behavioral assumptions. Both agents constantly adjust their strategy in an online learning setting, representing a self-organizing freight transport market. The behavior of such a market is the key interest of this work, keeping future extensions towards multiple carriers and shippers in mind. Inspired by game theory, efficiency and fairness are the primary performance metrics.

This paper expands upon Van Heeswijk (2019)—which introduced a learning bidding agent—in several ways. First, the carrier, which was a passive price-taker in the earlier work, now is a learning agent with a dynamic strategy. This adds a strategic dimension that vastly increases the complexity. Second, we add a (non-learning) broker agent to the setting, exploring the role of a neutral transport planner in self-organizing systems. Third, we test a variety of actor-critic models and deep learning techniques, whereas the earlier work relies on a basic policy gradient algorithm. Fourth, we root the performance metrics and numerical results in game-theoretical foundations.

The remainder of this paper is structured as follows. Section 2 discusses related literature. In Sect. 3 we define the system as a Markov Decision Process (MDP) model. Section 4 presents the solution method; we describe several variants of policy gradient algorithms to learn bidding- and asking prices. Section 5 presents the experimental design and Sect. 6 the numerical results. Section 7 ends the paper with the main conclusions.

2 Literature review

This literature overview is structured as follows. First, we discuss the core concepts of self-organizing logistics and the Physical Internet. Second, we highlight several studies on reinforcement learning in the Delivery Dispatching Problem, which relates to our setting. Third, we discuss optimal bidding and (contemporary) transport platforms. Fourth, we assess the links with game theory, which will be used for our analysis and experimental design.

The inspiration for this paper stems from the Physical Internet paradigm (Montreuil 2011; Montreuil et al. 2013). The Physical Internet envisions an open market in which logistics services are offered, with automated interactions between smart containers and other constituents determining routes and schedules. Sallez et al. (2016) emphasize the active role of smart containers, communicating, memorizing, negotiating, and learning both individually and collectively. Ambra et al. (2019) present a recent literature review of work performed in the domain of the Physical Internet. Interestingly, their overview does not mention works that define the smart container itself as an actor. Instead, existing works focus on traditional actors such as carriers, shippers and logistics service providers, even though smart containers supposedly route themselves in the Physical Internet.

The problem studied in this paper is related to the Delivery Dispatching Problem (Minkoff 1993), which entails dispatching decisions from a carrier’s perspective. Transport jobs arrive at a hub according to some external stochastic process. The carrier subsequently decides which subset of jobs to accept, anticipating future job arrivals. Basic instances may be solved with queuing models, but more complicated variants are computationally intractable; researchers often resort to reinforcement learning to learn high-quality strategies. We highlight some recent works in this domain. Klapp et al. (2018) develop an algorithm that solves the dispatching problem for a transport service operating on the real line. Van Heeswijk and La Poutré (2018) compare centralized and decentralized transport for networks with fixed line transport services, concluding that decentralized planning yields considerable computational benefits. Van Heeswijk et al. (2019a) study a variant including a routing component, using value function approximation to find strategies. Voccia et al. (2019) solve a variant that includes both pickups and deliveries. Our current paper distinguishes itself from the aforementioned works by assigning jobs based on bid-ask spreads (neutral perspective) rather than transport efficiency (carrier perspective).

Next, we highlight related works on optimal bidding in freight transport; in most of these studies competing carriers bid on transport jobs. For instance, Yan et al. (2018) propose a particle swarm optimization algorithm to place carrier bids on jobs. Miller and Nie (2020) emphasize the importance of integrating carrier competition, routing and bidding. Wang et al. (2018) design a reinforcement learning algorithm based on knowledge gradients, solving for a bidding structure with a broker intermediating between carriers and shippers. The broker aims to propose a price that satisfies both carrier and shipper, taking a percentage of accepted bids as its reward. In the context of market platforms, Zha et al. (2017) study market equilibriums, concluding that carriers and the broker benefit in times of scarce supply. Atasoy et al. (2020) address interaction between a brokerage platform and multiple carriers, finding that the broker retains profitability even after providing financial incentives to carriers. In a Physical Internet context, Qiao et al. (2019) model hubs as spot freight markets where carriers can place bids on transport bids. To this end, they propose a dynamic pricing model based on an auction mechanism. Most studies assume that shippers have limited to no influence in the bidding process; we aim to add a fresh perspective with this work. This paper builds onto the work of Van Heeswijk (2019), in which the shipper is a learning agent and the carrier a passive price taker. The author uses a policy gradient algorithm to learn bidding strategies. To the best of our knowledge, there are no studies that explicitly model both carriers and shippers as intelligent bidding agents.

We analyze our experimental results from a game theoretical perspective. Conceptually, the bid-ask problem may be classified as an infinitely repeated non-cooperative game, in which both agents aim to maximize their average reward. More specifically, it may be classified as a bargaining game such as defined by Nash (1953), in which both agents claim a share of system-wide gains. Folk theorems provide insights into equilibria in such settings (Friedman 1971). Each payoff profile that is feasible and individually rational in the one-shot game constitutes a Nash equilibrium in the repeated game. For bargaining games, the presence of a threat or disagreement point for deviating opponents is essential to prove the existence of Nash equilibria. Aumann and Shapley (1994) and Rubinstein (1980, 1994) present solutions for Nash equilibria under temporary punishments. Performance metrics and normative solutions are discussed in Sect. 5.2.

3 System description

This section formally defines our system as a Markov Decision Process model. In Sect. 3.1 we provide a high-level outline of the system and the agents involved. Section 3.2 describes the system state; Sect. 3.3 follows up with the decisions and reward functions. Finally, Sect. 3.4 provides the transition function to complete the model definition.

3.1 Model outline

The system contains three agent types: (i) shipper (S), (ii) carrier (C) and (iii) broker (B). We provide a global outline here; more detail follows in the subsequent sections.

We consider a singular transport service with a fixed origin (e.g., a transport hub). Every day, the shipper places individual bids (autonomous smart containers using a joint stochastic policy) for each job to be transported to its destination on a real line. The job is shipped if its bid is accepted. If the job reaches its due date and is still not shipped, it is removed from the system as a failed job. Second, the carrier is responsible for performing the transport service. Without knowing the bid price, it poses a daily ask price for each job. Depending on volume and distance, each job has its own marginal transport costs. When the broker assigns a job, the carrier is obliged to transport it and receives the requested ask price. Third, the freight broker is responsible for scheduling jobs. After receiving all bids and asks for the day, the broker assigns transport jobs to the carrier. Its profit is the difference between the bid- and ask price of each shipped job. This means that (i) jobs are never shipped when the ask exceeds the bid and (ii) in case transport capacity is insufficient, the broker assigns jobs in a way that maximizes its own total profit. Unlike the other agents, the broker’s behavior is rule-based rather than learned (i.e., a passive agent). Furthermore, it makes decisions at the batch level (scheduling all jobs for the day) rather than at the individual job level. We illustrate the strategic bidding problem in Fig. 1.

We consider an infinite horizon setting in which strategies are updated daily (i.e., an online learning setting). For practical and notational purposes, the horizon length is set to $N=\infty $ and the corresponding decision epochs (representing days) to $n \in \{0,1,\ldots ,N\}$. Every day n, a transport service with fixed capacity $\zeta ^{C}$ departs along the real line; all shipped jobs are delivered the same day (i.e., before the next epoch). Neither past bid- and ask prices nor job allocations impact current decisions, satisfying the Markovian memoryless property.

3.2 State description

The system state is defined as the set of transport jobs. An individual job is defined by an attribute vector ${\varvec{j}}$. In addition to the global time horizon that runs till N, each job has an individual time horizon ${\mathcal {T}}_{n,{\varvec{j}}}$ that corresponds to the number of decision epochs till due date (i.e., decreasing with time). Each job is represented by an attribute vector:

$$\begin{aligned} {\varvec{j}} =\left( \begin{array}{llll} &{}j_\tau = \#\text { epochs till due date} \\ &{}j_d = \text {distance to destination} \\ &{}j_v = \text {container volume} \end{array}\right) \end{aligned}$$

The integer attribute $j_\tau \in [0,\tau ^{max}]$ indicates how many decision epochs remain until the latest possible shipment date. Whenever a new job arrives in the system, we initialize ${\mathcal {T}}_{n,{\varvec{j}}}=\{j_\tau ,j_\tau -1,\ldots ,0\}$. The time horizon is tied to the individual job; the corresponding number of decision epochs till due date—represented by attribute $j_\tau $—is decremented with each time step. If $j_\tau =0$ and the job has not been shipped, it is removed from the system. Next, the attribute $j_d \in [1,d^{max}]$ indicates the distance between origin and destination. Finally, the job volume $j_v \in [1,\zeta ^{max}]$, with $\zeta ^{max}\le \zeta ^{C}$ represents the transport capacity required for the job. The system state at day n is represented by the set ${\varvec{J}}_{n}$, which contains all jobs ${\varvec{j}}$ present in the system. We use ${\mathcal {J}}_{n}$ to denote the set of feasible states.

3.3 Decisions and rewards

We introduce the decisions and reward functions. For each job, the shipper places a bid $b_{n,{\varvec{j}}} \in {\mathbb {R}}$ and the carrier poses an ask price $a_{n,{\varvec{j}}} \in {\mathbb {R}}$. The broker assigns jobs to the carrier, earning $b_{n,{\varvec{j}}}-a_{n,{\varvec{j}}}$ for each shipped job. As strategic behavior is the focal point of interest in this study, for each job only one bid/ask per epoch may be posed.

We first discuss the decisions and reward function of the shipper. The shipper bids according to a strategy $\pi _n^{shp}={\mathbb {P}}(b_{n,{\varvec{j}}} \mid ({\varvec{j}},{\varvec{J}}_n))$—used by all smart containers—with $\Pi ^S$ denoting the set of feasible strategies. How to obtain a strategy will be explained in Sect. 4. All bids are stored in a vector ${\varvec{b}}_n = [b_{n,{\varvec{j}}}]_{\forall {\varvec{j}} \in {\varvec{J}}_{n}}$. Remind that the payoff depends on the broker’s decision—denoted by binary variable $\gamma _{n,{\varvec{j}}}$, with 1 indicating the job will be shipped—and is unknown when posing the bid. The maximum willingness to pay for transporting job j is represented by $ c_{{\varvec{j}}}^{S,max}=c^{S,max}\cdot j_v \cdot j_d$ (i.e., depending on volume and distance); this value is used to compute the reward. As a well-functioning system is desired, we add variable penalties to express regret on failed bids, with lower bids yielding higher penalties (note that reward functions of shippers and carriers are not symmetric). At any given decision epoch, the direct reward function for individual jobs is defined as follows:

$$\begin{aligned} r_{{\varvec{j}}}^{S}(\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}})= {\left\{ \begin{array}{ll} c_{{\varvec{j}}}^{S,max}- b_{n,{\varvec{j}}} &{} \quad \text {if}\; \gamma _{n,{\varvec{j}}}=1 \\ -\max \big (0,c_{{\varvec{j}}}^{S,max}-b_{n,{\varvec{j}}}\big ) &{}\quad \text {if} \; \gamma _{n,{\varvec{j}}}=0 \\ \end{array}\right. }. \end{aligned}$$

(1)

The carrier makes its decision according to a strategy $\pi _n^{C}={\mathbb {P}}(a_{n,{\varvec{j}}} \mid ({\varvec{j}},{\varvec{J}}_n))$, with $\Pi ^C$ denoting the set of strategies. Ask prices during an episode are stored in a vector ${\varvec{a}}_n$. Each job has a marginal transport cost that depends on its distance and volume: $c_{{\varvec{j}}}^{C,trn}= c^{C,trn} \cdot j_v \cdot j_d$. Posing an ask price below the transport costs yields a loss when accepted. A penalty is incurred when (i) the ask price exceeds the shipment costs and (ii) there is sufficient idle capacity to accommodate the rejected job. If a job is assigned, the carrier’s reward is the difference between the ask price and the transport costs:

$$\begin{aligned} r_{{\varvec{j}}}^{C}(\gamma _{n,{\varvec{j}}},a_{n,{\varvec{j}}})= {\left\{ \begin{array}{ll} a_{n,{\varvec{j}}}- c_{{\varvec{j}}}^{C,trn} &{} \quad \text {if} \; \gamma _{n,{\varvec{j}}}=1 \\ -\max \big (0,a_{n,{\varvec{j}}}- c_{{\varvec{j}}}^{C,trn}\big ) &{}\quad \text {if} \; \gamma _{n,{\varvec{j}}}=0 \wedge \zeta ^{C} - \left( \sum _{{\varvec{j}} \in {\varvec{J}}_{n}} {\gamma }_{n,{\varvec{j}}} \cdot j_v \right) >0\\ 0 &{}\quad \text {otherwise} \\ \end{array}\right. }. \end{aligned}$$

(2)

For the broker, a job’s value is its bid price ($b_{n,{\varvec{j}}}$) minus its ask price ($a_{n,{\varvec{j}}}$). Using these values and the available capacity, the broker solves a 0–1 knapsack problem for the entire batch of jobs, using dynamic programming (Kellerer et al. 2004). The broker maximizes its direct rewards by selecting $\varvec{\gamma }_n$ as follows:

$$\begin{aligned} \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\varvec{\gamma _n}\in \Gamma ({\varvec{J}}_{n})} \left( \sum _{{\varvec{j}} \in {\varvec{J}}_{n}} {\gamma }_{n,{\varvec{j}}}(b_{n,{\varvec{j}}} - a_{n,{\varvec{j}}})\right) , \end{aligned}$$

(3)

s.t.

$$\begin{aligned} \sum _{{\varvec{j}} \in {\varvec{J}}_{n}} {\gamma }_{n,{\varvec{j}}} \cdot j_v \le \zeta ^{C}. \end{aligned}$$

The corresponding reward function for the broker is:

$$\begin{aligned} r_{{\varvec{j}}}^{B}(\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}},a_{n,{\varvec{j}}})= {\left\{ \begin{array}{ll} b_{n,{\varvec{j}}}-a_{n,{\varvec{j}}} &{}\quad \text {if} \; \gamma _{n,{\varvec{j}}}=1 \\ 0 &{}\quad \text {otherwise} \end{array}\right. }. \end{aligned}$$

(4)

Note that the broker never selects jobs with a higher ask than bid; this would yield a negative payoff. We formalize this minor result in Lemma 1, serving as a building block for later proofs.

Lemma 1

(Job selection when bid price is lower than ask price) If $b_{n,{\varvec{j}}}<a_{n,{\varvec{j}}}$, the broker will always set $\varvec{\gamma }_{n,{\varvec{j}}}=0$ to maximize its profits.

Proof

The proof is found in “Appendix A”. The broker always rejects jobs that would yield a negative payoff, since rejecting yields a payoff of 0. $\square $

3.4 Transition function

To conclude the system definition, we describe the transition function $X :({\varvec{J}}_{n},\varvec{\omega }_{n+1},\varvec{\gamma }_n) \mapsto {\varvec{J}}_{n+1}$—formally outlined in Algorithm 1—for the system state that occurs in the time step from decision epoch n to $n+1$. Three state changes occur during a time step; (i) number of decision epochs till due dates are decreased for all unshipped jobs, (ii) failed and shipped jobs are removed and (iii) newly arrived jobs (denoted by set $\varvec{\omega }_{n+1} \in \varvec{\Omega }$) are added.

Algorithm 1

Transition function $X({\varvec{J}}_{n},\omega _{n+1},\varvec{\gamma }_{n})$

0:	Input: ${\varvec{J}}_{n},\varvec{\omega }_{n+1},\varvec{\gamma }_{n}$	$\blacktriangleright $ Current state, job arrivals, shipping selection
1:		$\blacktriangleright $ Initialize next state
2:		$\blacktriangleright $ Copy state
3:	$\forall {\varvec{j}} \in {\varvec{J}}_{n}$	$\blacktriangleright $ Loop over all jobs
4:		$\blacktriangleright $ Remove shipped job
5:		$\blacktriangleright $ Remove unshipped job at due date
6:		$\blacktriangleright $ Decrement number of epochs till due date
7:		$\blacktriangleright $ Merge existing and new job sets
8:	Output: ${\varvec{J}}_{n+1}$	$\blacktriangleright $ New state

3.5 Policies and game-theoretical properties

From a game-theoretical perspective, the system has some interesting properties that are worth exploring before turning to solutions. In infinitely repeated games without discounting (corresponding to our infinite horizon problem), a common objective is to maximize average profits (Friedman 1971). Following the limit of means theorem, the optimal strategy (for the shipper, carrier is near-equivalent) looks as follows:

$$\begin{aligned} \pi ^{S,*}=\lim _{n \mapsto \infty } \frac{1}{n} \sum _{n=0}^N \sum _{{\varvec{j}} \in {\varvec{J}}_n} \left( \sum _{n^{\prime }=n}^{n+|{\mathcal {T}}_{{\varvec{j}}}|} \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{b_{n^{\prime },{\varvec{j}}} \in {\mathbb {R}}} {\mathbb {E}}(r_{{\varvec{j}}}(\gamma _{n^{\prime },{\varvec{j}}},b_{n^{\prime },{\varvec{j}}}))\right) \end{aligned}$$

(5)

The problem may be classified as a non-cooperative bargaining game (Nash 1953), in which the difference between transport costs and maximum willingness to pay is a surplus to be divided between shipper and carrier (remind the broker does not actively influence the game). Both agents independently ‘claim’ a proportion of the system-wide gain. The so-called feasibility set contains all solutions in which both agents achieve a nonnegative payoff. If no agreement is reached (i.e., the cumulative claim exceeds the system-wide gain or agents only earn negative payoffs), a disagreement point should exist. That is, each agent can execute a credible threat when the opponent deviates from the set of feasible solutions, as such capping its payoff. As shown in Lemma 2, each agent may cap the opponent’s payoff at 0. We prove this result for the carrier; a similar proof can be constructed for the shipper.

Lemma 2

(Existence of disagreement point) For the carrier, there exists a strategy $\pi ^C$ that ensures the shipper’s payoff $r_{{\varvec{j}}}^S$ equals at most 0 for any job ${\varvec{j}}$, regardless of the opponent’s strategy $\pi ^S$.

Proof

The full proof is found in “Appendix A”. If the carrier asks a price higher than the shipper’s maximum willingness to pay, the shipper must either bid higher than that (resulting in a negative payoff when accepted) or bid below the ask and forfeit the agreement (payoff of 0). $\square $

Following Nash (1953), uncertainty vanishes in the limit and the game converges to an equilibrium. In such an equilibrium, agents cannot improve their payoff by unilaterally changing their strategy. The folk theorem (Friedman 1971) states that an equilibrium payoff profile should satisfy two properties. First, the payoff should be a convex combination of payoffs of the stage game, e.g., a weighted average as defined in Eq. (5). Second, the equilibrium payoff must be individually rational, paying at least as much as the deviation point. Lemma 3 shows that the latter condition holds and formalizes the Nash equilibrium.

Lemma 3

(Existence of Nash equilibrium) Any payoff profile satisfying $r_{{\varvec{j}}}^C+r_{{\varvec{j}}}^S \equiv c_{{\varvec{j}}}^{S,max}-c_{{\varvec{j}}}^{C,trn}$ is a Nash equilibrium. This payoff profile is achieved by any pair of strategies satisfying $c_{{\varvec{j}}}^{C,trn}\le a_{n,{\varvec{j}}} = b_{n,{\varvec{j}}} \le c_{{\varvec{j}}}^{S,max}$.

Proof

The full proof is found in “Appendix A”. Intuitively, when $a_{n,{\varvec{j}}} < b_{n,{\varvec{j}}}$, an agent could unilaterally improve its payoff without triggering the disagreement point. All profiles not satisfying $c_{{\varvec{j}}}^{C,trn}\le a_{n,{\varvec{j}}} = b_{n,{\varvec{j}}} \le c_{{\varvec{j}}}^{S,max}$ are outside the feasible set and invoke the disagreement point. $\square $

The key result of this section is that any payoff profile satisfying $c_{{\varvec{j}}}^{C,trn} \le a_{n,{\varvec{j}}} = b_{n,{\varvec{j}}} \le c_{{\varvec{j}}}^{S,max}$ is a Nash equilibrium. Although the bargaining game is non-cooperative, several normative solutions exist that incorporate utility symmetry and Pareto efficiency as fairness criteria. The Nash bargaining solution (not be confounded with the equilibrium) describes the product of utilities as optimal (Nash 1950), the Kalai–Smorodinsky solution maximizes utility ratios (Kalai and Smorodinsky 1975), and Kalai’s solution maximizes the minimum surplus utility (Kalai 1977). In Sect. 5 we use the game-theoretical foundation to define appropriate measures for the experimental design.

4 Solution method

This section explains the solution method, which is based on deep reinforcement learning techniques. Finding optimal strategies to solve Eq. (5) is hard. The expectation depends on both the (unknown and potentially mixed) strategy of the opposing agent and the stochastic realization of new jobs; the optimal strategy today might fail tomorrow. Furthermore, bid and ask prices are generated at the individual job level, whereas the broker allocates jobs at the batch level. Finally, continuous action spaces by definition contain infinitely many actions. Thus, we use reinforcement learning to learn approximate strategies. Both shipper and carrier actively learn their pricing strategies based on observations. As changing the strategy of one agent influences the other, we are dealing with a highly non-stationary system.

In Sect. 4.1 we present a stochastic policy gradient algorithm, in which the strategy is constantly adjusted to maximize expected rewards given the expected value. Section 4.2 presents several extensions to the base algorithm, including value function approximations (actor-critic models). Finally, Sect. 4.3 describes the update procedure.

4.1 Policy gradient learning

The reinforcement learning algorithms used in this paper are policy gradient algorithms, operating directly on the strategy. The strategy itself is stochastic, meaning the bid/ask is drawn from a distribution. This makes the strategy harder to counter than deterministic or rule-based strategies. In addition, stochastic policies tend to work better in uncertain environments; the opponent’s strategy is never known with certainty. Policy gradient methods can be expanded by adding a value function approximation to estimate downstream rewards related to current actions—in that case we speak of actor-critic algorithms. We first discuss the vanilla policy gradient algorithm REINFORCE (Williams 1992) and extend to actor-critic models in Sect. 4.2. For readability, we only present notation for the shipper; carrier variants are near-identical.

Learning takes place in episodes $m \in \{0,1,\ldots ,M\}$, with each episode containing a finite number of decision epochs n. Every episode m yields a batch of job observations with corresponding rewards for shipper and carrier, used to update the policies. Rewards are weighted equally and thus not discounted.

In policy gradient reinforcement learning, actions are applied directly on the state. The actions are determined by a stochastic strategy:

$$\begin{aligned} \pi _{\varvec{\theta }^S}^{S,m}\big (b_{n,{\varvec{j}}}^m \mid {{\varvec{j}}},{\varvec{J}}_{n}\big )={\mathbb {P}}^{\varvec{\theta }^S}\big (b_{n,{\varvec{j}}}^m \mid {\varvec{j}},{\varvec{J}}_{n}\big ), \end{aligned}$$

(6)

where $\varvec{\theta }^{S}$ is the parametrization of the strategy. The probability distributions representing the strategies are Gaussian distributions; bids (asks) are drawn independently for each container:

$$\begin{aligned} b_{n,{\varvec{j}}}^m \sim {\mathcal {N}}(\mu _{\varvec{\theta }^S}({\varvec{j}},{\varvec{J}}_{n}), \sigma _{\varvec{\theta }^S}), \forall {\varvec{j}} \in {\varvec{J}}_{n}. \end{aligned}$$

(7)

The parameterized standard deviation $\sigma _{\varvec{\theta }^{S}}$ determines the level of exploration while learning, but is also an integral part of the strategy itself. Standard deviations typically decrease to small values once appropriate means are identified, but may also be larger when retaining some fluctuation is beneficial. For instance, strategies embedding randomness are more difficult to counter.

The stored action-reward trajectories during each episode m indicate which actions resulted in good rewards. We compute gradients which ensure strategy updates in that direction. Only completed jobs (i.e., shipped or removed) are used to update the strategy, such that we capture full reward trajectories. We might also learn from single observations (i.e., uncompleted jobs), yet full reward trajectories are unbiased and the trajectories are fairly short in this problem setting. Completed jobs are stored in a set $\hat{{\varvec{J}}}^m$. For each episode the cumulative rewards per job—shown for the shipper here, with a slight abuse of notation—are defined as follows:

$$\begin{aligned} {\hat{v}}_{n,{\varvec{j}}}^{S,m}(\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}})= {\left\{ \begin{array}{ll} r_{n,{\varvec{j}}}\big (\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}}^m\big )^{S,m} &{}\quad \text {if} \; j_\tau =0 \\ r_{n,{\varvec{j}}}\big (\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}}^m\big )^{S,m}+{\hat{v}}_{n+1,{\varvec{j}}}^{S,m} &{}\quad \text {if} \; j_\tau >0 \end{array}\right. } \quad , \forall j_\tau \in {\mathcal {T}}_{{\varvec{j}}}. \end{aligned}$$

The cumulative rewards observed at time n in episode m are stored in vectors $\hat{{\varvec{v}}}_{n}^{S,m}=\bigl [[{\hat{v}}_{n,{\varvec{j}}}^{S,m}]_{j_\tau \in {\mathcal {T}}_{{\varvec{j}}}}\bigr ]_{\forall {\varvec{j}} \in {\varvec{J}}_n}$ and $\hat{{\varvec{v}}}_{n}^{C,m}=\bigl [[{\hat{v}}_{n,{\varvec{j}}}^{C,m}]_{j_\tau \in {\mathcal {T}}_{{\varvec{j}}}}\bigr ]_{\forall {\varvec{j}} \in {\varvec{J}}_n}$. At the end of each episode, we can construct the information vector:

$$\begin{aligned} {\varvec{I}}^{S,m} =\biggl [[{\varvec{J}}_{n}^m, {\varvec{b}}_{n}^m, \hat{{\varvec{v}}}_{n}^{S,m}, \varvec{\gamma }_{n}^m]_{\forall n \in \{0,\ldots ,N\}}, \hat{{\varvec{J}}}^m\biggr ] . \end{aligned}$$

The information vector contains the states, actions and rewards required for the strategy updates. For this purpose we utilize the policy gradient theorem; see Sutton and Barto (2018) for a detailed description. We present the theorem for the shipper:

$$\begin{aligned} \nabla _{\varvec{\theta }^S} v_{n,{\varvec{j}}}^{\pi _{\varvec{\theta }^S}} \propto \sum _{{n}=1}^{N} \left( \int _{{\varvec{J}}_{n} \in {\mathcal {J}}_{n}} {\mathbb {P}}^{\pi _{\varvec{\theta }}^S}({\varvec{J}}_{n} \mid {\varvec{J}}_{{n}-1}) \int _{b_{n,{\varvec{j}}}^m \in {\mathbb {R}}} \nabla _{\varvec{\theta }^S}{\pi _{\varvec{\theta }}^S}\big (b_{n,{\varvec{j}}}^m \mid {\varvec{j}}, {\varvec{J}}_{n}\big )v_{n,{\varvec{j}}}^{\pi _{\varvec{\theta }}^S}\big (\gamma _{n,{\varvec{j}}},b_{n,{\varvec{j}}}^m\big )\right) . \end{aligned}$$

We proceed to apply the policy gradient theorem to our system, adopting a Gaussian decision-making strategy, using a neural network (actor network) to output the distribution parameters. The carrier and shipper both deploy a neural network, which is utilized at the individual container level (i.e., a jointly used policy). Let $\varvec{\theta }^S$ define the set of weight parameters describing the decision-making strategy $\pi _{\varvec{\theta }^S} :({\varvec{j}},{\varvec{J}}_{n}) \mapsto b_{n,{\varvec{j}}}^m$. Furthermore, let $\varvec{\phi }({\varvec{j}},{\varvec{J}}_{n})$ be a feature vector distilling the most salient state attributes, for instance the average number of epochs till due date or the number of jobs waiting (note that information is shared between containers). The features used for our study are described in Sect. 5.4. For the actor network, the feature vector $\varvec{\phi }$ is the input, $\varvec{\theta }^S$ represents the network weights, and the mean bid (ask) $\mu _{\varvec{\theta }^S}$ and standard deviation $\sigma _{\varvec{\theta }^S}$ are the output. We formalize the strategy as $\pi _{\varvec{\theta }^S}=\frac{1}{\sqrt{2\pi }\sigma _{\varvec{\theta }}}e^{-\frac{\left( b_{n,{\varvec{j}}}^m-\mu _{\varvec{\theta }^S}\left( {\varvec{j}},{\varvec{J}}_{n}\right) \right) ^2}{2\sigma _{\varvec{\theta }^S}^2}}$, with $b_{n,{\varvec{j}}}^m$ being the bid price, $\mu _{\varvec{\theta }^S}({\varvec{j}},{\varvec{J}}_{n})$ the Gaussian mean and $\sigma _{\varvec{\theta }^S}$ the parametrized standard deviation.^{Footnote 1} The corresponding action $b_{n,{\varvec{j}}}^m$ is acquired from the inverse normal distribution. Parameter updates take place after each episode, utilizing a function $ U({\varvec{\theta }}^{S,m},{\varvec{I}}^{S,m})$ that is detailed in Sect. 4.3.

The core concept behind the policy gradient algorithm is that the strategy converges to a price distribution appropriate for the state. Actions with high rewards and low probabilities yield the strongest update signals. The algorithmic procedure to update the parametrized strategy is formalized in Algorithm 2. To summarize the procedure: we perform M training episodes containing N decision epochs each, with new jobs arriving stochastically each epoch. For every individual job, bids and asks are generated at each epoch by the actor networks. Based on bid-ask pairs, the broker allocates jobs by solving a knapsack problem. Reward trajectories per job are stored; if a job is shipped or reaches its due date, it is added to the set of completed jobs. After each episode m, the reward trajectories of all completed job are used to update the actor networks for shipper and carrier respectively.

Algorithm 2

Outline of the policy gradient algorithm (based on Williams 1992)

0:	Input: $\pi _{\varvec{\theta }^{S}}^0,\pi _{\varvec{\theta }^{C}}^0$	$\blacktriangleright $ Differentiable parametrized strategies
1:		$\blacktriangleright $ Initialize parameters
2:	$\forall m \in \{0,\ldots ,M\}$	$\blacktriangleright $ Loop over episodes
3:		$\blacktriangleright $ Initialize completed job set
4:		$\blacktriangleright $ Generate initial state
5:	$\forall n \in \{0,\ldots ,N\}$	$\blacktriangleright $ Loop over finite time horizon
6:		$\blacktriangleright $ Bid placement jobs
7:		$\blacktriangleright $ Ask placement jobs
8:		$\blacktriangleright $ Job allocation broker, Eq. (3)
	$\left( \sum _{{\varvec{j}} \in {\varvec{J}}_{t^\prime }} {\gamma }_{n,{\varvec{j}}}^m(b_{n,{\varvec{j}}}^m - a_{n,{\varvec{j}}}^m)\right) $
9a:		$\blacktriangleright $ Compute cumulative rewards shipper
9b:		$\blacktriangleright $ Compute cumulative rewards carrier
10:	$\forall {\varvec{j}} \in {\varvec{J}}_{n} \mid j_{\tau }=0 \vee {\gamma }_{n,{\varvec{j}}}^m=1 $	$\blacktriangleright $ Loop over completed jobs
11:		$\blacktriangleright $ Update set of completed jobs
12:		$\blacktriangleright $ Generate job arrivals

13:		$\blacktriangleright $ Transition function, Algorithm 1
14a:		$\blacktriangleright $ Store information shipper
14b:		$\blacktriangleright $ Store information carrier
15a:		$\blacktriangleright $ Update actor network shipper
15b:		$\blacktriangleright $ Update actor network carrier
16:	Output: $\pi _{\varvec{\theta }^S}^M,\pi _{\varvec{\theta }^C}^M$	$\blacktriangleright $ Return tuned strategies

4.2 Policy gradient extensions

Section 4.1 presented the basic policy gradient algorithm. We now introduce four extensions, namely (i) policy gradient with baseline, (ii) actor-critic with Q-value, (iii) temporal difference learning—more specifically TD(1)—and (iv) actor-critic with advantage function. The algorithms are summarized in Table 1; for detailed descriptions we refer to Sutton and Barto (2018).

Table 1 Algorithmic variants of the policy gradient algorithm

Strategic bidding in freight transport using deep reinforcement learning

Abstract

Similar content being viewed by others

Smart Containers with Bidding Capacity: A Policy Gradient Algorithm for Semi-cooperative Learning

The third party logistics provider freight management problem: a framework and deep reinforcement learning approach

Customized Dynamic Pricing for Air Cargo Network via Reinforcement Learning

1 Introduction

2 Literature review

3 System description

3.1 Model outline

3.2 State description