1 Introduction

Mobility-as-a-service (MaaS) is a new type of mobility service that integrates various transport modes through a single platform by providing real-time information services and journey planning and booking across various transport operators. Passengers can enjoy the combined transport services offered by the MaaS scheduler across various transport operators. MaaS has the potential to reduce traffic congestion, air pollution, noise nuisance, energy consumption, and transport-related social exclusion, as well as to increase traffic safety, health, social cohesiveness, accessibility, and household expenditure [1]. To implement MaaS in intelligent transportation systems, several essential components are required: (1) a mobile application and a web-based interface that enables passengers to plan and confirm their journeys, (2) a software system that connects operators providing mobility services, (3) an intelligent scheduler that handles passenger journey queries and optimizes transport resource and utility, such as the scheduler in [2]. The quality of these components significantly impacts the successful transformation of how people travel in future intelligent transportation systems.

Among the components of MaaS, the intelligent scheduler is the only component that determines the journey offered to passengers, and thus, it influences the effectiveness of the transportation system and passenger utility. Designing such a scheduler could be a challenging task and a loosely organized scheduler could diminish the utility of transport operators and passengers. Therefore, the community has put effort into the design of the scheduler. The problem of determining the best route from origin to destination can be viewed as a shortest path problem [3, 4]. Many algorithms, such as Dijkstra’s algorithm [5] and A* search algorithm [6], have been proposed to determine the shortest path between two points in a transportation network. With more practical considerations such as time window, the number of transfers, and parking constraints in the transport network, various models and algorithms were proposed. In [7], the authors use a choice model to provide an optimized menu of travel options complying with seat capacity and committed time schedule constraints to passengers, which maximizes the profit of the operator. In [8], a cost allocation problem is formulated to minimize the passenger dissatisfaction in the system. We can see that all the aforementioned models in the literature are valid designs, but they are aimed at different objectives, including but not limited to the journey time, dissatisfaction, and profit.

Different journeys attract various potential passengers and their willingness to pay is not the same [9]. Especially for multimodal transportation systems such as MaaS, the weighting of each objective may have a significant effect on the choice of transport services. Therefore, a multi-objective formulation that minimizes the overall cost, time and users’ discommodity in multimodal transportation networks, as in [10], would be appreciated by passengers. In particular, it presented a utility measure that takes into account different passengers’ propensities. However, we argue that a static utility weight vector is still not favored by all transport operators and passengers. Moreover, the utility of transport operators and passengers can be in conflict with each other. From the transport operators’ perspective, profit is the primary consideration but not the passenger’s utility. They take passengers’ utility into account simply because the retention rate is related to passenger satisfaction and thus affects profit. In other words, passengers may not retain to the system if transport operators provide high-profit transport services only without considering passenger satisfaction. Hence, the behavior of the passenger can be better modeled by a dynamic function (e.g., Markov chain, differential equation) instead of static utility since the transport experience influences the passenger’s decision of their next transport service.

Markov decision process (MDP) is a discrete-time stochastic control process for modeling partly random outcomes under decision-making process [11]. Many transport operations can be modeled as MDP. For example, the optimal routing of a taxi searching for a new passenger was modeled as an MDP to account for long-term profit over the working period [12]. The transfer of travelers between integrated transportation was also modeled as MDP to explore the optimal transfer trip chain of different income groups [13]. The passenger purchasing behavior of air tickets was taken into account to modeled the dynamic pricing process as MDP in [14]. In the event of waiting for a taxi, the passenger may be modeled as MDP to decide whether keep waiting at the current location or move to the nearby road segment [15]. To solve an MDP problem, dynamic programming [16] is a common method to determine the optimal solution of the large problem from the values of the sub-problems, which establish the foundation of later advanced methods such as reinforcement learning (RL). RL is a learning-based method that performs action based on given observations of the system without the need for a mathematical model. The recent success in deep RL (DRL), which combines reinforcement learning and deep learning, makes it an excellent approach to solve complex problems. Therefore, DRL is a promising method for addressing the complex multi-objective journey planning problem in multimodal transport by considering dynamic passenger behavior that satisfies both transport operators’ and passengers’ utilities.

To the best of our knowledge, in the context of MaaS, no prior research has considered passenger satisfaction behavior as an MDP for multi-objective journey planning problems that satisfy the utilities of both transport operators and passengers. One potential way of analyzing the area of personalized behavior preferences is to assume there is a hidden customer agent with its own personal reward function and policy. In this case, inverse RL is a reasonable way of inferring these. In our case, we assume there is only a Hidden MDP common in all agents’ behaviors, and that the MaaS central controller is there to infer this and benefit its own central reward function and optimize its policy. In this paper, we aim to address this research gap by leveraging DRL to obtain the utility weight vectors for both transport operators and passengers. These weight vectors are unique to each passenger for the transport operator to identify their preference of transport service. A multi-objective journey planning problem is formulated with the utility weight vector including travel time, comfortableness, price, and operating cost. Different plans can be generated from the problem with different weight vectors. The system obtains optimal utility weight vectors that enhance passenger satisfaction level and transport operator profit caused by high retention rate. A proportional fairness-based variant is presented to balance the profit received by each transport operator. Detailed experiments are conducted to evaluate the effectiveness of the proposed approach. The experimental results show that our approach is effective and efficient in enhancing the transport operator profit and passenger satisfaction. The main contributions of this paper can be summarized as follows:

  1. 1.

    We propose a novel approach that considers passenger experience as a Markov model, where prior experiences have a transient effect on future long-term satisfaction and retention rate. As such, we have formulated a multi-objective journey planning problem with individual passenger preferences, experiences, and memories. This is novel compared to most multimodal transport models that consider passenger experience as a non-time varying utility function.

  2. 2.

    As such, we are motivated to design a DRL-based approach to determine optimal utility weights by learning the passenger’s past traveling experience and transport operator profit. This is a step up from heuristic optimization of non-time varying utility functions as it can balance short-term gains with long-term success across time-varying personalized passenger experiences.

  3. 3.

    A variant of the DRL-based approach is designed to balance the profit received by each transport operator based on proportional fairness (PF) reward scheme. This is a novel solution to balancing competing agendas between operators and address those naturally disadvantaged.

  4. 4.

    We conduct experiments using both real-world and synthetic datasets, and the results show that our proposed approach enhances passenger satisfaction and transport operator profit, while maintaining PF of profit across transport operators. These results demonstrate the effectiveness of our approach in addressing the complex multi-objective journey planning problem in MaaS.

The rest of this paper is organized as follows. Section 2 reviews the related work of MaaS, transportation factors, MDP problems, and the artificial intelligence approach in transportation. Section 3 illustrates the system model including the problem formulation, the multi-objective journey planning problem, and the MDP model of the passenger. The DRL algorithm is presented in Sect. 4. Experiments and results are presented in Sect. 5. Finally, Sect. 6 concludes this paper.

2 Related work

2.1 MaaS planner

MaaS is a rapidly growing and innovative mobility concept combining different kinds of transport services. Numerous studies have explored different characteristics of MaaS, including MaaS ecosystem [17, 18], bundling services [19, 20], suppliers [21, 22], pilot projects [23, 24], and simulations [25, 26]. Since MaaS is composed of multiple transport operators, providing a high-quality route over the multimodal transport network is crucial in MaaS research. To determine a route with a minimum travel time from an origin to a destination, a dynamic shortest path problem has formulated in [27] using both historical and real-time information. The problem is solved by a hybrid Approximate Dynamic Programming algorithm with a clustering approach that combines the value function approximation and a deterministic lookahead policy. In [28], the authors developed a routing strategy in schedule-based transit networks with stochastic vehicle arrival times using an online shortest path algorithm. In [29], the authors studied the system congestion effects on the route choice model where the link capacities are a function of flow instead of link cost. To capture the structural effects that flows have on capacities and the resulting impacts on route choice utilities, the authors proposed a method to obtain unique congestible capacity shadow prices in a multimodal network to verify the capability to capture congestion effects on capacities. In [30], the authors formulated a passenger-centric vehicle routing problem to maximize the quality of service in terms of waiting and riding time. In [31], a bus transit system planner was presented to optimize the passengers’ experience by limiting the probability of collisions among passengers. However, these studies focus on improving the service by a single criterion, while the issue of accommodating passengers with different preferences and longitudinal time-varying experiences and memories remains unresolved. Therefore, a problem formulation considering the successive passenger traveling experience as a series of related states and actions could better model the passenger behavior.

2.2 Studies of transportation factors

There are several studies that have examined the relationships between various factors in transportation systems. For example, Molina et al. [32] investigated the connections between loyalty and passengers’ profiles, experience, and values for taxi service and private-hire driver companies. The study suggested that passenger characteristics influence the choice of transportation mode, and younger passengers tend to be less loyal, more price-sensitive, and more concerned with sustainability. In [33], the authors analyzed the public acceptance of MaaS by investigating the intention to subscribe to MaaS and the willingness to pay for extra features of the service. The study found that the service attribute characteristics such as the price and the social influence variables have a significant effect on the subscription intention, and the transportation modes included in the bundle are related to socio-demographic profiles and individual transportation-related characteristics. Another reference [34] that investigates the effect of individual passenger preferences supports the hypothesis of the passenger preference in the MaaS. The study implies significant heterogeneity with regard to preferences, which results in different MaaS package preferences and individual characteristics in the latent class choice model. The authors also suggested that age, gender, income, education, and current travel behavior are important factors influencing an individual’s propensity to purchase MaaS packages. In [35], the authors analyzed satisfaction surveys and found that the higher passenger satisfaction has led to the increase in public transport demand. From the empirical analysis of passenger satisfaction data, reference [36] concluded that the passenger perceived quality of service, passenger expectations, passenger perceived value, and passenger loyalty all have a significant correlation to the passenger satisfaction. The authors in [37] identified five factors affecting passenger satisfaction and willingness to pay: emotional, economic, social, service, and functional values. All of these studies indicate the relationship between passenger behavior and preference, as well as the causality between retention rate and satisfaction level, which can be modeled as MDP.

2.3 MDP in transportation

MDP is widely used in modeling stochastic processes, especially the transport system that are full of uncertainty. Lautenbacher et al. [38] solved the single-leg airline revenue management problem by formulating it as a discrete-time MDP formulation. Hong et al. [39] applied the MDP formulation to the ordering and delivery problems with different transportation modes, costs, and inventory issues. In [40], the authors modeled the process of the passenger seeking a taxi as an MDP that finds the best move for a vacant taxi to maximize the total revenue. To understand driver behavior and driving decisions, MDP was used to analyze basic safety message data from vehicles [41], which discovered that the driver prefers to accelerate in order to escape the crowdedness around them. For motion planning for connected and automated vehicles, MDP can be used to incorporate network-level data and make decisions on platoon membership, lane changing, and route choice [42]. Traffic light control problem can be modeled as MDP to enhance the junction flow rate. Khamis et al. presented a multi-agent-based multi-objective traffic signal control that simulates the driver’s behavior continuously in space and time dimensions based on Bayesian interpretation of probability [43] and RL [44].

2.4 Artificial intelligence in transportation

The success of artificial intelligence and deep learning in recent years accelerated research in various fields including transportation [45]. For example, deep neural networks can be used to accurately predict short-term traffic flow [46, 47]. Besides traffic flow, travel demand [48] and origin–destination pairs [49] can also be predicted by a deep learning model, called Multi-Scale Convolutional LSTM Network, which considers temporal and spatial correlations, and high-level prediction results of the historical traffic data. Lane detection is another example that uses deep neural networks [50]. Among various artificial intelligence technologies, DRL is a sub-field that focuses on complex control problems [51], especially for those with MDP characteristics. For example, the capacitated vehicle routing problem was solved by the RL-based method in [52], which outperformed existing heuristics and optimization tools on medium-sized instances. In [53], routing strategies determined by a trained DRL agent were treated as the initial solution for a local search method to refine the solution quality. In autonomous driving, the RL-based strategy used the shared information to improve travel efficiency, fuel consumption, and safety at traffic intersections [54]. RL would also be helpful in situations where multiple autonomous vehicles coordinate the driving maneuvers with each other [55]. In the traffic signal control problem, DRL can intake images from road surveillance cameras to efficiently control the signal duration in an end-to-end fashion [56]. Using multi-agent DRL, the problem size can be extended to city-scale [57]. For the lane change problem, although rule-based models may perform well in known scenarios, controlling in an unforeseen scenario could be prone to failure. A RL-based controller can perform lane changes under complex and unforeseen scenarios [58, 59]. By considering the planning decision as the action in DRL, multi-modal journey planning can be solved by DRL [60, 61]. Although these works have not incorporated passenger behavior, they serve as a great pilot study for solving a multi-modal journey planning problem with DRL. All the above literature indicates that DRL has excellent potential to deal with the highly dynamic transportation systems filled with uncertainty.

2.5 Research gap

From the existing works, we learn that multimodal transport, including MaaS, is an actively studied mobility concept for future intelligent transportation systems. However, current journey planning optimization methods treat passengers with the same origin and destination as identical passengers and non-time varying single experience problem, which is incapable of modeling passenger preferences, experience, and memories over many journeys over time. Based on this observed property, MDP appears to be an ideal model for the aforementioned heterogeneous and dynamic process. Among the state-of-the-art technologies, deep reinforcement learning is widely used to solve MDP problems and could be integrated into the journey planning optimization methods to address these challenges.

3 System model

In this section, we first introduce the model and problem formulation of multimodal transport and then discuss the MDP behavior of the passenger and the effect on the model. The notations used in this section are summarized in Table 1.Footnote 1

Table 1 Notation summary

3.1 Problem formulation

MaaS consists of multiple transport operators and the transport operators may operate different transport services for the same route. Indirect or detours may be taken if the journey fits the passenger’s expectation in terms of the utility. For instance, passengers may purchase transit flight tickets to obtain a cheaper price, even if it results in a longer flight duration. The transport network can be modeled by a directed graph \(G({\mathcal {N}}, {\mathcal {A}})\) where \({\mathcal {N}}\) and \({\mathcal {A}}\) is the set of nodes and links in the network, respectively. For example, a node represents the staring or ending point of a mobility service, while a link represents the journey of a mobility service. Let \({\mathcal {F}}\) be the set of transport operators and each operator \(f \in {\mathcal {F}}\) manages their own sub-network \({\mathcal {A}}_f \in {\mathcal {A}}\). Each link (ij) that provides transport service from i to j is associated with a time costFootnote 2\(\beta ^f_{ij}\), discomfort cost \(\delta ^f_{ij}\), price \(\rho ^f_{ij}\), and operation cost \(\mu ^f_{ij}\), as managed by the operator f. When a passenger requests transport service from origin O(k) to destination D(k), an intelligent scheduler determines a planned journey that satisfies the transport requirements. The journey includes services from different transport operators f that are offered to the passenger.

We first define two decision variables \(x_{ij}^k\) and \(y_{ij}^f\) to facilitate the formulation of the problem. Binary variables \(x_{ij}^k\) are used to indicate which transport service will be offered to the passenger:

$$\begin{aligned} x^{k}_{ij}= {\left\{ \begin{array}{ll} 1 &{} \text {if link } (i, j) \text { offers to } k, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

Binary variables \(y^{f}_{ij}\) are defined for the transport link operation:

$$\begin{aligned} y^{f}_{ij}= {\left\{ \begin{array}{ll} 1 &{} \text {if link } (i, j) \text { is operated by } f, \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(2)

Let \(U_{ij}^k\) and \(U_{ij}^f\) be the utility vector of passengers and transport operators, respectively. The objective function is to minimize the total utility cost of passengers and transport operators as

$$\begin{aligned} \sum _{(i,j) \in {\mathcal {A}}_f, k \in {\mathcal {K}}} U_{ij}^k x^{k}_{ij} + \sum _{(i,j) \in {\mathcal {A}}_f, f \in {\mathcal {F}}} U_{ij}^f y^{f}_{ij}. \end{aligned}$$
(3)

In this paper, we consider four utility costs formulated as

$$\begin{aligned} U_{ij}^k{} & {} = [w_\beta ^k \beta _{ij}^f; w_\delta ^k \delta _{ij}^f; w_\rho ^k \rho _{ij}^f], \end{aligned}$$
(4)
$$\begin{aligned} U_{ij}^f{} & {} = [w_\mu ^f \mu _{ij}^f], \end{aligned}$$
(5)

where \(\beta _{ij}^f\), \(\delta _{ij}^f\), \(\rho _{ij}^f\), and \(\mu _{ij}^f\) are the travel time, discomfort, price, and operating cost of link \((i, j) \in {\mathcal {A}}_f\), respectively. \(W^k = [w_\beta ^k; w_\delta ^k; w_\rho ^k; w_\mu ^f]\) are the weighting of the corresponding utility cost terms. Without loss of generality, more utility costs such as carbon emissions can be formulated and this paper considers four utility cost only for simplicity.

Let \({\mathcal {N}}^+(i)\) and \({\mathcal {N}}^-(i)\) be the sets of incoming and outgoing locations of i, respectively, i.e., \({\mathcal {N}}^+(i) = \{j \in {\mathcal {N}}|(j, i) \in {\mathcal {A}}_f\}\) and \({\mathcal {N}}^-(i) = \{j \in {\mathcal {N}}|(i, j) \in {\mathcal {A}}_f\}\). The following equation ensures the feasibility of flow in the network:

$$\begin{aligned} \sum _{j \in {\mathcal {N}}^-(i)} x_{ij}^{k} - \sum _{j \in {\mathcal {N}}^+(i)} x_{ji}^{k}= {\left\{ \begin{array}{ll} 1 &{} \text {if } i=O(k), \\ -1 &{} \text {if } i=D(k), \forall i \in {\mathcal {N}}, k \in {\mathcal {K}}\\ 0 &{} \text {otherwise}\\ \end{array}\right. } \end{aligned}$$
(6)

We have to ensure the transport capacity constraint is not violated in the problem:

$$\begin{aligned} \sum _{k \in {\mathcal {K}}} x_{ij}^k \le C^f_{ij} y_{ij}^f, \quad \forall (i, j) \in {\mathcal {A}}_f, f \in {\mathcal {F}} \end{aligned}$$
(7)

where \(C_{ij}^f\) is the capacity of transport link \((i, j) \in {\mathcal {A}}_f\) operated by \(f \in {\mathcal {F}}\).

Therefore, the multi-objective journey planning problem is formulated as follows:

Problem 1

(Multi-objective Journey Planning Problem)

$$\begin{aligned} \text {minimize} \quad&(3) \\ \text {subject to} \quad&(6)\text {--}(7) \end{aligned}$$

The problem is an integer linear program. To solve this problem, the scheduler must coordinate with the transport operators and obtain information required in the problem such as transport link details. However, the utility weight vector \(W^{k}\) is an abstract term that cannot be easily quantified. For example, a passenger may not explicitly state that they prefer a journey with a combination of 40% travel time, 20% discomfort, and 40% price. Additionally, passenger behavior and expectation are influenced by their past experience which can be modeled as MDP. Too address these challenges, we propose using DRL (to be discussed in Sect. 4), which has the ability to learn from the passenger status and infer their expectation. Once the \(W^{k}\) vector is determined, the scheduler can solve Problem 1 using a conventional integer programming solver or heuristic.

3.2 Markov decision process

The transport journey and retention of a passenger are modeled as a 4-tuple Markov decision process \(\langle {\mathcal {S}}, {\mathcal {A}}, P, R \rangle\), where \({\mathcal {S}}\) and \({\mathcal {A}}\) are the set of states and actions, respectively. \(P(s_{t+1}|s_t, a_t)\) represents the state transition probability from States \(s_t \in {\mathcal {S}}\) to \(s_{t+1} \in {\mathcal {S}}\) for performing Action \(a_t \in {\mathcal {A}}\). \(R(s_t, a_t, s_{t+1})\) is the received reward due to the transition from \(s_t\) to \(s_{t+1}\) after performing \(a_t\). In particular, the passenger satisfaction level in the State is a N-level integer value indicating the retention rate. In general, passengers tend to retain to the system if they are satisfied with it. Therefore, satisfaction is proportional to the retention rate. A sample relationship is shown in Table 2.

Table 2 Sample relationship between satisfaction level and retention rate

The satisfaction level of passenger k, \(H^k\), can be changed after each journey offered to the passenger. For example, if the offered journey matches the passenger expectation, passenger satisfaction increases by n level. If the service is far below what the passenger is expecting, say more expensive than usual, the satisfaction decreases and thus reducing the retention rate of the journey. The expectation difference for passenger k depends on the expected and actual utility, which is defined as

$$\begin{aligned} E^k = {\tilde{w}}_\beta ^k({\tilde{\beta }}_{od}^k - \beta _{od}^k) + {\tilde{w}}_\delta ^k({\tilde{\delta }}_{od}^k - \delta _{od}^k) + {\tilde{w}}_\rho ^k({\tilde{\rho }}_{od}^k - \rho _{od}^k), \end{aligned}$$
(8)

where \({\tilde{W}}^k = [{\tilde{w}}_\beta ^k; {\tilde{w}}_\delta ^k; {\tilde{w}}_\rho ^k]\) is the actual weighting of passenger, \({\tilde{\beta }}_{od}^k, {\tilde{\delta }}_{od}^k, {\tilde{\rho }}_{od}^k\) are the utility expected by the passenger, o and d are the origin and destination, respectively. Expected utility is the utility of the ideal best journey the passenger can get. It can be determined by solving the planning problem as if the passenger is alone in the system. The actual utility is the utility of the journey planned by the scheduler, which may be deviated from the expected utility with an incompetent planner and limited capacity. Therefore, an incompetent planner may have a negatively large expectation difference on average.

The satisfaction function modeling the change in satisfaction can be expressed as

$$\begin{aligned} H^k:= {\left\{ \begin{array}{ll} H^k+n &{} \text {if } E^k \ge {\overline{E}}^k, \\ H^k-n &{} \text {if } E^k \le {\underline{E}}^k, \\ H^k &{} \text {otherwise},\\ \end{array}\right. } \end{aligned}$$
(9)

where \({\overline{E}}^k\) and \({\underline{E}}^k\) are the upper and lower expectation threshold, and n is the step change of the satisfaction level. As a whole, a sample state diagram of \(n = 1\) is summarized in Fig. 1. Similar customer satisfaction models can be found in supply chain [62] and product management [63].

Fig. 1
figure 1

Satisfaction transition representation

4 Deep reinforcement learning with proportional fairness

In this section, we present the algorithm of the DRL agent and the corresponding components interacting with the agent. A typical interaction framework between the DRL agent and the environment is adopted as shown in Fig. 2. When the MaaS scheduler receives a journey query, the trained DRL agent determines the output actions for the journey planner. Then, the journey planner solves problem 1 to determine the optimal journey. The journey is offered to the passenger who may take the suggestion based on the retention rate. The process repeats for another iteration when the passenger submits a new journey query.

Fig. 2
figure 2

Interaction between the environment and the scheduler

4.1 Environment

The environment represents the set of passengers and the transportation system. At each time t, the state \(s_t\) of a set of passengers is transmitted to the agent. Based on the action \(a_t\) received from the agent, an optimal journey planning problem is solved for the passengers as presented in Sect. 3.1. Note that the time t is not representing a fixed time interval as in other commonly seen RL problems. The time interval in our model depends on when the consumer chooses to query for their next journey, and the agent performs journey-to-journey sequential actions. The transport service taken by the passenger affect the next state \(s_{t+1}\), which can be denoted by transition probability \(P(s_{t+1}|s_t, a_t)\).

4.1.1 State

The state \(s_t\) of a passenger is the satisfaction to the system together with the passenger’s characteristics such as income and age. Satisfaction to the system is dynamic which follows the MDP as discussed in Sect. 3.2, while the characteristics are static for each passenger. In this paper, we assume the characteristics of the passengers are the factors affecting their travel expectations for simplicity. Nevertheless, other characteristics can be included in the state without loss of generality.

4.1.2 Reward

The profit is set to be the reward of the transport operators, which is equal to the price minus the operation cost of all passengers, given by Eq. (10).

$$\begin{aligned} \sum _{k,f} (\rho ^{kf} - \mu ^{kf}). \end{aligned}$$
(10)

The agent’s objective is to maximize this reward function by selecting an optimal action. Here, we introduce the concept of proportional fairness into the decision-making process of the agent. PF is a well-known principle in resource allocation that states that the gain of one party should be greater than the sum of the losses of all other parties together [64, 65]. To ensure PF across the transport operators, according to [66], a PF variation of the reward function can be formulated as

$$\begin{aligned} \sum _f w^f_{PF} \log \sum _k (\rho ^{kf} - \mu ^{kf}), \end{aligned}$$
(11)

where \(w^f_{PF}\) is the proportional weighting factor associated with each transport operator, which can be simplified as a common constant. This PF variation reward function is expected to ensure proportional fairness of profit among the transport operators. A naive approach to increase profit would be to increase the price and decrease the cost in the utility weight vector. However, simply increasing the price and lowering the cost may offend the passenger and lower their satisfaction. According to the MDP, passengers with a lower satisfaction have a lower chance to return to the system. Thus, the agent should learn to increase the passenger satisfaction by selecting a reasonable utility weight vector in order to increase the long-term profit.

4.2 DRL agent

The DRL agent refers to the component of the intelligent scheduler in the system, which determines the utility weight vector. Based on the given state \(s_t\) from the passenger, the agent computes an optimal utility weight vector that matches the passenger expectation of the journey in order to increase passenger satisfaction and the retention rate. For example, an expensive first-class itinerary versus an economy trip; a comfortable detour journey versus a crowded direct route. The route is determined by solving the multi-objective journey planning problem in Sect. 3.1 together with the given action (utility weight vector) produced from the DRL agent.

4.2.1 Action

The action is the utility weight vector with a dimension equal to the number of objective function terms. For example, the problem formulation presented in Sect. 3.1 contains four objectives: time, discomfort, price, and operation cost, and thus, the action is a four dimension utility weight vector indicating the weighting of each objective to be solved. Without loss of generality, the aforementioned four dimension utility weight vector is used in the experiments for simplicity. Ones can add more utilities to the action and problem formulation such as carbon emission.

4.2.2 Neural network

Neural networks are used to learn and perform the transition functions of states, actions, and rewards. To preserve the generality of the proposed approach, we use deep fully connected neural networks as the neural network structure in this paper. The reason is that the input and output of the neural network are numeric features and values where simple fully connected neural networks might be able to model the relationship already, unlike other data types such as images and time series data that usually use convolutional neural networks and recurrent neural network, respectively. We are fully aware that many state-of-the-art neural networks structure can replace the fully connected neural network, but the neural network structure is out of the scope of this paper. Nonetheless, the deep neural networks have to be designed to match the problem nature. For example, the action is an array ranging from 0 to 1, and thus, we use the sigmoid as the activation function.

4.3 Deep reinforcement learning algorithm

Our DRL algorithm is modified based on the deep deterministic policy gradient (DDPG) [67] which is a model-free off-policy algorithm for continuous control problems. The algorithm is an actor-critic approach where an actor function \(\pi (s|\theta ^\pi )\) is responsible for deterministically performing an action based on a given state, and a critic function Q(sa) learns the Q-value of the state and action pair following the Bellman equation.

A replay buffer R is used to store the transition tuple \((s_t, a_t, r_t, s_{t+1})\) for sampling minibatch to update the actor and critic. The motivation for using a replay buffer is that the training of neural networks usually assumes the samples are independently and identically distributed, which cannot be achieved if the updates are based on sequential transitions in reinforcement learning.

In the classic DDPG, Ornstein-Uhlenbeck (OU) noise is added to the action for exploration. However, in our scenario, the action space is a vector with values ranging from 0 to 1. Adding the OU noise to the action does not guarantee the range and thus we use the epsilon-greedy action selection method instead of OU noise for exploration. A uniformly random action vector is generated under the probability defined by \(\epsilon\).

There are four function approximators, namely actor local, actor target, critic local, and critic target network. The update rules of the four function approximators are given as follows as presented in [67]. Critic local \(\theta ^Q\) is updated based on the loss function:

$$\begin{aligned} L = \frac{1}{B} \sum _i (r_i + \gamma Q'(s_{i+1}, \pi '(s_{i+1}|\theta ^{\pi '})|\theta ^{Q'}) - Q(s_i, a_i|\theta ^Q))^2 \end{aligned}$$
(12)

where B is the minibatch size, i is the index of sample in the minibatch, and \(\gamma\) is the discount factor. Actor local \(\theta ^\pi\) is updated using policy gradient:

$$\begin{aligned} \nabla _{\theta \pi } J \approx \frac{1}{B} \sum _i \nabla _a Q(s, a|\theta ^Q)|_{s=s_i, a=\pi (s_i)} \nabla _{\theta \pi } \pi (s|\theta ^\pi )|_{s_i} \end{aligned}$$
(13)

The use of local and target networks is a technique to stabilize the training. In each iteration, the parameters of local networks are updated using the aforementioned rules. For the target networks, they are updated by copying the parameters from the local networks softly with a scale of \(\tau\). For critic target \(\theta ^{Q'}\):

$$\begin{aligned} \theta ^{Q'}:= \tau \theta ^Q + (1-\tau )\theta ^{Q'}. \end{aligned}$$
(14)

Similarly, for actor target \(\theta ^{\pi '}\):

$$\begin{aligned} \theta ^{\pi '}:= \tau \theta ^\pi + (1-\tau )\theta ^{\pi '}. \end{aligned}$$
(15)

Detailed algorithm is shown in algorithm 1.

figure a

5 Experiments

5.1 Experiment setup

We conduct experiments in two scenarios: New York City (NYC) and a synthetic scenario to simulate the multimodal transport system and passenger behaviors.

5.1.1 New York City scenario

To evaluate our proposed approach, we use a real-world transportation network and datasets of NYC. The transportation network is extracted from the Manhattan region of NYC based on the taxi zone maps.Footnote 3 Each zone represents a node in the network, and we assume there is an edge between two nodes if they are connected on the map. Isolated regions without connection are ignored. The graph contains 63 nodes in an irregular shape. We assume 3 transport operators on each connection, resulting in 3 edges between the connected nodes, composing a total of 963 edges. Each edge is associated with time, discomfort, price, and operation cost. The values are randomly generated between 0 and 1, except that the price must be larger than the operation cost for each edge.

A set of passengers \({\mathcal {K}}\) is constructed by open datasets of NYC. We sample the traffic query and passengers’ characteristics from the NYC Taxi and Limousine Commission Trip Record DataFootnote 4 and Citywide Mobility Survey,Footnote 5 respectively. The expected utility weight vector is calculated based only on the characteristics for Eq. (8), which is unknown to the intelligent scheduler throughout the experiments. All initial passenger satisfaction levels are at level 3. The multi-objective journey planning problem with a given utility weight vector from the DRL agent is solved by a standard optimizer in CVXPY [68].

5.1.2 Synthetic scenario

We also evaluate our proposed approach with a synthetic system. We use a square grid with 36 nodes to represent the transportation system, where each node is a stop or transit to other transport, and each edge is a transport sub-journey. Between any two neighboring nodes, 6 edges, each representing a transport operator, are set, resulting in a total of 360 edges for the 36 nodes. The network costs are randomly generated, as in the NYC scenario.

A set of passengers \({\mathcal {K}}\) is generated with the random origin, destination, and characteristics. The calculation of the expected utility weight vector, initial passenger satisfaction levels, and multi-objective journey planning problem-solving process is the same as the NYC scenario. Table 3 summarized the parameters used in the experiments.

Table 3 Parameter settings

5.1.3 Benchmarks

The main purpose of the DRL approach is to determine an utility weight vector for each passenger. Hence, two benchmark approaches for determining the utility weight vector are used to compare with the DRL approach, namely “fixed” and “random” policy. The utility weight vector of the former benchmark is fixed to ones and that of the latter benchmark is uniformly random vector between 0 and 1.

5.2 Profit

We test the transport operator profit received using the proposed and benchmark approaches in both NYC and synthetic scenarios. As discussed in Sect. 4.1.2, the rewards received from the environment are equivalent to the profit as in Eq (10), or Eq (11) for PF.

5.2.1 NYC scenario

The average rewards of 2000 episodes for the NYC scenario are shown in Table 4. Among the compared approaches, the DRL approach performs the best in terms of obtaining the highest rewards (profits), with an average of 336.48 units per episode. The DRL with PF is the second highest, with an average of 287.47 units per episode. The random policy is the lowest, resulting in 171.20 units per episode. The fixed utility weight vector policy is between random and DRL agents, obtaining an average reward of 223.55 units per episode. We noted that there is a drop in reward for DRL with PF. This is expected since the PF tried to ensure that the sum of proportional changes among transport operators is negative but not the highest reward in total, which can be considered an additional constraint to the maximization problem. The fairness of transport operators will be discussed in Sect. 5.5.

We are also interested in the variation of the approaches throughout the episode. To mitigate the fluctuation due to the randomness and see the trends clearly, we calculated the moving average of reward with a time window equal to 20 episodes and plotted the approaches in Fig. 3. In general, the DRL approach performs the best in all episodes. The DRL with PF is slightly lower than the DRL approach. Similarly, the random policy is the worst, and the fixed utility weight vector policy is between random and the DRL approaches for all episodes. A clear increasing trend is observed for the DRL agents. It starts from a low reward at the beginning of the episode and then increases with the training episode until it converges as expected. The agent behaves like this for two reasons. First, the \(\epsilon\) value is high at the beginning, which means random policy dominates the policy of the DRL agent, and thus, the performance is similar to random policy at the beginning. Second, the initial random policy provides a replay experience for the DRL agent to explore and learn. In later episodes when the \(\epsilon\) value gradually decays along the episode, the actions are mainly performed by the agent, which is helpful for further exploitation.

Table 4 Average reward of approaches in the NYC scenario
Fig. 3
figure 3

Moving average reward of NYC scenario. The time window of moving average is equal to 20

5.2.2 Synthetic scenario

The average rewards of 2000 episodes for the synthetic scenario are shown in Table 5. Similar to the NYC scenario, among the compared approaches, the DRL agent performs the best in terms of obtaining the highest rewards on an average of 376.17 units per episode. The DRL with PF is the second-highest, with an average of 363.41 units per episode. The random policy is the lowest, resulting in 111.58 units per episode. The fixed utility weight vector policy is between random and DRL approaches, obtaining an average reward of 164.32 units per episode.

The moving average of the reward with a time window of 20 episodes is plotted in Fig. 4. Similar to the results in the NYC scenario, the DRL approaches outperform all other approaches in all episodes. The random policy performs the worst, and the fixed utility weight vector policy is between the random policy and the DRL agent for all episodes. For both DRL agents, a clear increasing trend is observed, as in the NYC scenario.

We can observe similar results in both scenarios. Therefore, the proposed approach can successfully maximize the profit of the transport operators when compared to the benchmarks.

Table 5 Average reward of approaches in the synthetic scenario
Fig. 4
figure 4

Moving average reward of synthetic scenario. The time window of moving average is equal to 20

5.3 Passenger satisfaction

We study the passenger satisfaction level in both scenarios to understand the performance of each approach from the passengers’ perspective. For the simulated 2000 episodes and the 100 iterations in each episode, we plotted the average satisfaction level of each episode, each iteration, the total number of the satisfaction levels of each method, and the average satisfaction level against the number of nodes in Figs. 5, 6, 7 and 8, respectively, for NYC scenario. The same plots for the synthetic scenario are shown in Figs. 9, 10, 11 and 12. The color in each figure corresponds to the satisfaction level, as indicated by the color bar on the right side of the plots. Recall that a higher the satisfaction level \(H^k\) indicates better service for the passenger.

5.3.1 NYC scenario

Figures 5 and 6 illustrate the variation of satisfaction levels across iterations and episodes, respectively. In Fig. 5, the average satisfaction level of the DRL agent is the highest across the iterations compared to other approaches. DRL with PF is slightly lower than DRL. Since the satisfaction level shown is the average of 2000 episodes, the fluctuation is mitigated in the figure and shows a clear distinction between the approaches. This suggests that the methods converge to certain satisfaction levels across iterations. Figure 6 shows the average satisfaction level across episodes, which fluctuates more compared to Fig. 5. Nonetheless, we can still observe that the satisfaction level of DRL and DRL with PF is higher than those of the fixed and random policies. The fluctuation may be caused by the random origin, destination, and characteristics of the passengers across episodes.

Fig. 5
figure 5

Average satisfaction level of each iteration of NYC scenario

Fig. 6
figure 6

Average satisfaction level of each episode of NYC scenario

Fig. 7
figure 7

Total number of satisfaction level of each method of NYC scenario

Fig. 8
figure 8

Average satisfaction level by number of nodes of NYC scenario

5.3.2 Synthetic scenario

The observation of synthetic scenario results is similar to those of the NYC scenario. In Fig. 9, the average satisfaction level of both DRL approaches is the highest across the iterations compared to fixed and random policies. Figure 10 shows the average satisfaction level across episodes, which fluctuates more compared to Fig. 9. Nevertheless, we can still observe that the satisfaction level of both DRL methods is higher than those of the fixed and random policies.

Fig. 9
figure 9

Average satisfaction level of each iteration of synthetic scenario

Fig. 10
figure 10

Average satisfaction level of each episode of synthetic scenario

Fig. 11
figure 11

Total number of satisfaction level of each method of synthetic scenario

Fig. 12
figure 12

Average satisfaction level by number of nodes of synthetic scenario

5.4 Interpreting the satisfaction level

5.4.1 NYC scenario

In Fig. 7, we can observe that most of the satisfaction levels are high in both DRL methods compared to fixed and random, which means the agent can offer journeys that meet passengers’ preferences. An notable observation about the fixed policy is that the satisfaction level vary widely in levels 1 and 5. In other words, passengers’ satisfaction tends to converge toward either the lowest or the highest level when using the fixed utility policy. This is because the fixed utility policy is static and cannot adapt to the preferences of different passengers, as is the case with a static policy that has no intelligent design. For the random policy, level 1 occurs most of the time, indicating its instability. To interpret this observation, we plotted the average satisfaction level against the number of nodes between the origin and destination in Fig. 8. We observe a valley in the satisfaction levels of the fixed utility policy from the third to the fifth bar. This could be related to the complexity of the journey search space and the number of the origin–destination combination. First, for journeys with a smaller number of nodes in between, the search space is smaller, so there are fewer choices for the scheduler, leading to a sharp decrease trend in satisfaction from the first to the third bar. Second, there are fewer origin–destination combinations for journeys with more nodes in between in the network. Table 6 lists the origin–destination combinations of the networks. Origin–destination with three nodes in between has the largest number of combinations for the NYC scenario. The decreasing combination from 3 to 13 implies less variation in the problem and the expectation difference is mitigated. Hence, satisfaction rises from the fourth bar when the two effects are combined. In Fig. 11, level 1 occurs most of the time for the random utility policy, indicating the instability of this policy.

5.4.2 Synthetic scenario

In Fig. 11, we see that both DRL approaches yield high satisfaction levels compared to fixed and random policies, indicating that the agent can offer journeys that meet passengers’ preferences. Similar to the NYC scenario, the satisfaction level of fixed utility policy is quite diverse, with satisfaction levels varying widely between levels 1 and 5. We plotted the average satisfaction level against the number of nodes between the origin and destination in Fig. 12 to interpret the observation. We observe a clear valley in the satisfaction levels of the fixed utility policy against the number of nodes, where the journeys with the number of nodes equal to three have the lowest satisfaction. This could be related to the same reasons for complexity as in the NYC scenario. Table 6 lists the origin–destination combination of the networks. Origin–destination with two and three nodes in between has the largest combination. The decreasing combination from 3 to 9 implies less variation in the problem and the expectation difference is mitigated. Hence, satisfaction rises from the fourth bar when the two effects are combined. In Fig. 11, level 1 occurs most of the time for the random utility policy, which indicates the instability of this policy.

Therefore, based on the results of both NYC and synthetic scenarios, we can conclude that the DRL approaches can offer journeys that meet passengers’ preferences, and passengers are generally satisfied with the service.

Table 6 The shortest number of nodes in between origin and destination against the number of the origin–destination combination

5.5 Fairness among transport operators

We analyze the reward of transport operators for the DRL approach with and without PF. Table 7 summarizes the results of both scenarios.

5.5.1 NYC scenario

For the three transport operators in the NYC scenario, their normalized rewards are slightly different using the DRL with and without PF. Although the reward of all three transport operators is close to the mean, we can observe the fairness by comparing their variance and mean absolute deviation around the mean as shown in Table 7. A smaller value of variance and mean absolute deviation means the reward difference between transport operators is smaller and the passengers are more evenly distributed to the transport operators. We can see that the reward is more balanced for DRL with PF.

5.5.2 Synthetic scenario

For the six transport operators in the synthetic scenario, their normalized rewards are slightly different using the DRL with and without PF. Similar to the observation in the NYC scenario, the reward differences among transport operators are smaller as indicated by the variance and mean absolute deviation around the mean.

Therefore, from the results in both scenarios, we can conclude that the DRL with PF may balance the profit and passengers across the transport operators.

Table 7 Analysis of reward of transport operators

6 Conclusion

MaaS is a new concept that unites various modes of transportation into a single platform for future intelligent transportation systems. However, passengers in multimodal transportation systems have diverse demands and preferences for their transport service when facing multimodal transport choices. It is even more challenging to consider the consequences of disappointing the passengers as they may not retain. To address this challenge, we proposed a DRL approach to handle the unknown utility weight issue in the multi-objective journey planning problem. The agent can determine appropriate weights for each passenger based on their characteristic. Given the weights, the scheduler can determine the optimal journey that matches the passenger’s expectations instead of the non-dominated Pareto solution. A PF-based variant of the DRL approach is presented to balance the profit across transport operators during the journey planning. Our experiments with real-world and synthetic datasets show that the proposed approach effectively increases the passenger satisfaction level and can lead to a 2.3 times increase in profit. This research can serve as a useful reference for implementing practical intelligent transportation systems dealing with journey planning problem between multiple passengers and transport operators.

Furthermore, there are possible future research directions in addition to this research work. First, although the problem 1 defined a necessary set of objectives and constraints for demonstration and experimentation purposes, there are many variants focusing on different subtle applications that require further study. Second, as a pioneering work, we used classic fully connected neural networks as function approximators, which produced promising results. However, exploring different function approximator structures could enhance the performance further. Third, we will aim to also integrate the research with EV-charging grid management for a low-carbon MaaS ecosystem [69].