A Reinforcement Learning approach for bus network design and frequency setting optimisation

This paper proposes a new approach to solve the problem of bus network design and frequency setting (BNDFS). Transit network design must satisfy the needs of both service users and transit operators. Numerous optimisation techniques have been proposed for BNDFS in the literature. Previous approaches tend to adopt a sequential optimisation strategy that conducts network routing and service frequency setting in two separate steps. To address the limitation of sequential optimisation, our new algorithm uses Reinforcement Learning for a simultaneous optimisation of three key components of BNDFS: the number of bus routes, the route design and service frequencies. The algorithm can design the best set of bus routes without defining the total number of bus routes in advance, which can reduce the overall computational time. The proposed algorithm was tested on the benchmark Mandl Swiss network. The algorithm is further extended to the routing of express services. The validation includes additional test scenarios which modify the transit demand level on the Mandl network. The new algorithm can be useful to assist transit agencies and planners in improving existing routing and service frequency to cope with changing demand conditions.


Introduction
Transit networks are becoming increasingly complex as new transit systems, routes and services are added to cope with growing and diverse travel demand. Transit authorities are facing significant challenges in integrating different transit modes and services and improving the quality of transit service. The transit service planning 1 3 process consists of five main components, which are typically undertaken in sequential order: network design, frequency setting, timetable development, vehicle scheduling and crew scheduling. The first two components (i.e., network design and frequency setting) determine the shape and number of transit routes, route lengths and the number of fleet vehicles based on demand, land use, community characteristics and travel behaviour. They require the most complex strategies because they establish an optimal and balanced solution to meet the needs of both transit users and operators (Ceder and Wilson 1986). Since network design and frequency setting are also interrelated, the first two steps of transit planning are indeed the most challenging task in the transit planning process (Shih et al. 1998).
Different terms are used in the literature to describe the transit planning process. For instance, the first two steps of network design and frequency setting were named 'bus transit route network design' by Fan and Machemehl (2006), 'transit route network design' by Baaj and Mahmassani (1995), 'line planning in public transport' by Borndörfer et al. (2005), and 'transit network design and frequency setting' by Guihaire and Hao (2008). Focusing on bus transport planning, this study uses the terminology of the BNDFS problem, referring to the combined problems of bus network design and frequency setting.
Some early studies in the literature attempted to design a bus network and set service frequency separately to avoid computational complexity and intractable issues (Marwah et al. 1984;van Nes et al. 1988). The network design step determines the number of bus routes and their route design by connecting pre-determined bus stops to maximise user benefits such as convenience and reduced travel time. The service frequency of each bus route is then determined in such a way as to reduce operating costs. This two-step planning may not always generate the overall optimal design, because user benefits and operator costs often conflict with each other. In practice, both network design and frequency setting should be handled simultaneously to provide a comprehensive solution (Zhao et al. 2015).
The computational approach for solving the BNDFS problem is known as a complex optimisation problem because of the conflicting interests of transit users and operators. For example, users always desire a greater number of routes and more frequent services, but this naturally increases the running costs for operators. Therefore, a compromise solution must be sought with an acceptable trade-off from the perspectives of both users and the operator. To deal with the computational complexity, various heuristic and metaheuristic approaches have been proposed including simulated annealing (Zhao and Zeng 2007;Yan et al. 2013;Ahern et al. 2022), genetic algorithm (Owais and Osman 2018;Jha et al. 2019;Mahdavi Moghaddam et al. 2019), tabu search (Fan and Machemehl 2008;Roca-Riu et al. 2012;Yao et al. 2014), ant colony optimisation (Hu et al. 2005;Yu et al. 2012), artificial bee colony optimisation (Nikolić and Teodorović, 2014;Szeto and Jiang 2014), and particle swarm optimisation (Kechagiopoulos and Beligiannis 2014;Buba and Lee 2019;Cipriani et al. 2020). A common limitation of those techniques is that the performance of metaheuristic algorithms relies heavily on the setting of the algorithm parameters.
This study proposes a new approach to solve the BNDFS problem by using Reinforcement Learning, a branch of machine learning of AI techniques. Reinforcement Reinforcement Learning for bus network design and frequency setting Learning, a stochastic, random search technique, has recently emerged with considerable success as it allows a flexible and efficient optimisation of complex problems. Unlike most metaheuristic approaches, model-free Reinforcement Learning algorithms are applicable in different environments and can solve problems on a trialand-error basis (Sutton and Barto 1998). Darwish et al. (2020) applied deep Reinforcement Learning to solve the BNDFS problem. Their study mostly focused on designing a bus network to serve passengers with direct trips, and waiting time was not considered in the calculation of total travel time. In this study, the BNDFS problem is solved by proposing a Reinforcement Learning-based algorithm that accounts for all the relevant factors of travel time, including waiting time, in-vehicle time, and transfer time. In addition, the Reinforcement Learning-based algorithm allows to find the best number of bus routes, their route configurations, and their service frequencies simultaneously.
The Reinforcement Learning algorithm developed in this study attempts to optimise the well-known Mandl's Swiss network (Mandl 1980). Mandl's Swiss network was used for testing and validating routing algorithms (Chakroborty 2003;Chew et al. 2013;Mumford 2013;Nikolić and Teodorović, 2014;Buba and Lee 2018;Capali and Ceylan 2020). To enable more comprehensive applications, the algorithm is extended to formulate express and local routing options to better serve passenger demands. Mandl's Swiss network is also modified to create three additional test scenarios for validation of the algorithm's performance in various environments.
The paper is organised as follows. Section 2 presents the literature review. Section 3 briefly explains Reinforcement Learning. Section 4 describes the BNDFS problem with relevant decision features and variables. Section 5 presents the research methodology to develop the proposed new bus network design and frequency setting algorithm using Reinforcement Learning. Section 6 provides the computational experiments and their results including the different test scenarios with the proposed Reinforcement Learning algorithm. The conclusions and future work are in Sect. 7.

Literature review
Relevant reviews of the transit network design have been published by Guihaire and Hao (2008), Karlaftis (2009), Farahani et al. (2013), Ibarra-Rojas et al. (2015) and Durán-Micco and Vansteenwegen (2022). From these reviews, this section presents the studies which have attempted to solve the bus network design and frequency setting simultaneously.
The simultaneous approach for the BNDFS problem is an assignment problem that exhibits highly combinatorial characteristics requiring a large search space with exponential time (Fan et al. 2009). As searching the entire space is impractical, the BNDFS problem was addressed by using approximate methods that can find out a satisfactory solution. Mandl (1980) developed a two-stage procedure based on the heuristic method for the bus network design problem. The procedure generates an initial set of feasible routes, and the initial solution is improved iteratively by improving the total passenger in-vehicle travel time. Later, Zhao (2006) also developed a heuristic-based algorithm that minimised the total passenger travel time under the constraints of route length, fleet size and headway. Baaj and Mahmassani (1995) considered a combination of user and operator costs to determine a set of routes. Their proposed algorithm initially identifies the origin-destination pairs with the highest demand, and then the frequency of services and the number of buses required on each route. The initial set of bus routes is modified heuristically to improve its global effectiveness such as the number of direct trips, the total waiting time and the transfer time.
Because of advances in computation power, researchers began to use metaheuristic approaches. The metaheuristic approach can design larger and more complicated transit networks. There are broadly two types of metaheuristics: population-based metaheuristics (e.g., genetic algorithm, ant colony optimisation and bee colony optimisation) and single-solution-based metaheuristics (e.g., tabu search, hill climbing, simulated annealing and GRASP). Almost a third of previous metaheuristic-based algorithms for the BNDFS problem used Mandl's (1980) benchmark network to analyse the performance of their proposed models (Iliopoulou et al. 2019;Durán-Micco and Vansteenwegen 2022).
The most popular method of population-based metaheuristics is the genetic algorithm (Iliopoulou et al. 2019). The genetic algorithm is a search heuristic that reflects the process of natural selection to find the fittest individuals. Bielli et al. (2002) proposed a genetic algorithm to maximise network performance and minimise the required fleet size and total travel time. Their proposed algorithm calculates fitness function values with a multicriteria analysis to indicate performance. Later, Amiripour et al. (2014) developed an extended genetic algorithm to determine a set of bus routes for an entire year by considering seasonal demand patterns with a probability of occurrence. Their proposed algorithm aimed to minimise a different set of objectives, including the weighed sum of passengers' total waiting time, unused seat capacity, unsatisfied demand, and the required fleet size. A more recent study applied a genetic algorithm to design a bus network with limited-stop bus routes using the existing bus network in Beijing, China (Wu et al. 2015). Multi-objective genetic algorithms were developed to solve the bus network design and frequency setting problems simultaneously in recent studies (Owais and Osman 2018;Jha et al. 2019;Mahdavi Moghaddam et al. 2019).
Another form of genetic algorithm, namely a nondominated sorting genetic algorithm, was adopted by Chai and Liang (2020) for the BNDFS problem. The nondominated sorting genetic algorithm can solve multi-objective optimisation problems. Momenitabar and Mattson (2021) adopted the nondominated sorting genetic algorithm II to find a Pareto front solution between multi-objective functions. Ali and Roman (2021) used a memetic algorithm, another extension form of the traditional genetic algorithm, to find the best set of bus routes and their structures for bus network design. Their proposed algorithm was combined with the hill climbing local search algorithm to improve route design in the global search procedure.
Other population-based metaheuristics of bee colony optimisation (Lučić and Teodorović 2003), ant colony optimisation (Dorigo et al. 1996), particle swarm optimisation (Eberhart and Kennedy 1995) and intelligent water drops (Hosseini 2007) can be categorised into the swarm intelligence domain. Swarm intelligence is inspired by the social behaviour of a large group like a swarm or colony. In bee colony optimisation algorithms, artificial bees look for feasible sets of bus routes by exploring search space, and the collaboration between the artificial bees helps to locate more promising bus route structures and frequencies (Nikolić and Teodorović 2014;Szeto and Jiang 2014). Other studies recognised the similarity between ants' search for food from their nest and the way of finding an optimal bus route from a terminal and proposed ant colony optimisation models for the BNDFS problem (Hu et al. 2005;Yu et al. 2012). Particle swarm optimisation is also a populationbased metaheuristic that solves a problem by trying to improve a candidate solution through an iterative process. It was first used by Kechagiopoulos and Beligiannis (2014) to solve transit network design problems. Subsequently, particle swarm optimisation-based algorithms were improved and introduced by Buba and Lee (2019) and Cipriani et al. (2020). A new population-based metaheuristic, intelligent water drops, has been introduced in the literature (Hosseini 2007). It was inspired by observations of water drops in a river and environmental changes resulting from the flowing river. The water drops naturally flow along the optimal path of a river. Capali and Ceylan (2020) used the intelligent water drops algorithm to design bus routes on the Mandl benchmark network.
Another branch of metaheuristic approaches is single solution-based metaheuristics. In complex optimisation problems, finding the globally optimal solution is very challenging and limiting the searching space to neighbours often stagnates in a locally optimal solution. One of the advantages of using single solution-based metaheuristics is that they can intensify the search in local space regions (Talbi 2016). As one of the single solution-based metaheuristics, simulated annealing randomises the local search procedure and allows the selection of worse solutions with a certain probability to prevent the search from being stuck in a suboptimal space. Simulated annealing was first introduced to the BNDFS problem by Fan and Machemehl (2006), who adopted a multi-objective function of the weighted sum of user cost, operator cost and unmet demand. Simulated annealing has since been widely used for transit network design (Zhao and Zeng 2007;Yan et al. 2013;Ahern et al. 2022).
Tabu search is also another single solution-based metaheuristic adopted in the literature to solve the BNDFS problem. Similar to simulated annealing, tabu search attempts to find the optimal solution by exploring the search space beyond local optimality. Tabu search is a memory-based technique that enhances the searching performance by avoiding already searched solutions. Pacheco et al. (2009) developed a transit network design algorithm using tabu search for varying passenger demands. Other tabu search-based algorithms for the BNDFS problem were proposed by Fan and Machemehl (2008), Roca-Riu et al. (2012), and Yao et al. (2014).
It is crucial to consider the perspectives of both users and operators in the model for the BNDFS problem, with extra parameters adding complexity. Recent metaheuristic approaches are reportedly able to produce acceptable quality solutions within reasonable computational time. However, the performance of metaheuristic algorithms relies heavily on the setting of the algorithm parameters. Finding the right parameter values is an onerous task that requires expertise in the relationship between the algorithm and its parameters (Neumüller et al. 2011). As an alternative, Reinforcement Learning provides partially stochastic, rule-based modelling outcomes and derives optimal or near-optimal solutions without explicit modelling or formulation of the problem and instructions. Reinforcement Learning has been successfully applied to transport issues such as traffic signal controls (Shoufeng et al. 2008;Arel et al. 2010;Balaji et al. 2010) and traffic management (Rezaee et al. 2013;Ivanjko et al. 2015;Sun and Liu 2015). Recent advances in technology with high-performance computing systems can help derive reasonably good quality solutions within reasonable computation time for complicated problems requiring large search spaces.

Reinforcement Learning
Reinforcement Learning is a goal-oriented type of machine learning technique developed to guide an autonomous agent to maximise the agent's reward throughout repetitions of computational procedures in a certain environment (Sutton and Barto 1998). The key advantage of Reinforcement Learning is its ability to learn how to solve an optimal control problem without any external supervision. Instead, the agent senses the environment to decide on its state, and the agent takes the action that changes the state of the environment. This self-learning ability of the learning agent is the key characteristic that defines the intelligence of the system.
A traditional Reinforcement Learning model has five components, as shown in Fig. 1: agent, action, environment, state and reward. An agent learns and makes actions interacting with an environment where the environment comprises everything outside the agent. When the agent takes an action in the environment, the action is interpreted as a reward depending on how closely the agent reaches the pre-defined optimisation objective. The agent learns with a reward of action and a representation of a state, which are fed back into the following actions of the agent. Therefore, the agent seeks to maximise rewards over time through its choice of actions.
Q-learning is one of the most widely used methods for Reinforcement Learning (Sutton and Barto 1998). As the agent interacts with the environment in a sequence of steps, the environment is formulated based on the Markov decision process framework. The Markov decision process is widely used for decision making in situations where actions influence not just immediate reward, but also subsequent states through those states' future rewards. The Markov decision process is an abstract mathematical form of the reinforcement problem. The state-transition probabilities from state s to the following state s ′ under action a at time t can be expressed as: where S is a set of states of the agent on the environment; A is a set of actions of the agent; s � , s ∈ Sanda ∈ A.
The expected reward after the transition from state s to the following state s ′ under action a can be expressed as: where R is a set of rewards of the agent; A is a set of actions of the agent; r ∈ R.
The agent's mapping from state to action is called policy, . The policy is improved iteratively as the agent gains experience. Numerous trials are executed during the training phase for the agent to gain experience and to enable the agent to find * , the best policy, which determines the best action when the agent is in a particular state. The value associated with that state-action pair is updated with Q t s ′ , a , its current value, for r , the instant reward, that the agent receives for the executed action and with Q t+1 s � , a , the expected return, starting from that state. The data structure of the Q-value is a matrix, where each cell element represents the corresponding value of taking a particular action when in a particular state. The state-action value is maximised when the agent follows the best policy. The Q-value table is updated by using the following one-step equation: where Q new t s ′ , a is the new Q value of the following state s ′ under action a at time t ; Q old t s ′ , a is the previously recorded Q-value of the following state s ′ under action a at time t ; max b [Q old t+1 s � , a ] is the maximum Q-value among previously recorded Q-values of possible actions at time t + 1 from the state s ′ ; r t is the instant reward at time t ; t is the discount factor at time t (0 ≤ t ≤ 1) ; t is the learning rate at time t(0 ≤ t ≤ 1).

Bus network design and frequency setting (BNDFS) problem
The BNDFS problem has two components: bus network design and frequency setting. The network design finds a set of bus routes in a given area where pre-determined nodes of stops are provided. The bus routes provide transit services to passengers between origins and destinations. The second step is frequency setting for the bus routes determined in the previous step. The frequency setting typically assumes the maximum level of passenger demand on each route to calculate the number of bus services required to serve all the passenger demands. The frequency setting enables the calculation of the total travel time across the network as the sum of waiting time, transfer time and in-vehicle time for each passenger. The total travel time becomes a general indicator to evaluate the fitness of designed bus routes.
Essential inputs to the bus network design problem include the road network where transit is to operate, locations of bus stops, a travel time matrix between stops, and a passenger origin-destination matrix. Figure 2 presents an example of a small-sized road network. The road network can be expressed as follows: where N is a set of nodes; L is a set of links.
In a bus network, nodes or the points in the figure represent bus stops, while links or the connecting lines in the figure represent roads linking the stops. The origin-destination matrix and travel time matrix are also expressed as: where d ab is the number of passengers who wish to travel between stop a and stop b ; tt ab is the in-vehicle travel time via a direct link between two stops a and b.
If no direct link exists between two stops, multiple vehicle travel times ( ∑ tt) are added together. To solve the BNDFS problem, information must be obtained in advance on G , the road network; D , the demand matrix; and TT , the travel time matrix. The travel time matrix ( TT ) includes the travel time required for each direct link connected between two adjacent nodes.
In general, the design of a transit network must aim to provide a high level of service satisfaction to passengers at a reasonable cost to the service operator. More specifically, the satisfaction of passengers is defined in this study by combining two terms: the total travel time of all passengers and the total unmet passenger demand. The unmet passenger demand is the total number of those passengers who cannot find any possible route from their desired origin to their destination. The optimisation's objective function is therefore defined as a minimisation of these two terms with a constraint of the number of required buses, which determines the operating cost. This can be expressed in mathematical form as follows: where A is the objective function of a bus transit network; t ab is the travel time between stop a and stop b ; d ab is the demand between stop a and stop b ; is the coefficient of the unmet demand; d ab is the unmet demand between stop a and stop b ; v k is the number of required buses for route k.

Mathematical formulation and constraints
The transit network design is validated using selected indicators which also include the objective function as presented in the previous section. The sum of travel times of all passengers is computed for a given bus network. The travel time of a passenger is defined as the time period from the start moment of waiting at the origin or departure stop till the arrival moment at the destination stop. The travel time consists of three elements: waiting time, in-vehicle time and transfer time. The transfer time occurs when a passenger is required to change from one bus route to another one to reach the desired destination. The number of transfers is limited to one only in this study. The travel time from stop a to stop b via a transfer stop c is expressed in mathematical form as follows: where t wt,ab is the total waiting time taken while travelling from stop a to stop b; t inv,ab is the total in-vehicle time taken while travelling from stop a to stop b; t tr,ab is the total transfer time taken while travelling from stop a to stop b; t a wt is waiting time at stop a ; t a,b inv is in-vehicle time between stop a and stop b; t c tr is transfer time at stop c ; ab,c is a binary variable equal to 0 or 1, taking the value of 1 when the transfer occurred at stop c travelling from stop a to stop b ; c is the coefficient of penalty when a transfer occurs; T is the converting unit of time (e.g., 60 min • vehicle); f a,c k is the frequency of service along route k [vehicle], connecting between stop a and stop c.
The waiting time varies by the service frequency of the desired bus route. Increasing service frequency to cope with high passenger demand will increase the running costs for the service operator as well. The relationship between demand level, service frequency and the required number of buses for route k can be expressed as follows: Transit network design typically applies some constraints to produce design outcomes that are reasonably practical. First, the length of a bus route or the number of stops in a route must be restricted between the maximum and minimum allowable values. Bus routes that are too short serve only a very small number of passengers, and routes that are too long require long driving times for bus drivers, which could violate safety regulations on maximum driving time without a break. Second, the maximum number of bus routes needs to be constrained. Too many bus routes would make network design and service scheduling very complicated for both users and bus operators. Third, service frequencies must be higher than a minimum setting for passenger satisfaction (Cipriani et al. 2012). Fourth, the maximum passenger flows on a link must be limited to a reasonable setting. A link is a part of a route connecting two adjacent stops. The passenger flow on a link can be measured by summing all the passengers travelling on passing bus routes. Too high passenger flows on a link could cause extreme congestion leading to service delays and increased travel time, unreliability and passenger discomfort. Lastly, the total fleet size should not exceed the maximum number of vehicles of the operator. The number of vehicles is a direct indicator of operating cost.

Passenger assignment
Evaluating the fitness of a bus network design requires a passenger assignment method in advance to allocate passengers to each of the designed bus routes. This study adopted the passenger assignment method which has been used in past studies (Baaj and Mahmassani 1995;Mauttone and Urquhart 2009;Yan et al. 2013;Nikolić and Teodorović, 2014). This method allocates passengers to bus routes to minimise the number of service transfers required to reach the trip destination. Mandl's Swiss network is relatively small with only 15 nodes or stops. Figure 3 illustrates the overall procedures for passenger assignment to routes.
An OD pair between two stops a and b is the passenger demand travelling from a to b. If any direct path exists between a and b, the whole demand is allocated to the direct trip demand ( D 0 ). If more than one direct path is available, the in-vehicle travel time of each direct path is computed. Any path incurring in-vehicle travel time more than 50% above the minimum value is rejected (Baaj and Mahmassani 1995). The demand is proportionally assigned to direct paths based on service frequencies (Nikolić and Teodorović 2014). This process reflects passenger preference for direct paths with shorter in-vehicle travel time and higher service frequencies. The demand share function from stop a to stop b by n possible direct paths is expressed as: where d ab,k is distributed demand for path k from stop a to stop b ; f ab,k is the service frequency of path k from stop a to stop b.
To assign passengers to a transit path with one transfer, the method identifies all the bus routes connecting an origin and destination via any transfer node (stop). For one transfer, paths are attractive when their in-vehicle travel times are within a 10% tolerance compared to the shortest path. People reject using those paths of more than 10% above the minimum value (Baaj and Mahmassani 1995). The transit demand is assigned proportionally to the relative service frequency level of each bus route, which can be expressed as follows: 13) P ac = P ac,1 , … , P ac,k , … , P ac,n where P ac is a set of available paths from stop a to a transfer stop c ; d ac,k is the demand distributed for path k from stop a to a transfer stop c ; f ac,k is the service frequency of path k from stop a to a transfer stop c ; P cb,k is a set of available paths from a transfer stop c to stop b (continuously from the path P ac,k ); d cb,k,l is the demand distributed for path l from transfer stop c to stop b (continuously from the path P ac,k ); f cb,k,l is the service frequency of path l from transfer stop c to stop b (continuously from the path P ac,k ).
This study, and all the other previous studies in Table 3, adopted the same passenger assignment rule that assumed that passengers always attempt to reach their destinations by following the path requiring the fewest possible transfers (Baaj and Mahmassani 1995). Under this passenger assignment rule, passengers will not use a two-transfer path if a one (or no) transfer path exists. This study recognised that the maximum of one transfer is enough to serve all demands with reasonable travel times on the Mandl network. Therefore, it is not meaningful to consider twotransfer paths which can increase the overall computational times to finalise the best network.

Algorithm formulation
Reinforcement Learning is reportedly highly effective for solving sequential decision-making problems such as BNDFS. Designing a bus network involves selecting bus stops in a sequential manner, so the order of the selected stops shapes the bus routes. The agent's actions can be categorised broadly into three components: adding stops, ceasing adding stops to the current route, and ending an iteration. At the beginning of an iteration, the agent starts the network and route design by selecting the first stop to form the first bus route. Additional stops might be added to the route sequentially. Each bus route must have at least two bus stops (i.e., an origin and a destination). The agent might continue forming the current route by adding more bus stops or complete the current route and then start forming the next route. The agent can cease adding stops when no more stops are available to connect. The network design ends when no more improvement can be made by adding an additional bus route and bus stop. The network design also ends when both the total numbers of bus routes and bus stops have reached a maximum at the same time. In this study, the maximum number of bus routes is defined as the total number of nodes in a planning area. For instance, the designing network consists of 15 nodes in total, and so the maximum number of bus routes is also set as 15. The end of designing bus routes refers to the completion of one iteration (Fig. 4).
The agent explores and exploits the environment to build the best set of Q-values. The exploration is to investigate unexplored actions. The exploitation is to (14) d cb,k,l = f cb,k,l ∑ n 1 f cb,k,l d ac,k (15) P cb,k = P cb,k,1 , … , P cb,k,l , … , P cb,k,n exploit the current best actions. Every action is recorded with the current state and a reward (Q-value) value. The state refers to the agent's current situation, which is defined by the indexed bus routes designed up to the present status, the number of selected bus stops for each route, and their selection order. The reward is the assessment result of an action taken by the agent based on the total passenger travel time and the percentage of the served transit demand. The total travel time is the sum of in-vehicle time, transfer time and waiting time. The waiting time at the origin or at the transfer stops is computed depending on the frequency of each route. The frequency of each route is always set simultaneously to evaluate an action of network design. The reward function can be expressed as: where t t is the total travel time of all passengers; d sat is the ratio of serviced transit demand.
Because all the actions are correlated with each other in the same iteration, any smaller reward values of the previous states in the current iteration are replaced with the calculated reward value of the current state. This process will lead the agent to select the best stops for bus routes during the exploitation phase. As mentioned earlier, reward values are calculated from every action, and the rewards are recorded as Q-values. The discount factor always remains zero (refer to Eq. 3). The equation for updating the Q-value table is reformulated from Eq. (3) as follows: The learning rate ( t ) is equal to 1.0 if the calculated reward value ( r t s ′ , a ) is larger than the previously recorded Q-value ( Q old t s ′ , a ), otherwise the learning rate ( t ) equals zero. Exceptionally, the Q-value for the action of ending an iteration is the reward calculated of the current state by the action taken immediately before. The Q-value for the action of ending an iteration can distinguish the possibility of improving by taking any further sequential actions.  The action candidates are ranked in descending order by reward values, and the agent selects the first ranked action. If multiple action candidates are expected to produce the same reward value, priority is given to the actions in the order of adding a stop to the current route, starting the design of a new route, and the end option. Adding a stop to the current route is given the highest priority because continuing the current route can improve passengers' convenience and reduce network complexity by avoiding transfer trips.
The probabilities of exploration and exploitation are defined with the epsilongreedy ( -greedy) strategy: The epsilon ( ) value decreases over the learning process. This -value at the kth iteration can be expressed as: where I max is the pre-defined maximum number of iterations; I k is the number of current iterations.
In early iterations, the actions are more likely taken randomly (exploration), and as the number of iterations increases, the actions are more likely taken based on experience (exploitation). The learning process is terminated when a pre-defined maximum number of iterations is reached. The set of actions that has maximised the rewards will be selected as the final network and route design. The action candidates include adding stops in the current route, designing a new route, and ending network design.
In exploitation, the action of designing a new route was taken when no more benefit was foreseen by adding more stops to the current route. The action option of designing a new route can optimise the length of each route. Similarly, the action of ending the network design is taken when no more benefit is foreseen by adding more stops to the current route or an additional route. With the action option of ending the network design, the agent will not continue designing additional routes, unless better (18) Action at state k = Exploration with probability ( ) Exploitation with probability (1 − ) Example of the action process in exploitation networks are anticipated with further sequential actions. Therefore, the total number of bus routes is optimised. In our algorithm, the end of designing a network refers to the completion of one iteration, reaching a goal state. Once the goal state is reached, the agent is initialised to start designing the first stop of the first route in the next iteration. The overall optimisation procedure of the proposed algorithm is illustrated in Fig. 6.

Experimental design
Validation of the algorithm is carried out on the well-known Mandl's Swiss network (Mandl 1980). Mandl's Swiss network consists of 15 nodes (stops) and 21 bidirectional links as shown in Fig. 7. The road network contains the data of bus stop locations, link connectivity, link lengths and link travel times. The attributes on the links denote the travel times, and the travel times for both directions of a link are assumed to be the same. Mandl's network has been widely used as a benchmark for the BNDFS problem (Mandl 1980;Chakroborty 2003;Chew et al. 2013;Mumford 2013;Nikolić and Teodorović, 2014;Buba and Lee 2018;Capali and Ceylan 2020). Mandl's network has been a popular benchmark because the size of the network is manageable, and all the datasets required for solving the BNDFS problem are readily available. The theoretical minimum value of the total passenger travel time on Mandl's network was calculated at 155,790 min by Blum and Mathew (2010). Note that this can be achieved only when all passengers take buses by using the shortest paths from their origins to destinations without any waiting time and transfer time. However, the theoretical minimum value of 155,790 min of total passenger travel time is impossible to achieve with a realistic transit network because all the shortest paths from each origin to each destination require a dedicated bus route. A relatively high number of bus routes is required for only 15 nodes. Vermeir et al. (2021) proved that 13 bus routes are needed to achieve 155,790 min total travel time without accounting for any waiting time and fleet size limits. If the total fleet size is limited, waiting time and transfer trips are unavoidable.
The total demand for bus trips in the Mandl network is 15,570 passengers. The origin and destination matrix of the demand in the bus network is presented in Table 1. The critical node pair with the highest demand is 880 passengers between stop 6 and stop 10 in each direction. Overall, 82% of the 225 demand pairs are non-zero and have some demand.
Instead of conducting 50,000 iterations at once, we performed five independent replications with 10,000 iterations each. Each replication produces one set of bus routes based on the agent's experiences, and the best result over the five replications is selected as the best bus network. This process prevents the agent from achieving a local optimum without a sufficient number of explorations. The experiments were run on an Intel Core i7-8650U CPU (1.9 GHz) computer with 16GB RAM. A number of parameters assumed in our experiments are identical to previously published experiments (Nikolić and Teodorović 2014;Buba and Lee 2018). The parameters and their values are as follows: Fig. 7 Mandl's road network (Mandl 1980) Reinforcement Learning for bus network design and frequency setting Table 1 Origin-destination passenger demand matrix for 15 stops on Mandl's network The additional parameters required to run the algorithm include the maximum number of iterations and exploration/exploitation probability ratio. The maximum iteration number of 10,000 indicates where the experiments end. The exploration and exploitation probabilities represent the rate at which the agent's action is chosen randomly or not.
• Maximum number of iterations: 10,000 iterations • A ratio of exploration and exploitation probabilities: This study uses eight performance metrics including: percentage of demand served with a direct trip ( d 0 ), percentage of demand served with a one-transfer trip ( d 1 ), percentage of unmet demand ( d un ), fleet size ( F s ), total travel time of all passengers ( t t ), in-vehicle time of all passengers ( t inv ), waiting time of all passengers ( t wt ) and transfer time of all passengers ( t tr ). The primary performance criteria are the total passenger travel time ( t t ) and the percentage of unmet demand ( d un ). The total passenger travel time represents how promptly and efficiently a transit network provides service to passengers. The best transit network has the shortest total travel time without unmet transit demand.

Results and discussion
The bus networks designed by the algorithm are illustrated in Fig. 8. The networks consist of four bus routes illustrated in different colours. While restricting the maximum number of stops to 8 nodes, all four routes have stop 10 and three bus routes have stop 6. Stop 10 and stop 6 are the top two busiest nodes with the highest level of transit demands (4145 and 1870 passengers, respectively). When the number of stops per route is limited to 15 stops, all four bus routes contain the link between stop 6 and stop 8. The common link on all four bus routes likely has occurred to serve the highest demand (880 passengers) between stop 6 and stop 10.
Reinforcement Learning for bus network design and frequency setting The sequences of stops designed by the algorithm are provided in Table 2 where the corresponding fleet size required for each route is given in column 4.
In the case of limiting the number of stops allowed per route to 8 nodes, the final set of bus routes can serve passengers with 202,074 min of total passenger travel time. The total passenger travel time includes the total in-vehicle time of 172,058 min, the total waiting time of 21,566 min, and the total transfer time of 8450 min.
In the case of limiting the number of stops allowed per route to 15 nodes, the four bus routes are expected to produce 198,273 min of total passenger travel time or 12.73 min of average travel time per passenger. The theoretical minimum total in-vehicle travel time is 155,790 min (Blum and Mathew 2010). The algorithm resulted in 168,306 min of in-vehicle travel time, which is 8.0% more than Other performance metrics include waiting time, the proportion of direct trips and transfer time. The results show that the proposed bus network can serve all the passengers (i.e., 0% unmet demand) with a total waiting time of 26,767 min or 1.72 min of average waiting time per passenger.
The optimisation results are given in Table 3 which also shows some past studies and their routing optimisation results together (Mandl 1980;Chakroborty 2003;Chew et al. 2013;Mumford 2013;Nikolić and Teodorović, 2014;Buba and Lee 2018;Capali and Ceylan 2020). It should be noted that those studies were tested under different constraints and settings (e.g., the maximum number of stops on each route or passenger assignment method). Direct comparison of their optimisation performance is not possible, and the intention is to merely provide reference points to demonstrate the proposed algorithm could produce competitive results.
The proposed algorithm can design the bus route number, their configurations (routes), and the service frequency for each route simultaneously. This could be a significant benefit in designing large-sized transit networks. To demonstrate this, the algorithm is tested under various arbitrary scenarios in the following sections (e.g., express route required and demand growths).

Routing express service
Because transit demand varies at different stops along a bus route, a bus operational stopping strategy plays a pivotal role in improving the efficiency and reliability of transit systems . Implementing an express service on a regular bus service enhanced passenger satisfaction in a number of case studies, including Chicago, USA (Conlon et al. 2001), Montréal, Canada (El-Geneidy and Surprenant-Legault 2010), Santiago, Chile and Bogota, Colombia (Soto et al. 2017). This study further improved the proposed algorithm to implement additional express bus routes which provide services parallel to existing bus routes. Buses in express bus routes stop at only a few stops by stop skipping, while the other parallel routes serve transit demand at all the intermediate stops.
Express service will decrease passengers' in-vehicle times and operational costs. However, an express service also increases waiting time for those passengers whose origin or destination stops are skipped. To enable a comparison with the previous network design, the algorithm is implemented to add the additional express route(s), if required, to Mandl's network by keeping the existing bus routes. For a fair comparison before and after the express service was implemented, this study limited the operational resources to the same fleet size of 99 buses for both the original network and the new network with an additional express route.
The extended Reinforcement Learning algorithm with the option of express routes took an average execution time of 1286 s to run through 1000 iterations, meaning about 128.6 s for every 100 iterations. Compared to Ai and Roman (2021), the extended algorithm is competitive in computational time with metaheuristics of the genetic algorithm and hill climbing local search.
Reinforcement Learning for bus network design and frequency setting Table 3 Test results of routing algorithms *The values are as obtained by applying the passenger assignment method used in this study to the resulting route sets reported by the authors of the previous studies Performance metric Mandl (1980) Chakroborty (2003 Figure 9 illustrates the modified network design from the algorithm. One express route has been added to connect stop 1, stop 2, stop 6 and stop 10. This new route can effectively improve the service to passengers travelling from stop 6 to stop 10, where the transit demand is highest at 880 passengers. Table 4 shows the fleet size allocated to each corresponding route before and after an express service has been added. Although the maximum total fleet size is unchanged at 99, the agent allocated 7 buses to the new express route by taking them from routes 1, 2 and 4. Note that the fleet size for route 2 has increased by 2 buses. Table 5 compares the selected metrics before and after an express service has been added to the network. The new express route serves 7.66% of the total passengers by using 7 buses. The reduced number of buses serving the four allstop routes increased the passengers' total waiting time by 2205 min. However, the new service also reduced the total in-vehicle travel time by 5353 min, saving 3148 min overall. This result implies that the express route created by the Fig. 9 The best bus network obtained by Reinforcement Learning algorithm including an express route 1 3 Reinforcement Learning for bus network design and frequency setting algorithm can effectively improve the bus service without increasing resources or purchasing new vehicles.

Additional test scenarios
The algorithm is further validated by applying it to three additional scenarios where the transit demand and maximum fleet size are modified, as summarised in Table 6. Scenario A assumes transit demand in the entire network increased by 50%. The transit demand is increased by 100% at stop 9 for Scenario B and by 100% at stop 10 for Scenario C. Stop 9 for Scenario B is selected as its stop is located at the corner of the network with low transit demand. The area near stop 9 could represent the location of a large infrastructure development increasing demand, such as an airport or stadium. Note that the total number of passengers is the largest in Scenario C because stop 10 is the busiest bus stop in the network. Both the trips originating from and finishing at the affected stop are modified. To accommodate the increased number of passengers, the maximum fleet size is adjusted to 129 buses for Scenario A, and 119 buses for Scenarios B and C.
To enable a comparison with the previous network designs, the algorithm could only make three types of changes: (1) adjusting the frequency of existing bus routes; (2) adding additional local bus route(s) and adjusting frequency; or (3) adding more express bus route(s) and adjusting frequency. All the experimental parameters for   Reinforcement Learning remain the same as those presented earlier. Table 7 provides the validation results for selected performance metrics. The computational outcomes are compared to those of the benchmark network presented in Sect. 6.2. With the increased transit demand, the total travel time of passengers increased from 198,273 min to 298,909 min (+ 50.8%), 212,801 min (+ 7.3%) and 304,535 min (+ 53.6%) in Scenarios A, B and C, respectively, when no improvement is made to the service setting. Note that those ratios of increase in travel time are similar to the ratios of the increase in the number of passengers (see Table 6).
Adjusting the service frequency setting was the first redesign option. Service frequency is determined by the number of bus vehicles allocated to each route. The updated fleet allocations are presented in Table 8. Scenario A assumes an increase of fleet size by 30 buses from 99 to 129 (+ 30.3%). Since the passenger demand was increased by 50% in the whole network, the agent equally distributed the new buses to existing bus routes. The fleet size increased by 28.5%, 30.8%, 27.6% and 33.3% for routes 1, 2, 3 and 4, respectively. Both Scenarios B and C assume an extra 20 buses, and the agent allocated more new buses to the routes serving high demand origin and destination pairs. In Scenario B, more buses were allocated to route 1 (+ 21.4%) and route 3 (+ 24.1%), and these two routes connect stop 9 with other high-demand stops, such as stop 6 (1870 passengers) and stop 10 (4145 passengers). Similarly, more buses were allocated to route 2 (31%) and route 3 (24.1%) in Scenario C. Those two routes serve the two highest demand bus stops-stop 10 and stop 6. Adding more local route(s) is relatively ineffective in improving transit performance. The fleet sizes of existing routes decreased, which increased the waiting time of passengers compared to the other two options. Adding a new local service increased the passenger waiting time by 13.2%, 15.7% and 18.9% in Scenarios A, B and C, compared to only adjusting the service frequency. The agent added only one short local route in all three scenarios (see Fig. 10). This implies that additional local services are unlikely to reduce the total passenger travel time.
The express service option effectively improved transit performance by reducing total passenger travel time by between 3.07 and 3.94% compared to the benchmark. In Scenario A, the algorithm added one express route to connect stop 2, stop 6 and stop 10. In both Scenarios B and C, the additional express route connects stop 1, stop 6 and stop 10. Those new express services seem to be proposed to serve passengers travelling between stop 6 and stop 10, where the transit demand is highest at 1320 passengers (Scenario A), 1760 passengers (Scenario B) and 880 passengers (Scenario C). Those bus stops of stop 1 or stop 2 might be included in the additional express route because their stops have relatively higher transit demand than other stops when extending a bus route from stop 6 and stop 10.
From the validation results, it is clear that the algorithm can provide reasonably reliable network design and frequency optimisation under various test environments. The results show that the algorithm could be a robust tool to assist transit planners in designing a transit network and service or making incremental changes due to changing traffic and demand conditions.

Conclusions
This study proposed a new approach to solve the bus network design and frequency setting problem. The algorithm uses Reinforcement Learning to optimise the number of bus routes, route shapes and service frequencies simultaneously. This could be a significant benefit in designing large-sized transit networks. To the best of the authors knowledge, this simultaneous optimisation approach has not been tested in the literature yet. The algorithm was implemented on Mandl's Swiss benchmark network.
As transit agencies and planners must ensure that transit networks and services reflect people's rapidly changing travel needs, the proposed Reinforcement Learning algorithm was extended to assist in solving real-life problems. The extended algorithm could formulate an express service and redesign only selected network design and frequency components. The algorithm can be used to plan a new transit network, to improve selected routes or to set the service frequency to meet increasing transit demand across the whole network or parts of it.

Fig. 10
Bus network design: Scenarios A (left), B (middle) and C (right) Future research directions include investigating the capability and reliability of the proposed algorithm for larger-sized bus transit networks including Mumford's benchmark network (Mumford 2013), which must address various needs of both passengers and operators to design an efficient transit network. Reinforcement Learning can be incorporated with other supporting tools like deep learning in order to address the limitations of the ordinary form of Reinforcement Learning (e.g., slow training process for complicated environments or policies). For example, Chu et al. (2019) used Deep Reinforcement Learning for adaptive traffic signal control in large and complex traffic networks. Also, Deep Reinforcement Learning was used to provide a near-optimal interterminal truck route for container exchange between different terminals, which coped with hundreds and thousands of state-action spaces (Adi et al. 2020). The next step is expanding the algorithm to a large urban area with complex transit routes including different types of services such as a trunk-andfeeder public transport system.