Constrained nested logit model: formulation and estimation

A model of traveller behaviour should recognise the exogenous and endogenous factors that limit the choice set of users. These factors impose constraints on the decision maker, which constraints may be considered implicitly, as soft constraints imposing thresholds on the perception of changes in attribute values, or explicitly as hard constraints. The purpose of this paper is twofold: (1) To present a constrained nested logit-type choice model to cope with hard constraints. This model is derived from the entropy-maximizing framework. (2) To describe a general framework to deal with (dynamic) non-linear utilities. This approach is based on Reproducing Kernel Hilbert Spaces. The resulting model allows the dynamic aspect and the constraints on the choice process to be represented simultaneously. A novel estimation procedure is introduced in which the utilities are viewed as the parameters of the proposed model instead of attribute weights as in the classical linear models. A discussion on over-specification of the proposed model is presented. This model is applied to a synthetic test problem and to a railway service choice problem in which users choose a service depending on the timetable, ticket price, travel time and seat availability (which imposes capacity constraints). Results show (1) the relevance of incorporating constraints into the choice models, (2) that the constrained models appear to be a better fit than the counterpart unconstrained choice models; and (3) the viability of the approach, in a real case study of railway services on the Madrid–Seville corridor (Spain).


Introduction
Discrete choice models have long been recognized for their ability to capture a broad range of transport-related choice phenomena. For quite some time, there has been a growing interest in research into traveller behaviour, to explore the choice set formation and its representation. A proper modelling of user behaviour requires including the endogenous and exogenous factors that affect the decision-making process, as they induce constraints in the choice set formation. The endogenous factors are inherent to the user, limiting the universal choice set. The exogenous factors are instead originated by the decisions of other users and by the existing supply of goods or services.
Constraints imposed by endogenous or exogenous factor are called soft if they reduce the probability to choose a given alternative but does not completely exclude it from the choice set. Hard constraints instead cannot be violated and impose the exclusion of the alternative. For example in the choice of residential location, a user can eliminate from the choice set alternatives whose price is above a threshold (hard constraint due to endogenous factors), or it can be that the utility of an alternative drops if an attribute takes a value above or below a given threshold, but the alternative is still available (soft constraint due to endogenous factors). Analogously the allocation of seats between railway services and the choices of other passengers (exogenous constraint) may define the latent choice set for a specific user (hard constraint). Note that exogenous factors may induce soft constraints, e.g., the vehicle capacity in transit services (subway or bus) does not impose an upper bound in the number of passengers but represents a factor of discomfort.
As will be clarified in the next section, the literature review, most work concentrates on the use of soft constraints applied to model endogenous factors. Hard constraints have received less attention, despite their importance in modelling such endogenous factors as the interaction of supply and demand. If we are analysing a supply and demand equilibrium problem and we wish to estimate a demand model, the data available about demand is the result of choices made by that demand and the limitations imposed by the current supply. Most applications simply ignore these effects; thus, forecasts of demand for a different supply scenario from that used in the estimation are likely to violate constraints and miscalculate demand.
A notable example of the interaction of supply and demand can be seen in the problem of rolling out a refuelling infrastructure for Alternative Fuel Vehicles (AFV). This effect is known as the Chicken-or-Egg dilemma. The refuelling infrastructure imposes hard constraints on the user choice set. Users with no effective access to a refuelling station will not contemplate buying an AFV. Most discrete choice models designed to analyse the introduction of AF vehicles up to now have considered infrastructure as just another attribute (soft constraint). Fúnez-Guerra et al. (2016) show the problem for the Spanish case, and underlines that a discrete choice model that does not consider the refuelling infrastructure as a hard constraint significantly overestimates sales of AF vehicles as against a model which does. The hard constraints separate the effects of supply from those associated with demand allow changes in supply to be properly accounted for when making estimations. The main aim of this study is to explore an approach that can handle hard constraints in the decision-making process to find a mathematical formulation for the specific case of (nested) logit models.
A second problem addressed in this study is how to introduce general nonlinear utilities into this type of constrained model. This is especially important when considering dynamic phenomena. The machine learning community has made extensive and successful use of Reproducing Kernel Hilbert Spaces. This paper adapts these techniques to the second problem. The challenge lies in its estimation, since there may be a parameter for each observation, rather than one for each attribute considered. For this reason, and because constrained nested logit model do not allow a closed formula, we have created a novel estimation method for this problem.

Literature review
As mentioned before most of the literature concentrates on the use of soft constraints applied to model endogenous factors. Several theoretical frameworks have been proposed to account for these constraints. A rough taxonomy classifies the approaches followed in the literature into: 1. Implicit choice-set approaches: These methods incorporate thresholds in the perception/availability of an alternative in the random utility choice model; this has typically been carried out by introducing thresholds (cut-offs) as penalties in the utility function (see Swait 2001;Cascetta and Papola 2001;Elrod et al. 2004;Martínez et al. 2009;Bierlaire et al. 2010 among others). 2. Explicit choice-set approaches: These methods consist in a two-stage representation of decision making. In the first stage, the choice-set generation is simulated. The decision-makers screen alternatives and eliminate those that do not reach the relevant attribute cut-off levels from their choice sets. In the second stage, the decision makers choose, applying compensatory decision rules, only from the alternatives remaining in the reduced choice set (see Manski 1977;Swait and Ben-Akiva 1987;Ben-Akiva and Boccara 1995;Cantillo and Ortúzar 2005;Cantillo et al. 2006 among others).
The implicit approaches propose an extension to the linear compensatory utility model, which accommodates both the use of attribute cut-offs and cut-off violations in choice modelling. The major advantage of these methods is related to the computational time required. In particular Swait (2001) incorporates attribute cut-offs into the utility maximization problem formulation. It makes it possible for the consumer to treat the constraints as soft by violating them at some cost. This approach assumes a linear penalty function. Cascetta and Papola (2001) propose the implicit availability/perception model (IAP). The choice-set of alternatives is a fuzzy set where each element has a degree of membership of the choice set. Elrod et al. (2004) propose and test a model of decision making that integrates variations of a compensatory and two non-compensatory (i.e., conjunctive and disjunctive) decision strategies which is capable of providing probabilistic predictions for objects anywhere on a closed interval. Martínez et al. (2009) formulate a constrained multinomial logit (CMNL). This model implements cut-offs as a binomial logit function embedded in multinomial logit models. CMNL model is a heuristic that is based on convenient assumptions about the functional form of the utility function. CMNL model allows the choice domain to be constrained by as many cut-offs as required, limiting both an upper and a lower level of variables. Therefore, Castro et al. (2013) study the estimation of the CMNL model using the maximum likelihood approach. The CMNL model appears to be suitable for general applications. Using real data, these authors found significant differences in the elasticities between compensatory MNL and semi-compensatory CMNL models.
In the explicit approach, the endogenous factors are brought together in the latent choice set models which consider a set of alternatives for each decision maker. The Manski model (Manski 1977) has served as the standard workhorse model for discrete choice modelling with latent choice sets. The problem with this approach is that it leads to the need to enumerate (exponentially) the set of all alternatives. The estimation of the two-stage models is computationally intensive and their severely restrictive assumptions impede their practical application. Several studies (see Kaplan et al. 2009Kaplan et al. , 2012 are relaxing the assumptions embedded in these models with respect to the number of alternatives and choice sets, the representation of threshold selection and independently and identically distributed error terms across alternatives at the choice stage. Bierlaire et al. (2010) show on simple examples that CMNL model is not adequate for modelling the choice set generation process consistently with Manski's framework. Although Li et al. (2015) results are consistent with Bierlaire et al. (2010) findings, they differ in the fact that the CMNL model can successfully recover cut-off and scale parameters in the choice set probability function, while Bierlaire et al. (2010) finds that at the most only one can be recovered. Moreover, Paleti (2015) proposes higher-order approximations of the Manski approach in which the CNML model constitutes a first order approximation. That work also carries out a simulation study and shows additional order of approximation offering incremental improvement in the quality of the parameter estimates.
Less attention is paid to the exogenous factors. This problem is studied in the analysis of endogeneity in choice modelling (Louviere et al. 2005). The concept of endogeneity refers to the fact that the individual choice decisions may depend on themselves. Ding et al. (2012) show in a behavioural experiment that some respondents are willing to take a utility penalty (soft cut-off) rather than eliminate an alternative when a cut-off violation occurs (hard cut-off). However, with exogenous constraints, such as social, temporal, spatial and resource constraints, their fulfilment is mandatory.
Endogeneity may appear for a variety of reasons such as the omission of unobservable variables or the ability of individuals to influence the formation of the choice sets. In this paper, the second reason is analysed. Recently De Grange et al. (2015) have proposed a logit model to explicitly include endogeneity in attributes (explanatory variables) due to network externalities or social interactions. This approach tackles endogeneity with a fixed-point equation.
The previous models consider linear utility functions. In a dynamic choice modelling context it is essential to consider non-linear utility functions when changes in demand trend are to be captured. Other reasons to support the use of non-linear specifications of the utility function are found in the work of Cherchi and Ortúzar (2002) where they test different specifications of the utility function for a new train service design. These authors found that the non-linear specifications appear to be more suitable as not only are better model results obtained, but also the real distribution of the error terms is revealed. In the context of health economics, Van Der Pol et al. (2014) show that welfare estimates are sensitive to different utility function specifications. The results showed that the willingness to wait for hip and knee replacement (WTW) for the different patient profiles varied considerably. Assuming a linear utility function led to much higher estimates of marginal rates of substitution (WTWs) than with non-linear specifications.
Despite the importance of specifying the utility function, this matter has received limited attention in the literature on dynamic choice modelling. Popuri et al. (2008) introduced continuous (dynamic) systematic utility functions for departure time modelling, via sinusoidal functions which interact with covariates. Anas (1983) demonstrated that information minimizing (or entropy-maximizing modelling) and utility maximizing (behavioural demand modelling) should be seen as two equivalent views of the same problem. This author proves that the doubly-constrained gravity model is identical to a multinomial logit model of joint origin-destination choice. Donoso and de Grange (2010) give an interpretation of the entropy maximization problem in the context of microeconomic modelling, attempting to explain the origin of the two problems' equivalence.
In this paper we adopt the entropy-maximizing approach, a recent literature review of this area can be found in Swait and Marley (2013). The entropy-maximizing approach makes possible the addition of non-linear constraints in its formulation. These constraints introduce new information in the forecasting process in order to represent the complex issue of interrelation modelling between the decisions of individuals. The key difference between our entropy maximization problem and those presented in Anas (1983), Donoso and de Grange (2010), Grange et al. (2013) andDe Grange et al. (2015) is that general utilities are considered, and they do not necessarily have linear attributes.

Summary and contributions of this paper
This paper makes two major contributions.
• The paper proposes a novel approach for modelling people's behaviour through a discrete choice process considering the existence of hard constraints. Specifically, we propose a mathematical formulation of the nested logit model to cope with hard constraints. We provide a Monte Carlo simulation study on an application with hard constraints to show that the demand estimations performed by the logit-type models are in error when they predict new scenarios in which the current hard constraints in the base scenario of the estimation are modified. This disadvantage does not show up in the constrained models. • Secondly, we address the building of a general framework to specify dynamic nonlinear utilities. The main feature of this approach is that the specification of the utility function is not centered on a specific functional form (linear, polynomial, sinusoidal, etc) but on belonging to a specific space of functions, the so-called Reproducing Kernel Hilbert Spaces (RKHS). The essential advantage of this design is the search for the most suitable shape for the utility within the set of functions for the problem at hand. Furthermore, we propose a method for estimating the constrained nested model with general utilities based on the novel point of view of considering a subset of utilities as parameters for the estimation instead of the classical weightings of the attributes. This approach has been illustrated with the modelling of the selection of railway services.
The paper is organised as follows. ''The constrained nested logit model'' section formulates the constrained nested logit model. In this section, the RKHS are defined to represent generic utility functions. Furthermore the Tikhonov regularization method is explained for the estimation of these functions. ''Estimation of the CNL model'' section discusses the procedure followed to estimate the constrained nested logit model with this type of nonlinear utility. Fifth section analyses numerically the logit-type models with the constrained counterparts and it illustrates the methodology by numerically solving a railway service Transportation (2018Transportation ( ) 45:1523Transportation ( -1557Transportation ( 1527 selection problem. Finally, last section concludes with a discussion of our findings and future work.

The constrained nested logit model
This section formulates the constrained nested logit model. The proposed scheme is utilized to introduce constraints in the users' individual decision-making processes which may influence their behaviour.

Formulation of the constrained nested logit
Logit-type discrete choice models are developed using random utility models (see McFadden 1974; Ortúzar and Willumsen 2011) or a maximum entropy optimization problem (see Anas 1983;Donoso and de Grange 2010;Grange et al. 2013;De Grange et al. 2015). The first approach obtains the multinomial logit model from the assumption that the random component of each utility function is independent and identically Gumbeldistributed. Anas (1983) gives a proof of the equivalence between both frameworks. This section presents a constrained nested logit-type choice model derived from the entropy-maximizing framework. A selection process similar to the one shown in Fig. 1 is considered. This type of model represents a decision tree for the discrete choice problem, in which the root of the tree represents the first choice users of type ' 2 L may make between alternatives (denoted by the index m 2 S ' ), and in each branch of the tree lies another selection process in which users may select among the available sub-alternatives (denoted by the index s 2 S ' m ). This type of model has been widely used in transport modelling (see Oppenheim et al. 1995;Ortúzar and Willumsen 2011;Fernández et al. 1994;García and Marín 2005).
It is assumed that there exist various types of individual ' 2 L. The parameter b g ' represents all the individuals of type ' who interact with the system, the variable g m' is the number of individuals of type ' who choose alternative m in the first level, the variable g m' s is the number of individuals of type ' who choose alternative s in the second level knowing The entropy maximization problem allows the inclusion of constraints, leading to the Constrained Nested Logit model (CNL). These constraints may introduce new factors in the forecasting process which could influence the behaviour of the users. Therefore, the CNL model is formulated as follows: , and k ' 1 and k m' 2 are scalars associated with the variance of the error term of the utilities. Moreover, U ' , H m' and l r are dual variables associated with the constraints.
The first and second constraints of the model express the logical requirements that the sum of the users across each branch of the first level must be the total number of users, and that in each branch, the sum of the users in this particular branch must be the total number of users who previously selected the sub-alternatives. The later constraints represent the hard constraints which are imposed upon the choice process. These constraints are formulated mathematically via the functions b h r and the parameters b r ; which are known and depend on the problem to be solved.
We now see some examples of the possibilities made available by the inclusion of these constraints. Suppose each class ' of users consists of only one individual i, ' ¼ fig. In this case b g ' ¼ 1 and the variables g m' ; g m' s represent respectively the probability that individual ' chooses alternative m and that, having chosen alternative m, chooses sub-alternative s. The first observation is that the model takes all user decisions collectively and so can capture the interactions between the decisions of users.
Example 1 (Exogenous factors: capacity constraints) Suppose we are modelling flight choices of airline users. At the higher level we decide between standard and low-cost companies. At the lower level users can choose between the different flights available. Each flight s has a capacity limit depending on the number of seats K s . There exists a set of users L s , who are likely to choose flight s because of the origin-destination of their trip. In this problem capacity constraints will be active on many flights, especially low-cost ones, and this will significantly affect user choice. This leads to X '2Ls being imposed on the demand estimation, which shows that the expected number of passengers (the left-hand side of the expression) on flight s cannot be greater than the number of seats available on the flight. For some users the choice set will be affected by capacity and they must choose from among available options. In this case, the demand model embeds an equilibrium problem in which all users compete against one another to get tickets.
Example 2 (Exogenous factors: limited availability products) Suppose a set of consumers who make purchases over a period of time. We consider that these individuals are ordered according to the instant at which they make the purchase ' 1 \ Á Á Á \' n . If the available resources of each product (alternative s) are limited, these products could run out after a certain time, and that alternative is no longer available to other consumers. Let K ' s be the number of available units of product s when purchaser ' makes the purchase. The following must be satisfied: In this example, the optimization problem CNL is separable for each individual ', leading to n independent problems in which there is no interaction between individuals.
The parameters K ' s are unknown in many practical situations and in their place the initial quantity K s of product s is known. In this case we replace constraint (2) with the estimation of available capacity: The decisions of consumers affect the decisions of those preceding them. This leads to an iterative solution process for the CNL. Suppose the variable g m'k with k ¼ 1; . . .; j, is known, and we calculate the right side of Eq.
(3) and solve the CNL problem to calculate the probabilities of choice g m'jþ1 s of the consumer ' jþ1 , subsequently iterating the process again.
Example 3 (The polarized logit model) Grange et al. (2013) propose the so-called polarized logit model which consists of introducing one instrumental constraint in the MNL. The motivation to introduce this constraint is to force the prediction of choice probabilities towards values of 0 or 1. The polarized logit model may be extended to nested logit models considering in the CNL the constraints Example 4 (Endogenous factors) Martínez et al. (2009) discuss potential application of constrained logit models. In modelling the transport system, the constraints taking into account the endogenous factors associated with thresholds in the attributes, such as minimum activity level at destination for attracting trips, maximum waiting, maximum travel expenditure, and access times to public transport. Examples in location and land use modelling are housing choices which are constrained by the income budget or in relevant location options where the cut-offs help to model the scope of the spatial search of an individual.
Assume that the utility function depends on a set of K attributes, denoted by vector X. Each alternative s is characterised by the vector of attributes ðÁ Á Á ; X s;k ; Á Á ÁÞ. A user of type ' endogenously screens the universal choice set and eliminates all alternatives whose attribute vector lies out of the consumer's choice domain. For example, the user may eliminate the alternatives with a price higher than a self-imposed maximum expenditure b ' k . The set of constraints defines the individual's feasible domain.
The binary variable d ' s indicates the validity of the alternative s for the the user type '. The lower cut-offs can be analogously introduced in the CNL model.

Equilibrium issues
The CNL model represents an equilibrium between users. They compete for the existent resources (alternatives) considering their preferences and the imposed constraints which may affect their choices, like the capacity of the system. This fact leads to an equilibrium situation in which each user cannot improve their utility by selecting a different alternative. Therefore, ''Appendix 1'' proves that the solution of CNL satisfies the classic Nested Logit probability equations, but show that the constraints introduced in the CNL model penalize the utilities of the lower level, producing a modification to the forecasting of demand depending on the active constraints. A function of the multipliers, the named W m' 0 s , can be interpreted as the shadow price that the type of user ' 0 must pay for choosing the alternative s. Alternative s consumes scarce resources that a set of users must compete for, and this is the price, expressed in terms of utility, that users are prepared to pay to choose that alternative.

Non-linear utility specifications using Reproducing Kernel Hilbert Spaces
Linear utility among attributes is the most commonly-used approach in literature. In this paper the temporal nature of attributes is considered, so non-linear utilities appear to be better suited to the problem. We propose a framework to specify non-linear utility functions based on Reproducing Kernel Hilbert Spaces (RKHS). A quick introduction to RKHS may be found at Daumé (2004). We shall begin by giving the following definition: A Hilbert space of functions that admits a reproducing Kernel is called a RKHS. The reproducing Kernel of a RKHS is uniquely determined. Conversely, if a function K : X Â X7 !R is positive definite and symmetric (Mercer kernel), then it generates a unique RKHS in which the given kernel acts like a Reproducing Kernel.
We assume that the systematic utility function V ' : X & R p 7 !R, where X is the feasible set for the attributes, belongs to a given RKHS H K . For simplicity only one type of user ' is considered, thus this index is eliminated in this subsection. This assumption leads to the relationship The utility function can be expressed as a linear combination of the basis of the space H K . The kernel function defines a basis fKðx; yÞg y2X of the vectorial space H K , and thus Note that the choice of the kernel function K(x, y) plays the same role as the selection of the specification of the functional form of a non-linear utility function. It is convenient to consider reproducing kernels that lead to spaces H K which have a large range of functions in order to represent the utility function V(x) appropriately. The Eq. (9) gives the functional form of the utilities. Traditionally in the literature, the weightings a y are the parameters of the model and the utilities are computed from the estimated parameters. In this paper we interchange these roles. To illustrate this, consider the linear utility function: The most common steps followed in the estimation of the utilities are: firstly know the attributes of the alternatives x s with s 2 [ m S m , secondly estimate the parameters ða 0 ; a T Þ of the utility function (10) and thirdly calculate the utilities (see Ben-Akiva and Boccara 1995) by evaluating the utility function over the attributes, i.e In our approach the parameters to be calibrated are a subset of utilities V m s and the weightings a y are computed from these estimated parameters. This view leads to a different order in the estimation process of the utility function: Step (i) know the attributes values of the alternatives

and
Step (iii) calculate the parameters ða 0 ; a T Þ from the estimated utilities.
The fact that the CNL requires the parameters V m s in order to be used focuses the estimation process on these parameters, and so for the alternatives D 1 it is not necessary to know the functional form and the attributes which are relevant for the decision maker. Moreover, the parameters V m s have a clear interpretation but the interpretation of the weightings a y is unclear.
The Tikhonov Regularization Theory allows the realization of Step (iii) for specifications of utilities based on RKHS. Now we briefly describe Tikhonov discrete regularization theory by RKHS for the problem at hand. The General Theory of Tikhonov Regularization is explained in the book of Tikhonov and Arsenin (1997) and the General Theory of RKHS is defined in Aroszajn (1950).
Assume that the utilities of a subset of alternatives D 1 [ m S m are known and let n be the cardinality of the set D 1 . We denote by fU s g s2D1 these estimates. Note that they have not been denoted by V as they may contain certain random errors. Let be the set of attributes of the utilities to be estimated and let W n be a random sample. That is: Tikhonov Regularization Theory considers the function space where span is the linear hull and projects V(x) onto this space by using the sample W n . Tikhonov Regularization Theory makes a stable reconstruction of V(x) by solving the following optimization problem: where c [ 0, and The representation theorem gives a closed form solution of V Ã ðxÞ for the optimization problem (13). This theorem was introduced by Kimeldorf and Wahba (1970) in a spline smoothing context and has been extended and generalized to the problem of minimizing risk of functions in RKHS, see Schölkopf et al. (2001) and Cox and O'Sullivan (1990).
Theorem 1 (Representation) Let W n be a sample of V(x), let K be a (Mercer) kernel and let c [ 0. Then there is a unique solution V Ã of (13) that admits a representation by where a ¼ ða 1 ; . . .; a n Þ T is a solution to the system of linear equations: where I n is the identity matrix n Â n, U ¼ ðU s1 ; . . .; U sn Þ T and the matrix K x is given by The following example illustrates numerically how to carry out step (iii) using Tikhonov Regularization Theory.
Example 1 (A numerical example of the use of Tikhonov Regularization Theory) Suppose that the utilities have been estimated for a given set of attributes x s (in this case, the time of departure t of a transit service), with the data shown in Table 1.
In this approach a given functional expression is not specified for the utility function, but it is assumed that it belongs to a RKHS. For example, we consider the RKHS H K 1 with a Gaussian reproducing kernel K 1 and H K 2 with a multi-quadratic kernel. Specifically for this example the following kernels are defined where the values of the parameters are set: These selections allow the possibility of estimating two utility functions as follows: Moreover, to calculate the parameters a, the linear system (15) must be solved. The value of the parameter c ¼ 0:001 has been used. The values of a obtained by this method can be seen in Table 2. Figure 2 represents both utility functions. In the first case the space H K 1 is considered, while the second case considers the space H K 2 , obtaining different utility functions starting from the same set of data. In conclusion, this process allows a non-linear utility function to be derived from a set of known utilities.
We end the section with the following remarks • Remark 1. Functional expression It is worth noting that calculating the vector of parameters a by solving the system of equations (15) allows the analytical expression of V(x), Eq. (14), to be known, and it is possible to calculate the marginal utilities oV Ã ox s . This allows the subjective values of travel time (SVT) to be calculated (see Jara-Díaz Estimation of the CNL model CNL parameters are V; k ¼ ðÁ Á Á ; k ' 1 ; Á Á Á ; k m' 2 ; Á Á ÁÞ; b r ; b g ' . It is assumed that the upper bounds b r of the constraints and b g ' are known. The remaining parameters ðV; kÞ must be estimated. In this section a generic estimation methodology is described. As CNL is a strictly convex program (assuming that the functions b h r are convex) it poses a single optimum and CNL implicitly defines a function, which obtains the disaggregation of the demand by alternatives for each pair ðV; kÞ. This is depicted thus: The main idea of the approach presented in this paper for estimating the CNL model is to select and estimate a subset of utilities b V 1 which will be used for calculating the other utilities b V 2 , repeating this process iteratively, changing the values of b V 1 with the objective of approximating the demand estimated by the CNL to the real known values.
Assume a sample of N decision-makers, N ' is the number of individuals of type '. Also suppose that the number of individuals of type ' who select alternative s 2 S ' m is denoted by N m' s and it is known for a set of combinations ðs; 'Þ 2 D 0 . Denote N ¼ ðÁ Á Á ; N m' s ; Á Á ÁÞ with ðs; 'Þ 2 D 0 . Assume that the vector of attributes for each alternative s is known and is denoted by x s . Let D 1 be the alternative subset ðs; 'Þ in which the utility will be estimated, and let D 2 be the alternatives in which the utility will be calculated from the estimated function V Ã' ðxÞ by using the Tikhonov regularization theory described in ''Non-linear utility specifications using Reproducing Kernel Hilbert Spaces'' section. The set of alternatives is decomposed As a first step, the utility vector is estimated and in the second stage the utility function V Ã' ðÁÞ is calculated using Eq. (19) and the Eq. (15) for each ' and all non-estimated utilities U m' The above two stages are schematically represented by Using Eq. (23), the Eq. (20) can be rewritten as: Finally the estimation problem can be stated as: where F is a similarity function between predicted demand by CNL model, g, and the observed values, N. It is worth noting that the parameters to be estimated are the utility vector b V 1 and the vector of scale parameters k. An approach widely used for Fðg; NÞ is the minus log-likelihood (LL) function, and it leads to the maximum likelihood (ML) estimation problem: In some cases, as in the numerical Experiment 2 of this paper, disaggregated values by alternatives N m' s are not known. In these cases the least squares method can be adapted to the data.
The likelihood maximization or generalized least squares technique is achieved by embedding the computation of F within a non-linear optimization framework as shown in Fig. 3. The estimation problem of the CNL is formulated as a bi-level optimization model and Fig. 3 shows the application of free-derivative optimization methods to this problem. This calculation scheme is conceptual and susceptible of many implementations. The convergence to the optimal parameters is not guaranteed and it will depend on the freederivative optimization method applied.
It is worth noting that there exist infinite solutions to the estimation problem (25). Ben-Akiva and Lerman (1985) indicates that discrete choice models have two sources of overspecification. The first source is due to the scale indeterminacy because the units of the underlying latent utilities are not observed, and origin indeterminacy because the zero of the latent utility scale is not observable, so one must be set. The sources of over-specification of the parameters for the CNL proposed are explained in ''Appendix 2''.

Numerical analysis
This article describes a methodology for introducing constraints on the Nested Logit (NL) model and a further methodology for estimating it. The first key question to be analysed is how omitting the constraints where they are relevant to the problem affecting the accuracy of predictions of the NL model. The second step is to assess the estimation methodology set out in Sect. 4. The estimation problem to be solved has a bi-level nature in which, to evaluate the objective function, the CNL model must be solved. The proposed method is conceptual and does not specify any given optimization algorithm. The only assumption about the algorithm is that it be derivative-free, as the problem is bi-level and the so there is no guarantee of convergence. Moreover, the bi-level nature presents the challenge of addressing the computational burden in real applications.
These questions have been analysed in the following two experiments.
• Experiment 1 This experiment was carried out on synthetic data. The objective is to compare the logit model and the nested logit model with its constrained counterparts.
As well as checking numerically that the constrained models display better fitting to the data, we show that the predictions of the unconstrained models may not be reliable in scenarios where the constraints are changed. • Experiment 2 This experiment is a real application consisting of fitting the CNL model to the rail services choice problem. In this experiment we describe a metaheuristic

A simulation study (Experiment 1)
An important class of problems in which constraints appear in the modelling of demand is dynamic pricing with limited inventories, Boer (2015). This type of model enables certain firms, such as in the airline industry, to increase revenue by better matching supply with demand. Chen and Chen (2015) consider that most dynamic pricing problems share the following three main characteristics: (1) products are typically time-sensitive with a fixed selling season, (2) a given and finite amount of inventory of a product available at the beginning of the selling season and (3) multiple prices in the selling season. Feature (2) imposes (exogenous) capacity constraints for the consumers.
In this section we consider the fitting of a demand model to a dynamic pricing problem. The aim is to assess how the constraints affect user choices. For this reason we have greatly simplified the problem and we have considered fixed prices and thus constant utilities during the selling period.
We consider that a given railway service has K 1 ¼ 25 seats for first class and K 2 ¼ 40 for second class. The users arrive according to a Poisson process with arrival rate k ¼ 0:8 min (mean time between arrivals). We assume that the sales process begins 60 min before the departure time. The probability of a user desiring to travel in first class is 0.2 and to travel in second class is 0.8 (i.e the utilities are not time-sensitive). If on buying the ticket there were no seats available in the desired class, the user would choose, with probability 0.5, to change the class, and with the other 0.5 not to make the trip. By a Monte Carlo simulation we have generated data for 25 different days which can be downloaded from http://bit.ly/simData.
We have adjusted the MNL model and the NL shown in Fig. 4 to this problem. These two models have been adjusted with and without capacity constraints (i.e number of available seats). We have considered that the index ' represents one individual and these are ordered according to the instant t ' at which they buy the ticket. The decisions of the individuals who arrive before instant t ' affect the choice-set of individual ', since they may have taken all the seats in one or other class. If we let K ' i be the residual capacity (number of available seats) of the i À th class when user ' buys his ticket, the constrained MNL and NL models are formulated as: subject to: P s2f1;2;Ref g We used the parameters g ' 1 ¼ 1 in the the MNL and CMNL models and the values k ' 1 ¼ 1; k 1;' 2 ¼ 2; k 2;' 2 ¼ 3 and thus g 1' 1 ¼ 1 2 ; g 2' 1 ¼ 2 3 ; g 1' 2 ¼ 1 2 ; g 2' 2 ¼ 1 3 for the NL and CNL models. The first numerical trial has the aim of assessing the capacity of the models to describe the data. To this end we estimate the models by maximum likelihood using the following three classical functional specifications for the utilities:  As regards the models' goodness-of-fit, three indexes are reported in Table 3: loglikelihood evaluated at the parameter estimate values (L Ã ), rho-square (q 2 ) and adjusted rho-square q 2 where where k is the number of estimated parameters and L 0 is the log-likelihood evaluated at zero. The values of L 0 for the MNL and NL models are L 0 ¼ À2170:85 and L 0 ¼ À3694:94 respectively. Table 3 shows two important facts. The first is that the constrained models also improve on their unconstrained counterparts. It can even be seen that the constrained model with constant utility has a better fit than the unconstrained with quadratic utilities. The second conclusion is that the computational cost of the estimation of the constrained models is high. This reveals the need to solve the estimation problem efficiently in order to apply the methodology to real problems.
We now proceed to discuss how to test whether the improvement introduced by a constrained model with respect to an unconstrained one is or is not statistically significant. This was done using the likelihood ratio test. Let l Ã' s be the optimal Lagrangian multipliers of the constraints of the CMNL model; thus, these constraints can be penalized in the objective function and the CMNL can be reformulated as: s is a constant and we take l Ã' Ref ¼ 0 to unify the notation. This formulation allows the CMNL to be interpreted as an MNL in which the utilities are given by the expression The statistical hypothesis testing is: In this experiment we have contrasted the hypothesis that the constraints 20 min before the train leaves are active and we should, from that moment, consider the model constrained. That isS We perform a likelihood ratio test. The test statistic is twice the difference in L Ã and these values are shown in Table 4. It can be seen that the inclusion of the constraints is only significant in models with constant utilities. When we estimate the linear and quadratic utilities V ' s in an unconstrained model, the real purpose of the adjustment is to determine the best parameters to reproduce the penalized utilities e V ' s . The unconstrained models consider the constraints implicitly via the fitting of the penalized utilities. The conclusion of the hypothesis test is that in the case of linear and quadratic utilities the explicit inclusion of the constraints does not significantly improve implicit knowledge of them. Figures 5 and 6 show the estimated probabilities that a user will choose a given alternative depending on the instant of ticket purchase. Figure 5 shows the estimates obtained using the logit models and Fig. 6 using the nested logit models. These models use quadratic utilities. The true probabilities estimated by Monte Carlo simulation have been overlaid using a 10,000 day sample. All the models try to adjust the true probabilities.
The central question is to determine in what situations it is necessary to use the constrained models. The answer is when estimations are to be carried out for scenarios different from the adjusted situation. That is, when the initial constraints are going to vary. Assume we wish to estimate the number of tickets that will be sold by the transport operator if the rolling stock carrying out the service varies. For example we will assume 3 new types of rolling stock with different numbers of seats. Table 5 shows these estimations for each scenario using the constrained models and the Monte Carlo simulation method. The sample size was 10,000 days, and as this is a large value, we take this estimation as the true value with which to make the comparison. The unconstrained models predict the same value for all scenarios and this is shown in Table 6. If we observe the scenario which produces the worst estimation using the models with quadratic utility, we see that the relative errors (expressed as percentages) are, for the constrained models, 10.49 and 6.40% for the first and second class respectively, while for the unconstrained models these errors are 145.63 and 36:01% respectively. This shows the need to use the constrained models for this type of estimation. Finally  Fig. 7 shows the estimation of the ticket choice probabilities for the three new scenarios using a constrained MNL model. A good fit is seen for these new scenarios, even though they are different from the scenario used in the estimation model.

Application of the CNL model for railway service choice modelling (Experiment 2)
Suppose there are various types of users depending on his/her origin-destination. Index ' ¼ ði; jÞ 2 L refers to a trip from station i to station j. Assume an origin-destination matrix fb g ' g '2L which defines the potential demand. Assume the total demand disaggregated in two alternatives: (a) (High-speed) train trips. (b) Another means of transport.
where V m' is the utility of alternative m for the users of the origin-destination pair ': Note that the index ' is deleted in the parameter k 1 , which means that this value is the same for all origin-destination pairs '.
The model considers a nested logit model to disaggregate the demand considering the feasible timetable for a trip type '. Denote as S ' a the feasible set of railway services for making a trip type '. The second level of the nested logit model disaggregates the demand between the different railway services: Similarly to the upper decision level, the parameter k 2 is assumed independent of the origin-destination pair '. Figure 8 shows the nested logit model combined with the capacity constraints of the trains. When a train reaches station j the vehicle has picked up passengers from preceding stations. The number of passengers that can take the train is then restricted by the capacity of the vehicle. Denote by L þ sj the set of origin-destination pairs whose users take the service s before station j and leave the vehicle after station j. Also denote by L sj the set of origindestination pairs ' whose origin is station j and which use s. The capacity constraints of service s in station j is formulated as: X where K s is the capacity of train s, S is the set of services and J s represents the set of stations in which service s will stop. The demand model can be stated as:  Espinosa-Aranda et al. (2015) apply this model to the high-speed train timetabling problem.

Case study
To test the above model, a case study has been generated. The main objective is to study the possibility of estimating its parameters. This numerical example looks at the Madrid-Seville corridor of the Spanish High Speed Railway network. This corridor consists of 5 stations: Madrid (MAD), Ciudad Real (CR), Puertollano (PU), Córdoba (COR) and Sevilla (SEV) which produces 20 origin-destination demand pairs (10 per direction of travel) formed by 15,115 passengers/day. Currently this demand is completely covered by 100 services. Figure 9 shows the corridor used by these services, Table 7 indicates the route of each type of service and Table 8 shows the maximum capacity of each type of train.
In this example, there will be 20 types of users (i.e, 20 origin-destination pairs '). Each user type ' can travel using a set of services s. Considering the planned schedule, the set of alternatives ð'; sÞ 2 D consists of 298 possibilities. In this case the proposed model could estimate 298 parameters.
To estimate the model 25 services have been selected randomly, generating a set D 1 with 66 possibilities ð'; sÞ and, consequently, 66 parameters V a;' s that should be estimated to calculate the utility of each alternative as explained in ''Non-linear utility specifications using Reproducing Kernel Hilbert Spaces'' section. Note that the solution of the linear system (15) produces the values a ' s . The attributes considered for each possible alternative ð'; sÞ are: (1) the price x ' s;1 , (2) the travel time x ' s;2 and (3) the timetable x ' s;3 . The vector of attributes is denoted as s;3 Þ. The data used for the experiment can be downloaded from http://bit.ly/ 1gCFw5e.
To simplify we have set V Ã' ðxÞ ¼ V Ã ðxÞ for all origin-destination pairs ' in this experiment, and the utility function V Ã ðxÞ is defined as: where a Gaussian kernel Kðx; yÞ ¼ e ÀakxÀyk 2 is used in which k Á k is the Euclidean norm a 2 R þ . In this case a ¼ 5 has been considered. The regularization parameter c has been set as 0.00001. The parameter c has the function of preventing the system of linear equations (15) being singular. The parameter is chosen thus c ! 0 þ . First a small value is tested, as in this example, and if there are no numerical problems (an ill-conditioned   Transportation (2018Transportation ( ) 45:1523Transportation ( -1557Transportation ( 1547 problem) the value is acceptable. Otherwise the value of c is increased until it avoids illconditioning.

Estimation methods
The data used as N are in the public domain and therefore show aggregated information such as the total demand in a determined origin-destination pair or for a type of service. An estimation procedure based on an ML approach is not available because the disaggregated observations for each pair ð'; sÞ are unknown. In this test we have adapted the generalized least squares technique for comparing the known demand behaviour versus the demand predicted by the CHL model. In this example the values of k 1 and k 2 have also been estimated.
The optimization method selected for solving the estimation problem (25) is a hybridization of the Standard Particle Swarm Optimization (SPSO) (Zambrano-Bigiarini et al. 2013) and the Nelder-Mead (NM) (Nelder and Mead 1965) algorithm based on the framework presented in Espinosa-Aranda et al. (2013). Hybrid algorithms try to make full use of the merits of various optimization techniques in order to obtain an efficient method in the search for global optima.
The SPSO has been used successfully in global optimization problems particularly in transportation research (see Angulo et al. 2011Angulo et al. , 2013. The main advantages of PSO algorithms could be summarized as follows: they are capable of avoiding local optima, doing a search in the entire solution space, are robust against initialization parameters, viable, efficient with a smaller computational burden and have a simple selection of the right parameter values. The NM method is a direct search method that does not use numerical or analytic gradients and has local convergence with a high exploitation capacity. The resolution procedure for the CNL model has been GAMS 24 with solver CONOPT, showing that in an Intel I7 4 Cores 3.2 GHZ with 16 GB RAM computer the CPU time for each problem is around 0.12 seconds. The SPSO þ NM has been implemented in MATLAB, which calls GAMS to solve each individual CNL problem. The stopping criterion is based on the total number of solved CNL models. The SPSO algorithm was run for 50,000 objective function evaluations. A random start on an interval defined by Eq. (33) was used. The size of the swarm was 40 particles. The PSO-parameters w, c 1 and c 2 for updating velocity are defined as w ¼ 1=ð2 lnð2ÞÞ, c 1 ¼ 0:5 þ lnð2Þ and c 2 ¼ c 1 (Zambrano-Bigiarini et al. 2013). NM is run for 50,000 function evaluations starting from the best solution found by the SPSO algorithm to improve the solution.
''Appendix 2'' deals with over-specification based on two considerations. Firstly, the non-observed utilities are set to V b' ¼ 0 (the second source of over-specification). The second issue considers that each interval with B [ 0 contains optimal solutions of the estimation problem (the first source of overspecification), thus the imposition of this constraint limits the search space without reducing the quality of the fit. Selecting limits with very small B value (equivalent to d ' ! 0 in the first source of over-specification) could lead to large values of k i , causing a more complex estimation. A trade-off between the range of the interval of the utilities and the order of magnitude of the parameters k i must be achieved. Table 9 shows the computational results depending on the feasible region considered (33). The mean computational cost of each run of the SPSO algorithm was 1.7 h, and with NM, 4.4 h. Therefore the estimation of the CNL model can be computed in an affordable time.
As can be seen, the results show that with small intervals the algorithms are capable of finding a better solution than when searching in a bigger feasible region. This can also be seen in Fig. 10 which depicts the evolution of the objective function during the running of the SPSO algorithm per case study. The red graphs represent case studies 1-5, the blue 6-14 and the black line 15.

Study of the best solution obtained
The best solution is obtained by using the interval ðÀB; BÞ ¼ ðÀ1; 1Þ as the space of parameters. This section shows this solution. Figures 11 and 12 depict the utility function fixing respectively the departure time at 8:30 and the travel time to 63 min.
It can be seen in Fig. 11 that the specification of V ' ðxÞ ¼ V Ã ðxÞ for each pair ' produces the travel time attribute with which to consider the demand effect in each pair '. For example, the largest travel time represents the largest trip ' ¼ ðMAD; SEVÞ while the smallest time occur in pair ' ¼ ðCR; PUÞ. The utility function captures the demand in each pair.

Conclusions
This paper describes the CNL to model both the dynamic and constrained decision spaces in discrete choice contexts. This type of approach is suited to modelling problems in which exogenous and endogenous factors limit the universal choice set of the decision-makers. Applying the model requires additional data to specify the constraints and derivative-free optimization methods for solving the estimation problem.
A key contribution of this paper is the use of Kernel Hilbert Spaces for the specifications of non-linear utility functions. This paper presents a novel point of view for the model estimation based on the consideration of utilities of a set of alternatives as parameters instead of classical attribute weights. The over-specification issues associated with the CNL formulation are also discussed. The introduction of the estimation method requires testing its capacity to infer unbiased estimates of the true parameters. Further research into this subject, and how to specify the type of kernels and its parametrization is necessary to assess whether this methodology can give a perfect reconstruction of the original utility.
Experiment 1 is a simulation study on a constrained problem. The results obtained show that the constrained models have a better fit than their unconstrained counterparts. The unconstrained models take the constraints into account via non-linear utilities and thus for the baseline case there are differences between constrained and unconstrained models only when the utility function is linear. The key conclusion of Experiment 1 is that the constrained models are robust with respect to the modifications in the baseline constraints. If the unconstrained models are used in new scenarios, therefore, forecasts of demand are likely to violate the new constraints and miscalculate the true demand. In Experiment 2 a novel railway demand model is used to test the suitability of the proposed approach. Experiment 2 is based on real data for the Madrid-Seville high-speed corridor and proposes a metaheuristic methodology based on the hybridization of the Particle Swarm Optimization and Nelder-Mead method to estimate the CNL. The computational cost of solving the bi-level model is 4.4 h. The importance of eliminating overspecification of the model can be seen in the quality of the results obtained. The results show how the CNL model could represent the behaviour of users of the railway network.
Next, this paper will focus on the value of the probabilities at the lower level of the CNL. Therefore, solving for g m 0 ' 0 s in (34), and summing over s 2 S ' 0 m 0 , which is the set of sub-alternatives for the alternative m 0 and type of user ' 0 , we get Finally, dividing (38) by (39) we get the probabilities at the lower level of the CNL At this point, the probabilities of the CNL in the upper level will be also calculated. Finding H m 0 ' 0 from (39) where L m 0 ' 0 is the classical log-sum given by Finding g m 0 ' 0 from (42) and adding with respect to m b g ' 0 ¼ X Transportation (2018Transportation ( ) 45:1523Transportation ( -1557Transportation ( 1553 Finally, the probability of selecting alternative m 0 for the type of user ' 0 in the upper level is: Appendix 2: Over-specification of the CNL model In this Appendix we show that there exist infinite solutions for the estimation problem (25). This is due to over-specification of the parameters (García-Ródenas and Marín 2009;Bierlaire et al. 1997;Daganzo and Kusnic 1993;Ben-Akiva and Lerman 1985). Parameter over-specification must be avoided because although some of the more robust methods succeed in solving the problem, their speed of convergence may be very slow. This problem is due to the singularity of the second derivative matrix of the log-likelihood function.
First source of over-specification The first source of over-specification arises in the interaction between the structure of the utilities and the parameters k ' j with j 2 f1; 2g and it becomes: The above relationships are schematically denoted as e V ¼ d b V and e k ¼ k=d. Let ð b V 1 ; kÞ be a vector of parameters for the CNL model and let be the estimated demand. The objective function of the CNL model is separable in '. If each term in ' is multiplied by the constant d ' [ 0 then the optimal solution associated with ' is not changed. Moreover, the system constraints hold. This leads to: It is worth noting that the utilities b V 1 are multiplied by d, the solution of system (15) is multiplied by d and thus the utilities b V 2 are also multiplied by d because they are linear in their parameters a. Mathematically Using (47), (48) and (49), we obtain As the objective function of the estimation model (25), Fðg; NÞ, depends only on g, both solutions ð b V 1 ; kÞ and ð e V 1 ; e kÞ have the same objective value. This shows that there exist infinite optimal solutions of the estimation model.
Thus the scale parameters of the Gumbel error terms are undetermined. In practice, setting one Gumbel term for each ' is sufficient for the identification.

Second source of over-specification
The second source of over-specification in the CNL models is adding the same value to the utilities of all the alternatives, which does not affect the log-likelihood of the sample. In this case, we assume that D 2 ¼ f;g: The set of utilities e V m' produces the same solution as the optimization model. If utilities b V of the objective function CNML are replaced by utilities e V the same objective function value plus the constant is obtained Bierlaire et al. (1997) have analysed over-specification in nested logit models to the loglikelihood function. These authors have analysed the relationship between any two arbitrary strategies to avoid over-specification, and shown that the two strategies are equivalent under a linear transformation of the variables. Some algorithms are independent of such transformations: Newton's method and the quasi-Newton methods of the Broyden family are combined with line searches. If these are used, then the way in which the overspecification is eliminated is not important. Daganzo and Kusnic (1993) suggested equating one parameter to zero for each set of parameters mixed up in every source of over-specification, and estimating the rest in order to avoid over-specification in the nested logit model.