An axiomatic foundation of conditional logit

This paper considers a decision maker choosing from a set of options when options have multiple real-valued attributes. Assuming DM chooses all options with positive probability, four invariance assumptions are necessary and sufficient for choice probabilities to take McFadden’s conditional logit form: independence of irrelevant alternatives, translation invariance, presentation independence and context independence. Variations on these assumptions yield generalized logit and contextual logit models. This shows that even specific logit models have behavioral foundations in simple invariance assumptions involving observables only, which therefore are directly testable.


Introduction
Economic analyses typically rest on preference assumptions, and the resulting necessity to understand preferences inspired a large body of work developing methods to infer preferences from choice. The main difficulty is that choice is inherently stochastic, implying that preferences are not directly revealed. Structural modeling attempts to control for noise in the choice process, using explicit models of stochastic choice, and has a wide range of applications in empirical and behavioral analyses (for a "primer," see Wilcox 2008). In order to apply models of stochastic choice, however, researchers need to specify the formal link between choice propensities and observables, as demonstrated by Axiom 4 in McFadden's (1974) seminal characterization of conditional logit. This suggests that models of stochastic choice cannot be applied without making functional-form assumptions that are not directly testable (see, e.g., Keane 2010a; Nevo and Whinston 2010).
The present paper seeks to contribute to this discussion by demonstrating that specific models of stochastic choice, several widely used logit models with linear links between propensities and observables, have axiomatic foundations in invariance assumptions and therefore do not require functional-form assumptions. These invariance assumptions involve solely observable attributes of options and are directly testable. Extensions to nonlinear link functions that are additively separable in observables (such as CES utilities) are equally possible.
Formally, I consider a decision maker (DM) choosing from a finite set of options. Each option is characterized by an observable attribute vector (say, payoffs in different states of the world, or prices, quantities and qualities of products), and each option is chosen with positive probability. Within this framework, and given an essentialness condition requiring that DM is not indifferent with respect to any of the attributes, IIA and two simple invariance assumptions, clarifying when choice responds to changes in attributes and when not, uniquely characterize the conditional logit model of McFadden (1974). On the one hand, " -invariance" requires choice to be invariant to translation of the attribute vector, and on the other, " -relevance" requires that choice responds "uniformly" to attribute changes other than translation (comprising "presentation independence" and "context independence," in a sense to be made precise). Empirical evidence seems to suggest that observed choice exhibits also a form of scaling invariance, however. After strengthening -invariance and weakening -relevance correspondingly, I obtain a multi-attribute generalization of the contextual utility model of Wilcox (2011).
Four points appear to be worth noting about these results. (1) All invariance assumptions solely involve choice probabilities and option attributes, which both are considered observable in applied work (for a foundation, see Gul et al. 2014) and thus directly testable. Applied work may therefore first test, for a given dataset, if the choice postulates underlying logit are satisfied, or if the postulates are satisfied after a transformation of attributes suggested by McFadden (1974), and only then apply logit if adequate. (2) The widely used conditional logit model, which assumes a linear utility function linking attributes and log-propensities of choice, is provided with a foundation void of the functional-form assumptions it has been criticized for (for discussion, see Keane 2010b;Rust 2010;Blundell 2010;Heckman and Urzua 2010).
(3) For any given choice profile satisfying the invariance assumptions, the utility function linking attributes and log-propensities of choice is shown to be unique up to linear transformation, which is a novel uniqueness result that may be helpful in applications. (4) Observed behavior tends to exhibit a form of invariance to rescaling attributes that conditional logit does not accommodate, suggesting that alternate models such as the generalized form of contextual utility may indeed be more adequate in applied work.
To provide some context for the results, the general family of logit models is known to be characterized by positivity and IIA in the sense that choice probabilities then take the logit form for an unknown, potentially nonlinear utility function linking attributes and log-propensities of choice. My results provide testable conditions for this utility function to be uniquely linear in attributes or in pre-specified transformations thereof, as postulated by McFadden in his definition of conditional logit. The main condition is translation invariance, and interestingly, linear logit models are already known to predict choice probabilities that are invariant to translation of option attributes (or, utilities in some work). This paper's contribution is therefore the result that translation invariance is sufficient if complemented by a condition clarifying when attribute changes are relevant, and the sufficiency finally removes the necessity to make functional-form assumptions.
The need for the additional condition arises, because translation invariance yields a functional equation and its solution includes non-trivial integration constants. These integration constants resemble prior probabilities as observed in the generalized logit model of Matejka and McKay (2015) and may be interpreted to represent presentation effects in choice (i.e., effects due to ordering and positioning of options). In order to formally represent and capture such presentation effects, I extend the standard framework of stochastic choice by explicitly distinguishing "contexts" of decisions, i.e., different mappings between options and attributes. This generalized framework, in turn, is the main difference to existing analyses of stochastic choice when options have multiple attributes, such as Gul et al. (2014), and allows us to establish the necessity and sufficiency of simple invariance assumptions in characterizing logit models with specific (here, linear) utility functions.

Related literature
We may distinguish at least three classes of approaches in the characterization of logit models. The original approach due to Luce (1959) demonstrates that a general family of nonlinear logit models is characterized by positivity and independence of irrelevant alternatives (IIA). That is, if choice probabilities satisfy positivity and IIA, then there exists a utility function v such that choice probabilities take the multinomial logit form. This utility function is not known, however, and applied research needs to specify a utility function. In order to provide a foundation for such work, McFadden (1974) introduces the conditional logit model, explicitly linking option attributes and choice probabilities without resorting to an unknown utility function. McFadden achieves this by introducing two additional axioms that fix the exponential structure (Axiom 3) and the additive separability (Axiom 4), but both assumptions involve non-observable utilities rendering them untestable (for discussion, see Breitmoser 2018).
The third class of approaches seeks to model the choice process explicitly in order to establish conditions such that the resulting choice probabilities take the logit form. All of these approaches involve non-observable entities as well, rendering direct tests impossible. For example, logit can be formulated as random utility model (Thurstone 1927;Block and Marschak 1960), but this involves non-observables utilities, a specific functional form assuming additive separability of utilities and perturbations, and it requires unobservable utility perturbations to be identically and independently distribution as extreme-value type I. Logit can also be characterized as the outcome of rational inattention (Matejka and McKay 2015), but this relies on assumptions involving non-observable utilities and assumptions relating non-observable costs of information acquisition to Shannon's measure of entropy. Logit's foundation in an additive perturbed utility representation (e.g., Fudenberg et al. 2015) requires DM to maximize the difference of expected utility and perturbation costs and requires the unobservable perturbation costs to be proportional to the Shannon entropy. Finally, in another recent paper, Woodford (2014) characterizes logit choice as the solution to a specific optimization problem if a certain unobservable parameter ( ) is equal to 1.
The present paper is most closely related to McFadden (1974), in its attempt to establish testable conditions for specific (linear) utility functions to form the link between observable attributes of options and choice propensities. This objective of McFadden has received renewed interest in recent work on the theoretical foundation of logit. Specifically, Ahn et al. (2018) characterize the linear logit model if only average choices are observable, and Allen and Rehbeck (2019) analyze stochastic choice more generally if attributes vary between observations rather than option sets, noting that the latter was Luce's original approach when defining independence of irrelevant alternatives. In this paper, we allow for both, variation of option sets and variation of attributes, showing that they jointly characterize linear logit models. 1

Definitions
Decision maker DM chooses option x from menu B. Menu B is finite subsets of some set X, and the set of all finite subsets of X is denoted as P(X). Each option x ∈ X is associated with an attribute vector via ∶ X → ℝ n , which may define payoffs in different states of the world (as in decision theory) or payoffs to different agents as a function of the option chosen (as in game theory), or product bundles and prices (as in consumer choice). I refer to the attribute mapping as the context of DM's decision and to the pair ( , B) as a choice task. Given choice task ( , B) , the probability that DM chooses x is denoted as Pr(x| , B).
The set of choice tasks ( , B) that can be constructed is D = × P(X) . denotes the set of attribute mappings that may be constructed by changing attributes such as quantities or prices, or, for example, by permuting the attribute mapping (rearranging options in a shop or on a screen, or by relabeling states of the world or co-players in experiments). As indicated, a formal expression of variations of will be necessary to address "presentation effects" that come with integration constants below. For the purpose of the present paper, I assume that attributes are exogenously given by an experimental design or the analyst, but note that this may in practice not always be trivial (Gul et al. 2014).
1 To be clear, while these two studies appear to be the most closely related amongst the recent ones, the Luce model in particular and stochastic choice in general have been studied fairly comprehensively recently. To give just a few examples, Koida (2018) studies stochastic choice influenced by positioning of objects in menu, Ryan (2018) studies axiomatic characterizations of logit in choice under risk or uncertainty, and Echenique and Saito (2019) study generalized Luce models that relax positivity. I assume that the set of choice tasks D satisfies the following conditions. Throughout this paper, ℝ + denotes the set of positive reals. 2 Assumption 1 (Framework) The set of choice tasks D = Π × P(X) satisfies and −k (x) = −k (y).
Besides being standard assumptions in microeconomic theory and in recent analyses of stochastic choice (Gul et al. 2014;Fudenberg et al. 2015), these conditions help us ensure that the representation derived below is unique. Specifically, transformability ensures that we may discuss reactions to affine transformations of attributes, by ensuring that all affine transformations are well-defined objects. Surjectivity rules out scarce choice environments where the set of feasible attribute vectors is finite or even singleton; but it will be notationally convenient to know that [X] is convex and bounded in all dimensions. These assumptions are satisfied in choice tasks typically of interest in behavioral work (such as in choice under risk or in games, as payoffs can be varied almost continuously up to exogenous bounds). Note that the attribute functions may still be fairly ill-behaved, violating smoothness, monotonicity and continuity for any number points. Richness finally ensures that we may discuss reactions to uni-dimensional variations of the attribute vector. This is straightforwardly satisfied in decision tasks typically of interest to analysts, for example by direct manipulations of prices or payoffs.
Within this framework, we assume DM's choice profile Pr adheres to the following postulates.
Assumption 2 (Postulates on choice probabilities) There exists ∶ Π → ℝ + such that for all ( , B) : 2 With slight abuse of notation, I further identify all real numbers as constant functions such that addition and multiplication of a function with a real are well defined. Thus, for any ∶ X → ℝ n and any for all k ≤ n and x ∈ X . As usual, I use k (x) to denote the attribute k ≤ n of option x and −k (x) to the list of all attributes but k.
Essentialness requires that all dimensions of the attribute vector are relevant to DM. With respect to non-essential dimensions, the representation derived below would not be unique. Positivity allows that DM fails to maximize utility, however rarely, and captures the widely documented phenomenon that individual choice fluctuates and involves dominated options (Hey 2005). The above formulation of positivity requires that choice probabilities in all binary choice tasks are bounded below at a value strictly above zero, but this bound may be arbitrarily close to zero. Following McFadden (1974), positivity captures stochastic choice in a comparably mild manner, as an event occurring with zero probability is empirically indistinguishable from one occurring with positive but small probability.
Next, -invariance requires the choice profile Pr to be invariant to translation of attribute mappings. While translation is directly testable, e.g., by varying show-up fees in experiments, I am not aware of studies directly testing it. A string of evidence suggesting that choice is translation invariant is provided by neuro-economic studies, which consistently find "adaptive coding" as I discuss in more detail when deliberating scaling invariance. Finally, -relevance requires relative choice probabilities to be invariant across contexts if the option attributes are equivalent.
McFadden introduces the conditional logit model as a logit model where the logpropensities of choice are linear in option attributes (potentially after transforming attributes). This conditional logit model is a special case of the Luce model, which can be defined as follows: Any such V is said to admit a Luce representation.
Based on this, we define conditional logit as follows (following McFadden 1974): Definition 2 The choice profile Pr has a conditional logit representation if there exists ∈ ℝ n such that V with V (x) = exp{ ⋅ x } for all x, admits a Luce representation. Note the abbreviated vector notation involving option attributes x ∈ ℝ n for all x ∈ B.
Given Assumption 1, we will see that the above choice postulates are equivalent to the choice profile Pr taking the conditional logit form. That is, conditional logit is adequate if and only if the choice postulates are satisfied. Since the choice postulates solely involve observables, they are straightforwardly testable, which allows analysts to verify whether logit is an adequate model given their dataset. In practice, it may also be appropriate for analysts to determine whether the postulates are satisfied after invoking pre-specified transforms to the attributes and then run their analysis using these transforms, similar to power transforms such as Box-Cox transformations that align data with normal distributions in other applied work. Such transforms allow the utility to simply be additively separable in attributes, which contains many wellknown utility functions (most obviously CES utilities) as special cases.

Analysis
As shown by Luce, positivity and IIA imply that choice probabilities have a Luce representation, i.e., for each , a propensity function V ∶ X → ℝ exists such that The following result also clarifies that the choice propensities V are context dependent, as IIA itself does not restrict choice across contexts , and thus may involve arbitrary statistics of (such as suprema or infima) as constants. Given context , however, V can be expressed as a function of the option itself (x) and its attributes x .

Lemma 1 Pr satisfies P2-P3 ⇒ Pr has a Luce representation.
The routine proof is relegated to "Appendix." It should be clear that the functional forms of V are unrestricted and not directly observable by the analyst. There simply exists a family of unknown utility functions {V } admitting a Luce representation, and the question we ask is if there are testable conditions such that {V } assumes specific functional forms-focusing on the linear form assumed in McFadden's definition of conditional logit.
Next assume that choice is also -invariant, i.e., invariant to translations of contexts. On its own, this implies a comparably modest refinement of the set of propensity functions in relation to those compatible with just positivity and IIA. Here and in the following, writing " cinf " and later " csup ", I refer to π's componentwise infimum and supremum, respectively, over its domain (X), i.e., cinf = inf x∈X k (x) k≤n and csup = sup x∈X k (x) k≤n .
Lemma 2 Pr satisfies P2-P4 ⇒ Pr has a Luce representation where for all r ∈ ℝ n , V +r is a linear transformation of V .
The requirement of -invariance has further implications once we take its counterpart -relevance into account, but on its own, it poses no restriction on the functional form in our framework. This will be illustrated after the proof.
Proof By Lemma 1, there exists a collection of functions (V ) ∈Π such that Pr(x� , B) = V (x)∕ ∑ y∈B V (y) for all x, B, . Now fix ∈ Π and note that, given this representation of Pr , by P4 we obtain By positivity (P2), the values of V and V +r are nonzero. Hence, there exists c ∈ ℝ such that V +r = c ⋅ V . To see this, assume for contradiction that there is no such for all r ∈ ℝ n and ( , B) ∈ D.
constant. Then, there exist x, y such that V +r (x) = c 1 ⋅ V (x) and V +r (y) = c 2 ⋅ V (y) with c 1 ≠ c 2 . By P4, implying c 1 = c 2 , the contradiction, i.e., V +r is a linear transformation of V . ◻ Now, to illustrate, fix a one-dimensional attribute mapping ∶ X → ℝ and assume x = 2 and y = 0 . Also consider � = + 8 , which implies � x = 10 and � y = 8 . By -invariance, the relative probability of choosing x over y is equal in both contexts and ′ . Two seemingly related invariances are not implied, and those prevent us from taking full advantage of -invariance. On the one hand, assume there exist x � , y � ∈ X with payoffs 10 and 8 in the original context , i.e., x � = 10 and y � = 8 . Translation invariance does not imply that the relative probability of choosing 10 ( x ′ ) over 8 ( y ′ ) in context is equal to the one of choosing 2 (x) over 0 (y) in context -although we know that choosing between 2 and 0 under is equivalent to choosing between 10 and 8 in a different context ′ . I refer to this phenomenon as "presentation effect": The probability of choosing an option with a given outcome may depend on which option it is. Presentation effects may reflect labeling, ordering or positioning of options. They are compatible with V (x) as the option x itself is choice relevant, not just its attributes x . Presentation independence results if choice is invariant to permutation of options, and in order to express invariance to permutation, we need to distinguish contexts, since each permutation represents a different context .
On the other hand, fix ′′ such that �� x = 2 and �� y = 0 , but ≠ ′′ . Hence, ′′ is not a translation of , and choice propensities in contexts and ′′ are unrelated by Lemma 2. Hence, the relative probabilities of choosing options x and y may well differ between these contexts, despite the equality of attributes and options assuming these attributes, which I will call "context dependence." Strict context independence obtains if for all , � ∈ Π and all x, y ∈ X: The postulate of -relevance introduced above combines presentation and context independence, and with its help, we arrive at the conditional logit representation with a necessarily (log-)linear value function.
Theorem 1 Pr has a conditional logit representation with k ≠ 0 for all k ≤ n ⇔ Pr satisfies P1-P5. In addition, any V = {V } ∈Π admitting a Luce representation is a collection of functions that are linear transformations of another.
Step 1 (Representation independently of x): Pick any ∈ Π and x, y ∈ X . By P5, x = y implies Pr(x| , B) = Pr(y| , B) and thus V (x) = V (y) . Hence, choice propensities in any given context ∈ Π solely depend on attributes, and we can define a function Ṽ ∶ ℝ n → ℝ + such that Note that this does not rule out presentation effects entirely, but x contains the information required to implicitly represent presentation effects for any .

Hence,
for all x, y and r, implying that there exists some function h ∶ ℝ n → ℝ such that Ṽ ( x + r) =Ṽ ( x ) ⋅ h (r) for all x and r. Defining V = logṼ as well as ĥ = logh , we obtain V ( x + r) =V ( x ) +ĥ (r) . By positivity, implying that V = logṼ is bounded above and below. By surjectivity (A2), V is defined on sets of positive measure in ℝ for all dimensions k ≤ n of the attribute vector. Thus, all solutions of this fundamental Pexider functional equation satisfy V ( x ) = ⋅ x + c for all x, with unique ∈ ℝ n and some c ∈ ℝ (Aczél and Dhombres 1989, Corollary 10, p. 43, in conjunction with Theorem 8, p. 17). Using Ṽ = expV and the relation of Ṽ to V , this yields V (x) = exp{ ⋅ x + c } for all x ∈ X . Hence, for all x ∈ B, ( , B) ∈ D.
for all x ∈ B and ( , B) ∈ D.
Finally, again by P4 (translation invariance), which implies = +r for all r ∈ ℝ n .
Hence, there exists ∈ ℝ such that = for all , implying V (x) = exp{ ⋅ x + c } for all x, and that V has the conditional logit form defined up to linear transformation. ◻

Incorporating scaling invariance
A range of empirical and experimental studies suggests that -relevance as introduced above may be too strict. Specifically, choice in many experiments appears to be invariant to scaling option attributes (i.e., to scaling payoffs of players), and this seems to contradict -relevance. There is a caveat to this, however. The evidence suggesting that behavior is scale invariant stems from comparing observations between experiments (or, between experimental subjects). The most direct evidence that I am aware is provided by meta-analyses of experimental behavior, which consistently find that decisions are independent of the amounts of money at stake, for example in dictator games (Engel 2011), ultimatum games (Oosterbeek et al. 2004;Cooper and Dutcher 2011) and trust games (Johnson and Mislin 2011). An explanation for such scale invariance, and thus indirect evidence, is provided by the neuroeconomic result called "adaptive coding" (see Tremblay and Schultz 1999, and the (9) recent survey of Camerer et al. 2017): The neuronal representation of option payoffs adapts to the range of feasible payoffs. The baseline activity of the cell encoding the value of a given object adapts to the minimum of the payoff range in a given context, and its peak activity adapts to the maximum of the payoff range. Thus, choice ends up being invariant to scaling, and to changes in background income as indicated above, which appears to falsify the strict form of -relevance postulated before. 3 However, there is little direct evidence for scale invariance within subjects (see, however, Wilcox 2011Wilcox , 2015. That is, if subjects are presented pairs of decision problems that are equivalent up to scaling, in sufficiently quick succession such that the neuronal representation does not adapt, it is not clear that behavior would actually be scale invariant indeed. Intuitively, if a high-stake decision is immediately followed by a "seemingly trivial" low-stake decision, scale invariance might be violated. I am not aware of experimental evidence directly testing this intuition, but such tests are clearly conceivable.
In general, though, substantial rescaling of option attributes of decision problems presented in quick succession is rarely a concern in analyses. Instead, rescaling tends to be concern in analyses merging data obtained independently under various conditions, under varying treatments or in different experiments, and in such cases, where scale invariance seems confirmed by meta-analyses, one may wish to adapt -relevance in order to acknowledge scale invariance. The following two choice postulates weaken -relevance and strengthen -invariance in order to additionally acknowledge such scale invariance.

Assumption 3 (Alternative postulates on choice probabilities)
P6 Strong -invariance: Pr(⋅| , B) = Pr(⋅|a + b , B) for all a ∈ ℝ n and b ∈ ℝ n + P7 Weak -relevance: -relevance if csup − cinf = csup � − cinf � If we adopt these postulates instead of P4 and P5, then it follows immediately that the scale of the attribute range must factor out. That is, Pr satisfies P1-P3 and P6 if and only if Pr has Luce representation with V being linear transformations of some function ( ⊘ is used to denote componentwise division of vectors) where Ṽ =Ṽ a+b for all a ∈ ℝ n , b ∈ ℝ n + . Additionally invoking P7 yields a multiattribute variation of the contextual logit model proposed by Wilcox (2011).

Definition 3
The choice profile Pr has a contextual logit representation if there exists ∈ ℝ n such that V with V(x) = exp{ ⋅ [ x ⊘ ( csup − cinf )]} for all x, admits a Luce representation.
Theorem 2 Pr has a contextual logit representation with k ≠ 0 for all k ≤ n ⇔ Pr satisfies P1-P3 and P6-P7.
The formal proof is relegated to "Appendix," as it largely resembles that for conditional logit. In conclusion, let me briefly discuss a few points related to Theorem 2. To begin with, let me clarify the relation to contextual logit as defined by Wilcox (2011). Wilcox considers binary choice from lotteries, where lotteries are distributions over three outcomes z 1 < z 2 < z 3 . These three outcomes are called the context of the decision; the specification of the lotteries may change while the context (z 1 , z 2 , z 3 ) is held constant. Lotteries over these outcomes are denoted S and T, and utilities from lotteries are denoted V(S) and V(T). Given this, Wilcox defines the contextual logit representation as V(z 3 ) and V(z 1 ) denote the utilities from the degenerate lotteries yielding the maximal outcome and minimal outcome (respectively) in context (z 1 , z 2 , z 3 ) with probability 1. Given the logit specification of H, this is equivalent to where V max ∶= V(z 3 ) and V min ∶= V(z 1 ) . The above definition of contextual logit extends Wilcox' definition, who considered a decision maker caring about one option attribute (expected utility), straightforwardly to multiple attributes.
One concern about applying the contextual logit model may be that the attributes of unavailable options matter. In Wilcox' formulation, the "context" is defined by the degenerate lotteries that yield either outcome with certainty, which may be unavailable to DM yet tend to be observable by the analyst, but in general, the context of a decision may be subjective and therefore unobservable by the analyst (see, for example, Panizza et al. 2019). In order to capture choice in line with the contextual logit model defined above, it suffices to find some measure for the scale of the choice task and rescale attributes correspondingly, but to the extent the scale is subjective, rescaling may not be trivial in all applications.
Finally, invariance with respect to heterogeneous scaling across dimensions, i.e., with respect to scaling the attribute vector by a vector b ∈ ℝ n , can be shown (analogously to the proof of Theorem 1) to yield a so-called strict utility model: V( (x)) = ∏ k k (x) k . To see this, note that scaling invariance is equivalent to Pr(S) = exp{ f (S)} exp{ f (S)} + exp{ f (T)} with f (y) = V(y) V max − V min , translation invariance of logarithmized attributes. 4 Along these lines, many more models of stochastic choice may be found to have behavioral foundations in invariance assumptions. Now, fix ∈ Π such that csup − cinf = and cinf = (such exists by transformability A1). Hence, using Ṽ as defined in Step 1, i.e., conditional on the fixed context , we may follow the arguments in the proof of Theorem 1 up to Eq. (9) and obtain (abusing the fraction sign to denote componentwise division ⊘) with ∈ ℝ n and c ∈ ℝ , where +r = for all r. As demonstrated in Step 1, this implies for any ′ such that � = a + b for some a ∈ ℝ n and b ∈ ℝ n + , still abusing the fraction sign to denote componentwise division ⊘, for all B ∈ P(X) , x ∈ B . Hence, for any such ′ , with � = .
Step 3 (Weak context and presentation independence): Let Π � ⊂ Π denote the set of contexts such that ∈ Π � if and only if csup − cinf = . Given this, we can follow Step 3 in the proof of Theorem 1 to establish that there exists ∈ ℝ n such that, for all ∈ Π � for some c ∈ ℝ (which cancels out). The claimed extension to contexts with csup − cinf ≠ follows directly from P7. ◻