1 Introduction

Consider the following simple linear regression model:

$$\begin{aligned} y=x\cdot \theta +\epsilon , \end{aligned}$$
(1)

where y is a dependent variable, x is a d-dimensional random vector, \(\epsilon \) is an error term, \(\theta =(\theta ^1,\dots ,\theta ^d)\in \mathbb {R}^d\), and \(x\cdot \theta =\sum _{i=1}^d x_i\theta ^i\). The model seems to be very “flat” owing to its linear appearance. If we change the parameter of the model as follows:

$$\begin{aligned} \theta ^i \mapsto 1/\xi _i \end{aligned}$$

for each \(i=1,\dots , d\), the model becomes a nonlinear regression model,

$$\begin{aligned} y=\sum _{i=1}^d \frac{x_i}{\xi _i}+\epsilon , \end{aligned}$$

which does not appear very “flat” anymore, although the nature of the model remains unchanged. This rather simple example highlights the importance of the geometric point of view in understanding the shape of statistical models: flatness of a statistical model must be defined independent of the choice of parameters, that is, by the manner of information geometry.

In econometrics, information geometry has been used to characterize the flat nature of statistical models, including the standard linear regression model, Poisson regression, Wald tests, the ARMA model, and many other examples [4, 5, 9, 11]. The objective of this study is to explore the application potential of binary choice models.

In the binary choice model, the value of dependent variable y can be 1 or 0, based on whether or not some event occurs. The standard model is represented as

$$\begin{aligned} y= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if}\quad x\cdot \theta \ge \epsilon \\ 0 &{}\quad \text {if}\quad x\cdot \theta < \epsilon , \end{array}\right. } \end{aligned}$$

where x is an \(\mathbb {R}^d\)-valued random vector distributed according to density \(p_X(x)\), \(\theta \in \mathbb {R}^d\), and \(\epsilon \) is a random term independent of x. The choice probability is given by

$$\begin{aligned} \textbf{P}\{y=1\mid x\}=\textbf{P}\{\epsilon \le x\cdot \theta \mid x\}=F(x\cdot \theta ), \end{aligned}$$

where F is the distribute function of \(\epsilon \). The joint density function of \((y,x)\in \{0,1\}\times \mathbb {R}^d\) is

$$\begin{aligned} p_\theta (y,x)=F(x\cdot \theta )^y(1-F(x\cdot \theta ))^{1-y}p_X(x). \end{aligned}$$
(2)

The model is commonly used in social sciences to describe the choices made by decision-makers between two alternatives. These alternatives may represent school, labor supply, marital status, or transportation choices. See [10, 15] for a list of empirical applications. In particular, the model is referred to as the probit model when F is the standard normal distribution, and as the logit model when F is the standard logistic distribution:

$$\begin{aligned} F(u)=\frac{\exp u}{1+\exp u},\quad u\in \mathbb {R}. \end{aligned}$$
(3)

The probit model is often considered a plausible model owing to its normally distributed random errors, whereas the logit model is considered merely as a closed-form approximation of the probit. Contrary to this common belief, we think the logit model is the most natural model among the parametric binary choice models from the point of view of information geometry.

The remainder of this paper is organized as follows. In Sect. 2, the geometry of the binary choice model is formulated and the model is shown to be a dually flat space. In Sect. 3, the logit model is investigated in detail as a specified case in Sect. 2. The canonical divergence and the Kullback–Leibler (KL) divergence are introduced to the model. We demonstrate that the logit model is a unique model, whose canonical and KL divergences are equal. In Sect. 4, we offer a geometric interpretation of the maximum likelihood estimation of the binary choice model. In Sect. 5, we summarize the conclusions of this study.

2 Geometry of the binary choice model

The model set is given by

$$\begin{aligned} \mathcal {P}=\{p_\theta \mid \theta \in \Theta \}, \end{aligned}$$

where \(\Theta \) is an open subset of \(\mathbb {R}^d\). This study is based on the following technical assumptions:

  1. (A1)

    F is an infinitely differentiable function on \(\mathbb {R}\) with positive derivative \(f=F'>0\);

  2. (A2)

    x has a compact support \(\mathcal {X}\subset \mathbb {R}^d\) such that \(\mathcal {X}^{int}\not =\emptyset \).

Therefore, model \(\mathcal {P}\) is considered to be a d-dimensional \(C^\infty \) manifold with a canonical coordinate system \(\Theta \rightarrow \mathcal {P}\), \(\theta \mapsto p_\theta \).

Proposition 1

The coordinate system \(\Theta \rightarrow \mathcal {P}\), \(\theta \mapsto p_\theta \), is bijective.

Proof

Assume that there exists \(\theta \not =\theta '\) such that \(p_\theta =p_{\theta '}\). Then, \(F(x\cdot \theta )=F(x\cdot \theta ')\) holds for every \(x\in \mathcal {X}\). Because (A1) implies that F is strictly monotone, \(\mathcal {X}\subset \{x\in \mathbb {R}^d\mid x\cdot (\theta -\theta ')=0\}\), which contradicts \(\mathcal {X}^{int}\not =\emptyset \). \(\square \)

Unless otherwise specified, \(\theta \) is used as the (global) coordinate of manifold \(\mathcal {P}\) hereafter.

For every \(p=p_\theta \), let \(E_p\) be the expectation operator defined as

$$\begin{aligned} E_p\beta (y,x)= & {} \int \beta (y,x)p(y,x)\,dydx\\= & {} \int \beta (1,x)F(x\cdot \theta )p_X(x)\, dx+\int \beta (0,x)(1-F(x\cdot \theta ))p_X(x)\, dx \end{aligned}$$

for an arbitrary measurable function \(\beta \) of (yx). The conditional expectation operator \(E_p[\,\cdot \mid x\,]\) is also defined as follows:

$$\begin{aligned} E_p\left[ \,\beta (y,x)\mid x\,\right] = \!\int \! \beta (y,x)p(y|x)\,dy = \beta (1,x)F(x\cdot \theta )+\beta (0,x)(1-F(x\cdot \theta )). \end{aligned}$$

In particular, \(E_p[\,y\mid x\,]=F(x\cdot \theta )\) holds. The expectation operator with respect to x is simply denoted by E because the value of \(E\beta (x)=\int _{\mathcal {X}}\beta (x)p_X(x)\, dx\) is independent of \(\theta \).

The score function of \(p=p_\theta \) is

$$\begin{aligned} \frac{\partial }{\partial \theta }\log p (y,x)=\frac{y-F(x\cdot \theta )}{F(x\cdot \theta )(1-F(x\cdot \theta ))}f(x\cdot \theta )x. \end{aligned}$$
(4)

Because

$$\begin{aligned} E_p[\,(y-F(x\cdot \theta ))^2\mid x\,]=E_p[\,y\mid x\,]-F(x\cdot \theta )^2=F(x\cdot \theta )(1-F(x\cdot \theta )), \end{aligned}$$

the Fisher information matrix \(G(\theta )\) is given as

$$\begin{aligned} G(\theta )= E_p\left( \frac{\partial }{\partial \theta }\log p\right) \left( \frac{\partial }{\partial \theta }\log p\right) ^\top =E\left[ \frac{ f(x\cdot \theta )^2 }{F(x\cdot \theta )(1-F(x\cdot \theta ))}xx^\top \right] . \end{aligned}$$

For simplicity, define \(r:\mathbb {R}\rightarrow \mathbb {R}_{++}\) as

$$\begin{aligned} r(u)=\frac{ f(u)^2 }{ F(u) (1-F(u) ) } \end{aligned}$$
(5)

for every \(u\in \mathbb {R}\) such that \(G(\theta )=E\left[ r(x\cdot \theta )xx^\top \right] \). By the assumptions, r is bounded on an arbitrary compact interval. Because x has a bounded support, \(G(\theta )\) is finite at every \(\theta \in \Theta \). In addition, we assume that

  1. (A3)

    \(G(\theta )\) is positive definite at every \(\theta \in \Theta \).

The tangent space of \(\mathcal {P}\) at \(p=p_\theta \) is \(T_p\mathcal {P}=\text {Span}\left\{ (\partial _1)_p,\ldots , (\partial _d)_p \right\} \), where \(\partial _i= \frac{\partial }{\partial \theta _i}\) for \(i=1,\ldots , d\). For example, the unconditional expectation \(E_p y\) is obtained as

$$\begin{aligned} E_p y=E\Bigl ( E_\theta [\,y\mid x\,]\Bigr )=\int _{\mathcal {X}}F(x\cdot \theta )p_X(x)\, dx, \end{aligned}$$

which is a smooth function on \(\mathcal {P}\). A tangent vector \(X=X^i(\partial _i)_p\) operates on this as

$$\begin{aligned} X(E_p y)= & {} X^i (\partial _i)_p\int _{\mathcal {X}}F(x\cdot \theta )p_X(x)\, dx\\= & {} X^i \int _{\mathcal {X}}x_if(x\cdot \theta )p_X(x)\, dx=\sum _{i=1}^d X^i E\left[ f(x\cdot \theta )x_i\right] . \end{aligned}$$

Moreover, at every (yx),

$$\begin{aligned} X(\log p(y,x))= & {} X^i\left( \partial _i\right) _p (y \log F(x\cdot \theta )+(1-y)\log (1-F(x\cdot \theta ))+\log p_X(x))\\= & {} X^i\left( \frac{y-F(x\cdot \theta )}{ F(x\cdot \theta )(1-F(x\cdot \theta )) }x_i \right) . \end{aligned}$$

The Fisher information metric g on \(\mathcal {P}\) is introduced as

$$\begin{aligned} g_p(X,Y)=E_p\left( X\log p(y,x) \right) \left( Y\log p(y,x) \right) =X^iY^jg_{ij}(p), \end{aligned}$$

where \(g_{ij}(p)=E\left[ r(x\cdot \theta )x_ix_j \right] \) is the (ij) element of \(G(\theta )\).

Given metric g, the binary choice model is considered to be a Riemannian manifold \((\mathcal {P},g)\). Moreover, function \(\psi :\Theta \rightarrow \mathbb {R}\) is defined as

$$\begin{aligned} \psi (\theta )= E\left[ \int _0^{x\cdot \theta }\left( \int _0^v r(u)\, du \right) \, dv \right] . \end{aligned}$$
(6)

Then,

$$\begin{aligned} \partial _i \partial _j \psi (\theta ) =\frac{\partial }{\partial \theta ^i} E\left[ \left( \int _0^{x\cdot \theta } r(u)\, du \right) x_j \right] = E\left[ r({x\cdot \theta }) x_ix_j \right] = g_{ij}(p). \end{aligned}$$

Hence, \((\mathcal {P},g)\) is a Hessian manifold with potential \(\psi \) when it is equipped with flat connections [12, 14]. For later convenience, we introduce gradient \(\partial \psi :\Theta \rightarrow \mathbb {R}^d\) as

$$\begin{aligned} \partial \psi (\theta )= \left[ \begin{array}{c} \partial _1\psi (\theta ) \\ \vdots \\ \partial _d\psi (\theta ) \end{array} \right] =E \left[ \left( \int _0^{x\cdot \theta }r(u)\, du \right) x \right] , \end{aligned}$$

where \(\partial _i\psi (\theta )=E \left[ \left( \int _0^{x\cdot \theta }r(u)\, du \right) x_i \right] \) for \(i=1,\dots , d\). Because the Hessian \(\partial ^2\psi (\theta )=G(\theta )\) is positive definite by (A3), an inverse mapping \((\partial \psi )^{-1}: \partial \psi (\Theta ) \subset \mathbb {R}^d \rightarrow \mathbb {R}^d\) exists and is continuously differentiable at every point.

Let \(\mathfrak {X}(\mathcal {P})\) denote the class of \(C^\infty \) tangent vector fields on \(\mathcal {P}\). For \(X=X^i \partial _i\), \(Y=Y^j\partial _j \in \mathfrak {X}(\mathcal {P})\), the Levi-Civita connection \(\nabla \) of \((\mathcal {P},g)\) is introduced as

$$\begin{aligned} \nabla _X Y=X^i(\partial _i Y^j) \partial _j+X^iY^j\Gamma _{ij}^k\partial _k, \end{aligned}$$
(7)

where \(\Gamma _{ij}^k\) is the Christoffel symbol:

$$\begin{aligned} \Gamma _{ij}^k(\theta )= & {} \frac{1}{2}\Bigl [ \partial _i g_{l j}(\theta )+ \partial _j g_{il}(\theta )- \partial _l g_{ij}(\theta ) \Bigr ] g^{kl}(\theta )\nonumber \\= & {} \frac{1}{2} E\left[ r'(x\cdot \theta ) x_ix_jx_l \right] g^{kl}(\theta ) \end{aligned}$$
(8)

for \(i,j,k\in \{1,\dots ,d\}\), where \(g^{kl}\) denotes the (kl)-element of \(G(\theta )^{-1}\). Let \(\Gamma _{ijk}(\theta )=\Gamma _{ij}^l(\theta )g_{kl}(\theta )=\frac{1}{2} E\left[ r'(x\cdot \theta ) x_ix_jx_k \right] \) such that

$$\begin{aligned} \Gamma _{ijk}=\frac{1}{2}\partial _i g_{jk}=\frac{1}{2}\partial _j g_{k i}=\frac{1}{2}\partial _k g_{ij}. \end{aligned}$$
(9)

The curvature and torsion tensors \(R:\mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\rightarrow \mathfrak {X}(\mathcal {P})\) and \(T:\mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\rightarrow \mathfrak {X}(\mathcal {P})\) of \(\nabla \) are, respectively, defined as

$$\begin{aligned} R(X,Y,Z)=\nabla _X\left( \nabla _Y Z \right) -\nabla _Y\left( \nabla _X Z\nabla \right) -\nabla _{[X,Y]}Z \end{aligned}$$
(10)

and

$$\begin{aligned} T(X,Y)=\nabla _X Y -\nabla _Y X - [X,Y], \end{aligned}$$
(11)

where \([X,Y]=X^i (\partial _iY^j)\partial _j-Y^j(\partial _jX^i)\partial _i\).

Proposition 2

Let \(\nabla \) be the Levi-Civita connection (7) with coefficients (8). Then,

$$\begin{aligned} R_{ijk}:=R(\partial _i,\partial _j,\partial _k)= \left( \Gamma _{ik}^m\Gamma _{j m}^l - \Gamma _{j k}^m\Gamma _{im}^l \right) \partial _{l} \end{aligned}$$
(12)

for \(i,j,k\in \{1,\dots ,d\}\) and \(T\equiv 0\).

Proof

Using (8), \(T(\partial _i,\partial _j)=(\Gamma _{ij}^k-\Gamma _{ji}^k)\partial _k=0\) is trivially shown. Because \(g_{mh}g^{hl}=1\) if \(m=l\) and 0 if \(m\not = l\),

$$\begin{aligned} \partial _i(g_{mh}g^{hl}) =(\partial _ig_{mh})g^{hl}+g_{mh}(\partial _ig^{hl}) = 2\Gamma _{im}^{l}+g_{mh}(\partial _ig^{hl}) = 0, \end{aligned}$$

which implies \(\partial _i g^{hl}=-2\Gamma _{im}^l g^{mh}\). Using the definition of the curvature tensor,

$$\begin{aligned} R_{ijk}&= \nabla _{\partial _i}(\Gamma _{jk}^l \partial _l)-\nabla _{\partial _j}(\Gamma _{ik}^l \partial _l)\\&=\left\{ \partial _i(\Gamma _{jkh}g^{hl})\partial _l+ \Gamma _{jk}^l\Gamma _{il}^h \partial _h \right\} -\left\{ \partial _j(\Gamma _{ikh}g^{hl})\partial _l+ \Gamma _{ik}^l\Gamma _{jl}^h \partial _h \right\} \\&= \left( \partial _i\Gamma _{jkh}- \partial _j\Gamma _{ikh} \right) g^{hl}\partial _l \\&\quad -2 \left( \Gamma _{jkh}-\Gamma _{ikh}\right) \Gamma _{im}^l g^{mh} +\left( \Gamma _{im}^l \Gamma _{j k}^m-\Gamma _{j m}^l \Gamma _{ik}^m \right) \partial _{l}. \end{aligned}$$

Because \(\partial _i\Gamma _{jkh}=\partial _j\Gamma _{ikh}=\frac{1}{2}E\left[ r''(x\cdot \theta )x_ix_jx_kx_h \right] \), Eq. (12) is obtained. \(\square \)

Proposition 2 implies that the binary choice model with the Fisher information geometry is essentially a flat manifold. Let S be an arbitrary symmetric (0, 3)-tensor on \(\mathcal {P}\). A family of \(\alpha \)-connections \(\{\nabla ^{(\alpha )}\}_{\alpha \in \mathbb {R}}\) is defined as

$$\begin{aligned} g(\nabla ^{(\alpha )}_XY,Z)=g(\nabla _XY,Z)- \alpha S(X,Y,Z) \end{aligned}$$

for every \(\alpha \in \mathbb {R}\). The corresponding connection coefficients are given by

$$\begin{aligned} \Gamma _{ijk}^{(\alpha )}=\Gamma _{ijk}-\alpha S_{ij k}, \end{aligned}$$

where \(S_{ijk}=S(\partial _i,\partial _j,\partial _k)\). See chapter 6 of [2] for definitions and details of \(\alpha \)-connections.

A pair \((\nabla ^{( \alpha )}, \nabla ^{( -\alpha )})\) of the connections provides the dual connections of \((\mathcal {P},g)\) such that

$$\begin{aligned} X( g (Y,Z) )=g ( \nabla _X^{( \alpha )}Y,Z )+g (Y, \nabla _X^{( -\alpha )}Z ) \end{aligned}$$

because

$$\begin{aligned} X( g (Y,Z) )= & {} g ( \nabla _X Y,Z )+g (Y, \nabla _X Z ) \\= & {} \left\{ g ( \nabla _X Y,Z ) - \alpha S (X,Y,Z) \right\} + \left\{ g (Y, \nabla _X Z )-\left( - \alpha S (Y,X,Z)\right) \right\} \\= & {} g ( \nabla _X^{( \alpha )} Y,Z ) + g (Y, \nabla _X^{(-\alpha )} Z ) \end{aligned}$$

by symmetry \(S(X,Y,Z)=S(Y,X,Z)\). Let \(R^{( \alpha )}\) and \(T^{(\alpha )}\) be the curvature and torsion tensors of \(\nabla ^{( \alpha )}\), respectively. When \(R^{( \alpha )}=R^{( -\alpha )}=0\) and \(T^{( \alpha )}=T^{( -\alpha )}=0\) hold, \((\mathcal {P},g,\nabla ^{( \alpha )},\nabla ^{( -\alpha )})\) is said to be a dually flat space. A dually flat space has the dual affine coordinates \((\xi ,\zeta )\), where \(\xi =(\xi ^1,\dots ,\xi ^d)\) is the \(\nabla ^{( \alpha )}\)-affine coordinate such that \(\Gamma _{ijk}^{(\alpha )}(\xi )\equiv 0\) and \(\zeta =(\zeta _1,\dots ,\zeta _d)\) is the \(\nabla ^{(-\alpha )}\)-affine coordinate such that \(\Gamma _{ijk}^{ -\alpha }(\zeta )\equiv 0\). Furthermore,

$$\begin{aligned} g \left( \frac{\partial }{\partial \xi ^i},\, \frac{\partial }{\partial \zeta _j} \right) =\delta _i^j=\left\{ \begin{array}{cc} 1 &{}\quad (i=j) \\ 0 &{}\quad (i \not = j). \end{array}\right. \end{aligned}$$

Theorem 1

Let S be a (0, 3)-tensor on \((\mathcal {P},g)\) given by \(S(X,Y,Z)=X^iY^jZ^k S_{ijk}\) with

$$\begin{aligned} S_{ijk}(p)=\Gamma _{ijk}(\theta )=\frac{1}{2}E_\theta \left[ r'(x\cdot \theta )x_ix_jx_k \right] . \end{aligned}$$

For \(\alpha =\pm 1\), let \(\nabla ^{(\alpha )}\) be defined as

$$\begin{aligned} g(\nabla ^{( \alpha )}_XY,Z)=g(\nabla _XY,Z) -\alpha S(X,Y,Z). \end{aligned}$$

Then, \((\mathcal {P},g,\nabla ^{( +1 )},\nabla ^{( -1 )})\) is a dually flat space with dual affine coordinates \((\theta ,\eta )\), where \(\eta =\partial \psi (\theta )\).

Proof

While the results can be immediately obtained using the general theory of the Hessian manifolds [12, 14], we give direct proof for a self-contained description of the paper.

Based on the assumption, \(\Gamma _{ij}^{(+1)k}(\theta )\equiv 0\) trivially holds. To confirm that \(\eta =\partial \psi (\theta )\) is the \(\nabla ^{(-1)}\)-affine coordinate, let \(\Gamma _{ab}^{(-1)c}(\eta )\) for \(a,b,c\in \{1,\dots ,d\}\) be the \(\nabla ^{(-1)}\)-connection coefficients expressed in terms of \(\eta \). By the definition of \(\eta \), \(\partial _k \eta _l=g_{kl}\) holds. This implies that \(\frac{\partial \theta ^k}{\partial \eta _l}=g^{kl}\) and that

$$\begin{aligned} \frac{ \partial ^2 \eta _l }{ \partial \theta ^i \partial \theta ^j }\frac{ \partial \theta ^k }{ \partial \eta _l } =E\left[ r'({x\cdot \theta }) x_i x_j x_l \right] g^{kl}=2\Gamma _{ij}^k(\theta ). \end{aligned}$$

From the definition of \(\nabla ^{(-1)}\), \(\Gamma _{ij}^{( -1 )k}(\theta )=2\Gamma _{ij}^{k}(\theta )\). By the change-of-variables formula for the connection coefficients,

$$\begin{aligned} \Gamma _{ij}^{(-1)k}(\theta )= & {} \frac{ \partial ^2 \eta _l }{ \partial \theta ^i \partial \theta ^j }\frac{ \partial \theta ^k }{ \partial \eta _l } +\frac{ \partial \eta _a }{ \partial \theta ^i}\frac{ \partial \eta _b }{ \partial \theta ^j} {\Gamma }_{ab}^{(-1)c}(\eta )\frac{ \partial \theta ^k }{ \partial \eta _c }\\= & {} 2\Gamma _{ij}^{k}(\theta ) +g_{ai}g_{bj}{\Gamma }_{ab}^{(-1)c}(\eta )g^{kc}, \end{aligned}$$

which implies

$$\begin{aligned} \Gamma ^{(-1)c}_{ab}(\eta )=g^{ai}g^{bk}\left( \Gamma _{ij}^{(-1)k}(\theta )-2\Gamma _{ij}^{k}(\theta ) \right) g_{kc} =0. \end{aligned}$$

Moreover,

$$\begin{aligned} g\left( \frac{\partial }{\partial \theta ^i},\ \frac{\partial }{\partial \eta _j} \right) =\frac{\partial \theta ^k}{\partial \eta _j}g\left( \frac{\partial }{\partial \theta ^i},\ \frac{\partial }{\partial \theta ^k} \right) =g^{jk}g_{ik}=\delta ^j_i. \end{aligned}$$

\(\square \)

Corollary 1

For \(\alpha =\pm 1\), the \(\nabla ^{(\alpha )}\)-geodesic path \(\gamma ^{(\alpha )}=\{\gamma ^{(\alpha )}_t|\, 0\le t\le 1\}\) connecting \(p,q\in \mathcal {P}\) is given by

$$\begin{aligned} \gamma ^{(\alpha )}_t(y,x)=F( x\cdot \theta ^{(\alpha )}_t )^y(1-F( x\cdot \theta ^{(\alpha )}_t ))^{1-y}p_X(x), \end{aligned}$$

where

$$\begin{aligned} \theta ^{(+1)}_t=(1-t)\theta _p+t\theta _q \end{aligned}$$

and

$$\begin{aligned} \theta ^{(-1)}_t=(\partial \psi )^{-1}( (1-t)\eta _p+t\eta _q). \end{aligned}$$
(13)

In particular, (13) is a solution to the ordinary differential equation,

$$\begin{aligned} \frac{d}{dt}\theta _t^{(-1)}=G\left( \theta _t^{(-1)}\right) ^{-1}(\eta _q-\eta _p), \end{aligned}$$

with initial condition \(\theta _0=\theta _p\).

Proof

Let \(\eta _t^{(-1)}=\partial \psi (\theta _t^{(-1)})=(1-t)\eta _p+t\eta _q\). Then,

$$\begin{aligned} \frac{d}{dt}\eta _t^{(-1)}=\partial ^2 \psi \left( \theta _t^{(-1)}\right) \frac{d}{dt}\theta _t^{(-1)}= G\left( \theta _t^{(-1)}\right) \frac{d}{dt}\theta _t^{(-1)} =\eta _q-\eta _p. \end{aligned}$$

\(\square \)

3 Two divergences of the binary choice model

The dual potential \(\varphi \) of \(\psi \) is given as

$$\begin{aligned} \varphi (\eta )=\max _{\theta }\ \eta \cdot \theta -\psi (\theta ), \end{aligned}$$

which is the Legendre transformation of \(\psi (\theta )\). Because \(\theta \mapsto \eta \cdot \theta -\psi (\theta )\) is strictly concave by (A3), the maximum of \(\eta \cdot \theta -\psi (\theta )\) is attained at \(\theta =(\partial \psi )^{-1}(\eta )\), which is a solution to the first-order condition, \(\eta -\partial \psi (\theta )=0\). Let \(\theta _p\) and \(\eta _p\) denote the canonical coordinate and its dual at \(p\in \mathcal {P}\), respectively. Then, the dual potential is explicitly given as

$$\begin{aligned} \varphi (\eta _p)= & {} \eta _p\cdot \theta _p-\psi (\theta _p) \\= & {} E\left[ \left( \int _0^{x\cdot \theta _p}r(u)\, du \right) x \right] \cdot \theta _p-E\left[ \int _0^{x\cdot \theta _p} \left( \int _0^vr(u)\, du \right) dv \right] . \end{aligned}$$

In general, for a dually flat space with dual affine coordinates \((\theta ,\eta )\) and dual potentials \((\psi ,\varphi )\), the canonical divergence between p and q in \(\mathcal {P}\) is defined as follows [3, 7, 8]:

$$\begin{aligned} D(p\, \Vert \, q)=\varphi (\eta _p)+\psi (\theta _q)-\eta _p\cdot \theta _q. \end{aligned}$$
(14)

For binary choice model (2),

$$\begin{aligned} D(p \,\Vert \, q)= & {} \left\{ E\left[ \left( \int _0^{x\cdot \theta _p}r(u)\, du \right) x \right] \cdot \theta _p-E\left[ \int _0^{x\cdot \theta _p} \left( \int _0^vr(u)\, du \right) dv \right] \right\} \nonumber \\{} & {} +E\left[ \int _0^{x\cdot \theta _q} \left( \int _0^v r(u)du \right) dv \right] -E\left[ \left( \int _0^{x\cdot \theta _p} r(u)du \right) x \right] \cdot \theta _q \nonumber \\= & {} E\left[ \int _{x\cdot \theta _p}^{x\cdot \theta _q} \left( \int _{x\cdot \theta _p}^v r(u)du \right) dv \right] . \end{aligned}$$
(15)

For the given p, a function \(\theta \mapsto D(p\,\Vert \, p_\theta )\) is strictly convex because a direct computation shows

$$\begin{aligned} \partial _i\partial _j D(p\,\Vert \, p_\theta )= \frac{\partial ^2}{\partial \theta ^i \partial \theta ^j}\Bigl ( \varphi (\eta _p)+\psi (\theta )-\eta _p\cdot \theta \Bigr )=g_{ij}(\theta ). \end{aligned}$$

Theorem 2

Let p, q, r \(\in \mathcal {P}\). Let \(\gamma ^{(+1)}=(\gamma ^{(+1)}_t)_{0\le t \le 1}\) be the \(\nabla ^{(+1)}\)-geodesic path connecting p and q, and let \(\gamma ^{(-1)}=(\gamma ^{(-1)}_s)_{0\le s \le 1}\) be the \(\nabla ^{(-1)}\)-geodesic path connecting q and r. If and only if \(\gamma ^{(+1)}\) and \(\gamma ^{(-1)}\) are orthogonal at the intersection q in the sense that

$$\begin{aligned} g_q \left( \left( \frac{d}{dt}\right) _q \gamma _t^{(+1)},\left( \frac{d}{ds}\right) _q {\gamma }_s^{(-1)} \right) =0, \end{aligned}$$

we have

$$\begin{aligned} D(p\,\Vert \, r)= D(p\,\Vert \, q)+D(q\,\Vert \, r). \end{aligned}$$
(16)

Proof

The result is standard. See e.g., [3, 7, 13] for a proof. \(\square \)

Corollary 2

The Pythagorean formula (16) holds if and only if \((\eta _p-\eta _q)\cdot (\theta _q-\theta _r)=0\).

Proof

From corollary 1,

$$\begin{aligned} \left( \frac{d}{dt}\right) _q\gamma ^{(+1)}_t=(\theta _q^i-\theta _p^i)\left( \frac{\partial }{\partial \theta ^i}\right) _q \end{aligned}$$

and

$$\begin{aligned} \left( \frac{d}{ds}\right) _q\gamma ^{(-1)}_s=g^{jk}(q)((\eta _r)_k-(\eta _q)_k)\left( \frac{\partial }{\partial \theta ^j}\right) _q. \end{aligned}$$

Therefore,

$$\begin{aligned} g_q \left( \left( \frac{d}{dt}\right) _q \gamma _t^{(+1)},\left( \frac{d}{ds}\right) _q {\gamma }_s^{(-1)} \right)&= g_{ij}(q)(\theta _q^i-\theta _p^i)g^{jk}(q)((\eta _r)_k-(\eta _q)_k)\\&= (\theta _p-\theta _q)\cdot (\eta _q-\eta _r). \end{aligned}$$

\(\square \)

An alternative for the divergence on \(\mathcal {P}\) is the KL divergence,

$$\begin{aligned} KL(p\, \Vert \, q )=E_p \left[ \log \frac{p(y,x)}{q(y,x)} \right] . \end{aligned}$$

For the binary choice model,

$$\begin{aligned} KL(p\, \Vert \, q )= & {} E_p \left[ y\log \frac{ F(x\cdot \theta _p)}{F(x\cdot \theta _q)} +(1-y)\log \frac{1-F(x\cdot \theta _p)}{1-F(x\cdot \theta _q)} \right] \nonumber \\= & {} E \left[ F(x\cdot \theta _p) \log \left( \frac{ F(x\cdot \theta _p) }{ F(x\cdot \theta _q) } \right) \right] \nonumber \\{} & {} +E\left[ (1-F(x\cdot \theta _p)) \log \left( \frac{1-F(x\cdot \theta _p) }{ 1-F(x\cdot \theta _q) } \right) \right] \end{aligned}$$
(17)

based on the law of iterated expectations. Canonical divergence (15) and KL divergence (17) generally do not coincide. However, in a special case where F is a logistic distribution, they are shown as equivalent.

Theorem 3

\(D=KL\) holds for arbitrary \(p_X\) if and only if F is a logistic distribution; that is,

$$\begin{aligned} F(u)=\frac{ \exp (\beta u)}{1+ \exp (\beta u)}, \end{aligned}$$

where \(\beta >0\).

Proof

If F is a logistic distribution with parameter \(\beta >0\),

$$\begin{aligned} \beta \int F(u)\, du=\log (1+\exp (\beta u))+C \end{aligned}$$

and

$$\begin{aligned} f(u) =\frac{d}{du}\left( \frac{ \exp (\beta u) }{1+\exp (\beta u) } \right) =\beta F(u)(1-F(u)). \end{aligned}$$

Hence, \(r(u)=\beta f(u)\), \(\int _{x\cdot \theta _p}^v r(u)\,du=\beta (F(v)-F(x\cdot \theta _p))\), and

$$\begin{aligned} D(p\, \Vert \, q)= & {} E\left[ \int _{x\cdot \theta _p}^{x\cdot \theta _q} \beta \left( F(v)-F(x\cdot \theta _p) \right) dv \right] \nonumber \\= & {} E\left[ \log \left( \frac{1+\exp (\beta x\cdot \theta _q)}{ 1+\exp (\beta x\cdot \theta _p) } \right) \right] -E\left[ F(x\cdot \theta _p)\log \left( \frac{\exp (\beta x\cdot \theta _q)}{\exp (\beta x\cdot \theta _p)} \right) \right] \nonumber \\= & {} E\left[ (1-F(x\cdot \theta _p)) \log \left( \frac{1-F( x\cdot \theta _p)}{ 1-F(x\cdot \theta _q) } \right) \right] \nonumber \\{} & {} -E\left[ F(x\cdot \theta _p)\log \left( \frac{F(x\cdot \theta _q)}{F( x\cdot \theta _p)} \right) \right] \nonumber \\= & {} KL(p\,\Vert \,q). \end{aligned}$$

On the other hand, if \(D\equiv KL\) holds for an arbitrary \(p_X\),

$$\begin{aligned} \frac{ f(x\cdot \theta _p)^2 }{ F(x\cdot \theta _p)(1-F(x\cdot \theta _p)) } \equiv \frac{ f(x\cdot \theta _p)f(x\cdot \theta _q) }{ F(x\cdot \theta _q)(1-F(x\cdot \theta _q)) } \end{aligned}$$

holds for arbitrary p and q because

$$\begin{aligned} (\partial _\theta )_p(\partial _\theta )_qD(p\,\Vert \, q)=-E\left[ \frac{ f(x\cdot \theta _p)^2 }{ F(x\cdot \theta _p)(1-F(x\cdot \theta _p)) }xx^\top \right] \end{aligned}$$

and

$$\begin{aligned} (\partial _\theta )_p(\partial _\theta )_q KL(p\,\Vert \, q)=-E\left[ \frac{ f(x\cdot \theta _p)f(x\cdot \theta _q) }{ F(x\cdot \theta _q)(1-F(x\cdot \theta _q)) }xx^\top \right] . \end{aligned}$$

From the principle of the separation of variables, this is possible only if there exists a positive constant \(\beta \) such that

$$\begin{aligned} \frac{ f(u) }{ F(u)(1-F(u)) } \equiv \beta . \end{aligned}$$

Therefore, F is the logistic distribution. \(\square \)

For the standard logit model (\(\beta =1\)), the results presented in the preceding section are further simplified. The Fisher information metric is given as

$$\begin{aligned} g_{ij}(p)=E\left[ f(x\cdot \theta )x_ix_j \right] . \end{aligned}$$

The \(\nabla ^{(-1)}\)-affine coordinate \(\eta \) is expressed as

$$\begin{aligned} \eta = E\left[ F(x\cdot \theta )x \right] =E_p[yx ]. \end{aligned}$$

The potential is \(\psi (\theta )= E\left[ \log \left( 1+\exp (x\cdot \theta )\right) \right] \), and the divergence is

$$\begin{aligned} D(p\, \Vert \, q)=E\left[ \log \left( \frac{1+\exp (x\cdot \theta _q)}{1+\exp (x\cdot \theta _p)}\right) \right] -E\left[ \frac{\exp (x\cdot \theta _p)}{1+\exp (x\cdot \theta _p)}x\right] \cdot (\theta _q-\theta _p). \end{aligned}$$

We can generalize Theorem 3 to cover the multinomial discrete choice model. Let \(\{1,\ldots , k\}\) be the set of choices. Assume that the choice probability conditioned on x is now given by

$$\begin{aligned} \textbf{P}\{y=i\mid x\}= \frac{F(x\cdot \theta _i)}{\sum _{j=1}^k F(x\cdot \theta _j)} \end{aligned}$$

for \(i\in \{1,\dots , k\}\), where F is a smooth distribution function and \(\theta =\left[ \begin{array}{ccc}\theta _1&\cdots&\theta _k \end{array} \right] \in (\mathbb {R}^d)^k\) with \(\theta _i=(\theta _i^1,\ldots ,\theta _i^d)\in \mathbb {R}^d\). Let \(p_X\) be the marginal density of x and \(\Theta \subset (\mathbb {R}^d)^k\) be the set of parameters. Then, the multinomial choice model is obtained as

$$\begin{aligned} p_\theta (y,x)=\frac{\sum _{i=1}^k \delta _i(y) F(x\cdot \theta _i)p_X(x)}{\sum _{j=1}^k F(x\cdot \theta _j)}. \end{aligned}$$

In particular, when F is the standard logit distribution, the model becomes the multinomial logit model with the choice probability

$$\begin{aligned} p_\theta (y=i|x)=\frac{ \exp (x\cdot \theta _i) }{ \sum _{j=1}^k \exp (x\cdot \theta _j) } \end{aligned}$$

for \(i\in \{1,\dots , k\}\). The model set \(\{p_\theta | \theta \in \Theta \}\) is a dually flat space with the dual affine coordinates \((\theta ,\eta )\) and the potential

$$\begin{aligned} \psi (\theta )=E\left[ \log \sum _{j=1}^k \exp (x\cdot \theta _j)\right] , \end{aligned}$$

where \(\eta =\left[ \begin{array}{ccc}\eta _1&\cdots&\eta _k \end{array} \right] \in (\mathbb {R}^d)^k\), \(\eta _i=(\eta _{i,1},\ldots ,\eta _{i,d})\in \mathbb {R}^d\), and

$$\begin{aligned} \eta _{i,l}=E\left[ \frac{ \exp (x\cdot \theta _i) }{ \sum _{j=1}^k \exp (x\cdot \theta _j) }x_l \right] \end{aligned}$$

for \(i\in \{1,\dots ,k\}\) and \(l\in \{1,\dots , d\}\). Furthermore, canonical D and KL divergences of the model are equal.

4 Maximum likelihood estimation of the binary choice model

Most of the results presented in Sects. 2 and 3 are independent of the choice of \(p_X\). Therefore, by replacing \(p_X\) with its estimates based on empirical data, we might obtain some geometric view of the statistical inference of the model. Let \(\mathcal {P}\) be the set of the binary choice model (2). Let \((y_1,x_1)\), \(\dots \), \((y_T,x_T)\) be i.i.d. sample from \(p=p_\theta \in \mathcal {P}\). Then the empirical expectation operator \(\hat{E}\) is given by

$$\begin{aligned} \hat{E}\beta (y,x)=\frac{1}{T}\sum _{t=1}^T \beta (y_t,x_t). \end{aligned}$$

The empirical Fisher information matrix is given by \(\hat{G}(\theta )=\hat{E}[r(x\cdot \theta )xx^\top ]\). Again, we assume that

  1. (A3’)

    \(\hat{G}(\theta )\) is positive definite.

The empirical versions of the Fisher information metric \(\hat{g}_{ij}=\hat{E}[r(x\cdot \theta )x_ix_j]\), the potential \(\hat{\psi }(\theta )\), the Levi-Civita connection \(\hat{\nabla }\) and the connection coefficients \(\hat{\Gamma }_{ij}^k\) are also introduced simply by replacing E with \(\hat{E}\).

The “true” parameter \(\theta \) of \(p=p_\theta \) is well approximated by the maximum likelihood estimator,

$$\begin{aligned} \hat{\theta }=\text {arg}\max _\theta \ \hat{E} \log p_\theta (y,x ), \end{aligned}$$
(18)

which is an empirical analog of the KL divergence minimization,

$$\begin{aligned} \theta = \text {arg}\min _\theta \ KL(p\,\Vert \,p_\theta )= \text {arg}\max _\theta \ E_p \log p_\theta \end{aligned}$$

(see e.g., [7, 10]). The estimator is a solution to the first-order conditions of maximization (18):

$$\begin{aligned} \frac{\partial }{\partial \theta }\hat{E}\log p_\theta (y,x) = \frac{1}{T}\sum _{t=1}^T \frac{y_t-F(x_t\cdot \theta )}{F(x_t\cdot \theta )(1-F(x_t\cdot \theta ))}f(x_t\cdot \theta )x_t=0. \end{aligned}$$

In particular, when the logit model is considered, the condition is simplified to

$$\begin{aligned} \frac{1}{T}\sum _{t=1}^T\bigl (y_t-F(x_t\cdot \theta )\bigr )x_t=0, \end{aligned}$$

which implies \(\hat{E}[yx]=\hat{E}\left[ F(x\cdot \hat{\theta })x\right] =(\partial \hat{\psi })(\hat{\theta })\). Therefore, in the maximum likelihood estimation of the logit model, we first estimate the dual parameter \(\eta \) directly as \(\hat{\eta }=\hat{E}[yx]=\frac{1}{T}\sum _{t=1}^T y_tx_t\), and secondly we estimate the canonical parameter \(\theta \) using \(\hat{\theta }=(\partial \hat{\psi })^{-1}(\hat{\eta })\). This method is easily implemented because it does not involve numerical optimizations.

One objective of empirical studies of an econometric model is to test the statistical significance of its coefficients \(\theta ^1,\dots ,\theta ^d\). When we want to test a joint null hypothesis such as, say, \(H_0:\theta ^1=\theta ^2=0\), where \(d\ge 2\) is assumed, the value of \(\theta \) is estimated subject to the linear constraint:

$$\begin{aligned} \left[ \begin{array}{ccccc} 1 &{} \ 0 &{} \ 0 &{} \ \cdots &{}\ 0\\ 0 &{} \ 1 &{} \ 0 &{} \ \cdots &{}\ 0 \end{array}\right] \left[ \begin{array}{c} \theta ^1 \\ \vdots \\ \theta ^d \end{array}\right] =\left[ \begin{array}{c} 0 \\ 0 \end{array}\right] . \end{aligned}$$

The restriction is generalized to the case of \(H_0:H^\top \theta =c\), where \(H=\left[ \begin{array}{ccc} h_1&\cdots&h_m \end{array}\right] \) is a \(d\times m\) matrix with \(\text {rank}(H)=m<d\), and \(c=(c_1,\ldots ,c_m)^\top \in \mathbb {R}^m\). Let \(\mathcal {H}=\{ \theta \in \Theta \mid H^\top \theta =c\}\) be the constraint set, and let \(\mathcal {P}_{\mathcal {H}}=\left\{ p_\theta \in \mathcal {P} \mid \theta \in \mathcal {H} \right\} \) be the constrained model. Because \(\theta \) is the \(\nabla ^{(+1)}\)-affine coordinate of \(\mathcal {P}\), \(\mathcal {P}_{\mathcal {H}}\) is an affine-flat submanifold of \(\mathcal {P}\).

If the logit model is assumed, the constrained maximum likelihood estimator

$$\begin{aligned} \hat{\theta }|_{\mathcal {H}}=\text {arg}\max _\theta \ \hat{E} \log p_\theta (y,x ) \quad \text {subject to} \quad \theta \in \mathcal {H} \end{aligned}$$

is found by orthogonally projecting the unconstrained estimator \(\hat{\theta }\) onto the constraint set. We define the D-projection operator \(\Pi :\mathcal {P}\rightarrow \mathcal {P}_{\mathcal {H}}\) by

$$\begin{aligned} \Pi p=\text {arg}\min _q\ D(p\,\Vert \, q) \quad \text {subject to} \quad q\in \mathcal {P}_\mathcal {H} \end{aligned}$$
(19)

for every \(p\in \mathcal {P}\), where D is the canonical divergence (14). The operator is well defined because \(D(p\,\Vert \, q)\) is strictly convex in \(\theta _p\).

Theorem 4

\(q=\Pi p\) holds for \(q,p\in \mathcal {P}\) if and only if \(\eta _q-\eta _p\in \text {Image}(H)\).

Proof

Let \(\mathcal {L}(\theta ,\lambda )=D(p\,\Vert \, p_\theta )-\sum _{i=1}^m \lambda ^i(h_i\cdot \theta -c_i)\) be the Lagrangian with multipliers \(\lambda =(\lambda ^1,\ldots ,\lambda ^m)\). As \(\theta \mapsto D(p\,\Vert \, p_\theta )\) is strictly convex, a necessary and sufficient condition for minimization is given by

$$\begin{aligned} \frac{\partial }{\partial \theta }\mathcal {L}(\theta ,\lambda )= \frac{\partial }{\partial \theta }D(p\,\Vert \, p_\theta )-\sum _{i=1}^m\lambda ^i h_i=0, \end{aligned}$$

which implies

$$\begin{aligned} \frac{\partial }{\partial \theta }D(p\,\Vert \, p_\theta )=\frac{\partial }{\partial \theta }\bigl ( \varphi (\eta _p)+\psi (\theta )-\eta _p\cdot \theta \bigr ) =\eta -\eta _p\in \text {Image}(H) \end{aligned}$$

at a solution \(\theta \) to (19). Therefore, \(\eta _q-\eta _p\in \text {Image}(H)\) is satisfied if and only if q solves (19). \(\square \)

The empirical version of theorem 4 offers us graphical images of the maximum likelihood estimation. Let \(\hat{D}\) be the empirical version of the canonical divergence, and \(\hat{\Pi }\) be the \(\hat{D}\)-projection operator. Then, if and only if the logit model is assumed, the \(\hat{D}\)-projection becomes equivalent to the constraint maximum likelihood estimation; that is,

$$\begin{aligned} p_{\hat{\theta }|_{\mathcal {H}}}=\hat{\Pi }p_{\hat{\theta }}. \end{aligned}$$

This is because

$$\begin{aligned} \hat{\Pi }p_{\hat{\theta }}= & {} \text {arg}\min _q\ \hat{D}(p_{\hat{\theta }}\,\Vert \, q) \quad \text {subject to} \quad q\in \mathcal {P}_\mathcal {H} \\= & {} \text {arg}\min _{p_\theta }\ \hat{E}\left[ \frac{ \log p_{\hat{\theta }}(y,x)}{\log p_\theta (y,x)}\right] \quad \text {subject to} \quad \theta \in \mathcal {H} \\= & {} \text {arg}\max _{p_\theta }\ \hat{E}{\log p_\theta (y,x)} \quad \text {subject to} \quad \ H^\top \theta =c \\= & {} p_{\hat{\theta }|_{\mathcal {H}}}. \end{aligned}$$

For the dual parameter \(\hat{\eta }|_{\mathcal {H}}=\partial \hat{\psi }(\hat{\theta }|_{\mathcal {H}})\), the condition

$$\begin{aligned} \hat{\eta }|_{\mathcal {H}}-\hat{\eta }\in \text {Image}(H) \end{aligned}$$

holds if and only if the model is logit. Furthermore, if \(\hat{\eta }|_\mathcal {H}\) satisfies the condition, there exists \(\lambda \in \mathbb {R}^m\) such that \(\hat{\eta }-\hat{\eta }|_{\mathcal {H}}=H\lambda \). Therefore, for any \(\theta \in \mathcal {H}\),

$$\begin{aligned} (\hat{\eta }-\hat{\eta }|_{\mathcal {H}})\cdot (\theta -\hat{\theta }|_{\mathcal {H}})= \lambda ^\top H^\top (\theta -\hat{\theta }|_{\mathcal {H}})= \lambda ^\top (c-c)=0 \end{aligned}$$

is satisfied. The situation is shown in Fig. 1. The figure seems obvious, but it is to be remarked that this naive image of the orthogonal projection is consistent with the estimation with linear restrictions if and only if the logit model is assumed; in other models such as the probit model, the orthogonal projection with respect to the Fisher information metric fails to maximize the likelihood on the affine linear submodel.

Fig. 1
figure 1

Orthogonal projection \(\hat{\Pi }:\hat{\eta }\mapsto \hat{\eta }|_{\mathcal {H}}\) and maximum likelihood estimator \(\hat{\theta }|_{\mathcal {H}}\) under null hypothesis \(H^\top \theta =c\)

5 Discussion

In this study, we investigated the geometry of parametric binary choice models. The model was established as a dually flat space, where the canonical coefficient parameter \(\theta \) acts as an affine coordinate. The dual flat property introduces a canonical divergence into the model. The divergence is equivalent to the KL divergence if and only if the model is a logit model. As an example application, the projection onto an affine linear subspace was geometrically characterized.

The dual flatness of the binary choice model is caused by the single-index structure of the model, which depends on parameter \(\theta \) only through linear index \(x\cdot \theta \), making the Levi-Civita connection coefficients \(\Gamma _{ijk}\) symmetrical on (ijk). Therefore, the results of this study can be extended to a more general class of single-index models, including nonlinear regressions, truncated regressions, and ordered discrete choice models. The studied model might be extended to the neural network model, which consists of connected binary response models. Each node of the network is considered as a single-index model. However, the entire structure of the model could be highly nonlinear in terms of parameter \(\theta \), which leads to the non-flatness of the model.

Among the binary choice models, the logit is shown to have good properties geometrically as well as statistically. This is not because of only the explicit integrability of the logit. In general, we say that a statistical model \(\mathcal {P}=\{p_\theta \mid \theta \in \Theta \}\) is an exponential family if it is expressed as

$$\begin{aligned} p_\theta (z)=\exp \left[ C(z)+\sum _{i=1}^d\theta ^i \beta _i(z)-\psi (\theta ) \right] . \end{aligned}$$
(20)

It is widely known that the (curved) exponential family possesses desirable properties such as higher-order efficiency of the maximum likelihood estimation [1, 6]. Although the logit model is not truly exponential, the conditional density \(p_\theta (y|x)\) is still written as

$$\begin{aligned} p_\theta (y|x) =\exp \left( (x\cdot \theta )\delta _1(y)+\delta _0(y)-\psi (\theta |x)\right) , \end{aligned}$$
(21)

where

$$\begin{aligned} \delta _i(y)= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if}\quad y=i\\ 0 &{}\quad \text {if}\quad y\not =i, \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \psi (\theta |x)=\log \left( 1+\exp (x\cdot \theta )\right) . \end{aligned}$$

Conditioned by x, model (21) belongs to an exponential family with potential \(\psi (\theta |x)\). Notably, \(\psi (\theta )=E\left[ \psi (\theta |x)\right] \). Because marginal density \(p_X\) does not appear in the score of model (4), statistical properties of the model are primarily determined by \(p_\theta (y|x)\). Our study suggests that the logit model is a unique binary choice model, which belongs to the conditional exponential family.