Abstract
In this study, we consider parametric binary choice models from the perspective of information geometry. The set of models is a dually flat manifold with dual connections, naturally derived from the Fisher information metric. Under the dual connections, the canonical divergence and the Kullback–Leibler divergence of the binary choice model coincide if and only if the model is a logit model.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider the following simple linear regression model:
where y is a dependent variable, x is a d-dimensional random vector, \(\epsilon \) is an error term, \(\theta =(\theta ^1,\dots ,\theta ^d)\in \mathbb {R}^d\), and \(x\cdot \theta =\sum _{i=1}^d x_i\theta ^i\). The model seems to be very “flat” owing to its linear appearance. If we change the parameter of the model as follows:
for each \(i=1,\dots , d\), the model becomes a nonlinear regression model,
which does not appear very “flat” anymore, although the nature of the model remains unchanged. This rather simple example highlights the importance of the geometric point of view in understanding the shape of statistical models: flatness of a statistical model must be defined independent of the choice of parameters, that is, by the manner of information geometry.
In econometrics, information geometry has been used to characterize the flat nature of statistical models, including the standard linear regression model, Poisson regression, Wald tests, the ARMA model, and many other examples [4, 5, 9, 11]. The objective of this study is to explore the application potential of binary choice models.
In the binary choice model, the value of dependent variable y can be 1 or 0, based on whether or not some event occurs. The standard model is represented as
where x is an \(\mathbb {R}^d\)-valued random vector distributed according to density \(p_X(x)\), \(\theta \in \mathbb {R}^d\), and \(\epsilon \) is a random term independent of x. The choice probability is given by
where F is the distribute function of \(\epsilon \). The joint density function of \((y,x)\in \{0,1\}\times \mathbb {R}^d\) is
The model is commonly used in social sciences to describe the choices made by decision-makers between two alternatives. These alternatives may represent school, labor supply, marital status, or transportation choices. See [10, 15] for a list of empirical applications. In particular, the model is referred to as the probit model when F is the standard normal distribution, and as the logit model when F is the standard logistic distribution:
The probit model is often considered a plausible model owing to its normally distributed random errors, whereas the logit model is considered merely as a closed-form approximation of the probit. Contrary to this common belief, we think the logit model is the most natural model among the parametric binary choice models from the point of view of information geometry.
The remainder of this paper is organized as follows. In Sect. 2, the geometry of the binary choice model is formulated and the model is shown to be a dually flat space. In Sect. 3, the logit model is investigated in detail as a specified case in Sect. 2. The canonical divergence and the Kullback–Leibler (KL) divergence are introduced to the model. We demonstrate that the logit model is a unique model, whose canonical and KL divergences are equal. In Sect. 4, we offer a geometric interpretation of the maximum likelihood estimation of the binary choice model. In Sect. 5, we summarize the conclusions of this study.
2 Geometry of the binary choice model
The model set is given by
where \(\Theta \) is an open subset of \(\mathbb {R}^d\). This study is based on the following technical assumptions:
-
(A1)
F is an infinitely differentiable function on \(\mathbb {R}\) with positive derivative \(f=F'>0\);
-
(A2)
x has a compact support \(\mathcal {X}\subset \mathbb {R}^d\) such that \(\mathcal {X}^{int}\not =\emptyset \).
Therefore, model \(\mathcal {P}\) is considered to be a d-dimensional \(C^\infty \) manifold with a canonical coordinate system \(\Theta \rightarrow \mathcal {P}\), \(\theta \mapsto p_\theta \).
Proposition 1
The coordinate system \(\Theta \rightarrow \mathcal {P}\), \(\theta \mapsto p_\theta \), is bijective.
Proof
Assume that there exists \(\theta \not =\theta '\) such that \(p_\theta =p_{\theta '}\). Then, \(F(x\cdot \theta )=F(x\cdot \theta ')\) holds for every \(x\in \mathcal {X}\). Because (A1) implies that F is strictly monotone, \(\mathcal {X}\subset \{x\in \mathbb {R}^d\mid x\cdot (\theta -\theta ')=0\}\), which contradicts \(\mathcal {X}^{int}\not =\emptyset \). \(\square \)
Unless otherwise specified, \(\theta \) is used as the (global) coordinate of manifold \(\mathcal {P}\) hereafter.
For every \(p=p_\theta \), let \(E_p\) be the expectation operator defined as
for an arbitrary measurable function \(\beta \) of (y, x). The conditional expectation operator \(E_p[\,\cdot \mid x\,]\) is also defined as follows:
In particular, \(E_p[\,y\mid x\,]=F(x\cdot \theta )\) holds. The expectation operator with respect to x is simply denoted by E because the value of \(E\beta (x)=\int _{\mathcal {X}}\beta (x)p_X(x)\, dx\) is independent of \(\theta \).
The score function of \(p=p_\theta \) is
Because
the Fisher information matrix \(G(\theta )\) is given as
For simplicity, define \(r:\mathbb {R}\rightarrow \mathbb {R}_{++}\) as
for every \(u\in \mathbb {R}\) such that \(G(\theta )=E\left[ r(x\cdot \theta )xx^\top \right] \). By the assumptions, r is bounded on an arbitrary compact interval. Because x has a bounded support, \(G(\theta )\) is finite at every \(\theta \in \Theta \). In addition, we assume that
-
(A3)
\(G(\theta )\) is positive definite at every \(\theta \in \Theta \).
The tangent space of \(\mathcal {P}\) at \(p=p_\theta \) is \(T_p\mathcal {P}=\text {Span}\left\{ (\partial _1)_p,\ldots , (\partial _d)_p \right\} \), where \(\partial _i= \frac{\partial }{\partial \theta _i}\) for \(i=1,\ldots , d\). For example, the unconditional expectation \(E_p y\) is obtained as
which is a smooth function on \(\mathcal {P}\). A tangent vector \(X=X^i(\partial _i)_p\) operates on this as
Moreover, at every (y, x),
The Fisher information metric g on \(\mathcal {P}\) is introduced as
where \(g_{ij}(p)=E\left[ r(x\cdot \theta )x_ix_j \right] \) is the (i, j) element of \(G(\theta )\).
Given metric g, the binary choice model is considered to be a Riemannian manifold \((\mathcal {P},g)\). Moreover, function \(\psi :\Theta \rightarrow \mathbb {R}\) is defined as
Then,
Hence, \((\mathcal {P},g)\) is a Hessian manifold with potential \(\psi \) when it is equipped with flat connections [12, 14]. For later convenience, we introduce gradient \(\partial \psi :\Theta \rightarrow \mathbb {R}^d\) as
where \(\partial _i\psi (\theta )=E \left[ \left( \int _0^{x\cdot \theta }r(u)\, du \right) x_i \right] \) for \(i=1,\dots , d\). Because the Hessian \(\partial ^2\psi (\theta )=G(\theta )\) is positive definite by (A3), an inverse mapping \((\partial \psi )^{-1}: \partial \psi (\Theta ) \subset \mathbb {R}^d \rightarrow \mathbb {R}^d\) exists and is continuously differentiable at every point.
Let \(\mathfrak {X}(\mathcal {P})\) denote the class of \(C^\infty \) tangent vector fields on \(\mathcal {P}\). For \(X=X^i \partial _i\), \(Y=Y^j\partial _j \in \mathfrak {X}(\mathcal {P})\), the Levi-Civita connection \(\nabla \) of \((\mathcal {P},g)\) is introduced as
where \(\Gamma _{ij}^k\) is the Christoffel symbol:
for \(i,j,k\in \{1,\dots ,d\}\), where \(g^{kl}\) denotes the (k, l)-element of \(G(\theta )^{-1}\). Let \(\Gamma _{ijk}(\theta )=\Gamma _{ij}^l(\theta )g_{kl}(\theta )=\frac{1}{2} E\left[ r'(x\cdot \theta ) x_ix_jx_k \right] \) such that
The curvature and torsion tensors \(R:\mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\rightarrow \mathfrak {X}(\mathcal {P})\) and \(T:\mathfrak {X}(\mathcal {P})\times \mathfrak {X}(\mathcal {P})\rightarrow \mathfrak {X}(\mathcal {P})\) of \(\nabla \) are, respectively, defined as
and
where \([X,Y]=X^i (\partial _iY^j)\partial _j-Y^j(\partial _jX^i)\partial _i\).
Proposition 2
Let \(\nabla \) be the Levi-Civita connection (7) with coefficients (8). Then,
for \(i,j,k\in \{1,\dots ,d\}\) and \(T\equiv 0\).
Proof
Using (8), \(T(\partial _i,\partial _j)=(\Gamma _{ij}^k-\Gamma _{ji}^k)\partial _k=0\) is trivially shown. Because \(g_{mh}g^{hl}=1\) if \(m=l\) and 0 if \(m\not = l\),
which implies \(\partial _i g^{hl}=-2\Gamma _{im}^l g^{mh}\). Using the definition of the curvature tensor,
Because \(\partial _i\Gamma _{jkh}=\partial _j\Gamma _{ikh}=\frac{1}{2}E\left[ r''(x\cdot \theta )x_ix_jx_kx_h \right] \), Eq. (12) is obtained. \(\square \)
Proposition 2 implies that the binary choice model with the Fisher information geometry is essentially a flat manifold. Let S be an arbitrary symmetric (0, 3)-tensor on \(\mathcal {P}\). A family of \(\alpha \)-connections \(\{\nabla ^{(\alpha )}\}_{\alpha \in \mathbb {R}}\) is defined as
for every \(\alpha \in \mathbb {R}\). The corresponding connection coefficients are given by
where \(S_{ijk}=S(\partial _i,\partial _j,\partial _k)\). See chapter 6 of [2] for definitions and details of \(\alpha \)-connections.
A pair \((\nabla ^{( \alpha )}, \nabla ^{( -\alpha )})\) of the connections provides the dual connections of \((\mathcal {P},g)\) such that
because
by symmetry \(S(X,Y,Z)=S(Y,X,Z)\). Let \(R^{( \alpha )}\) and \(T^{(\alpha )}\) be the curvature and torsion tensors of \(\nabla ^{( \alpha )}\), respectively. When \(R^{( \alpha )}=R^{( -\alpha )}=0\) and \(T^{( \alpha )}=T^{( -\alpha )}=0\) hold, \((\mathcal {P},g,\nabla ^{( \alpha )},\nabla ^{( -\alpha )})\) is said to be a dually flat space. A dually flat space has the dual affine coordinates \((\xi ,\zeta )\), where \(\xi =(\xi ^1,\dots ,\xi ^d)\) is the \(\nabla ^{( \alpha )}\)-affine coordinate such that \(\Gamma _{ijk}^{(\alpha )}(\xi )\equiv 0\) and \(\zeta =(\zeta _1,\dots ,\zeta _d)\) is the \(\nabla ^{(-\alpha )}\)-affine coordinate such that \(\Gamma _{ijk}^{ -\alpha }(\zeta )\equiv 0\). Furthermore,
Theorem 1
Let S be a (0, 3)-tensor on \((\mathcal {P},g)\) given by \(S(X,Y,Z)=X^iY^jZ^k S_{ijk}\) with
For \(\alpha =\pm 1\), let \(\nabla ^{(\alpha )}\) be defined as
Then, \((\mathcal {P},g,\nabla ^{( +1 )},\nabla ^{( -1 )})\) is a dually flat space with dual affine coordinates \((\theta ,\eta )\), where \(\eta =\partial \psi (\theta )\).
Proof
While the results can be immediately obtained using the general theory of the Hessian manifolds [12, 14], we give direct proof for a self-contained description of the paper.
Based on the assumption, \(\Gamma _{ij}^{(+1)k}(\theta )\equiv 0\) trivially holds. To confirm that \(\eta =\partial \psi (\theta )\) is the \(\nabla ^{(-1)}\)-affine coordinate, let \(\Gamma _{ab}^{(-1)c}(\eta )\) for \(a,b,c\in \{1,\dots ,d\}\) be the \(\nabla ^{(-1)}\)-connection coefficients expressed in terms of \(\eta \). By the definition of \(\eta \), \(\partial _k \eta _l=g_{kl}\) holds. This implies that \(\frac{\partial \theta ^k}{\partial \eta _l}=g^{kl}\) and that
From the definition of \(\nabla ^{(-1)}\), \(\Gamma _{ij}^{( -1 )k}(\theta )=2\Gamma _{ij}^{k}(\theta )\). By the change-of-variables formula for the connection coefficients,
which implies
Moreover,
\(\square \)
Corollary 1
For \(\alpha =\pm 1\), the \(\nabla ^{(\alpha )}\)-geodesic path \(\gamma ^{(\alpha )}=\{\gamma ^{(\alpha )}_t|\, 0\le t\le 1\}\) connecting \(p,q\in \mathcal {P}\) is given by
where
and
In particular, (13) is a solution to the ordinary differential equation,
with initial condition \(\theta _0=\theta _p\).
Proof
Let \(\eta _t^{(-1)}=\partial \psi (\theta _t^{(-1)})=(1-t)\eta _p+t\eta _q\). Then,
\(\square \)
3 Two divergences of the binary choice model
The dual potential \(\varphi \) of \(\psi \) is given as
which is the Legendre transformation of \(\psi (\theta )\). Because \(\theta \mapsto \eta \cdot \theta -\psi (\theta )\) is strictly concave by (A3), the maximum of \(\eta \cdot \theta -\psi (\theta )\) is attained at \(\theta =(\partial \psi )^{-1}(\eta )\), which is a solution to the first-order condition, \(\eta -\partial \psi (\theta )=0\). Let \(\theta _p\) and \(\eta _p\) denote the canonical coordinate and its dual at \(p\in \mathcal {P}\), respectively. Then, the dual potential is explicitly given as
In general, for a dually flat space with dual affine coordinates \((\theta ,\eta )\) and dual potentials \((\psi ,\varphi )\), the canonical divergence between p and q in \(\mathcal {P}\) is defined as follows [3, 7, 8]:
For binary choice model (2),
For the given p, a function \(\theta \mapsto D(p\,\Vert \, p_\theta )\) is strictly convex because a direct computation shows
Theorem 2
Let p, q, r \(\in \mathcal {P}\). Let \(\gamma ^{(+1)}=(\gamma ^{(+1)}_t)_{0\le t \le 1}\) be the \(\nabla ^{(+1)}\)-geodesic path connecting p and q, and let \(\gamma ^{(-1)}=(\gamma ^{(-1)}_s)_{0\le s \le 1}\) be the \(\nabla ^{(-1)}\)-geodesic path connecting q and r. If and only if \(\gamma ^{(+1)}\) and \(\gamma ^{(-1)}\) are orthogonal at the intersection q in the sense that
we have
Proof
The result is standard. See e.g., [3, 7, 13] for a proof. \(\square \)
Corollary 2
The Pythagorean formula (16) holds if and only if \((\eta _p-\eta _q)\cdot (\theta _q-\theta _r)=0\).
Proof
From corollary 1,
and
Therefore,
\(\square \)
An alternative for the divergence on \(\mathcal {P}\) is the KL divergence,
For the binary choice model,
based on the law of iterated expectations. Canonical divergence (15) and KL divergence (17) generally do not coincide. However, in a special case where F is a logistic distribution, they are shown as equivalent.
Theorem 3
\(D=KL\) holds for arbitrary \(p_X\) if and only if F is a logistic distribution; that is,
where \(\beta >0\).
Proof
If F is a logistic distribution with parameter \(\beta >0\),
and
Hence, \(r(u)=\beta f(u)\), \(\int _{x\cdot \theta _p}^v r(u)\,du=\beta (F(v)-F(x\cdot \theta _p))\), and
On the other hand, if \(D\equiv KL\) holds for an arbitrary \(p_X\),
holds for arbitrary p and q because
and
From the principle of the separation of variables, this is possible only if there exists a positive constant \(\beta \) such that
Therefore, F is the logistic distribution. \(\square \)
For the standard logit model (\(\beta =1\)), the results presented in the preceding section are further simplified. The Fisher information metric is given as
The \(\nabla ^{(-1)}\)-affine coordinate \(\eta \) is expressed as
The potential is \(\psi (\theta )= E\left[ \log \left( 1+\exp (x\cdot \theta )\right) \right] \), and the divergence is
We can generalize Theorem 3 to cover the multinomial discrete choice model. Let \(\{1,\ldots , k\}\) be the set of choices. Assume that the choice probability conditioned on x is now given by
for \(i\in \{1,\dots , k\}\), where F is a smooth distribution function and \(\theta =\left[ \begin{array}{ccc}\theta _1&\cdots&\theta _k \end{array} \right] \in (\mathbb {R}^d)^k\) with \(\theta _i=(\theta _i^1,\ldots ,\theta _i^d)\in \mathbb {R}^d\). Let \(p_X\) be the marginal density of x and \(\Theta \subset (\mathbb {R}^d)^k\) be the set of parameters. Then, the multinomial choice model is obtained as
In particular, when F is the standard logit distribution, the model becomes the multinomial logit model with the choice probability
for \(i\in \{1,\dots , k\}\). The model set \(\{p_\theta | \theta \in \Theta \}\) is a dually flat space with the dual affine coordinates \((\theta ,\eta )\) and the potential
where \(\eta =\left[ \begin{array}{ccc}\eta _1&\cdots&\eta _k \end{array} \right] \in (\mathbb {R}^d)^k\), \(\eta _i=(\eta _{i,1},\ldots ,\eta _{i,d})\in \mathbb {R}^d\), and
for \(i\in \{1,\dots ,k\}\) and \(l\in \{1,\dots , d\}\). Furthermore, canonical D and KL divergences of the model are equal.
4 Maximum likelihood estimation of the binary choice model
Most of the results presented in Sects. 2 and 3 are independent of the choice of \(p_X\). Therefore, by replacing \(p_X\) with its estimates based on empirical data, we might obtain some geometric view of the statistical inference of the model. Let \(\mathcal {P}\) be the set of the binary choice model (2). Let \((y_1,x_1)\), \(\dots \), \((y_T,x_T)\) be i.i.d. sample from \(p=p_\theta \in \mathcal {P}\). Then the empirical expectation operator \(\hat{E}\) is given by
The empirical Fisher information matrix is given by \(\hat{G}(\theta )=\hat{E}[r(x\cdot \theta )xx^\top ]\). Again, we assume that
-
(A3’)
\(\hat{G}(\theta )\) is positive definite.
The empirical versions of the Fisher information metric \(\hat{g}_{ij}=\hat{E}[r(x\cdot \theta )x_ix_j]\), the potential \(\hat{\psi }(\theta )\), the Levi-Civita connection \(\hat{\nabla }\) and the connection coefficients \(\hat{\Gamma }_{ij}^k\) are also introduced simply by replacing E with \(\hat{E}\).
The “true” parameter \(\theta \) of \(p=p_\theta \) is well approximated by the maximum likelihood estimator,
which is an empirical analog of the KL divergence minimization,
(see e.g., [7, 10]). The estimator is a solution to the first-order conditions of maximization (18):
In particular, when the logit model is considered, the condition is simplified to
which implies \(\hat{E}[yx]=\hat{E}\left[ F(x\cdot \hat{\theta })x\right] =(\partial \hat{\psi })(\hat{\theta })\). Therefore, in the maximum likelihood estimation of the logit model, we first estimate the dual parameter \(\eta \) directly as \(\hat{\eta }=\hat{E}[yx]=\frac{1}{T}\sum _{t=1}^T y_tx_t\), and secondly we estimate the canonical parameter \(\theta \) using \(\hat{\theta }=(\partial \hat{\psi })^{-1}(\hat{\eta })\). This method is easily implemented because it does not involve numerical optimizations.
One objective of empirical studies of an econometric model is to test the statistical significance of its coefficients \(\theta ^1,\dots ,\theta ^d\). When we want to test a joint null hypothesis such as, say, \(H_0:\theta ^1=\theta ^2=0\), where \(d\ge 2\) is assumed, the value of \(\theta \) is estimated subject to the linear constraint:
The restriction is generalized to the case of \(H_0:H^\top \theta =c\), where \(H=\left[ \begin{array}{ccc} h_1&\cdots&h_m \end{array}\right] \) is a \(d\times m\) matrix with \(\text {rank}(H)=m<d\), and \(c=(c_1,\ldots ,c_m)^\top \in \mathbb {R}^m\). Let \(\mathcal {H}=\{ \theta \in \Theta \mid H^\top \theta =c\}\) be the constraint set, and let \(\mathcal {P}_{\mathcal {H}}=\left\{ p_\theta \in \mathcal {P} \mid \theta \in \mathcal {H} \right\} \) be the constrained model. Because \(\theta \) is the \(\nabla ^{(+1)}\)-affine coordinate of \(\mathcal {P}\), \(\mathcal {P}_{\mathcal {H}}\) is an affine-flat submanifold of \(\mathcal {P}\).
If the logit model is assumed, the constrained maximum likelihood estimator
is found by orthogonally projecting the unconstrained estimator \(\hat{\theta }\) onto the constraint set. We define the D-projection operator \(\Pi :\mathcal {P}\rightarrow \mathcal {P}_{\mathcal {H}}\) by
for every \(p\in \mathcal {P}\), where D is the canonical divergence (14). The operator is well defined because \(D(p\,\Vert \, q)\) is strictly convex in \(\theta _p\).
Theorem 4
\(q=\Pi p\) holds for \(q,p\in \mathcal {P}\) if and only if \(\eta _q-\eta _p\in \text {Image}(H)\).
Proof
Let \(\mathcal {L}(\theta ,\lambda )=D(p\,\Vert \, p_\theta )-\sum _{i=1}^m \lambda ^i(h_i\cdot \theta -c_i)\) be the Lagrangian with multipliers \(\lambda =(\lambda ^1,\ldots ,\lambda ^m)\). As \(\theta \mapsto D(p\,\Vert \, p_\theta )\) is strictly convex, a necessary and sufficient condition for minimization is given by
which implies
at a solution \(\theta \) to (19). Therefore, \(\eta _q-\eta _p\in \text {Image}(H)\) is satisfied if and only if q solves (19). \(\square \)
The empirical version of theorem 4 offers us graphical images of the maximum likelihood estimation. Let \(\hat{D}\) be the empirical version of the canonical divergence, and \(\hat{\Pi }\) be the \(\hat{D}\)-projection operator. Then, if and only if the logit model is assumed, the \(\hat{D}\)-projection becomes equivalent to the constraint maximum likelihood estimation; that is,
This is because
For the dual parameter \(\hat{\eta }|_{\mathcal {H}}=\partial \hat{\psi }(\hat{\theta }|_{\mathcal {H}})\), the condition
holds if and only if the model is logit. Furthermore, if \(\hat{\eta }|_\mathcal {H}\) satisfies the condition, there exists \(\lambda \in \mathbb {R}^m\) such that \(\hat{\eta }-\hat{\eta }|_{\mathcal {H}}=H\lambda \). Therefore, for any \(\theta \in \mathcal {H}\),
is satisfied. The situation is shown in Fig. 1. The figure seems obvious, but it is to be remarked that this naive image of the orthogonal projection is consistent with the estimation with linear restrictions if and only if the logit model is assumed; in other models such as the probit model, the orthogonal projection with respect to the Fisher information metric fails to maximize the likelihood on the affine linear submodel.
5 Discussion
In this study, we investigated the geometry of parametric binary choice models. The model was established as a dually flat space, where the canonical coefficient parameter \(\theta \) acts as an affine coordinate. The dual flat property introduces a canonical divergence into the model. The divergence is equivalent to the KL divergence if and only if the model is a logit model. As an example application, the projection onto an affine linear subspace was geometrically characterized.
The dual flatness of the binary choice model is caused by the single-index structure of the model, which depends on parameter \(\theta \) only through linear index \(x\cdot \theta \), making the Levi-Civita connection coefficients \(\Gamma _{ijk}\) symmetrical on (i, j, k). Therefore, the results of this study can be extended to a more general class of single-index models, including nonlinear regressions, truncated regressions, and ordered discrete choice models. The studied model might be extended to the neural network model, which consists of connected binary response models. Each node of the network is considered as a single-index model. However, the entire structure of the model could be highly nonlinear in terms of parameter \(\theta \), which leads to the non-flatness of the model.
Among the binary choice models, the logit is shown to have good properties geometrically as well as statistically. This is not because of only the explicit integrability of the logit. In general, we say that a statistical model \(\mathcal {P}=\{p_\theta \mid \theta \in \Theta \}\) is an exponential family if it is expressed as
It is widely known that the (curved) exponential family possesses desirable properties such as higher-order efficiency of the maximum likelihood estimation [1, 6]. Although the logit model is not truly exponential, the conditional density \(p_\theta (y|x)\) is still written as
where
and
Conditioned by x, model (21) belongs to an exponential family with potential \(\psi (\theta |x)\). Notably, \(\psi (\theta )=E\left[ \psi (\theta |x)\right] \). Because marginal density \(p_X\) does not appear in the score of model (4), statistical properties of the model are primarily determined by \(p_\theta (y|x)\). Our study suggests that the logit model is a unique binary choice model, which belongs to the conditional exponential family.
Data Availability
No datasets were generated or used in this study.
References
Amari, S.: Differential geometry of curved exponential families-curvature and information loss. Ann. Stat. 10(2), 357–385 (1982)
Amari, S.: Information Geometry and Its Applications. Springer Japan KK, Tokyo (2016)
Amari, S., Nagaoka, H.: Methods of Information Geometry. Oxford University Press, Tokyo (2000)
Andrews, I., Mikusheva, A.: A geometric approach to nonlinear econometric models. Econometrica 84(3), 1249–1264 (2016)
Critchley, F., Marriott, P., Salmon, M.: On the differential geometry of the Wald test with nonlinear restrictions. Econometrica 64(5), 1213–1222 (1996)
Eguchi, S.: Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 11(3), 793–803 (1983)
Eguchi, S., Komori, O.: Minimum Divergence Methods in Statistical Machine Learning. From an Information Geometric Viewpoint. Springer Japan KK, Tokyo (2022)
Eguchi, S., Komori, O., Ohara, A.: Duality of maximum entropy and minimum divergence. Entropy 16(7), 3552–3572 (2014)
Kemp, G.C.R.: Invariance and the Wald test. J. Econom. 104(2), 209–217 (2001)
Lee, M.J.: Micro-Econometrics: Methods of Moments and Limited Dependent Variables, 2nd edn. Springer, New York (2010)
Marriott, P., Salmon, M.: An introduction to differential geometry in econometrics. In: Applications of Differential Geometry to Econometrics, pp. 7–63. Cambridge University Press, Cambridge (2000)
Nakajima, N., Ohmoto, T.: The dually flat structure for singular models. Inf. Geom. 4, 31–64 (2021)
Nielsen, F.: On geodesic triangles with right angles in a dually flat space. In: Progress in Information Geometry, pp. 153–190. Springer Nature Switzerland AG, Cham (2021)
Shima, H., Yagi, K.: Geometry of Hessian manifolds. Differ. Geom. Appl. 7, 277–290 (1997)
Train, K.E.: Discrete Choice Methods with Simulations. Cambridge University Press, New York (2003)
Acknowledgements
The author thanks to the anonymous reviewer who provided valuable comments.
Author information
Authors and Affiliations
Contributions
The paper is written by a single author.
Corresponding author
Ethics declarations
Conflict of interest
The author states that there is no conflict of interest to declare.
Additional information
Communicated by Frank Nielsen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tanaka, H. Dually flat structure of binary choice models. Info. Geo. (2024). https://doi.org/10.1007/s41884-024-00136-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41884-024-00136-1