In this section, we will go through the derivation of Bayesian network Fisher kernel on a high level. To reduce clutter, some details of the derivations are postponed to Appendices.
Univariate Fisher kernel
While our interest is in a multivariate model, much of the derivation can be reduced to computing the Fisher kernel for a single categorical variable with r possible values, since a Bayesian network can be represented as a collection of conditional probability tables, and the tables themselves consist of univariate categorical distributions. Therefore, we will first study the simple univariate case.
The main result of this section is formulated as follows.
Lemma 1
AssumeDis a single categorical variable taking values from 1 torwith probabilities\(\theta = (\theta _1,\ldots , \theta _r)\)where\(\theta _k > 0\)and\(\sum _{k=1}^r\theta _k = 1\). Now, the Fisher similarity for two observationsxandyis given as:
$$\begin{aligned} K(x,y;\theta ) = {\left\{ \begin{array}{ll} \frac{1-\theta _{x}}{\theta _{x}} &{}\mathrm{{if }}x=y\text{, } \text{ and }\\ -1 &{}\mathrm{{if }}x\ne y, \end{array}\right. } \end{aligned}$$
where \(\theta _x = P(D = x ; \theta )\).
Proof
We will proceed in a straightforward fashion, dividing our proof into three parts: (1) We first compute the gradient of the log likelihood; (2) then find an expression for the inverse of Fisher information matrix; (3) and lastly, combine these to get the statement of Lemma 1.
Gradient of the log likelihood
Let d denote a value of D. We can express the likelihood as:
$$\begin{aligned} P(d;\theta )=\prod _{k=1}^{r}\theta _{k}^{I_{d}(k)}, \end{aligned}$$
where \(\theta _{r}\) is not a real parameter of the model but a shorthand notation \(\theta _{r}\equiv 1-\sum _{k=1}^{r-1}\theta _{k}\). Thus, the log-likelihood function is \(\log P(d;\theta )=\sum _{k=1}^{r}I_{d}(k)\log \theta _{k}\). Taking the partial derivatives wrt. \(\theta _k\) yields:
$$\begin{aligned} s(d;\theta )_{k} = \frac{\log P(d;\theta )}{\partial \theta _k} = I_{d}(k)\theta _{k}^{-1}-I_{d}(r)\theta _{r}^{-1}. \end{aligned}$$
Using \(e_{k}\) to denote \(k^{ th }\) standard basis vector of \({\mathbb {R}}^{r-1}\), and \(\mathbf {1}\) for the vector of all ones, the whole vector \(s(d;\theta )\) can be written as:
$$\begin{aligned} s(d;\theta )&= {\left\{ \begin{array}{ll} \theta _{k}^{-1}e_{k} &{} \text{ if } d=k<r, \text{ and }\\ -\theta _{r}^{-1}\mathbf {1} &{} \text{ if } d=r, \end{array}\right. } \end{aligned}$$
or using indicators
$$\begin{aligned} s(d;\theta )=\sum _{k=1}^{r-1}I_{d}(k)\theta _{k}^{-1}e_{k}-I_{d}(r)\theta _{r}^{-1}\mathbf {1}.\ \end{aligned}$$
(3)
It is worth emphasizing that each partial derivative depends on the data vector d, but is a function of the parameter \(\theta \). It may be useful to further reveal the structure by writing:
$$\begin{aligned} s(d;\theta )_{k} = {\left\{ \begin{array}{ll} 0 &{} \text{ if } d\ne k<r,\\ \theta _{k}^{-1} &{} \text{ if } d=k<r, \text{ and }\\ -\theta _{r}^{-1} &{} \text{ if } d=r. \end{array}\right. } \end{aligned}$$
So depending on the data item d, the partial derivative \(s(d;\theta )_{k}\) is one of the three functions of \(\theta \) above.
The Fisher information and its inverse
The Fisher information matrix for the multinomial model (which is equivalent to our categorical case when the number of trials is 1) takes the form (Bernardo and Smith 1994, p. 336):
$$\begin{aligned} \mathcal {I}(\theta ) = \frac{1}{\theta _r} \begin{bmatrix} \theta _r\theta _1^{-1}+1&\ 1&\ \cdots&\ 1 \\ 1&\ \theta _r\theta _2^{-1}+1&\ \cdots&\ 1 \\ \cdots&\ \cdots&\ \ddots&\ \vdots \\ 1&\ 1&\ \cdots&\ \theta _r\theta _{r-1}^{-1}+1 \end{bmatrix} \end{aligned}$$
(4)
and its inverse is given as:
$$\begin{aligned} \mathcal {I}^{-1}(\theta ) = \begin{bmatrix} \theta _1(1-\theta _1)&\ -\theta _1\theta _2&\ \cdots&\ -\theta _1\theta _{r-1}\\ -\theta _1\theta _2&\ \theta _2(1-\theta _2)&\ \cdots&\ -\theta _2\theta _{r-1} \\ \cdots&\ \cdots&\ \ddots&\ \vdots \\ -\theta _1\theta _{r-1}&\ -\theta _2\theta _{r-1}&\ \cdots&\ \theta _{r-1}(1-\theta _{r-1}) \end{bmatrix}. \end{aligned}$$
(5)
Bernardo and Smith (1994) omit the explicit derivations of (4) and (5). We will present the derivation of \(\mathcal {I}(\theta )\) in Appendix A. The inverse can be obtained from this by noting that \(\mathcal {I}(\theta )\) is expressible as a sum consisting of two terms: an invertible (diagonal) matrix and a rank one matrix. Inverting a matrix with this structure is straightforward (Miller 1981). Even more straightforwardly, computing the matrix product, \(\mathcal {I}(\theta )\mathcal {I}^{-1}(\theta )\), and verifying that it gives an identity matrix proves that inverse of \(\mathcal {I}(\theta )\) has to have the form given in (5).
For further purposes, it is useful to write the elements of \(\mathcal {I}(\theta )\) as:
$$\begin{aligned} \mathcal {I}(\theta )_{kl} = \delta _{kl}\theta _{k}^{-1}+\theta _{r}^{-1}, \end{aligned}$$
(6)
where we used Kronecker delta symbol \(\delta _{xy}\) that equals 1, if \(x=y\), and 0, otherwise. \(\square \)
The kernel
The last thing left to do is to combine our results to get the expression for the univariate kernel \(K(x,y) = s^T(x;\theta )\mathcal {I}^{-1}(\theta )s(y;\theta )\). As the gradient \(s(d;\theta )\) takes different forms depending on the value d, it is maybe the easiest to enumerate all the cases corresponding to different combinations for values x and y. After considering all the cases, it turns out that the kernel will take different forms depending on whether \(x = y\) or \(x \ne y\). This derivation is done explicitly in Appendix A. We will just state the final result here, which is also the statement of Lemma 1:
$$\begin{aligned} K(x,y;\theta )&= \delta _{xy}\theta _{x}^{-1}-1 \\&={\left\{ \begin{array}{ll} \frac{1-\theta _{x}}{\theta _{x}} &{} {\text{ if } }x=y\text{, } \text{ and }\\ -1 &{} {\text{ if } }x\ne y. \end{array}\right. } \end{aligned}$$
We notice that dissimilar items are always dissimilar at level \(-1\), but similar items are similar at level \([0,\infty ]\). The similarity of a value to itself is greater for rare values (rare means that \(\theta _x = P(D = x ; \theta )\) is small).
Multivariate Fisher kernel
We will now move on to the derivation in the multivariate case, leaving again some details to Appendix B. As seen from Eq. (1), the log-likelihood function for Bayesian networks is:
$$\begin{aligned} \log P(d;\theta )&=\sum _{i=1}^n\sum _{j=1}^{q_i}\sum _{k=1}^{r_i} {I_{d_i}(k)\cdot I_{\pi _i(d)}(j)} \log \theta _{ijk} \nonumber \\&=\sum _{i=1}^n\sum _{j=1}^{q_i}I_{\pi _i(d)}(j)\sum _{k=1}^{r_i} {I_{d_i}(k)} \log \theta _{ijk}, \end{aligned}$$
(7)
where \(\theta _{ijr_{i}}\) is never a real parameter but just a shorthand \(\theta _{ijr_{i}}\equiv 1-\sum _{k=1}^{r_{i-1}}\theta _{ijk}\). From Eq. (7), it is easy to compute the components of score using the calculations already made in the univariate case. This gives:
$$\begin{aligned} s(d;\theta )_{ijk} = I_{\pi _{i}(d)}(j)[I_{d_{i}}(k)\theta _{ijk}^{-1}-I_{d_{i}}(r_{i})\theta _{ijr_{i}}^{-1}], \end{aligned}$$
(8)
which also has the familiar form from the univariate case, apart from the extra indicator function due to the presence of parent variables. To get some intuition from the above equation, we can write it case by case as:
$$\begin{aligned} s(d;\theta )_{ijk} = {\left\{ \begin{array}{ll} 0 &{} \text{ if } \pi _{i}(d)\ne j,\\ 0 &{} \text{ if } \pi _{i}(d)=j\wedge d_{i}\ne k< r_i\wedge d_{i}\ne r_{i},\\ \theta _{ijk}^{-1} &{} \text{ if } \pi _{i}(d)=j\wedge d_{i}=k < r_i, \text{ and }\\ -\theta _{ijr_{i}}^{-1} &{} \text{ if } \pi _{i}(d)=j\wedge d_{i}=r_{i}. \end{array}\right. } \end{aligned}$$
Fisher information matrix
The derivation of Fisher Information matrix also proceeds along the same lines as the univariate case. The indicator function corresponding to the parent configuration results in the marginal probability of parent variables appearing in the formula. Details can be found in Appendix B. The end result is as follows:
$$\begin{aligned} \mathcal {I}(\theta )_{(xyz),(ijk)} = \delta _{ij,xy}P(\pi _{i}(D)=j)(\delta _{kz}\theta _{ijk}^{-1}+\theta _{ijr_{i}}^{-1}), \end{aligned}$$
(9)
where \(\delta _{ij,xy} = \delta _{ix}\cdot \delta _{jy}\) is generalization of Kronecker delta notation to a series of integers.
By looking at Eq. (9), we see that the matrix has a block diagonal structure. There is one block corresponding to each parent configuration for every variable. For variable \(D_i\), the jth block is a \((r_i - 1) \times (r_i - 1)\) matrix, where on the diagonal we have values \(\theta _{ijk}^{-1} + \theta _{ijr_{i}}^{-1}\) for \(k = 1, \ldots , r_i - 1\) and outside the diagonal \(\theta _{ijr_{i}}^{-1}\) (with all the values multiplied by \(P(\pi _{i}(D)=j)\)). So each block looks just like the matrix in the univariate case (see Eq. (4)), which we have already inverted.
Inverse of the Fisher information matrix
The inverse of a block diagonal matrix is obtained by inverting the blocks separately. Applying the univariate result for each of the blocks, we obtain that the ijth block has the form:
$$\begin{aligned} ^{ij\!}[\mathcal {I}^{-1}(\theta )]=\frac{1}{P(\pi _{i}(D)=j)}[^{ij\!}A-^{ij\!}F], \end{aligned}$$
where \(^{ij\!}A\) is a diagonal matrix with entries \(^{ij\!}A_{kk}=\theta _{ijk}\) and \(^{ij\!}F_{kl}=\theta _{ijk}\theta _{ijl}\).
Fisher kernel
In calculating the \(K(x,y;\theta )=s(x;\theta )^{T}\mathcal {I}^{-1}(\theta )\,s(y;\theta )\), the block diagonality of the \(\mathcal {I}^{-1}(\theta )\) will save us a lot of computation. We therefore partition the score vector into sub-vectors \(s(d;\theta )_{i}\) that contain partial derivatives for all the parameters \(\theta _{i} = \cup _{j = 1}^{q_i}\cup _{k=1}^{r_i -1}\{ \theta _{ijk} \}\), and then further subdivide these into parts \(s(d;\theta )_{ij}\).
Since the kernel \(K(x,y;\theta )\) can be expressed as an inner product \(\langle s(x;\theta )^{T}\mathcal {I}^{-1}(\theta ),\,s(y;\theta )\rangle \), we may factor this sum as:
$$\begin{aligned} K(x,y;\theta )=\sum _{i=1}^{n}K_{i}(x,y;\theta ), \end{aligned}$$
where \(K_{i}(x,y;\theta )=\langle [s(x;\theta )^{T}\mathcal {I}^{-1}(\theta )]_{i},\,s(y;\theta )_{i}\rangle \). It is therefore sufficient to study the terms \(K_{i}.\)
Since in \(s(x;\theta )_{i}\) only the entries corresponding to the indices \((i,\pi _{i}(x))\) may be non-zero, the same applies to the row vectors \([s(x;\theta )^{T}\mathcal {I}^{-1}(\theta )]_{i}\) due to the block diagonality of \(\mathcal {I}^{-1}(\theta )\).
That automatically implies that if \(\pi _{i}(x)\ne \pi _{i}(y)\) then \(K_{i}(x,y;\theta )=0\). If \(\pi _{i}(x)=\pi _{i}(y)=j\), the computation reduces to the univariate case for \(K(x_{i},y_{i};\theta _{ij})\), after taking the factors \(P(\pi _{i}(D)=j)\) in a Fisher information matrix blocks into account. Here \(\theta _{ij} = \{\theta _{ijk} \mid k = 1, \ldots ,r_i \}\).
Summarizing these results, we get \(K(x,y;\theta )=\sum _{i=1}^{n}K_{i}(x,y;\theta )\), where
$$\begin{aligned} K_{i}(x,y;\theta )={\left\{ \begin{array}{ll} 0 &{} \text{ if } \pi _{i}(x)\ne \pi _{i}(y)\text{, }\\ -\frac{1}{P(\pi _{i}(x);\theta )} &{} \text{ if } \pi _{i}(x)=\pi _{i}(y)\\ &{}\wedge (x_{i}\ne y_{i})\text{, } \text{ and }\\ \frac{1}{P(\pi _{i}(x);\theta )}\frac{1-\theta _{i\pi _{i}(x)x_{i}}}{\theta _{i\pi _{i}(x)x_{i}}} &{} \text{ if } \pi _{i}(x)=\pi _{i}(y)\\ &{}\wedge (x_{i}=y_{i}), \end{array}\right. } \end{aligned}$$
or more compactly
$$\begin{aligned} K_{i}(x,y;\theta )&= \frac{\delta _{\pi _{i}(x)\pi _{i}(y)}}{P(\pi _{i}(x);\theta )}\left[ \frac{\delta _{x_{i}y_{i}}}{\theta _{i\pi _{i}(x)x_{i}}}-1\right] \\&= \frac{\delta _{\pi _{i}(x)\pi _{i}(y)}K(x_{i},y_{i};\theta _{i\pi _{i}(x)})}{P(\pi _{i}(x);\theta )}. \end{aligned}$$