Keywords

1 Introduction

Regression models are ubiquitous in computer vision applications (e.g., medical imaging [1] and face alignment by shape regression [2]). In scientific data analysis, regression models are the default tool of choice for identifying the association between a set of input feature vectors (covariates) \(\mathbf{x} \in \mathcal {X}\) and an output (dependent) variable \(y \in \mathcal {Y}\). In most applications, the regressor is obtained by minimizing (or maximizing) the loss (or fidelity) function assuming the dependent variable y is corrupted with noise \(\epsilon \): \(y = f(\varvec{x}) + \epsilon \). Consequently, solving for the regressor is, in fact, equivalent to estimating the expectation \(\mathbb {E}[y | \varvec{x} ]\); in statistics and machine learning, this construction is typically referred to as forward (or standard) regression [3]. The above formulation does not attempt to model noise in \(\varvec{x}\) directly. Even for linear forms of \(f(\cdot )\), if the noise characteristics are not strictly additive and normally distributed (i.e., so that \(y = f(\varvec{x}+\epsilon ) \Leftrightarrow y = f(\varvec{x})+\epsilon '\)), parameter estimates and consistency properties will not hold in general [4, 5].

Fig. 1.
figure 1

Dynamic feature weights for two tasks: ambient temperature prediction (left) and age estimation (right). Our formulation provides a way to determine, at test time, which features are most important to the prediction. Our results are competitive, which demonstrates that we achieve this capability without sacrificing accuracy. (Color figure online)

The above issues have long been identified in the statistics literature and a rich body of work has emerged. One form of these models (among several) is typically referred to as inverse regression [6]. Here, the main idea is to estimate \(\mathbb {E}[\varvec{x}|y]\) instead of \(\mathbb {E}[y|\varvec{x}]\), which offers asymptotic and practical benefits and is particularly suitable in high-dimensional settings. To see this, consider the simple setting in which we estimate a regressor for n samples of \(\varvec{x} \in \mathbf R ^p\), \(p \gg n\). The problem is ill-posed and regularization (e.g., with \(\ell _0\) or \(\ell _1\) penalty) is required. Interestingly, for linear models in the inverse regression setting, the problem is still well specified since it is equivalent to a set of p univariate regression models going from y to a particular covariate \(x^j\) where \(j \in \{1,\cdots , p\}\).

As an illustrative example, let us compare the forward and inverse regression models for \(p > n\). For forward regression, we have \(y = \mathbf {b}^\mathsf {T}\varvec{x}+\epsilon \) and so \(\mathbf {b}^*=(X^\mathsf {T}X)^{-1}X^\mathsf {T}\mathbf {y}\), where \(X = [\varvec{x}_{1} \;\cdots \;\varvec{x}_{n}]^\mathsf {T}\) and \(\mathbf {y} = [y^{1} \;\cdots \;y^{n}]^\mathsf {T}\). This is problematic due to a rank deficient \(X^\mathsf {T}X\). But in the inverse regression case, we have \(\varvec{x} = \mathbf {b}y+\epsilon \) and so \(\mathbf {b}^*=(\mathbf {y}^\mathsf {T}\mathbf {y})^{-1}\mathbf {y}^\mathsf {T}X\), which can be computed easily. In statistics, a widely used algorithm based on this observation is Sliced Inverse Regression (SIR) [3]. At a high level, SIR is a dimensionality reduction procedure that calculates \(\mathbb {E}[\varvec{x}|y]\) for each ‘slice’ (i.e., bin) in the response variable y and finds subspaces where the projection of the set of covariates is dense. The main idea is that, instead of using the full covariance of the covariates \(\varvec{x}\)’s or \([\varvec{x}^\mathsf {T}\;\; y]^\mathsf {T}\), we use the \(\mathbb {E}[\varvec{x}|y]\) for each bin within y as a new feature with a weight proportional to the number of samples (or examples) within that specific bin. Then, a principal components-derived subspace for such a covariance matrix yields a lower-dimensional embedding that incorporates the proximity between the y’s for subsets of \(\varvec{x}\).

This idea has seen renewed interest in the machine learning and computer vision communities [7, 8]. For example, consider the following simple example demonstrated in a relatively recent paper [9]: Their goal was to utilize an intrinsic low-dimensional (e.g., 2D or 3D) representation of the input image (or a silhouette) to predict body pose, which was parameterized as 3D joint angles at articulation points. Identifying the structure in the gram matrix of the output space enables identification of the conditional dependencies between the input covariates and the output multivariate responses. This is not otherwise possible. For example, we do not typically know which low-dimensional representation of the input images best predicts specific values of the output label.

The above discussion suggests that SIR models can effectively find a single global subspace for the input samples \(\varvec{x}\) considering the conditional distribution of \((\varvec{x}|y)\). However, there are a number of practical considerations that the SIR model is ill-equipped to handle. For example, in computer vision applications operating in the wild, such as the temperature prediction task shown in Fig. 1, we can rarely find a global embedding that fully explains the relationship between the covariates and the response. In fact, subsets of samples may be differently associated with slices of the output space. Further, many ‘relevant’ features may be systematically corrupted or unavailable in a non-trivial fraction of images. In practice, one finds that these issues strongly propagate as errors in the calculated embedding, making the downstream analysis unsatisfactory. Of course, in the forward regression setting, this problem is tackled by performing feature selection via sparsity-type penalties, which emphasize the reliable features in the estimation. The direct application of this idea in the inverse regression model is awkward since the ‘predictor’ y (which is the response in forward regression) for \(\varvec{x}\) is just one dimensional.

It turns out that the desirable properties we seek to incorporate within inverse regression actually fall out of an inherent characteristic in many vision datasets, namely an abundance of features. In other words, in associating a large set of features derived from an image to an output label (or response) y, it is often the case that different subsets of features/covariates predict the label equally well. In the inverse regression context, this property enables adapting associations between density windows of the output space with different subsets of covariates dynamically on a sample-by-sample basis. If a covariate is generally relevant but missing for a small subset of examples (e.g., due to occlusion, noise, or corruption), the formulation allows switching the hypothesis to a distinct ‘support’ of covariates for these samples alone.

In summary, exploiting abundance in inverse regression yields robust and highly flexible models. But perhaps more importantly, we obtain highly interpretable models (in an individual, sample-specific way), which is crucial in many applications. In a mammogram exam, for example, an explanation of why a patient was assigned a high probability of malignancy is critical for interpretability. With no compromise in performance, such functionality is valuable in applications but natively available in very few. Beyond Decision Trees and Inductive Logic Programming, regression models seldom yield such flexibility. Next, we give a few motivating examples and then list the main contributions of this paper.

1.1 Motivating Examples

Consider the two tasks in Fig. 1 where the relevance of features/covariates varies depending on the context and the specific samples under consideration. In facial age estimation, a feature from a local patch at a fiducial point (e.g., lip, eye corners) carries a great deal of information for predicting age. But if the patch is occluded, this feature is not relevant for that particular image. Consider another example focused on ambient temperature estimation from outdoor scene images recently tackled in [10]. Here, we must deal not only with occlusion and corruption, but, depending on the context, the relevance of an otherwise predictive feature may also vary. For example, the appearance of a tree (e.g., leaf color and density) may enable identifying subtle changes even within a specific season (e.g., early or late spring). But in winter, after the trees have shed their leaves, this feature carries little useful information for predicting day-to-day temperature. In this case, the relevance of the feature varies with the specific values assigned to the response y. Importantly, being able to evaluate the different features driving a specific prediction can guide improvement of learning algorithms by enabling human interpretability.

1.2 Contributions

To summarize, our contribution is a novel formulation using inverse regression and sufficient reduction that provides end-to-end statistical strategies for enabling (1) adaptive and dynamic associations between abundant input features and prediction outputs on an image-by-image basis, and (2) human interpretability of these associations. Our model dynamically updates the relevance of each feature on a sample-by-sample basis and allows for missing or randomly corrupted covariates. Less formally, our algorithm explains why a specific decision was made for each example (based on feature-level dynamic weights). We analyze the statistical properties of our formulation and show experimental results in three different problem settings, which demonstrate its wide applicability.

2 Estimating the Conditional Confidence of Covariates

Given a supervised learning task, our overall workflow consists of two main modules. We will first derive a formulation to obtain the confidence associated with individual covariates \(x^j\) conditioned on the label y. Once the details of this procedure are derived, we will develop algorithms that exploit these conditional confidences for prediction while also providing information on which covariates were responsible for that specific prediction. We start by describing the details of the first module.

2.1 A Potential Solution Based on Sufficient Dimension Reduction

The ideal mechanism to assign a confidence score to individual covariates, \(x^j\), should condition the estimate based on knowledge of all other (uncorrupted) covariates \(x^{-j}\) as well as the response variable y. This is a combinatorial problem that quickly becomes computationally intractable. For example, even when we consider only a single pair of covariates and a response, the number of terms will quadratically increase as \(f(x^1|x^2,y),f(x^1|x^3,y),f(x^1|x^4,y),\ldots f(x^1|x^p,y)\). Another related issue is that, when considering dependencies between multiple variables \(f(x^1|x^2,x^3,\ldots ,y)\), estimation is challenging because the conditional distribution is high-dimensional and the number of samples may be small in comparison. Further, in the prediction phase, we do not have access to the true y, which makes conditioning somewhat problematic. An interesting starting point in formulating a solution is the concept of sufficient dimension reduction [11]. We provide a definition and subsequently describe our idea.

Definition 1

Given a regression model \(h: X \rightarrow Y\), a reduction \(\phi : \mathbf R ^{p} \rightarrow \mathbf R ^{q}, q \le p\), is sufficient for the regression task if it satisfies one of the following conditions:

  • (1) inverse reduction, \(X|(Y,\phi (X)) \ \sim \ X|\phi (X)\),

  • (2) forward reduction, \(Y|X \ \sim \ Y|\phi (X)\),

  • (3) joint reduction, \(X Y | \phi (X)\),

where indicates independence, \(\sim \) means identically distributed, and A|B refers to the random vector A given the vector B[11, 12].

Example 1

Suppose we are interested in predicting obesity y of a subject using a regression model \(h:x\rightarrow y\) with 10 covariates such as weight \(x^{1}\), height \(x^{2}\), education \(x^{3}\), age \(x^{4}\), gender \(x^{5}\), \(\ldots ,\) BMI \(x^{10}\). Since obesity is highly correlated to weight and height, \((y| x^{1},x^{2},\ldots ,\) \(x^{10}) \sim (y|\phi (x^{1},\ldots , x^{10}) ) \sim (y|x^{1},x^{2})\). Here, we call \(\phi :(x^{1}, \ldots , x^{10}) \rightarrow (x^{1}, x^{2})\) a sufficient reduction for the given regression task. Also, for predicting BMI, i.e., \(h':(x^{1},\ldots , x^{9},y) \rightarrow x^{10}\), \(\phi ':(x^{1},\cdots , x^{9}) \rightarrow (x^{1}, x^{2})\) is a sufficient reduction since \((x^{10}| x^{1},\ldots , x^{9},y) \sim (x^{10}| x^{1},x^{2})\).

Our goal is to address the intractability problem by characterizing \((x^{j}|x^{-j},y)\) in a simpler form based on the definition of sufficient reduction. Notice that sufficient reduction relies on specifying an appropriate regression model and we seek to derive identities for the expression \((x^{j}|x^{-j},y)\). It therefore makes sense to structure our regression problem as \(h:x^{-j},y \rightarrow x^j\). The definition of forward reduction states that if \(Y|X \sim Y|\phi (X)\) holds, \(\phi (X)\) is a sufficient reduction for the regression problem h. In this definition, if we let \(X = x^j\), \(Y= (x^{-j},y)\), and \(\phi (X)=\phi (x^{-j},y)\), we directly have \((x^j | x^{-j},y) \sim (x^j|\phi (x^{-j},y))\), as desired.

Why is this useful? The conditional distribution \(f(x^{j}|x^{1}, \ldots , x^{p}) = f(x^{j}|\phi (x^{-j}))\) can be more efficiently estimated in a lower-dimensional space using sufficient reduction. In addition, once we make the assumption that the sufficient reduction function values coincide with y, i.e., \(\phi (x^{-j})=y\), then estimating the conditional distribution simplifies to \(f(x^{j}|x^{1}, \ldots , x^{p}) = f(x^{j}|\phi (x^{-j}))=f(x^{j}|y)\). Intuitively, this special case is closely related to the well-known conditional independence of features given a response used in a naïve Bayesian relationship:

$$\begin{aligned} f(y|x) \propto \frac{\prod f(x^{j}|y)f(y)}{f(x)}. \end{aligned}$$
(1)

In other words, given a sufficient reduction, all covariates \(x^j\) are conditionally independent. The form in Eq. (1) is simply a special case where \(\phi (\cdot )\) is y; the general form, on the other hand, allows significant flexibility in specifying other forms for \(\phi (\cdot )\) (e.g., any lower-dimensional map) as well as setting up the conditional dependence concretely in the context of conditional confidence. Note that sufficient reduction methods are related to generative models (including Naïve Bayes). It is tempting to think that generative models with lower-dimensional hidden variables play the same role as sufficient reduction. However, the distinction is that the sufficient reduction \(\phi \) from SIR can be obtained independently for any downstream analysis (regression) whereas hidden variables in generative models need to be specified and learned for each regression model. Now, the remaining piece is to give an expression for the conditional confidence distribution. For simplicity, in this work, we will use a multivariate Gaussian, which facilitates evaluating \(\mathbb {E}[x^{i}|y]\) and \(\text {VAR}[x^{i}|y]\) easily.

Remarks. Notice that \(x^{j}\) may not always correspond to a unique covariate. Instead, it may refer to a subset of covariates, e.g., multiple features from a local patch in an image may constitute a specific \(x^{j}\). In various practical situations it may turn out that one or more of these features may be irrelevant to the given regression problem. This situation requires special handling: briefly, we will consider the support of the regression coefficients for \(\mathbb {E}[ y|x^{i}]\) and measure the confidence of the feature by measuring the deviation from \(\mathbb {E}[x|y]\) only along the related regression direction. These extensions will be described later.

2.2 A Simple Estimation Scheme Based on Abundant Features

The above description establishes the identity, \(f(x^{j}|\phi (x^{-j})) = f(x^{j}|y)\), assuming \(\phi (x^{-j}) = y\) and gives us a general expression to calculate the conditional confidence of individual covariates. What we have not addressed so far is a constructive scheme to actually calculate \(\phi (x^{-j})\) so that it serves as a surrogate for y. We describe this procedure below based on sufficient reduction.

A natural strategy is to substitute y using predicted estimates, \(\hat{y}\), derived from a subset of covariates, \(\{1,\cdots ,p\}{\setminus }j\). The difficulty, however, is that many of these subsets may be corrupted or unavailable. Fortunately, we find that in most situations (especially with image data), multiple exclusive subsets of the covariates can reliably predict the response. This corresponds to the abundant features assumption described earlier, and seems to be valid in many vision applications including the three examples studied in this paper. This means that we can define \(\phi ^I(x^I)\) for distinct subsets I of the covariate set, \(\{1,\cdots ,p\}{\setminus }j\). Intuitively, a potentially large number of I’s will each index unique subsets and can eventually be used to obtain a reliable prediction for y, which makes the sufficient reduction condition, \(\phi ^I(x^I) = y\), sensible. Marginalizing over distinct I’s, we can obtain \(\mathbb {E}[x^j | \phi ^I(x^I) ]\) (described below). Then, by calculating the discrepancy between \(\mathbb {E}[x^j | \phi ^I(x^I) ] =\mathbb {E}[x^j | \hat{y}^I ]\) and \(x^j\), we can evaluate the conditional confidence of each specific covariate \(x^j\).

Marginalizing over I to calculate \(\mathbb {E}[f(x^j | \phi ^I(x^I) )]\). To calculate \(\mathbb {E}[f(x^j | \phi ^I(x^I) )]\), the only additional piece of information we need is the probability of the index set I. This can be accomplished by imposing a prior over each corresponding sufficient reduction, \(\phi ^I(\cdot )\), as \(w_{\phi ^I}:=\mathbb {E}[(y- \phi ^I(x^I) )^2]^{-1}\) which expresses the belief that the reliability of distinct sufficient reductions \(\phi ^I(\cdot )\) will vary as a function of the subset of patches it indexes.Footnote 1 This means that the conditional confidence for a covariate is calculated by a weighted mean of \(f(x^j|\phi ^i(x^i))=f(x^j|\hat{y}^i)\) using \(w_{\phi ^j}\) (see Line 4 in Algorithm 1). With these ingredients, we present the complete algorithm in Algorithm 1.

figure a

2.3 Deriving Priors for Sufficient Dimension Reduction

We now describe how to derive priors for sufficient dimension reduction using a convex combination of multiple sufficient reductions. We assume that each weak sufficient reduction \(\varPhi ^I(\cdot )\) is an unbiased estimator for y. Since a convex combination of unbiased estimators (expectation over estimators) is also an unbiased estimator, our problem is to find the optimal weights for such a combination of the sufficient reductions. Note that such an estimator will satisfy a minimum variance property. Once calculated, we will directly use the estimates as a prior for \(\phi ^I(\cdot )\).

Let \(\phi ^1(x^1) \sim \mathcal {N}(y, \sigma _1^2), \ldots , \phi ^K(x^K) \sim \mathcal {N}(y, \sigma _K^2)\) denote a set of sufficient reductions for different subsets I in \(\{1,\cdots ,p\}{\setminus }j\) where I indices belong in the set \(\{1, \cdots , K\}\). This means that \(y=\mathbb {E}(\phi ^I(x^I))\) since each estimator is unbiased. Note that each estimator is independent given y, which means, roughly speaking, the prediction errors among the different sufficient reductions are not correlated. So, the problem of calculating the weights, w, reduces to the following optimization model,

$$\begin{aligned} \min _{\varvec{w}} \text {VAR}\left[ \sum _{I=1}^K \phi ^j(x^I)w^I \right] \; \text {s.t.} \ \sum _I w^I = 1 \text { and } w^I \ge 0 \text {, for all } I \in {1, \ldots , K}. \end{aligned}$$
(2)

Since we assume that the error is independent given y, Eq. (2) can be written as

$$\begin{aligned} \min _{\varvec{w}} \sum _{I=1}^K \sigma ^2_I (w^I)^2 \;\; \text {s.t.} \sum _I w^I = 1 \text { and } w^I \ge 0 \text {, for all } I \in {1, \ldots , K} \end{aligned}$$
(3)

The optimal weights \(\varvec{w}\) have a closed form due to the following result.

Lemma 1

Based on KKT optimality conditions, one can verify (see the extended paper) that the optimal weights for Eq. (3) are \(w^I = \sigma ^{-2}_I /\sum _{k=1}^K \sigma ^{-2}_k\). This is a unique global optimum for Eq. (3) when \(\sigma _I^2 >0, \forall I \in \{1,\ldots , K \}\).

This provides a weight for each subset \(I \in \{1,\ldots , K \}\) for arbitrary constant K.

In the extended paper, we present a scheme to estimate the conditional confidence of specific features within a particular covariate by considering the sufficient reduction direction. This reduces the influence of irrelevant features within a multivariate covariate, given a regression task. Next, we introduce a variant of kernel regression when covariates (and their multivariate features) have an associated conditional confidence score.

3 Conditional Confidence Aware Kernel Regression

In this section we modify an existing kernel regressor formulation to exploit the conditional confidence of covariates. This final module is needed to leverage the conditional confidence towards constructions that can be applied to applications in machine learning and computer vision.

We start from the Nadaraya-Watson kernel regression with a Gaussian kernel. Since this estimator requires a dissimilarity measure between samples, we simply need to define a meaningful measure using the covariate confidences. To do so, we can use a simple adjustment such that the distance measure makes use of covariates (both univariate and multivariate) differentially, proportional to their confidence level. The expectation of distance of each pair of covariates weighted by confidence shown below is one such measure:

$$\begin{aligned} \text {d}_w(x_1, x_2,w_1,w_2) := \sqrt{\frac{\sum _j w_{x_1^j} w_{x_2^j} (x_1^j-x_2^j)^2}{\sum _j w_{x_1^j} w_{x_2^j}}}. \end{aligned}$$
(4)

The expression in Eq. (4) can be interpreted as agnostic of the example-specific labels (even if they were available). Interestingly, the weights \(w_{x^j}\) are obtained via a surrogate to the unknown labels/responses via sufficient reduction. This scheme will still provide meaningful distances even when one or more covariates are corrupted or unavailable. Next, we modify Eq. (4) so we can guarantee that it is an unbiased estimator for distances between uncorrupted covariates under some conditions.

3.1 Unbiased Estimator for Distance Between Uncorrupted Covariates

This section covers a very important consequence of utilizing inverse regression. Notice that it is quite uncommon in the forward regression setting to derive proofs of unbiasedness for distance estimates in the presence of corrupted or missing covariates or features. This is primarily because few, if any, methods directly model the covariates \(x^j\). Interestingly, inverse regression explicitly characterizes \(f(x^j |\phi ^I(x^I))\), which means that we have access to \(\mathbb {E}[x^j |x^{-j}]\). Let us assume that the ‘true’ but unobserved value of the covariate is \(z^j \approx \mathbb {E}[x^j |x^{-j}]\). Since our model assumes that \(x^j\) is observed with noise, we can model the variance of \(x^j\) given \(\mathbb {E}[x^j |x^{-j}]\) using \(\sigma _{x^j|z^j}^2 = \mathbb {E}[ (x^j -\mathbb {E}[x^j |x^{-j}])^2 ]\), i.e., \(x^j \sim \mathcal {N}(z^j, \sigma _{x^j|z^j}^2 )\). This allows us to obtain a powerful “corrected” distance measure. We now have:

Proposition 1

Assume that we observe covariates \(x^1, x^2\) with Gaussian noise given ground truth feature values \(z^1\) and \(z^2\), i.e., \(x^j_1 \sim \mathcal {N}(\bar{z}^j_1, \sigma ^2_{x^j|z^j})\) and \(x_2^j \sim \mathcal {N}(\bar{x}^j_2, \sigma ^2_{x^j|z^j})\). Then, we have

$$\begin{aligned} \mathbb {E}[(x_1 - x_2)^2 ]&= \mathbb {E}[x_1]^2+\mathbb {E}[x_2]^2 - 2 \mathbb {E}[x_1]\mathbb {E}[x_2] - 2 \text {COV}(x_1,x_2) + \text {VAR}[x_1]+\text {VAR}[x_2]\nonumber \\&= \bar{x}_1^2+\bar{x}_2^2-2\bar{x}_1\bar{x}_2+2\sigma ^2_{x|z} = (\bar{x}_1-\bar{x}_2)^2+2\sigma ^2_{x|z} \end{aligned}$$
(5)

Thus, \((x_1 - x_2)^2 -2\sigma ^2_{x|z} \) is an unbiased estimator for distances between true (but unobserved) covariate values, e.g., \((z_1-z_2)^2 = \mathbb {E}[(x_1 - x_2)^2 -2\sigma ^2_{x|z} ]\).

Once we have access to \(2\sigma ^2_{x|z}\), deriving the unbiased estimate simply involves a correction. So, we obtain the corrected distances:

$$\begin{aligned} \text {d}(x_1, x_2, w_1,w_2)^2&:= \mathbb {E}_j \left[ \left( (x_1^j-x_2^j)^2 -2\sigma _j^2 \right) \right] =\frac{\sum _j \left( (x_1^j-x_2^j)^2 -2\sigma _j^2 \right) w_{x_1^j}w_{x_2^j}}{\sum _{j} w_{x^j_1} w_{x_2^j}}. \end{aligned}$$
(6)

4 Results and Discussion

Our method is broadly applicable, and so we show results on three different computer vision datasets, each with an associated task: (1) outdoor photo archives for temperature prediction, (2) face images for age estimation, and (3) magnetic resonance imaging (MRI) of brains for Alzheimer’s disease prediction. For temperature prediction on the Hot or Not dataset [10], we show that our algorithm can help explain why a specific prediction was made without sacrificing accuracy compared to the state-of-the-art. We use age estimation as a familiar example to demonstrate several properties of our approach, namely that our global (\(w_{\phi ^I}\)), and dynamic weights (\(w_{x^j}\)) are meaningful and intuitive. Finally, we show that our method can be used to pinpoint regions of the brain image that contribute most to Alzheimer’s disease prediction, which is valuable to clinicians.

4.1 Temperature Prediction

Hot or Not [10] consists of geo-located image sequences from outdoor webcams (see supplement). The task is to predict ambient outdoor temperature using only an image of the scene. For fair comparison, we evaluated our method on the same 10 sequences selected by [10]. Like [10], we used the first-year images for training and the second-year images for testing.

We decompose temperature \(T\) into a low-frequency component \(T_{\mathrm {lo}}\) and a high-frequency component \(T_{\mathrm {hi}}\) as in [10]. We train our algorithm to predict \(T_{\mathrm {lo}}\) and \(T_{\mathrm {hi}}\) separately, and then estimate the final temperature as \(T= T_{\mathrm {lo}} + T_{\mathrm {hi}}\). Intuitively, \(T_{\mathrm {lo}}\) is correlated with seasonal variations (e.g., the position of the sun in the sky at 11:00 am, the presence or absence of tree leaves) and \(T_{\mathrm {hi}}\) is correlated with day-to-day variations (e.g., atmospheric conditions).

Glasner et al. [10] demonstrated good performance using each pixel and color channel as a separate feature. Our approach assumes a set of consistent landmarks across the image set. In principle, we could treat each pixel and color channel as a ‘landmark,’ but doing so would result in impractically slow training. Therefore, we adopt a two-level (hierarchical) approach.

We first describe our lowest-level features. Let \(z_{t} = I_{i,j,c,t}\) be the image intensity at pixel \(i,j\), color channel \(c\in \{\mathrm {red}, \mathrm {green}, \mathrm {blue}, \mathrm {gray} \}\), and time \(t\in \mathcal {T}\). Let \(T_{t}\) be the ground truth temperature at time \(t\). We omit the \(\mathrm {lo}\)/\(\mathrm {hi}\) subscript below. Each pixel produces a temperature estimate according to a simple linear model, \(\hat{T}_{i,j,c,t} = a_{i,j,c} z_{t} + b_{i,j,c}\), where \(\hat{T}_{i,j,c,t}\) is the estimated temperature at time \(t\) according to pixel \(i,j,c,t\). We compute the regression coefficients \(a^{*} = a_{i,j,c}^{*}\) and \(b^{*} = b_{i,j,c}^{*}\) by solving \(a^{*},b^{*} = \min _{a,b} \sum _{t\in \mathcal {T}} \Vert a z_{t} + b - T_{t} \Vert _{2}^{2}\). A straightforward way to produce a single prediction is to combine the pixel-wise predictions using a weighted average, \(\hat{T}_{t}\). We form two feature vectors at each pixel, \(\mathbf {t}_{i,j,t} = [\hat{T}_{\mathrm {red}}, \hat{T}_{\mathrm {green}}, \hat{T}_{\mathrm {blue}}, \hat{T}_{\mathrm {gray}}]\) corresponding to temperature estimates, and \(\mathbf {v}_{i,j,t} = [z_{\mathrm {gray}}, g_x, g_y]\), where \(z_{\mathrm {gray}}\) is the grayscale pixel intensity and \(g_x\) and \(g_y\) are the x and y grayscale intensity gradients, respectively.

We divide the image into non-overlapping \(h \times w\)-pixel patches and assign a landmark to each patch (we empirically set \(h=w=15\)). At each landmark \(k\) we construct a region covariance descriptor [13]. Specifically, for each patch \(\mathcal {P}_{k,t}\) centered at \(k\) at time \(t\) we compute two covariance matrices, \(\varSigma _{\mathbf {v}}\) and \(\varSigma _{\mathbf {t}}\): The feature vector for landmark \(k\) is then \(\mathbf {f}_{k} = {[\mathbf {\sigma }_{\mathbf {v}}, \mathbf {\sigma }_{\mathbf {t}}]}^\mathsf {T}\), where \(\mathbf {\sigma }_{\mathbf {v}}\) is a \(1 \times 6\) vector of upper-right entries of \(\varSigma _{\mathbf {v}}\) and \(\mathbf {\sigma }_{\mathbf {t}}\) is a \(1 \times 10\) vector of upper-right entries of \(\varSigma _{\mathbf {t}}\). We trained and tested our algorithm using the set of \(\{\mathbf {f}_{k,t}\}\).

Figure 2 illustrates several interesting qualitative results of our approach on the Hot or Not dataset. Table 1 provides a quantitative comparison between the accuracy of variants of our proposed approach, and the accuracy of seven different estimation methods proposed by [10] on the Hot or Not dataset. The first seven rows are results reported by [10]. The bottom four rows are variants of our method. We note that, unlike Glasner et al. [10], our “Kernel Est. with \(w_\phi w_x\)” method is capable of producing time-varying (dynamic) landmark weights (see Fig. 2), which provides a meaningful and intuitive way to understand which parts of the image contribute most significantly to the temperature estimate. At the same time, the accuracy of “Kernel Est. with \(w_\phi w_x\)” is competitive, which shows that our method does not sacrifice accuracy to achieve this capability.

Fig. 2.
figure 2

Qualitative results on scene (a) from the Hot or Not dataset [10]: summer, late autumn, and winter. Notice that low-frequency global weights (row 1, middle column) tend to be larger around the background trees and at the edge of the foreground tree, which reflects that leaf appearance is well correlated with the season. Observe that high-frequency global weights (row 1, right column) tend to be larger on distant buildings, which reflects the intuition that daily weather variations (e.g., fog, precipitation) can dramatically change the appearance of the atmosphere, which is especially noticeable against the backdrop of distant buildings. Note that our method correctly reduces the high-frequency weights on the crane (row 3, right column), which suggests that unpredictably occluded landmarks should not contribute to the estimate (appearance temporarily becomes uncorrelated with temperature). Best viewed in color. (Color figure online)

Table 1. Accuracy of Celsius temperature prediction on Hot or Not [10]. Each cell contains two values: \(R^2\)/RMSE, where \(R^2 = 1 - \frac{\mathrm {MSE}}{\sigma ^2}\), MSE is the mean squared error of the temperature estimation, \(\sigma ^2\) is temperature variance, and RMSE is root MSE. The first seven rows are results from [10]. The bottom four rows are variants of our method. Our method produces a time-varying (dynamic) weight for each landmark, which provides a richer, more intuitive explanation of the estimation process.

4.2 Face Age Estimation

Face age estimation is a well-studied area in computer vision. For example, apparent age estimation [14] was a key topic in the 2015 Looking At People ICCV Challenge [15]. The top performers in that challenge all used a combination of deep convolutional neural networks and large training databases (e.g., \(\sim \)250k images). Given the significant engineering overhead required, we do not focus on achieving state-of-the-art accuracy using such large datasets. Instead, here we show qualitative results on a smaller age estimation dataset to illustrate several aspects of our approach. For experimentation, we used the Lifespan database [16], which has been previously used for age estimation [17] and modeling the evolution of facial landmark appearance [18].

The Lifespan database contains frontal face images with neutral and happy expressions, with ages ranging from 18 to 94 years. We used the 590 neutral expression faces with associated manually labeled landmarks from [17]. Following [17], we used five-fold cross-validation for our experiments.

Figure 3 shows the age estimates and landmark weights produced by our method. We see that certain regions of the face (e.g., eyes, mouth corners) generally received higher weights than others (e.g., nose tip). However, this is not true for all faces. For example, cosmetics can alter appearance in ways that conceals apparent age, and landmarks can be occluded (e.g., by hair or sunglasses). This implies that a globally consistent weight for each landmark is suboptimal. In contrast, our dynamic weights \(w_x\) adapt to each face instance to better handle such variations. See the supplementary material for additional results.

Fig. 3.
figure 3

Qualitative age estimation results on images from the Lifespan database [16]. Notice that landmarks occluded by hair are correctly down-weighted. Eye and mouth landmarks tend to have higher weight, which suggests that their appearance is more predictive of age than the nose, for example. However, we see that the eye and mouth corners of the 36-year-old woman (second row, first column) are very low, perhaps due to her cosmetics. Our method is not always accurate. For example, the age estimate for the 20-year-old man (second row, third column) is technically incorrect. However, his apparent age is arguably closer to the estimate than his actual age. See the supplementary material for additional results. Best viewed electronically in color. (Color figure online)

Fig. 4.
figure 4

Conditional confidence maps of two representative subjects from the normal control group (left) and the AD group (right). The maps are overlaid on the population mean FA map. Observe that different white matter regions play important roles in the prediction. For example, the frontal white matter is bilaterally important in the CN subject where as there is assymetry in the AD subject.

4.3 Alzheimer’s Disease (AD) Classification

We further demonstrate the performance of our model on a clinically-relevant task of predicting disease status from neuroimaging data. For this set of experiments we used diffusion tensor imaging (DTI) data from an Alzheimer’s disease (AD) dataset. We use the fractional anisotropy (FA) maps that are the normalized standard deviation maps of the eigenvalues of the DTI as a single channel image for deriving the feature vectors. We used standard image processing of DTI [19] to derive these measures in the entire white matter region of the brain from a total of 102 subjects. There were 44 subjects with AD diagnosis and 58 matched normal control (CN) subjects. We defined 186 regularly-placed landmarks on the lattice of the brain volume. At each of these landmarks we derived mean feature vector ([IIxIyIz]) using a local 3D patch of size \(10\times 10\times 10\). I is the FA value, IxIyIz are the differentiated FA values in the xy and z directions, respectively. Since our algorithm performs regression, we used \(\{0,2 \}\) for \(\{CN, AD\}\) and thresholded the prediction results at 1. Using these features we obtained a classification accuracy of 86.17 % using 10-fold cross-validation. Even though our method is a regression model, this outperforms SVM with PCA on the same data set showing 80 %–85 % [20]. The resulting conditional confidence maps (computed using \(w_{x^j}\) in Algorithm 1) for the top 20 landmarks (of the 186) for two sample subjects are shown in Fig. 4.

5 Conclusions

This paper provided a statistical algorithm for identifying conditional confidence of covariates in a regression setting. We utilized the concept of Sufficient Reduction within an Inverse Regression (AIR) model to obtain formulations that offer individual-level relevance of covariates. On all three applications described here, we found that in addition to gross accuracy, the ability to explain a prediction for each test example can be valuable for many applications. Our approach comes with various properties such as optimal weights, unbiasedness, and procedures to calculate conditional densities along only relevant dimensions given a regression task; these are interesting side results. Our evaluations suggest that there is substantial value in further exploring how Abundant Inverse Regression can complement current regression approaches in computer vision, offer a viable tool for interpretation/feedback, and guide the design of new methods that exploit these conditional confidence capabilities directly.