1 Introduction

Consider an experimental design setting which involves a cohort \(\mathcal {S}\) comprised of N individuals (or examples) in total. We are allowed to obtain a maximum of p measurements (or features) for each participant (or example) in \(\mathcal {S}\). Depending on the application, these p measurements may be variously interpreted — for example, in a machine learning experiment, we may have p distinct numerical preferences a user assigns to each item whereas in computer vision, the measurements may reflect p specific requests for supervision or indication on each image in \(\mathcal {S}\) [14]. In a neuroscience experiment, the cohort corresponds to individual subjects — the p measurements will denote various types of imaging and clinical measures we can acquire. Of course, independent of the application, the “cost” of measurements is quite variable: while features such as gender and age of a participant have negligible cost, requesting a user to rate an image in abstract terms, “How natural is this image on a scale of 1 to 5?”, may be more expensive. In neuroimaging, acquiring some clinical and cognitive measures is cheap, whereas certain image scans can cost several thousands of dollars [5, 6].

In the past, when datasets were smaller, these issues were understandably not very important. But as we move towards acquiring and annotating large scale datasets in machine learning and vision [79], the cost implications can be substantial. For instance, if the budget for a multi-modal brain imaging study involving several different types of image scans for \(\sim \)200 subjects is \(\$3\)M+ and we know a priori which type of inference models will finally be estimated using this data, it seems reasonable to ask if “adaptive” data acquisition can bring down costs by \(25\,\%\) with negligible deterioration in statistical power. While experiment design concepts in classical statistics provide an excellent starting point, they provide little guidance in terms of practical technical issues one faces in addressing the question above. Outside of a few recent works [1012], this topic is still not extensively studied within mainstream machine learning and vision.

In this paper, we study a natural form of the experimental design problem in the context of an important brain imaging application. Assume that we have access to a cohort \(\mathcal {S}\) of n subjects. In principle, we can acquire p measurements for each participant. But all p measures are not easily available — say, we start only with a default set of \(p^\prime \) measures for each subject which may be considered as “inexpensive”. This yields a matrix of size \(N \times p^{\prime }\). We are also provided the remaining set of (\(p - p^{\prime }\)) measurements but only for a small subset \(\mathcal {S}^{\prime }\) of \(n^\prime \) subjects — possibly due to the associated expense of the measurement. We can, if desired, acquire these additional (\(p-p^\prime \)) measures for each individual participant in \(\mathcal {S}{ \setminus }\mathcal {S^{\prime }}\), but at a high per-individual cost. Our goal is to eventually estimate a statistical model that has high fidelity to the “true” model estimated using the full set of p measures/features for the full cohort \(\mathcal {S}\). The key question is whether we can design an adaptive query strategy that minimizes the overall cost we incur and yet provides high confidence in the parameter estimates we obtain. The problem statement is quite general and models experimental design considerations in numerous scientific disciplines including systems biology and statistical genomics where an effective solution can drive improvements in efficiency.

1.1 Related Work

There are three distinct areas of the literature that are loosely related to the development described in this paper. At the high level, perhaps the most closely related to our work is active learning which is motivated by similar cost-benefit considerations, but in terms of minimizing the number of queries (seeking the label of an example) [13]. Here, one starts with a pool of unlabeled data and picks a few examples at random to obtain their labels. Then, we repeatedly fit a classifier to the labeled examples seen so far and query the unlabeled example that is most uncertain or likely to decrease overall uncertainty. This strategy is generally successful though may asymptotically converge to a sub-optimal classifier [14]. Adaptive query strategies have been presented to guarantee that the hypothesis space is fully explored and to obtain theoretically consistent results [15, 16]. Much of active learning focuses on learning discriminative classifiers; while the Bayesian versions of active learning can, in principle, be applied to far more general settings, it is not clear whether such formulations can be adapted for the stratified cost structure we encounter in the motivating example above and for general parameter estimation problems where the likelihood expressions are not computationally ‘nice’.

Within the statistics literature, the problem of experiment design has a rich history going back at least four decades [1719], and seeks to formalize how one deals with the non-deterministic nature of physical experiments. In contrast to the basic setting here and even data-driven measures of merit such as D-optimality [20, 21], experiment design concepts such as the Latin hypercube design [22] intentionally assume very little about the relationship between input features and the output labels. Instead, with d features, such procedures will generate a space-filling design so that each of the dimensions is divided into equal levels — the calculated configuration merely provides a selection of inputs at which to compute the output of an experiment to achieve specific goals. Despite a similar name, the goals of these ideas are quite different from ours.

Within machine learning and vision, papers related to collaborative filtering (and matrix completion) [2326] share a number of technical similarities to the development in our work. For instance, one may assume that in a matrix of size \(N \times p\) (subjects \(\times \) measurements), the first \(p^\prime \) columns are fully observed whereas multiple rows in the remaining (\(p - p^\prime \)) columns are missing. This clearly yields a matrix completion problem; unfortunately, the setup lies far from incoherent sampling and the matrix versions of restricted isometry property (RIP) that make the low-rank completion argument work in practice [27, 28]. This observation has been made in recent works where collaborative filtering was generalized to the graph domain [29] and where random sampling was introduced for graphs in [30]. However, these approaches, which will serve as excellent baselines, do not exploit the band-limited nature of measurements in frequency space. Separately, matrix completion within an adaptive query setting [31, 32] yields important theoretical benefits but so far, no analogs for the graph setting exist.

The contribution of this paper is to provide a harmonic analysis inspired algorithm to estimate band-limited signals that are defined on graphs. It turns out that such solutions directly yield an efficient procedure to conduct adaptive queries for designing experiments involving stratified costs of measurements, i.e., where the first subset of measures is free whereas the second set of \((p- p^\prime )\) measures is expensive and must be requested for a small fraction of participants. Our framework relies on the design of an efficient decoder to recover the band-limited original signal involving multiple channels which was only partially observed. In order to accomplish these goals, the paper makes the following contributions.

  • (i) We propose a novel sampling and signal recovery strategy on a graph that is derived via harmonic analysis of the graph.

  • (ii) We show how a band-limited multi-variate signal on a graph can be reconstructed with only a few observations via a simple optimization scheme.

  • (iii) We provide an extensive set of experiments on two independent datasets which demonstrate that our framework works well in estimating expensive image-derived measurements based on (a) a partial set of observations (involving less expensive image-scan data) and (b) a full set of measurements on only a small fraction of the cohort.

2 Preliminaries: Linear Transforms in Euclidean and Non-euclidean Spaces

Well known signal transforms in the forward/inverse directions such as the wavelet and Fourier transforms (in non-Euclidean space) are fundamental to our proposed framework. These transforms are well understood in the Euclidean setting, however, their analogues in non-Euclidean spaces have not been studied until recently [33]. We provide a brief overview of these transforms in both Euclidean and non-Euclidean spaces.

2.1 Continuous (Forward) Wavelet Transform

The Fourier transform is a fundamental tool for frequency analyses of a signal by transforming the signal f(x) into the frequency domain as

$$\begin{aligned} \small \hat{f}(\omega ) = \langle f, e^{j\omega x} \rangle = \int f(x) \mathrm e^{-j\omega x} \mathrm d x \end{aligned}$$
(1)

where \(\hat{f}(\omega )\) is the resultant Fourier coefficient. Wavelet transform is similar to the Fourier transform, but it uses a different type of oscillating basis function (i.e., mother wavelet). Unlike Fourier basis (i.e., \(\sin ()\)) with infinite support, a wavelet \(\psi \) is a localized function with finite support. One can define a mother wavelet \(\psi _{s,a}(x) = \frac{1}{s} \psi (\frac{x-a}{s})\) with scale and translation properties, controlled by s and a respectively. Here, changing s controls the dilation and varying a controls the location of \(\psi \). Using \(\psi _{s,a}\) as bases, a wavelet transform of a function f(x) results in wavelet coefficients \(\mathcal W_f(s,a)\) at scale s and at location a as

$$\begin{aligned} \small \mathcal W_f(s,a) = \langle f, \psi \rangle = \frac{1}{s} \int f(x) \psi ^* (\frac{x-a}{s}) \mathrm {d}x \end{aligned}$$
(2)

where \(\psi ^*\) is the complex conjugate of \(\psi \) [34].

Interestingly, \(\psi _s\) is localized not only in the original domain but also in the frequency domain. It behaves as a band-pass filter covering different bandwidths corresponding to scales s. These band-pass filters do not cover the low-frequency components, therefore an additional low-pass filter \(\phi \), a scaling function, is typically introduced. A transform with the scaling function \(\phi \) results in a low-pass filtered representation of the original function f. In the end, filtering at multiple scales s of the wavelet offers a multi-resolution view of the given signal.

2.2 Wavelet Transform in Non-euclidean Spaces

Defining a wavelet transform in the Euclidean space is convenient because of the regularity of the domain (i.e., a regular lattice). In this case, one can easily define the shape of a mother wavelet in the context of an application. However, in non-Euclidean spaces (e.g., graphs that consists of a set of vertices and edges with arbitrary connections), an implementation of a mother wavelet becomes difficult due to the ambiguity of dilation and translation. Due to these issues, the classical definition of the wavelet transform has not been suitable for analyses of data in non-Euclidean spaces until recently when [33, 35] proposed wavelet and Fourier transforms in non-Euclidean spaces.

The key idea in [33] for constructing a mother wavelet \(\psi \) on the nodes of a graph is simple. Instead of defining it in the original domain where the properties of \(\psi \) are ambiguous, we define a mother wavelet in a dual domain where its representation is clear and then transform it back to the original domain. The core ingredients for such a construction are (1) a set of “orthonormal” bases that provide the means to transform a signal between a graph and its dual domain (i.e., an analogue of the frequency domain) and (2) a kernel function h() that behaves as a band-pass filter determining the shape of \(\psi \). Utilizing these ingredients, a mother wavelet is first constructed as a kernel function in the frequency domain and then localized in the original domain using a \(\delta \) function and the orthonormal bases. Such an operation will implement a mother wavelet \(\psi \) on the original graph. Defining a kernel function in the 1-D frequency domain is simple, and one can rely on spectral graph theory to obtain the orthonormal bases of a graph [33] which can be used for graph Fourier transform.

A graph \(\mathcal G = \{\mathcal V, \mathcal E\}\) is formally defined by a vertex set \(\mathcal V\) with N number of vertices and a edge set \(\mathcal E\) with edges that connect the vertices. Such a graph is generally represented by an adjacency matrix \(\mathcal A_{N \times N}\) where each element \(a_{ij}\) denotes the connection between ith and jth vertices by a corresponding edge weight. Another matrix that summarizes the graph, a degree matrix \(\mathcal D_{N \times N}\), is a diagonal matrix where the ith diagonal is the sum of edge weights connected to the ith vertex. A graph Laplacian is then defined from these two matrices as \(\mathcal L = \mathcal D - \mathcal A\), which is a self-adjoint and positive semi-definite operator. The matrix \(\mathcal L\) can be decomposed into pairs of eigenvalues \(\lambda _l \ge 0\) and corresponding eigenvectors \(\chi _l\) where \(l = 0,1,\cdots , N-1\). The orthonormal bases \(\chi \) can be used as analogues of Fourier bases in the Euclidean space to define the graph Fourier transform of a function f(n) defined on the vertices n as

$$\begin{aligned} \small \hat{f} (l) = \sum _{n=1} ^{N} \chi ^* _l (n) f(n) ~~~\text{ and }~~~ f (n) = \sum _{l=0} ^{N-1} \hat{f}(l) \chi _l (n) \end{aligned}$$
(3)

where the forward transform yields the graph Fourier coefficient \(\hat{f}(l)\) and the inverse transform reconstructs the original function f(n). If the signal f(n) lies in the spectrum of the first k number of \(\chi _l\) in the dual space, we say that f(n) is k band-limited. Just like in the conventional Fourier transform, this graph Fourier transform offers a mechanism to transform a signal on graph vertices back and forth between the original and the frequency domain.

Fig. 1.
figure 1

Examples of bases functions on a graph. (a) Cat shaped graph, (b) A graph Fourier basis \(\chi _2\), (c) Graph wavelet bases \(\psi _1\) at two different locations (ear and paw), (d) Graph wavelet basis \(\psi _4\) as in (c). Notice that wavelet bases in (c) and (d) are localized while \(\chi _2\) is spread all over the mesh.

Using the graph Fourier transform, a mother wavelet \(\psi \) is implemented by first defining a kernel function h() and then localizing it by a Dirac delta function \(\delta _n\) in the original graph through the inverse graph Fourier transform. Since \(\langle \delta _n, \chi _l \rangle = \chi _l^*(n)\), the mother wavelet \(\psi _{s,n}\) at vertex n at scale s is defined as

$$\begin{aligned} \small \psi _{s,n}(m) = \sum _{l=0} ^{N-1} h(s \lambda _l) \chi _l ^* (n) \chi _l (m). \end{aligned}$$
(4)

Here, using the scaling property of Fourier transform [36], the scale s can be defined as a parameter in the kernel function h() independent from the bases \(\chi \). Representative examples of a graph Fourier basis and graph wavelet bases are shown in Fig. 1. A cat shaped graph is given in Fig. 1(a), and one of its graph Fourier basis \(\chi _2\) is shown in (b). Also, graph wavelets at two different scales (i.e., dilation) at two different locations (ear and paw) are shown in Fig. 1(c) and (d). Notice that \(\chi \) in Fig. 1(b) is diffused all over the graph, while the wavelet bases in (c) and (d) are localized with finite support.

Once the bases \(\psi \) are defined, the wavelet transform of a function f on graph vertices at scale s follows the classical definition of the wavelet transform:

$$\begin{aligned} \small \mathcal W_f(s,n) = \langle f, \psi _{s,n} \rangle = \sum _{l=0} ^{N-1} h(s\lambda _l) \hat{f}(l) \chi _l (n) \end{aligned}$$
(5)

resulting in wavelet coefficients \(W_f (s,n)\) at scale s and location n. This transform offers a multi-resolution view of signals defined on graph vertices by multi-resolution filtering. Our framework, to be described shortly, will utilize the definition of the mother wavelet in (4) for data sampling strategy on graphs as well as the graph Fourier transform for signal recovery.

3 Adaptive Sampling and Signal Recovery on Graphs

Suppose there exists a band-limited signal (of p channels/features) defined on graph vertices, and we have limited access to the observation on only a few of the vertices in the graph. Our goal is to estimate the entire signal using only the partial observations. Since the signal is band-limited, we do not need to sample every location in the native domain (i.e., Nyquist rate). Unfortunately, we do not have powerful sampling theorems for graphs. In this regime, in order to recover the original signal, we need an efficient sampling strategy for the data. In the following, we describe how the vertices should be selected for accurate recovery of the band-limited signal and propose a novel decoder working in a dual space that is more efficient than alternative techniques.

3.1 Graph Adaptive Sampling Strategy

In order to derive a random sampling of the data measurement on a graph (i.e., signal measurement on vertices), we first need to assign a probability distribution \(\mathsf p\) on the graph nodes. This probability tells us which vertices are more likely to be sampled for measurements, and needs to satisfy the definition of a probability distribution as \(\sum _{n=1} ^N \mathsf p (n) = 1\) where \(\mathsf p>0\). The construction of \(\mathsf p\) is based on how the energy spreads over the graph vertices, given the graph structure. It means that it is easier to reconstruct a given signal with limited number of bases at some vertices than other vertices, and prioritizing those vertices for sampling will yield better estimation of the original signal.

In order to define the probability distribution \(\mathsf p\) over the vertices, we make use of the eigenvalues and eigenvectors from spectral graph theory to describe the energy propagation on the graph. In [30], the authors show how well a \(\delta _n\) can be reconstructed at a vertex n with k number of eigenvectors and normalize them to construct a probability distribution as

$$\begin{aligned} \small \mathsf p(n) = \frac{1}{k} ||V_k^T \delta _n||_2 ^2 = \frac{1}{k} \sum _{l=0}^{k-1} \chi _l(n)^2 \end{aligned}$$
(6)

where \(V_k\) is a matrix with column vectors as \(V_k = [\chi _0 ~ \cdots \chi _{k-1}]\). Their solution puts the same weight on each eigenvector to compute the distribution, assuming that the signal is uniformly distributed in the k-band (i.e., the spectrum of the first k eigenvectors). Such a strategy uses the graph Fourier bases to reconstruct a delta function, which typically is not desirable in many applications since Fourier bases suffers from ringing artifacts. Moreover, in many cases, the signal may be localized even within the k-band, and it necessitates a scaling (i.e., filtering) of the signal at multiple scales in the frequency domain.

Interestingly, it turns out that the definition of \(\mathsf p\) above can be viewed entirely via a non-Euclidean wavelet expansion described in Sect. 2. Recall that a mother wavelet \(\psi _{s,n}\) is implemented by localizing a wavelet operation at scale s as in (4). It constructs a mother wavelet at scale s localized at n as a unit energy propagating from n to neighboring vertices as a diminishing wave function. When we look at \(\psi _{s,n}(n)\), the self-effect of a mother wavelet at vertex n is written as

$$\begin{aligned} {\small \psi _n(s, n) = \sum _{l=0} ^{N-1} h(s \lambda _l) \chi _l (n)^2. } \end{aligned}$$
(7)

At the high level, (7) tells us how much of the unit energy is maintained at n itself at scale s. Notice that (7) is a kernelized version of (6) using a kernel function h(). Depending on the design of the kernel function h(), we may interpret it as robust graph-based signatures such as heat-kernel signature (HKS) [37], wave kernel signature (WKS) [38], global point signature (GPS) [39] and wavelet kernel descriptor (WKD) [40], which were introduced in computer vision literature for detecting interest points on graphs and mesh segmentation.

Fig. 2.
figure 2

Sampling probability distribution \(\mathsf p_s\) in different scales derived from “Meyer” wavelet on Minnesota graph. Left: at scale \(s=1\),  Middle: at scale \(s=2\), Right: at scale \(s=3\).

Our idea is to make use of the wavelet expansion to define a probability distribution at scale s as

$$\begin{aligned} \small \mathsf p_s(n) = \frac{1}{Z_s} \psi _n(s, n) = \frac{1}{Z_s} \sum _{l=0} ^{N-1} h(s \lambda _l) \chi _l (n)^2 \end{aligned}$$
(8)

where \(Z_s = \sum _{n=1}^N \psi _n(s, n)\). Then \(\mathsf p_s\) is used as a sampling probability distribution which drives how we adaptively query the measurements at the unobserved vertices. Depending on application purposes, h() can be designed as any known filters for wavelets such as Morlet, Meyer, difference of Gaussians (DOG) and so on. Examples of \(\mathsf p_s\) using Meyer wavelet are shown in Fig. 2.

Our formulation in (8) is especially useful when we know the distribution of \(\lambda \) prior to the analysis by imposing higher weights on the band where signal is concentrated. We also work with only k eigenvectors when a full diagonalization of \(\mathcal L\) is expensive. We will see that this observation is important in the next Section, where we utilize a low dimensional space spanned by the k eigenvectors for an efficient solver, while other methods require the full eigenspectrum.

3.2 Recovery of a Band-Limited Signal in a Dual Space

Consider a setting where we observe only a partial signal \(y \in \mathbb R^{m \times p}\) of a full signal \(f \in \mathbb R^{N \times p}\) where \(m\ll N\), and our goal is to recover the original signal f given y. Suppose that our budget allows querying m vertices (to acquire measurements) in the setting phase. Let the locations where we observe the signal be denoted as \(\varOmega = \{\omega _1, \cdots , \omega _m\}\) yielding \(y(i) = f(\omega _i),~~\forall i \in \{1, 2, \cdots , m\}\). Now the question is how \(\varOmega \) should be selected for optimal (or high fidelity) recovery of f. Our framework uses the strategy described in Sect. 3.1 to sample data according to a sampling probability. Based on the m samples (observations), we can build a projection operator \(M_{m \times N}\) (i.e., a sampling matrix) yielding \(Mf = y\) as

$$\begin{aligned} M_{i,j} = {\left\{ \begin{array}{ll} 1~~~ \text {if} ~~~j = \omega _i \\ 0~~~ \text {o.w.} \end{array}\right. } \end{aligned}$$
(9)

Using the ideas described above, a typical decoder would solve for an estimation g of the original signal f using a convex problem as

$$\begin{aligned} \small g^* = \arg \min _{g \in \mathbb R^n} || \mathcal P_{\varOmega } ^ {-\frac{1}{2}} (Mg-y) ||_{2}^2 + \gamma g^{T} h(\mathcal L) g \end{aligned}$$
(10)

where \(\mathcal P_{\varOmega } = \mathrm {diag} (\mathsf p(\varOmega ))\) and \(h(\mathcal L) = \sum _{l=0}^{N-1} h(\lambda _l) \chi _l \chi _l^T\). Taking a close look at the formulation above, it prioritizes minimizing the error between an estimation at the sampled locations (with weights of \(\frac{1}{\sqrt{\mathsf p_{\varOmega }}}\)), and the remaining missing elements are filled in by the regularizer representing graph smoothness. Such a recovery explained in [30] has three weaknesses. (1) It does not take into account whether the recovered signal is band-limited. (2) The main objective function (i.e., the first term) in (10) suggests that it does not matter whether the estimated elements in the unsampled locations are correct. (3) Finally, the analytic solution to the above problem is not easily obtainable without the regularizer or when the regularizer is not full rank. This becomes computationally problematic in real cases when the given graph is large, since the filtering operation in (10) requires a full eigendecomposition of the graph Laplacian \(\mathcal L\).

Fig. 3.
figure 3

A toy example of our framework on a cat mesh (\(N=3400\)). (a) Band-limited random signal in [0, 1] with noise, (b) Sampling probability \(p_1\) derived from (8), (c) Sampled signal at \(m=340\) locations out of 3400, (d) Recovered signal using our method with only \(k=50\).

To deal with the problems above, we propose to encode the band-limited nature of the recovered signal as a constraint. Our framework solves for a solution to (10) entirely in a dual space by projecting the problem to a low dimensional space where we search for a solution of size \(k \ll N\).

Let \(\hat{g}(l) = \sum _{n=1}^N g(n)\chi _l(n)\) be the graph Fourier transform of a function g and \(\hat{g}_k\) be the first k coefficients, then reformulating the model in (10) using \(g = V_k \hat{g}_k\) (assuming that g is k-band limited) yields

$$\begin{aligned} \small \hat{g}_k^* = \arg \min _{\hat{g}_k \in \mathbb R^k} || \mathcal P_{\varOmega } ^ {-\frac{1}{2}}(MV_k \hat{g}_k -y) ||_{2}^2 + \gamma (V_k \hat{g}_k)^{T} h(\mathcal L) V_k \hat{g}_k. \end{aligned}$$
(11)

An analytic solution to this problem can be achieved by taking the derivative of (11) and setting it to 0. The optimal solution \(\hat{g}_k^*\) must satisfy the condition

$$\begin{aligned} \small (V_k ^TM^T \mathcal P_\varOmega ^{-1} M V_k + \gamma V_k ^Th(\mathcal L)V_k)\hat{g}_k^* = V_k^TM^T \mathcal P_\varOmega ^{-1} y \end{aligned}$$
(12)

which reduces to

$$\begin{aligned} \small (V_k ^TM^T \mathcal P_\varOmega ^{-1} M V_k + \gamma h(\varLambda _k))\hat{g}_k^* = V_k^TM^T \mathcal P_\varOmega ^{-1} y \end{aligned}$$
(13)

where \(\varLambda _k\) is a \(k \times k\) diagonal matrix where the diagonals are the first k eigenvalues of \(\mathcal L\). Using the optimal \(\hat{g}_k^*\), we can easily recover a low-rank estimation \(g^* = V_k \hat{g}_k^*\) that reconstructs f. Notice that we only need to find a solution of a much smaller dimension which is significantly more efficient. Moreover, the filtering operation h() in the regularizer in (12) becomes much simpler, and concurrently the solution natively maintains the k-band limited property of the original signal.

A toy example demonstrating this idea is shown in Fig. 3. Given a cat mesh with \(N=3400\) vertices, we first define a random signal \(f \in [0,1]\) that is band-limited in the spectrum of \(\mathcal L\) with Gaussian noise of N(0, 0.1). We take \(\mathsf p_1\) for the sampling distribution and sample \(m=340\) (\(10\,\%\) of the total) vertices without replacement. Our estimation g using only \(k=50\) bases is shown in Fig. 3(d), where the error between the true f and g is extremely small despite using such little data to begin with. We also can see that our method is robust to noise.

4 Experiment Design in Neuroimaging

In this section, we present proof of principle experimental results on two different neuroimaging studies: (1) the Human Connectome Project (HCP) dataset and (2) Wisconsin Registry for Alzheimer’s Prevention (WRAP) dataset. In both studies, we demonstrate the performance of our method in estimating expensive neuroimage-derived measurements at regions of interests (ROI) in the brain using (1) a set of \(p^{\prime }\) less expensive measures of all p measures available to the full cohort \(\mathcal {S}\) of N subjects and (2) a set of (\(p-p^{\prime }\)) expensive measures available to a small cohort subset \(\mathcal {S}^{\prime }\) which includes m subjects. Given these datasets, the goal of these experiments is to see if we can get accurate estimates of the (\(p-p^{\prime }\)) expensive measures of the full cohort \(\mathcal {S}\) of N subjects in a way that statistical power for the follow-up analysis is not greatly compromised.

4.1 Experimental Setup

We compare the performance of our method with two other state-of-the-art methods, (1) Collaborative filtering by Rao et al. [29] and (2) Random sampling of band-limited signals by Puy et al. [30]. For all three methods: (a) We derived adjacency matrices \(\mathcal A\) using data from the full set \(\mathcal {S}\) of N samples and \(p^{\prime }\) economical measures (i.e., more widely available and/or less expensive modalities) and the radial basis function \(\exp (-||x-y||^2/\sigma ^2)\). We then constructed normalized graph Laplacians \(\mathcal L=\mathcal D^{-1/2}(\mathcal D-\mathcal A)\mathcal D^{-1/2}\) used in our framework. (b) We set \(h(\lambda _l) = \lambda _l^4\) for \(h(\mathcal L)\) for the filtering operation in the regularizer and set \(\gamma = 0.01\) in (11). (c) We show estimation results of the (\(p-p^{\prime }\)) expensive measures using \(R \in \{20, 40 , 60\} \% \) of total N samples for both studies and assess the \(\ell _2\)-norm error of the difference between the estimated and observed measures. Because of the stochastic nature of the sampling step, we ran the estimation 100 times and use the average of the corresponding errors for comparisons. In addition, we also compare the predicted values of the (\(p-p^{\prime }\)) neuroimaging measures at each ROI (averaged across subjects) against true values and the estimates of the other two baseline methods. For example, given a cohort of \(N=100\) subjects, suppose we have full data for \(p^\prime =10\) low-cost measurements. Then, the goal is to acquire the \(p-p^{\prime } = 5\) measurements on only \(m = 20\) subjects (i.e., 20 % of the cohort) and estimate the (\(p-p^{\prime }\)) measurements on the remaining \(N-m\) subjects.

4.2 Prediction on the Human Connectome Project

Dataset. The diffusion weighted MR images (DW-MRI) from HCP ([42]) were acquired on custom built hardware using advanced pulse sequences [43] and for a lengthy scan time (\(\sim \)1 h). It allows estimating microstructural properties of the brain, accurate reconstruction of the white matter pathways ([44]) (e.g., see Fig. 4) which form a crucial component in mapping the structural connectome of the human brain [4548]. Typically, such an acquisition of DW-MRI is not feasible in many research sites due to limitations of hardware and software. On the other hand, the set of non-imaging measurements are cheaper and easier to acquire. Hence the ability to predict such high quality diffusion metrics (e.g. fractional anisotropy (FA)) from only a small sample of the DW-MRI scans and the non-imaging measurements has value. HCP provides several categories of non-imaging covariates for the subjects [49] covering factors spanning several different categories. (The full list of covariates is given in the appendix.) We demonstrate the performance of our model on the task of FA prediction in 17 widely studied fiber bundles (shown in Fig. 4) [41, 50] using 27 variables related to cognition, demographics, education and so on.

Fig. 4.
figure 4

Top: The 17 major white matter pathways analyzed in the HCP study [41], Bottom: ROIs and measures analyzed in the WRAP study (Left: A sample FA map and the 162 gray matter ROIs for DTI, Right: Sample \(^{11}\)C PiB DVR map and the 16 gray matter ROIs).

Fig. 5.
figure 5

Sampling ratio versus error plot (left) on the HCP dataset (dashed lines) and the WRAP dataset (straight lines). The corresponding values are in the table on the right. (Color figure online)

Results. Given the full cohort \(\mathcal {S}\) of \(N=487\) subjects from the HCP dataset with the selected \(p^{\prime }=27\) low-cost covariates, we recovered high-cost FA measures in \(p-p^\prime =17\) ROIs (i.e. pathways) using \(p^{\prime }\) covariates and the FA values from \(m \ll N\) participants. The \(p^{\prime }\) measures were used to construct \(\mathcal {L}\) with \(\sigma =5\) and \(k=100\) for generating the sampling distribution \(\mathsf {p}\) for our framework. We analyzed three cases by sampling 20 %, 40 %, 60 % of the total population according to \(\mathsf p\) for m observations to predict FA on N subjects.

Fig. 6.
figure 6

Distribution of mean errors over the ROIs from 100 runs using 20 % (left column), 40 % (middle column) and 60 % (right column) samples on the HCP (top row) and the WRAP dataset (bottom row). Ours (red) show the lower errors than Puy et al. (green) and Rao et al. (blue). (Color figure online)

Fig. 7.
figure 7

Spherical representations of the prediction errors (\(\ell _2\)-norm) in the HCP study (top) and in the WRAP study (bottom). Left: errors using Ours, Middle: errors using Puy et al., Right: errors using Rao et al. The spheres are centered at the center-of-mass of the specific bundle/regional volumes, and the radius of the spheres are proportional to the prediction error. (Color figure online)

Figure 5 (dashed lines) summarizes the overall estimation errors using \(R=\{20, 40 , 60\} \%\) samples of the total population. For all three methods, the errors decreased with an increase in sample size, and our method (red) consistently outperformed the other two methods (blue and green). When we look at the distribution of errors, shown in the top row of Fig. 6, the center of the error distribution using our framework (red) is far lower than the other methods (blue and green). Anatomical specificity of the estimation measures (using 40 % samples) is illustrated on the top panel of Fig. 7 where the location of spheres represents the position of the ROIs and their sizes and colors correspond to the mean errors. As seen in Fig. 7, our method (top-left) clearly has smaller and blue spheres compared to the other methods (middle and right). The quantitative error for individual ROIs used for the spheres are provided in left table of Table 1, and the predicted FA for all ROIs (averaged across subjects) are presented in Fig. 8. For all 17 FA measures, with \(40\,\%\) sampling, we see that our results (blue) are closest to the ground truth (red) while other methods under/over estimate. (Additional results shown in supplement.) When the \(\ell _2-\)norm error is small, we expect results from downstream statistical analysis (e.g., p-values) will be accurate since the distributions of measurements are closer to the true sample distribution.

4.3 Prediction on a Preclinical Alzheimer’s Disease Project

Dataset. Alzheimer’s disease (AD) is known as a disconnection syndrome [51, 52] because connectivity disruption can impede functional communication between brain regions, resulting in reduced cognitive performance [53, 54]. Currently, positron emission tomography (PET) using radio-chemicals such as \(^{11}\)C Pittsburgh compound B (PiB) is important in mapping functional AD pathology. Distribution volume ratios (DVR) of PiB in the brain offer a good measure of the plaque pathology which is considered specific to AD. Unfortunately these PET scans are costly and involve lengthy procedures. WRAP dataset consists of participants in preclinical stages of AD [54, 55] and contains 140 samples with both low-cost FA measures and high-cost PiB DVR (examples shown in Fig. 4). Utilizing the FA values over the entire set of subjects and a partial observation of the PiB measures from a fraction of the population, we investigate the performance of our model for the recovery of PiB measures.

Remark. From a neuroimaging perspective, predicting PiB measures accurately enough for actual scientific analysis is problematic. Utilizing a modality (e.g., cerebrospinal fluid) will be more appropriate for predicting PiB measures, and such results are available on the project homepage. The results below demonstrate that such a prediction task yields results numerically feasible compared to baseline strategies although not directly deployable for neuroscientific studies.

Table 1. Region-wise mean \(\ell _2\)-norm of 100 runs of HCP-FA (left) and PiB DVR (right) with 40\(\%\) samples. Errors from our method are the lowest shown in bold.

Results. For this set of experiments, we selected \(p^{\prime }=17\) pathways with most reliable FA measures to construct a graph with \(N=140\) vertices (i.e., subjects). Utilizing the graph and a partial set of PiB DVR measurements from \(m \ll N\) participants (20 %, 40 % and 60 % of the total population), we predicted the expensive PiB DVR values on 16 ROIs over the whole subjects. To define \(\mathcal L\) and \(\mathsf p\), we used \(\sigma =3\) and \(k=50\). As shown in Fig. 5 in straight lines, our estimation (red) yields the smallest error compared to [30] (green) and [29] (blue) for all three sampling cases. The bottom row in Fig. 6 shows that the centers of error distribution using our algorithm (red) have lower errors than those of other methods (green and blue). As seen in the bottom panel of Fig. 7, similar to the HCP results in Sect. 4.2, we observe smaller errors in every ROI, where the actual region-wise errors are given in the right table of Table 1. Figure 8 presents the predicted regional PiB DVR values against the ground truth where our prediction in blue are consistently closer to the ground truth in red. Additional results using 20 % and 60 % of the subjects are presented in the appendix.

Fig. 8.
figure 8

Average estimations of the HCP-FA in the fiber bundle (top) and the WRAP PiB DVRs (bottom) using 40 % samples. For each measurement, the bars from the left to right are the measurements of the ground truth, ours, Puy et al. and Rao et al. Ours most closely estimate the actual ground truth values of all the measurements. (Color figure online)

5 Conclusion

In this paper, we presented an adaptive sampling scheme for signals defined on a graph. Using a dual space of these measurements obtained via a non-Euclidean wavelet transform, we show how signals can be recovered with high fidelity based on a stratified set of partial observations on the nodes of a graph. We demonstrated the application of this core technical development on accurately estimating diffusion imaging and PET imaging measures from two independent neuroimaging studies, so that one can perform standard analysis just as if the measurements were acquired directly. We presented experimental results demonstrating that our framework can provide accurate recovery using observations from only a small fraction of the full samples. We believe that this ability to estimate unobserved data based on a partial set of measurements can have impact in numerous computer vision and machine learning applications where acquisitions of large datasets often involve varying degrees of stratified human interaction. Many real experiments involve entities that have intrinsic relationships best captured as a graph. Mechanisms to exploit the properties of these graphs using similar formulations as those presented in this work may have important practical and immediate ramifications for many experimental design considerations in numerous scientific domains.