Knowledge Elicitation via Sequential Probabilistic Inference for High-Dimensional Prediction

Prediction in a small-sized sample with a large number of covariates, the"small n, large p"problem, is challenging. This setting is encountered in multiple applications, such as precision medicine, where obtaining additional samples can be extremely costly or even impossible, and extensive research effort has recently been dedicated to finding principled solutions for accurate prediction. However, a valuable source of additional information, domain experts, has not yet been efficiently exploited. We formulate knowledge elicitation generally as a probabilistic inference process, where expert knowledge is sequentially queried to improve predictions. In the specific case of sparse linear regression, where we assume the expert has knowledge about the values of the regression coefficients or about the relevance of the features, we propose an algorithm and computational approximation for fast and efficient interaction, which sequentially identifies the most informative features on which to query expert knowledge. Evaluations of our method in experiments with simulated and real users show improved prediction accuracy already with a small effort from the expert.


INTRODUCTION
Datasets with a small number of samples n and a large number of variables p are nowadays common. Statistical learning, for example regression, in these kinds of problems is ill-posed, and it is known that statistical methods have limits in how low in sample size they can go [1]. A lot of recent research in statistical methodology has focused on finding different kinds of solutions via well-motivated trade-offs in model flexibility and bias. These include strong assumptions about the model family, such as linearity, low rank, sparsity, meta-analysis and transfer learning from related datasets, efficient collection of new data via active learning, and, less prominently, prior elicitation.
There is, however, a certain disconnect between the development of state-of-the-art statistical methods and their application in challenging data analysis problems. Many applications have significant amounts of previous knowledge to incorporate into the analysis, but this is often unstructured and tacit. Building it into the analysis would require tailoring the model and eliciting the knowledge in a suitable format for the analysis, which would be burdensome for both experts in statistical methods and experts in the problem domain. More commonly, new methods are developed to work well in some broad class of problems and data, and domain experts use default approaches and apply their previous knowledge post-hoc for interpretation and discussion. Even when experts in both fields are directly collaborating, the feedback loop between the method development and application is often slow.
We propose to directly integrate the user into the modelling loop by formulating knowledge elicitation as a probabilistic inference process. We study a specific case of sparse linear regression with the aim of solving prediction problems where the number of available samples ("training data") is insufficient for statistically accurate prediction. A core characteristic of the formulation is that it adapts to the feedback obtained from the expert and it sequentially integrates every piece of information before deciding on the next query for the expert. The sequential aspect of the approach efficiently reduces the burden on the expert, since the most informative queries will be asked first, thus reducing the overall number of needed interactions and allowing knowledge elicitation for high-dimensional parameters (such as the regression weights). By interactively eliciting and incorporating expert knowledge, our approach fits into the interactive learning literature. Our focus is here on the probabilistic modelling part of the interaction and we leave the design of supporting user interfaces for future work.

Contributions and Outline
The outline of the paper and our main contributions are as follows. After discussing related work (Sect. 2), we rigorously formulate the expert knowledge elicitation as a probabilistic inference process (Sect. 3). We study a specific case of sparse linear regression, and in particular, consider cases where the user has knowledge about the values of the regression coefficients and about the relevance of the features (Sect. 4). We present an algorithm for efficient interactive sequential knowledge elicitation for high-dimensional models that makes knowledge elicitation in "small n, large p" problems feasible (Sect. 4.3). We describe an efficient computational approach using deterministic posterior approximations allowing real-time interaction for the sparse linear regression case (Sect. 4.4). Simulation studies are presented to demonstrate the performance and to gain insight into the behaviour of the approach (Sect. 5). Finally, we demonstrate that real users are able to improve the predictive performance of sparse linear regression in a proof-of-concept experiment (Sect. 5.4).

RELATED WORK
The problem we study relates to several topics studied in the literature, either by the method, goal, or by the considered setting. In this section, we highlight the main connections.
Interactive Learning. Interactive machine learning includes a variety of ways to employ user's knowledge, preferences, and human cognition to enhance statistical learning [2,3]. These methods have been used successfully in several applications, such as learning user intent [4] and preferential clustering. For instance, the semi-supervised clustering method in [5,6] uses feedback on pairs of items that should or should not be in the same cluster, to learn user preferences. In addition to the differences coming from the learning task, one notable contrast between these works and our method is that their aim is to identify user preferences or opinions, whereas our goal is to use expert knowledge as an additional source of information for an improved prediction model, by integrating it with the knowledge coming from the (small n) data. As a probabilistic approach, our work relates to [7] and [8], where expert feedback is used for improved learning of Bayesian networks and for visual data exploration, respectively. In Sect. 3.3, we show how these works can be seen as instances of the general approach we propose.
Active Learning. The method we propose for efficiently using expert feedback is related to active learning techniques (for a survey, see, for instance, [9]), where the algorithms actively select the most informative data points to be used in prediction tasks. Our method similarly queries the user for information with the goal of maximising the information gain from each feedback and thus learning more accurate models with less feedback. The same definition of efficiency with respect to the use of samples, also connects our work with experimental design techniques, recently used for linear settings by Seeger [31], Hernández-Lobato et al. [32], and Ravi et al. [12]. Our task, however, is different as we do not aim at collecting new data samples, but the additional information comes from a different source, the expert, with its respective bias and uncertainty. Indeed, our method will be most useful in cases where obtaining additional input samples would be too expensive.
Prior Elicitation and Privileged Information. Many works have studied approaches for efficient elicitation of human and, in particular, expert knowledge. In prior elicitation [13], the goal is to use expert knowledge to construct a prior distribution for Bayesian data analysis and restrict the range of parameters to be later used in learning models. Notably, an important line of work [14,15] studies methods of quantifying subjective opinion about the coefficients of linear regression models through the assessment of credible intervals. Our approach goes beyond pure prior elicitation as the training data is used to facilitate efficient user interaction. Another line of work considers expert feedback as privileged information [16], where additional human knowledge is allowed in the training phase only. Differently to our method, these works typically do not consider an interactive integration of the expert knowledge with the training data, and do not model the reliability of the human feedback thus received, rather, they use it as a guideline for improving the performance of learning tasks.

KNOWLEDGE ELICITATION AS INTERACTIVE PROB-ABILISTIC MODELLING
In the following, we formulate expert knowledge elicitation as a probabilistic inference process.

Key Components
Let y and x denote the outputs (target variables) and inputs (covariates), and θ and φ y the model parameters. Let f encode input from the user (feedback based on the user's knowledge) and φ f be related model parameters. We identify the following key components: 1. An observation model p(y|x, θ, φ y ) for y.

A feedback model
3. A prior model p(θ, φ y , φ f ) completing the hierarchical model description.

4.
A query algorithm and user interface that facilitate gathering f iteratively from the user.

5.
Update process of the model after user interaction.
The observation model can be any appropriate probability model. It is assumed that there is some parameter θ, possibly high-dimensional, that the user has knowledge about. The user's knowledge is encoded as (possibly partial) feedback f that is transformed into information about θ via the feedback model. Of course, there could be a more complex hierarchy tying the observation and feedback models, and the feedback model can also be used to model more user-centric issues, such as the quality of or uncertainty in the knowledge or user's interests. The feedback model, together with a query algorithm and a user interface, is used to facilitate an efficient interaction with the user. The term "query algorithm" is used here in a broad sense to describe any mechanism that is used to intelligently guide the user's focus in providing feedback to the system. This enables considering a high-dimensional f without overwhelming the user as the most useful feedbacks can be queried first. Crucially, this enables going beyond pure prior elicitation as the observed data can be used to inform the queries via the dependence of the feedback and observation models. For example, the queries can be formed as solutions to decision or experimental design tasks that maximize the expected information gain from the interaction.
Finally, as the user's feedback is modelled as additional data, Bayes theorem can be used to sequentially update the model during the interaction. For real-time interaction, this may present a challenge as computation in probabilistic models can be demanding. It is known that slow computation can impair effective interaction [17] and, thus, efficient computational approaches are important.

Examples
The goal in this paper is to use the interaction scheme to help solve prediction problems in the "small n, large p" setting. The approach as described above is, however, more general and applicable to other problems. We briefly describe two earlier works that can be seen as instances of it. Cano et al. [7] present a method for integrating expert knowledge into learning of Bayesian networks. The observation model is a multinomial Bayesian network with Dirichlet priors. The user provides answers to queries about the presence or absence of edges in the graph and the feedback model assumes the answers to be correct with some probability. Which edge to query about next is selected by maximising the information gain with regard to the inclusion probability of the edges. Monte Carlo algorithms are used for the computation.
House et al. [8] present a framework for interactive visual data exploration. They describe two observation models, principal component analysis and multidimensional scaling, that are used for dimensionality reduction to visualise the observations in a two dimensional plot. They do not have a query algorithm, but their user interface allows moving points in a low-dimensional plot closer or further apart, which is interpreted by a feedback model that transforms the feedback into appropriate changes in the shared parameters with the observation model to allow exploration of different aspects of the data. Their model affords closed form updates.

FEEDBACK MODELS AND QUERY ALGORITHM FOR SPARSE LINEAR REGRESSION
We next introduce the knowledge elicitation approach for sparse linear regression.

Sparse Regression Model
Let y ∈ R n be the observed output values and X ∈ R n×m the matrix of covariate values. The regression is modelled with Gaussian observation model and a spike-and-slab sparsity-inducing prior [18] on the regression coefficients w ∈ R m , and a Gamma prior on the inverse of the residual noise variance σ 2 : Here, the γ j are latent binary variables indicating inclusion or exclusion of the covariates in the regression (δ 0 is a point mass at zero) and ρ is the prior inclusion probability controlling the expected sparsity. The α σ , β σ , ψ 2 , and ρ are assumed fixed hyperparameters.

Feedback Models
We consider two simple and natural feedback models encoding knowledge about the individual regression coefficients: • User has knowledge about the value of the coefficient (f w,j ∈ R): • User has knowledge about the relevance of coefficient (f γ,j ∈ {0, 1} for not-relevant, relevant): Here, ω 2 and π control the uncertainty or strength of the knowledge. In detail, ω 2 is the uncertainty in the user's estimate of the coefficient, and π is the probability that the user gives correct feedback relative to the state of the covariate inclusion indicator γ j .

Query Algorithm
Our aim is to improve prediction. Thus, the user interaction should focus on aspects of the model (here, predictive features) that would be most beneficial towards this goal. We use the query algorithm to rank the features for choosing which feature to ask feedback about next. The ranking is formulated as a Bayesian experimental design task [19]. More specifically, the feature j * that maximizes the expected information gain is chosen next: where j indexes the features, F is the set of feedbacks that have already been given (to simplify notation, those are here included in D), and the summation over i goes over the training dataset. The information gain is defined as the Kullback-Leibler divergence (KL) between the current posterior predictive distribution p(ỹ|D, x) = p(ỹ|x, θ)p(θ|D)dθ, where θ = (w, γ, σ 2 ), and the posterior predictive distribution with the new feedback f j , p(ỹ|D, x, f j ). The bigger the information gain, the bigger impact the new feedback has on the predictive distribution. Since the feedback itself will only be observed after querying the user, we take the expectation over the posterior predictive distribution of the feedback p(f j |D). More details about the Bayesian experimental design are provided in the supplementary material (Sec. B).
We note that, were the predictive distribution of y Gaussian, the problem would be simple. The expected information gain would be independent of y and the actual values of the feedbacks (when feedback is on values of the regression coefficients) and would only depend on the x and which features were given feedback on [31]. The sparsity-promoting prior, however, makes the problem non-trivial.

Computation
The model does not have a closed form posterior distribution, predictive distribution, or solution to the information gain maximization problem. To achieve fast computation, we use deterministic posterior approximations. Expectation propagation [25] is used to approximate the spike-andslab prior [29] and the feedback models, and variational Bayes (e.g., [27, Chapter 10]) is used to approximate the residual variance σ 2 . The form of the posterior approximation for the regression coefficients w is Gaussian. The posterior predictive distribution for y is also approximated as Gaussian. Details are provided in the supplementary material (Sect. A.2).
Expectation propagation has been found to provide good estimates of uncertainty, which is important in experimental design [29,31,32]. In evaluating the expected information gain for a large number of candidate features, running the approximation iterations to full convergence for each, however, is too slow. We follow the approach of Seeger [31], Hernández-Lobato et al. [32] in computing only a single iteration of updates on the essential parameters for each candidate. We show in the results that this already provides a good performance for the query algorithm in comparison to random queries. Details on the computations are provided in the supplementary material (Sect. A.3).

EXPERIMENTS
The performance of the proposed method (Sect. 4) is evaluated in several "small n, large p" regression problems on both simulated and real data. A proof-of-concept user study is presented  to demonstrate the feasibility of the method with real users. 1

Simulated Data
We use synthetic data to study the behaviour of the approach in a wide range of controlled settings.
Setting. The covariates of n training data points are generated from X ∼ N(0, I). Out of the m regression coefficients w 1 , . . . , w m ∈ R, m * are generated from w j ∼ N(0, ψ 2 ) and the rest are set to zero. The observed output values are generated from y ∼ N(Xw, σ 2 I). We consider cases where the user has knowledge about the value of the coefficients (Eq. 2 with noise value ω = 0.1) and where the user has knowledge about non-relevant/relevant features (Eq. 3 with γ j = 1 if w j is non-zero, and γ j = 0 otherwise, and π = 0.95). For a generated set of training data, all algorithms query feedback about one feature at a time. Mean squared error (MSE) is used as the performance measure to compare the query algorithms. For the simulated data setting, we use the known data-generating values for the fixed hyperparameters, namely: ψ 2 = 1, ρ = m * /m, and σ 2 = 1 (here we do not use the distribution assumption on σ 2 ).
Results. In Fig. 2, we consider a "small n, large p" scenario, with n = 10, m * = 10 and with increasing dimensionality (hence also increasing sparsity) from m = 12, . . . , 200. The heatmaps show the average MSE values over 100 runs (repetitions of the data generation) for both feedback models, as obtained by our sequential experimental design algorithm and by a strategy that randomly selects the sequence of features on which to ask for expert feedback. The result shows that our method achieves a faster improvement in the prediction, starting from the very first user feedbacks, for both feedback types, and at all the dimensionalities.
Notably, in the case of the random strategy, the performance decreases rapidly with the growing dimensionality (even with 50 feedbacks, in the setting with 200 dimensions, the prediction error for random strategy stays high), while the user feedback via the sequential experimental design is informative enough to provide good predictions even in large dimensionalities. Comparing the two types of feedback, the feedback on the coefficient values gives better performance for both strategies.
Sect. C.1.1 in the supplement shows heatmaps for the same setting but with a fixed dimension m = 100 and a varying number of training data n = 5, . . . , 50. For those experiments, we can again see superior improvement for the sequential experimental design compared to random, for both feedback models, and in particular for small sample sizes. Moreover, a comparison of the sequential experimental design algorithm to its non-sequential version (Sect. C.1.2 in the supplement) shows that the former achieves a better performance, indicating that the user feedback affects the next query. Finally, for further insight into the behaviour of the approach, a simulation experiment with n = 10 in Sect. C.2 in the supplementary material shows that the training set error begins to increase as a function of the number of feedbacks while the test error decreases. This happens because the initial fit exhausts the information in the training data, but at this small sample size is insufficient to provide good generalization performance.

Review Rating Prediction
We test our method for the task of predicting review ratings from textual reviews in subsets of Amazon and Yelp datasets. Each review is one data point, and each distinct word is a feature with the corresponding covariate value given by the number of appearances of the word in the review. In addition to being fit for sparse linear regression models (as shown in previous studies, for instance, in [29]), we also chose this type of dataset due to the uncomplicated interpretation of the features, which allows us to easily test our method on real users.

Datasets
Amazon data. The Amazon data is a subset of the sentiment dataset of [23]. This dataset 2 contains textual reviews and their corresponding 1-5 star ratings for Amazon products. Here, we only consider the reviews for products in the kitchen appliances category, which amounts to 5149 reviews. The preprocessing of the data follows the method described in [29], where this dataset was used for testing the performance of a sparse linear regression model. Each review is represented as a vector of features, where the features correspond to unigrams and bigrams, as given by the data provided by [23]. For each distinct feature and for each review, we created a matrix of occurrences and only kept for our analysis the features that appeared in at least 100 reviews, that is, 824 features.
Yelp data. The second dataset we use is a subset of the YELP (academic) dataset 3 . The dataset contains 2.7 million restaurant reviews with ratings ranging from 1 to 5 stars (rounded to half-stars). Here, we consider the 4086 reviews from the year 2004. Similarly to the preprocessing done for Amazon data, each review is represented as a vector of features (distinct words). After removing non-alphanumeric characters from the words and removing words that appear fewer than 100 times, we have 465 words for our analysis.

Simulated User Feedback
For all experiments on Amazon and Yelp datasets, we proceeded as follows: First, each dataset was partitioned in three parts: (1) a training set of 100 randomly selected reviews, (2) a test set of 1000 randomly selected reviews, and (3) the rest as a "user-data set" for constructing simulated user knowledge. The data were normalised to have zero mean and unit standard deviation on the training and user-data sets. The simulated user feedback was generated based on the posterior inclusion probabilities E[γ] in a spike-and-slab model trained on the user-data partition. We only considered the more realistic case where the user can give feedback about the relevance of the words. For a word j selected by the algorithm, the user gives feedback that the word is and uncertain otherwise. The intuition is that if the user-data indicate that a feature is zero/non-zero with high probability, then the simulated user would select that feature as not-relevant /relevant. However, for uncertain words, the feedback iteration passes without receiving any feedback. The model parameters were set to π = 0.9, ψ 2 = 0.01, α σ = 1, β σ = 1, and ρ = 0.3.

Results
We compare three query algorithms: • random feature suggestion (green line, triangle up), • an strategy that knows the relevant features beforehand (inferred from the posterior inclusion probabilities over all data) and asks exclusively about them first, and then chooses at random from the features not already selected (red line, triangle down) 4 , • our sequential experimental design algorithm (Sect. 4.3) (blue line, squares). 0  20  40  60  80  100  120  140  160  180  200 Mean Squared Error  All algorithms query feedback about one feature at a time and MSE is used as the performance measure. The ground truth line represents the MSE after receiving user feedback for all words in each dataset. A first observation is that the use of additional knowledge coming from the simulated expert indeed reduces the prediction errors, for all algorithms and on both datasets. Yet, the reduction in the prediction error differs significantly depending on whether the methods manage to query feedback on the most informative features first. Indeed, the goal is to make the elicitation as little burdensome as possible for the experts. To reach the goal, a strategy needs to rapidly extract a maximal amount of information from the expert, which here amounts to the careful selection of the features on which to query feedback. As expected, the random query selection strategy has a constant and slow improvement rate, as the number of feedbacks grows, leaving a big gap from the ground-truth performance in both datasets, even after 200 user feedbacks. In contrast, the (unrealistic) strategy that first asks about relevant features begins with a steep increase in performance for the first iterations (only 26 words for Amazon and 23 for Yelp are marked as relevant, as computed from the full dataset), then it continues with a very slow improvement rate coming from asking non-relevant words. Our method manages to identify the informative features rapidly and thus has a higher improvement compared to random from the first user feedbacks. In the case of Yelp data, our strategy manages to be very close to the strategy knowing the relevant words in the initial feedbacks and then getting very close to the ground-truth after 200 interactions. Furthermore, there is a significant gap compared to the random strategy for all amounts of feedbacks. In the more difficult (in terms of rating prediction error and size of dimensions) Amazon dataset, although the gap to the random strategy is clear, our strategy exceeds the level of information obtained in the 26 non-zero features only after 140 feedbacks.

Expert Knowledge Elicitation vs. Collecting More Samples
We next contrast the improvements in the predictions brought by eliciting the expert feedback to improvements gained by adding samples from the user-data set to the training set. For the latter, we use two alternative strategies: randomly selecting a sequence of reviews to be included in the training set, and an active learning strategy, which selects samples based on maximizing expected information gain (an adaptation of the method in [31] Table 2 shows how many feedbacks (for the knowledge elicitation strategies in the last two columns: random and our method; see Sect. 4.3) and respectively how many additional samples (that is, additional reviews to be included in the train set) are needed to reach set levels of MSE, noting that all strategies have the same "small n, large p" regression setting as a starting point, with n = 100 and a corresponding MSE of 1.2036.
Even with the relatively weak type of expert feedback (feedback on the relevance of features), a specific performance is reached by a comparable number of expert feedbacks and additional data. For instance, the same level of MSE=1.18 is obtained either by asking an expert about the relevance of 25 features and by actively selecting 12 extra samples. When the active selection is not possible, we can see that the same information gain requires 94 additional randomly selected samples. Naturally, the results obtained are specific for this Yelp data and for the feedback model we assume. Nevertheless, the comparison shows the potential of expert knowledge elicitation in prediction for settings where actively selecting samples is not possible, or even more so, when getting additional samples is impossible or very expensive. The same observations and intuitions about the information gain comparison remain valid for the Amazon data (see Sect. C.3).

User Study
The goal of the user study is to investigate the prediction improvement and convergence speed of the proposed sequential method based on human feedback. Our focus is on testing the accuracy of feedback from real users on the easily interpretable Amazon data rather than on details of the user interface. Hence, we asked ten university students and researchers to go through all the 824 words and give us feedback in the form of not-relevant, relevant, or uncertain. This allowed for a fast collection of feedbacks and we could use the pre-given feedback to test the effectiveness of several query algorithms. We assumed that the algorithms had access to 100 training data and at each iteration they could query the pre-given feedback of the participant about one word. The whole process was repeated for 40 independent runs, where training data were randomly selected. The hyperparameters of the model were set to the same values as in the simulated data study with the only difference that the strength of user knowledge was lowered to π = 0.7.    4 shows the average MSE improvements for each of the 10 participants, when using our proposed method and the random query order. From the very first feedbacks, the sequential experimental design approach performs better for all users and captures the expert knowledge more efficiently. The random strategy exhibits a relatively constant rate of performance improvement with the increasing number of feedbacks, while the sequential experimental design shows faster improvement rate in the beginning implying that it can query about the more important features first.
To further quantify the statistical evidence for the difference, we computed the paired-sample t tests between the random suggestion and the proposed method at each iteration (green and blue curves in Fig. 4). Already after the first feedback, the difference between the methods is significant at the Bonferroni corrected level α = 0.05/200. Further analysis about the convergence speed and the suggested words are reported in the supplementary (Sect. C.4).

CONCLUSION
We presented a knowledge elicitation approach for high-dimensional sparse linear regression. The results for "small n, large p" problems in simulated and real data with simulated and real users, and with user knowledge on the regression weight values and on the relevance of features, showed improved prediction accuracy already with a small number of user interactions. The knowledge elicitation problem was formulated as a probabilistic inference process that sequentially acquires and integrates user knowledge with the training data. Compared to pure prior elicitation, the approach can facilitate richer interaction and be used in knowledge elicitation for high-dimensional parameters without overwhelming the user.
As a by-product of our study, we noticed that even for the rather weak feedback on the relevance of features, the number of expert feedbacks and the number of randomly acquired additional data samples needed to reach a certain level of MSE reduction were of the same order. Although this observation was obtained on a noisy dataset and for a simplifying user interaction setting, the fact that the considered feedback type was rather weak sets the ground for a further and more robust comparison of the performance gain obtained from these two different sources of information.
The presented knowledge elicitation method is general, and as all assumptions have been explicated as a probabilistic model, the approach can be rigorously analyzed and tailored to match specifics of other knowledge elicitation settings. The presented results considered rather simple types of feedback as a proof-of-concept of the approach. In future, we will work on extending the types of interactions and outlining new types of interactive machine learning problems.

A.1 Model
The posterior distribution of the regression model is where D = (y, X, f γ , f w ) are the training data observations together with the sets of observed user feedback and Here, F γ and F w denote the sets of indices of the features that have received relevance feedback and weight feedback, respectively. π, ω 2 , α σ , β σ , and ψ 2 are assumed fixed hyperparameters. The parametrizations of the distributions follow Gelman et al. [24] and we use the generic p(·) notation, where it is understood that the parameters identify the separate terms.

A.2 Posterior approximation
The corresponding posterior approximation is where, using bar to distinguish the parameters of the posterior approximation, q(w) = N(w|m,Σ), q(σ −2 ) = Gamma(σ −2 |ᾱ σ ,β σ ), q(γ) = j Bernoulli(γ j |ρ j ), and the site term approximations are wheret · denote the exponential family forms of the corresponding distributions parametrized by the precision-adjusted mean and precision for normal distribution, and the natural parameters for Bernoulli and gamma distributions. Note that the terms p(σ −2 ), p(f w |w), and p(γ) need not be approximated as they are already of the correct exponential family form. The parameters of the full approximation can be identified from the products of the corresponding site term approximations and arē m =Σ(μ y +μ w +μ fw ), where diag(·) is a diagonal matrix with the parameter as the diagonal and feedback term approximation parameters are zero for feedbacks that have not been observed.

A.3 Computation of the posterior approximation
Expectation propagation (EP) and variational Bayes (VB) inference are used to find the parameters of the posterior approximation [25][26][27]. Expectation propagation for linear regression with spike and slab prior has been introduced by Hernández-Lobato et al. [28] (see [29] for a more extensive treatment). We update thet N (w|μ y ,Γ y ) andt Gamma (σ −2 |α y ,β y ) term approximations using VB and all other terms using EP. The parameter update steps in the algorithm, to be iterated until convergence, are 1. p(w|γ) approximation using parallel EP update.
The individual terms are updated following the pattern in [25]: 1. Computation of the cavity distribution, q \ (·) ∝ q(·) t(·) . In the natural parametrization, this corresponds to subtracting the parameters of the site approximation from the parameters of the full approximation for the processed model parameter.
For the EP update, KL[p q] and for the VB, KL[q p]. The former corresponds to setting the moments of the sufficient statistics of q to match those ofp, and the latter has solution q(·) new ∝ exp(E q−· [logp(·)]), where the expectation is over the approximate posterior of all other model parameters than the one that is being processed [26,27].
3. Updating of the parameters of the site approximation,t new ∝ q(·) new q \ (·) . This can be thought of as an inverse of the step 1 to now get the updated site approximation and, in the natural parametrization, is a subtraction of the cavity parameters from the parameters of the new full approximation. We use damping of the updates (the parameters are set to a convex combination of the old parameters and the new parameters computed above) [30].
All of the computation have closed form solutions.

B Bayesian Experimental Design
The task is to find the feedback that maximises the expected information gain: where F is the set of feedbacks that have already been given (to simplify notation, those are here assumed included in D) and the summation over i goes over the training dataset. The evaluation of the expected information gain is described in the following. The posterior predictive distribution is approximated as Gaussian: wheres 2 =β σ ασ is the posterior mean approximation for the residual variance. Similarly, the posterior predictive distributions of the feedbacks for the two feedback types follow as approximate Gaussian and Bernoulli distributions: The information gain between the predictive distributions is (8) As running the EP algorithm to full convergence would be too costly for evaluating a large number of candidates, we approximate the posterior distribution with the new feedback with partial EP updates. This is similar to the approach of Seeger [31] and Hernández-Lobato et al. [32] for experimental design for sparse linear model. We consider the two types of feedback separately.
In the case of feedback directly on the regression weight, we add the corresponding site term (which is already of Gaussian form and does not need approximation, as noted above) and do not update the approximations of the other site terms (including assumings 2 f =s 2 ). The new posterior approximation of w with these assumptions is where e is a vector of zeros except for 1 at jth element, T = 1 ω 2 , and h =f w,j ω 2 . Notably,Σ −1 andΣ −1m are the precision and the precision-adjusted mean of the posterior approximation without the new feedback and are directly available from the previous EP approximation. The new posterior covariance is independent of the value of the feedbackf w,j and it can be efficiently evaluated using the matrix inversion lemma asΣf =Σ − 1 T −1 +ΣjjΣ ee ⊤Σ . Furthermore, the expectation over the feedback in the expected information gain affects only the term with the squared difference of the means. This is where the first equality follows from substituting the Equation 10 and using the matrix inversion lemma, and the second equality from h T =f w,j and the remaining expectation being equal to the variance of the predictive distribution of the feedback.
In the case of relevance feedback, we add the corresponding site term for the feedback and run single EP update on it and the corresponding prior term p(w j |γ j ). These updates are purely scalar operations and do not require any costly matrix operations. Other site term approximations are not updated. The new posterior approximation of w with these assumptions is where That is, now T and h are the changes in the precision and the precision adjusted mean in the jth feature and these are available with cheap scalar operations. The expectation over the value of the feedback in the expected information gain is in this case a sum of two terms and we evaluate both of the terms separately using the above scheme. Again, we use the matrix inversion lemma to avoid full inversions in computing the new posterior covariance.

C.1 Synthetic data
For the synthetic experiments with simulated data, we continue the study of the behaviour of our algorithm, through additional experiments and visualisations. The setting stays the same as in Sect. 5.1, except for the specifications below.

C.1.1 Heatmaps with varying number of training data
We now study the performance when the number of training data varies from 1 to 50 (since we consider in particular small-samples settings). The dimensionality is fixed to 100, and the number of relevant features is 10.    5 illustrates the behaviour of our strategy and that of the random feature selection, for the previously described synthetic data setting with a fixed dimension m = 100 and with increasing numbers of training data points n = 5, . . . , 50. For very small sample sizes (n < 10), a difference between the performance of the two methods starts being visible after 20-30 received feedbacks. Then, for larger training samples sizes (10 < n < 30), the MSE reduction in our method is more visible from the first feedbacks, while for n > 30, both strategies have a much smaller MSE.

C.1.2 Sequential vs Non-sequential Experimental Design
For a simple setting with simulated data, we now study the difference between our method and its non-sequential version for the two feedback models discussed previously: user feedback on the coefficients and on their relevance. The non-sequential version chooses the sequence of features to be queried before observing any expert feedback. We note that the behaviour and ranking of the query algorithms remain similar to the one observed in the previous plots. In Fig. 6, we consider a "small n, large p" scenario, with n = 10, m = 100, m * = 10 and we report the average MSE value over 500 runs.
The results in the plots are shown for an increasing number of feedbacks, that gets to the number of dimensions, when all methods converge. However, if we consider the plausible scenario when the number of user interactions are limited, one can notice that compared to the other methods, the performance of both experimental design methods have a more important decrease in prediction loss even in the first iterations.
This reflects the fact that both experimental design strategies manage to identify and ask with priority about the most informative coefficients. This is more evident in the feedback model about coefficient relevance ( Fig. 6(b)), where the performance of the two experimental design strategies is very close to the strategy that first suggests only relevant features. However, one can also notice an improved performance for the sequential version of the experimental  design strategy. Indeed, the more carefully selected sequence of queries done by the sequential experimental design strategy, manages to reduce the prediction error faster, compared to the non-sequential selection, where the observed expert feedback is not taken into account. Also, as expected, the difference between the sequential and non-sequential experimental designs is more significant in the case of the stronger feedback model on coefficients values ( Fig. 6(a)).

C.2 Comparison of Training and Test Set Errors and the Average Accumulated Suggestion Behaviour
We can get some insight into the behaviour of the approach by comparing the training and test set errors shown in Fig. 6(a) and Fig. 7(a) for the simulated data scenario described in the previous section with feedback on the coefficient values. The training set error begins to increase as a function of the number of expert feedbacks. This happens because the model without any feedbacks has exhausted the information in the training data (to the extent allowed by the regularizing priors) and fits the training data well. The user feedback, however, moves the model away from the training data optimum and towards better generalization performance. Indeed, the MSE curves for the training and test errors converge close to each other as the number of feedbacks increases. Moreover, the convergence is faster for the query algorithms that start by suggesting the features with non-zero effects implying that they are more informative (Fig. 7(b)).

C.4 User Study
We complement the analysis of the results of the user study with two illustrations. First, to compare the convergence speed of different methods, we normalised the MSE improvements at each iteration by the amount of total improvement obtained by each of the users, when considering all their individual feedback. Figure 8(a) depicts the convergence speed of methods based on this measure. As can be seen from the figure, for all participants, the proposed method was able to capture most of the participants's knowledge with small budget of 20  feedback queries (stabilizing at around 200 out of the total 824 features in the considered subset of Amazon data). Then, in Figure 8(b), we show the average percentage of relevant words that were asked from the participants at each iteration. It is evident from the figure that the proposed algorithm started by mostly asking about limited relevant words. The relevant words were identified by considering all the data in Amazon dataset and training an spike and slab model and then choosing words with E[γ j ] > 0.7 (words with high posterior inclusion probability). Based on this threshold, only 39 words from 824 words were considered as relevant. The coefficients of all words in the spike-and-slab model, along with the names of relevant words are shown in Fig. 9.