1 Introduction

Bayesian optimization (BO) (Shahriari et al. 2016b) is a promising approach for black-box optimization of an expensive-to-evaluate unknown function. Its applications include hyperparameter tuning of machine learning models with the highest predictive accuracy (Snoek et al. 2012) and searching of compound structures with desirable properties (Korovina et al. 2020). BO is conducted in an iterative manner. In each iteration, a BO method determines the values of input variables in an input domain \({\mathcal {X}} \subset {\mathbb {R}}^d\), evaluates a black-box function \(f: {\mathcal {X}} \rightarrow {\mathbb {R}}\) at that input query, and observes the corresponding output \(y \in {\mathbb {R}}\). The final goal is to find the input that maximizes the function with as few function evaluations as possible.

Although a learner can fully specify all values of input variables in the standard BO setting, this procedure is expensive or difficult to perform for some cases. One example is crowdsourcing, by which human-intelligent tasks are outsourced to an undetermined number of people on the Internet. Most existing crowdsourcing platforms allow us to ask workers to perform the task by specifying conditions such as age, country, and experience to obtain high-quality deliverables (Li et al. 2014). However, if the conditions are too strict, there is a risk that there will be no workers available to participate in the task. Therefore, we specify some conditions to increase the probability that one of the workers who meet the conditions will participate in the task. Upon completion of the task by a worker, we observe the quality of the deliverable and the unspecified conditions of the worker.

Another example is the design of airfoils with low self-noise in an outsourcing scenario. We ask a manufacturer to produce an airfoil by specifying values of its various variables, including angle of attack and chord length. However, specifying values of many variables in outsourcing incurs high cost, and our budget is often limited. Therefore, we are restricted to specify only values of certain variables. After obtaining the ordered airfoil, the values of the unspecified variables are revealed, and self-noise of the airfoil is measured via an acoustic test.

In this paper, we propose a novel Gaussian process (GP) bandit framework, which we call Bayesian optimization with partially specified queries (BOPSQ). In BOPSQ, unlike the standard BO setting, a learner selects a subset of input variables and specifies their values. We call such an input query a partially specified query. Next, the learner observes the values of the unspecified input variables determined according to a known or unknown distribution. Then, the learner evaluates the black-box function at the input values to obtain the corresponding output. Hence, the learner is required to consider the extent to which the input variables contribute to the function’s output as well as which values the unspecified input variables will take. Williams et al. (2000) and Lattimore et al. (2016) also handled partially specified queries. However, in their works, the input variables whose values are specified are fixed or infinite input spaces are not considered for the GP bandit.

We propose two algorithms for known and unknown input distributions. In the case of the known input distribution, the uncertainty lies only in the function estimation, and we can naturally extend a posterior sampling approach to this setting. In the case of the unknown input distribution, the learner further needs to take the uncertainty in the distribution estimation into account. We adopt a multi-armed bandit approach to appropriately explore the input distribution.

The contributions of the present study are summarized as follows:

  1. 1.

    We propose a novel GP bandit framework with partially specified queries, which is a natural generalization of the standard BO problem and BO with environmental variables (Williams et al. 2000).

  2. 2.

    We propose two algorithms for known and unknown input distribution cases with their sublinear regret bounds for popular kernels.

  3. 3.

    We demonstrate the empirical effectiveness of the proposed algorithms using a set of test functions and real-world datasets.

The rest of the paper is organized as follows. We briefly describe the standard BO setting and approaches in Sect. 2. Section 3 formalizes the problem setting of BOPSQ as an extension of the standard BO setting. In Sect. 4, we propose the two algorithms for known and unknown input distribution cases with theoretical guarantees. We validate the effectiveness of the proposed algorithms in Sect. 5. Section 6 summarizes the related works. We provide the concluding remarks in Sect. 7.

2 Review of Bayesian optimization

BO (Srinivas et al. 2012; Shahriari et al. 2016b) is an approach for finding the maximizer of an expensive-to-evaluate black-box function interactively. Let \({\mathcal {X}} \subset {\mathbb {R}}^d\) be a d-dimensional compact set and \(f: {\mathcal {X}} \rightarrow {\mathbb {R}}\) be an unknown real-valued function. In each iteration, a learner determines the values of input variables \(x \in {\mathcal {X}}\). The learner evaluates the function at x and obtains an output \(y = f(x) + \epsilon\), where \(\epsilon \sim {\mathcal {N}}(0, \sigma ^2)\). The goal is to find the maximizer

$$\begin{aligned} x^* = \mathop {\mathrm{argmax}}\limits _{x \in {\mathcal {X}}} f(x) \end{aligned}$$

with as few evaluations as possible.

BO methods typically assume a GP (Rasmussen and Williams 2006) as a prior of the black-box function: \(f \sim {{\mathcal {G}}}{{\mathcal {P}}}(\mu _0, k)\), where \(\mu _0: {\mathcal {X}} \rightarrow {\mathbb {R}}\) is a mean function and \(k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}\) is a covariance function. Without loss of generality, we assume a zero mean \(\mu _0 = \varvec{0}\). Suppose that a set of t input-output observations \({\mathcal {D}}_{t} = \{(x_{i}, y_i)\}_{i=1}^{t}\) is given. Then, we have a posterior mean \(\mu _t(x)\), posterior variance \(\sigma _t^2(x)\), and posterior covariance \(k_t(x,y)\) as follows:

$$\begin{aligned}&\mu _t(x) = {{k}_x}^\top {\left( K + \sigma ^2 I\right) }^{-1} \varvec{y}_t,\quad \sigma _t^2(x) = k_t(x,x),\\&{k_t(x, y) = k(x,y) - {{k}_x}^\top {\left( K + \sigma ^2 I\right) }^{-1} {{k}_y},} \end{aligned}$$

where \({{k}_x}\) is a t-dimensional vector whose i-th entry is \(k(x_i, x)\), K is a \(t \times t\) matrix whose (ij)-th entry is \(k(x_i, x_j)\), I is an identity matrix, and \(\varvec{y}_t\) is a t-dimensional vector whose i-th entry is \(y_i\).

BO methods often use an acquisition function \(a_{t+1}: {\mathcal {X}} \rightarrow {\mathbb {R}}\) that quantifies how much an input should be evaluated next. The next input query is determined as

$$\begin{aligned} x_{t+1} = {\mathop {\mathrm{argmax}}\limits }_{x \in {\mathcal {X}}} a_{t+1} (x). \end{aligned}$$

Popular choices for the acquisition function are the Gaussian process upper confidence bound (GP-UCB) (Srinivas et al. 2012), the probability of improvement (Kushner 1964), and the expected improvement (Mockus et al. 1978; Jones et al. 1998). Thompson sampling (Thompson 1933; Russo and Roy 2014) based on a posterior sample \(g_{{t+1}} \sim {{\mathcal {G}}}{{\mathcal {P}}}(\mu _{t}, k_{t})\) is also widely used, and its acquisition function is defined as

$$\begin{aligned} a_{t+1} (x) = g_{t+1}(x). \end{aligned}$$

3 Problem setting

We propose a novel GP bandit problem that utilizes partially specified queries, BOPSQ. We first define some terms and notations. Then, we introduce the framework and define regret for BOPSQ.

3.1 Notation

Let \({\mathcal {X}} = {\mathcal {X}}^{(1)} \times \ldots \times {\mathcal {X}}^{(d)} \subset {\mathbb {R}}^d\), \(d \in {\mathbb {N}}\), be a compact domain and \(f: {\mathcal {X}} \rightarrow {\mathbb {R}}\) be an unknown function of interest. A control variable set, \(I \subset [d]\), represents the indices of input variables which a learner can specify the values of. An uncontrol variable set, \({\bar{I}} = [d] \setminus I\), represents the remaining indices of input variables which the learner can not specify the values of. Let \({\mathcal {I}} \subset 2^{[d]}\) be a family of control variable sets. We call the values of control (uncontrol) variable set control (uncontrol) variables, denoted by \(x^I \in {\mathcal {X}}^I\) (\(x^{{\bar{I}}} \in {\mathcal {X}}^{{\bar{I}}}\)). Here, \(x^I = \{x^{(i)} \in {\mathcal {X}}^{(i)} \mid i \in I\}\) and \({\mathcal {X}}^I = {\mathcal {X}}^{i_1} \times \ldots \times {\mathcal {X}}^{i_{k}}\) for \(I = \{i_1, \ldots , i_k\} \subset [d]\). Let \(X = (X^{(1)},\ldots ,X^{(d)})^\top \sim P(X)\) be a random vector according to an input distribution P over \({\mathcal {X}}\) and Y be a random variable over \({\mathbb {R}}\) corresponding to the output variable. With a slight abuse of notation, we write \(f(x^I,x^{{\bar{I}}})\) for f(x).

3.2 Framework

The optimization procedure consists of T iterations. A learner is given a non-empty family of control variable sets \({\mathcal {I}} \subset 2^{[d]}\), which we assume is static over T iterations for the sake of simplicity in this paper. In each iteration, the learner selects a control variable set \(I \in {\mathcal {I}}\) and specifies their values \(x^{I} \in {\mathcal {X}}^I\). Next, the learner observes uncontrol variables \(X^{{\bar{I}}}\) drawn from a conditional distribution \({P}(X^{{\bar{I}}} \mid X^{I}=x^{I})\), which is either known or unknown. By evaluating a black-box function f at the input variables \((x^I, X^{{\bar{I}}})\), the learner observes the corresponding output \(Y = f(x^{I}, X^{{\bar{I}}}) + \epsilon\), where \(\epsilon \sim {\mathcal {N}}(0,\sigma ^2)\).

In the standard BO setting, the goal is to find the optimal input variables \(x^*\). However, in BOPSQ, the learner may not always select \(x^*\) and obtain \(f(x^*)\) because the input query is random. Therefore, for the objective function, we consider the conditional expectation of \(f(x^{{I}}, X^{{\bar{I}}})\) over \(X^{{\bar{I}}}\) given \(X^{{I}} = x^{{I}}\). We define the goal as finding the optimal control variable set \(I^* \in {\mathcal {I}}\) and control variables \(x^{I^*} \in {\mathcal {X}}^{I^*}\) in as few evaluations as possible, defined as

$$\begin{aligned} (I^*, x^{I^*}) = \mathop {\mathop {\mathrm{argmax}}\limits }_{I \in {\mathcal {I}}, x^{I} \in {\mathcal {X}}^I} {\mathbb {E}} [f(X^{I}, X^{{\bar{I}}}) \mid f, X^{I}=x^{I}]. \end{aligned}$$

Hereafter, we write \(x^{I^*}\) instead of \((I^*, x^{I^*})\) for simplicity.

BOPSQ in the unknown input distribution case can be seen as a mixture of BO and multi-armed bandits. By selecting an action \(x^I\) from an infinite action space \({\mathcal {X}}^I\), a learner observes information \(f(x^I, X^{{\bar{I}}})\), which corresponds to the standard BO. By selecting an action I (and \(x^I\)) from a finite action space \({\mathcal {I}}\), a learner observes information \(X^{{\bar{I}}} \sim P(X^{{\bar{I}}} \mid X^I = x^I)\), which corresponds to a multi-armed bandit problem.

Some specific forms set as the family of control variable sets \({\mathcal {I}}\) recover existing BO settings. The simple setting in which the learner can select all the input variables as a control variable set, \({\mathcal {I}} = \{[d]\}\), corresponds to the standard BO setting. When the family of control variable sets is always fixed to one particular subset, \({\mathcal {I}} = \{I\}, I \subset [d], 0< |I| < d\), we consider this setting as BO with environmental variables (Williams et al. 2000). In this sense, BOPSQ is a generalization of these BO settings. The major difference appears when the learner is required to select a control variable set for \(|{\mathcal {I}}| > 1\). Moreover, when the input distribution P(X) is unknown, the learner is required to consider the uncertainty of the input distribution estimation as well as the function estimation.

In practice, the sizes of all control variable sets may be same, i.e., \({\mathcal {I}} = \{I \mid |I| = k\}\) for \(0<k<d\). This is because control variable sets that are a subset of other control variable sets are useless. Let \(I \subset J \in {\mathcal {I}}\), that is, I be a subset of J. It is clear that the maximum of the objective function for I in Eq. (1) is not larger than that for J, i.e., \(\max _{x^{I} \in {\mathcal {X}}^{I}} {\mathbb {E}} [f(X^{I}, X^{{\bar{I}}}) \mid f, X^{I}=x^{I}] \le \max _{x^{J} \in {\mathcal {X}}^{J}} {\mathbb {E}} [f(X^{J}, X^{{\bar{J}}}) \mid f, X^{J}=x^{J}]\). Further, we typically want to specify the values of as many input variables as possible in black-box optimization. Therefore, in the experimental part, we set the sizes of all control variable sets to be same.

3.3 Regret

The final goal is to find the optimal control variables \(x^{I^*}\) defined in Eq. (1). We define instantaneous regret \(\text {IR}(t)\) in iteration t, which evaluates the gap between the optimal solution and the choice \(x^{I_t}\), described as

$$\begin{aligned} \text {IR}(t) = {\mathbb {E}}[f(X^{I^*}, X^{\bar{I^*}}) \mid f, X^{I^*} = x^{I^*}] - {\mathbb {E}}[f(X^{I_t}, X^{\bar{I_t}}) \mid f, X^{I_t} = x^{I_t}]. \end{aligned}$$

The resultant performance measure, Bayes cumulative regret \(\text {BayesRegret}(T)\), is the expectation and summation over T iterations of the instantaneous regret, defined as

$$\begin{aligned} \text {BayesRegret}(T) = \sum _{t=1}^T {\mathbb {E}}\left[ \text {IR}(t)\right] . \end{aligned}$$

Our target is designing algorithms that achieve sublinear regret, that is, \(\lim _{T \rightarrow \infty } \text {BayesRegret}(T) / T \rightarrow 0\). In some situations, one may prefer simple regret, the best instantaneous regret so far \(\min _{t=1,\ldots ,T} {\mathbb {E}}\left[ \text {IR}(t)\right]\), to cumulative regret. However, sublinear cumulative regret guarantees that simple regret asymptotically converges to 0 since \(\min _{t=1,\ldots ,T} {\mathbb {E}}\left[ \text {IR}(t)\right] \le \text {BayesRegret}(T) / T\).

4 Algorithms

We consider two cases where the input distribution is known and unknown, propose two algorithms based on posterior sampling for these cases, and also provide their regret bounds that are sublinear for popular kernels.

4.1 TSPSQ-known for known input distribution

We first focus on the case where the joint input distribution P(X) or the set of conditional distributions \(\{P(X^{{\bar{I}}} \mid X^I)\}_{I \in {\mathcal {I}}}\) is known. For example, the case of the the airfoil design provided by a manufacturer corresponds to the situation in which the manufacturer publishes random options for the unspecified features. We propose an algorithm based on Thompson sampling, which we call Thompson sampling with partially specified queries for the known input distribution case (TSPSQ-known).

4.1.1 Acquisition function

In iteration t, TSPSQ-known determines the next control variables as

$$\begin{aligned} x^{I_{t}} = \mathop {\mathrm{argmax}}\limits _{I \in {\mathcal {I}}, x^I \in {\mathcal {X}}^I} {\mathbb {E}}[g _t(X^{I}, X^{{\bar{I}}}) \mid g _t, X^{I}=x^{I}], \end{aligned}$$

where \(g_t\) is a sample from the GP posterior \(g _t \sim {{\mathcal {G}}}{{\mathcal {P}}}(\mu _{t-1}, k_{t-1})\). The pseudo-code is given in Algorithm 1.

Equation (2) approaches the ground-truth objective in Eq. (1) as the posterior sample \(g_t\) approximates the ground-truth function f better with more observations. TSPSQ-known is a natural extension of Thompson sampling; if we set \({\mathcal {I}} = \{[d]\}\) as the standard BO setting, the acquisition function in Eq. (2) is reduced to \(g_t(x)\).

4.1.2 Regret bound

We first introduce the notion of maximum information gain, which is the basis of typical regret analysis for BO. The maximum information gain in iteration T, denoted by \(\gamma _T\), is the maximum possible information gain achievable by any algorithm for f via queries \(A = \{x_1, x_2,\ldots , x_T\} \subset {\mathcal {X}}\) and the corresponding outputs \(y_A = \{y_1, y_2,\ldots , y_T\}\), defined as

$$\begin{aligned} \gamma _T = \max _{A \subset {\mathcal {X}}, |A| = T} I(f; y_A). \end{aligned}$$

Here, \(I(f; y_A)\) denotes the mutual information between f and \(y_A\). Srinivas et al. (2012) provided the bounds of \(\gamma _T\) for various types of kernels: \(\gamma _T = O(d \log T)\) for the linear kernel \(k(x,y) = x^\top y\), \(\gamma _T = O\left( (\log T)^{d+1}\right)\) for the Gaussian kernel \(k(x, y) = \exp {(0.5 \Vert x-y\Vert ^2 / l^2)}\), and \(\gamma _T = O\left( T^{d(d+1)/(2\nu +d(d+1))} \log T\right)\) for the Matérn kernel \(k(x,y) = \left( 2^{1-\nu } / \Gamma (\nu )\right) r^{\nu } B_{\nu }(r)\), where \(r = (\sqrt{2 \nu } / l)\left\| x - y\right\|\), \(B_{\nu }\) is the modified Bessel function, and \(l,\nu > 0\).

We then derive the regret bound of TSPSQ-known for the case where the ground-truth input distribution is known.

Theorem 1

Let \({\mathcal {X}} \subset [0,1]^d\), \(d \in {\mathbb {N}}\), be a compact domain. Assume \(k(x, x') \le 1\) and that there exist constants \(a',b'>0\) for the partial derivatives of sample paths f such that

$$\begin{aligned} l>0, \forall j \in [d], ~{\mathbb {P}} \left( \sup _{x \in {\mathcal {X}}} \left| \frac{\partial f(x)}{\partial x^{(j)}}\right| > l\right) \le a' e^{-(l/b')^2}. \end{aligned}$$

Then, by running TSPSQ-known for T iterations, it holds that

$$\begin{aligned} \mathrm{BayesRegret}(T) = {O}(\sqrt{d T \gamma _T \log T}). \end{aligned}$$

The analysis is mainly based on the techniques by Russo and Roy (2014), Srinivas et al. (2012), where the cumulative regret is decomposed into terms with respect to discretization errors of \({\mathcal {X}}\), the difference between f and the upper confidence bound \(\mu _{t-1}(x) + \beta _t^{1/2} \sigma _{t-1}(x)\), and the difference between \(x^{I^*}\) and \(x^{I_t}\). A detailed proof of Theorem 1 can be found in Appendix A.

Note that the growth rate of the regret bound in T matches the existing results of Thompson sampling and GP-UCB in the standard BO setting shown by Russo and Roy (2014) and Srinivas et al. (2012), respectively, though their definitions of regret are different. This is because the ground-truth input distribution is given, and the uncertainty in estimation is therefore only in f, as in the standard BO setting. The regret bound is sublinear for the popular kernels, such as the linear kernel, Gaussian kernel, and Matérn kernel.

figure a

4.2 TSPSQ-unknown for unknown input distribution

Next, we consider the case where the joint input distribution P(X), or the set of conditional distributions \(\{P(X^{{\bar{I}}} \mid X^I)\}_{I \in {\mathcal {I}}}\), is unknown. We propose Thompson sampling with partially specified queries for the unknown input distribution (TSPSQ-unknown) based on a multi-armed bandit approach.

In the unknown input distribution case, a set of conditional distributions of uncontrol variables needs to be estimated. For example, when the size of uncontrol variables is d/2, there could be \(|{\mathcal {I}}| = \left( {\begin{array}{c}d\\ d/2\end{array}}\right) = \varOmega (2^{d/2})\) combinations of uncontrol variables. However, samples from a conditional distribution \(P(X^{{\bar{I}}} \mid X^{{I}}=x^I)\) may be uninformative for estimating other conditional distributions \(P(X^{{\bar{J}}} \mid X^{{J}})\) for \(I \ne J\) without any assumption (a further discussion can be found in Appendix C). Then, the estimation of different conditional distributions may be separately addressed, which requires an exponential number of samples with respect to d.

Therefore, to make the estimation statistically tractable, we assume only that input variables are independent of each other. That is, the cumulative distribution function \(F(X \le x)\) is decomposed as \(F(X \le x) = \prod _{i=1}^d F^{(i)}(X^{(i)} \le x^{(i)})\). Then, the number of distributions to be estimated does not increase exponentially with d. Note that we do not make further assumptions such as continuity or a certain form of the distribution.

4.2.1 Acquisition function

Let \({\bar{g}}_t\) be a function defined as

$$\begin{aligned}&{\bar{g}}_{t}\left( x^{I}, x^{{\bar{I}}}\right) = g_{t}\left( x^{I}, x^{{\bar{I}}}\right) + \sum _{i \in {\bar{I}}} \frac{\alpha _{t}}{\sqrt{|{\mathcal {S}}_{t-1}^{(i)}|}}, \end{aligned}$$

where \(g_t \sim {{\mathcal {G}}}{{\mathcal {P}}}(\mu _{t-1}, k_{t-1})\), \({\mathcal {S}}_{t-1}^{(i)}\) is a set of observations of the uncontrol variables until iteration \({t-1}\) for \(i \in [d]\), defined as \({\mathcal {S}}_{t-1}^{(i)} = \{X_s^{(i)} \mid i \in {\bar{I}}_s, s \in [t-1]\}\), and \(\alpha _t\) is a monotonically increasing function in t. TSPSQ-unknown determines the next control variables as

$$\begin{aligned} x^{I_{t}} = \underset{x^I \in {\mathcal {X}}^I, I \in {\mathcal {I}}}{{\text {argmax}}}~ {\mathbb {E}}_{X^{{\bar{I}}} \sim {\hat{F}}^{I}}\left[ {\bar{g}}_{t}\left( x^{I}, X^{{\bar{I}}}\right) \right] , \end{aligned}$$

where \({\hat{F}}^{I}\) denotes the empirical distribution function for I, defined as \({\hat{F}}^{I}(X^{I} \le x^{I}) = \prod _{i \in I} {\hat{F}}^{(i)}(X^{(i)} \le x^{(i)})\), \({\hat{F}}^{(i)}(X^{(i)} \le x^{(i)}) = {1}/{|{\mathcal {S}}_{t-1}^{(i)}|} \sum _{{\hat{x}}^{(i)} \in {\mathcal {S}}_{t-1}^{(i)}} \mathbb {1}[{\hat{x}}^{(i)} \le {x}^{(i)}]\). The pseudo-code is given in Algorithm 1.

The acquisition function of TSPSQ-unknown in Eq. (6) works similarly to that of TSPSQ-known in Eq. (2) for the known input distribution case. TSPSQ-unknown computes expectation using the empirical distribution. However, the empirical distribution may not be accurate when the number of samples is small. Based on the observation discussed in Sect. 3.2, we employ a UCB approach for multi-armed bandits (Auer et al. 2002). The second term on the right hand side of Eq. (5) encourages increasing the number of samples from the input distributions with few samples. A hyperparameter \(\alpha _t\) controls the exploration-exploitation trade-off in the distribution estimation: large \(\alpha _t\) improves the distribution estimation, while small \(\alpha _t\) attempts to achieve small regret.

TSPSQ-unknown avoids a regret bound exponentially dependent on the number of input variables d for the independence assumption, as shown later, because it does not need to consider the combinations of d input variables in the distribution estimation. However, the computational cost of TSPSQ-unknown in each iteration still exponentially depends on d due to the combinations of \({\mathcal {S}}^{(i)}\) for \(i \in {\bar{I}}\). As a simple case, when \(|{\bar{I}}|=d/2\) and \(|{\mathcal {S}}^{(i)}| = n\) for \(i \in {\bar{I}}\), the computation of the empirical expectation in Eq. (6) requires \(n^{d/2}\) evaluations of \({\bar{g}}\) for each \(x^I \in {\mathcal {X}}^I\) and \(I \in {\mathcal {I}}\). This computational cost could be soon intractable as n increases through iterations for large \({\mathcal {I}}\) and d. In general, an exact computation with small cost may not be possible; however, the cost can be reduced when some assumptions hold. For example, Kandasamy et al. (2015) and Mutny and Krause (2018) assumed the additivity of f with respect to each \(x^{(i)}\) for \(i \in [d]\), i.e., \(f(x) = \sum _{i \in [d]} f_i(x^{(i)})\). If that assumption holds, the computational cost is reduced to O(dn).

4.2.2 Regret bound

We then derive a regret bound for the unknown input distribution case.

Theorem 2

Let \({\mathcal {X}} \subset [0,1]^d, d \in {\mathbb {N}}\), be a compact domain. Assume \(k(x, x') \le 1\) and that there exist constants \(a,b,a',b'>0\) for the sample paths f and their partial derivatives such that Assumption (4) holds and

$$\begin{aligned} l>0, ~{\mathbb {P}} \left( \sup _{x \in {\mathcal {X}}} \left| f(x) \right| > l\right) \le a e^{-(l/b)^2}. \end{aligned}$$

Suppose an unknown cumulative distribution function of the input variables is factorized into \(F(X \le x) = \prod _{i=1}^d F^{(i)}(X^{(i)} \le x^{(i)})\). Define \(\alpha _t = 2 b' \log t\). Then, by running TSPSQ-unknown for T iterations, it holds that

$$\begin{aligned} \mathrm{BayesRegret}(T) = O\left( {\sqrt{d T \log T}\left( \sqrt{\gamma _T} + \sqrt{m \log T}\right) } \right) , \end{aligned}$$

where \(\gamma _T\) is the maximum information gain defined in Eq. (3) and \(m = \max _{I \in {\mathcal {I}}} |{\bar{I}}|\) is the maximum number of dimensions of uncontrol variables.

The main difference of the proof from that for Theorem 1 is the evaluation of the distribution estimation. We employ techniques in multi-armed bandit problems and the Dvoretzky–Kiefer–Wolfowitz inequality (Dvoretzky et al. 1956; Massart 1990) to derive the upper bound. The whole proof can be found in Appendix B.

The growth rate of the regret bound in T in Theorem 2 matches the existing results for the standard BO setting (Russo and Roy 2014; Srinivas et al. 2012), as with TSPSQ-known. This is because, in most cases, the first term corresponding to the function estimation is dominant compared to the second term corresponding to the input distribution estimation thanks to the independence assumption. Recall that \(m \le d\). Since d is typically moderate in BO, e.g., \(d<10\), \(\sqrt{m}\) is not large. Therefore, for the Gaussian kernel for example, where \({\gamma _T} = O\left( (\log T)^{(d+1)}\right)\), the first term \(\sqrt{d T (\log T)^{(d+2)}}\) can be major compared to the second term \({\sqrt{m d T \log ^2 T}}\) in the total regret bound. This argument also can be applied to the Matérn kernel. However, for the linear kernel, where \(\gamma _T = O(d \log T)\), the first and second terms are equal in growth rate for \(m=O(d)\).

The bound linearly depends on d for \(m=O(d)\), and no exponential term appears in d. As m decreases, the bound also decreases. This is because the learner can specify the values of the many input variables as desired and the error with respect to the distribution estimation becomes small.

5 Experiments

We demonstrate the effectiveness of the proposed algorithms for the known and unknown input distributions using a set of test functions and real-world datasets.

5.1 Comparing methods

We compare the proposed algorithms to four baselines: DropoutBO, WrapperBOs, RandomBO, and Random. DropoutBO (Li et al. 2017), which was originally proposed for the high-dimensional BO, sequentially determines a partially specified query by performing optimization over a low-dimensional input domain. In each iteration, DropoutBO randomly selects \(I_t \in {\mathcal {I}}\) and then performs the standard BO over \({\mathcal {X}}^I\) to determine the next control variables \(x^{I_t}\). WrapperBOs performs the standard BO over \({\mathcal {X}}^I\) separately for \(I \in {\mathcal {I}}\) and selects \(x^{I_t}\) based on a wrapper method (Guyon and Elisseeff 2003) as feature selection. For each \(I \in {\mathcal {I}}\), it applies GP regression to \(\{(x^I_i, y_i)\}_{i=1}^t\), observations of input variables of I, where \(x_i = (x^{I_i}, X^{\bar{I_i}})\). Then, it selects \(x^{I_t}\) using \(|{\mathcal {I}}|\) acquisition functions \(\{a^I\}_{I\in {\mathcal {I}}}\) as \(x^{I_t} = {\mathop {\mathrm{argmax}}\limits }_{I \in {\mathcal {I}}, x^I \in {\mathcal {X}}^I} a^I(x^I)\). RandomBO directly uses the standard BO method with a randomly determined control variable set. First, it determines the next d-dimensional input as in the standard BO, \(x_t = {\mathop {\mathrm{argmax}}\limits }_{x \in {\mathcal {X}}} a_t (x)\). Then, \(I_t\) is randomly chosen from a uniform distribution over \({\mathcal {I}}\). Subsequently, it queries for \(x_t^{I_t}\). Random simply determines the next control variables according to a uniform distribution.

We use the Gaussian kernel function for GP models and employ Thompson sampling as an acquisition function for all methods. In the preliminary experiment in Sect. 5.3, we set the hyperparameters of the kernel function to those obtained by the type II maximum likelihood estimation using thousands of data points. In the other experiments in Sects. 5.4 and 5.5, the hyperparameters are learned every 10 iterations by using the type II likelihood estimation.

5.2 Approximation of sample paths

All methods except Random use a sample path \(g_t \sim {{\mathcal {G}}}{{\mathcal {P}}}(\mu _{t-1}, k_{t-1})\). By approximating the GP with a finite-dimensional Bayesian linear model, we can efficiently optimize \(g_t\) (Hernández-Lobato et al. 2014). Bochner’s theorem (Bochner 1959) guarantees that for any continuous shift-invariant kernel \(k(x-y)\), its Fourier transform is a non-negative measure \({\hat{k}}(\omega )/\alpha\), where \(\alpha = \int _{\omega }{\hat{k}}(\omega )d\omega\) is a normalization constant. Random Fourier features (Rahimi and Recht 2007) approximate the kernel function as

$$\begin{aligned} k(x-y) = \alpha {\mathbb {E}}[e^{-i\omega ^\top (x-y)}] \approx \varPhi (x)^\top \varPhi (y), \end{aligned}$$

where \(\varPhi (x) = \sqrt{{2\alpha }/{M}}\big (\sin {(\omega _{1}^\top x)}, \cos {(\omega _1^\top x)},\) \(\ldots , \sin {(\omega _{M/2}^\top x)}, \cos {(\omega _{M/2}^\top x )}\big )^\top\) and \(\{\omega _i\}_{i=1}^{M/2} \overset{\text {i.i.d.}}{\sim } {\hat{k}}(\omega )\). Then, the GP is approximated with an M-dimensional Bayesian linear model. The sample path is approximated as

$$\begin{aligned} g_t(x) \approx \varPhi (x)^\top \theta _t,\quad \theta _t \sim {\mathcal {N}}({\tilde{\mu }}_t, {\tilde{\varSigma }}_t), \end{aligned}$$

where \({\tilde{\mu }}_{t} = A^{-1} \varPhi \left( {X}_{{t-1}}\right) ^{\top } Y_{t-1}\), \({\tilde{\varSigma }}_{t} = \sigma ^2 A^{-1}\), \(A = {\varPhi \left( {X}_{{t-1}}\right) ^{\top } \varPhi \left( {X}_{{t-1}}\right) + \sigma ^{2} {I}}\), \(\varPhi \left( {X}_{{t-1}}\right) =\left( \varPhi \left( x_{1}\right) , \ldots \varPhi \left( x_{{t-1}}\right) \right) ^{\top } \in {\mathbb {R}}^{({t-1}) \times M}\), and \(Y_{t-1} = (y_1, \ldots , y_{t-1})^\top\). The random Fourier features enable us to deal with a large number of samples. In the experiments, we set \(M=1000\).

5.3 Preliminary experiments

Here we study the effectiveness of our algorithms for known and unknown input distributions under simple settings. We use the Branin-Hoo function (Picheny et al. 2013) (Fig. 1(left)) over a two-dimensional input domain as a black-box target function without observation noise. We model a joint distribution of input variables with a Gaussian distribution \({\mathcal {N}}(\mu , \varSigma )\) truncated over \([0,1]^2\), where we set \(\mu = {(0.5,0.5)}^\top\) and \(\varSigma = ({(0.01, 0)}^\top , {(0, 0.05)}^\top )\) (Fig. 1 (right)). We set a family of control variable sets as \({\mathcal {I}} = \{\{1\}, \{2\}\}\). The Branin-Hoo function has the three minimum points (dots in Fig. 1 (left)); however, the optimal control variable differs from them (line in Fig. 1 (left)). In Theorem 2 for TSPSQ-unknown, \(b' > 0\) to define \(\alpha _t = 2 b' \log t\) may not be known in advance. Hence, we change the value of a hyperparameter c for \(\alpha _t = c \log t\) and study the difference in performance. We set \(T=100\) and repeat each experiment 30 times.

The results for the known input distribution case are shown in Fig. 2, where the lines and shaded areas are the means and standard deviations for 30 trials, respectively. TSPSQ-known achieved small regret in the early iterations and outperformed the baselines by finding the nearly optimal solution. Figure 3 (left) shows the cumulative regret at the 100th iteration of TSPSQ-unknown, with different values of the hyperparameter c for the unknown input distribution case. We observed that the smaller c is better than the larger c in the mean for 30 trials. In particular, TSPSQ-unknown with \(c=0.12\) worked the best and found the nearly optimal solution, as shown in Fig. 3 (right), by controlling the balance between exploration and exploitation in the distribution estimation. The result of TSPSQ-unknown with \(c=0.12\) is competitive with that of TSPSQ-known for the known input distribution. The larger c values suffered from the higher regret because they heavily focused on exploration and obtained low rewards; however, the result is still better than the baselines. The variance of small c was large because it heavily focused on exploitation and sometimes fell into the local optimum.

Fig. 1
figure 1

Branin-Hoo function (left) and its input density function (right)

Fig. 2
figure 2

Cumulative regret for the Branin-Hoo function with the known input distribution

Fig. 3
figure 3

Cumulative regret at 100th iteration (left) and the cumulative regret over the iterations (right) of TSPSQ-unknown with different hyperparameter values c for the Branin-Hoo function with the unknown input distribution

5.4 Experiments using test functions

Next, we conduct numerical simulations using two test functions: a cosine mixture function (Anderson et al. 2000) over a two-dimensional domain and the Rosenbrock function (Picheny et al. 2013) over a four-dimensional domain. We model a joint distribution of input variables with a Gaussian distribution \({\mathcal {N}}(\mu , \varSigma )\), where we set \(\mu = {(0.7,0.7)}^\top\), \(\varSigma = ({(0.01, 0)}^\top , {(0, 0.05)}^\top )\) for the cosine mixture function and \(\mu _i = 0.7\), \(\varSigma _{i,i} = 0.01, \varSigma _{i,j} = 0\), for \(i,j \in [4], i\ne j\) for the Rosenbrock function. We set a family of control variable sets \({\mathcal {I}}\) as a set of combinations of 1 of 2 input variables for the cosine mixture function and a set of combinations of 2 of 4 input variables for the Rosenbrock function. We set the observation noise variance \(\sigma ^2 = 10^{-3}\) for both functions, following Hernández-Lobato et al. (2014). According to the results in Sect. 5.3, we set \(\alpha _t = 0.12 \log t\) for TSPSQ-unknown. We set \(T=100\) and repeat each experiment iterations 30 times.

Figures 4 and 5 show the cumulative regret for the known and unknown input distribution cases for the cosine mixture function and the Rosenbrock function, respectively. Note that the performance of the baselines does not change for both cases. The proposed algorithms, TSPSQ-known and TSPSQ-unknown, outperformed the baselines and achieved small regret. They finally yielded the nearly optimal solutions, while the baselines increased their cumulative regrets linearly. TSPSQ-unknown is again competitive with TSPSQ-known although it is not given the ground-truth input distribution. This result might be because the error in the distribution estimation is much smaller than the error in the function estimation, as discussed in Sect. 4.2.2.

Fig. 4
figure 4

Cumulative regret for the cosine mixture function

Fig. 5
figure 5

Cumulative regret for the Rosenbrock function

5.5 Experiments using real-world datasets

Finally, we conduct experiments using two real-world datasets: the airfoil self-noise datasetFootnote 1 and the word similarity dataset (Snow et al. 2008). The airfoil self-noise dataset is a set of airfoil self-noise as an output variable and five input variables including angle of attack and chord length. The goal is to design airfoils with low self-noise via the five input variables in an outsourcing scenario as described in Sect. 1. The word similarity dataset is a set of answers given by workers for word pair similarity tasks in crowdsourcing. In a task, a word pair is given to a worker, and the worker assesses their similarity with an integer from 0 to 10, where a large number indicates stronger similarity. We randomly select 10 of 30 answers as the input variables. For the output variable, we calculate the negative mean of the absolute difference between the remaining 20 answers and the true answers so that the good workers have the high output value. The goal is to find good workers whose answers are accurate via the 10 input variables as described in Sect. 1. We set a family of control variable sets \({\mathcal {I}}\) as a set of combinations of 2 of 5 input variables for the airfoil self-noise dataset and a set of combinations of 5 of 10 input variables for the word similarity dataset. To reduce the computational cost, we randomly reduce the size of the family to 10 for the word similarity dataset. The discussion on large \(|{\mathcal {I}}|\) can be found in Sect. 4.2.1. For these datasets, the ground-truth functions and the ground-truth input distributions are unknown and only available through noisy observations. Output data corresponding to the input points we query for is not always available. Therefore, we substitute a posterior mean function of a GP obtained by learning each dataset as a ground-truth functionFootnote 2. The observation noise variance is also set to the learned parameter. For a proxy for the ground-truth input distributions, we apply kernel density estimation with the Gaussian kernel to the datasets, whose bandwidth is obtained using the median heuristic (Garreau et al. 2017). We set \(\alpha _t = 0.12 \log t\) for TSPSQ-unknown based on the results in Sect. 5.3, \(T=100\), and repeat each experiment 30 times.

Figures 6 and 7 illustrate the cumulative regret for the known and unknown input distribution cases for the airfoil self-noise function and the word similarity function, respectively. Both TSPSQ-known and TSPSQ-unknown performed better than the baselines. In the early iterations, there was no significant difference in regret for all methods, but the proposed algorithms were able to keep their regrets small after 20 iterations, as shown in Fig. 6, and after 30 iterations, as shown in Fig. 7. TSPSQ-unknown is competitive with TSPSQ-known in these experiments, possibly due to the more difficult function estimation than the distribution estimation, as discussed in Sect. 4.2.2.

Fig. 6
figure 6

Cumulative regret for the airfoil self-noise function

Fig. 7
figure 7

Cumulative regret for the word similarity function

6 Related work

There is a significant body of literature on BO, where a learner has control of only some of the input values but not the rest. Prior works are classified into two settings based on whether the remaining input values are revealed to the learner to determine the controllable input values.

First, in contextual GP bandits (Krause and Ong 2011), a black-box function takes control variables and context information as input. In each iteration, the learner receives context information and then determines the values of control variables. In BOPSQ, a learner has no access to the context information.

Second, several works tackled situations where a learner determines the controllable input values without knowing the remaining input values, as well as BOPSQ. In BO with environmental variables (Williams et al. 2000), the objective is to maximize the expected cumulative reward on the remaining input values determined according to a distribution. This setting can be seen as a special case of BOPSQ, as shown in Sect. 3. In a game theory setting (Sessa et al. 2019; Dai et al. 2020), each agent determines its action sequence to maximize each cumulative reward against the other agents. Bogunovic et al. (2018) dealt with an adversarial setting, where an adversary determines the remaining input values to minimize the learner’s reward. Kirschner et al. (2020) and Nguyen et al. (2020) extended this adversarial setting into a setting where an adversary selects a distribution of the remaining input values. These works are closely related to BOPSQ; however, a learner can not select input variables to determine their values. BOPSQ involves the feature selection procedure.

Causal bandit (Lattimore et al. 2016) for causal intervention is also closely related to BOPSQ, which involves a feature selection procedure. In the problem setting, a causal model is provided to a learner as a directed acyclic graph over random variables of input and output variables. In each iteration, the learner specifies values of any subset of input variables and then observes the remaining subset of input variables and the output variable. The objective is to maximize the expectation of the output variable conditioned on specified values. The primary difference is that BOPSQ deals with infinite input spaces as a GP bandit, while causal bandit deals with finite input spaces as a multi-armed bandit.

Noisy inputs have been tackled by several works (Oliveira et al. 2019; Luong et al. 2020). In BO with noisy inputs (Oliveira et al. 2019), the input values are perturbed, and a black-box function takes the unobservable perturbed values as input. In BO with missing inputs (Luong et al. 2020), some of the input values are replaced with unobservable random values. The input values in BOPSQ are also random; however, all input variables are controllable for a learner in those works, although input values are noisy. Further, feature selection is not addressed in those works.

The feature selection problem (Guyon and Elisseeff 2003) is related to BOPSQ in a wide sense, which is the task of selecting a subset of features that are most relevant to an output. Chen et al. (2012) proposed a method that jointly optimizes the objective and selects features for high-dimensional BO where it is assumed that only a subset of input variables are associated with a black-box function. Li et al. (2017) also dealt with high-dimensional BO by performing optimization only with respect to randomly chosen features. Random values or the values of the input variables with the best function value so far are used as the values of the unchosen features. However, these works operate in the standard BO setting: a learner determines all values of input variables. Further, the methods do not deal with the random input variables.

7 Conclusion

We proposed BOPSQ, a novel problem setting that utilizes partially specified queries for the situations in which the specification of all input variables is expensive or difficult. BOPSQ involves a features selection procedure, and values of unspecified input variables are randomly determined. In particular, BOPSQ with unknown input distribution corresponds to a mixture of BO and multi-armed bandits. We proposed the algorithms based on posterior sampling, TSPSQ-known and TSPSQ-unknown for the known and unknown input distribution cases. We derived their regret upper bounds that are sublinear for the popular kernels. In the experiments, we demonstrated the effectiveness of the proposed algorithms using the test functions and the real-world datasets.

In BOPSQ, uncontrol variables are sampled from a conditional distribution described by \(X^{{\bar{I}}} \sim P(X^{{\bar{I}}} \mid X^I = x^I)\). For this sampling, we may consider another scenario: a causal intervention in a directed acyclic graph over random variables (Lattimore et al. 2016), described by \(X^{{\bar{I}}} \sim P(X^{{\bar{I}}} \mid \text {do}(X^I = x^I))\). If the input distribution is neither known nor independent, the intervention scenario would differ from the scenario in this paper.

To make BOPSQ applicable to many real-world problems, it is important to extend BOPSQ to the setting in which selection costs of control variable sets are different (Tran-Thanh et al. 2012). In the problem setting of BOPSQ, in principle, we assume that the cost is uniform; however, specifying the values of more control variables would be more costly in many real-world problems. Handling unbounded domains (Shahriari et al. 2016a) is another important extension of BOPSQ. We may not know the unbounded domain, especially for the unknown input distribution case.