Bayesian optimization with partially specified queries

Bayesian optimization (BO) is an approach to optimizing an expensive-to-evaluate black-box function and sequentially determines the values of input variables to evaluate the function. However, it is expensive and in some cases becomes difficult to specify values for all input variables, for example, in outsourcing scenarios where production of input queries with many input variables involves significant cost. In this paper, we propose a novel Gaussian process bandit problem, BO with partially specified queries (BOPSQ). In BOPSQ, unlike the standard BO setting, a learner specifies only the values of some input variables, and the values of the unspecified input variables are randomly determined according to a known or unknown distribution. We propose two algorithms based on posterior sampling for cases of known and unknown input distributions. We further derive their regret bounds that are sublinear for popular kernels. We demonstrate the effectiveness of the proposed algorithms using test functions and real-world datasets.


Introduction
Bayesian optimization (BO) (Shahriari et al. 2016b) is a promising approach for black-box optimization of an expensive-to-evaluate unknown function.Its applications include hyperparameter tuning of machine learning models with the highest predictive accuracy (Snoek et al. 2012) and searching of compound structures with desirable properties (Korovina et al. 2020).BO is conducted in an iterative manner.In each iteration, a BO method determines the values of input variables in an input domain X ⊂ ℝ d , evaluates a black-box func- tion f ∶ X → ℝ at that input query, and observes the corresponding output y ∈ ℝ .The final goal is to find the input that maximizes the function with as few function evaluations as possible.
Although a learner can fully specify all values of input variables in the standard BO setting, this procedure is expensive or difficult to perform for some cases.One example is crowdsourcing, by which human-intelligent tasks are outsourced to an undetermined number of people on the Internet.Most existing crowdsourcing platforms allow us to ask workers to perform the task by specifying conditions such as age, country, and experience to obtain high-quality deliverables (Li et al. 2014).However, if the conditions are too strict, there is a risk that there will be no workers available to participate in the task.Therefore, we specify some conditions to increase the probability that one of the workers who meet the conditions will participate in the task.Upon completion of the task by a worker, we observe the quality of the deliverable and the unspecified conditions of the worker.
Another example is the design of airfoils with low self-noise in an outsourcing scenario.We ask a manufacturer to produce an airfoil by specifying values of its various variables, including angle of attack and chord length.However, specifying values of many variables in outsourcing incurs high cost, and our budget is often limited.Therefore, we are restricted to specify only values of certain variables.After obtaining the ordered airfoil, the values of the unspecified variables are revealed, and self-noise of the airfoil is measured via an acoustic test.
In this paper, we propose a novel Gaussian process (GP) bandit framework, which we call Bayesian optimization with partially specified queries (BOPSQ).In BOPSQ, unlike the standard BO setting, a learner selects a subset of input variables and specifies their values.We call such an input query a partially specified query.Next, the learner observes the values of the unspecified input variables determined according to a known or unknown distribution.Then, the learner evaluates the black-box function at the input values to obtain the corresponding output.Hence, the learner is required to consider the extent to which the input variables contribute to the function's output as well as which values the unspecified input variables will take.Williams et al. (2000) and Lattimore et al. (2016) also handled partially specified queries.However, in their works, the input variables whose values are specified are fixed or infinite input spaces are not considered for the GP bandit.
We propose two algorithms for known and unknown input distributions.In the case of the known input distribution, the uncertainty lies only in the function estimation, and we can naturally extend a posterior sampling approach to this setting.In the case of the unknown input distribution, the learner further needs to take the uncertainty in the distribution estimation into account.We adopt a multi-armed bandit approach to appropriately explore the input distribution.
The contributions of the present study are summarized as follows: 1. We propose a novel GP bandit framework with partially specified queries, which is a natural generalization of the standard BO problem and BO with environmental variables (Williams et al. 2000).2. We propose two algorithms for known and unknown input distribution cases with their sublinear regret bounds for popular kernels.3. We demonstrate the empirical effectiveness of the proposed algorithms using a set of test functions and real-world datasets.
The rest of the paper is organized as follows.We briefly describe the standard BO setting and approaches in Sect. 2. Section 3 formalizes the problem setting of BOPSQ as an extension of the standard BO setting.In Sect.4, we propose the two algorithms for known and unknown input distribution cases with theoretical guarantees.We validate the effectiveness of the proposed algorithms in Sect. 5. Section 6 summarizes the related works.We provide the concluding remarks in Sect.7.
2 Review of Bayesian optimization BO (Srinivas et al. 2012;Shahriari et al. 2016b) is an approach for finding the maximizer of an expensive-to-evaluate black-box function interactively.Let X ⊂ ℝ d be a d-dimensional compact set and f ∶ X → ℝ be an unknown real-valued function.In each iteration, a learner determines the values of input variables x ∈ X .The learner evaluates the function at x and obtains an output y = f (x) + , where ∼ N(0, 2 ) .The goal is to find the maximizer with as few evaluations as possible.BO methods typically assume a GP (Rasmussen and Williams 2006) as a prior of the black-box function: f ∼ GP( 0 , k) , where 0 ∶ X → ℝ is a mean function and k ∶ X × X → ℝ is a covariance function.Without loss of generality, we assume a zero mean 0 = 0 .Suppose that a set of t input-output observations D t = {(x i , y i )} t i=1 is given.Then, we have a posterior mean t (x) , posterior variance 2 t (x) , and posterior covariance k t (x, y) as follows: where k x is a t-dimensional vector whose i-th entry is k(x i , x) , K is a t × t matrix whose (i, j)-th entry is k(x i , x j ) , I is an identity matrix, and y t is a t-dimensional vector whose i-th entry is y i .
BO methods often use an acquisition function a t+1 ∶ X → ℝ that quantifies how much an input should be evaluated next.The next input query is determined as Popular choices for the acquisition function are the Gaussian process upper confidence bound (GP-UCB) (Srinivas et al. 2012), the probability of improvement (Kushner 1964), and the expected improvement (Mockus et al. 1978;Jones et al. 1998).Thompson sampling (Thompson 1933;Russo and Roy 2014) based on a posterior sample g t+1 ∼ GP( t , k t ) is also widely used, and its acquisition function is defined as

Problem setting
We propose a novel GP bandit problem that utilizes partially specified queries, BOPSQ.We first define some terms and notations.Then, we introduce the framework and define regret for BOPSQ.
a t+1 (x) = g t+1 (x).be a family of control variable sets.We call the val- ues of control (uncontrol) variable set control (uncontrol) variables, denoted by

Notation
Let X = (X (1) , … , X (d) ) ⊤ ∼ P(X) be a random vector according to an input distribution P over X and Y be a random variable over ℝ corresponding to the output variable.With a slight abuse of notation, we write f (x I , x Ī ) for f(x).

Framework
The optimization procedure consists of T iterations.A learner is given a non-empty family of control variable sets I ⊂ 2 [d] , which we assume is static over T iterations for the sake of simplicity in this paper.In each iteration, the learner selects a control variable set I ∈ I and specifies their values x I ∈ X I .Next, the learner observes uncontrol variables X Ī drawn from a conditional distribution P(X Ī | X I = x I ) , which is either known or unknown.By evaluating a black-box function f at the input variables (x I , X Ī ) , the learner observes the corresponding output Y = f (x I , X Ī ) +  , where ∼ N(0, 2 ).
In the standard BO setting, the goal is to find the optimal input variables x * .However, in BOPSQ, the learner may not always select x * and obtain f (x * ) because the input query is random.Therefore, for the objective function, we consider the conditional expectation of f (x I , X Ī ) over X Ī given X I = x I .We define the goal as finding the optimal control variable set I * ∈ I and control variables x I * ∈ X I * in as few evaluations as possible, defined as Hereafter, we write x I * instead of (I * , x I * ) for simplicity.BOPSQ in the unknown input distribution case can be seen as a mixture of BO and multi-armed bandits.By selecting an action x I from an infinite action space X I , a learner observes information f (x I , X Ī ) , which corresponds to the standard BO.By select- ing an action I (and x I ) from a finite action space I , a learner observes information X Ī ∼ P(X Ī | X I = x I ) , which corresponds to a multi-armed bandit problem.Some specific forms set as the family of control variable sets I recover existing BO set- tings.The simple setting in which the learner can select all the input variables as a control variable set, I = {[d]} , corresponds to the standard BO setting.When the family of control variable sets is always fixed to one particular subset, I = {I}, I ⊂ [d], 0 < |I| < d , we con- sider this setting as BO with environmental variables (Williams et al. 2000).In this sense, BOPSQ is a generalization of these BO settings.The major difference appears when the learner is required to select a control variable set for |I| > 1 .Moreover, when the input dis- tribution P(X) is unknown, the learner is required to consider the uncertainty of the input distribution estimation as well as the function estimation.
In practice, the sizes of all control variable sets may be same, i.e., I = {I | |I| = k} for 0 < k < d .This is because control variable sets that are a subset of other con- trol variable sets are useless.Let I ⊂ J ∈ I , that is, I be a subset of J.It is clear that (1) the maximum of the objective function for I in Eq. ( 1) is not larger than that for J, i.e., . Further, we typically want to specify the values of as many input variables as possible in black-box optimization.Therefore, in the experimental part, we set the sizes of all control variable sets to be same.

Regret
The final goal is to find the optimal control variables x I * defined in Eq. (1).We define instanta- neous regret IR(t) in iteration t, which evaluates the gap between the optimal solution and the choice x I t , described as The resultant performance measure, Bayes cumulative regret BayesRegret(T) , is the expec- tation and summation over T iterations of the instantaneous regret, defined as Our target is designing algorithms that achieve sublinear regret, that is, lim T→∞ BayesRegret(T)∕T → 0 .In some situations, one may prefer simple regret, the best instantaneous regret so far min t=1,…,T [IR(t)] , to cumulative regret.However, sub- linear cumulative regret guarantees that simple regret asymptotically converges to 0 since min t=1,…,T [IR(t)] ≤ BayesRegret(T)∕T.

Algorithms
We consider two cases where the input distribution is known and unknown, propose two algorithms based on posterior sampling for these cases, and also provide their regret bounds that are sublinear for popular kernels.

TSPSQ-known for known input distribution
We first focus on the case where the joint input distribution P(X) or the set of conditional distributions {P(X Ī | X I )} I∈I is known.For example, the case of the the airfoil design provided by a manufacturer corresponds to the situation in which the manufacturer publishes random options for the unspecified features.We propose an algorithm based on Thompson sampling, which we call Thompson sampling with partially specified queries for the known input distribution case (TSPSQ-known).

Acquisition function
In iteration t, TSPSQ-known determines the next control variables as (2) where g t is a sample from the GP posterior g t ∼ GP( t−1 , k t−1 ) .The pseudo-code is given in Algorithm 1.
Equation (2) approaches the ground-truth objective in Eq. ( 1) as the posterior sample g t approximates the ground-truth function f better with more observations.TSPSQknown is a natural extension of Thompson sampling; if we set I = {[d]} as the stand- ard BO setting, the acquisition function in Eq. ( 2) is reduced to g t (x).

Regret bound
We first introduce the notion of maximum information gain, which is the basis of typical regret analysis for BO.The maximum information gain in iteration T, denoted by T , is the maximum possible information gain achievable by any algorithm for f via queries A = {x 1 , x 2 , … , x T } ⊂ X and the corresponding outputs y A = {y 1 , y 2 , … , y T } , defined as Here, I(f ;y A ) denotes the mutual information between f and y A .Srinivas et al. (2012) pro- vided the bounds of T for various types of kernels: T = O(d log T) for the linear kernel k(x, y) = x ⊤ y , T = O (log T) d+1 for the Gaussian kernel k(x, y) = exp (0.5‖x − y‖ 2 ∕l 2 ) , and T = O T d(d+1)∕(2 +d(d+1)) log T for the Matérn kernel k(x, y) = 2 1− ∕Γ( ) r B (r) , where r = ( √ 2 ∕l)‖x − y‖ , B is the modified Bessel function, and l,  > 0. We then derive the regret bound of TSPSQ-known for the case where the groundtruth input distribution is known.
Theorem 1 Let X ⊂ [0, 1] d , d ∈ ℕ , be a compact domain.Assume k(x, x � ) ≤ 1 and that there exist constants a ′ , b ′ > 0 for the partial derivatives of sample paths f such that Then, by running TSPSQ-known for T iterations, it holds that The analysis is mainly based on the techniques by Russo and Roy (2014), Srinivas et al. (2012), where the cumulative regret is decomposed into terms with respect to discretization errors of X , the difference between f and the upper confidence bound , and the difference between x I * and x I t .A detailed proof of Theo- rem 1 can be found in Appendix A.
Note that the growth rate of the regret bound in T matches the existing results of Thompson sampling and GP-UCB in the standard BO setting shown by Russo and Roy (2014) and Srinivas et al. (2012), respectively, though their definitions of regret are different.This is because the ground-truth input distribution is given, and the uncertainty in estimation is therefore only in f, as in the standard BO setting.The regret bound is sublinear for the popular kernels, such as the linear kernel, Gaussian kernel, and Matérn kernel. (3) Algorithm 1 TSPSQ-known (TSPSQ-unknown) Input: A GP prior GP(µ 0 , k 0 ), a query budget T , and a family of control variable sets I (and a set of conditional distributions {P (X Ī | X I )} I∈I for TSPSQ-known) Determine the next control variables x I t using Equation (2) for TSPSQ-known (or Equation ( 6) for TSPSQ-unknown) Observe uncontrol variables X Īt Evaluate f at (x I t , X Īt ), and observe the corresponding output

TSPSQ-unknown for unknown input distribution
Next, we consider the case where the joint input distribution P(X), or the set of conditional distributions {P(X Ī | X I )} I∈I , is unknown.We propose Thompson sampling with partially specified queries for the unknown input distribution (TSPSQ-unknown) based on a multiarmed bandit approach.
In the unknown input distribution case, a set of conditional distributions of uncontrol variables needs to be estimated.For example, when the size of uncontrol variables is d/2, ) combinations of uncontrol variables.However, samples from a conditional distribution P(X Ī | X I = x I ) may be uninformative for estimat- ing other conditional distributions P(X J | X J ) for I ≠ J without any assumption (a further discussion can be found in Appendix C).Then, the estimation of different conditional distributions may be separately addressed, which requires an exponential number of samples with respect to d.Therefore, to make the estimation statistically tractable, we assume only that input variables are independent of each other.That is, the cumulative distribution function Then, the number of distributions to be estimated does not increase exponentially with d.Note that we do not make further assumptions such as continuity or a certain form of the distribution.

Acquisition function
Let ḡt be a function defined as and t is a monotonically increasing function in t.TSPSQ-unknown determines the next control variables as where FI denotes the empirical distribution function for I, defined as The acquisition function of TSPSQ-unknown in Eq. ( 6) works similarly to that of TSPSQ-known in Eq. ( 2) for the known input distribution case.TSPSQ-unknown computes expectation using the empirical distribution.However, the empirical distribution may not be accurate when the number of samples is small.Based on the observation discussed in Sect.3.2, we employ a UCB approach for multi-armed bandits (Auer et al. 2002).The second term on the right hand side of Eq. ( 5) encourages increasing the number of samples from the input distributions with few samples.A hyperparameter t controls the exploration-exploitation trade-off in the distribution estimation: large t improves the distribution estimation, while small t attempts to achieve small regret.
TSPSQ-unknown avoids a regret bound exponentially dependent on the number of input variables d for the independence assumption, as shown later, because it does not need to consider the combinations of d input variables in the distribution estimation.However, the computational cost of TSPSQ-unknown in each iteration still exponentially depends on d due to the combinations of S (i) for i ∈ Ī .As a simple case, when | Ī| = d∕2 and |S (i) | = n for i ∈ Ī , the computation of the empirical expectation in Eq. ( 6) requires n d∕2 evaluations of ḡ for each x I ∈ X I and I ∈ I .This computational cost could be soon intractable as n increases through iterations for large I and d.In general, an exact computation with small cost may not be possible; however, the cost can be reduced when some assumptions hold.For example, Kandasamy et al. (2015) and Mutny and Krause (2018) assumed the additivity of f with respect to each x (i) . If that assumption holds, the computational cost is reduced to O(dn).

Regret bound
We then derive a regret bound for the unknown input distribution case.
Theorem 2 Let X ⊂ [0, 1] d , d ∈ ℕ , be a compact domain.Assume k(x, x � ) ≤ 1 and that there exist constants a, b, a ′ , b ′ > 0 for the sample paths f and their partial derivatives such that Assumption (4) holds and

Suppose an unknown cumulative distribution function of the input variables is factorized into
Then, by running TSPSQunknown for T iterations, it holds that where T is the maximum information gain defined in Eq. (3) and m = max I∈I | Ī| is the maximum number of dimensions of uncontrol variables.
The main difference of the proof from that for Theorem 1 is the evaluation of the distribution estimation.We employ techniques in multi-armed bandit problems and the Dvoretzky-Kiefer-Wolfowitz inequality (Dvoretzky et al. 1956;Massart 1990) to derive the upper bound.The whole proof can be found in Appendix B.
The growth rate of the regret bound in T in Theorem 2 matches the existing results for the standard BO setting (Russo and Roy 2014;Srinivas et al. 2012), as with TSPSQknown.This is because, in most cases, the first term corresponding to the function estimation is dominant compared to the second term corresponding to the input distribution estimation thanks to the independence assumption.Recall that m ≤ d .Since d is typically moderate in BO, e.g., d < 10 , √ m is not large.Therefore, for the Gaussian kernel for example, where T = O (log T) (d+1) , the first term √ dT(log T) (d+2) can be major compared to the second term √ mdT log 2 T in the total regret bound.This argu- ment also can be applied to the Matérn kernel.However, for the linear kernel, where T = O(d log T) , the first and second terms are equal in growth rate for m = O(d).
The bound linearly depends on d for m = O(d) , and no exponential term appears in d.As m decreases, the bound also decreases.This is because the learner can specify the values of the many input variables as desired and the error with respect to the distribution estimation becomes small.

Experiments
We demonstrate the effectiveness of the proposed algorithms for the known and unknown input distributions using a set of test functions and real-world datasets.

Comparing methods
We compare the proposed algorithms to four baselines: DropoutBO, WrapperBOs, RandomBO, and Random.DropoutBO (Li et al. 2017), which was originally proposed for the high-dimensional BO, sequentially determines a partially specified query by performing optimization over a low-dimensional input domain.In each iteration, Drop-outBO randomly selects I t ∈ I and then performs the standard BO over X I to deter- mine the next control variables x I t .WrapperBOs performs the standard BO over X I separately for I ∈ I and selects x I t based on a wrapper method (Guyon and Elisse- eff 2003) as feature selection.For each I ∈ I , it applies GP regression to {(x I i , y i )} t i=1 , observations of input variables of I, where x i = (x I i , X Īi ) .Then, it selects x I t using |I| acquisition functions {a I } I∈I as x I t = argmax I∈I,x I ∈X I a I (x I ) .RandomBO directly uses the standard BO method with a randomly determined control variable set.First, it determines the next d-dimensional input as in the standard BO, x t = argmax x∈X a t (x) .Then, I t is randomly chosen from a uniform distribution over I .Subsequently, it queries for x I t t .Random simply determines the next control variables according to a uniform distribution.
We use the Gaussian kernel function for GP models and employ Thompson sampling as an acquisition function for all methods.In the preliminary experiment in Sect.5.3, we set the hyperparameters of the kernel function to those obtained by the type II maximum likelihood estimation using thousands of data points.In the other experiments in Sects.5.4 and 5.5, the hyperparameters are learned every 10 iterations by using the type II likelihood estimation.

Approximation of sample paths
All methods except Random use a sample path g t ∼ GP( t−1 , k t−1 ) .By approximat- ing the GP with a finite-dimensional Bayesian linear model, we can efficiently optimize g t (Hernández-Lobato et al. 2014).Bochner's theorem (Bochner 1959) guarantees that for any continuous shift-invariant kernel k(x − y) , its Fourier transform is a non-negative measure k()∕ , where  = ∫  k()d is a normalization constant.Random Fourier fea- tures (Rahimi and Recht 2007) approximate the kernel function as where ∼ k() .Then, the GP is approximated with an M-dimensional Bayesian linear model.The sample path is approximated as ×M , and Y t−1 = (y 1 , … , y t−1 ) ⊤ .The random Fou- rier features enable us to deal with a large number of samples.In the experiments, we set M = 1000.
In Theorem 2 for TSPSQ-unknown, b ′ > 0 to define t = 2b � log t may not be known in advance.Hence, we change the value of a hyperparameter c for t = c log t and study the difference in performance.We set T = 100 and repeat each experiment 30 times.
The results for the known input distribution case are shown in Fig. 2, where the lines and shaded areas are the means and standard deviations for 30 trials, respectively.TSPSQ-known achieved small regret in the early iterations and outperformed the baselines by finding the nearly optimal solution.Figure 3 (left) shows the cumulative regret at the 100th iteration of TSPSQ-unknown, with different values of the hyperparameter c for the unknown input distribution case.We observed that the smaller c is better than the larger c in the mean for 30 trials.In particular, TSPSQ-unknown with c = 0.12 worked the best and found the nearly optimal solution, as shown in Fig. 3 (right), by controlling the balance between exploration and exploitation in the distribution estimation.The result of TSPSQ-unknown with c = 0.12 is competitive with that of TSPSQ- known for the known input distribution.The larger c values suffered from the higher regret because they heavily focused on exploration and obtained low rewards; however, the result is still better than the baselines.The variance of small c was large because it heavily focused on exploitation and sometimes fell into the local optimum.

Experiments using test functions
Next, we conduct numerical simulations using two test functions: a cosine mixture function (Anderson et al. 2000) over a two-dimensional domain and the Rosenbrock function (Picheny et al. 2013) over a four-dimensional domain.We model a joint distribution of input variables with a Gaussian distribution N( , ) , where we set  = (0.7, 0.7) ⊤ ,  = ((0.01,0) ⊤ , (0, 0.05) ⊤ ) for the cosine mixture function and i = 0.7 , i,i = 0.01, i,j = 0 , for i, j ∈ [4], i ≠ j for the Rosenbrock function.We set a family of con- trol variable sets I as a set of combinations of 1 of 2 input variables for the cosine mixture function and a set of combinations of 2 of 4 input variables for the Rosenbrock function.We set the observation noise variance 2 = 10 −3 for both functions, following Hernández- Lobato et al. (2014).According to the results in Sect.5.3, we set t = 0.12 log t for TSPSQunknown.We set T = 100 and repeat each experiment iterations 30 times.
Figures 4 and 5 show the cumulative regret for the known and unknown input distribution cases for the cosine mixture function and the Rosenbrock function, respectively.Note that the performance of the baselines does not change for both cases.The proposed algorithms, TSPSQ-known and TSPSQ-unknown, outperformed the baselines and achieved small regret.They finally yielded the nearly optimal solutions, while the baselines increased their cumulative regrets linearly.TSPSQ-unknown is again competitive with TSPSQ-known although it is not given the ground-truth input distribution.This result might be because the error in the distribution estimation is much smaller than the error in the function estimation, as discussed in Sect.4.2.2.

Experiments using real-world datasets
Finally, we conduct experiments using two real-world datasets: the airfoil self-noise dataset1 and the word similarity dataset (Snow et al. 2008).The airfoil self-noise dataset is a set of airfoil self-noise as an output variable and five input variables including angle of attack and chord length.The goal is to design airfoils with low self-noise via the five input variables in an outsourcing scenario as described in Sect. 1.The word similarity dataset is a set of answers given by workers for word pair similarity tasks in crowdsourcing.In a task, a word pair is given to a worker, and the worker assesses their similarity with an integer from 0 to 10, where a large number indicates stronger similarity.We randomly select 10 of 30 answers as the input variables.For the output variable, we calculate the negative mean of the absolute difference between the remaining 20 answers and the true answers so that the good workers have the high output value.The goal is to find good workers whose answers are accurate via the 10 input variables as described in Sect. 1.We set a family of control variable sets I as a set of combinations of 2 of 5 input variables for the airfoil self- noise dataset and a set of combinations of 5 of 10 input variables for the word similarity dataset.To reduce the computational cost, we randomly reduce the size of the family to 10 for the word similarity dataset.The discussion on large |I| can be found in Sect. 4.2.1.For these datasets, the ground-truth functions and the ground-truth input distributions are unknown and only available through noisy observations.Output data corresponding to the input points we query for is not always available.Therefore, we substitute a posterior mean function of a GP obtained by learning each dataset as a ground-truth function 2 .The observation noise variance is also set to the learned parameter.For a proxy for the groundtruth input distributions, we apply kernel density estimation with the Gaussian kernel to the datasets, whose bandwidth is obtained using the median heuristic (Garreau et al. 2017).We set t = 0.12 log t for TSPSQ-unknown based on the results in Sect. 5.3, T = 100 , and repeat each experiment 30 times.
Figures 6 and 7 illustrate the cumulative regret for the known and unknown input distribution cases for the airfoil self-noise function and the word similarity function, respectively.Both TSPSQ-known and TSPSQ-unknown performed better than the baselines.In the early iterations, there was no significant difference in regret for all methods, but the Fig. 6 Cumulative regret for the airfoil self-noise function proposed algorithms were able to keep their regrets small after 20 iterations, as shown in Fig. 6, and after 30 iterations, as shown in Fig. 7. TSPSQ-unknown is competitive with TSPSQ-known in these experiments, possibly due to the more difficult function estimation than the distribution estimation, as discussed in Sect.4.2.2.

Related work
There is a significant body of literature on BO, where a learner has control of only some of the input values but not the rest.Prior works are classified into two settings based on whether the remaining input values are revealed to the learner to determine the controllable input values.
First, in contextual GP bandits (Krause and Ong 2011), a black-box function takes control variables and context information as input.In each iteration, the learner receives context information and then determines the values of control variables.In BOPSQ, a learner has no access to the context information.
Second, several works tackled situations where a learner determines the controllable input values without knowing the remaining input values, as well as BOPSQ.In BO with environmental variables (Williams et al. 2000), the objective is to maximize the expected cumulative reward on the remaining input values determined according to a distribution.This setting can be seen as a special case of BOPSQ, as shown in Sect.3. In a game theory setting (Sessa et al. 2019;Dai et al. 2020), each agent determines its action sequence to maximize each cumulative reward against the other agents.Bogunovic et al. (2018) dealt with an adversarial setting, where an adversary determines the remaining input values to minimize the learner's reward.Kirschner et al. (2020) and Nguyen et al. (2020) extended this adversarial setting into a setting where an adversary selects a distribution of the remaining input values.These works are closely related to BOPSQ; however, a learner can not select input variables to determine their values.BOPSQ involves the feature selection procedure.
Causal bandit (Lattimore et al. 2016) for causal intervention is also closely related to BOPSQ, which involves a feature selection procedure.In the problem setting, a causal model is provided to a learner as a directed acyclic graph over random variables of input and output variables.In each iteration, the learner specifies values of any Fig. 7 Cumulative regret for the word similarity function subset of input variables and then observes the remaining subset of input variables and the output variable.The objective is to maximize the expectation of the output variable conditioned on specified values.The primary difference is that BOPSQ deals with infinite input spaces as a GP bandit, while causal bandit deals with finite input spaces as a multi-armed bandit.
Noisy inputs have been tackled by several works (Oliveira et al. 2019;Luong et al. 2020).In BO with noisy inputs (Oliveira et al. 2019), the input values are perturbed, and a black-box function takes the unobservable perturbed values as input.In BO with missing inputs (Luong et al. 2020), some of the input values are replaced with unobservable random values.The input values in BOPSQ are also random; however, all input variables are controllable for a learner in those works, although input values are noisy.Further, feature selection is not addressed in those works.
The feature selection problem (Guyon and Elisseeff 2003) is related to BOPSQ in a wide sense, which is the task of selecting a subset of features that are most relevant to an output.Chen et al. (2012) proposed a method that jointly optimizes the objective and selects features for high-dimensional BO where it is assumed that only a subset of input variables are associated with a black-box function.Li et al. (2017) also dealt with high-dimensional BO by performing optimization only with respect to randomly chosen features.Random values or the values of the input variables with the best function value so far are used as the values of the unchosen features.However, these works operate in the standard BO setting: a learner determines all values of input variables.Further, the methods do not deal with the random input variables.

Conclusion
We proposed BOPSQ, a novel problem setting that utilizes partially specified queries for the situations in which the specification of all input variables is expensive or difficult.BOPSQ involves a features selection procedure, and values of unspecified input variables are randomly determined.In particular, BOPSQ with unknown input distribution corresponds to a mixture of BO and multi-armed bandits.We proposed the algorithms based on posterior sampling, TSPSQ-known and TSPSQ-unknown for the known and unknown input distribution cases.We derived their regret upper bounds that are sublinear for the popular kernels.In the experiments, we demonstrated the effectiveness of the proposed algorithms using the test functions and the real-world datasets.
In BOPSQ, uncontrol variables are sampled from a conditional distribution described by X Ī ∼ P(X Ī | X I = x I ) .For this sampling, we may consider another sce- nario: a causal intervention in a directed acyclic graph over random variables (Lattimore et al. 2016), described by X Ī ∼ P(X Ī | do(X I = x I )) .If the input distribution is nei- ther known nor independent, the intervention scenario would differ from the scenario in this paper.
To make BOPSQ applicable to many real-world problems, it is important to extend BOPSQ to the setting in which selection costs of control variable sets are different (Tran-Thanh et al. 2012).In the problem setting of BOPSQ, in principle, we assume that the cost is uniform; however, specifying the values of more control variables would be more costly in many real-world problems.Handling unbounded domains (Shahriari et al. 2016a) is another important extension of BOPSQ.We may not know the unbounded domain, especially for the unknown input distribution case.

A Proof of Theorem 1
In this section, we present the proof of Theorem 1, where the input distribution is known.
We first introduce a relevant result.Srinivas et al. (2012) provided a bound of sum of posterior variances over T iterations using the maximum information gain T as The analysis basically follows the frameworks by Russo and Roy (2014), Srinivas et al. (2012).First, we discretize the domain X in iteration t into grid points ) .Using the UCB sequence, we decom- pose BayesRegret(T) into We will bound the terms A 1 and A 5 .By Assumption 4, we have .
By applying the same argument to A 5 as above, we obtain the same upper-bound.
Next, we will upper-bound A 2 .For a random variable where we used t (x) ≤ 1 and t = 4(d + 1) log(t) Since X I * and X I t are identically distributed from the same distribution conditioned on D t−1 and U t is deterministic, we have ( 10) (11) Finally, we will bound A 4 .Since f and (X I t , X Īt ) conditioned on D t−1 are independent, we have where we used the monotonic increase of t in the first inequality, the Cauchy-Schwarz inequality in the second inequality, and Inequality (8) in the third inequality.By combining Eq. ( 9) and Inequalities ( 10), ( 12), ( 13), and ( 14), we complete the proof.◻

B Proof of Theorem 2
In this section, we present the proof of Theorem 2, where the input distribution is unknown and assumed to be independent of each other variables.
We define an UCB sequence ) .We further define Then, we decompose the regret at iteration t into ( 14) First, since f and g t follow the same distribution given D t−1 , we have for which we have Next, we will upper-bound B 2,t and B 4,t .We decompose B 2,t into (15) , and b ′ , b > 0 are the constants in Assumptions ( 7) and (4).Here, the last equality is because f and g ′ t are independently distributed from the same distribution given D t−1 .Using the independence assumption of the input random variables and integration by parts, we have , where and X Jt ⧵ Jt,i denotes an expectation operator for abbreviation defined as Using Inequality (18), we have Since samples x i ∈ S i t−1 for i ∈ Jt are independently distributed given |S i t−1 | , by applying the Dvoretzky-Kiefer-Wolfowitz inequality (Dvoretzky et al. 1956;Massart 1990) conditioned on n = |S i t−1 | , for  > 0 , we have Define t,n ∶= + √ (2 log t)∕n for  > 0 , and we have (18) 1 3 where we used ∑ ∞ n=1 x n ∕n = − log(1 − x) in Equality ( 20) and e x ≥ x + 1 in Inequali- ties ( 21) and ( 22).By taking the summation over T and combining Inequalities ( 19) and ( 23), we have where m = max I∈I | Ī|.
B 2-3,t after taking summation over T is upper-bounded because of Assumption ( 4).
where we used monotonic increase of B in the second inequality.The term B 2-4,t is upper-bounded as where we used Assumptions ( 7) and (4) in Inequality (26).Therefore, by taking summation over T, we have where we used Inequality (27) in the first inequality and the monotonic increase of t in the second inequality.By combining Inequalities ( 17), ( 24), ( 25) and ( 28), we obtain Applying the same argument to ∑ T t=1 B 4,t , we obtain the same bound.Due to the definition of the acquisition function given D t−1 , we have Next, we upper-bound B 5,t .We have (28) where we used monotonic increase of t in the first inequality.Equality ( 31) is because events {i ∈ Īt , |S To upper-bound the term B 6,t , we decomposed it into three terms as Because f and g ′ t are independently distributed from the same distribution given D t−1 , we upper-bound the term B 6-1,t after summing over T as where we applied the same argument as Inequality (10) in the last inequality.
By setting = √ log T∕T , we have which completes the proof.◻

C Discussion on estimating unknown dependent input distributions
We consider a simple toy example that illustrates the essential hardness of the problem when the input distribution is not necessarily independent.
For simplicity we consider the binary input space, X = {0, 1} 6 with I = {I | I ⊂ [d], |I| = 3} , but the discussion below is easily extended to the continuous case with general input dimension d and the number of control variables |I|.Let us consider the case that the learner already knows the complete information of the objective function f(x), which is expressed as Now, consider the following two distributions P and Q.Under P, X (i) is i.i.d. with P(X (i) = 0) = for sufficiently small  > 0 .Under Q, the sequences in S given below appear more frequently, satisfying where (40) We can see that, under Q, it is uniquely optimal to choose I = {1, 2, 3} with x I = 000 .This is because, in this choice Q(X Ī = 011|X I = 000) ≈ 1∕2 whereas X Ī almost always becomes 111 (resulting in f (x) = 0 ) in any other choice.Nevertheless, it requires an enormous num- ber of samples to distinguish P and Q unless choosing I = {1, 2, 3} with x = 000 , because X Ī almost always takes value 111 in all the other cases.
By the same argument, for arbitrary I such that |I| = 3 , we can consider a distribution Q ′ such that choosing I with x I = 000 is optimal but Q ′ cannot be distinguished from P unless trying this choice.Therefore, it is necessary to try combinatorially many candidates of I to find the input such that f(x) becomes 1 with a nonnegligible probability.From this observation, we can conclude that dependent input distributions essentially make the task exponentially hard.
be a compact domain and f ∶ X → ℝ be an unknown function of interest.A control variable set, I ⊂ [d] , represents the indices of input variables which a learner can specify the values of.An uncontrol variable set, Ī = [d] ⧵ I , represents the remaining indices of input variables which the learner can not specify the values of.Let I ⊂ 2[d]

Fig. 2 Fig. 3
Fig. 2 Cumulative regret for the Branin-Hoo function with the known input distribution

Fig. 4 Fig. 5
Fig. 4 Cumulative regret for the cosine mixture function