Pre-Training Acquisition Functions by Deep Reinforcement Learning for Fixed Budget Active Learning

There are many situations in supervised learning where the acquisition of data is very expensive and sometimes determined by a user’s budget. One way to address this limitation is active learning. In this study, we focus on a fixed budget regime and propose a novel active learning algorithm for the pool-based active learning problem. The proposed method performs active learning with a pre-trained acquisition function so that the maximum performance can be achieved when the number of data that can be acquired is fixed. To implement this active learning algorithm, the proposed method uses reinforcement learning based on deep neural networks as as a pre-trained acquisition function tailored for the fixed budget situation. By using the pre-trained deep Q-learning-based acquisition function, we can realize the active learner which selects a sample for annotation from the pool of unlabeled samples taking the fixed-budget situation into account. The proposed method is experimentally shown to be comparable with or superior to existing active learning methods, suggesting the effectiveness of the proposed approach for the fixed-budget active learning.


Introduction
In the framework of supervised learning, the predictive performance of a learned model should improve when the number of samples increases. However, it is not desirable to simply increase the number of samples when the cost of annotation or acquiring labels is high. Examples of such cases include problems that require expert knowledge for annotation as well as problems that require large-scale experiments. Also, the effect of each set of data on the training of the predictive model is not equal. Some data hardly change the model, and some data can deteriorate the performance of the model. Therefore, it is necessary to selectively annotate data that is considered useful for the model. To maximize the performance of the model, a framework for selective annotation called active learning [19,38] has been developed. Active learning methods are roughly divided into two categories: pool-based and stream-based methods. Pool-based methods assume that there is a large amount of data (called pool data) that consists of features only (i.e., without labels or response values), and the active learner selects which datum should be annotated next. The annotated datum is then included in the training dataset. In the next step of active learning, the learner then selects a datum from the remainder of the pool dataset. Stream-based active learning is a method that deals with the problem of determining whether to annotate sequentially given data individually. In this work, we focus on pool-based active learning.
In active learning, the most critical issue is how to design the acquisition function used for determining which datum to annotate in the current circumstances. The majority of existing approaches adopt some kind of measure of the difficulty of prediction as the sample selection criterion. These criteria often give good results empirically, but the same criterion is used throughout the active learning process in these methods. The best strategy for designing the acquisition function would differ according to context. In most of the conventional active learning methods, a fixed single selection criterion is used throughout the learning process, and the selected data may include outliers and other such samples that degrade the performance of the model. For example, uncertainty sampling [30] selects the data with the smallest discriminant posterior probability. In other words, the most difficult data for the current learning model is selected. However, this method may select outliers on the discrimination boundary. In order to avoid such a situation, it is necessary to change the criteria flexibly according to the context of learning process.
Suppose we want to learn a prediction model when a budget is fixed in advance, namely, the number of data to be labeled is pre-determined, which is an extremely common situation when developing a machine learning system. In this case, a better model should be obtained if we acquire the labeled data in a manner that considers the data acquisition order or context within the budget. To meet these demands, we propose a method that applies reinforcement learning [41] to active learning. The use of reinforcement learning for learning acquisition function used in active learning is already considered in [49]. However, the method proposed in [49] is designed only for classification problems and does not consider budget of learner. Considering the context within which the data is acquired, data is selected according to an appropriate criterion that reflects the current state of the learning model so that the model performance is maximized when the specified number of data is acquired. For this purpose, a deep Q-network (DQN) [32] is used to learn an acquisition function that takes the data context into account.
The major contributions of this paper is summarized as follows: -We tackled the problem of learning the acquisition function suitable for fixed-budget active learning problem. Recent studies on active learning focus on the data-driven acqui-sition function design, but, to the best of the authors knowledge, the acquisition function tailored for the fixed-budget situation is not investigated yet. -To realize learning the fixed-budget acquisition function, we utilized the reinforcement learning. In particular, we adopt the DQN and the acquisition function which is trained in advance of the operational phase of the active learning. By using the reinforcement learning, the training of the acquisition function is done so as to select appropriate samples where the number of available samples is fixed.
The rest of the paper is organized as follows. In Sect. 2, related works on active learning and recent results on learning acquisition functions from data are summarized. Section 3 introduces the notion of reinforcement learning and its modern implementation, the deep Q-networks. Then, our proposed approach for the fixed budget active learning is presented in Sect. 4, and it is experimentally evaluated in Sect. 5. The last section is devoted to the discussion and conclusion.

Related Work
Active learning has a long history and is still actively researched [18,20,25] and applied variety of problems [9,31,35,36,40,42,44]. However, in most of the literature, the criterion used to select data from the pool data does not change according to the environment or context, so if the criterion is not appropriate for the current status of the learning model or pool data, the selected dataset will not improve the predictive model as expected.
Several recent studies have employed the meta-active learning approach, which aims to learn the acquisition function for the active learning from the dataset [3,14,28,33,47,50]. The authors of [50] and [3] applied the active learning strategy to one-shot learning. Active one-shot learning [50] is designed for stream-based active learning, in which reinforcement learning is adopted to learn whether to label a given sample or to ignore it. The method in [3] is a modified version of [50] for pool-based active learning. These methods utilize reinforcement learning [10,41] to learn the acquisition function. In particular, [50] is similar to our proposed method in that it uses a DQN. The method proposed in [50] has the advantage of being able to learn the environment with high precision using a DQN. However, because it is designed for cases in which an extremely small subset of pool data should be labeled, it is not suitable for a standard active learning problem. Also, [12] utilize the DQN for learning an appropriate acquisition function, but its formulation is heavily dependent on the Markov decision process, meaning that it is only applicable to stream-based active learning, while we consider the pool-based active learning in this study.
Recently, a methodology called learning active learning (LAL) was proposed in [28], in which the acquisition function for active learning is pre-trained before the actual learning phase. In the pre-training stage, a large number of datasets are collected. These datasets can be collected from other problem domains or even be artificially generated. Then, features are designed using the dataset and the predictive model to be learned. For example, the distances between a candidate datum to be annotated and its k-nearest neighbor data in the pool, the coefficients of a linear classifier (predictive model), or the average depth of random forest classifiers could be employed. These features are calculated and stacked to form a feature vector, and the feature vectors are used to train an acquisition function for improving the prediction model. Then, the trained acquisition function is used in the predictive model learning phase, in which the same feature vectors are extracted from the actual data and the predictive model.
A method that takes the feature extraction performed by LAL one step further by performing feature engineering has also been proposed [33]. Using reinforcement learning, the acquisition function for the embedded features is trained. This method is designed to perform the feature embedding and learn the acquisition function in an end-to-end manner. However, the predictive model for the method is currently limited to a two-class linear support vector machine [46]. Though in this paper we concentrate on classification setting, our method is applicable to both classification and regression settings.
In this study, we propose a method to pre-train the acquisition function for active learning by using deep Q-network with datasets from other domains. The method enables us to select data to be annotated according to the context of the learning process of a predictive model. For the predictive model, we adopt random forest [6], which can realize multi-class classification and regression in a unified framework, and it is easy to extract different kind of features from the trained model as explained in Sect. 4, but other predictive models can be plugged into our method. We note that even deep learning models can be used for classification in active learning, but its hypothesis space is too large and has its own difficulty [23] when applied to active learning.
We consider fixed budget regime in this study, and consider the optimal acquisition function under this circumstance, which is the main difference between existing work for learning acquisition function for active learning [28,33]. We note that in some cases, we run active learning algorithm without explicit limitation or budget for data annotation. In such cases, we encounter another problem, namely, when to stop learning. There are several works on the optimal stopping timing of active learning [2,4,22,26]. There are only few works in the literature of active learning in which the budget is explicitly considered [7,13], where the authors derived a budget aware stream-based active learning, which do not consider learning the acquisition function from data.

Preliminary for Reinforcement Learning
Our active learning model uses a pre-trained acquisition function, which is learned by reinforcement learning. There are many possibilities for implementing reinforcement learning and our main idea does not assume any specific realization of reinforcement learning. One of the modern and promising approaches is that based on the deep neural networks. In particular, DQN [32] is used to account for dynamic phenomena where time is explicitly involved. The consecutive data acquisition process corresponds to the notion of "time", and DQN is shown to work well in the literature of learning acquisition function for active learning [12,50].
In this section, we introduce a reinforcement learning method based on DQN to realize active learning in consideration of the context of data acquisition.

Q-Learning
This subsection presents an overview of Q-learning, which is a representative reinforcementlearning method. Reinforcement learning [41] is a field of machine learning in which an agent tries to maximize reward by taking actions under a state in which the agent is located. Q-learning is a theoretically sound reinforcement learning methods that has been empirically shown to perform well.
The aim in Q-learning is to obtain a function that calculates value for taking action a t when a learner or an agent is in state s t at a certain time t. Here, s t is a collection of parameters that specify the state or current situation of the agent, and a t is a collection of parameters that represent the action that is taken in the current situation. We define a Q-function that accepts (s t , a t ) and outputs a reward. With this function, an action that maximizes the reward is taken. The ideal Q-function Q * (s t , a t ) is defined by (1) Here, π is a mapping from state s to action a; that is, it is the strategy for deciding which action to take. In other words, Q * (s, a) is the expectation of the obtained reward when the agent takes the ideal action a in a certain situation s. Reward R t is composed of the cumulative immediate rewards from time t = 0 to t = T with a discount factor as follows: where T is the time in the final state, γ ∈ [0, 1] represents the discount rate, and r τ represents the immediate reward at time t = τ . Applying Eq.
(2) to Eq. (1) yields the following expression: where P t+1 is the probability distribution of state s t+1 . The action a * that maximizes the reward at time t constitute the sequence of actions {a * τ } T τ =t , and in this sense, reinforcement learning considers the context of learning.
In Q-learning, the ideal Q-function in Eq. (1) is not available because of the lack of the ground truth distribution of R t . As a result, the Q-function is typically represented as a table created from discrete states s and discrete actions a. The value (reward) of each cell in this table is updated by the following formula: where α ∈ [0, 1] is a parameter indicating the learning rate. There are various ways to select the action during learning, but here we use a simple method called the -greedy method [41]. The -greedy method is a strategy that takes a random action with a probability of and maximizes the Q-function currently obtained with a probability of 1 − .

Deep Q-Network
A Deep Q-network (DQN) uses a deep neural network [15,29] instead of a table for representing the Q-function; hence, unseen states and actions can be evaluated. Here, we approximate the Q-function using a deep neural network determined by the parameter θ as Q * (s, a) ≈ Q(s, a; θ ). The DQN starts with a randomly initialized parameter θ and optimizes the following objective function: where The index i indicates the i-th label annotation on the unlabeled data from the pool dataset. Parameter θ i+1 is then optimized with q i+1 , calculated using θ i , until i reaches a predefined time T , which is the maximum number of data to be acquired as determined by the budget. The final Q-function is the deep neural network determined by θ . The biggest difference between conventional Q-learning and a DQN is that conventional Q-learning treats the states and actions as discrete, whereas a DQN treats states and actions as continuous values. We use a DQN in active learning by designing the state and behavior appropriately because of its high empirical performance.

Proposed Method
This section describes the procedure for applying DQN to active learning. The framework is similar to that of [28], which considers the reduction of loss for the predictive model when data is added to the training set, and learns a predictor of the reduction by using datasets from other domains.
Learning the acquisition function would improve the performance of active learning. However, active learning is in general used under scarce data regime, and we cannot expect the acquisition function learned with a small number of data generalize well. In the proposed method, we use a large number of datasets that are collected from other domains or artificially generated to learn the acquisition function (i.e., the predictor of the reduction in loss). The acquisition function is modeled and learned within the framework of a DQN. The state and action in our formulation have the following design: State: Parameters that represent the predictive model and that describe the training data (c.f. coefficients for regressor, averaged distance to other data in the pool). Action: A parameter that determines which data to select (c.f. uncertainty of prediction evaluated by the current predictive model).
By designing the state and action in this way, we can learn a Q-function that predicts the amount of test loss reduction (reward) by selecting an unlabeled data (action) given the current predictive model and pool data (state). Once the Q-functions have been obtained, it is possible to select the data that reduces the test loss the most when a certain number of data are added to the training data.
In this work, we consider supervised learning problem of predicting response vari- y 1 ), . . . , (x N , y N )} is assumed to be given for training a predictive model.

Design of the State, Action and Reward
For implementing DQN, we have to design state and action as input and output of DQN. In this subsection, we first define the state, namely, the feature vector extracted from the predictive model and annotated dataset. Then the action of the learner is defined, which corresponds to the data selection policy in terms of the acquisition function for active learning. Finally, the reward for a certain action is defined as the amount of increase of accuracy by adding the annotated datum selected by the learner. We note that there are various possibilities for designing state, action, and reward, and we do not rule out other designs not adopted here. Appropriateness of those designs would depend on the dataset, predictive model, budget, and many other factors, and optimization of the design is an open issue.

Design of the State
The state used for the DQN should consist of parameters that reflect the performance and structure of the current learning model. In [28], it is empirically shown that simple features, such as the variance of the classifier output or the predicted probability distribution over possible labels for a specific datapoint on synthetic data, is effective for training the acquisition function. Here, we explain the parameters adopted for defining the state in our framework. An arbitrary predictive model is used in the proposed framework, but we here consider a random forest [6]. We adopt the OOB accuracy A o as one of the features of the acquisition function because it should be useful for expressing the performance of the predictive model.
The random forest used as the predictive model performs random sampling with replacement of the given dataset k times. Let Φ i , i = 1, . . . , k be a subset of the given dataset to construct the i-th decision tree for ensemble learning. In the sampling with replacement, approximately 36% of all data are not sampled 1 . The unsampled data subset is called the out-of-bag (OOB) sample, and is used to assess the generalization performance of the predictive model [6]. The OOB accuracy is obtained by performing verification using this OOB sample as follows: where is the output when the input x is given to the random forest trained over the dataset D. 1 y i (S) is an indicator function, and returns 1 when the statement y i = S holds, and returns 0 otherwise. We adopted the OOB accuracy A o because it directly express the performance of the predictive model.
We use decision trees for weak learners in random forest. Decision tree divides the feature space into regions and determines the output depending on which region the input feature belongs to. The average of the number of divided regions and the number of divisions in the decision tree is given by where N T i is the number of terminal regions in the i-th tree, and N S i is the number of splits in the i-th tree among the k decision trees. The average of the number of regions that were above and below the threshold when splitting just before the end region of the decision tree is expressed by Eqs. (6) and (8), respectively. 1 Consider the probability that certain sample x i out of size n dataset is not sampled in the sampling procedure of the random forest. The probability that this sample is not selected in a single sampling procedure is (n−1)/n. Since random forest performs sampling with replacement n times from the size n dataset, the probability that and Here Γ i e is the e-th region immediately before the end region in the decision tree h(·; Φ i ) constructed using the i-th sample Φ i , and where τ i e means the threshold for splitting in the e-th region of a decision tree. N i indicates the number of regions immediately preceding the end region in the decision tree h(·; Φ i ). These values are considered to be useful for characterizing the intrinsic data structure, and used as features for the acquisition function.
We also use the first m contribution ratios of the eigenvalues of the d × d matrix X X ∈ R d×d , where rows of the matrix X are annotated data x i ∈ R d at the current stage of active learning. We note that m ≤ d, the dimension of explanatory variable. The intuition for the use of eigenvalues and associated contribution ratios of the design matrix is that in statistical learning theory [45], eigenvalue of the data matrix plays an essential role for characterising the learnability. The contribution ratio of the j-th principal component is given by where λ i is the i-th eigenvalue, and n is the dimension of feature vector x.
Finally, state vector s is defined by concatenating the above features as follows:

Design of the Action
In the proposed method, the data selection corresponds to the action in the reinforcementlearning procedure. In this study, the action is designed using the indices used in existing active learning methods. We combine uncertainty sampling (US) [30] and the variance for prediction [1] to improve the performance. Combining other indices used for active learning, e.g., disagree probability [39], margin of the classification surface [43], would be possible, with a possible increase of computational cost. In the proposed method, the action value is determined so that the reward is maximized for a certain state. Then, from the pool dataset, we select the datum closest to that optimal action and assign a label to it. The posterior probability of class discrimination is an index used in US [30], a commonly used active learning method, and it is defined by Here, P(Y = c|x) is the probability that the class of the response variable Y is c given x ∈ R d , and Π is the pooled dataset. We note that we can consider solving both regression and classification problems by our method, but acquisition function is trained with classification dataset. It is not absolutely necessary, but using classification dataset make the design of action easier. Variance for prediction is another index widely used in active learning methods such as query by committee (QBC) [1], and is defined as var( is a tree within the random forest and Φ i is the subset used for training the tree.
Using the pre-trained DQN, the proposed method outputs the discriminant posterior probability and variance as a feature vector to characterize an ideal action. The action vectors are calculated for each datum in the pool, and the closest one to the ideal action is selected and labeled, which is expressed as follows: Here, u and var are the values of the discriminant posterior probability and variance, respectively, determined to be optimal by the DQN.

Design of Reward
We define the immediate reward as the amount of increase in accuracy obtained by adding a new training sample (x, y) to the predictive model. That is, where acc(D t ; D) is the accuracy of the prediction of the model trained using dataset D and evaluated using dataset D t .

Advantage of the Proposed Method
Because the reward in Q-learning is the cumulative sum of the immediate reward at each decision, the learned acquisition function takes into account the context, i.e., the situation of data acquisition under the condition that the maximum number of obtainable training data is fixed. Additionally, the output behavior (which is treated as optimal) can be used in combination with the criteria of any existing method; hence, the design of the state in the proposed method is highly flexible, unlike that of the method proposed in [33], which uses heuristics specific to certain tasks and predictive models. Although the use of data from other domains was inspired by [28], the proposed method updates DQN parameters by reinforcement learning. The conceptual diagram and pseudo-code of the proposed method are shown in Fig. 1 and Algorithm 1, respectively.

Experiments
This section describes the evaluation of the proposed active learning method via a set of multi-class classification experiments with both artificial and real-world datasets 2 . We first investigate which dataset is useful for learning the acquisition function. Then, the proposed method is compared with existing methods over six datasets obtained from the UCI Machine Split D to training dataset and pool dataset 3: for t = 1 to T do 4: Train R F model using the training dataset, and calculate state s by Eq. (10) 5: Calculate action value from Q(s, a; θ) and select data from the pool according to Eq. (12). 6: Calculate immediate reward by Eq. (13) 7: end for 8: Train DQN. Find θ j that minimizes Eq. (3), and update the parameter: θ = θ j 9: end for Output: θ Learning Repository. Finally, through a simple experiment, we demonstrate that the proposed method is able to select appropriate samples considering the context of the learning process. Throughout the experiments, the architecture of DQN is fixed as follows: -input layer: 10 dimension for the above defined state features.
-three times repetition of 16 dimensional fully connected layer followed by Relu activation. -output layer: 2 dimension for the above defined action features.
The architecture is relatively small and the performance could be improved by neural architecture search [11], which is left for our future work. The network is trained using Adam [24] with learning rate 0.001 and the cost function is the mean absolute error.
To define the state vector s in Eq. (10), the parameter m should be determined. Throughout the experiments, we set m = 6, which is the smallest number of dimensions of all of the datasets used in the experiments. In our preliminary experiments, we saw that larger m tends to offer better classification performance, but the difference were not significant.

Dependence on the pre-training dataset
In the proposed method, as in [28], the data selection criterion (the acquisition function) is learned beforehand using datasets from other domains. In this study, we created six datasets for classification problems following the procedure proposed in [16]. Details of the generated artificial datasets are reported in Table 1.
Of these datasets, datasets A and B were used for training the DQN, and the remaining four datasets were used for verification. For all datasets, in the active learning experiment, 1,000 data were used as verification data, and the remaining 9,000 data were divided into training data and pool data. During learning, the number of training data for DQN was randomly varied to enable it to handle various situations. The budget for active learning was set to 100 samples, and the training of the DQN was completed when 100 samples were taken. The result of experimenting under these conditions is shown in Fig. 2.
Of all four types of datasets, the proposed method trained with dataset A yielded superior performance. This experiment shows that there is a large difference in performance depending on the learning source dataset. In the following experiment, the model trained with dataset A was used.

Experiments with Real-World datasets
In this section, we compare the performance of the proposed method with those of existing methods over six datasets from the UCI Machine Learning Repository: adult [27], car [5], winequality-red, white [8], googletrip-review, and tripadvisor-review [34]. Profile of the used datasets are summarized in Table 2. These datasets are popular benchmark datasets in machine learning and active learning, range from classical (adult, car, winequality) to relatively new (googetrip-review and tripadvisor-review), various input dimensionalities (6-23), includes both qualitative and quantitative features. In particular, the latter four datasets are considered as suitable for active learning where annotation is hard, namely, evaluating wine quality requires tasting by expert sommeliers, and giving a review to a tourist spot require visiting the place actually.
US, QBC, LAL, and random sampling were used as comparison methods. A similar method [33] relies on a two-class support vector machine as a learning model, so it is not suitable for an experiment with a multi-class discriminant. As stated in Sect. 5.1, the number of data to be acquired (the budget) from the pool data is 100. The result of the experiment using this setting is shown in Fig. 3.
The proposed method performed significantly better than the other methods on the googletrip-review and tripadvisor-review datasets, and it performed comparably to the other  Table 1, "1" is the one learned with datasetA, and "2" is the one learned with datasetB (line types are described in (a)). The vertical axis is the correct answer rate and the horizontal axis is the number of data to be acquired. Each plot is the averaged values in 5 times  Table 2, these two datasets have relatively smaller pool datasets, and it is possible that our proposed method requires larger pool datasets than other methods to ensure that the actual and pre-trained datasets have large enough intersection. The difference in performance between datasets could be partly due to the similarity between the dataset used for pre-training and the dataset used for active learning. We also conjecture the similarity of feature distribution to the datasets for pre-training is the most important factor to the performance of the proposed method. Investigation of the feature similarity and selection of the best dataset for learning acquisition functions is our important future work.

Evaluation of the Context Awareness
In this subsection, we compare the active learning methods with the oracle data selector to demonstrate that the proposed method considers the context of data selection. Here, the oracle is a method that selects the most appropriate set of data. To make the combinatorial calculation feasible, the size of the pool dataset is restricted to 25 and the number of acquisitions (the budget) is set to five. For this setting, the acquisition function was trained using dataset A, and the active learner was tested on the tripadvisor-review dataset. In this experiment, we compare the classification accuracy of the final models and the matching rate of the selected subset of data. Table 3 shows the results of five-fold cross-validation. When comparing the proposed method with the oracle, the averages of the five trials are exactly the same. The matching rate of the data selected by the proposed method is 32%, which is higher than that of random sampling. This indicates that the probability of obtaining a combination close to that of the oracle is increased by considering the context. The five data points actually selected are different to those obtained by the oracle because the data were acquired so that the performance is maximized over the combination of all five. Although the number of pool data was very low (25), the number of combinations of data acquisition ( 25 C 5 = 53, 130) is sufficiently large. When acquiring data at random, the probability that all five selected data would match that of the oracle is 0.000019%. Hence, the results obtained by the proposed method are much better than the expected value of those obtained at random.

Computational Costs
Active learning is a methodology required in situations where measurement and experiments are costly, and it is unlikely that the calculation cost of the acquisition function will become a problem. For reference, Fig. 4 shows the time required to evaluate the acquisition function for each method used in our comparative experiment. Since computational time is affected by various factors such as the dimensionality of data, size of pooled dataset and distribution of pool or population dataset, we consider the relative computational times to those of random sampling, which is of the order of milliseconds 3 . We note that for our LAL and the proposed method, we have to train acquisition functions in advance. The computational cost for training acquisition function for LAL is around one hour, and that for DQN (N e = 5000 epochs) in our method is around 20 hours. The acquisition functions can be trained in advance and the computational cost for training the acquisition function does not affect the running time for active learning. Also, the computational time would be reduced by parallel computation.
From Fig. 4, we see that the uncertainty sampling method is consistently faster than other methods. For the other three methods, the computational time is comparable.

Conclusion and Future Work
We proposed an active learning method suitable for a fixed budget regime. The proposed method considers the context of data acquisition using a random forest as a learning model Fig. 4 Relative computational time to random sampling and a reinforcement-learning DQN. When the data used for reinforcement learning in advance matches the data at the time of operation, a significant improvement in accuracy was seen when compared with the results of the existing method.
In the proposed method, learning is performed by extracting features from the target learning model and dataset. However, because this part relies on human-generated heuristics, it may affect performance depending on how the features are selected. In the future, we will make this part automatically learnable. In addition, we used a single dataset for training the acquisition function, but the generalization would be improved by mixing datasets from different domains. For that purpose, it will be necessary to investigate the difference in performance depending on the dataset used for learning the acquisition function.
As an important future work, theoretical underpinning of the proposed approach remains to be investigated. For example, sample complexity of active learning is established when the disagreement-based acquisition functions is employed [17,21,51]. However, the approach of learning acquisition function is recently proposed and its learning theory is yet to be developed. We expect that to develop a learning theory for the proposed method, combination of the analysis of online learning (which requires treatment as stochastic processes and martingale analysis) and a regret analysis for assessing the quality of the acquisition function trained with reinforcement learning would be necessary, making the problem difficult. Another important direction of future research is the strategy of multiple-selection or batch selection from pooled dataset. When we consider deep neural network as a predictive model, adding only one sample per iteration of active learning is not reasonable because of high cost for training the model. Also, it is unlikely that performance of a CNN changes with only one additional training datum. Several selection methods of multiple samples at a time in active learning is recently studies [37,48], and incorporating the notion of context to multiple sample selection method would be of great practical importance.