1 Introduction

With advances in information technology, the Internet advertising market has continued to grow year after year. Internet advertising revenue in the USA in 2021 was 189 billion USD, which is 2.9 times greater than TV ad spending in the same year [1, 2]. Against this background, Internet advertising has become indispensable for corporate promotion, regardless of the size of the company.

In this paper, we propose a machine learning method that automatically determines some of the configuration items in advertising operations from the perspective of an advertiser (a company placing Internet advertisements). Our focus point is the difficulty advertisers face in managing advertisements. To effectively distribute advertisements, a great deal of information in addition to user responses is necessary, including detailed attribute information on users who have viewed advertisements and information on competing advertisers. However, most of this information is not disclosed to advertisers, even though it is owned by the advertising platform. Advertisers thus need to make decisions based only on limited information, and many of them currently outsource their ad operations to agencies as a result. Our objective is to offer a machine learning method that automatically determines some delivery settings using only the information available to the advertiser, thereby supporting efficient operation of Internet advertising by the advertiser alone, independent of an agency.

We focus on determining bid prices for sponsored search ads, particularly Amazon ads. Sponsored search ads are ads that appear on search results screens in conjunction with the search terms entered, and as of 2021, they accounted for 41.4% of U.S. Internet advertising revenue (78.3 USD) [2]. The bid price is the maximum amount an advertiser can pay for one click on an ad. This parameter is crucial because it directly affects the cost of advertising and the sales generated via advertising. Note that while many studies on bid price optimization in sponsored search have utilized the search rank of the advertisement for optimization, the Amazon advertisements covered in this paper do not use search rank because it is not possible to obtain it due to its terms of use. We therefore propose a method for automatically determining bid prices for sponsored search ads using the limited information available to advertisers.

The key contributions of this work are as follows:

  • We mathematically modeled the bid optimization in a realistic setting consisting of three characteristics: “The setting targets the optimization of bids for each keyword in pay-per-click sponsored search advertising”, “The only information available to advertisers is the number of impressions, clicks, conversions, and advertising cost for each keyword”, and “Advertisers bid daily and set monthly budgets on a campaign basis”. This setting was proposed in a joint research project with negocia, Inc., which provides an automated Internet advertising management system, and is a common situation faced by many companies. To our best knowledge, there have been no prior studies on bid optimization in similar settings.

  • We proposed Bayesian AdComB, an algorithm that optimizes the bid price from the limited information available to advertisers. The algorithm models the relationships among advertising metrics by constructing a Bayesian network and then implements Bayesian inference to estimate the distribution of those metrics, taking into account the uncertainty of Internet advertising. The bid price optimization is then performed online with the aid of a bandit algorithm, which is designed to maximize the cumulative long-term reward in situations where estimation and optimization must be done sequentially. This bid price optimization problem can be solved by reformulating it as an integer programming problem.

  • We ran simulations using real data from negocia, Inc. and the iPinYou dataset [3], which is an open dataset. The results demonstrated not only the effectiveness of our proposed method but also the flexibility and extensibility of the framework.

Section 2 of this paper describes the problem setting of bid amount optimization, and Section 3 introduces related work. In Section 4, we explain the proposed method, and in Section 5, we present the numerical experiments we conducted to verify the effectiveness of our method. We conclude in Section 6 with a brief summary and mention of future work.

2 Problem Setting

2.1 Bid Optimization for Sponsored Search Ads

The focus of this work is bid price optimization at the keyword level for pay-per-click sponsored search advertising. Sponsored search advertising generates sales through a five-step process: search, auction, impression, click, and conversion, as shown in Fig. 1. Advertisers using sponsored search ads register multiple phrases called keywords for each advertisement. When a user performs a search, an auction is held among advertisers whose keywords match the search phrase, and the winning advertisement is displayed to the user (impression). This auction is typically a second-price auction, in which the price paid is the bid of the second-place bidder. On advertising platforms such as Google Ads and Amazon Ads, the results of this auction are affected not only by the bid price for specific keywords but also by an undisclosed score calculated by the advertising platform [4, 5]. In pay-per-click advertising, advertising costs are incurred only when a user clicks on the ad. After a click, the advertiser seeks to generate a conversion (e.g., purchase of a product).

Fig. 1
figure 1

Flow of a user’s search through a search engine and conversion through sponsored search advertising

The interaction between advertisers and ad platforms is shown in Fig. 2. Advertisers manage multiple keywords in a unit called a campaign, and bid prices are determined daily for all keywords in the campaign. Note that advertisers cannot change their bid prices during the course of the day. Advertisers receive daily feedback from the ad platform, including the number of impressions, clicks, and conversions as well as the advertising cost for each keyword. However, some information is not available, such as the number of times the keyword was searched, the bid prices of other advertisers, and ad positions. Also, advertisers set budgets for the entire campaign but cannot set budgets for individual keywords. This budget is set over a longer span of time than the bid, e.g., on a monthly basis. If the budgeted amount runs out in the middle of the period, ads will not be displayed unless the budget is increased. It is extremely important for advertisers to avoid situations like this, as it represents a serious opportunity loss.

Fig. 2
figure 2

Interaction between advertisers and ad platforms

Overall, the setting we examine is characterized by the following three main features:

  1. 1.

    Bid price optimization is targeted at the keyword level for pay-per-click sponsored search ads

  2. 2.

    The only information available to advertisers is the number of impressions, clicks, and conversions and the advertising cost for each keyword

  3. 3.

    Advertisers determine bid prices daily and set monthly budgets for the entire campaign

This setting was proposed in a joint research project with negocia, Inc., which provides an automated Internet advertising management system.

2.2 Problem Formulation

The bid price optimization in our setting can be formulated as follows. Let N be the total number of keywords, T the total number of bidding rounds, and \(b_{i}^t\) the bid price for keyword i in period t. \(\underline{b}, \overline{b}\) are the minimum and maximum bid prices that can be set, and \(c_{i}(b), u_{i}(b)\) are the ad cost and conversion if we set a bid price of b for keyword i in period t. Also, we define \([n]=\{1,\dots , n\}\) for \(n\in \mathbb {N}\). The advertiser’s goal is to maximize the total number of conversions with an advertising cost below B.

$$\begin{aligned} \begin{aligned}&\max _{b_i^t}{} & {} \sum ^T_{t=1}\sum ^N_{i=1} u_{i}(b_{i}^t) \\&\text {subject to}{} & {} \sum ^T_{t=1}\sum ^N_{i=1} c_{i}(b_{i}^t) \le B ,\\{} & {} {}&\underline{b} \le b_i^t \le \overline{b} ~~~~~~\forall i \in [N], \forall t \in [T]. \end{aligned} \end{aligned}$$
(1)

The objective function of optimization problem (1) shows the total number of conversions throughout all rounds, and the first constraint shows the constraint that the total ad cost throughout all rounds is within the budget B. The second constraint shows that there is a fixed range that can be set for bid prices for all keywords in all rounds. In this formulation, ad cost and the number of conversions for each keyword are expressed as a function of bid price, \(c_{i}(b), u_{i}(b)\), but these functions are unknown and must be predicted from the previous data.

2.3 Challenges

There are two main challenges with bid price optimization in the problem setting discussed Section 2.1: high uncertainty and the need for sequential forecasting and optimization. The uncertainty is caused by three main factors. The first factor is the scarcity of positive example data. In pay-per-click advertising, neither conversions nor advertising costs occur unless an ad is clicked on. However, click-through rates for Internet advertising are very low: according to a survey, the average click-through rates are 3.17% for sponsored search ads on Google and 0.46% for display ads [6]. Therefore, some keywords have very few (or no) clicks, making the prediction of conversions and advertising cost inaccurate. The second factor is the paucity of data available to advertisers. To optimize bid prices, information such as competitors’ bids, the demographics of the ad’s viewers, and the location of the ad is very important, but such information is often not available to advertisers. Therefore, advertisers must determine bid prices from limited information such as daily ad impressions, clicks, conversions, and advertising cost, which makes accurate forecasting difficult. The third factor is uncertainty derived from human behavior. User behavior such as clicks and conversions is difficult to predict because it is related to various factors such as user demographics, personality, and time. This uncertainty creates the need for sequential estimation and optimization. Bid price optimization requires knowing the conversions, advertising cost, and other advertising metrics for all possible bid price candidates, but the aforementioned uncertainties make it difficult to make predictions with sufficient accuracy. Therefore, an approach that optimizes the bid price once and fixes it thereafter is not likely to be the optimal strategy, and sequential estimation and optimization are required each time the information is updated.

3 Related Work

Bid price optimization for Internet advertising has been studied under various conditions, including different types of ads, bidding styles, and constraints.

Examples of studies on bid price optimization for sponsored search advertising include works by [7,8,9,10,11]. Feng et al. [7] formulated the bid price optimization problem as an adversarial bandit problem and proposed an algorithm based on Exponential-weight policy for Exploration and Exploitation (Exp3) [12] to optimize bid prices for sponsored search advertising. Their algorithm focuses on optimizing a single bid price without considering budget constraints. Abhishek et al. [8] optimize bids by creating a model that predicts the number of clicks and ad costs based on multiple assumptions. Thomaidou et al. [9] proposed a comprehensive operational framework for sponsored search that includes bid price optimization and automatic ad text generation. Zhang et al. [10] approximate and optimize the bid price optimization problem as a sequential quadratic programming problem. However, the methods proposed in these papers require information that is difficult for advertisers to obtain, such as the location of the ads on the search result page. Uncertainty in sponsored search advertising is a challenge in contexts other than bid price optimization. Rusmevichientong et al. [13] proposed a method to prioritize keywords for bidding with uncertainty, and [14, 15] optimize the budget allocation for sponsored search advertising with uncertainty on the basis of investment portfolio theory.

Avadhanula et al. [16] and Nuara et al. [17] examined bid price optimization for multiple platforms. Avadhanula et al. [16] provided a multi-platform bid price optimization algorithm based on a method called stochastic bandits with knapsacks [18]. Their algorithm is designed for a setting with a high frequency of bid price decisions, such as real-time bidding for display advertising, and thus is not suitable for the setting of this paper, where bid prices can be changed only once a day at most. Nuara et al. [17] formulated a bid price/daily budget optimization problem for multiple platforms as a combinatorial semi-bandit problem [19] and provided an algorithm for simultaneous optimization. Their algorithm is based on the Gaussian process upper confidence bound (GP-UCB) [20], which predicts conversions and advertising costs for each bid price candidate and then optimizes bid prices using dynamic programming. Although their algorithm targets bid price optimization for multiple platforms, it can be extended to our setting, namely, multiple keywords on a single platform. However, we cannot directly apply their algorithm because it is not possible to control advertising costs by setting individual daily budgets in our setting.

Studies that have applied Bayesian inference to the field of Internet advertising include works by [21,22,23,24,25,26]. Hou et al. [24] have made ROI predictions for a portfolio consisting of multiple keywords in sponsored search advertising. They modeled the relationship between bid prices, clicks, conversions, and other advertising metrics using Bayesian networks, which is an approach similar to the Bayesian modeling utilized in this study. However, their study aimed to predict the ROI of advertising and did not optimize bid prices. Agarwal et al. [21], Du et al. [22], Ghose and Yang [23] and Yang and Ghose [26] discuss the relationship between coefficients by linearly modeling the relationship between the indicators and examining the posterior distribution of the coefficients. Yang et al. [25] focus on the hierarchical structure that exists among ads, platforms, and users and utilize Hierarchical Bayes to estimate click rates for specific ads, platforms, and users.

4 Method

Our proposed method, Bayesian advertising combinatorial bandit (Bayesian AdComB), consists of three subroutines inspired by the method of Nuara et al. [17]. The flow of the proposed method is shown in Algorithm 1.

Algorithm 1
figure a

Bayesian AdComB

The algorithm first updates the probability model \(\{\mathcal {M}_i^{(t-1)}\}^N_{i=1}\) for each keyword using the advertising metrics obtained in the previous round by update subroutine (line 5). We model the relationships among advertising metrics by a Bayesian network and derive the posterior distribution by Bayesian inference. Next, exploration subroutine (line 6) selects the functions \(\hat{c}_{i}^t(b)\) and \(\hat{u}_{i}^t(b)\) for optimization from the model distribution. We utilize a bandit algorithm for this selection. Finally, optimization subroutine (line 8) performs optimization using the function selected in the exploration subroutine. We reformulate Eq. (1) as an integer programming problem and optimize it using a general-purpose integer programming solver. Each subroutine is described in detail in the following subsections.

4.1 Update Subroutine

The purpose of this subroutine is to estimate the distribution of conversions and advertising costs for each bid price candidate. Estimating these factors as a distribution rather than a point enables estimation that accounts for the high uncertainty. Our method constructs a Bayesian network to model the relationship between the bid price, the number of impressions, clicks, and conversions and the advertising costs, which are the advertising metrics available to advertisers. We then use Bayesian inference to obtain the posterior distribution.

For the probability model (prior distribution, likelihood function) in the generation process of the advertising metrics, if the log data accumulated so far exists, WAIC [27] in the log data is used as a criterion to select one of several candidates. In this paper, we propose four likelihood functions and two prior distributions that we consider reasonable. Although it is difficult to claim that a stochastic model is valid, the goodness of fit of a stochastic model to log data can be quantitatively evaluated by calculating WAIC in log data for any possible stochastic model. Therefore, it is possible to quantitatively compare stochastic models that are not included in this paper and select the model to be utilized. See Section 5.3.1 for the results of the WAIC comparison on the log data of the stochastic model proposed in this paper.

Fig. 3
figure 3

Bayesian network with likelihood function 1

Fig. 4
figure 4

Bayesian network with likelihood function 2

We consider four likelihood functions: (1) expressed by Fig. 3 and Eq. (2), (2) expressed by Fig. 4 and Eq. (3), (3) expressed by Fig. 5 and Eq. (4), and (4) expressed by Fig. 6 and Eq. (5). Note that these likelihood functions indicate the likelihood function for the keyword i, where \(\mathcal {N}(\mu , \sigma ^2)\) represents the normal distribution with mean \(\mu \) and standard deviation \(\sigma \), and \(\textrm{Binomial}(n, p)\) represents the binomial distribution with number of trials n and probability p.

$$\begin{aligned} \begin{aligned} \textrm{volume}_i&\sim \mathcal {N}\left( \mu ^\textrm{vol}_i, {\sigma ^\textrm{vol}_i}^2\right) \\ \textrm{imp}_i&\sim \mathcal {N}\left( \frac{b_i^2}{b_i^2+c_i^2} \cdot \textrm{volume}_i, {\sigma ^\textrm{imp}_i}^2\right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cv}_i&\sim \textrm{Binomial}\left( \textrm{click}_i, \textrm{CVR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( \alpha _i \cdot b_i \cdot \textrm{click}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned} \textrm{imp}_i&\sim \mathcal {N}\left( a_{3i}b_i^3 + a_{2i}b_i^2 + a_{1i}b_i + a_{0i}, {\sigma ^\textrm{imp}_i}^2 \right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cv}_i&\sim \textrm{Binomial}\left( \textrm{click}_i, \textrm{CVR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( \alpha _i \cdot b_i \cdot \textrm{click}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \textrm{volume}_i&\sim \mathcal {N}\left( \mu ^\textrm{vol}_i, {\sigma ^\textrm{vol}_i}^2\right) \\ \textrm{imp}_i&\sim \mathcal {N}\left( \frac{b_i^2}{b_i^2+c_i^2} \cdot \textrm{volume}_i, {\sigma ^\textrm{imp}_i}^2\right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cv}_i&\sim \textrm{Binomial}\left( \textrm{click}_i, \textrm{CVR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( (\alpha _i \cdot b_i + \beta _i ) \cdot \textrm{click}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} \begin{aligned} \textrm{imp}_i&\sim \mathcal {N}\left( a_{3i}b_i^3 + a_{2i}b_i^2 + a_{1i}b_i + a_{0i}, {\sigma ^\textrm{imp}_i}^2 \right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cv}_i&\sim \textrm{Binomial}\left( \textrm{click}_i, \textrm{CVR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( (\alpha _i \cdot b_i + \beta _i ) \cdot \textrm{click}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(5)
Fig. 5
figure 5

Bayesian network with likelihood function 3

Fig. 6
figure 6

Bayesian network with likelihood function 4

In addition to the aforementioned advertising metrics, the model includes the search volume for each keyword, which is a metric that cannot be observed by the advertiser, and the keyword-specific parameters \(\mu ^\textrm{vol}\), \(\sigma ^\textrm{vol}\), \(\sigma ^\textrm{imp}\), c, \(\textrm{CTR}\), \(\textrm{CVR}\), \(\alpha \), and \(\sigma ^\textrm{cost}\). The process of generating the number of clicks and conversions for keywords is the same for all likelihood functions. In these formulations, the number of clicks and conversions is generated by a binomial distribution based on the click-through rate (CTR) and conversion rate (CVR) for each keyword. The assumption that each keyword has a specific CTR and CVR is natural, given that there has been much research on CTR and CVR prediction in the Internet advertising field. The characteristics of each likelihood function are summarized in Table 1. Differences occur due to the changes made in the process of generating impressions and ad costs. The process of generating impressions in likelihood functions 1 and 3 takes into account search volume, which is information that cannot be observed by the advertiser. Since the search volume of a keyword indicates the number of times the keyword matches the search phrase, it is assumed to be generated from a normal distribution in which \(\mu ^\textrm{vol}_i\), indicating the keyword’s searchability, is the mean and \(\sigma ^\textrm{vol}_i\), indicating the variation in the search volume of the keyword, is the standard deviation. We use \(\frac{b_i^2}{b_i^2+c_i^2}\) as the ratio of the average number of impressions to the search volume (corresponding to the winning percentage of the auction), which is the formulation proposed by Zhang et al. [28]. In contrast, likelihood functions 2 and 4 assume that the number of impressions is generated from a normal distribution whose mean is a cubic function of bid prices. For the ad cost generation process, likelihood functions 1 and 2 assume that the ad cost per click is a constant multiple of the bid price, while likelihood functions 3 and 4 assume that the ad cost per click is a linear function of the bid price. In the case of a generalized second-price auction, which is the auction format used on many advertising platforms, if the advertiser’s quality score for an impression involving a click with keyword I is \(QS_i\), the quality score of a competitor one rank below the advertiser is \(QS^{\prime }\) and the bid price is \(b^\prime \), and the ad cost incurred for a click is expressed as \(\frac{QS^\prime \cdot b^\prime }{QS_i}\). Although information on quality scores and competitor bid prices is not disclosed, in likelihood functions 1 and 2, A is approximated by a constant, and in likelihood functions 3 and 4, B is approximated by a linear function.

Table 1 Characteristics of each likelihood function

The two prior distributions considered in this paper are prior distribution A expressed by Eq. (6) and prior distribution B expressed by Eq. (7). Note that these prior distributions at keyword i, where \(\textrm{Beta}(\alpha , \beta )\) represents the beta distribution with \(\alpha , \beta \) as the shape function, \(\tilde{\mu }, \tilde{\alpha }, \tilde{\beta }\) are the parameters estimated by maximum likelihood estimation for the first round of logs, and C is the hyperparameter. Prior distribution A is a distribution designed to account for the effect of the prior distribution on the posterior distribution. A distribution whose prior distribution has no effect on the posterior distribution is called a non-informative prior distribution, but it is difficult to implement a non-informative prior distribution for complex likelihood functions using Bayesian modeling packages. Note that the prior distribution A therefore affects the posterior distribution, not the non-informative prior distribution. Prior distribution B is determined by maximum likelihood estimation for some parameters, and a normal distribution with large variance is used for other parameters. Maximum likelihood estimation is utilized to improve the fit of the posterior distribution to the data used in the selection and optimization subroutines. Since \(a_{0i}\), \(a_{1i}\), \(a_{2i}\), \(a_{3i}\), \(\alpha _i\), and \(\beta _i\) are latent variables that may take negative values and the parameters determined by maximum likelihood estimation may not be used as the standard deviation of the normal distribution, the same distribution as the prior distribution A is used as the prior distribution of these latent variables.

$$\begin{aligned} \begin{aligned} \mu ^\textrm{vol}_i&= |\mathcal {N}(0, 10^6)| \\ \sigma ^\textrm{vol}_i&= |\mathcal {N}(0, 10^6)| \\ \sigma ^\textrm{imp}_i&= |\mathcal {N}(0, 10^6)| \\ c_i&= |\mathcal {N}(0, 10^6)|\\ a_{0i}&= \mathcal {N}(0, 10^6)\\ a_{1i}&= \mathcal {N}(0, 10^6)\\ a_{2i}&= \mathcal {N}(0, 10^6)\\ a_{3i}&= \mathcal {N}(0, 10^6)\\ \textrm{CTR}_i&= \textrm{Beta}(1, 1) \\ \textrm{CVR}_i&= \textrm{Beta}(1, 1) \\ \alpha _i&= \mathcal {N}(0, 10^6) \\ \beta _i&= \mathcal {N}(0, 10^6)\\ \sigma ^\textrm{cost}_i&= |\mathcal {N}(0, 10^6)|. \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} \mu ^\textrm{vol}_i&= \left| \mathcal {N}\left( \tilde{\mu }_1, \left( \tilde{\mu }_1/C\right) ^2\right) \right| \nonumber \\ \sigma ^\textrm{vol}_i&= \left| \mathcal {N}\left( \tilde{\mu }_2, \left( \tilde{\mu }_2/C\right) ^2\right) \right| \nonumber \\ \sigma ^\textrm{imp}_i&= \left| \mathcal {N}\left( \tilde{\mu }_3, \left( \tilde{\mu }_3/C\right) ^2\right) \right| \nonumber \\ c_i&= \left| \mathcal {N}\left( \tilde{\mu }_4, \left( \tilde{\mu }_4/C\right) ^2\right) \right| \nonumber \\ a_{0i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{1i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{2i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{3i}&= \mathcal {N}(0, 10^6)\nonumber \\ \textrm{CTR}_i&= \textrm{Beta}(\tilde{\alpha }_1, \tilde{\beta }_1) \nonumber \\ \textrm{CVR}_i&= \textrm{Beta}(\tilde{\alpha }_2, \tilde{\beta }_2) \nonumber \\ \alpha _i&= \mathcal {N}(0, 10^6)\nonumber \\ \beta _i&= \mathcal {N}(0, 10^6)\nonumber \\ \sigma ^\textrm{cost}_i&= \left| \mathcal {N}\left( \tilde{\mu }_5, \left( \tilde{\mu }_5/C\right) ^2\right) \right| . \end{aligned}$$
(7)

Since the probability model used in the proposed method is too complex to calculate the posterior distribution analytically, we approximate it by stochastic variational inference instead [29]. Although the Markov chain Monte Carlo method (MCMC) [30] can be used to approximate the posterior distribution, the difference in the detailed shape of the posterior calculated by the update subroutine does not significantly affect the performance of the proposed method, and there is no need to obtain the exact shape of the posterior distribution. We therefore use stochastic variational inference from the viewpoint of computation time. In the stochastic model presented in this paper, \(\mu ^\textrm{vol}_i\), \(\sigma ^\textrm{vol}_i\), \(a_{0i}\), \(a_{1i}\), \(a_{2i}\), \(a_{3i}\), \(\sigma ^\textrm{imp}_i\), \(\sigma ^\textrm{cost}_i\), \(\alpha _i\), and \(\beta _i\) are approximated by a normal distribution, and \(\textrm{CTR}_i\) and \(\textrm{CVR}_i\) are approximated by a beta distribution to optimize the variational parameters. From the second round onward, we reuse the posterior distribution from the previous round as the prior distribution and perform Bayesian estimation only with the newly obtained data. This approach, called sequential learning, speeds up the computation because it uses less data when computing the posterior distribution.

Our Bayesian AdComB framework can be flexibly extended by modifying the model. For example, the framework can be utilized for bid price optimization in forms other than sponsored search advertising and also for using prior knowledge, such as relationships among metrics.

4.2 Exploration Subroutine

In this subroutine, functions \(\hat{c}_{i}^t(b)\) and \(\hat{u}_{i}^t(b)\) are determined from the predicted distribution of conversions and advertising cost for each bid price candidate. We use bandit algorithms adapted to the setting of this study. The bandit problem collectively refers to the problem of repeatedly selecting one of several candidates whose expected reward is unknown, with the goal of maximizing the cumulative reward. Equation (1) can be regarded as a problem of maximizing the sum of the cumulative rewards of N bandit problems with an overall budget constraint, which is basically a variant of a combinatorial semi-bandit problem [19]. In a bandit problem, a tradeoff called the exploration-exploitation dilemma arises: if the optimal point among the data obtained so far continues to be selected, the globally optimal solution may not be reached, and if a point with a small amount of data and high uncertainty continues to be selected, the cumulative reward will be reduced. Bandit algorithms such as UCB [31] and Thompson sampling [32] create a balance between exploration and exploitation by assuming a distribution of rewards and making selections based on its mean and variance.

We propose two algorithms to determine the function value of the conversion function \(\hat{u}_{i}\), which corresponds to the reward in the bandit problem. The first algorithm, Bayesian AdComB Percentile (Bayesian AdComB-PT), uses the qth percentile of the distribution, while the second algorithm, Bayesian AdComB Thompson sampling (Bayesian AdComB-TS), uses randomly sampled values from the distribution (q is a hyperparameter). Function \(\hat{u}_{i}\) is determined by Eq. (8) for Bayesian AdComB-PT and Eq. (9) for Bayesian AdComB-TS. Note that \(\lceil \mathcal {D} \rceil ^q\) represents the qth percentile of the distribution \(\mathcal {D}\), and \(\mathcal {M}_i(u_i(b))\) represents the distribution at bid price b of the conversion function predicted by the model \(\mathcal {M}_i\).

$$\begin{aligned} \hat{u}_{i}^\textrm{PT}(b) = \lceil \mathcal {M}_i[u_i(b)] \rceil ^q, \end{aligned}$$
(8)
$$\begin{aligned} \hat{u}_{i}^\textrm{TS}(b) \sim \mathcal {M}_i[u_i(b)]. \end{aligned}$$
(9)

To reduce the risk of violating the budget constraint in the aforementioned uncertain environment, we adopt the 60th percentile value as the advertising cost function, that is, \(\hat{c}_{i}(b) = \lceil \mathcal {M}_i[c_i(b)] \rceil ^{60}\). Since this number is a parameter, we can control how strictly the budget constraint is adhered to by changing it. For example, we could decrease the risk of running out of budget by making this number larger at only the end of the month.

4.3 Optimize Subroutine

In the optimization subroutine (line 8 of Algorithm 1), optimization is performed using the function determined in the exploration subroutine. The functions \(c_{i}(b), u_{i}(b)\) used in optimization problem (1), which is the formulation of the setting in this paper, cannot make any assumptions about the function shape (e.g., convexity), and it is not easy to obtain globally optimal solutions using continuous optimization methods. Therefore, we solve optimization problem (10) with discretized bid prices. Note that \(\mathcal {B}\) denotes the set of possible bid price candidates. The globally optimal solution of this problem is expected to be close to the globally optimal solution of the original problem (1).

$$\begin{aligned} \begin{aligned}&\max _{b_i^t}{} & {} \sum ^T_{t=1}\sum ^N_{i=1} u_{i}(b_{i}^t) \\&\text {subject to}{} & {} \sum ^T_{t=1}\sum ^N_{i=1} c_{i}(b_{i}^t) \le B ,\\{} & {} {}&b_i^t \in \mathcal {B} ~~~~~~\forall i \in [N], \forall t \in [T]. \end{aligned} \end{aligned}$$
(10)

Since the proposed method is an online optimization method in which prediction and optimization are performed sequentially, we consider optimization in the tth round. Let \(\hat{c}_{i}^t(b), \hat{u}_{i}^t(b)\) be the functions of ad cost and number of conversions obtained by the update and selection subroutines at the tth round. Since the budget B is the sum of all rounds, it is necessary to set the budget per round for optimization. The opportunity loss can be serious if an ad has display difficulties for a certain period of time. Therefore, for balance, it is logical to set the budget amount for the day of optimization to the remaining budget \(\bar{B}\) divided by the number of days remaining in the month. The optimization problem for the tth round is formulated as

$$\begin{aligned} \begin{aligned}&\max _{b_i^t}{} & {} \sum ^N_{i=1} \hat{u}_i^t(b_{i}^t) \\&\text {subject to}{} & {} \sum ^N_{i=1} \hat{c}_i^t(b_{i}^t) \le \frac{\bar{B}}{T-t+1}, \\{} & {} {}&b_i^t \in \mathcal {B} ~~~~~~\forall i \in [N]. \end{aligned} \end{aligned}$$
(11)

Furthermore, using the variable \(z_i^j\), which takes a value of 1 only if \(b_i^t=b_j\) (that is, the bid price of the ith keyword is the jth element of \(\mathcal {B}\), and 0 otherwise), the problem can be transformed as in optimization problem (12):

$$\begin{aligned} \begin{aligned}&\max _{z_i^j}{} & {} \sum ^N_{i=1} \sum ^{|\mathcal {B}|}_{j=1}\hat{u}_{i}^j z_i^j\\&\text {subject to}{} & {} \sum ^N_{i=1} \sum ^{|\mathcal {B}|}_{j=1}\hat{c}_{i}^j z_i^j \le \frac{\bar{B}}{T-t+1},\\{} & {} {}&\sum ^{|\mathcal {B}|}_{j=1}z_i^j = 1 ~~~~~~\forall i = [N],\\{} & {} {}&z_i^j \in \{0, 1\} ~~~~~~\forall i \in [N],\forall j \in [|\mathcal {B}|]. \end{aligned} \end{aligned}$$
(12)

Note that we define the constants \(\hat{u}_{i}^j, \hat{c}_{i}^j\) as \(\hat{u}_{i}^j:= \hat{u}_{i}^t(b_j), \hat{c}_{i}^j:= \hat{c}_{i}^t(b_j)\). Equation (12) is an integer programming problem, and we use Gurobi optimization [33], a general-purpose integer programming solver, for this optimization. Gurobi optimization utilizes the branch-and-cut method to obtain fast exact solutions. In our setting, even with \(N=1000\) and \(|\mathcal {B}|=1000 \), the solution is completed within 40 s.

5 Numerical Experiments

This section describes the numerical experiments we conducted to evaluate the effectiveness of the proposed method.

5.1 Setup

We built two simulators for the numerical experiments: a sponsored search simulator and an iPinYou simulator. The sponsored search simulator was created on the basis of analysis of the real log data of sponsored search advertising. As in actual ad operations, when we input bid prices for each keyword into this simulator, it outputs ad impressions, clicks, conversions, and ad costs for each keyword. We used ad log data from negocia, Inc., a Japanese company that provides an automated Internet advertising management system.

The iPinYou simulator is a version of the iPinYou dataset [3] modified into a form that can be used in our setting. It was originally used in a competition organized by iPinYou, a Chinese advertising company. The dataset consists of log data from multiple advertisers’ display ads and includes winning bid prices from auctions, which are generally private, allowing a simulator to be built. On the other hand, the data came from display ads, not sponsored search ads, and there were multiple differences, so the simulator was constructed by preprocessing the dataset. First, since there are no keywords in the iPinYou dataset, we split the logs of one advertiser and considered each of them as a keyword and then used the logs of multiple advertisers to replicate the setup where there are multiple keywords in one campaign. Next, we changed our objective from conversion maximization to click maximization and modified our proposed methodology and baseline algorithm accordingly. This is because the dataset incurs advertising costs for impressions, and there are very few conversions logged (some advertisers do not have any conversions). As probabilistic models for the update subroutine of the proposed method in the iPinYou simulator, we considered four likelihood functions and two prior distributions. The four likelihood functions are likelihood function \(1^\prime \), expressed by Fig. 7 and Eq. (13); likelihood function \(2^\prime \), expressed by Fig. 8 and Eq. (14); likelihood function \(3^\prime \), expressed by Fig. 9 and Eq. (15); and likelihood function \(4^\prime \), expressed by Fig. 10 and Eq. (16).

$$\begin{aligned} \begin{aligned} \textrm{volume}_i&\sim \mathcal {N}\left( \mu ^\textrm{vol}_i, {\sigma ^\textrm{vol}_i}^2\right) \\ \textrm{imp}_i&\sim \mathcal {N}\left( \frac{b_i^2}{b_i^2+c_i^2} \cdot \textrm{volume}_i, {\sigma ^\textrm{imp}_i}^2\right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( \alpha _i \cdot b_i \cdot \textrm{imp}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(13)
$$\begin{aligned} \begin{aligned} \textrm{imp}_i&\sim \mathcal {N}\left( a_{3i}b_i^3 + a_{2i}b_i^2 + a_{1i}b_i + a_{0i}, {\sigma ^\textrm{imp}_i}^2 \right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( \alpha _i \cdot b_i \cdot \textrm{imp}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned} \begin{aligned} \textrm{volume}_i&\sim \mathcal {N}\left( \mu ^\textrm{vol}_i, {\sigma ^\textrm{vol}_i}^2\right) \\ \textrm{imp}_i&\sim \mathcal {N}\left( \frac{b_i^2}{b_i^2+c_i^2} \cdot \textrm{volume}_i, {\sigma ^\textrm{imp}_i}^2\right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( (\alpha _i \cdot b_i + \beta _i ) \cdot b_i \cdot \textrm{imp}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned} \begin{aligned} \textrm{imp}_i&\sim \mathcal {N}\left( a_{3i}b_i^3 + a_{2i}b_i^2 + a_{1i}b_i + a_{0i}, {\sigma ^\textrm{imp}_i}^2 \right) \\ \textrm{click}_i&\sim \textrm{Binomial}\left( \textrm{imp}_i, \textrm{CTR}_i\right) \\ \textrm{cost}_i&\sim \mathcal {N}\left( (\alpha _i \cdot b_i + \beta _i ) \cdot b_i \cdot \textrm{imp}_i, {\sigma ^\textrm{cost}_i}^2\right) . \end{aligned} \end{aligned}$$
(16)
Fig. 7
figure 7

Bayesian network with likelihood function \(1^\prime \)

Fig. 8
figure 8

Bayesian network with likelihood function \(2^\prime \)

Common to all likelihood functions is the process of generating the number of clicks on keywords, which is the same as in the case of sponsored search advertisements discussed in Section 4.1. The characteristics of each likelihood function are summarized in Table 2. The same formulation as in Section 4.1 was used for the process of generating the number of impressions. For the ad cost generation process, we focused on the ad cost per impression, since the ad cost is charged by impression. In particular, likelihood functions \(3^\prime \) and \(4^\prime \) are assumed to be quadratic functions of the bid price. This is based on our analysis of the iPinYou data and our finding that ad cost per impression divided by bid price can be approximated as a linear function of bid price.

Fig. 9
figure 9

Bayesian network with likelihood function \(3^\prime \)

The two prior distributions considered in this paper are prior distribution A\(^\prime \), expressed by Eq. (17), and prior distribution B\(^\prime \), expressed by Eq. (18). These distributions were designed with the same intention as in Section 4.1.

$$\begin{aligned} \begin{aligned} \mu ^\textrm{vol}_i&= |\mathcal {N}(0, 10^6)| \\ \sigma ^\textrm{vol}_i&= |\mathcal {N}(0, 10^6)| \\ \sigma ^\textrm{imp}_i&= |\mathcal {N}(0, 10^6)| \\ c_i&= |\mathcal {N}(0, 10^6)|\\ a_{0i}&= \mathcal {N}(0, 10^6)\\ a_{1i}&= \mathcal {N}(0, 10^6)\\ a_{2i}&= \mathcal {N}(0, 10^6)\\ a_{3i}&= \mathcal {N}(0, 10^6)\\ \textrm{CTR}_i&= \textrm{Beta}(1, 1) \\ \alpha _i&= \mathcal {N}(0, 10^6) \\ \beta _i&= \mathcal {N}(0, 10^6)\\ \sigma ^\textrm{cost}_i&= |\mathcal {N}(0, 10^6)|. \end{aligned} \end{aligned}$$
(17)
$$\begin{aligned} \mu ^\textrm{vol}_i&= \left| \mathcal {N}\left( \tilde{\mu }_1, \left( \tilde{\mu }_1/C\right) ^2\right) \right| \nonumber \\ \sigma ^\textrm{vol}_i&= \left| \mathcal {N}\left( \tilde{\mu }_2, \left( \tilde{\mu }_2/C\right) ^2\right) \right| \nonumber \\ \sigma ^\textrm{imp}_i&= \left| \mathcal {N}\left( \tilde{\mu }_3, \left( \tilde{\mu }_3/C\right) ^2\right) \right| \nonumber \\ c_i&= \left| \mathcal {N}\left( \tilde{\mu }_4, \left( \tilde{\mu }_4/C\right) ^2\right) \right| \nonumber \\ a_{0i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{1i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{2i}&= \mathcal {N}(0, 10^6)\nonumber \\ a_{3i}&= \mathcal {N}(0, 10^6)\nonumber \\ \textrm{CTR}_i&= \textrm{Beta}(\tilde{\alpha }_1, \tilde{\beta }_1) \nonumber \\ \alpha _i&= \mathcal {N}(0, 10^6)\nonumber \\ \beta&= \mathcal {N}(0, 10^6)\nonumber \\ \sigma ^\textrm{cost}_i&= \left| \mathcal {N}\left( \tilde{\mu }_5, \left( \tilde{\mu }_5/C\right) ^2\right) \right| . \end{aligned}$$
(18)
Fig. 10
figure 10

Bayesian network with likelihood function \(4^\prime \)

Table 2 Characteristics of each likelihood function

We set the parameters of the simulation for an actual ad operation environment. The parameters included simulator type, total number of rounds T, number of keywords N, number of bid price candidates \(|\mathcal {B}|\), budget B, and number of rounds of accumulated logs \(T^\prime \). We conducted several experiments in which these settings were varied. The parameters of the experiment with the sponsored search simulator are shown in Table 3 and those with the iPinYou simulator in Table 4. Budget B is expressed as a ratio of the ad cost when the bid price for all keywords is set to the all-round maximum. For the accumulated logs, bid prices were determined by a random algorithm, and the logs were the same for all algorithms. For simulations when the accumulated logs are from an algorithm other than Random, please refer to subsection 5.3.5.

Table 3 Parameters of the experiment with the sponsored search simulator
Table 4 Parameters of the experiment with the iPinYou simulator

As an evaluation metric for the experiment, we utilize the total number of conversions on the simulator in the case of the Sponsored Search simulator and the total number of clicks on the simulator in the case of the iPinYou simulator. We ran the experiment five times with different random states for all settings and averaged the results. Our method was implemented entirely in Python, and the Bayesian modeling package NumPyro [34] was used for Bayesian inference.

5.2 Algorithms

We compared the following nine algorithms in our experiments.

  • Random is an algorithm that randomly selects bid prices.

  • Greedy is an algorithm that solves an optimization problem using previous log data to determine bid prices. In this algorithm, only bid prices that have been previously selected are used as candidates for optimization. In other words, optimization problem (12) is optimized while excluding bid prices that have never been selected from the variables of the optimization problem. As \(\hat{u}_i^j, \hat{c}_i^j\) of the bid prices that have been selected so far, we used the average number of conversions and ad costs in the log data.

  • \(\varvec{\epsilon }\)-greedy is an algorithm that selects a random bid price with probability \(\epsilon \) and solves the optimization problem with probability \(1 - \epsilon \) using log data so far as well as Greedy to determine the bid price. The algorithm performs the exploitation with probability \(1 - \epsilon \) and the exploration with probability \(\epsilon \). We adopted \(\epsilon =0.1\) in our experiment.

  • k-NN is an optimization method that predicts the number of conversions and ad costs for each bid price candidate using the k-nearest neighbor method  [35]. The optimization procedure is the same as the optimization subroutine in Section 4.3. We set hyperparameter k (the number of neighborhoods to be utilized) to \(k=10\) in our experiments. In addition, k-NN randomly selects bid prices if the number of data so far is less than k.

  • AdComB-UCB and AdComB-TS are modifications of the algorithms proposed by Nuara et al. [17]. These algorithms, like the proposed method, consist of three subroutines: update subroutine, exploration subroutine, and optimization subroutine. In the update subroutine, the distribution of conversions and ad costs are estimated by a Gaussian process instead of Bayesian estimation, and an RBF kernel is used as the kernel. In the selection subroutine, the mean and variance obtained from the Gaussian process are used to select the function, and the function used for optimization is determined by UCB for AdComB-UCB and by Thompson sampling for AdComB-TS. In this paper, we modified the formulation and algorithm because, unlike in  [17], the daily budget for each bidding target cannot be set. Although [17] estimates the number of clicks and click value (equivalent to conversion rate in our setting) by a Gaussian process in the update subroutine, and optimization is performed using dynamic programming in the optimization subroutine, AdComB-UCB and AdComB-TS estimate the number of conversions and ad costs by a Gaussian process in the update subroutine, and optimization is performed in the optimization subroutine using a general-purpose integer programming solver.

  • Bayesian AdComB-MEAN is a modification of our proposed method to take the average in the exploration subroutine.

  • Bayesian AdComB-PT and Bayesian AdComB-TS are the algorithms of our proposed method. The same hyperparameters were used for Bayesian AdComB-MEAN, Bayesian AdComB-PT, and Bayesian AdComB-TS. We optimized the variational parameters using Adam [36] with a learning rate of 0.001 and 100,000 steps. The number of samples for the latent variable in the exploration subroutine was set to 10,000, the number of samples for ad cost and number of conversions (clicks in the case of the iPinYou simulator) for each bid price to 10,000, and \(q=70\) for Bayesian AdComB-PT.

Table 5 WAIC in log data for setting 1

5.3 Results

In this section, we present and discuss the results of two preliminary experiments and three simulations.

5.3.1 Preliminary Experiment 1: Comparison of Probabilistic Models

In the update subroutine of the proposed method described in Section 4.1, the probabilistic model (prior distribution, likelihood function) in the process of generating advertising metrics is selected from several candidates based on the WAIC in the log data, if the log data accumulated so far exists. In Preliminary Experiment 1, the WAIC of each probability model in the log data available in each setting is measured. The Python library arviz [37] was used to calculate the WAIC.

The WAIC for all combinations of likelihood functions and prior distributions in the sponsored search simulator considered in Section 4.1 are shown in Tables 5, 6, and 7. We use the log data available for setting 1 in Table 5, setting 2 in Table 6, and setting 3 in Table 7, with the best WAIC shown in bold. As the log data for settings 4 and 5 are the same as that for setting 1, and setting 6 has only one round of log data, no experiments were performed using the log data available for those settings. Note that \(C=3\) was utilized as the hyperparameter for prior distribution B.

Table 6 WAIC in log data for setting 2
Table 7 WAIC in log data for setting 3

Among the three settings examined in the experiment, likelihood function 3 with prior distribution B had the best WAIC, and likelihood functions 3 and 4 tended to have a better WAIC than likelihood functions 1 and 2. This can be attributed to the high expressiveness of the ad cost generation process. WAIC tended to be better for likelihood function 3 than for likelihood function 4 and for likelihood function 1 than for likelihood function 2. This finding demonstrates the validity of modeling that considers search volume (an advertising metric that is not actually observed) as a latent variable. As for prior distributions, prior distribution B tended to have a better WAIC than prior distribution A for all likelihood functions, presumably because some parameters were determined by maximum likelihood estimation, resulting in a better fit to the data.

The WAIC results for all combinations of likelihood functions and prior distributions in the sponsored search simulator considered in Section 5.1 are listed in Tables 812. Note that the log data available in settings A–E correspond to Tables 8, 9, 10, 11 and 12, respectively, with the best WAIC shown in bold.

The design of the stochastic models in the iPinYou simulator was similar to that of the sponsored search simulator, but WAIC trends for each stochastic model in the sponsored search simulator were similar for all settings of log data, whereas in the iPinYou simulator, WAIC trends for each probability model differed across settings. In terms of the overall trend, as in the case of the sponsored search simulator, the WAIC of likelihood function 3\(^\prime \) was better than that of likelihood function 4\(^\prime \), the WAIC of likelihood function 1\(^\prime \) was better than that of likelihood function 2\(^\prime \), and the WAIC of prior distribution B\(^\prime \) was better than that of prior distribution A\(^\prime \). However, the opposite trend was observed for the former in setting A and for the latter in some of the likelihood functions in settings A, D, and E. This difference is presumably due to differences in the distribution of advertising metrics across the data.

Utilizing the above experimental results as a basis, we use likelihood function 3 with prior distribution B in settings 1–6, likelihood function 4\(^\prime \) with prior distribution B\(^\prime \) in setting A, likelihood function 1\(^\prime \) with prior distribution B\(^\prime \) in setting B, likelihood function 3\(^\prime \) with prior distribution B\(^\prime \) in settings C and D, and likelihood function 3\(^\prime \) with prior distribution A\(^\prime \) in setting E.

Table 8 WAIC in log data for setting A
Table 9 WAIC in log data for setting B

5.3.2 Preliminary Experiment 2: Computation Time for Optimization

The proposed method formulates the bid price optimization as integer programming problem (12) and solves it with a generic integer programming solver. The same formulation is also used to optimize integer programming problems for all comparison methods except Random. Note that in the Greedy and \(\epsilon \)-greedy formulations, the selectable bid prices are limited to those that have been selected before. In Preliminary Experiment 2, we investigate the robustness of integer programming problem (12) with respect to size by measuring the computation time of the optimization while changing the parameters that determine the size, namely, the number of keywords N and the number of possible bid prices \(|\mathcal {B}|\).

This experiment was conducted using a sponsored search simulator that allows the number of keywords to be easily changed. We fixed \(T=1, B=1/3\) and measured the time taken to find the optimal solution using a general-purpose integer programming solver for all combinations of \(N\in \{10, 50, 100, 1000\}, |\mathcal {B}|\in \{10, 20, 200, 1000\}\).

The estimated number of conversions and ad cost in integer programming problem (12), \(\hat{u}_{i}^j, \hat{c}_{i}^j\), was sampled by Eq. (19), where \(u_i(b_j), c_i(b_j)\) are the true number of conversions and ad costs incurred when bidding on keyword i at bid price \(b_j\). Equation (19) indicates that the true values of the number of conversions and ad cost are subject to noise generated from a normal distribution with a standard deviation of 1/3 of the true value. This noise represents the error in the estimation. In the experiment, \(\hat{u}_{i}^j, \hat{c}_{i}^j\) were sampled five times and optimized for each \(\hat{u}_{i}^j, \hat{c}_{i}^j\).

$$\begin{aligned} \begin{aligned} \hat{u}_{i}^j&\sim \mathcal {N}\left( u_i(b_j), u_i(b_j)/3\right) \\ \hat{c}_{i}^j&\sim \mathcal {N}\left( c_i(b_j), c_i(b_j)/3\right) . \end{aligned} \end{aligned}$$
(19)

The results representing the average number of seconds taken to find the optimal solution are shown in Table 13.

As we can see, the calculation of the optimization subroutine is fast. In the actual operation of sponsored advertising, the number of keywords handled is several hundred at most, and the number of possible bid price candidates is usually only a few dozen. Furthermore, in the setting of this paper, the bid price optimization algorithm is executed only once a day, so the results in Table 13 are fast enough.

Table 10 WAIC in log data for setting C
Table 11 WAIC in log data for setting D

5.3.3 Experiment 1: Sponsored Search Simulator

The results of the experiments with the six settings of Table 3 using the sponsored search simulator are shown in Table 14. These results represent the % improvement when the total number of conversions in the Random simulation is set to 100 %, with the best number in each setting shown in bold.

These results confirm that the proposed method, Bayesian AdComB-PT, performs better than the existing algorithms in all settings. Bayesian AdComB-MEAN, which is a modification of the proposed method to take the average in the exploration subroutine, also outperformed the existing algorithms, which indicates that the framework of the proposed method other than the selection subroutine is also effective. In contrast, Bayesian AdComB-TS, which selects points from a distribution by Thompson sampling, has an unstable performance and performs worse than Bayesian AdComB-MEAN on average. This is possibly because the variance of the distribution estimated by Bayesian inference is large, and the shape of the function determined by the bandit algorithm is far from the actual shape. We should be able to solve this problem by imposing constraints on the sampling of points from the distribution.

Table 12 WAIC in log data for setting E
Table 13 Computation time of optimization subroutine for each parameter (s)

For the existing methods, Greedy outperformed Random for all settings. This result strengthens the validity of the formulation of optimization problem (12), as Greedy can be regarded as using only the optimization subroutine of the proposed method. Greedy also outperformed \(\epsilon \)-greedy for all settings except setting 6. This result suggests that a completely random search leads to a worse cumulative reward, as the search is to some extent improved by the log data. The k-NN, which calculates ad costs and number of conversions for all bid price candidates by using the k-nearest neighbor method and considering the relationship between large and small bids, performed the same optimization as the proposed method, but its performance was comparable to that of Greedy. This is presumably due to the low prediction accuracy of ad costs and conversions when bids are made at bid prices that have never been selected in the log data using the k-nearest neighbor method due to the high uncertainty. While AdComB-UCB performed well for all settings, the performance of AdComB-TS was not stable across settings. This result is similar to that of the proposed Bayesian AdComB method and suggests the possibility of increased variance in performance due to the selection of function values by sampling when utilizing the framework consisting of update subroutines, selection subroutines, and optimization subroutines used by these methods in our problem setting. This suggests the possibility of increased variance.

Next, we examine the influence of the simulator hyperparameters. Note that the indices in this experiment are relative, so a precise discussion is difficult. Setting 2 features an increased total number of keywords, but Bayesian AdComB-TS and AdComB-TS have a higher relative performance among all methods compared to the other settings. This may be due to the fact that the sampling bias is compensated for compared to a smaller total number of keywords. Setting 4 is a setting where the budget is halved, but the improvement from Random for all methods is greater than in the other settings. This is presumably because the methods other than Random take budget constraints into account. In setting 6, where there are fewer rounds in the log data, the performances of Greedy, \(\epsilon \)-greedy, AdComB-UCB, AdComB-TS, and Bayesian AdComB-TS are significantly lower, suggesting that these methods should be used on their own without the log data. It is probably also necessary to increase the number of explorations, such as bidding with a random bid price with a certain probability.

Table 14 Results of Experiment 1
Table 15 Results of Experiment 2

5.3.4 Experiment 2: iPinYou Simulator

The results of the experiments with the five settings of Table 4 using the iPinYou simulator are shown in Table 15. These results represent the % improvement when the total number of clicks in the Random simulation is set to 100 %, with the best number in each setting shown in bold.

For Experiment 2, Bayesian AdComB-PT also showed a good performance, while setting A exhibited a lower performance than Bayesian AdComB-MEAN. A possible cause could be insufficient budget digestion due to overestimating ad costs. The instability of the performance of Bayesian AdComB-TS and AdComB-TS is the same as in Experiment 1 and is presumably due to the large variance of the advertising metric, as in Experiment 1. The reason for the highest performance of AdComB-TS in setting B is also considered to be the same.

As a whole, the reason for the smaller improvement % from Random compared to Experiment 1 can be attributed to the irregular noise of the advertising metric. Since the iPinYou simulator is built directly using real data, it is strongly influenced by factors that cannot be observed by advertisers (such as time series [data?] (e.g., day of the week and time) and user attributes), which may make it difficult to approximate the function for the bid price of each advertising metric. Another possible factor is that in the iPinYou simulator, the number of impressions is the only observable intermediate advertising metric before leading to the number of clicks in the objective function.

Table 16 Results of Experiment 3

5.3.5 Experiment 3: Disproportionate Log Data

In this experiment, we examine the performance of each algorithm when the available log data measures are not Random and the bid prices of the log data are biased. Since the bid prices of available log data are rarely random in the case of Internet advertising, the settings used in this experiment are realistic. The experiments were conducted for settings 1 and 5 of the sponsored search simulator and settings A and C of the iPinYou simulator. The log data measures were used for the first five rounds of bidding with Random and for the remaining rounds with k-NN. The results are shown in Table 16. These results represent the % improvement when the total number of conversions (clicks for iPinYou simulator) in the Random simulation is set to 100 %, with the best number in each setting shown in bold. The results of the Random simulation do not depend on the log data strategy, so they can be compared with the results of the same settings in Experiments 1 and 2.

The performance of each algorithm was lower in settings 1, A, and C than in similar settings in Experiments 1 and 2. This may be due to the fact that the measures in the log data are not Random, and the bid prices in the log data are biased toward specific values, so there is little information on other bid prices. On the other hand, for setting 5, there is little difference between the results of Experiment 1 and the performance of each algorithm. This may be due to the fact that the total number of rounds in setting 5 is 60, which is twice the number of rounds in the other settings, and the bias in the log data has little effect on the overall results.

For Experiment 3, the proposed method, Bayesian AdComB-PT, also performed well for all settings. It also showed a smaller performance drop than Bayesian AdComB-MEAN from the results of Experiments 1 and 2. These findings suggest the validity of using a selection subroutine that considers the balance between utilization and search instead of using averages. In setting C, the performances of Greedy, \(\epsilon \)-greedy, and k-NN were noticeably worse than in Experiment 2. The reason may be that the number of rounds of log data was too small to search for bid prices that would result in a large number of clicks.

6 Conclusion

In this paper, we proposed an algorithm for bid price optimization at the keyword level for sponsored search advertising. The framework consists of Bayesian inference, a bandit algorithm, and online optimization. The update subroutine uses Bayesian estimation to update the model, which utilizes the keyword bid price as input and predicts the number of conversions, ad cost, and other advertising metrics for the keyword as a distribution. Bayesian estimation is used to make predictions that take into account the high level of uncertainty, which that is a challenge unique to Internet advertising. In the exploration subroutine, a bandit algorithm is used to determine the function to be utilized for optimization. The optimization subroutine uses the function determined in the selection subroutine to perform optimization with a general-purpose integer programming solver. By using the function determined by the bandit algorithm, the bid price optimization is performed while considering how to maximize the cumulative reward. We also constructed two simulators using real data from sponsored search ads and the iPinYou dataset, which is an open dataset. In preliminary experiments, we used log data available in multiple settings of these two simulators to measure the WAIC of the multiple stochastic models considered in this paper and discussed the validity of the stochastic models. We also measured the computation time of the optimization subroutine of the proposed method by changing the size of the optimization problem and confirmed that it is fast enough. Experiments using the sponsored search simulator confirmed the effectiveness of the proposed method, and the performance of each method for each simulator setting was discussed. Experiments using the iPinYou simulator also confirmed the effectiveness of the proposed method, suggesting that it is highly scalable. Experiments with different log data measures in these simulators confirmed the effectiveness of the proposed method in more realistic settings.

The first aspect of future prospects for this research is to refine the optimization algorithm of the proposed method. One possible direction for algorithm refinement is the use of boosting machines and global Bayesian optimization theory. Boosting machines can be used to update subroutines. NGBoost [38] can estimate distributions in the same way as in this paper, treating the parameters of the conditional distribution as targets for a multi-parameter boosting algorithm. Global Bayesian optimization theory could be used to make the update subroutine and the optimization subroutine theoretically consistent. Žilinskas and Calvin [39] considered a global optimization problem where the objective function was assumed to be a black box and expensive. Their proposed method is not directly applicable to this study as it does not consider the exploration-exploitation dilemma, but we believe it can be applied by using Bayesian.

The second is to extend the proposed method to be adaptable to more realistic environments. Factors affecting advertising metrics not considered in this paper include keyword match types and text characteristics, time-series trends, and time differences between when clicks and conversions occur.

Features such as keyword match type, text length, and abstraction level affect the click rate and conversion rate of keywords. Ramaboa and Fish [40] and Yang et al. [41] asserted that the click and conversion rates tend to be higher for keywords with an exact match type than for keywords with a partial match type. Du et al. [22] classified keywords into three types (keywords related to a company’s services, keywords related to competitors’ services, and general keywords with a high level of abstraction), in addition to keyword match type, and investigated the impact on performance (e.g., click rates). The proposed method can be applied by manually classifying keywords based on the advertising groups to which they belong or by automatically classifying them by acquiring distributed representations of words and clustering them and then hierarchizing the probability model so that some latent variables are shared by keywords in the same group. In this paper, we focused on the optimization of bid prices for keywords with a fixed match type, but it is possible to extend the optimization to include the match type and to propose new keywords.

As for the time-series trends, these time-series characteristics may arise in the click rate and search volume of advertisements because the attributes of users who visit a site differ depending on the day of the week and the season. Zhang et al.’s [42] analysis of the iPinYou dataset, which we also used for the experiments in this paper, showed that there are characteristics of click rates by day of the week and by time of day that are specific to advertisers. A possible application of the proposed method would be to analyze the time-series characteristics from the available log data and reflect the results as latent variables in a probabilistic model.

Conversions do not necessarily occur at the same time as clicks—they may occur days or even weeks after the click. Chapelle [43] proposed a conversion rate prediction method for display advertising that takes into account the time difference until a conversion occurs. Vernade et al. [44] defined the bandit problem, in which there may be a delay in the timing of getting a reward, and proposed a bandit algorithm to solve it. Although the setting in this paper is one where conversion delays do not occur, we expect that these methods can be extended to the case where conversion delays occur by extending them to the setting in this paper.

In addition, in the actual operation of sponsored search advertising, practical constraints are often imposed on the ratio of conversions to ad cost, etc. In this paper, we only considered budget constraints. Yang et al. [45] addressed the bid price optimization problem for impression-based display advertising, in which a maximum cost-per-click constraint is imposed in addition to the budget constraint. The method proposed in this paper is based on PID control, which is a classical feedback control method. He et al. [46] proposed a reinforcement learning-based bidding optimization method that can take into account constraints such as the upper bound of the cost-per-impression and the lower bound of the click rate in addition to the constraints of  [45]. The method proposed in this paper uses a probabilistic model including each advertising metric, which enables estimation of the distribution of advertising metrics other than conversions and ad cost. As such, it is possible to select a function for each advertising metric in a selection subroutine and to consider constraints other than budget constraints in an optimization subroutine. Furthermore, by changing the hyperparameters in the selection subroutine, the priority of each constraint can be considered.

By adding these extensions to the proposed method and verifying its effectiveness in a realistic environment rather than in a simulation, we aim to develop a method that will assist advertisers in their advertising operations.