Self-Adaptive Agents in a Dynamic Pricing Duopoly: Competition, Collusion, and Risk Considerations

In many online markets, we observe fierce competition and highly dynamic price adjustments. Competitors frequently adjust their prices to respond to changing market situations caused by competitors’ price adjustments. In this paper, we examine price response strategies within an infinite horizon duopoly where the competitor’s strategy has to be learned. The goal is to derive knowledge about the opponent’s pricing strategy in a self-adaptive way and to balance exploration and exploitation. Our models are based on anticipated price reaction probabilities and efficient dynamic programming techniques. We show that our approach works when played against unknown strategies. Further, we analyze the mutual interplay of our self-learning strategies as well as their tendencies to form a cartel when motivated accordingly. Moreover, we propose two extensions of our model to integrate risk aversion. Finally, we demonstrate the effectiveness of parallelization techniques to speed up the computation of strategies as well as their simulation.


Introduction
Motivation E-commerce platforms allow consumers as well as market participants to observe all competitors' prices in real-time. As also frequent price updates are allowed, online markets have become increasingly dynamic and competitive.
Hence, dynamic pricing strategies, which take into account the competitors' anticipated price reactions by learning historical price responses and gradually adjusting the own strategy accordingly, are of high interest. Nevertheless, efficiently determining optimized price reactions to maximize expected long-term profits in competitive markets is challenging -especially for wide price ranges and large numbers of goods -as the underlying dynamic optimization problems are complex.
In online markets, both perishable goods (e.g., food products [42], event tickets [40], and seasonal pieces of clothing [14]) and durable goods (e.g., electronic devices [12] and licenses for software [11]) are subject to automated price adjustment strategies. Oftentimes, these strategies follow a periodically recurring pattern over time (e.g., Edgeworth cycles) [29,30]. In the case of a duopoly, where two market participants are competing against each other, Edgeworth cycles entail that both market participants undercut each other until one market participant's lower bound is reached (e.g., the profit yields zero), and the market participant raises the price to secure future profits.
In this paper, we propose an approach to compute optimized pricing strategies under duopoly competition. We consider the sale of durable goods under an infinite time horizon without inventory considerations. Our goal is to This article is part of the topical collection "Operations Research and Enterprise Systems" guest edited by Federico Liberatore, Greg H. Parlier and Marc Demange. derive price response strategies that optimize the expected long-term future profits under an uncertain competitor strategy by learning from the observed actions of the competitor and adapting to them effectively in a self-tuning manner. An additional focus lies on mutual learning strategies and collusion effects as well as risk considerations and efficient numerical computations.

Related Work
To efficiently determine optimized prices for the sale of goods is key in revenue management. In this context, the comprehensive books by [31,41], and [9] as well as conceptual papers, cf., e.g., [2,27], cover the broad field of dynamic pricing. In addition, [4,5,39], and [18] provide a good overview over dynamic pricing developments in recent years.
As mostly used in the literature, we will consider socalled myopic customers who arrive and decide. Opposed to that, there are also approaches, cf. [22,25], that analyze dynamic pricing models with customers who strategically time their purchase by anticipating future prices in advance.
In the context of pricing competition, the work by [10] considers a continuous time multi-product oligopoly for differentiated perishable goods by harnessing optimality conditions to solve the multi-dimensional dynamic pricing problem. In a more general oligopoly model for the sale of perishable goods, Gallego and Hu [8] analyzes structural characteristics of equilibrium strategies.
The paper [26] studies duopoly pricing models for identical products with fixed capacities. The sale of perishable goods is typically subject to incomplete market information as sales and remaining inventory of competitors are usually not observable. In this context, Schlosser and Richly [37] looks at dynamic pricing strategies in a finite horizon duopoly using a hidden Markov model to estimate the competitor's remaining capacity.
Further, regarding price agents in uncertain environments, [1,17,43], and [6] study dynamic pricing models under competition with limited demand information by employing robust optimization techniques and simulationbased learning approaches. In the area of data-driven pricing approaches, Schlosser and Boissier [36] analyze stochastic dynamic pricing models in competitive markets with multiple offer dimensions, such as price, quality, and rating. However, in their model, no price anticipations strategies are considered.
In [35], the authors analyze optimal repricing strategies in a stochastic duopoly model with full information. Schlosser and Richly [38] extend this work by including endogenous reference price effects and price anticipations. The authors consider both known and unknown competitor strategies. However, they use a different demand setup and price exploration mechanism to anticipate competitor prices. Moreover, they do neither consider cartel formations nor risk aversion. Kastius and Schlosser [16] study similar problems using reinforcement learning techniques (Deep Q-learning Networks and Soft Actor Critic) refraining from explicit knowledge about demand and state dynamics.

Contribution
This paper is an extended version of [15]. The main contributions of [15] are the following. We derive mechanisms to find effective self-tuning responses against (i) fixed but unknown competitor strategies including deterministic as well as randomized (mixed) strategies. Based on these mechanisms, we analyze the interaction of (ii) two self-adapting strategies over time. Furthermore, we study (iii) how selftuning strategies can be adapted to naturally form a cartel in which market participants settle on a fixed price and thereafter stop competing with price adjustments.
Compared to [15], in this paper, we present extended studies and the following additional contributions: first, as a short response time of the approach is crucial for practical applicability, we present a two-phase parallelization approach which accelerates finding the optimal dynamic pricing strategy. We find that the algorithm's runtime performance can be increased by orders of magnitude. Second, we extend the repricing framework by additional risk considerations. We propose two utility-based risk-averse models, which seek to avoid the probability of critical price reactions and in turn, the risk of poor performances. The applicability and the effectiveness of both models are also shown numerically.
The remainder of this paper is structured as follows. "Model Description" describes our infinite time horizon duopoly consisting of two competing market participants. In "Solution Approach", we outline the dynamic programming framework on which our approach to determining optimized pricing strategies rests upon. For the case of an unknown competitor's strategy, we provide an in-depth description of our self-adapting strategy for optimized price reactions. Further propose our concepts to tackle scenarios in which the decision-maker is risk-averse. "Numerical Performance Optimization and Parallelization" contains performance considerations for real-world market situations and proposes parallelization techniques to gain a competitive advantage in such situations based on the response time for price SN Computer Science adjustments. In "Experimental Evaluation", we evaluate our proposed pricing strategies for all selected market setups. "Conclusion" summarizes our conclusions.

Model Description
Assume two competing market participants A and B want to sell goods on an online marketplace. The marketplace allows frequent price adjustments as well as the observation of price updates of the competitor. Nowadays, computing power enables competing market participants to perform market analyses for thousands of products frequently to support almost real-time price anticipation strategies.
For this work, we have several assumptions that abstract away from a real price competition but capture the main aspects. The product supply of each market participant is considered to be unlimited. Hence, we assume that both market participants have the ability to reorder and do not have to consider inventory limitations.
Both competing market participants use discrete prices considering an arbitrary but finite set prices = p 1 , … , p n ). In most markets, the products' smallest difference in price is a cent. Thus, most competitions can be easily simulated by our models. Moreover, the coherent costs c, c ≥ 0 (e.g., for delivery) are predefined, as they do not change over time. In most cases, these coherent costs do not play a huge factor for computing optimal strategies, so we chose to set c = 0 for our experiments. As a consequence, in our experiments, a sale's net profit equals the offer price.
Since most of the products on big online marketplaces are present for a long period of time and do not need to be sold until a specific date, we consider an infinite time horizon. In this context, to express the desire to gain profits early we use a discounting factor , 0 < < 1.
Moreover, we decided to use a discrete time model, that allows to model different scenarios. While market participant A might react to the current market situation at t and t + 1 , market participant B might react at t + h and t + 1 + h , h ∈ (0, 1) . A visualization of this time model can be found in Fig. 1. The delay parameter h allows for simulation of various different scenarios. While h = 0.5 results in a fair duopoly competition, h ≠ 0.5 results in a biased scenario. Here, one can think of one competitor being able to have access to more computing power and thus, on average, reacting faster to price adjustments by the other party.
An opponent's strategy can be deterministic or stochastic (randomized). Deterministic strategies are characterized by only allowing for a single price reaction to a given price. Meanwhile, stochastic strategies have a larger pool of reactions for a single price from which they can choose one. In our case, stochastic strategies select the reaction randomly, although not necessarily uniformly distributed.
This allows for interesting observations, as the optimized strategy against an unpredictable opponent can be counter intuitive. In general, a market participant's strategy can be characterized by a probability distribution of how to respond to a certain competitor price. In this context, the probability that B reacts to A's price p A ∈ prices (under a delay h) with the price p B ∈ prices is denoted by Customers can only base their buying decision on the two competitors' prices at time t. For ease of simplicity, we do not represent different product conditions (e.g., used or new) or seller ratings in our models. Further, as demand learning is not in focus, we assume that the customer's behavior is known or has already been estimated. In our models, for ease of simplicity, we assume that (exactly) one customer arrives within each time interval [t, t + 1] (randomly, uniformly distributed) and chooses to buy a product based on the current prices. The probability that a customer buys a product is described as P buy (p A , p B ) , p A , p B ∈ prices . The customer purchases from the competitor with the lower price or randomly chooses a competitor if the offer prices are equal.
From firm A's perspective, the probability that a customer buys a product (within one period) is denoted by whereas firm B's sales probability is P buy B (p B , p A ) , p A , p B ∈ prices . Note that in our model, the sales probability (e.g., of firm A) can be summarized as a function which depends on (i) the current competitor price ( p B ) and (ii) the price p A chosen for one period. Further, it is also affected by (iii) the competitor's price reaction p ′ B and (iv) the reaction delay h (of firm B). Hence, based on (1), i.e., if p ′ B and h can be anticipated, firm A sells in the first time span (t, t + h) with probability h ⋅ P buy A (p A , p B ) , p A , p B ∈ prices , and with probability (1 − h) ⋅ P buy A (p A , p � B ) , p A , p � B ∈ prices , firm A sells in the remaining share of the period, i.e., (t + h, t + 1) . Hence, based on (1), the sales probabilities (2) can be expressed via conditional probabilities (2) P buy A p A , p B ∶ (prices, prices) → [0, 1]. Finally, the expected total future profit G of market participant A (discounted with ∈ (0, 1) ) given both player's strategies can be computed by, cf. (1)-(3), and the objective (of player A) is to maximize his/her expected profit.

Solution Approach
Our basic optimization model to solve the problem defined in "Model Description" for known inputs is described in "Basic Optimization Model". Based on this model, in "Dealing with Unknown Strategies", we address the case when the competitor's strategy is unknown. In "Competition of Two Self-Adaptive Strategies", we study the case when both participants use adaptive learning strategies. In "Incentivizing Cartel Formations", we analyze how cartels form. In "Risk-Averse Model Extensions", we present two models to include risk aversion.

Basic Optimization Model
We take participant A's perspective and assume given buying probabilities (2) for one period (with reaction delay h) as well as certain price reaction probabilities (1). Then, the value function V(p B ) of the duopoly problem can be solved using dynamic programming methods (e.g., value iteration) with T steps using initial values for The price reaction policy associated to (5) (i.e., how to respond to participant B's price p B ) is determined by the arg max at the last step of the recursion in t = 0 . Note that the number of recursion steps T and the starting values V T p B can be chosen such that the approximation satisfies a given accuracy (based on the discount factor and the maximum attainable reward). In addition to that, when solving (5) repeatedly with slight changes (e.g., with updated price reaction probabilities), suitable starting values of previous solutions can be used.

Dealing with Unknown Strategies
In this section, we assume that market participant A does not know market participant B's strategy and that participant B's strategy is fixed and does not change over time. The objective of A is to approximate B's strategy based on the reactions to proposed prices while still being competitive on the market. Therefore, A continuously adjusts their strategy after a fixed number of time steps T a to have a competitive strategy while gathering more data about the reactions of the opponent B. This process is visualized in Fig. 2. The reactions are recorded in a two-dimensional data structure tr as follows. tr p A , p B is the total number of times B reacted with price p B to A's price p A . Furthermore, we define #tr p A as the total number of times A has seen a reaction to price p A as After T a time steps, participant A computes the best anticipation strategy via (5) using the estimated probability distribution over B's recorded price reactions so far with P react p A , p B from (1). If there are no reactions recorded for p A , we assume a uniform distribution over all available prices. Therefore, Then, participant A acts according to the computed strategy for the next T a time steps, until they do the next strategy computation also taking the newly collected data into consideration. The size of T a should be as small as possible to update the participant's strategy often and is only limited by , otherwise.

Fig. 2 Participant
A competing against an unknown pricing strategy by adjusting their own strategy over time, cf. [15] i th adaption (i + 1) th adaption i * T a (i + 1) * T a (i -1) th strategy i th strategy (i + 1) th strategy strategy participant A participant B SN Computer Science the available computation time and the available computational resources. Over time, participant A gets to know B's strategy, as the observed distribution over the price reactions will become closer to the expected distribution. In the optimal case, this model will deliver the same price anticipation strategy as the competitor's price response probabilities would be exactly known. However, if A receives an unprofitable reaction for a specific price, it is likely that the model will not propose this price in the future again. A's strategy might get stuck and will not change in the future. In order to counteract this behavior, we need to explore further. We keep proposing prices that have not seen enough reactions, even though these prices would not be part of the optimal anticipation strategy that could be build based on the recorded price reactions. The participant is able to gather new reactions and extend their data foundation significantly. Below, we propose two procedures to explore participant B's pricing strategy. Note that both mechanisms differ from the one used in [38], where artificially added observations of high price reactions of the competitor are used to organize the price exploration in an incentive-driven framework based on an optimistic initiation.
Assurance We choose a specific number of time steps T i and let the participant randomly propose prices that have not received enough reactions. By doing so, the participant gains more confidence in the next strategy evaluation. We will only apply this procedure in the first T i time steps to build a profound first strategy, but it is also reasonable to apply this procedure at a later point in time (e.g., when the strategy has not changed for a long time). In the former case, there are no recorded data and if T i ≤ |prices| A proposes a different price each time step. If T i > |prices| then A will start over and proposes every price at least once before proposing it a second time. Which price is proposed exactly will be decided randomly to account for |prices| mod T i ≢ 0 . During exploration, the model only cares about gaining new information about participant B's strategy and neither takes competitiveness nor profits into account. Afterwards, participant A continuously adjusts its procedure as described before.
Incentive To motivate the model to include suitable prices in its strategy that have not seen enough reactions by participant B yet, the price anticipation, cf. (1), is modified. The way to motivate depends on the customer's buying behavior. We search for the combination of prices p A * , p B * that gives participant A the highest possible profit in the next iteration. Therefore, we utilize the part of the value function (5) for calculating the immediate profit as follows: Intuitively, it is desirable to propose a price that receives the reaction p B * because participant A can react with p A * and will then gain the highest possible profit. Thus, in the process of collecting observations, we slightly overestimate P react p A , p B * . The algorithm, cf. (5), needs to decide whether a high immediate profit is worth risking long-term profits. This way, participant A slowly adjusts its strategy by taking new prices into consideration until enough reactions for every available price were received. P react p A , p B for ∈ ℝ + is defined as The value allows to influence how many reactions for a price participant A will be received before deeming this price as unprofitable. If ≈ 0 and if A receives an unprofitable reaction for a specific price, this price will not be proposed again as we do not assume a desirable price reaction in the future. However, if we choose to be larger (e.g., ≥ 1 ), there need to be multiple undesirable price reactions before they outweigh the possible chance of a high profit in the next iteration.
Both exploration approaches have their own advantages and disadvantages. The advantage of the Assurance exploration procedure is that participant A gains a sparse but broad data foundation very quickly. After initial exploration, participant A is able to build their first competitive strategy. Furthermore, if participant B uses a deterministic strategy and T i ≥ |prices| , the evaluated strategy after exploration will not change in later strategy adaptions as A has already seen every possible reaction from B. In this case, A fully reveals the strategy after T i time steps. A major downside of this procedure is that participant A does not care about profits for T i time steps. As we consider an infinite event horizon, it is negligible if the competitor does not work efficiently for a finite number of time steps. If we instead consider a real world market situation, the competitor might not be able to survive the exploration phase. Therefore, T i needs be chosen wisely and in proportion to |prices| . It is not feasible to try out most of the available prices if |prices| is large. In this case, it might be better to use the Incentive approach. The participant considers profits and losses starting from the first proposed price and progressively explores prices that have not seen a reaction because exploring is part of strategy evaluation. On the other hand, it might take the Incentive procedure several strategy adaptions before every price has been proposed at least once and even more rounds of strategy adaptions before the incentive weight has been smoothed out completely. Both approaches could be further tuned, e.g., using interpolations. For example, in real-world scenarios where prices p A − 1 and p A + 1 are unprofitable, it is very likely that price p A is also unprofitable. However, as both procedures have not seen a reaction for p A , they are influenced to propose this price. Therefore, both procedures have the problem that they might propose prices unnecessarily. Additionally, both exploration procedures have one hyper-parameter that each needs to be tuned. We will discuss choosing T i and further in "Results for Fixed but Unknown Competitor Strategies".

Competition of Two Self-Adaptive Strategies
Based on "Dealing with Unknown Strategies", we consider the following extension. Instead of competing with an unknown but fixed strategy, both parties can adapt their pricing strategies over time to react to the current market pricing situation and the other participant's pricing strategy. Similar to the model presented in "Dealing with Unknown Strategies", both participants need a data foundation to base their strategy decision on. We, therefore, collect the respective opponent's price reactions over time in the data structure tr . After a predefined number T d of price reactions, one market participant is allowed to analyze their collected price reactions to adapt their own pricing strategy. Another T d price reactions later, the other market participant reacts to the changed market situation. Figure 3 illustrated this process. The collection of the mentioned T d price reactions is grouped together in the referenced data collection lasting for T d time steps. A data collection block represents the real market competition. All the tracked price reactions are passed into the strategy adaption of the respective participant at time step t. The computation of the newly adapted strategy is the same as the one from "Dealing with Unknown Strategies". The participant's (i − 1)th strategy is replaced with the ith strategy, which will be used for the next two data collection blocks while participant B still uses their (i − 1)th strategy. The next data collection block starting at t runs another T d time steps. At t + T d , participant B updates their strategy which will be used within the subsequent two data collection blocks.
Note, in contrast to "Dealing with Unknown Strategies", now both strategies changing the over time (cf. Markov property). Reaction data collected at t = 0 will probably be outdated at some later point t i which might result in inaccurate price reaction strategies. The model needs to anticipate that. In order to do so, we introduce a vanishing of values within our tr data structure. After a pricing strategy adaption was performed, the values are multiplied with a constant factor ∈ [0, 1] . This allows to decide between different intensities of keeping all of the collected, but possibly outdated data. For example, with = 1 all recorded reactions will be kept and with = 0 the data structure will be reset. While = 0 leads to better anticipation strategies when just respecting the last simulation run, > 0 is expected to account for the long-term trend and to be less prone to over-fitting the own strategy on a single data collection run.

Incentivizing Cartel Formations
In this section, we present a way to allow both market participants to form a cartel in which they constantly price their products equally. Note, instead of predefining a cartel price in advance, we study the case whether it is possible to modify our self-tuning price anticipation/optimization framework such that two of our independently applied learning strategies form a cartel without direct communication.
The cartel price denoted by p * can be identified as follows: We still use the response policy derived by (5) but include the following slight adaption. We manually overwrite the reaction of market participant A to p * with p * (all other reactions remain unchanged). Thus, market participant A signals to market participant B its willingness to support a p * = arg max p∈prices P buy A (p, p) ⋅ p. SN Computer Science cartel price. The rest of our approach to define price reactions and to decide on prices, as described in "Dealing with Unknown Strategies" and "Competition of Two Self-Adaptive Strategies", remains unchanged.

Risk-Averse Model Extensions
Our risk neutral basic model, cf. (5), can be modified to endogenize risk considerations. To avoid the risk of poor performances, we use a utility function u(x) to address the stochasticity in each period, which comes from the customers buying probability and the competitor's price reaction probability.
Additional utility model We adapt (5), such that in each period, the expected utility is considered and finally accumulated. For this additional utility model, we use initial values for V (AU) T p B ∶= u(0) , and perform T steps of the recursion, t = 0, 1, … , T − 1 , p B ∈ prices, Note, in (6) various utility functions can be applied, e.g., u(x) = ln(1 + x) or u(x) = x , ∈ (0, 1) , which are used to focus on a single period's reward.
Exponential utility model Instead of the additional utility model above also, more specific risk models such as the well-known exponential utility model can be used to consider total future rewards. The value function V (EU) (p B ) , p B ∈ prices , represents the best expected utility of discounted future rewards E(u(G)|p B ) for u(x) = −e − ⋅x , > 0 . Here, we use initial values for V (EU) T (p B ) = −1 and perform T steps of the following multiplicative Bellman equation, t = 0, 1, ..., T − 1 , p B ∈ prices, The two associated risk-averse policies a (AU) 0 (p B ) and a (EU) 0 (p B ) , p B ∈ prices , are again determined by the arg max of (6) and (7), respectively, cp. (5). Further, as the basic numerical structure is unchanged and the state space was not extended, the runtime of both modified models remains similar to (5).

Numerical Performance Optimization and Parallelization
Our proposed strategy evaluation for dynamic pricing has a worst case time complexity of O(T ⋅ k ⋅ |prices| 2 ) where T denotes the number of evaluated time periods, |prices| denotes the number of available prices and k, 1 ≤ k ≤ |prices| , the number of different prices the competitors strategy reacts with to a given price. Thus, while a deterministic strategy results in a time complexity of O(T ⋅ |prices| 2 ) , a stochastic pricing strategy might have a time complexity of up to O(T ⋅ |prices| 3 ) . Moreover, the space complexity of our dynamic pricing algorithm is O(T ⋅ |prices|).
The space complexity directly results from the dynamic programming table which spans across the time and price dimensions. The time complexity cannot be explained as straightforward as the space complexity because we have to distinguish computing a single time point and a time period. When looking at a single time point, we have to evaluate all |prices| prices as a reaction to each of the |prices| possible competitor prices. Additionally, the model has to consider all possible price reactions of the competitor to its own price to estimate the future impact of its decision. For deterministic strategies, there is only one possible competitor reaction. However, stochastic strategies can potentially have |prices| price reactions. Therefore, the time complexity for computing a single time point is O(|prices| 2 ) for deterministic strategies and O(|prices| 3 ) for stochastic strategies. Since we want to simulate T time points, the total time complexity is O(T ⋅ |prices| 3 ) in the worst case.
In a real-world setting, this leads to increased complexity even for basic scenarios. If we consider all prices below $5 up to a ¢1 precision, there are already 500 available prices. Therefore, we need to evaluate up to 500 3 = 1, 25 ⋅ 10 8 possible price settings.
To improve the run time of our algorithm, we can employ parallelization techniques. Parallelization, whereby large problems are divided into smaller subproblems which can be solved individually and simultaneously across multiple processing units [32], integrates nicely with dynamic programming models [3] [44]. We parallelize two steps in our dynamic pricing algorithm.
First, we compute the price reactions for all possible competitor prices at any given point in time t in parallel. Thereby, it is necessary to synchronize the partial results for all |prices| prices in the evaluation of any time point t across multiple processing units before starting the evaluation of t − 1 . Consequently, we cannot parallelize the evaluation for two different time points, such as t and t + 1 , as the evaluation for t + 1 depends on the results for t. It is possible to parallelize the evaluation even further, though. Each of the price reactions to each of the competitor prices can be evaluated in parallel. However, there is no practical performance gain, since there are most likely more available prices than parallel computing units which are already fully utilized by the first step. Furthermore, parallelization in general, and nested parallelization in particular always imposes additional synchronization overhead which can lead to performance degradation [20].
The theoretical speedup that can be achieved for any parallel program is limited by the longest serial execution path according to Ahmdal's Law [13]. In our parallelized strategy evaluation, the longest serial execution path is evaluating each price reaction to a given competitor price and estimating the future impact of the price reaction for each of the T time points. Therefore, the theoretically achievable speedup factor is |prices| with at least that many parallel processing units present. The real-world performance is further limited by the synchronization overhead (i.e., combining the partial results of the strategy evaluation) and the parallelization overhead (e.g., spawning multiple threads).
Second, we simulate the optimal strategy which our dynamic pricing algorithm determined multiple times in parallel. This problem is known to be embarrassingly parallel [28], as there is no dependency between the instances of the simulation. Therefore, we can easily accelerate the simulation phase by orders of magnitude simply through running it on multiple processing units in parallel with one final synchronization step to collect the results.

Experimental Evaluation
In this section, we study the results of our different approaches presented in "Dealing with Unknown Strategies"-"Risk-Averse Model Extensions" and "Numerical Performance Optimization and Parallelization". In "Setup", we define the buying behavior and the competitor's strategies to be learned in our examples. Within this setup, we consider the competition against unknown strategies ("Results for Fixed but Unknown Competitor Strategies"), mutually self-adapting strategies ("Results for Mutual Self-Adaptive Strategies"), cartel formations ("Results for Cartel Formations"), risk considerations ("Results for Risk Considerations"), and performance optimizations based on parallelization ("Evaluation of Performance Optimizations using Parallelization").

Setup
At first, in "Test Strategies of the Competitor", we define example approaches for deterministic and stochastic strategies. Thereafter, in "Customer Buying Behavior", we analyze the customer behavior.

Test Strategies of the Competitor
Next, we introduce two different groups of pricing strategies with a single representative each that will be referred to in the subsequent evaluation. Deterministic This common class of strategies always reacts with the same price to a certain proposed price. Their behavior can be formally described as A simple and widely used deterministic strategy, which we call Underbid, is the following. The other participant's price is always underbid by one unit (e.g., Δ ) but respects the minimum available price. This response strategy can be expressed by An example of the Underbid strategy with Δ = 1 is shown in Fig. 4a. We used twenty possible prices, F ∶ prices → prices, p ↦ max(min(prices), p − Δ). Fig. 4 Visualization of price response probabilities for deterministic (Underbid) and randomized (Stochastic) price reaction strategies, cf. [15] SN Computer Science prices 20 = {Δ, 2Δ, … , 20} . Each cell shows the probability that the competitor reacts with p B to a current price p A . In other words, each cell shows the result of P react p A , p B . Resulting from that, a column contains a distribution over all price reactions p B to a given price p A . As Underbid is a deterministic strategy, in each column a single p B makes up 100% of the occurrences. This can be clearly seen in Fig. 4a.
Stochastic This second class of strategies is characterized by a randomized behavior. A given price p A might result in different price reactions p B . One can compare this behavior with playing multiple pricing strategies at the same time. Figure 4b shows a stochastic strategy which, using the indicator function 1 {⋅} , can be described as

Customer Buying Behavior
In our evaluation will use the following customer buying behavior. Nonetheless, all models are capable of handling also other customer buying behaviors. We decided to pick a realistic buying behavior in order to make this evaluation as practical as possible. We define the probability that a customer buys a product given the current prices p A and p B as follows: Hence, the customer is more likely to buy a product if the minimal price on the market is lower. If both prices are very high, it is less likely that the customer buys a product. As mentioned earlier, we assume that the customer always chooses the lower price. If both proposed prices are the same, the customer randomly chooses one market participant's product. Therefore, the probability that the customer decides participant A's product offer is defined via Finally, in the context of (1) and (2) in (5), the resulting conditional buying probability for one period of time is described by, cp.

Results for Fixed but Unknown Competitor Strategies
Evaluating our model for an unknown opponent's strategy, cf. "Dealing with Unknown Strategies", we compare how long the different exploration procedures Assurance and Incentive take to approximate the real opponent's strategy and what their profits are along the way. The quality of different learning strategies can be verified by comparing them to the optimal strategy, which can be obtained by solving (5) for the opponent's strategy. We test the two exploration procedures (Assurance and Incentive) in the setting described in "Setup", where the underlying opponent's strategy is either Underbid or Stochastic. We use the discount factor = 0.99 and intervals with h = 0.5 . We deduced T = 100 to be sufficient for the strategy anticipation. We choose the number of time steps after which A adjusts their strategy as T a = 1 . This, in return, implies that the strategy is reevaluated after every new price reaction from B.
We compare the Assurance procedure under different numbers of time steps for exploration T i and the Incentive procedure under different incentive weights respectively. For this purpose, we choose T i and as follows: As a baseline for each approach, we use T i = 0 and = 0.001 . To get a profound impression of the calculated strategy at a specific time step t, we run the simulation S = 1 000 times for T S = 100 time steps. The average of A's expected profits in these simulations is divided by the simulation length T S and will be denoted E t . Therefore, E t is A's expected profit with the strategy used at time step t.    We denote O to be the expected profit that is achieved when the optimal strategy is used. O is constant as the optimal strategy does not change over time. If E t ≈ O , we know that A either found the optimal strategy or another strategy that produces very similar profits. If this keeps up for a greater number of time steps, A successfully identified B's strategy. An example is visualized in Fig. 5. The figure depicts the development of expected profits E t over time when utilizing the Assurance procedure. We used prices = prices 20 with Underbid as B's underlying strategy and T i = 20 time steps for initial exploration. As discussed in "Dealing with Unknown Strategies", A will be able to find the optimal strategy because B's strategy is deterministic and T i ≥ |prices| . This can be seen clearly in Fig. 5. During exploration with Assurance, A's expected profits are mediocre but after exploration, the expected profits are equal to the optimal profits. To compare the two procedures over a longer time period, we set A's cumulative profit in proportion to the cumulative profit under the optimal strategy. A's cumulative profit will be denoted CE t and the cumulative profit under the optimal strategy will be A found the optimal strategy or another one that achieves equal profits. Furthermore, in order for the cumulative profits to be approximately equal, the strategy needs to be used for several time steps to account for inferior strategies applied in the past. We will call CE t CO t the profit ratio. Figure 6 depicts the profit ratio of the Assurance and Incentive procedure with their respective configurations for 400 time steps.
Assurance and underbid strategy (Fig. 6a The figure shows that it takes a long time for larger T i to account for losses during exploration as we know that A would find the optimal strategy after T i = 20 time steps. T i = 10 seems to have found the optimal strategy after initial losses as well. This can be seen because the plot is approached by T i = 20 . These two configurations as well as T i = 40 and T i = 100 will approach the optimal profit ratio of 1 on the infinite event horizon. With T i = 0 the model was not able to find the optimal strategy which explains why its plot is being overtaken by that of T i = 20.

SN Computer Science
Assurance and stochastic strategy (Fig. 6b As expected, the illustration shows that for a stochastic strategy more exploration is needed. The configuration T i = 0 and T i = 10 converge to the same point, which means that the additional exploration did not contain any beneficial information. T i = 20 results in a higher profit ratio but similar to the deterministic scenario, it takes very long for larger T i to account for missed profits during the exploration phase. Incentive and underbid strategy (Fig. 6c In the figure, we can see that every configuration after some initial profits experiences a drop in the profit ratio. The reason for that is the following. Initially, the merchant has no information about the competitor's price reaction. The price exploration is organized via (5) based on the current estimation of the competitor's price reaction probabilities. Initially, no undercutting is expected and prices are greedily explored that provide comparably high profits. For these prices, the merchant learns that he/she will be undercut by the competitor. Hence, the exploration via (5) tries out prices that are not yet associated with an undercut by the competitor. In turn, playing such prices provides lower profits (resulting in the profit ratio drop). After the merchant has learned that also for these price, he/she will undercut the exploration via (5) automatically switches back to more profitable prices as wrong expectations regarding the competitor's reaction have been corrected (resulting in an increased profit ratio). The amplitude of the drop is pronounced for close to 0. For larger , the price exploration holds on stronger to the beneficial price p A * , which stabilizes the profit ratio in the early exploration phase. This is visualized in Fig. 7 for = 1 . The model continuously tries out higher prices. For a lower proposing these prices happens very fast. That is the reason why the drop for lower is greater compared to larger . Model configurations with larger take longer to gain confidence for the profitable prices before trying out less profitable prices. This also means that larger take longer to accept that the opponent strategy is deterministic. It is therefore not surprising that the order of profit ratio at t = 400 is the ascending order of . Every configuration is able to find the optimal strategy. However, we see that the larger the longer it takes the model to be certain.
Incentive and stochastic (Fig. 6d The plots have a similar shape as those of Fig. 6c against Underbid). The drop is less significant which should be due to the Stochastic being a more forgiving strategy compared to Underbid. = 0.001 seems to not have received enough opponent reactions which can be seen as the plot is stagnating for larger t. Moreover, larger perform equally well.
Overall, we obtain that the Incentive procedure should be preferred over the Assurance procedure for exploration. The Incentive procedure produces higher profit ratio compared to Assurance procedure. This is because the later needs exploration for T i ≥ |prices| time steps to produce a good strategy which can be clearly seen in the scenario of the Stochastic strategy. However, if T i is too large, it takes very long to compensate the exploration phase. For the Incentive procedure ≈ 1 seems to be ideal. Moreover, configurations with large take too long to be confident about the opponent's strategy, while configurations with small are considerably less likely to propose a price multiple times.

Results for Mutual Self-Adaptive Strategies
Our evaluation of the interaction between two self-adapting strategies will be divided into three major parts. The first of those runs the competition with two identically configured models and observes how the competition affects each of these. The second part focuses on the parameter and its effect on model's performance. The final part examines whether both strategies can form a cartel. For all of these tests, we use an interval split h = 0.5 , a discount factor = 0.99 and an evaluation time horizon T = 50 . Strategy updates are performed frequently with T d = 10 in order to shorten the initial exploration phase. We simulate each configuration for 2 000 time points to account for long term effects. Additionally, the models use the Incentive technique, presented in "Dealing with Unknown Strategies", to explore prices the competitor has not reacted to.

Symmetric Strategies
In this scenario, we let two identically configured models compete. We evaluated different (symmetric) values ≠ 1 weighting new and old reactions. We identified = 0.8 as most suitable to account for focusing on newer reactions while keeping track of old ones, too. Figure 8 shows the  [15] progress of the profits of both strategies. In the beginning, market participant B is ahead due to the fact that the strategy of market participant B has one period of additional data during the strategy reevaluation. Therefore, it can finish its exploration phase earlier. However, the earlier update has its disadvantages, too. Market participant B notices first that they have to restore a low price level at some point to gain higher profits in the future. However, in this scenario, market participant A can exploit this behavior by choosing a low price, so participant B has to increase the price level. Therefore, market participant A never has to increase the price level itself, but can force market participant B to do so once the price level drops too low. Accordingly, market participant A wins out at some point and never looses the profit lead again. We see that market participant B is not able to stop the downward trend once it started. When competing for an extended period of time, both strategies enter a loop of chosen prices, which both models profit from. While market participant B looses the competition, its strategy is still optimal from its point of view. The alternative of matching a low price of market participant A is not lucrative, as there is no guarantee that market participant A will restore the price level and market participant B looses profit in the long run. When looking at the strategy evolution, we see that both models start with similar strategies to explore the respective competitor's strategy. Figure 9a shows that market participant B is ahead during the exploration phase. Both strategies evaluate the competitor reactions from most profitable to least profitable. As we can see, participant B is already using price 12, while participant A is still evaluating the more profitable price of 14.
Further, Fig. 9b shows the learning progress of both strategies at t = 400 . Market participant B has learned that it has to restore the price level at some point, while market participant A exploits B's strategy by matching lower prices in order to force B to raise the price level afterwards. After 1 000 time periods, both strategies do not change any more. The final strategies are presented in Fig. 9c. We can see that both strategies underbid each other in the mid price ranges. However, they differ in their behavior once the price drops too low and also in their price reaction on too high market prices. Moreover, market participant A already deems prices below 7 as too low, while market participant B only deems below 3 as too low.

Asymmetric Strategies with ˛ deviations
While the previous section focused on symmetric setups ( = 0.8 for A and B), this section investigates the impact of selected values on the models' performances. Figure 10 shows the competition of two extreme values (i.e., A = 0 and B = 1 ). We see that participant B wins the competition very decidedly. = 0 is observed to be the worst possible setting as the incentive based learning has to start over and over again. This is due to the fact that the model loses every recorded price reaction after each strategy evaluation. Therefore, previously played prices appear to be new to the unsuspecting model. In the short term, this strategy can work out because it focuses on high profit prices first and the model For this reason, we also present a more competitive setting where participant A uses = 1 and participant B uses = 0.5 . An value of 0.5 allows a model to focus on newer reactions, yet not loosing information on older ones. Figure 11a shows the profits of both competing strategies over time. We can see that participant B's strategy using = 0.5 is more effective.
As we saw earlier, participant B is at a structural disadvantage. Nonetheless, B is able to win over participant A due to the superior value. In contrast to the first experiment, participant B is able to force A to restore the price level, as we can see in Fig. 11b. Participant A is not able to remove misleading reactions from the early exploration. Thus, market participant A is not able to compete with market participant B who adapts its strategy accordingly.

Results for Cartel Formations
This section studies a cartel formation between two selflearning strategies. While the previous experiments showed that the two models do not form a cartel on their own, the introduction of an artificial price reaction as discussed in "Competition of Two Self-Adaptive Strategies" helps with that. Figure 12 shows the prices p A and p B of both market participants over time.
We find that both strategies explore the possible prices at first, as we see a lot of different prices played. Both participants choose the cartel price some time, but they do not form a cartel instantly. Given that participant A will always react to a price p B = p B * = 11 with p A = p A * = 11 , we see that participant B at some point t ≈ 250 decides to consistently react to p A * with p B * to form a cartel. After the formation phase, both parties continue to stick with the cartel price. As expected, the earned profits of both participants are high and both strategies outperform their competing counterparts.

Results for Risk Considerations
In this subsection, we test the risk-averse versions of our basic model (5). We consider the setup of "Setup". As an exemplary competitor strategy, we assume the randomized strategy Stochastic defined in cf. "Test Strategies of the Competitor". Figure 13a illustrates the results of the additional utility model (6)

Fig. 11
Profit progression and strategy comparison of two competing, asymmetric self-adapting price strategies with A = 1 and B = 0.5 , cf. [15] strategy, which is depicted by a black line in Fig. 13a. Curves with lighter shades of gray indicate strategies with higher risk aversion. Overall, we observe similar patterns, i.e., there are ranges where it is best to slightly undercut the competitor and for ranges with low prices it is best to significantly raise the price level. Further, we find that the higher the degree of risk aversion, the smaller are response prices and prices are just raised for smaller prices. The corresponding results of the exponential utility model (7) with u(x) = 1 − e − ⋅x and ∈ (0, 2] are shown in Fig. 13b. Here, the risk-neutral solution is associated with � → 0 . Overall, compared to the AU model, we observe similar results. The examples demonstrate that both proposed risk models are applicable in our framework and that, as expected, a higher degree of risk aversion will, in general lead to lower prices avoiding the risk of poor performances by generating lower rewards with high confidence. As in the risk-neutral model, T = 100 iterations were enough to obtain convergence (cf. value iteration).
Note, the runtime of (6) -(7) scales as the one of (5) as the computational complexity is the same.

Evaluation of Performance Optimizations Using Parallelization
We evaluate the proposed two-step parallelization approach for our dynamic pricing algorithm in a Python-based reference implementation which utilizes the library pathos for configuring and launching multiple threads for parallel execution.
We run our experiments on a bare-metal server with an Intel(R) Xeon(R) Gold 6148 CPUs @ 2.40 GHz with 750 GB RAM @ 2666 MT/s attached to it. The server runs on Ubuntu 18.04 LTS with Python 3.6.9. We use the highprecision chronometry library timeit for measuring the runtime of both evaluation and simulation of our dynamic pricing algorithm.
We conduct all of our performance experiments with a deterministic Underbid strategy ("Test Strategies of the Competitor") for participant B. Since our goal is to quantify the performance improvement of our parallelization approach, we measure the runtime of each experiment for 1, 2, 4, and 8 threads. To reduce potential measurement deviations of the execution times, we repeat each experiment ten times and take the arithmetic mean over all execution times.
We first analyze the impact of increasing the number of available prices |prices| from 100 to 500 on the runtime of our strategy evaluation for participant A using different numbers of threads (Fig. 14). In the single-threaded scenario, where only 1 thread determines the optimal pricing strategy, the strategy evaluation runtime grows quadratically with the number of available prices |prices| as k = 1 caused by B's deterministic reaction behavior ("Numerical Performance Optimization and Parallelization"). The execution time ranges from 13 s for 100 available prices to 722 s for 500 prices. In the multi-threaded scenarios which harness the computational power of multiple cores by spawning 2, 4, or 8 threads, respectively, to determine the optimal pricing strategy in parallel, the runtime growth is considerably  slower. 8 threads perform the strategy evaluation in almost linear time yielding a speedup of ×6.4 for 500 available prices. In addition to that, we see that the parallel approach outperforms its single-threaded counterpart already for |prices| = 100 which in a real-world scenario accounts for just 1$ with a 1ct precision. Our second experiment measures how the number of evaluation steps T impacts the runtime of our algorithm's strategy evaluation for various numbers of threads (Fig. 15a). Further, T strongly influences the runtime of (5). We see that in all thread configurations (i.e., 1, 2, 4, and 8 threads) the strategy evaluation runtime grows linearly with T. However, doubling the number of threads leads a considerable performance boost across the entire range between T = 100 and T = 500 . More precisely, for T = 100 the speedup between 1 and 8 threads is ×5.2 (i.e., 183 s vs. 35 s). In the case of T = 500 , the speedup is ×5.0 (i.e., 921 s vs. 185 s). As the strategy evaluation builds the foundation for all three models presented in this paper, its performance is crucial for the respective participant. In a real-life setting, a faster strategy computation helps to anticipate the current market behavior more precisely and therefore enables a higher profitable pricing behavior.
Our last experiment measures the time it takes to simulate the optimal pricing strategy produced by our strategy evaluation S times using 1 to 8 threads (Fig. 15b). Since all simulations can run fully independently without any synchronization in-between ("Numerical Performance Optimization and Parallelization"), it is not surprising that 2, 4, and 8 threads outperform the single-threaded execution for the entire range of S. 10 000 simulations run in 24.6 s on 1 thread and just 3.8 s on 8 threads which yields a speedup of ×6.5 . Thus, multiple different market places can be tracked simultaneously which helps with generating a reliable data foundation for future price adaptions.

Conclusion
Online markets offer optimal conditions to employ dynamic pricing strategies, as it is easy to observe competitor's prices and to change the own price. To gain a competitive advantage market participants seek to optimize their prices and to update them frequently. We have analyzed optimized pricing strategies for different duopoly scenarios with infinite time horizon. Our setup allows fairly general demand probabilities as well as various potential competitor strategies.
First, we show how our self-adaptive strategy manages to explore a competitor's strategy efficiently and to derive an optimized response strategy. To reveal the competitor's  price responses, we use an incentive-driven approach to organize the price exploration such that prices that have not been proposed before and prices that maximize expected rewards are effectively balanced. Second, we let our self-learning strategies interact with each other. Both of the strategies estimate the respective competitor's strategy and regularly adapt their price responses. Our simulations show that strategies evolve over an extended period of time, but stop evolving at some point. Moreover, regarding the data to anticipate prices during this process, we find that considering a suitably chosen share of past observed price reactions significantly outperforms settings in which either all or too few observations are used for price anticipation.
Third, we slightly modify our self-adapting strategy for one of the two firms such that the response to the cartel price is the cartel price itself; all other responses remain unchanged, i.e., competitive. We find that both strategies stop competing once they discover the cartel price as it is the most beneficial scenario for both players.
Fourth, we show how the underlying risk neutral dynamic programming model to compute response strategy for given price anticipations can be adapted to address risk-averse decision-making. Besides a utility-based model considering the utility of single periods in an additive way we also show how to apply the well-known exponential utility model optimizing the expected utility of total discounted future rewards. Our numerical examples demonstrate the applicability of both models as well as their risk-averse impact on response strategies.
Finally, we present a two-phase parallelization approach to accelerate both computation and simulation of pricing strategies. We achieve a significant performance boost for all evaluated scenarios. Larger numbers of available prices and simulations amplify this performance boost even further. We further observe that our parallelization efforts converge against the theoretical speedup limit. Not only does this substantiate our claim that parallelization for our dynamic pricing algorithm is reasonable, but it proves that for real-world scenarios with high-frequency price adaptions it is an essential performance optimization technique.
Future research may focus more on strategies in unfair or asymmetric market environments. This might be caused by one party using more efficient algorithms, the use of higher computational resources, or limited access to marketplaces around the world. Strategies need to anticipate those difficult circumstances in order to gain long-term profits. Furthermore, besides the exit and entry of other firms, market settings which change over time caused by evolving (strategic) customer behaviors as well as the consideration of different product qualities or ratings could be investigated.
Funding Open Access funding enabled and organized by Projekt DEAL.

Conflict of interest
The authors confirm that there is no conflict of interest.
Code availability Not applicable, could be provided.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp:// creat iveco mmons. org/ licen ses/ by/4. 0/.