Market making (MM) is a well-known high frequency trading (HFT) strategy widely used in large stock exchanges around the world including NYSE and NASDAQ  [1]. Unlike MM however, HFTs are not obligated to always trade, and therefore two are dissimilar. A market maker or a MM agent is solely responsible for providing market liquidity through high-speed execution of a large number of orders within a fraction of a second. The market becomes illiquid in the absence of MM agents as orders will take time to execute, so the market participants lose interest in trading. MM therefore is a crucial component of the market and is a means of attracting investors. The profit in MM comes from the difference between the quoted ask (sell) and the quoted bid (buy) price of a stock. A MM firm has an obligation to continuously place buying and selling limit orders to add liquidity to the market. Each stock listed on the market for trading has a highest-bid price (best-bid) and a lowest-ask price (best-ask), and the difference between best-bid and best-ask at an instant, is known as market spread. The MM agent stands ready to continuously buy and sell stocks from other market participants during market operational hours in order to add liquidity to the market. Market liquidity can be measured by the quoted spread of a MM agent and the number of successful trades; hence, the higher the number of trades and lower the quoted spread, the more liquid the market becomes.

With the advent of algorithmic trading, humans vanish from the market in MM roles, except over-the-counter markets, e.g. corporate bonds market, as it would no longer be possible to make profit manually. MM is nowadays completely based on high-speed trading systems in order to convert the speed advantage into profit. However, the high-speed trading alone is not enough for MM agents to compete with other traders. There is a need for automated MM which can incorporate human level expertise into high-speed trading. There exists a machine learning technique known as reinforcement learning (RL) which is capable of being an automated MM agent. In the latest work on RL-based MM, [2] developed an automated MM agent using RL where the goal of the agent is to maximise the profit and minimize the inventory. This work is the state-of-the-art and is used as a baseline model in this paper to evaluate our new method. Moreover, this work is the first attempt to design a practical model of MM considering all necessary market phenomenon’s including price to ticks conversion, limit order book handling operations such as insertion, deletion and amendment of orders and latency in order executions. For simplicity, we refer to this work by RMM-Spooner. RMM-Spooner used a well-known state discretization function approximation method of RL called tile-codings to design a RL-based MM agent.

We also use a traditional MM model by [3] as the second baseline model for comparison with our proposed method. This traditional model used a reservation price to calculate the optimal bid and ask quotes of MM agent. For simplicity, we refer to this traditional MM model by AS model. Existing MM methods consider only the current price. If MM additionally takes into consideration the predicted future price, we call it predictive market making (PMM). If the prediction is accurate enough, higher profits are expected with PMM. There is ample evidence in the literature that HFT performance can be improved significantly by accurate prediction of future prices, see, for example, [4,5,6]. However, as far as we know, there is no existing study on PMM in the literature. In this paper we study PMM by answering the following questions: 1) How to predict the future price of a stock?; 2) How to incorporate the predicted price in MM?; 3) Will such a PMM method generate higher profit and higher market liquidity?; and 4) How does this PMM method compare with the baseline methods namely RMM-Spooner and AS model? To answer these questions, different PMM strategies (see Section 3.2 for detail), based on the type of deep neural network (DNN) architecture for price prediction and the value of consolidated price equation (CPE) weight (w \(\in [0.5, 1]\)), are designed. CPE uses the weighted mean of two asset prices, where the weight of each component is denoted by w and 1-w. DNN architectures consist of multiple layers of artificial neurons connected fully with each other. Three different DNN architectures are used, namely multi-layer perceptron (MLP), long-short term memory (LSTM) and convolutional neural network (CNN) to design the price prediction model of PMM. The DNN prediction models use w to combine the current and the predicted price of a stock to generate a consolidated price. Moreover, the value of w controls the contribution of current and predicted prices in the condolidated price. The empirical observations say that if we increase the contribution of predicted price in CPE as the prediction accuracy increases then higher returns are observed.

The limit order book (LOB) simulated in this research uses intraday trading, where stocks are bought and sold the same day during market operational hours. LOB gets updated regularly at a fixed interval, typically five seconds, and a MM agent makes a decision based on the up-to-date information of the market. ETFs are nowadays popular with traders for a number of reasons, including appropriate diversification in portfolio, and the fact that they can be traded like regular stocks [7]. These PMM strategies are evaluated on a basket of ten stocks to find an optimal strategy for exchange traded funds (ETFs). These ten stocks and three ETFs are picked randomly from Yahoo finance.

The main contributions of this paper are: 1) the introduction of a novel concept of PMM, which enhances the returns of a RL-based MM agent and hence the market liquidity through LOB midprice prediction; 2) the design of a consolidated price equation (CPE), which is responsible for the amalgamation of present and future market prices of stocks; 3) a RL-based MM agent which can anticipate the future market trend in real-time.

The remainder of this paper is organised as follows. Section 2 reviews the related work. Section 3 provides necessary background details. Section 4 decribes the proposed method. Then, Section 5 explains the empirical study and the statistical analysis of out-of-sample backtesting. Finally, Section 6 concludes the paper.

Related Work

MM has been studied in multiple fields including AI. [8] used RL to develop the first practical model of MM. They investigated the influence of uninformed market participants on quote placement behaviour of market makers and argued that RL policies converge successfully while keeping the balance between profit and spread.  [3] approached the MM problem as an optimal bid and ask quotes placement in LOB using probabilistic knowledge where order arrivals and executions follow their model (AS model). [9] studied the impact of MM on price formation and treated exploration versus exploitation problem as price discovery versus profit earning. However, this work concentrated on price prediction and stability rather than enhancing the quality of market measured by market spread. [10] designed an automated MM framework based on convex hull optimisation. [11] used an online learning approach to design a market maker and empirically evaluated it while assuming sufficient liquidity.  [12] designed an intelligent market maker which uses a prediction signal to improve profits but fails to suggest any successful MM strategy in general.  [13] designed a high frequency MM model that takes advantage of speed in placing quotes and also argues that speed and profit are positively correlated.  [14] proposed a trade execution model for high frequency MM that predicts sudden price fluctuation from price ticks (a tick is a smallest unit of change in price) information. However, they argue that recurrent neural network-based price flip prediction is advantageous in MM, but fail to test their model on other MM strategies including RL, which limits the practicality of their model. Then, [2] developed a state-of-the-art model (RMM-Spooner) based on RL which is the first attempt to study this problem on realistic grounds taking into consideration all market phenomenons that affect MM. [15] employs asset price momentum signals in decision making strategy of MM agent. This momentum signal-based decision making closely resembles this work. However, we minimise the risk of sudden or unexpected momentum shift (due to the irrational behavior of traders towards new information) using CPE, unlike [15]. Moreover, this work designs a more robust MM strategy which predicts an actual price rather than just the direction (upward or downward trend) like [15]. The similarity between this work and [15] is directional trading based on price changes.

As far as machine learning (ML) in finance and economics is concerned, [16] improved the results of a recurrent RL-based trading system using a genetic algorithm.  [17] designed an automated trading system which combines DNN and RL together and predicts the number of shares to enhance the trading decisions. They evaluated their system on different stock indices including S&P500 and reported significant improvement in profits as compared to their baseline RL trading method. Research conducted by [17] improved financial trading decisions using deep q network. Many researchers applied machine learning-based forcasting techniques to enhance the current state-of-the-art methodologies used in economics and finance. [18] used extreme learning machine for bankruptcy prediction whereas [19] also used extreme learning machine but used it to design a financial soundness predictor using bank data. Moreover, [20] evaluated various ML algorithms for money laundering detection. Another work published by [21] analyzed the role of market sentiment and technical indicators in conjunction with ML techniques to predict stock trends.

In financial time-series prediction, neural networks have been a popular choice among the AI community, particularly LSTMs . [22] studied stock market price prediction using backpropagation and multi-layer feed forward neural network. They stated that mathematical or statistical techniques are not appropriate as market indicators do not represent any significant relationships that make stock markets quite difficult to predict.  [23] focused on MLP, CNN and LSTM in proposing a hybrid architecture using wavelets for stock price prediction. [24] used LSTM to predict stock market movement based on historical prices and technical indicators. They reported 55.9% accuracy in their prediction results and described them as promising. [25] used backpropagation neural networks (BPNN) and the improved sine cosine algorithm (ISCA) with Google Trends to predict opening price movement for the S&P 500 and DJI Indices, respectively.  [26] used LSTM for price prediction with wavelet transform to remove noise. Recently, [27] found MLP outperforming vector autoregressive model in predicting crude oil prices. Based on these existing studies mentioned above, we chose MLP, CNN and LSTM neural network models to solve the time-series price prediction problem for use in the various PMM strategies.


Fig. 1
figure 1

Flowchart of the PMM model, comprised of DNN-based price predictor and RL-based MM agent connected via CPE

In this research, we integrate the RMM-Spooner model with market price prediction feature. This integrated machine learning-based model is known as PMM model, and is shown in Fig. 1. This section comprises three subsections; the first subsection describes about LOB simulation method. Then, the second subsection explains the MDP formulation of the PMM model, and finally the last subsection is about the proposed RL-based MM method.

LOB Model

A limit order (LO) arrives at a certain price and volume and has two sides, namely the ask and the bid. An LOB is a sorted list of LOs awaiting execution at their quoted or higher price. The LOB simulator designed here displays top five ask and bid orders at multiple levels, as shown in Fig. 2.

Fig. 2
figure 2

LII order book snapshot containing price and volume information

An order which executes immediately on arrival against the resting LOs in LOB starting with the best price, is known as a market order (MO). Each MO gets matched with the best available LO on opposite side. Simply, a buy MO is filled against a lowest ask LO, and viceversa. The difference between prices at multiple levels is denoted by the tick size. On the other side, a LO arrives and rests in a queue at a price value until an opposite MO arrives in the LOB. In case no MO arrives and an opposite LO already exists at the same price level, then both LOs get executed immediately.

LOBs are widely accepted and presumably almost half of the financial markets use them for maintaining arrival and execution of orders [28]. Furthermore, they provide remarkable market insight, hence we use a realistic data driven LOB simulator for the RL environment. The PMM agent places both ask and bid LOs together with some intended spread, known as \(quoted_{spread}\) (denoted by qs in Eq. 1).

$$\begin{aligned} Return = qs \times v \times z \end{aligned}$$

A trade is considered successful when both the ask side and the bid side of a LO are executed. The v denotes the number of units (e.g. number of shares of a stock) to be bought and sold in a trade. The z is the number of trades occurred between the PMM agent and other traders. The PMM agent makes a profit in every successful trade. In the case of only one side executing, inventory accumulation occurs. When only the ask side LO gets filled, the inventory decreases; otherwise it increases. There is a set inventory range (e.g. -100 to 100) beyond which all LOs are automatically cancelled to make the inventory zero. The nonzero inventory is cleared against MOs, otherwise they are cancelled at the end of the day. The Return is accumulated incrementally at each z number of units per equity traded.

MDP Formulation

Markov decision processes (MDPs) are the stochastic control processes, and  [29] state that MDPs are the suitable candidates to model the sequential decision making problems including MM. [30] describe RL as the method of mapping situations to actions by assessing the scalar reward signal and can solve MDPs. RL is a type of machine learning method which is neither supervised nor unsupervised. In other words, RL reinforces a learning agent to directly interact with its surrounding environment without any supervision. The RL agent observes the environment state and picks the best action to execute, then the environment returns back the scalar reward signal. This returned reward signal is used as a feedback by the agent to assess the quality of the picked action. Then, the agent again observes a new environment state and picks the best action and receives a reward. In this manner, the sequence of rewards is obtained in discrete time steps and the goal of RL agent is to maximize this sequence.

State Space

From the concepts of MDPs, a RL problem consists of an agent transiting one situation to another in discrete time steps. Mathematically, these situations are known as states and are either discrete or continuous in nature. The RL agent starts in some initial state, then transits from one state to another, and finally reaches some final state or terminal state. This transition from initial to terminal state, is known as an episode. The agent transits from one particular state to another particular state with certain probability. This probability is governed by the state transition function. In other words, the state transition function provides the connection among states means which state leads to which other state. This connection information of state space is known as the model of the RL environment. There are two types of RL environments, namely model-based and model free. In model-based, the state transition function is known in advance. However, in model free no transition probability information is available. In most practical scenarios the model of the environment is not known or difficult to obtain due to the unavailability of transitional dynamics information, and so the model free RL approach is widely used in solving real problems including MM. Moreover, when the states are only observable to the agent at the time of decision making then the RL algorithm solves a partially observable Markov decision process (POMDP) instead of a MDP.

Some technical indicators provide a useful insight and facilitates the LOB simulation for the PMM agent, namely:

  • Volatility: the dispersion of returns of an equity

  • Volume imbalance: the ratio of ask to bid volume in LOB

  • Relative strength index: measures the change in price recently

  • Market spread: the difference between lowest ask and the highest bid of LOB

  • Mid-price movement: the change in the average value of lowest ask and highest bid prices in LOB

We use lookback window method to calculate volatility, relative strength index, market spread from historical time-series LOB data. The standard formulae for the calculation of each of above-mentioned indicators are used. The length of lookback windows is 60 days. The historical volatility value is computed via calculating the square root of the variance (\(1/n(\textstyle \sum \limits _{i=1}^{n} (x_i - \mu _i))^2)\)) of historical prices window or simply the standard deviation of historical prices. The relative strength index is computed using the formula (midprice price up - midprice price down )/(midprice price up + midprice price down). The mid The midprice price up is the mean of historical mid price move (in upward direction) window, whereas the midprice down is the average of historical mid price move (in downward direction) window . The market spread is computed using the formula (lowest ask price - highest bid price). In our formulation, the state is continuous in nature, similar to RMM-Spooner and [31], and denoted by 8 variables, namely volatility, volume imbalance, relative strength index, market spread, mid-price move, inventory, ask distance, bid distance. The inventory denotes the volume of asset(s), either positive (long position) or negative (short position), of the PMM agent. The ask distance calculates the distance between the best open order in the ask book and the best ask price, whereas the bid distance is the difference between the best open order on the bid side and the best bid price.

Action Space

As defined in the section above, the RL agent transits between states within the state space. From the concepts of MDP and RL, these transitions are performed when the RL agent executes some action. Actions can be discrete or continuous in nature, like states. We design our action space similar to RMM-Spooner. Each action has two parts, namely ask-book level and bid-book level. The ask-book level specifies the level of LOB for ask quote and bid-book level specifies for bid quote. The action space contains 9 discrete actions, namely \(Quote(1,1), Quote(2,2), Quote(3,3), Quote(1,0), Quote(0,1), Quote(2,0), Quote(0,2), Quote(3,0), clear_{inventory}\).


A reward is a scalar feedback signal for a specified action in a state of the environment and distinguishes RL from unsupervised learning where the goal is to extract hidden patterns in unlabeled data [30]. In other words, a reward function is a mathematical representation of the targeted goal of the RL agent. In the realm of MM, RMM-Spooner defined the reward as a function of asset returns and the RL agent’s inventory. We use the similar reward function represented by Eq. 2.

$$\begin{aligned} Reward_t = \phi _t - (\lambda \times max((inv_t \times \xi _t),0)) \end{aligned}$$

The components of the reward function are as follows: net profit/loss received is denoted by \(\phi _t\), \(inv_t\) is the inventory of the PMM agent (see Section 3.2.1 for detail), \(\lambda \in [0,1]\) is a parameter to control the influence of inventory on reward (\(\lambda\)=0.7 from RMM-spooner) and \(\xi _t\) is the LOB midprice at time t. The PMM agent trains itself to make optimal decisions through trial-error method, the post learning behaviour solely relies on the components of the reward function. The correct modelling of the reward function, depending on the problem, is crucial to achieve a desired behaviour of the RL agent. The intended goal of a PMM agent is to maximize the value of reward function via maximizing the asset return and minimizing the inventory. Moreover, the inventory risk can be controlled by a regulator \(\lambda\) (Eq. 2), while the PMM agent learns to pick an optimal action that always yields higher return in long term with minimum risk in every state of the environment.

Value Function

The objective of a PMM agent is to discern the relationship between the state space and the action space through an evaluation of reward signals. This objective is achieved through deriving an optimal policy (a mapping from states to actions which leads to the desired behaviour of agent) which is either an optimal state-value function denoted by V*(s) or an optimal action-value function denoted by Q*(s, a). We prefer to use Q*(s, a) over V*(s) to find an optimal policy since Q*(s, a) provides a better estimate of the policy than V*(s) [32], because action-value function can distinguish the expected cumulative reward among different actions. However, the state-value function considers only state to provide an estimate of expected cumulative reward.

The state space is continuous in nature. This makes traditional RL methods such as tabular method for storing the learning experience impractical due to inifinitely many state-action pairs. Therefore, a function approximation method is required to estimate the Q*(s, a) function. We use a linear parameteric function approximation method, namely tile-codings [33] for this purpose. Q*(s, a) estimates the quality of an action a, in terms of the expectation of total cumulative reward, when the PMM agent started in a state s and reaches the terminal state. A sequence of interactions from initial to the terminal state between the agent and its envronment, is known as an episode. As mentioned earlier, the Q*(s, a) denotes the optimal policy of PMM agent which means using this optimal policy the agent selects an optimal action in an unseen state.

There are two types of RL policies, namely target policy and action-selection policy. The target policy is the policy which is being learned to estimate the action-value function. However, the action-selection policy is used to select actions while interaction with the environment. RL algorithms can be categorized into : 1) off-policy; and 2) on-policy. An off-policy method estimates a different target policy and follows another fixed action-selection policy, whereas an on-policy-based RL agent uses the target policy as the action-selection policy while learning. Off-policy method requires an action-selection policy in advance and is difficult to obtain in most real-world problems including MM. Therefore, we prefer to employ a well-known on-policy approach, namely SARSA (see Eq. 3), based on the pioneering temporal-difference (TD) learning algorithm [33].

$$\begin{aligned} Q_{new}(s_t, a_t) \leftarrow Q_{old}(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q_{old}(s_t, a_t)) \end{aligned}$$

As mentioned earlier, the state space of the PMM agent is continuous, therefore a function approximator (FA) (e.g. linear function of features such as tile-codings, neural networks, Gaussian distribution-based FA [34], etc.) is required to generalize the learning experience across entire state space. SARSA uses state, action at time t along with the reward, state and action at time t+1 to estimate the TD-error term \((r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q_{old}(s_t, a_t))\) in Eq. 3. The TD-error is computed by bootstrapping the future Q value discounted with \(\gamma\), and then subtracting the current Q estimate from the future estimate and adding the future reward \(r_{t+1}\). The FA can either be a linear function or a neural network, FA represents the continuous state spaces in terms of the product of features and their weights. The PMM agent estimates the weights by minimizing the TD-error. These learned weights of FA represent the unbiased estimate of the true action-value function. Hence, the learned Q*(s, a) function is treated as an optimal RL policy, denoted by \(\pi *\).

The Proposed Method

We propose the PMM model-based MM agent which comprises two main components : 1) an RL-based MM agent; 2) DNN-based price predictor. The PMM model-based MM agent merges the two market prices, i.e. the current price and the predicted price together to estimate a final consolidated market price. The PMM model-based agent places LOs in the LOB, with certain price and volume. All LOs are placed in the LOB at each discrete point of time, during the trading hours. At an instant of time t, the PMM agent can access the market price (average of the best ask and best bid prices) from the LOB. This market price of an equity is considered as a reference point to calculate the simultaneous bid and ask quotes [3]. The PMM agent has the flexibility of merging the future market movement with the current market trend as opposed to the two benchmarks (RMM-Spooner and AS model). However, the PMM agent might not handle large volume orders (a limitation of market replay method) effectively, like RMM-Spooner.

The time-series market price prediction problem is solved using a sliding-window supervised learning method. Sliding-window is an array containing s number of historical market prices at time t. This array feeds the historical prices into the DNN-based price predictor (right-hand side of Eq. 4) to predict the equity price at the next time step denoted by \(\psi _{t+1}\) (left-hand side of Eq. 4).

$$\begin{aligned} \psi _{t+1} = \Gamma (\xi _t, \xi _{t-1}, \dots , \xi _{t-s}) \mid s \ge 1 \end{aligned}$$

At time t, the historical LOB data provides the state space information, in terms of state variables, to the PMM agent. Then, the agent amalgamates the market prices of two discrete time steps, i.e. (\(\xi _t\)) and (\(\psi _{t+1}\)), using CPE (Eq. 5) and computes the consolidated price denoted by \(\Psi _t\) (left-hand side of Eq. 5).

$$\begin{aligned} \Psi _t = (w \times \xi _t) + ((1-w) \times \psi _{t+1}) \end{aligned}$$

A linear function approximator, named tile-codings, is used to retain the learning experience. Then, the Q(s, a) function values, from tile-codings, for all actions in the discrete action space are computed. An epsilon-greedy action selection policy is used to choose an action from action space using Q(s, a) function values. The selected action (an ask and a bid quote is calculated using consolidated price \(\Psi _t\)) is then executed and a reward value is obtained using the reward function. The returned reward is then used to update the Q(s, a) function using SARSA algorithm. In this manner, the PMM agent minimizes the sudden price fluctuation risk which continuously persists in the markets. CPE makes PMM flexible enough by providing automatic quotes adjustment according to the future movement of market price. Moreover, there is a sudden market momentum reversal risk associated with the predicted price \(\psi _{t+1}\) due to the irrational behaviour of traders in the market. CPE weight acts as a risk controller for PMM agent, which is simply not present in RMM-Spooner and AS model.

As mentioned earlier, the feed-forward DNN models, responsible for market price prediction, uses sliding-window method for training on the time-series LOB historical data. These DNNs are built using TensorFlow framework and are trained on 80% of the data. In the MLP neural network architecture, each hidden layer contains 100 neurons with rectified linear unit activation function and mean-squared-error loss function. LSTM architecture also contains 100 neurons in each hidden layer with adam optimizer and mean-squared-error loss function. CNN has convolutional layer as input layer and max-pooling and flattening layers as hidden layers. Each convolutional layer contains filters each of size 2 with rectified linear unit activation. CNN uses adam optimizer function for optimizing the network. The hidden layers in MLP, LSTM and CNN are 2, 3 and 3, respectively.

The hyperparameters of the first component (a RL-based MM agent) of the PMM model are used from RMM-Spooner (1-11 in Table 1) and therefore are optimal. However, the second component (DNN model for price prediction) of PMM employs a random selection of the values of hyperparameters 12-14 in Table 1, and uses hit and trial to find the best combination for each DNN architecture.

Table 1 Optimal hyperparameters of the RL-based MM agents, including PMM

The DNN models are trained using a well-known backpropagation method for propagating the absolute error (positive difference between the predicted and the actual value) and then adjusting the weight coefficients of input neurons. This training process includes trying different random values of the hyperparameters 12-14 shown in Table 1 in order to minimize the root-mean-square-error(RMSE) and to attain the best hyperparameters values for each of the DNN architectures. Multiple PMM strategies are developed based on the hyperparameters shown in Table 1 and two further variables : 1) CPE weight w; 2) the type of DNN model (MLP, LSTM and CNN). We test numerous PMM strategies based on the combination of these further two hyperparameters (CPE weight w and type of a DNN architecture), such as \(PMM_{(LSTM, w=0.90)}\), \(PMM_{(CNN, w=0.95)}\) etc.


Section 5.1 describes the LOB simulation datasets which are collected from an open source data service owned by CBOE. The dataset contains quotes and trades data, where quotes represent the unfilled limit orders resting in LOB and trades are the filled quotes. Section 5.2 analyses the empirical performance of various PMM strategies in terms of MM returns on individual stocks and ETFs.


An LII (or L2) LOB keeps track of quotes placed by different market participants including MM agents. CBOE provides the LII book trades and quotes intraday high frequency (typically 5 seconds) tick-by-tick data which contains the top five asks and bids along with the recent trades during market operational hours. Each quote contains price and volume information at which it is awaiting to get filled. Moreover, price and volume information gets updated at the rate of every five seconds throughout the day. A computerized method randomly chooses ten optionsFootnote 1 from the top 100 listed on CBOE namely: Vodafone Group Plc (VOD), American Airlines Group Inc (AAL), GlaxoSmithKline Plc (GSK), Altria Group Inc (MO), Amazon Inc (AMZN), Walmart Inc (WMT), Nvidia Corporation (NVDA), Chevron Corporation (CVX), United Parcel Service (UPS) and Texas Instruments Incorporation (TXN), belonging to different sectors providing diversity like ETFs. The quotes and trades data of ten stocks and three ETFs (SPY, DIA and XLF) for ten months (1 Jan 2019 - 30 Sep 2019) from market opening to the closing time, i.e. 8.30 to 16.30 Monday to Friday, is gathered and preprocessed for experimentation.

Results and Discussion

Table 2 Out-of-sample backtesting averaged returns (average value of PnL returns of multiple RL agent’s testing episodes) of the best PMM strategies and the benchmarks
Table 3 Out-of-sample backtesting averaged returns (average value of PnL returns of multiple RL agent’s testing episodes) of the PMM strategies, based on arbitrary CPE weight (w) values for each DNN architecture

As mentioned earlier, the hyperparameters used to develop each MM strategy, including both benchmarks, are same and shown in Table 1. The PMM strategies are the MM strategies developed through optimizing the CPE weight w (Eq. 5) for each of the DNN model (MLP, LSTM and CNN). We design an “ideal” PMM strategy, denoted by \({ PMM_{(perfect, w=0.5)}}\), so as to compare against the best performing PMM methods. DNN-based price prediction models are trained on a historical data range (1 Jan 2019 - 30 Sep 2019), and the “ideal” method was aware of future prices while training, hence there is no error in prediction. The “ideal” PMM method knows the true future market price value in advance, hence we consider this method as a theoretical PMM benchmark for our empirical study. The “ideal” approach superseded every other MM strategy, including PMM, in terms of total number of positive returns (USD) in ten stocks (Tables 2 and 3). The contributions of both the current and the predicted market prices are kept equal (w=0.5 in Eq. 5) towards the consolidated price. From the empirical analysis, the CPE weight w and the RMSE of DNN model are linearly related with each other (Fig. 3). In simple words, the more accurate the prediction of the model the lesser the w – this suggests w depends on accuracy, rather than accuracy on w. The curve recommends using lower w with lower RMSE and higher w with higher RMSE.

Fig. 3
figure 3

Linear relationship between RMSE and CPE weight (w)

The contribution of the predicted market price should decrease (by increasing the w) with the increase in RMSE of DNN models. When the RMSE > 0, then w \(\in\) (0.5, 1], and when the RMSE is 0, then w=0.5. Moreover, the existing practical MM benchmark, namely RMM-Spooner, uses w=1. Therefore, the range of w, used in developing and evaluating PMM strategies, is (0.5, 1].

However, in practice no such DNN time-series prediction model exists which has RMSE exactly 0. In fact, in practice the price prediction component of PMM needs to predict the future price in real time, which would definitely be not accurate (RMSE \(\ne\) 0). For all three DNN models (MLP, LSTM and CNN) w values lie in the range (0.5, 1], as they are real prediction models and do not have 0 RMSE. Therefore, we tried multiple arbitrary w values for each architecture, as shown in Table 3. Then, we select the best performing w values, as shown in Table 2, for all three architectures. Strategies 1, 2 and 3 are the best performing practical PMM strategies, with their respective w values depending on the RMSE values (Table 4) of their corresponding DNN architectures.

Table 4 RMSE of DNN models and market spread of stocks

We optimize the hyperparameter w of PMM model developed using hyperparameters (Table 1) and design multiple PMM strategies, as shown in Tables 2 and 3. The curves shown in Fig. 4 look noisy, and therefore a direct conclusion is difficult to be drawn.

Fig. 4
figure 4

The out-of-sample backtesting investment returns of different RL-based MM strategies for twenty episodes in ten equities. Strategy 3 beats the RMM-Spooner in 3(a) despite the unclear pattern. RMM-Spooner model yields lesser loss than all PMM strategies in 3(b). Strategy 3 is more profitable than the RMM-Spooner in 3(c), 3(d), 3(e), 3(f), 3(g), 3(i) and 3(j). Strategy2 is better than considering the RMM-Spooner in 3(h)

Hence, we use averaged return value of each equity of all strategies. Table 2 shows the average return over number of testing episodes, means in twenty episodes twenty return values are obtained and the average of these twenty values is the average return value of an equity. Using average return values helps in reducing noise and a significant statistical conclusion can be easily drawn. The experiments behind these results are repeated five times for each of the MM strategy with optimal hyperparameter settings, and observed that the same curves are obtained in every repetition. No uncertainty is observed during experiments repetition; hence the results have not originated from any statistical anomaly. Moreover, the confidence intervals can be computed only when the results vary in each repetition.

The out-of-sample backtesting returns (USD) of stocks are averaged over the number of backtesting episodes and represented in Tables 2 and 3. Table 2 shows the optimized PMM strategies (Fig. 4), and Table 3 shows how returns (USD) differ with w and the type of DNN model architecture. Out of these optimized PMM strategies (strategy 1, 2 and 3), we select one best out of these three based on the number of stocks yielding positive returns in the basket of ten stocks. If we carefully analyse the Table 2, then we observe that after \({ PMM_{(perfect, w=0.5)}}\) (an “ideal” PMM method or a theoretical PMM benchmark) strategy 3 outperforms all others including benchmarks (RMM-Spooner and AS model). Strategy 3 obtains positive returns (USD) in 4/10 stocks, whereas RMM-Spooner gets 3/10 stocks with positive return (USD) values. We observe that in case of AMZN stock, the returns (USD) are highest in every MM strategy as the AMZN has highest market spread (Table 4). Moreover, another interesting observation includes the mean of absolute differences between stock returns (USD) among strategy 3 and RMM-Spooner is $19.66 in cases where strategy 3 outperforms RMM-Spooner and $3.23 in cases where RMM-Spooner wins. We use a simple regression tree-based model for price prediction and conducted out-of-sample backtesting (see Table 3). Empirical results conclude that the best PMM strategies outperforms “Regression Tree”-based MM strategy as well.

Until now, strategy 3 or \(PMM_{(CNN, w=0.95)}\) is a best practical PMM strategy amongst all. The MM agents used to be large firms or banks which trades in large collection of stocks rather than individual stocks. Moreover, a clear observation of Fig. 4 says that no single PMM strategy can outperform the benchmarks in all individual stocks. Hence, we evaluate our identified PMM strategy in a diverse and large collection of stocks, so that it can be applied in a real market. ETFs, as mentioned earlier, are popular among traders as they provide a large and diversified collection of stocks and they can be traded as regular stocks. Based on these advantages of ETFs, strategy 3 is tested for 3 different ETFs, namely SPY (SPDR S&P 500 ETF, one of the largest ETF in the world), DIA (SPDR Dow Jones Industrial Average ETF ) and XLF (Financial Select Sector SPDR ETF). Table 5 contains average values of returns (USD) obtained from out-of-smaple backtesting of strategy 3, RMM-Spooner and AS model.

Table 5 Out-of-sample backtesting of Strategy 3, RMM-Spooner and AS model on ETFs
Fig. 5
figure 5

Out-of-sample backtesting of strategy 3, RMM-Spooner and AS model over ETFs

The returns (USD) obtained from using strategy 3 in all 3 ETFs are significantly higher than the RMM-Spooner (Fig 5).

The overall aim of any MM agent is to place the large number of orders with tight qs in order to enhance the market liquidity. The general rule says that the market liquidity is inversely proportional to the qs and directly proportional to the z. Clearly, the average value of qs and z of strategy 3, as shown in Table 6, is significantly higher than the two benchmarks, both in ETFs and individual stocks category. The final outcome of this study suggests that PMM with CNN and w=0.95 is a practical PMM approach and can be applied in a large stock exchange for MM in a large collection of stocks including ETFs.

Table 6 Q-Spread (quoted spread of MM agent) and Trades (number of successfully executed orders) collectively represent the market liquidity provided


Market making plays a vital role in preserving the market liquidity and the interest of participants in compensation of minute profit at every successful trade. Almost all major stock exchanges employ MM agents to boost up the market liquidity and induce smoothness in trades execution. Therefore, it becomes necessary to develop MM methods that can enhance their (MM agents) profits and bolster the market liquidity further. This paper proposes a novel concept, known as PMM, which aims to improve the investment returns of a RL-based MM agent by integrating the DNN-based market price prediction feature. The proposed PMM method was evaluated in individual stocks and a large collection of stocks, known as ETF. Empirical analysis of out-of-sample backtesting suggests that \(PMM_{(CNN, w=0.95)}\) or strategy 3 is a practical MM approach and can trade in large collection of stocks, profitably. The research concludes that the PMM agent is intelligent due to the market anticipation feature and can provide more liquidity to the market as compared to the benchmarks (RMM-Spooner and AS model).