1 Introduction

In financial media, extensive attention is given to the study of charts and visual patterns. Known as technical analysis or chartism, this form of financial analysis relies solely on historical price and volume data to produce forecasts, on the assumption that specific graphical patterns hold predictive information for future asset price fluctuations [1]. Early research into genetic algorithms devised solely from technical data (as opposed to e.g. fundamentals or sentiment analysis) showed promising results, sustaining the view that there could be substance to the practice [2, 3].

The rising popularity of neural networks in the past decade, fuelled by advances in computational processing power and data availability, renewed interest in their applicability to the domain of finance. Krauss et al. [4] applied multilayer perceptrons (MLPs) to find patterns in the daily returns of the S&P500 stock market index. Dixon et al. [5] further demonstrated the effectiveness of neural nets on intraday data, deploying MLPs to classify returns on commodity and FX futures over discrete 5-min intervals. Architectures comprised of 4 dense hidden layers were sufficient to generate annualised Sharpe ratios in excess of 2.0 on their peak performers. In each instance, patterns were sought in the time series of returns rather than in the price process itself.

Seminal findings by Lo et al. [6] employed instead the visuals emerging from line charts of stock closing prices, relying on kernel regression to smooth out the price process and enable the detection of salient trading patterns. An equally common visual representation of price history in finance is the candlestick. Candlesticks encode the opening price, closing price, maximum price and minimum price over a discrete time interval, visually represented by a vertical bar with lines extending on either end. Much as with line charts, technical analysts believe that specific sequences of candlesticks reliably foreshadow impending price movements. A wide array of such patterns are commonly watched for [7], each with their own pictogram and associated colourful name (‘inverted hammer’, ‘abandoned baby’, etc).

Though recurrent neural networks–and in particular Long Short-Term Memory (LSTM) models [8]—have been the most popular choice for deep learning on time series data [9,10,11], promising results have begun to appear from the application of convolutional neural networks to financial data. Neural networks have frequently been labelled as black boxes, limiting their deployment in domains where interpretability is sought. Convolutional neural networks partially overcome this, by extracting locally interpretable features in their early layers. Furthermore, recent research suggests that these models bear the capacity to generalise not merely across time but across assets as well, identifying universal features of stock market behaviour [12, 13].

The contributions of this paper are threefold: firstly, we rigorously evaluate the practice of candlestick chartism, and find little evidence to support it. The human-engineered features prescribed by technical analysis produce classifiers that barely outperform guesswork, unlike the patterns identified through deep learning. Secondly, we show that filters learned and tested on 22 years of S&P500 price data in a CNN architecture yield modest gains in accuracy over both technical methods and machine learning alternatives, including MLPs unsupported by the feature extraction capabilities of a convolutional layer. Thirdly, we demonstrate that considerable gains in forecasting capability are achievable through ensemble methods and thresholding based on model confidence.

This paper evaluates quantitatively the merits of candlestick-driven technical analysis before proposing an improved, data-driven approach to financial pattern recognition. Formally, we reframe candlestick patterns as a form of feature engineering intended by chartists to extract salient features, facilitating the classification of future returns with higher fidelity than the raw price process would otherwise allow. After a brief review of neural networks, we define the data used throughout the paper (Sect. 2) and motivate the pursuit of new, better visual heuristics for finance by assessing the predictiveness of candlestick formations (Sect. 3). Feeding candlestick data through a neural network involving separate filters for each technical pattern, we classify next-day returns with the filters implied by chartist doctrine (Sects. 4.14.2) and set this cross-correlational approach as a baseline to improve upon [14]. We then compare the model’s accuracy when filters are not preset but instead learned by convolutional neural networks (CNNs) during their training phase (Sect. 4.3), and benchmark deep learning against alternative methods drawn from both traditional finance and machine learning (Sect. 4.4). We enhance the accuracy of CNNs through the addition of thresholding and ensembling (Sect. 4.5), and finish with two practically minded extensions: the backtested performance of the model (Sect. 4.6) and the visual interpretation of the features extracted by the CNN (Sect. 4.7).

2 Methodology overview

Over the last decade, neural networks have risen dramatically in popularity, propelled by the success of deep learning in a wide range of practical applications. Formally, neural networks map inputs to outputs through a collection of nonlinear computation nodes, called neurons, stacked into hidden layers. Inputs and outputs are connected by potentially many such hidden layers, leading to the so-called deep learning architectures. Neural networks can be interpreted as an ultra-parametric extension of linear regression models, wherein each neuron computes a weighted linear combination of its inputs, applies a nonlinear transformation to the newfound value and forwards its output to the next layer’s neurons.

2.1 Multilayer perceptrons

The benefit of nonlinear transformation is particularly pronounced as architectures are extended in depth: without nonlinearity, additional layers would not confer any incremental value, as the linear combinations of linear combinations would themselves just be linear combinations with different weights. In other words, multiple hidden layers without nonlinear transformations would be equivalent to a single hidden layer with appropriately chosen weights. The inclusion of nonlinear transformations, termed activation functions, between the hidden layers allows neural networks to learn complex functional mappings. MLPs harness this potential by including multiple layers between input and output and allowing each layer to possess many neurons (Fig. 1). The activation functions employed are commonly the hyperbolic tangent function \(\tanh (o)\), logistic function \(\sigma (o)\) and rectified linear unit ReLU(o).

Fig. 1
figure 1

A multilayer perceptron with 2 hidden layers. Each neuron within a layer computes a linear combination of its inputs followed by a nonlinear transformation

$$\begin{aligned} \tanh (o)= & {} \frac{\sinh (o)}{\cosh (o)} = \frac{e^o-e^{-o}}{e^o+e^{-o}} \end{aligned}$$
(1)
$$\begin{aligned} \sigma (o)= & {} \frac{1}{1+e^{-o}} \end{aligned}$$
(2)
$$\begin{aligned} \text{ ReLU }(o)= & {} \max (0,o) \end{aligned}$$
(3)

We adopt ReLu as our preferred activation function, a common choice in neural network architectures given its computational efficiency and performance [15].

2.2 Convolutional neural networks

Convolutional neural networks extend multilayer perceptrons, by adding one or several additional layers at the beginning of the architecture. These layers, termed convolutional layers, consist of a set of learned filters. These filters are typically much smaller than the input, and measure local similarity (calculated by sliding dot or Hadamard product). The output of a convolutional layer is a feature map, identifying regions where the input to the layer was similar to the learned filter. In effect, convolution functions as bespoke feature extractors for neural network architectures, enabling in the process vastly superior model performance (Fig. 2).

Fig. 2
figure 2

Image from Nvidia (https://developer.nvidia.com/discover/convolutional-neural-network)

A convolutional neural network with 2 convolutional layers. The red hashed outline contains the feature-extractive convolutional layers, and the blue hashed outline is effectively a single-layer perceptron. In this architecture, convolutional outputs over a local area are reduced to a single value via subsampling or pooling, an operation designed to improve the model’s memory footprint and invariance to translations/rotations.

2.3 Definition of candlestick data

Both the financial time series data and the candlestick technical filters used by chartists take the same form. Asset price data for a discrete time interval is represented by four features: the opening price (price at the start of the interval), closing price (price at the end of the interval), high price (maximum over the interval) and low price (minimum over the interval). The candlestick visually encodes this information (Fig. 3): the bar’s extremities denote the open and close prices, and the lines protruding from the bar (the candle’s ‘wicks’ or shadow) denote the extrema over the interval. The colour of the bar determines the relative ordering of the open and close prices: a white bar denotes a positive return over the interval (close price > open price) and a black or shaded bar denotes a negative return (close price < open price).

Fig. 3
figure 3

Candlestick representation of financial time series data

We can therefore summarise the candlestick representation of a financial time series of length n timesteps as a \(4\times n\) price signal matrix F capturing its four features. Throughout this paper, we rely on daily market data, but the methods can be extended to high-frequency pattern recognition on limit order books—an active area for current research [13].

2.4 Definitions of technical patterns

We include major candlestick patterns cited by practitioners of technical analysis at three timescales: 1-day, 2-day and 3-day patterns. The simple 1-day patterns include the hammer (normal and inverted), hanging man, shooting star, dragonfly doji, gravestone doji, and spinning tops (bullish and bearish, where bullish implies a positive future return and bearish implies a negative future return). Our 2-day patterns cover the engulfing (bullish and bearish), harami (bullish and bearish), piercing line, cloud cover, tweezer bottom and tweezer top. Finally our 3-day patterns cover some of the most cited cases in chartist practice: the abandoned baby (bullish and bearish), morning star, evening star, three white soldiers, three black crows, three inside up and three inside down. Figure 4 provides both the visual template associated with each pattern and the future price direction it is meant to presage. As before, we summarise a technical pattern P of length m timesteps as a \(4\times m\) matrix \(T_{P_m}\), standardised for comparability to have zero mean and unit variance.

Fig. 4
figure 4

For each timescale (1-day, 2-day and 3-day), we specify 8 chartist patterns and the future direction they predict (‘bullish’ for positive returns, ‘bearish’ for negative returns)

2.5 Empirical data

Throughout our work, we use daily technical (i.e. open, close, high and low price) data from the S&P500 stock market index constituents for the period January 1994–December 2015, corresponding to n = 2,439,184 entries of financial data in the price signal F.Footnote 1 This data set covers a representative cross section of US companies across a wide timeframe suitable for learning the patterns, if any, of both expansionary and recessionary periods in the stock market.

3 Evaluation of current tools

As a preliminary motivation for the adoption of machine learning for technical forecasts, we assess the merits of candlestick chartism in finance. We run several diagnostics to assess separately the informativeness and predictiveness for each technical pattern.

3.1 Conditioning returns on the presence of a pattern

The \(4\times m\) matrix representation \(T_{P_m}\) for pattern P of length m and equal-length, standardised rolling windows \(F_{n}\) of the full price signal F at timestep n can be cross-correlated together to generate a time series \(S_P\) measuring the degree of similarity between the price signal and the pattern. For a given pattern P, at each timestep n:

$$\begin{aligned} S_{P,n} = \bigg \langle \frac{T_{P_m}}{\Vert T_{P_m}\Vert },\frac{F_{n}}{\Vert F_{n}\Vert },\bigg \rangle \end{aligned}$$
(4)

where \(\langle \cdot ,\cdot \rangle\) is the inner product of the two matrices and \(\Vert \cdot \Vert\) is the \(L^{2}\) norm.

For each pattern P, we produce a conditional distribution of next-day returns by extracting the top quantile (in our study, decile and centile) of similarity scores \(S_P\).

3.2 Informativeness

For our purposes, we define a technical pattern to be informative if its presence significantly alters the distribution of next-day returns in a Kolmogorov–Smirnov two-sample (K–S) test [16], comparing the unconditional distribution of all next-day returns to the distribution conditioned on having just witnessed the pattern. The K–S test is a nonparametric test which can be used to compare similarity between two empirical or analytical distributions. The two-sample K–S test is one of the most useful and general nonparametric methods, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. As the test is nonparametric, no assumptions are made regarding the nature of the sample distributions. Denoting by \(\{{R_P}_{t=1}^{n_1}\}\) the subset of returns conditioned on matching pattern P and \(\{{R}_{t=1}^{n_2}\}\) the full set of unconditional returns, we compute their empirical cumulative distribution functions \(F_1(z)\) and \(F_2(z)\). The K–S test evaluates the null hypothesis that the distributions generating both samples have identical cdfs, by computing the K–S statistic:

$$\begin{aligned} \gamma = \bigg (\frac{n_1 n_2}{n_1+n_2}\bigg )^{1/2} \sup _{-\infty<z<\infty } |F_1(z) - F_2(z)| \end{aligned}$$
(5)

The limiting distribution of \(\gamma\) provides percentile thresholds above which we reject the null hypothesis. When this occurs, we infer that conditioning on the pattern does materially alter the future returns distribution.

3.3 Predictiveness

While these patterns may bear some information, it does not follow that their information is actionable, or even aligns with the expectations prescribed by technical analysis. Notched boxplots of both unconditional returns and returns conditioned on each of the filters (Fig. 5) allow us to gauge whether the pattern’s occurrence does in fact yield significant returns in the intended direction.

Fig. 5
figure 5

Notched boxplots of the distributions of returns in basis points (100th of a percent), conditional on observing each of the technical patterns (similarity score \(S_P\) in its top centile). Whiskers cover twice the interquartile range. At a glance, none of the conditional distribution medians diverge substantively from the unconditional baseline of zero, and the distributions’ deviations dwarf their medians by two orders of magnitude

Fig. 6
figure 6

Close-up of boxplot notches for the distributions of returns in basis points (100th of a percent), conditional on observing each of the technical patterns (similarity score \(S_P\) in its top centile). Absence of overlap between the boxplot notches of a conditional distribution and the unconditional distribution provides evidence at the 95% confidence threshold that the medians of the 2 distributions differ [17]. Surprisingly, several single-day patterns do in fact correlate with abnormal next-day returns. Almost all of the multi-day patterns exhibit notches that overlap with the unconditional distribution’s, implying that the distribution medians are not meaningfully changed by conditioning. Only ‘Bearish Engulfing’ and ‘Three Black Crows’ seem to be significant—as harbingers of better times, despite their names

A closer examination suggests that several of the 1-day patterns are in fact relevant, but that the more elaborate 2-day and 3-day formations are not. Conditioning on 14 of the 16 multi-day patterns produces no significant alteration in the median of next-day returns distributions (Fig. 6): only the ‘Bearish Engulfing’ and ‘Three Black Crows’ patterns produce a conditional distribution for which the 95% confidence interval of the median (denoted by the notch) differs markedly from its unconditional counterpart.

Table 1 Summary statistics for the next-day return distributions conditioned on matching technical patterns
Table 2 Summary statistics for the next-day return distributions conditioned on matching technical patterns more stringently

3.4 Findings

We report the empirical results of the K–S goodness-of-fit tests and top decile and centile (Tables 1, 2, respectively) conditional distribution summary statistics, using daily stock data from the S&P500. Though several of the patterns do indeed bear information altering the distribution of future returns, their occurrence is neither a reliable predictor of price movements (high standard deviation relative to the mean) nor even, in many instances, an accurate classifier of direction. Elaborate multi-day patterns systematically perform worse than their single-day counterparts. Surprisingly, 6 of the 8 single-day patterns do in fact produce meaningful deviations from the unconditional baseline, with the dragonfly and gravestone doji standing out as significant outliers (− 25.81 bpFootnote 2 and + 22.41 bp, respectively, when conditioning on the top centile of similarity score, Table 2). But even in those instances, technical analysis forecasts incorrectly, as prices move in the direction opposite to chartism’s predictions. McLean and Pontiff [18] showed that predictor variables lose on average 58% of their associated return, post-publication. In a similar vein, we hypothesise that these patterns may have once been predictive on a historical data set, but that their disclosure and subsequent overuse has long since negated their value. Conceptually, the notion of using filters in financial data to extract informative feature maps may bear merit—but the chartist filter layer is demonstrably an improper specification today.

4 Proposed improved approach

The approach of searching for informative intermediate feature maps in classification problems has seen widespread success in domains ranging from computer vision [15] to acoustic signal processing [19]. Where technical analysis uses filters that are arbitrarily pictographic in nature, we turn instead to convolutional layers to extract features from data. We evaluate the performance of passing the raw data both with and without chartist filters, and subsequently measure the incremental gain from learning optimal feature maps by convolution. The findings are then benchmarked against widely recognised approaches to time series forecasting including autoregression, recurrent neural networks, nearest neighbour classifiers, support vector machines (SVM) and random forests.

In the experimental results that follow, we split our S&P500 time series data into training and test sets corresponding to single stock prices from 1994–2004 and 2005–2015, respectively.Footnote 3 This specific train-test split of the data includes both expansionary and recessionary periods in each subset, to ensure the model is capable of learning the financial patterns that may emerge in different economic regimes.Footnote 4

4.1 Multi-layer perceptron

To address issues of scale and stationarity, we process the original \(4 \times n\) price signal matrix F into a new \(80 \times n\) price signal matrix F* where each column is a standardised encoding of 20 business days of price data. This encoding provides 4 weeks of price history, a context or ‘image’ within which neural network filters can scan for the occurrence of patterns and track their temporal evolution. We pass F* through a multilayer perceptron (MLP) involving fully connected hidden layers. Preliminary fivefold cross-validation experiments with financial time series determined the network topology required for the model to learn from its training data.Footnote 5 Insufficient height (neurons per hidden layer) and depth (number of hidden layers) led to models incapable of learning their training data. We settled on 2 fully connected layers of 64 neurons with ReLU activation functions, followed by a softmax output layer to classify positive and negative returns. Regularisation was achieved via the inclusion of dropout [21] in the fully connected layers of the network, limiting the model’s propensity towards excessive co-adaptation across layers. A heavily regularised (dropout \(=0.5\)) 2-layer MLP is already able to identify some structure in its data (out-of-sample accuracy of 50.6% after 100 epochs, Table 3). From our experiments with alternative dropout rates, we found that insufficient regularisation (dropout \(=0.3\)) led to overfitted models with poor out-of-sample performance, but that erring on the side of excessive regularisation (dropout \(=0.7\)) was an acceptable choice, leading to similar generalisation to our base case at the expense of needing more epochs to converge.

Table 3 Accuracy (%) obtained after training a 2-layer MLP on single stock data from the S&P500, using open-close-high-low price data

4.2 Technically filtered MLP

Reframing technical patterns as pre-learned cross-correlational filters, we consider for each pattern length m the 8 pattern matrices \(T_{P_m}\) defined visually in Fig. 4. Each such formation, of form \(4 \times m\), is stacked along the depth dimension, producing a \(4 \times m \times 8\) tensor T whose inner product with standardised windows of the raw price signal F yields a new \(8 \times n\) input matrix \(F_T\),

$$\begin{aligned} F_T = \big \langle T,F\big \rangle . \end{aligned}$$
(6)

This new input is the result of cross-correlating the raw price signal F with the technical analysis filter tensor T, and can be interpreted as the feature map generated by technical analysis. We now use \(F_T\) as the input to the same MLP as before and look for improvements in model forecasts. The results we find are consistent with Sect. 3: using technical analysis for feature extraction hinders the classifier, slightly degrading model performance (out-of-sample accuracy of 49.8% after 100 epochs using the 1-day patterns, Table 4).

Table 4 Accuracy (%) obtained after training a technically filtered MLP (filter length \(m = 1\)) on single stock data from the S&P500, using open-close-high-low price data

4.3 Convolutional neural network

Table 5 Details of the architecture for a CNN scanning patterns of length m
Table 6 Accuracy (%) obtained In-Sample (IS) and Out-of-Sample (OoS) after training a deep neural network with a single convolutional layer learning 1-day, 2-day and 3-day patterns

We now deepen the neural network by adding a single convolutional layer with 8 filters (so chosen to match the number of technical filters at each timescale, per Fig. 4) to our earlier MLP (architecture detailed in Table 5). Separate experiments are run for convolutional filters of size 4, 8 and 12, corresponding to scanning for 1-day, 2-day and 3-day patterns. Their performance is reported in Table 6. The CNN finds much greater structure in its training data than the MLP could, and generalises better. Accounting for the size of the test set (n = 1,332,395), the leap from the MLP’s out-of-sample accuracy of 50.6% to the 1-day CNN’s out-of-sample accuracy of 51.3% is considerable.

4.4 Model evaluation

To investigate whether the predictive performance of the neural network classifiers is not merely considerable but statistically significant, we derive the area under the curve (AUC) of each model’s receiver operating characteristic curve (ROC), and exploit an equivalence between the AUC and Mann–Whitney–Wilcoxon test statistic U [22]:

$$\begin{aligned} \mathrm{AUC} = \frac{U}{n_P n_N} \end{aligned}$$
(7)

where \(n_P\) and \(n_N\) are the number of positive and negative returns in the test set, respectively. In our binary classification setting, the Mann–Whitney–Wilcoxon test evaluates the null hypothesis that a randomly selected value from one sample (e.g. the subset of test data classified as positive next-day returns) is equally likely to be less than or greater than a randomly selected value from the complement sample (the remaining test data, classified as negative next-day returns). Informally, we are testing the null hypothesis that our models have classified at random. The test statistic U is approximately Gaussian for our sample size, so we compute each model’s standardised Z-score and look for extreme values that would violate this null hypothesis.

$$\begin{aligned} Z = \frac{U - \mu _U}{\sigma _U} \end{aligned}$$
(8)

where:

$$\begin{aligned} \mu _U = \frac{n_P n_N}{2} \end{aligned}$$
(9)

and

$$\begin{aligned} \sigma _U = \sqrt{\frac{n_P n_N (n_P + n_N + 1)}{12}} \end{aligned}$$
(10)

We benchmark our CNNs against traditional linear models popular in finance (AR(1) and AR(5) models), a buy-and-hold strategy and a range of machine learning alternatives detailed below.

4.4.1 Recurrent neural networks (RNN)

Deep learning for time series analysis has typically relied on recurrent architectures capable of learning temporal relations in the data. Long Short-Term Memory (LSTM) networks have achieved prominence for their ability to memorise patterns across vast spans of time by addressing the vanishing gradient problem. A thorough RNN architecture search [23] identified a small, but persistent gap in performance between LSTMs and the recently introduced Gated Recurrent Unit [24] on a range of synthetic and real-world data sets. Our benchmark RNNs involve a preliminary recurrent layer (LSTM and GRU, in separate experiments) of 8 neurons followed by 2 dense layers of 64 neurons with dropout, comparable in architectural complexity to the CNN models of Sect. 4.3.

4.4.2 k-Nearest Neighbours (k-NN)

We evaluate a range of nearest neighbour classifiers, labelling each day of the test set with the most frequently observed class label (positive or negative next-day return) in the k training points that were closest in Euclidean space.

4.4.3 Support vector machines (SVM)

SVMs have been applied to financial time series forecasting in prior literature, and achieved moderate success when the input features were not raw price data but hand-crafted arithmetic derivations like Moving Averages and MACD [25]. We report SVM performance under different kernel assumptions (linear and RBF), where the model hyperparameters (regularisation parameter C to penalise margin violations, RBF kernel coefficient \(\gamma\) to control sensitivity) were selected by cross-validation on a subset of the training data.

In their study of European financial markets, Ballings et al. [26] evaluated the classification accuracy of ensemble methods against single classifiers. Their empirical work highlighted the effectiveness of random forests in classifying stock price movements and motivates their inclusion in our list of benchmarks, under varying assumptions for the number of trees hyperparameter n.

Table 7 Benchmark performance across a range of models trained on S&P500 technical data for January–December 1994 and tested on January 2005–December 2015

4.4.4 Benchmark findings

Table 7 provides the AUC, Z-score and significance of each model, where significance measures the area of the distribution below Z. We disregard significance for negative Z-scores (as is the case for the technically filtered neural network) as they imply classifiers that performed (significantly) worse than random chance. The results underscore the scale of the challenge for pattern recognition in finance: deep learning achieved the best results by a significant margin, and most alternative methods yielded accuracies that were not statistically distinguishable from guesswork.Footnote 6 Convolution also outperforms recurrence in our experiments, suggesting that a 20-day window may be sufficient to capture temporal dependencies in markets.

4.5 Methodological extensions to the ConvNet framework

Learning neural network filter specifications via convolution yields a significant boost to predictive prowess over the baseline model of Sect. 4.1 and technically filtered variant of Sect. 4.2. The CNNs’ outperformance of autoregressive and machine learning techniques further confirms the aptitude of convolutional feature extraction on technical data, and spurs us to target domain-specific enhancements to our deep learning models.

4.5.1 Confidence thresholding

In contrast to mission-critical application domains like autonomous navigation, finance does not require an algorithmic agent to be accurate at all times. It is acceptable (and factoring in friction costs, preferable) for a model to be sparse in making decisions, only generating ‘high conviction’ calls, if this results in greater accuracy. Furthermore, and unlike several other common classifiers in machine learning like SVMs or Nearest Neighbours, the output values in the final layer of the CNN can be assigned a probabilistic interpretation, enabling a filtered, nuanced approach to classification. We replicate this by adding a confidence threshold \(\alpha\) to the classification output of the final softmax layer of Table 5: test points where neither class is assigned a probability greater than \(\alpha\) are deemed uncertain, and disregarded by the thresholded convolutional neural network (TCNN). For each model (1-, 2- and 3-day TCNN), the confidence threshold \(\alpha\) is tuned through fivefold cross-validation. Accuracy as a function of confidence threshold \(\alpha\) is presented in Fig. 7, and demonstrates in all 3 cases that a substantial increase in model prowess can be achieved by thresholding the softmax output to only consider class assignments with high certainty. We also highlight the \(\alpha\) threshold which retains the top centile of test outputs, corresponding to the model’s most confident assignments. These vary by model (54.2%, 54.1% and 55.3% for the 1-, 2- and 3-day TCNNs, respectively), but in each case form a reliable heuristic for balancing model confidence and sample size. A notable analogue to the study of technical analysis in Sect. 3: models searching for more elaborate multi-day patterns tend to underperform the single-day TCNN.

Fig. 7
figure 7

Model accuracy as a function of softmax threshold \(\alpha\). For each model, we indicate by a cross the threshold level that retains the 1% of test data for which the model’s output probabilities imply the highest confidence

4.5.2 Ensembling TCNNs

An effective technique in image processing involves homogeneous ensembling of multiple copies of the same CNN architecture, averaging across the class assignments of the constituent models [15, 27]. Combining this probabilistic interpretation of the softmax layer with model averaging, we construct a heterogeneous ensemble out of our 1-day, 2-day and 3-day TCNNs. The ensemble benefits from learning patterns manifesting at different timescales, and achieves a higher accuracy (57.5%) on its top-confidence centile than any of the individual learners (56.7%, 56.3% and 55.9% for the 1-day, 2-day and 3-day TCNN, respectively, Fig. 7).

Performance metrics of both the TCNNs and TCNN ensemble are provided in Table 8. While the Z-scores of the TCNN models are lower than those of unthresholded CNN models, this is primarily the consequence of sample size on statistical significance tests—AUC improves markedly under thresholding.

Fig. 8
figure 8

Activity level of the various TCNN models through a 11-year period. As we retain the top centile from the 1,408,679 test points, each model is generating 14,087 trading decisions over 2868 business days, or on average 4.91 trades per day. Though the model is active throughout the window, discernible spikes in activity occur around major events, most notably the US debt ceiling crisis in August 2011

Fig. 9
figure 9

Cumulative profit (as a multiple of starting wealth, per Table 7) generated by the various TCNN models between January 2005 and December 2015, in the absence of friction costs. The models are steadily profitable, with occasional spikes related to major events. Drawdowns are infrequent and of limited scale

Table 8 Performance comparison between MLP, CNN and TCNN models trained on S&P500 technical data for January–December 1994 and tested on January 2005–December 2015. Precision and recall are computed as weighted averages across both classes

4.6 Practical implementation

Through thresholding, we enforce sparsity in the model’s decision-making. In a real-world deployment, infrequent activity keeps friction costs low—a desirable outcome for trading algorithms. We track the activity level of the various models over time, as well as the cumulative profit they would generate over the 11-year test window. We assume the model fully captures the 1-day return associated with the top centile of its thresholded class assignments, additively for positive class predictions and subtractively for negative class predictions.

The models are heavily skewed towards buying activity, with accurately timed spikes centred around major world events (Fig. 8). The 2 largest single-day buy orders occurred on 9 August 2011 (328 buys), at the tail end of the US debt ceiling crisis which caused the S&P500 to drop 20% in 2 weeks, and on 24 August 2015 (241 buys), following a flash crash in which US markets erased 12% of their value before recovering. The largest sell volume occurred on 22 September 2008 (31 sells), a full week after the collapse of Lehman Brothers. This coincides with market-wide relief over Nomura’s decision to buy Lehman’s operations—and presented the last opportunity to sell before the nosedive of the Great Financial Crisis in late 2008. Despite having no information about world news in their technical data set, the models were capable of both inferring crucial moments in history, and timing trading decisions around them.

Figure 9 presents the model’s profitability over time to highlight the relative steadiness of convolution’s performance in identifying stock market patterns, when the decisions are generated by TCNNs and their ensemble. Table 9 translates this performance into compounded annual returns and Sharpe ratios under various assumptions for friction. Even in the absence of tight execution (average trading cost of 0.25% from the mid-market price), the models remain highly profitable. This sensitivity analysis does nevertheless highlight the importance of good execution in any real-world deployment of algorithmic trading: the TCNN ensemble can only just break even if the per-transaction cost rises to 0.35%.

4.7 Interpretable feature extraction

Table 9 Compound Annual Growth Rate (CAGR, in %) and Sharpe ratio of the TCNN models under various assumptions for the cost of trading

The convolutional filters learned by the network provide a basis for feature extraction. In particular, the convolutional layer’s filters define patches whose cross-correlation with the original input data was informative in minimising both in-sample and out-of-sample categorical cross-entropy. We produce a mosaic of these filters as Hinton diagramsFootnote 7 (Fig. 10) and visualise them in the language of technical analysis as candlestick patterns (Figs. 11 and 12), cross-correlational templates whose occurrence is informative for financial time series forecasting. Unlike technical patterns, however, these templates have no set meaning: the purpose of individual neurons in a convolutional layer is not readily interpretable.

Fig. 10
figure 10

Weight-space visualisation as Hinton diagrams for the 24 cross-correlational filters learned from the first layer of each CNN (8 per constituent model)

Fig. 11
figure 11

Hinton diagram of the sixth cross-correlational filter learned in the first layer of the 3-day CNN. The relative values of the standardised open, close, low and high for each column in the filter define, in a chartist sense, a specific candlestick sequence (or patch thereof, in instances where the filter’s open or close is incompatible with the high-low range) which the neural network extracted as informative for time series forecasting

Fig. 12
figure 12

Candlestick pattern translation of the cross-correlational filter mosaic for the 3-day CNN

5 Conclusion

Our results present, to our knowledge, the first rigorous statistical evaluation of candlestick patterns in time series analysis, using normalised signal cross-correlation to identify pattern matches. We find little evidence of predictive prowess in any of the standard chartist pictograms, and suspect that the enduring quality of such practices owes much to their subjective and hitherto unverified nature. Nevertheless, it is not inconceivable that price history might contain predictive information, and much of quantitative finance practice relies on elements of technical pattern recognition (e.g. momentum-tracking) for its success. Through a deep learning lens, technical analysis is merely an arbitrary and incorrect specification of the feature-extractive early layers of a neural network. Within relatively shallow architectures, learning more effective filters from data improves accuracy significantly while also providing an interpretable replacement for chartism’s visual aids. The simplicity of our architecture showcases the potential for deep learning to supplant technical analysis: we do not expect shallow convolution to be optimal. In the context of computer vision, accuracy improved significantly through the expansion of model depth.Footnote 8 We hypothesise that a thorough neural architecture search will yield similar incremental gains in our space. Hybrid neural architectures, including recurrent layers capable of learning long-term dependencies between the patterns identified by convolution, may further enhance the potential for deep learning in finance. Thresholding and deep ensembling of such models would form a robust framework for systematic decision-making in financial markets.