Thresholded ConvNet Ensembles: Neural Networks for Technical Forecasting

Much of modern practice in financial forecasting relies on technicals, an umbrella term for several heuristics applying visual pattern recognition to price charts. Despite its ubiquity in financial media, the reliability of its signals remains a contentious and highly subjective form of 'domain knowledge'. We investigate the predictive value of patterns in financial time series, applying machine learning and signal processing techniques to 22 years of US equity data. By reframing technical analysis as a poorly specified, arbitrarily preset feature-extractive layer in a deep neural network, we show that better convolutional filters can be learned directly from the data, and provide visual representations of the features being identified. We find that an ensemble of shallow, thresholded CNNs optimised over different resolutions achieves state-of-the-art performance on this domain, outperforming technical methods while retaining some of their interpretability.


INTRODUCTION
In nancial media, extensive a ention is given to the study of charts and visual pa erns. Known as technical analysis or chartism, this form of nancial analysis relies solely on historical price and volume data to produce forecasts, on the assumption that speci c graphical pa erns hold predictive information for future asset price uctuations (Blume et al, 1994). Early research into genetic algorithms devised solely from technical data (as opposed to e.g. fundamentals or sentiment analysis) showed promising results, sustaining the view that there could be substance to the practice (Neely et al, 1997; Allen and Karjailainen, 1999). Research in nance has typically restricted itself to the time series of closing prices and the visuals emerging from line charts (Lo et al, 2000), relying on kernel regression to smooth out the price process and enable pa ern recognition.
An equally common visual representation of price history in nance is the candlestick. Candlesticks encode opening price, closing price, maximum price and minimum price over a discrete time interval, visually represented by a vertical bar with lines extending on either end. Much as with line charts, technical analysts believe that speci c sequences of candlesticks reliably foreshadow impending price movements. A wide array of such pa erns are commonly watched for (Taylor and Allen, 1992), each with their own pictogram and associated colourful name ('inverted hammer', 'abandoned baby', etc).
Recent research on candlestick chartism has debunked the validity of several highly-cited pa erns (Ghoshal and Roberts, 2017). Drawing on a modern intuition for pa ern recognition in vision and language (Bengio, 2009), candlestick pa erns are reframed as a form of feature engineering intended by chartists to extract salient features, facilitating the classi cation of future returns with higher delity than the raw price process would otherwise allow. Filters learned through convolution show promise, and serve as the blueprint for an interpretable application of deep learning to the nancial domain.
A er de ning a format for cross-correlating time series data with chartist lters (Section 2.1-2.3), we undertake a thorough statistical assessment of the predictive prowess of 1-day, 2-day and 3-day candlestick formations (Section 2.4). Feeding candlestick data through a neural network involving separate lters for each technical pa ern, we classify next-day returns with the lters implied by chartist doctrine (Section 3.1-3.2) and set this cross-correlational approach as a baseline to improve upon (Romaszko, 2015). We then compare the model's accuracy when lters are not preset but instead learned by thresholded convolutional neural networks (TC-NNs) during their training phase (Section 3.3). Finally we assess the signi cance of our ndings (Section 3.4), and benchmark deep learning in nance against alternative methods (Section 3.5). e contributions of this paper are threefold: rstly, we rigorously evaluate the practice of candlestick chartism, and nd li le evidence to support it. We agree with Lo et al (2000) that the distribution of future returns conditioned on observing technical pa erns diverges signi cantly from the unconditional distribution, but upon close inspection the resulting classi er barely outperforms guesswork. Secondly, we show that lters learned and tested on 22 years of S&P500 price data in a CNN architecture can yield modest gains in accuracy over technical methods. irdly, we demonstrate that considerable gains in forecasting capability are achievable through ensemble methods and con dence thresholds.

EVALUATING TECHNICAL ANALYSIS 2.1 De nition of Candlestick Data
Both the nancial time series data and the candlestick technical lters used by chartists take the same form. Asset price data for a discrete time interval is represented by four features: the opening price (price at the start of the interval), closing price (price at the end of the interval), high price (maximum over the interval) and low price (minimum over the interval). e candlestick visually encodes this information (Fig. 1): the bar's extremities denote the open and close prices, and the lines protruding from the bar (the candle's 'wicks' or shadow) denote the extrema over the interval. e colour of the bar determines the relative ordering of the open and close prices: a white bar denotes a positive return over the interval (close price > open price) and a black or shaded bar denotes a negative return (close price < open price). We can therefore summarise the candlestick representation of a nancial time series of length n timesteps as a 4 × n price signal matrix F capturing its four features.
roughout this paper we rely on daily market data, but the methods can be extended to high-frequency pa ern recognition using tick data and full order books.

De nitions of Technical Patterns
We undertake a comprehensive review of all the major candlestick pa erns cited by practitioners of technical analysis at multiple timescales. e simple 1-day pa erns include the hammer (normal and inverted), hanging man, shooting star, dragon y doji, gravestone doji, and spinning tops (bullish and bearish). Our 2-day pa erns cover the engul ng (bullish and bearish), harami (bullish and bearish), piercing line, cloud cover, tweezer bo om and tweezer top. Finally our 3-day pa erns cover some of the most cited cases in chartist practice: the abandoned baby (bullish and bearish), morning star, evening star, three white soldiers, three black crows, three inside up and three inside down. Fig. 2 provides both the visual template associated with each pa ern, as well as the future price direction it is meant to presage. As before, we summarise a technical pa ern P of length m timesteps as a 4 ×m matrix T P m , standardised for comparability to have zero mean and unit variance.

Identi cation by Template Matching
Matrix representations for both the template T P m and equal-length, standardised rolling windows F n of the full price signal F at timestep n can be cross-correlated together to generate a time series S P measuring the degree of similarity between the price signal and the lter. For a given pa ern P, at each timestep n: where ·, · is the inner product of the two matrices and · is the L 2 norm. Our algorithm extracts the top centile of similarity scores S P as pa ern matches and produces a distribution of next-day returns conditional on matching pa ern P.

Evaluating Technical Analysis
We run several diagnostics to assess separately the informativeness and predictive prowess of each technical pa ern.
2.4.1 Empirical Data. roughout our work, we use technical (i.e. open, close, high and low price) data from the S&P500 stock market index constituents for the period Jan 1994 -Dec 2015, corresponding to n = 2,817,849 entries of nancial data in the price signal F . is dataset covers a representative cross-section of US companies across a wide timeframe suitable for learning the patterns, if any, of both expansionary and recessionary periods in the stock market.

Informativeness.
We begin by comparing the top centile of conditional returns with their unconditional counterparts, with the view that conditioning on informative pa erns should yield signi cantly di erent distributions. Denoting by {R P n 1 t =1 } the subset of returns conditioned on matching pa ern P and {R n 2 t =1 } the full set of unconditional returns, we compute their empirical cumulative distribution functions F 1 (z) and F 2 (z). e two-sample Kolmogorov-Smirnov (K-S) test evaluates the null hypothesis that  . Surprisingly, several single-day pa erns do in fact correlate with abnormal next-day returns. Almost all of the multi-day pa erns exhibit notches that overlap with the unconditional distribution's, implying that the distribution medians are not meaningfully changed by conditioning. Only 'Bearish Engul ng' and ' ree Black Crows' seem to be signi cant -as harbingers of be er times, despite their names.
the distributions generating both samples have identical cdfs, by computing the K-S statistic: e limiting distribution of γ provides percentile thresholds above which we reject the null hypothesis. When this occurs, we infer that conditioning on the pa ern does materially alter the future returns distribution. As an example of this approach, we provide the empirical cdfs of both unconditional returns and returns conditioned on the pa ern: ' ree Black Crows' (Fig. 3).

Predictive Prowess.
Whilst these pa erns may bear some information, it does not follow that their information is actionable, or even aligns with the expectations prescribed by technical analysis. Notched boxplots of both unconditional returns and returns conditioned on each of the lters (Fig. 4) allow us to gauge whether the pa ern's occurrence does in fact yield signi cant returns in the intended direction.
A closer examination suggests several of the 1-day pa erns are in fact relevant, but that the more elaborate 2-day and 3-day formations are not. Conditioning on 14 of the 16 multi-day pa erns produces no signi cant alteration in the median of next-day returns distributions (Fig. 5): only the 'Bearish Engul ng' and ' ree Black Crows' pa erns produce a conditional distribution for which the 95% con dence interval of the median (denoted by the notch) di ers markedly from its unconditional counterpart.

Results.
We report the empirical results of the K-S goodness of t tests and top centile (Table 1) conditional distribution summary statistics, using daily stock data from the S&P500. ough several of the pa erns do indeed bear information altering the distribution of future returns, their occurrence is neither a reliable predictor of price movements (high standard deviation relative to the mean) nor even, in many instances, an accurate classi er of direction. Elaborate multi-day pa erns systematically perform worse than their single-day counterparts. Surprisingly, 6 of the 8 single day pa erns do in fact produce meaningful deviations from the unconditional baseline, with the dragon y and gravestone doji standing out as signi cant outliers (-25.81 bp and +22.41 bp respectively when conditioning on the top centile of similarity score, Table 1). But even in those instances, technical analysis gets the direction wrong: the deviations are the polar opposite of what chartist doctrine would imply. Conceptually, the notion of using lters in nancial data to extract informative feature maps may bear merit -but the chartist lter layer is demonstrably an improper speci cation.

FEATURE ENGINEERING IN FINANCE
e approach of searching for informative intermediate feature maps in classi cation problems has seen widespread success in domains ranging from acoustic signal processing  to computer vision (Krizhevsky et al, 2012). Where technical analysis uses lters that are arbitrarily pictographic in nature, we learn layers for feature extraction from data. Table 1: Summary statistics for the next-day return distributions conditioned on matching technical patterns. A match on pattern P is deemed to have occurred when the crosscorrelational similarity score S P is in its top centile. K-S statistics γ above 1.95 are signi cant at the 0.001 level. Mean return µ for each pattern is expressed as a di erence from the unconditional baseline. e incremental mean returns are dwarfed by their standard deviation, and do not even always move in the direction prescribed by chartism. We begin by spli ing our S&P500 time series data into training and test sets corresponding to stock prices from 1994-2004 and 2005-2015 respectively. 1 We evaluate the performance of passing the raw data both with and without chartist lters, and subsequently measure the incremental gain from learning optimal feature maps by convolution. e ndings are then benchmarked against widely recognised approaches to time series forecasting including recurrent neural networks, nearest neighbour classi ers, support vector machines (SVM) and random forests.

Multi-Layer Perceptron
To address issues of scale and stationarity, we process the original 4 × n price signal matrix F into a new 80 × n price signal matrix F * where each column is a standardised encoding of 20 business days of price data.
is encoding provides 4 weeks of price history, a context or 'image' within which neural network lters can scan for the occurrence of pa erns and track their temporal evolution. We pass F * through a multilayer perceptron (MLP) involving fully-connected hidden layers. Preliminary cross-validation experiments with nancial time series determined the network topology required for the model to learn from its training data. Insu cient height (neurons per hidden layer) and depth (number of hidden layers) led to models incapable of learning their training data. We se led on 2 fully-connected layers of 64 neurons with ReLU activation functions, followed by a so max output layer to classify positive and negative returns. Early stopping during the crossvalidation phase determined the length of each experiment: 50 to 100 epochs were optimal to avoid the risk of over ing. Further regularisation was achieved via the inclusion of dropout (Srivastava, 2014) in the fully-connected layers of the network, limiting the model's propensity towards excessive co-adaptation across layers. A heavily-regularised (dropout = 0.5) 2-layer MLP is already able to identify some structure in its data (out-of-sample accuracy of 50.6% a er 100 epochs, Table 3).

Technically-Filtered MLP
Reframing technical pa erns as pre-learned cross-correlational lters, we consider for each pa ern length m the 8 pa ern matrices T P m de ned visually in Fig. 2. Each such formation, of form 4 × m, is stacked along the depth dimension, producing a 4 × m × 8 tensor T whose inner product with standardised windows of the raw price signal F yields a new 8 × n input matrix F T , is new input is the result of cross-correlating the raw price signal F with the technical analysis lter tensor T , and can be interpreted as the feature map generated by technical analysis. We now use F T as the input to the same MLP as before and look for improvements in model forecasts. e results we nd are consistent with Section 2: using technical analysis for feature extraction hinders the classi er, slightly degrading model performance (out-of-sample accuracy of 49.8% a er 100 epochs using the 1-day pa erns, Table  4).

Convolutional Neural Network
We now deepen the neural network by adding a single convolutional layer with 8 lters to our earlier MLP (architecture detailed in Table  2). Separate experiments are run for convolutional lters of size 4, 8 and 12, corresponding to scanning for 1-day, 2-day and 3-day pa erns. eir performance is reported in Table 5. e CNN nds much greater structure in its training data than the MLP could, and generalises be er. Accounting for the size of the test set (n = 1,408,679), the leap from the MLP's out-of-sample accuracy of 50.6% to the 1-day CNN's out-of-sample accuracy of 51.3% is highly signi cant.

Confidence Thresholding.
In contrast to other application domains, nance does not require an algorithmic agent to be accurate at all times. It is acceptable (and factoring in friction costs,    Figure 6: Model accuracy as a function of so max threshold α . For each model, we indicate by a cross the threshold level that retains the 1% of test data for which the model's output probabilities imply the highest con dence. preferable) for a model to be sparse in making decisions, only generating 'high conviction' calls, if this results in greater accuracy. We replicate this by adding a con dence threshold α to the classication output of the nal so max layer of Table 2: test points where neither class is assigned a probability greater than α are deemed uncertain, and disregarded by the classi er. Accuracy as a function of con dence threshold α is presented in Fig. 6, and demonstrates in all 3 cases that a signi cant increase in model prowess can be achieved by thresholding the so max output to only consider class assignments with high certainty. For each model, we also highlight the α threshold which retains the top centile of test outputs, corresponding to the model's most con dent assignments. ese vary by model (54.2%, 54.1% and 55.3% for the 1-, 2-and 3-day TCNNs respectively), but in each case form a reliable heuristic for balancing model con dence and sample size. A notable analogue to the study of technical analysis in Section 2: models searching for more elaborate multi-day pa erns tend to underperform the single-day TCNN.

Ensembling
TCNNs. An e ective technique in image processing involves homogeneous ensembling of multiple copies of the same CNN architecture, averaging across the class assignments of the constituent models (Krizhevsky et al, 2012;Antipova et al, 2016). Combining this probabilistic interpretation of the so max layer with model averaging, we construct a heterogeneous ensemble out of our 1-day, 2-day and 3-day TCNNs. e ensemble bene ts from learning pa erns manifesting at di erent timescales, and achieves a higher accuracy (57.5%) on its top-con dence centile than any of the individual learners (56.7%, 56.3% and 55.9% for the 1-day, 2-day and 3-day TCNN respectively, Fig. 6).
3.3.3 Practical Implementation. rough thresholding, we enforce sparsity in the model's decision making. In a real-world deployment, infrequent activity keeps friction costs low -a desirable outcome for trading algorithms. We track the activity level of the various models over time, as well as the cumulative pro t they would generate over the 11-year test window. We assume the model fully captures the 1-day return associated with the top centile of its thresholded class assignments, additively for positive class predictions and subtractively for negative class predictions. e models are heavily skewed towards buying activity, with accurately-timed spikes centred around major world events (Fig. 7). e 2 largest single-day buy orders occur on the 9th of August 2011 (328 buys), at the tail end of the US debt ceiling crisis which caused the S&P500 to drop 20% in 2 weeks, and on the 24th of August 2015 (241 buys), following a ash crash in which US markets erased 12% of their value before recovering. e largest sell volume occurs on the 22nd of September 2008 (31 sells), a full week a er the collapse of Lehman Brothers. is coincides with market-wide relief over Nomura's decision to buy Lehman's operations -and presented the   last opportunity to sell before the nosedive of the Great Financial Crisis in late 2008. Despite having no information about world news in their technical dataset, the models were capable of both inferring crucial moments in history, and timing trading decisions around them. Fig. 8 presents the model's pro tability over time to highlight the relative steadiness of convolution's performance in identifying stock market pa erns, when the decisions are generated by TCNNs and their ensemble. Table 6 translates this performance into compounded annual returns and Sharpe ratios under various assumptions for friction. Even in the absence of tight execution (average trading cost of 0.25% from the mid-market price), the models  Table 6) generated by the various TCNN models between Jan-2005 and Dec-2015, in the absence of friction costs. e models are steadily pro table, with occasional spikes related to recognising major events. Drawdowns are infrequent and of limited scale.
remain highly pro table. is sensitivity analysis does nevertheless highlight the importance of good execution in any real-world deployment of algorithmic trading: the TCNN ensemble can only just break even if the per-transaction cost rises to 0.35%.

Interpretable Feature
Extraction. e convolutional lters learned by the network provide a basis for feature extraction. In particular, the convolutional layer's lters de ne patches whose cross-correlation with the original input data was informative in minimising both in-sample and out-of-sample categorical crossentropy. We produce a mosaic of these lters as Hinton diagrams (Fig. 9) and visualise them in the language of technical analysis as candlestick pa erns ( Fig. 10 and 11), cross-correlational templates whose occurrence is informative for nancial time series forecasting. Unlike technical pa erns however, these templates have no set

Signi cance of Model Results
To investigate whether the predictive performance of the neural network classi ers is statistically signi cant, we derive the area under the curve (AUC) of each model's receiver operating characteristic curve (ROC), and exploit an equivalence between the AUC and Mann-Whitney-Wilcoxon test statistic U (Mason and Graham, 2002): where n P and n N are the number of positive and negative returns in the test set, respectively. In our binary classi cation se ing, the Mann-Whitney-Wilcoxon test evaluates the null hypothesis that a randomly selected value from one sample (e.g., the subset of test data classi ed as positive next-day returns) is equally likely to be less than or greater than a randomly selected value from the complement sample (the remaining test data, classi ed as negative next-day returns). Informally, we are testing the null hypothesis that our models have classi ed at random. U is approximately Gaussian for our sample size, so we compute each model's standardised Z -score and look for extreme values that would violate this null hypothesis.
where: Figure 11: Candlestick pa ern translation of the cross-correlational lter mosaic for the 3-day CNN.
and σ U = n P n N (n P + n N + 1) 12 (7)  Table 7 provides the AUC, Z -score and signi cance of each model, where signi cance measures the area of the distribution below Z . We disregard signi cance for negative Z -scores (as is the case for the technically-ltered neural network) as they imply classi ers that performed (signi cantly) worse than random chance. Learning neural network lter speci cations via convolution yields a signi cant boost to predictive prowess over the baseline model of Section 3.1 and technically-ltered variant of Section 3.2. Whilst the Z -scores of the TCNN models are lower than those of unthresholded CNN models, this is primarily the consequence of sample size on statistical signi cance tests -AUC improves markedly under thresholding.

Performance Benchmarks
Deep learning has garnered signi cant a ention in recent years for its ability to outperform alternative methods, se ing the stateof-the-art in computer vision and speech recognition benchmarks. e lack of commonly-agreed datasets such as MNIST for digit recognition or ImageNet for image classi cation means nance has lacked a stable backdrop for model benchmarking. For our purposes, we propose the use of the S&P500 technicals dataset for Jan 1994 -Dec 2015 as a baseline against which to evaluate other classi ers and benchmark deep learning in nance.

k-Nearest Neighbours (k-NN).
We evaluate a range of nearest neighbour classi ers, labelling each day of the test set with the most frequently observed class label (positive or negative next-day return) in the k training points that were closest in Euclidean space.
3.5.3 Support Vector Machines (SVM). SVMs have been applied to nancial time series forecasting in prior literature, and achieved moderate success when the input features were not raw price data but hand-cra ed arithmetic derivations thereof called technical indicators (Kim, 2003). We report SVM performance under di erent kernel assumptions (linear and RBF), where the model hyperparameters (regularisation parameter C to penalise margin violations, RBF kernel coe cient γ to control sensitivity) were selected by cross-validation on a subset of the training data.

Random Forests (n-RF).
In their study of European nancial markets, Ballings et al (2015) evaluated the classi cation accuracy of ensemble methods against single classi ers. eir empirical work highlighted the e ectiveness of random forests in classifying stock price movements and motivates their inclusion in our list of benchmarks, under varying assumptions for the number of trees hyperparameter n. Table 8 underscore the scale of the challenge for pa ern recognition in nance: deep learning achieved the best results by a signi cant margin, and most alternative methods yielded accuracies that were not statistically distinguishable from guesswork. Convolution outperforms recurrence in our experiments, suggesting that a 20-day window may be su cient to capture temporal dependencies in markets.

CONCLUSION
Our results present to our knowledge the rst rigorous statistical evaluation of candlestick pa erns in time series analysis, using normalised signal cross-correlation to identify pa ern matches. We nd li le evidence of predictive prowess in any of the standard chartist pictograms, and suspect that the enduring quality of such practices owes much to their subjective and hitherto unveri ed nature. Nevertheless, it is not inconceivable that price history might contain predictive information, and much of quantitative nance practice relies on elements of technical pa ern recognition (e.g., momentum-tracking) for its success. rough a deep learning lens, technical analysis is merely an arbitrary and incorrect speci cation of the feature-extractive early layers of a neural network. Within relatively shallow architectures, learning more e ective lters from data improves accuracy signi cantly while also providing an interpretable replacement for chartism's visual aids. resholding and deep ensembles yield a robust framework for systematic decision making in nancial markets, further enhancing performancethough only up to a point. e predictive information embedded in price history appears limited, and even state-of-the-art techniques in pa ern recognition will remain subject to that upper bound.