Introduction

As a relatively new type of currency introduced in the new millennium, cryptocurrency is a digital or virtual currency that employs advanced encryption techniques in managing and regulating its currency units [1]. It heavily relies on blockchain technology. Hence it inherits the blockchain’s properties, such as decentralization, transparency, and immutability [2]. No single party or authority controls cryptocurrency transactions, making it an exciting option for many people to use and invest in it.

The first and most popular cryptocurrency to date is Bitcoin. It was established based on the concept introduced in Satoshi Nakamoto’s publication in 2008 entitled ‘Bitcoin: a Peer-to-Peer Electronic Cash System’ [3], which was later implemented as an open-source code and released for the public on SourceForge [4]. The concept was widely accepted and adopted in various applications, helped shape the cryptocurrency ecosystems, and gave birth to many other cryptocurrencies, such as Ethereum, Ripple, Monero, Stellar, Litecoin, and Dash [2]. Figure 1 shows Bitcoin’s and several other cryptocurrencies’ prices in the last few years, together with their growth percentages.

Fig. 1
figure 1

Cryptocurrencies’ price fluctuation (a) and growth percentages (b) in the last few years [5]

As can be seen from Fig. 1a, cryptocurrencies prices are very dynamic, making it a challenging task to predict the future values of cryptocurrencies. Moreover, cryptocurrencies are also highly volatile, with sudden rises and dips over time, as shown in Fig. 1b, caused by many factors. Therefore, the trading community needs an accurate prediction method in helping them make strategic decisions and benefit from their investment [6].

Several prediction techniques were introduced in previous studies, but they can be mainly grouped into two approaches: traditional empirical analysis and machine learning algorithms [7, 8]. In the traditional empirical analysis, various statistical methods ranging from a simple one (such as moving average) to very complex analyses (such as hidden Markov model and sentiment analysis techniques) had been applied to predict the cryptocurrency prices movement [9]. Several studies that had a focus on this approach are Anupriya and Garg [10], Abraham et al. [11], Mohapatra et al. [12], Nasir et al. [13], Bakar and Rosbi [14], and Wolk [15].

On the other hand, the machine learning approach was widely applied in various fields, including for predicting cryptocurrency prices. Radityo et al. [16] tried to predict the Bitcoin exchange rate to USD using artificial neural networks (ANNs). Four types of ANNs were compared where they found that the backpropagation neural network (BPNN) method is the best method that gives relatively low mean absolute percentage error (MAPE) and shortest training time. Phaladisailoed and Numnonda [17] also had tried to predict the Bitcoin price. However, they used other machine learning methods, namely Theil-Sen regression and Huber regression, and even two deep learning methods, namely long short-term memory (LSTM) and gated recurrent unit (GRU). Their experimental results on 2195 daily trading exchange data (BTC/USD) found that GRU got the lowest mean square error (MSE) and highest R-square values at 0.00002 and 0.992, respectively. Recently, Akyildirim et al. [18] published their work in predicting cryptocurrency returns using several machine learning methods, such as support vector machine (SVM), logistic regression, artificial neural networks, and random forest. In this study, their focus was not only on one cryptocurrency price but twelve cryptocurrencies’ prices. They found that the SVM is the best performing method that gives consistent results. Some other researchers also tried to develop ensemble methods in tackling the cryptocurrency price prediction, as seen in the works of Sin and Wang [19] and Ji et al. [20].

In this research, we will further apply and analyze three popular deep learning methods that are classified as recurrent neural networks (RNNs). They are the long short-term memory (LSTM), the bi-directional LSTM, and the gated recurrent unit (GRU). Although several studies had employed those methods, most of them focused on the univariate prediction model, whilst in this research, we will put our focus on the multivariate prediction model. Moreover, we also use five major cryptocurrencies according to the market capitalization, i.e., Bitcoin (BTC), Ethereum (ETH), Cardano (ADA), Tether (USDT), and Binance Coin (BNB) [21]. Hence, several contributions of this research are: (1) we propose simple three layers network architecture for regression task in predicting cryptocurrencies’ prices, (2) we compare and analyze the proposed architecture on three popular RNNs, namely LSTM, Bi-LSTM, and GRU models, (3) we use the multivariate approach on five major cryptocurrencies and evaluate their results, and lastly, (4) we run the proposed architecture on each model several times to get robust evaluation results.

The organization of this paper is structured as follows. The data collection and three RNNs as deep learning methods used in this research will be explained shortly. Next, the pre-processing steps, experimental results, and discussion will be given in the following section. Finally, some concluding remarks and suggestions for future research direction will end the paper.

Data collection and research method

We start this section by explaining the data collection source and their characteristics. Next, a brief description of three popular RNNs models, namely LSTM, Bi-LSTM, and GRU, will be given. In the end, an explanation of three commonly used prediction error criteria, i.e., mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE), as the performance evaluation metrics used in this research will be delivered.

Data source and characteristics

We focus our study on five major cryptocurrencies based on their market capitalization [21]. They are Bitcoin (BTC), Ethereum (ETH), Cardano (ADA), Tether (USDT), and Binance Coin (BNB). We collected the daily recorded data of each cryptocurrency price against USD from Yahoo! Finance [22] and took the maximum available data from the source. Each collected dataset contains seven attributes with different data types, namely Date (date), Open (float), High (float), Low (float), Close (float), Adj Close (float), and Volume (int). Since each cryptocurrency was introduced and recorded in different time, the start date of available data for each cryptocurrency will be differed each other. This also results in different number of total records being studied for each cryptocurrency. The characteristics of collected datasets are shown in Table 1.

Table 1 Data characteristics

Long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), and gated recurrent unit (GRU)

As an improved version of feedforward neural networks, the recurrent neural networks (RNNs) can keep past information by taking previous layers’ output as input for the next layer in the networks [23]. However, it cannot learn a too-long memory contained in the networks and suffers the long-term dependency problem [24]. Therefore, a special type of RNNs was introduced by Hochreiter and Schmidhuber that could solve the problem [25]. Later, the proposed method was called the long short-term memory (LSTM) model and was widely used in various applications, including solving regression tasks, such as time series prediction.

To store the state in the networks, LSTM uses a three gates mechanism. The first gate is the forget gate (\({f}_{t}\)) that controls the degree of information loss from the previous cell’s hidden state. The second gate is the input gate (\({i}_{t}\)) that controls the degree of new information that will be stored in the current cell state. Lastly, the third gate is the output gate (\({o}_{t}\)), which calculates the new output value for the current cell [26]. Figure 2 illustrates the three gates mechanism in an LSTM cell, while several related equations used in an LSTM cell are shown as Eqs. (16).

$${f}_{t}=\sigma \left({W}_{f}{h}_{t-1}+{U}_{f}{x}_{t}+{b}_{f}\right)$$
(1)
$${i}_{t}=\sigma \left({W}_{i}{h}_{t-1}+{U}_{i}{x}_{t}+{b}_{i}\right)$$
(2)
$${\tilde{C }}_{t}=\mathrm{tanh}\left({W}_{C}{h}_{t-1}+{U}_{C}{x}_{t}+{b}_{C}\right)$$
(3)
$${C}_{t}={f}_{t} \odot {C}_{t-1}+{i}_{t} \odot {\tilde{C }}_{t}$$
(4)
$${o}_{t}=\sigma \left({W}_{o}{h}_{t-1}+{U}_{o}{x}_{t}+{b}_{o}\right)$$
(5)
$${h}_{t}={o}_{t} \odot \mathrm{tanh}\left({C}_{t}\right)$$
(6)

Here, \({\tilde{C }}_{t}\) is the candidate cell state and \({C}_{t}\) is the current cell state. All networks’ weights are denoted as \({W}_{f}, {W}_{i}, {W}_{C}, {W}_{o}, {U}_{f}, {U}_{i}, {U}_{C}, {U}_{o}\) and bias variables as \({b}_{f}, {b}_{i}, {b}_{C}, {b}_{o}\). \({h}_{t}\) represents current hidden state value and \({x}_{t}\) represents new information at the current cell. Two types of activation functions are used here, i.e., the sigmoid (\(\sigma \)) and the tangent hyperbolic (\(\mathrm{tanh}\)) activation functions. They are the most frequently used nonlinear activation functions in artificial neural networks [27].

Fig. 2
figure 2

LSTM cell and its gates mechanism [28]

A newer version of LSTM, called the bidirectional LSTM (Bi-LSTM) model, was introduced later. In Bi-LSTM, the bidirectional structure of RNNs is employed. Therefore, rather than one forward direction hidden layer, two hidden layers with similar output and inverse directions are used [29], as shown in Fig. 3. When the networks have learned from both directions separately, their output will be combined by using any merge modes available for Bi-LSTM [30]. In this study, we used the default merge mode in Bi-LSTM, i.e., the concatenation mode. By using this approach, both past and future information contained in the dataset can be preserved [31].

Fig. 3
figure 3

LSTM versus Bi-LSTM architectures [32]

Another type of RNNs recently introduced by Cho et al. [33] is known as the gated recurrent unit (GRU). It is pretty similar to LSTM; however, it uses fewer gates and simpler equations. It combines LSTM’s forget and input gates into a single update gate and merges the cell and hidden states [6]. Four equations are incorporated in the GRU cell as shown in Eqs. (710) below.

$${z}_{t}=\sigma \left({W}_{z}\cdot\left[{h}_{t-1},{x}_{t}\right]\right)$$
(7)
$${r}_{t}=\sigma \left({W}_{r}\cdot\left[{h}_{t-1},{x}_{t}\right]\right)$$
(8)
$${\tilde{h }}_{t}=\mathrm{tanh}\left(W\cdot\left[{r}_{t} \odot {h}_{t-1},{x}_{t}\right]\right)$$
(9)
$${h}_{t}=\left(1-{z}_{t}\right) \odot {h}_{t-1}+{z}_{t}\odot {\tilde{h }}_{t}$$
(10)

Here, \({z}_{t}\) and \({r}_{t}\) are the update and reset gates, \({W}_{z}\) and \({W}_{r}\) are the networks’ weights, \({\tilde{h }}_{t}\) is the current memory state, \({h}_{t}\) is the final memory or output of the GRU unit to be passed on to the next unit, and \({x}_{t}\) is the new information at the current cell. Figure 4 shows a GRU cell and its gates.

Fig. 4
figure 4

GRU cell and its gates mechanism [34]

Performance evaluation

To evaluate the prediction results, we use three popular error criteria, i.e., mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). In MAE and RMSE, the degrees of error are given in a unit value, whilst in MAPE, the degree of error is given in a percentage value. Equations (1113) represent those three error measurements where a smaller score implies better prediction results [35, 36].

$$MAE=\frac{1}{n}\sum_{t=1}^{n}\left|{Y}_{t}-{F}_{t}\right|$$
(11)
$$RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^{n}{\left({Y}_{t}-{F}_{t}\right)}^{2}}$$
(12)
$$MAPE=\left(\frac{1}{n}\sum_{t=1}^{n}\left|\frac{{Y}_{t}-{F}_{t}}{{Y}_{t}}\right|\right)\times 100\%$$
(13)

Here, the total number of data is denoted as \(n\), the actual observed value is denoted as \({Y}_{t}\), and the predicted value is denoted as \({F}_{t}\).

Results and discussion

In this section, we will explain the pre-processing and model development phase taken in the study. Next, several examples of the prediction plot results of each cryptocurrency are shown, followed by the performance results and analysis of this study.

Pre-processing and model development

We conducted several pre-processing steps for each cryptocurrency data, starting from data imputation to handle missing values to data reshaped so that it can be processed by the deep learning methods applied in this study, namely LSTM, Bi-LSTM, and GRU. Firstly, out of seven data attributes, we dropped the Adjusted Close attribute and focused on predicting the Close value as the target feature. Hence, rather than a univariate prediction approach, we used a multivariate prediction model in this study by incorporating the Close, Open, High, Low, and Volume data attributes. Next, we checked for missing values found in the dataset and used a simple data imputation technique by replacing the missing values with their previous known records. Then, we normalized all data features using a MinMaxScaler transformation and reframed the dataset as a multivariate model. The next step is to split the data into training and test sets. We used an 80:20 ratio for training: test for each cryptocurrency. Lastly, we reshaped both those sets into 3D-array shapes to be further used in the deep learning model development phase.

In this study, we propose simple three layers networks architecture to be used in each deep learning model. We argue that simple networks architecture could achieve comparable performance results with deeper and complex ones, especially for regression tasks in the time series domain. Our proposed three layers architecture consists of one considered deep learning layer (LSTM, Bi-LSTM, or GRU) with 100 neurons, one Dropout layer that will drop 10% of processed information, and one Dense layer with one neuron as the output. During the networks’ compilation, we used mean square error as the loss function and Adam optimizer. The deep learning model will be trained for 50 epochs with 32 batch sizes each run.

The model was trained and run on Google Colab with a single core (two threads) Intel® Xeon® Processors @ 2.20 GHz,  ~ 12.69 GB RAM, and  ~ 107.72 GB disk spaces. In conducting the experiments, we used Python 3 and several core libraries, such as NumPy for numerical computing, Pandas for data processing and analysis, Matplotlib for data visualization, Keras and scikit-learn (sklearn) for the deep learning application programming interface (API) in Python.

Prediction results

After we trained each cryptocurrency dataset on the proposed deep networks architecture, we tested it on the test set for evaluation. The model development and evaluation were run ten times for each deep learning network (LSTM, Bi-LSTM, and GRU) to get robust evaluation performance results. The prediction results were plotted as shown in Figs. 5, 6, 7 for each deep learning model (only the first run was shown). The blue line shows the actual value, while the red line shows the predicted value. As can be seen from the plotted results, each approach (LSTM in Fig. 5, Bi-LSTM in Fig. 6, and GRU in Fig. 7) gives similar results and could follow each cryptocurrency pattern. Further discussion and analysis of the prediction results are given in the following sub-section.

Fig. 5
figure 5

Prediction results for five cryptocurrencies using LSTM

Fig.6
figure 6

Prediction results for five cryptocurrencies using Bi-LSTM

Fig. 7
figure 7

Prediction results for five cryptocurrencies using GRU

Analysis

We used three popular error measurement criteria in conducting the evaluation, i.e., MAE, RMSE, and MAPE. Ten consecutive runs were conducted for each deep learning method and each cryptocurrency pair, and the mean error values were recorded. Table 2 shows the error measurement results of each cryptocurrency pair. The lowest score of each pair is shown in bold text.

Table 2 Performance results of each cryptocurrency (average of ten consecutive running)

The performance evaluation using MAE and RMSE gave similar results. For USDT-USD and BNB-USD, the deep networks architecture trained using LSTM gave the best results on average at 0.0025/0.0034 (MAE/RMSE) for USDT-USD and 18.0877/27.6245 for BNB-USD. Bi-LSTM, on the other hand, gave the best result on average at 100.1383/147.8453 for ETH/USD. Meanwhile, for both BTC-USD and ADA-USD, the GRU method achieved the best results at 1167.3462/1777.3070 and 0.0782/0.1134, respectively.

We got a slightly different result when the performance evaluation was measured by MAPE. For the ETH-USD cryptocurrency pair, which was previously got the best results when trained using Bi-LSTM, now when measured using MAPE, the best result was achieved by GRU. In fact, GRU is the preferred deep learning method among those three RNNs, which gave the lowest MAPE score at 0.0447 on average between those five considered cryptocurrencies. Bi-LSTM is slightly worse than GRU in the performance evaluation using MAPE, while LSTM is the worst. The boxplot diagram that shows the MAPE comparison is illustrated in Fig. 8.

Fig. 8
figure 8

MAPE comparison of each RNNs deep learning method

To further determine whether the performance results between each applied deep learning method are significantly different or not, we conducted a paired sample t-test. A Microsoft Excel add-in named ‘data analysis toolpak’ [37] was used. Using the tool, we compared the MAE, RMSE, and MAPE results between LSTM—Bi-LSTM, LSTM—GRU, and Bi-LSTM—GRU. The paired sample t-test results are shown in Table 3. It can be seen that the two-tailed p-values for all comparisons are more than the significance level at 0.05. Therefore, although GRU is the preferred method based on the error measurement criteria results, the difference in performance results between each deep learning method is not significantly different from each other.

Table 3 t-test results

We also recorded the execution time for model development of each considered method in this study during the experiment. Table 4 shows the recorded average execution time for ten runs of each model on each cryptocurrency. As can be inferred from the results, the LSTM method needs a shorter execution time for model development than the Bi-LSTM method. However, GRU is the better method among those three RNNs that gives the shortest execution time at 7.5791 on average, which was slightly better than LSTM at 7.9768 on average. It seems that both LSTM and GRU have similar performance in terms of execution time, while Bi-LSTM is the worst. Figure 9 depicts the execution time comparison for each RNNs deep learning method in a Boxplot diagram.

Table 4 Execution time comparison
Fig. 9
figure 9

Execution time comparison of each RNNs deep learning method

Lastly, we compare the performance results of the proposed approach with other similar studies in predicting future cryptocurrencies values. Table 5 shows the relative comparison of similar studies based on the best average MAPE results. Our proposed approach seems to compete with other more complex methods pretty nicely.

Table 5 Relative comparison with similar studies

Conclusion

We have proposed simple three layers deep networks architecture to be used and compared among three popular RNNs, namely the long short-term memory (LSTM), the bidirectional LSTM (Bi-LSTM), and the gated recurrent unit (GRU). We used a multivariate prediction model approach on five major cryptocurrencies against USD while performing a robust evaluation by running the proposed networks architecture ten times. We found that both Bi-LSTM and GRU have similar performance results in averaged MAPE from the experimental results. LSTM could get better results for USDT-USD and BNB-USD, but it also has greater variation than Bi-LSTM and GRU. Moreover, LSTM and GRU have similar results in terms of execution time, where GRU is slightly better and has lower variation results on average.

To get better prediction results, specific experiments on the deep networks architecture for each cryptocurrency pair could be conducted. Moreover, in this study, we used the default merge mode (concatenation) for Bi-LSTM. Another study to compare the effect of various merge modes for Bi-LSTM in predicting cryptocurrency can be done. Similarly, the effect of different activation functions used in the model development could also be taken for future research.