Keywords

1 Introduction

The system mainly concentrates on machine learning algorithms that are used in prediction modeling. Machine learning algorithms are self-programming methods to deliver better results after being exposed to data. The learning portion of machine learning signifies that the models which are build changes according to the data that they encounter over the time of fitting.

The idea behind the building of this system was to determine which one among the chosen time series forecasting algorithms are the most suitable for these operations. The uniqueness of this work is specified using the help of the literature review section of this study. The five algorithms that were chosen are Linear Regression, K-Nearest Neighbor, Auto ARIMA, Support Vector Machine, and Facebook’s Prophet, which were never compared altogether on a common platform. Also, several datasets were extracted for building and testing these models, along with the evaluation metrics.

Since the extracted datasets are time-series forecasting types, that’s why algorithms that are most suitable for these kinds of works are chosen in this system. The term time series forecasting means that the system is going to make a prediction based on time-series data. Time series data are those where records are indexed on the basis of time, that can be anything like a proper date, a timestamp, quarter, term, or year. In this type of forecasting, the date column is used as a predictor/independent variable for predicting the target value.

A machine learning algorithm builds a model with the help of a dataset by getting trained and tested. The dataset is split into two parts as train and test datasets, and generally, the record of these two do not overlap, and there are different mechanisms around machine learning for this task. After fitting/training the model on the basis of the train portion, it must be tested, and for that, the test dataset comes into play. Further, the results that are generated are matched with the desired targets with the help of evaluation metrics. The two-evaluation metrics viz., the Mean Absolute Percentage Error and the Root Mean Squared Error are considered for comparison purpose is broadly discussed in the Methodology chapter.

2 Literature Review

This section delivers the opinion and conclusion of several researchers who contributed their works to the field of machine learning algorithms. Also, this section manifests the comparative outcomes of the machine learning algorithms.

Vansh Jatana mentioned in his paper Machine Learning Algorithms [1] that Machine Learning is a branch of AI which allows System to train and learn from the past data and activities. Also, it explores a bunch of regression, classification, and clustering algorithms through several parameters including the memory size, overfitting tendency, time for learning, and time for predicting. In the comparison of Random Forest, Boosting, SVM, and Neural Networks, the time for learning is weaker in the case of Linear Regression. Also, like Logistic Regression and Naive Bayes [2], the overfitting tendency of Linear Regression is low. However, in the research Linear regression is the only pure regression model, as else are Classification as well as Clustering model too.

Ariruna Dasgupta and Asoke Nath [3] discuss the broader classification of a prominent machine learning algorithm in their journal and also, specifies the new applications of them. In supervised learning, priori is necessary and always produces the same output for specific input. Similarly, Reinforcement learning requires priori too, but the output changes if the environment doesn’t remain the same for a specific result. Nevertheless, Unsupervised Learning doesn’t require priori.

Talking about Auto ARIMA, Prapanna Mondal, Labani Shit, and Saptarsi Goswami [4] in their paper carried a study on 56 stocks from 07 divisions. Stocks that are registered in the National Stock Exchange (NSE) are considered. The authors have chosen 23 months of information for the observational research. They’ve calculated the perfection of the ARIMA model in prediction of stock costs. For all the divisions, the ARIMA model’s accuracy in anticipating stock costs is higher than eighty fifths, which symbolizes that ARIMA provides sensible accuracy.

A work by Kemal Korjenić, Kerim Hodžić, and Dženana Đonk [5] evaluates its performance in very real-world use cases. The prophet model has inclinations of generating fairly conventional monthly as well as quarterly forecasts. Also, as an enormous potential for classification of the portfolio into many classes consistent with the expected level of statement authenticity: some five-hundredths of the merchandise portfolio (with large amount of dataset) will be projected with MAPE < 30% monthly, whereas around 70% can be predicted with MAPE < 30% quarterly (out of that 40% with MAPE <15%).

Sibarama Panigrahi and H.S. Behra [6] used FTSF-DBN, FTSF-LSTM, and FTSF-SVM models as comparative algorithms for their Fuzzy Time Series Forecasting (FTSF) in their journal. These Machine learning algorithms are used model FLRs (Fuzzy Logic Relationships [7]. The paper concluded that FTSF-DBN outperformed DBN (Deep Belief Network) method. But it also reported that the statistical difference between FTSF-LSTM and LSTM is insignificant.

Talking about K-Nearest Neighbour (KNN), it has been stated in a paper [8] that KNN as a data mining algorithm has a broad range of use in regression and classification scenarios. It is mostly used for data Mining or data categorization. In Agriculture, it can be applied for simulating daily precipitations and weather forecasts. KNN can be used efficiently in determining required patterns and correlations between data. Along with those other techniques such as hierarchical clustering and k-means, regression models, ARIMA [9], and decision tree analysis can also be applied over this massive field of exploration. Also, KNN [10] can be applied medical field to predict the reason for a patient’s admission to the hospital.

In the end, the whole analysis of the different journals published in recent years features a broad perspective of different machine learning algorithms specifically time series and prediction algorithms, that are about to be featured in the implementation of this system. Also, from the above study, it can be concluded that each algorithm belongs to different categories and have significant applications. Further, some of the comparative studies define the best machine learning techniques based on several parameters. Nevertheless, in this whole process of encountering the brilliant works, team never came across any work where five algorithms that they’ve chosen being compared in on one platform with common dataset, and that’s why the team saw this as an opportunity to compare these five algorithms that are different nature but also share some similarities so that they can be used for time series forecasting as well.

3 Methodology

The idea was to create an interface that could display result matrix and multiple analysis with words, numbers, statistics, and pictorial representations. The visual interface created by the team should not deviate from the topic for the audience and should only include limited and necessary items such as what algorithms are used, what dataset are used, their data analysis and respected comparative results. Anyway, the construction of the interface was the ultimate concern in the entire research and system construction campaign.

3.1 Linear Regression

Linear regression [11] is a simplistic and well-known Machine Learning algorithm. It is a mathematical procedure that is applied for the prognosticative analytical study. Simple Linear regression delivers forecasts for continuous or numeric variables like trades, wages, span, goods worth, etc.

Mathematically, it can be represented as shown in “Eq. (1)”,

$$ {\text{y}}\, = \,\uptheta 0\,\, + \,\,\uptheta {1}\, \times \,{1}\,\, + \,\,\uptheta {2}\, \times \,{2}\, + \, \ldots \, + \,\,\uptheta {\text{n}} \times {\text{n}} $$
(1)

Here, y is the target variable and x1, x2, …, xn are predictive variables that represents every other feature in a dataset. θ0, θ1, θ2, …, θn represent the parameters that can be calculated by fitting the model.

In the case of using two variables i.e., 1 independent and 1 dependent variable, it can be represented as shown in “Eq. (2)”:

$$ {\text{y}}\, = \,\,\uptheta 0\,\, + \,\,\uptheta {\text{1x}} $$
(2)

where parameters θ0 is said to be the intercept that forms on y-axis, and θ1 can be generated once the model is trained.

3.2 K Nearest Neighbour

K-Nearest Neighbour [12] calculates the similarity among the recent data and recorded cases and sets the new records into the section where alike data exists.

It computes the length between the input and the test data and provides the prognostication subsequently as shown in “Eq. (3)”.

$$ \begin{array}{*{20}l} {{\text{d}}\left( {\text{p,q}} \right)\, = \,{\text{d}}\left( {\text{q,d}} \right)\, = \,\sqrt {{\left( {q_1 \, - \,p_1 } \right)^2 } \, + \,\left( {q_2 \, - \,p_2 } \right)^2 \, + \, \ldots + \,\,\left( {q_n \, - \,p_n } \right)^2 }} \hfill \\ {\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = \sqrt {\sum\limits_{i = 1}^n {\left( {q_i \, - \,p_i } \right)^2 \cdot } } } \hfill \\ \end{array} $$
(3)

The n number of specifications are taken into consideration. The marking that is situated at the merest position from marking is in similar class. Here q and p are new and existing data-points respectively.

3.3 Auto ARIMA

ARIMA [13] is a standard word that refers to Auto-Regressive Integrated Moving Average. It is a mere and efficient ML algorithm used to perform time-series forecasting. It consists of two systems Auto Regression and Moving average.

It takes past values into account for future prediction. There are 3 essential parameters in ARIMA:

p => historical data used for predicting the upcoming data

q => historical prediction faults i.e., used for forecasting the Upcoming data

d => Sequence of variation

3.4 Prophet

The prophet [14] is an open-source library by FB company made for predicting time series data to learn and likely forecast the exchange. Seasonality variations occur over a short duration and aren’t notable enough to be described as a trend. The equations related to the terms are defined as shown in “Eq. (4)”,

$$ {\text{fn}}\left( {\text{t}} \right)\, = \,{\text{g}}\left( {\text{t}} \right)\, + \,{\text{s}}\left( {\text{t}} \right)\, + \,{\text{h}}\left( {\text{t}} \right)\, + \,{\text{e}}\left( {\text{t}} \right) $$
(4)

where,

g(t)   => trend

s(t)  => seasonality

h(t)  => forecast effected by holidays

e(t) => error term

fn(t)  =>  the forecast

The variation of the given terms is maths dependent. And if not studied properly it might lead them to make the wrong prediction which may be very problematic to the customer or for business in practice.

3.5 Support Vector Machine

The SVM [15] is a machine learning algorithm that is employed for both regressions and classifications depending upon the enigmas. In Linear SVM, features are linearly arranged [16] that can utilize a simple straight line to implement SVM in this case. The formula for obtaining hyperplane in this case is as shown in “Eq. (5)”:

$$ {\text{y}}\, = \,{\text{mx}}\, + \,{\text{c}} $$
(5)

If the feature that is being used is of non-linear type, then more dimensions are needed to be added to it. And in that case, one need to use a plane. The formula for obtaining hyperplane in this case is as shown in “Eq. (6)”:

$$ {\text{z}}\,{ = }\,{\text{x2}}\,{ + }\,{\text{y2}} $$
(6)

In this system, to determine the accuracy, 2 evaluation metrics that are used for generating results are Mean Absolute Percentage Error and Root Mean Squared Error, and both depend on the obtained values and actual value.

The Root Mean Squared Error a.k.a. RMSE value is obtained by taking the square root of the addition of the individually calculated mean squared errors. The formula for the same is given in “Eq. (7)”:

$$ RMSE\, = \,\sqrt {\sum_{i = 1}^n {\frac{{\left( {\hat{y}_i \, - \,y_i } \right)^2 }}{n}} } $$
(7)

Here, ŷ1, ŷ2, ŷ3, …, ŷn are the actual value and y1, y2, y3…yn are respective obtained value and n here is the number of iterations performed.

In MAPE or Mean Absolute Percentage Error, the value is calculated by taking absolute subtraction of obtained value from actual value divided by the actual value, later the individual value to obtain the result were added as shown in “Eq. (8)”

$$ {\text{M}}\,{ = }\,\frac{1}{n}\,\sum_{t = 1}^n {\left| {\frac{A_t \, - \,F_t }{{A_t }}} \right|} $$
(8)

Here, A1, A2, A3, …, An represents actual value, while F1, F2, F3, …, Fn represents the obtained data, and n is the number of iterations taken under consideration.

4 System Design

The design of the whole system depends on the flow of modules. The work is segregated into six modules, and the team developed the whole system going through these six modules that are discussed in this section of the study. Figure number 1 describes the modules and processes that are going to be involved in the long process of implementation of the required interface (Fig. 1).

Fig. 1.
figure 1

Flow of modules

4.1 Data Requirements and Collection

In this phase of the whole implementation, the main objective is to understand what kind of datasets are required in the massive process. Understanding the data requirements plays a vital role in upcoming modules in this long process. Further, after understanding the data requirements, the next step is to focus on the collection of the required datasets.

4.2 Data Preparation

This phase of the implementation is the most crucial. It let the implementer determine the bruises in the collected data. To operate with the data, it needs to be developed in a way that inscribes abstaining or fallacious values and eliminates copies and ensures that it is accurately formatted for modeling.

4.3 Modelling

In this module, the implementation of algorithms is done as per the requirement in Python with the help of some Python libraries. It is the phase, that allows to decide how the information can be envisioned to find the solution that is required. All five algorithms which are either predictive or descriptive that are mentioned in the previous section were implemented here.

4.4 Model Evaluation

Model’s assessment will probably assess the calculations that are actualized in the past module. It is intended to decide the right logical methodology or strategy to take care of the issue. With the help of RMSE, and MAPE, it can be determined which model is most suitable for a particular time series dataset. The closer the value of RMSE and MAPE towards zero, the better the model for that dataset.

4.5 Interface Building

In this module, the work went under the interface development of the system. Also, the team established a connection between the interface and the models that were implemented in previous phases. Also, as per the requirement, the team can also revert to the fourth phase of the implementation. Django was used as the web-framework for this phase of the implementation.

4.6 Deployment

Once the models are evaluated and the interface is developed, it is deployed and put to the ultimate test. It showed required comparative results and satisfied the objectabletive the team has taken prior to initiating the hands-on working on this system.

5 Results

As per the discussion, the results that need to generate were nothing else but the comparative results of the evaluation metrics value of the respective dataset. First in that trail was the Stock Prediction dataset, and table number 1 describes shows the comparative values for the same (Table 1).

Table 1. Results of stock prediction dataset.
Table 2. Results of earthquake forecasting dataset.
Table 3. Results of sales forecasting dataset.

Auto ARIMA has been the best performer with the lowest value of RMSE and MAPE. However, SVM and KNN are the worst performers according to the RMSE and MAPE respectively. Similarly, Table 2 shows the output generated for the Earthquake dataset, and here reader can observe that Auto ARIMA and Linear Regression are the best performers with the lowest value of RMSE and MAPE respectively. However, KNN was the worst performer according to both RMSE and MAPE. But the numerals were so much close in this case (Table 3).

The results of the Sales forecasting dataset are described in table number 3, where it can be observed that Linear Regression and SVM turns out to be the best performer with the lowest value of RMSE and MAPE respectively. However, KNN was the worst performer according to both RMSE and MAPE (Fig. 2).

Fig. 2.
figure 2

Model performance comparative graph

The graphs in figure number 2 shows the comparison of the value attains by the Evaluation metrics. The Tata Global Beverage graph signifies that RMSE has higher values than MAPE; however, the other two datasets say otherwise. Ultimately, it all depends on the target variable and dataset.

The trend of First Dataset says Auto ARIMA has a significantly lower value of RMSE (3.74366) and MAPE (0.72129) than other models. However, talking about the worst performer, KNN beats other algorithms according to RMSE (16.92529), and SVM according to RMSE (69.81082).

Looking at the trend of the second dataset one can say that there is minimal difference between models according to RMSE; however, among all Auto ARIMA (0.41603) gave a bit better satisfying result. But according to the MAPE, Linear Regression (2.49101) went on top followed by Auto ARIMA (2.58689). White RMSE and MAPE both signified that KNN wouldn’t be a good choice for this dataset.

The third dataset i.e., for Sales prediction had very difficult in choosing an optimal algorithm according to the graph. Nevertheless, Linear Regression became the more favorable algorithm than others according to the numbers of RMSE (2.22990). Similarly, SVM became a more optimal algorithm according to MAPE (22.56927). But again, KNN significantly became not a good choice.

6 Conclusion

Experimental performance analysis of five algorithms viz., linear regression, K – Nearest Neighbor, Auto ARIMA, Prophet, and Support Vector Machine is done. Stock market, earth and sales forecasting data is analyzed. To compare the performance and accuracy of these algorithms, RMSE and MAPE are used as the evaluation metrics. Lower the value of RMSE and MAPE, the better the algorithm.

As per the results, according to the RMSE, Auto ARIMA is the most optimal algorithm in two cases out of three. However, MAPE states that the Auto ARIMA is suitable for only one case. Taking it all in determination, it can be said that Auto ARIMA jostled all the other four algorithms, followed by Linear regression in the second place. Also, KNN is going to be the worst choice for Time-Series Forecasting. In the end, it won’t be wrong to say that everything depends upon the trends and variables of the dataset, and that’s why choosing an appropriate machine learning model becomes priority before going for a business idea. Here, one can observe that there is small difference between results of the evaluation metrics of earthquake and sales dataset. Yet, the numeral gaps between Auto ARIMA and other models in Stock Prediction dataset is clearly observed.