Keywords

1 Introduction

The studies available on EU funds have mainly been focused on their implementation by country (Nigohosyan & Vutsova, 2017), by region (Iribas & Pavia, 2010), or even by program (Andrade, 2016). The previous studies focus on the different levels of investments made or on their percentage use in local projects (De Iuliis, 2016). Despite the prolific methods available to perform the evaluation of ESIF, predictive analytics has not yet been applied in this context.

In this framework, this work aims at proposing a broader and higher perspective regarding the use of predictive analysis. Towards this end, the analysis performed herein is done by trying to predict the total investment that will occur during the implementation of the next programming period of 2021–2027, according to the data available for the previous programming period of 2014–2020. Therefore, the purpose is to predict the total amount that will be devoted to each fund based on the difference between the total investment funds spent and the total amount planned, for each fund, i.e., ERDF, ESF, Youth Employment Initiative (YEI), CF, European Agricultural Fund for Rural Development (EAFRD), and European Maritime and Fisheries Fund (EMFF).

An artificial intelligence tool has been used, specifically the Rapidminer, to test the best prediction model for each ESIF. In this way, it will be possible to anticipate the amount that should be assigned, thus decreasing the difference between the amount spent and planned.

This paper is structured as follows. Section 2 delivers the literature review on the subject. Section 3 briefly goes through the main underpinning assumptions of the methodology employed. Section 4 provides the description of the implementation of the models. The discussion of results is given in Sect. 5. Finally, conclusions are drawn, highlighting the main limitations found and future work developments.

2 Literature Review

Several authors underline the relevance and contribution of big data analytics and the use of machine learning in predictive analytics, reinforcing the role and contribution of decision-making based on business environments (Lismont et al., 2017; Meyer et al., 2019; Psarras, et al., 2020). Quite important also is the methodology and all the processes to achieve the results. Ge et al. (2017), describe all the required processes for extracting the dataset from the database, namely the analysis of the metadata for data preparation and exploration. Only after the dataset identification and preparation, the regression model is selected to perform the analysis. Several hypotheses could be extracted from the application of the different regression methods. In this context, Linear Regression, Support Vector Regression (SVR), Artificial Neural Network (ANN), k-Nearest, Neighbors (k-NN), and M5 model tree are some of the regression models used according to the main goal, which could be employed to model the relationship between the dependent and independent variables. Each model has its own merits and demerits. If the main concern is to maintain the error framed to a short interval, the SVR should be used (Hotzlast, 2022). Although the CRISP-DM is not new, it is a model quite tested and serves as the main structure for the data science process.

3 Methodology

The predictive analytics process described uses the CRISP-DM process model is the most used and common data science process. The step-by-step analysis of the CRISP-DM focuses its attention on the different predictive models. In many cases, it will be the user, not the data scientist, who will carry out the deployment steps. He/she will test the model application for his/her business values (i.e., model hyperparameters). This means that the model should be generic enough for the adaptation to different business variables.

3.1 Rapidminer Automation Procedure

To test the most used predictive methods, the Rapidminer was used as a data science platform that allows data engineering, model building, and machine learning operation, among others. It allows the application of the CRISP-DM model. Therefore, it was used to do the prediction analyses and simulations for each of the ESIFs. This tool has a two-phase automation procedure: the TurboPrep for data preparation and the AutoModel, to test and simulate the different prediction models. For the first 3 phases of the CRISP-DM model, business understanding, data understanding, and data preparation were applied for each of the EU funds, i.e., ERDF, CF, ESF, EAFRD, and EMFF.

Then, after the dataset preparation, the prediction models are simulated and tested using the AutoModel, in order to fulfill all the process modeling phases.

4 Implementation Models

4.1 Data Preparation of EU Funds Using Rapidminer TurboPrep

The first step was to collect each dataset for each fund. The data source was directly obtained from the European Commission data center. The preparation was initiated by identifying every attribute and its meaning in the dataset. This corresponds to step two of the CRISP-DM model, as given below:

Ms—country initials.

Programme Title—program name.

TO_short—main program thematic objective.

National_Amount_planned—investment planned for each country.

Total_Amount_planned—investment planned with all the contributions.

Year—year id of the fund.

EU_co_financing—percentage of European financing over total investment.

Total_eligible_cost_decided_(selected)—total investment costs of all programs for each country.

Total_eligible_spending—after eligible costs, what was spent.

Reference Date—when the investment was available.

Some attributes were eliminated because of their redundancy, like the country’s name and the program’s acronym. After that, the metadata analysis table was obtained—see Fig. 1.

Fig. 1
A table depicts the retrieve prediction analysis (metadata) E U fund E R D F dataset for 2022 (output). It includes role, name, type, range, missing, and comment parts.

Source Author’s own elaboration

ERDF MetaData table.

From Fig. 2 it can be seen that a new attribute was created for this search, calculated by the difference between the total amount spent and the total amount planned. This attribute “Implemented—Planned 20,142,020” was defined for all the countries during all years between 2014 and 2020.

Fig. 2
A table depicts the E R D F dataset preparation with turbo prep. It includes the addition of new attributes by the difference between the total amount spent and the total amount planned.

Source Author’s own elaboration

ERDF data preparation with TurboPrep.

A summary of the distribution of the new variable can be shown in Fig. 3 and details on the statistics of central tendency are obtainable as well.

Fig. 3
A table depicts the implemented planned 20142020, including a summary and distribution.

Source Author’s own elaboration

Measures of central tendency.

4.2 Data Modeling of EU Funds Using RapidMiner AutoModel

The next phase of the CRISP-DM involves modeling and simulation for the ERDF. In this framework, the Auto model selects and executes all the predictive models available in its library. The dataset of the ERDF with the new attribute was tested with all possible predictive models as shown in Figs. 4, 5 and 6.

Fig. 4
A table depicts the dataset of E R D F with additional new attributes with all possibilities of predictive models. It includes the preparation of the E R D F auto model.

Source Author’s own elaboration

ERDF Automodel prediction preparation - Part 1.

Fig. 5
A table depicts the auto model preparation of the E R D F dataset. It includes generated linear models, deep learning, decision tree, random forest, and gradient-boosted trees.

Source Author’s own elaboration

ERDF Automodel prediction preparation - Part 2.

Fig. 6
A table depicts the auto model prediction analysis result of the E R D F dataset. It includes 2 graphical analyses of the relative error and run times with 8 number of models.

Source Author’s own elaboration

An overview of the prediction model analysis results for ERDF.

The same procedure was done for the other five EU funds applying all the steps of the CRISP-DM model with the TurboPrep for data preparation and exploitation, followed by the modeling analysis on the Automodel.

4.3 Simulation

The Automodel enables choosing the best prediction model, with the best results and lower relative error, by simulating, for each model, the best value for the implementation, which is the difference between the total amount spent and the total amount planned. For the ERDF, the following values were achieved to consider the dependent variables according to Figs. 7, 8 and 9.

Fig. 7
A table depicts the predictions including the results, decision tree - simulator, important factors for prediction with a bar graphical analysis (support prediction and contradicts prediction).

Source Author’s own elaboration

Important factors for prediction on the decision tree—simulator.

Fig. 8
A table depicts the decision tree predictions chart and the results of the data. It includes a plotted graph for predictions and tree values.

Source Author’s own elaboration

Decision tree prediction chart.

Fig. 9
A representation depicts the results of the decision tree - simulator and the final prediction values.

Source Author’s own elaboration

Final prediction value with the decision tree.

All the phases of the CRISP-DM model were repeated for the other five EU funds. In the case of the ESF, the YEI, and CF the simulator presented the best prediction model, which was the Generalized Linear Model, while for the remaining funds the best prediction model was the decision tree. From the values previously achieved it is possible to build a prediction table with all the funds, the best prediction algorithm, and their results—see Table 1.

Table 1 Final prediction values and best prediction algorithm for all the ESIF funds from 2014–2020

The values presented above (Table 1), validate that there are some funds whose execution differs quite significantly from the amount planned (i.e., present high negative values). Special attention should be given particularly to the execution of EMFF. Although some priorities for the 2021–2027 ESIF are different and have been reduced from eleven to five objectives, this research proposes a method to reduce the total amounts planned.

5 Conclusions and Further Research

By using the CRISP-DM and an artificial intelligence tool like Rapidminer it is possible to achieve some predictions for the next ESIF 2021–2027. Further and deeper research should be taken for each country following the new priorities and using machine learning algorithms for better predictions. A quite interesting study could be developed, in the future with the introduction of new variables, accounting for the impact of the war in Ukraine and the inflation due to the lack of materials, mainly electronic components. Nowadays, these two facts are a concern and will affect the first five priorities and project goals of ESIF. Future work should also involve a similar analysis per each thematic objective.