1 Introduction

In order to seek a better and more sustainable future for all, the sustainable development goals (SDGs), a set of 17 global goals, were drawn up. In 2015, 195 countries agreed that they could change the world for the better. This will be achieved by bringing together their governments, companies, media, higher education and non-governmental institutions to mend the lives of people in their countries by 2030 [1]. Although the SDGs are not legally binding, governments are expected to take charge and set up a national structure to pursue them. Therefore, States will bear the most responsibility for monitoring and reviewing progress made. This requires the collection of qualitative data in a timely manner, so that follow-ups and reviews at a regional level are based on analyzes conducted at the national level, contributing to global follow-up and review. One of the SDGs, goal number 3, is good health and wellbeing; reducing the number of deaths is a part of this goal (goal number 3.9). Increasing birth rates without increasing neonatal and child mortality and, in general, decreasing the mortality rate is considered as a sign of good health and wellbeing of any community. One way to find out where a community is regarding this goal is to use statistical forecasts.

A time series is a set of observations resulting in statistical data ordered by time. The study of time series has become a fertile field, as it is closely related to human life and daily activities [2]. Time series analysis aims to find an appropriate model that simulates and explains the studied behavior, and thus the ability to forecast future values [3]. Forecasting is of great importance in all fields, so researchers and experts have taken an interest in this field, developing many predictive techniques and models and using them to obtain more accurate results. The use of an artificial neural network model has proved to be efficient in several areas.

Artificial intelligence is a branch of computer science and information technology, and can be defined as visualizing the representation, simulation and modelling of human mental abilities by means of a computer [4]. This branch of science has become one of the most active fields in recent times, and it has many and multiple uses, one of which is to predict what will happen in the future, by monitoring the data of a phenomenon or an event for a certain period of time and then predicting what this phenomenon or event will do in the future. The most commonly used neural networks consist of three layers, which have an input layer, a hidden layer(s), and an output layer. The input is fed into the network through the input layer, and the information is processed inside the hidden layer(s), then the output is provided by the output layer. The transmutation components in each layer are called neurons or nodes. The number of neurons and layers hidden within the network depends on the problem, and is determined by the train and error method [5].

Forecasting using artificial neural networks is a modern method that has received wide attention in many fields. It has been widely used because it can explain the behavior of non-linear data, and does not require strict and precise conditions for the purpose of forecasting. In this paper, we forecast deaths in Northern Iraq (Kurdistan Region) using a backpropagation training algorithm to train a network consisting of a multi-layer perceptron neural network for the future planning and knowledge of community characteristics. Also, none of the researches published in the region dealt with all the characteristics of the data studied in this paper. We also look at the strengths and weaknesses of births and deaths registry offices in the region, and attempt to remedy data input problems by suggesting a unified data input strategy.

1.1 Literature Review [6,7,8]

In today's culture, accurate mortality estimates are more vital than ever. Due to the increased societal benefit of more precise death estimates, the field of mortality prediction is expanding and progressing. Several strategies for modeling and predicting mortality have been developed in recent years. The deductive guideline method for estimating deaths for numerous years was the Lee-Carter (LC) mortality projection technique (1992). The age-period mortality model looks at the overall time trend, mortality patterns by age, and age disparities in the magnitude of total time change for a given population over a given time period. The Lee-Carter model has inspired a slew of variants, extensions and alternatives since its inception. Renshaw and Haberman propose a multi-factor variation (2003). Brouhns et al. (2002) estimate the parameters of an LC model using a Poisson distribution and log-likelihood maximization. Renshaw and Haberman's (2006) one-factor LC model was updated to include cohort impact. Young-adult mortality was studied by O'Hare and Li (2012). In his most recent publication, Currie (2016) provides a comprehensive discussion of generalized linear and non-linear mortality models. Han Li Shang and Steven Haberman talk about the recent practice of merging multiple projections rather than just one. Advances in mortality modeling based on machine and deep learning models have been reported in the literature. In the early 1990s, Kramer (1991) reported a neural-network-based generalization of PCA to the problem of non-linear feature extraction in chemical engineering literature. To forecast maternal mortality in a Nigerian region, Abdulkarim and Garko (2015) employed a feed-forward neural network with a particle swarm approach. Puddu and Menotti (2012) extended their approach to rural Italy, showing no difference in performance between multi-layer perceptrons and multiple logistic regressions when predicting 45-year all-cause death. Hainaut (2018) investigated a Neural Network design for mortality analysis with the purpose of extrapolating acceptable non-linearity in observed mortality force. In a multiple population situation, Richman and Wuthrich (2018) employed deep neural networks to determine the parameters in the Lee-Carter model. Deep learning models, particularly recurrent neural networks (RNNs), are gaining in popularity in a wide range of forecasting applications, including mortality prediction.

2 Methodology

2.1 Data

In most developing countries the burden of recording a birth or a death falls entirely on the family. In many cases it requires significant effort and expense, and can take several weeks. This partly explains why so many births and deaths are not recorded.

Unfortunately, there are still significant gaps in the availability and quality of this important data in many parts of the world. Only about 65% of all births worldwide are recorded, about a third of the world's annual deaths are recorded through civil registration, and up to 80% of deaths occurring outside health facilities are not counted [9].

In all health directorates in the Iraqi governorates, health and vital statistics departments form part of the Health Planning Directorate. These departments receive health-related data from health facilities either on CDs or paper spreadsheets. There are also Registration Bureaus of Births and Deaths (RBBD) in most regions, and they are responsible for registering births and deaths in the surrounding areas [10]. The RBBD records deaths reported either by hospitals or by relatives. When a death occurs in a hospital, a death certificate is issued containing information about the deceased person (age, gender) and also the probable cause of death, which is confirmed by the doctor [11]. This registration system provides fragmented information, however, especially in rural areas. In the Kurdistan Region, many gaps in the recording of deaths still occur, such as:

  • the death recording process varies from one region to another, with format, accuracy, completeness and accessibility of data being a challenge when processing local statistics.

  • reported data on mortality is inaccurate, since there are no laws forcing relatives to register the death,

  • data collection is mostly paper based, therefore important information is lacking, making vital statistics and the evaluation of outcomes difficult to measure efficiently.

We collected data on 62,000 individuals, including information such as: date of birth, date of death, gender, cause of death, place of death, hospital, from the births and deaths registry bureaus and forensic medicine departments in Sulaimaneyah, Duhok, Zakho, Aqrah, and Amedi. From the collected data we found that the number one cause of death was Diseases of the circulatory system (I00-I99), and the number two cause of death was Neoplasms (C00-D49); External causes of morbidity and mortality (V00-Y99), and Certain infectious and parasitic diseases (A00-B99) are ranked third and fourth respectively. The number of deaths in the age group 60–79 is the highest among all age groups, confirming the indicator of life expectancy in Iraq at 73.6 years. Male life expectancy is 71.7 years, female life expectancy is 75.6 years [12]. We observed a large number of deaths for the age group 0–1; this aspect must be deeply investigated looking at the complete data records of offices, and will be done in the next phase of our work. Generally speaking, the infant mortality rate in Iraq fell from 83.883 deaths per 1000 live births in 1970 to 22.918 deaths per 1000 live births in 2020 [13].

2.2 Methods

In 2015 the University of Rome Tor Vergata (Italy) and the Kurdistan Ministry of Health started cooperating on the implementation of an electronic epidemic monitoring and health surveillance system, called KRG-HIS (Kurdistan Region Governorate Health Information System). This system is currently implemented in 125 primary healthcare centers and hospitals in different Provinces: Duhok, Erbil, Sulaymaniyah and Halabja. The KRG-HIS derives from the DHIS2 (District Health Information Software) [14] free online software system which is based on ICD-10( International Classification of Disease) coding, and is designed to manage healthcare data by collecting, storing, managing and transmitting electronic medical records of patients, such as diagnoses, vaccinations, births and deaths [15].

Part of the information we get from this system relates to the cause of death. According to WHO (World Health Organization), analyzing why people die is one of the key elements to understanding the health situation of a region, and that is why the system is also concerned with this aspect. We collected data on deaths that are recorded in the RBBDs and Forensic Medicine Departments in different areas. The data we collected were not coded or classified, but were written as text in either Arabic or Kurdish. Each individual record was checked and coded according to the WHO international classification of diseases (ICD-10). After coding we placed our data in different categories.

2.3 Aim

The aim of our work is to predict deaths in Kurdistan over the next two years. These predictions are quarterly based. We also aim to predict gender, cause of death, age and governorate. To achieve this, first of all the data were normalized and classified.

The simplest data to classify is gender, with just 2 possible values: Male (M) or Female (F). Age data is simple too, we chose 7 different classes: < 1, 1–4, 5–14, 15–34, 35–59, 60–79, 80+. A more complex classification was the cause of death. We chose to use the ICD10 code, and defined the 21 different classes shown in table 1  (Table 1).

Table 1 Disease surveillance—Duhok events grouped by ICD10 classes

At the end of this step, a single line of data has the following fields:

  • Quarter of the year (related to deaths), e.g. 01/01/2008 to 04/30/2008;

  • Age class, e.g. class 1–4;

  • Hospital, e.g. Heevi Children’s Hospital;

  • Governorate, e.g. Duhok;

  • Cause of death, following the ICD-10 code, e.g. in the range C00–D49.

2.4 Statistical Method

The artificial neural network method is a modern knowledge processing method that has become popular, taking on a prominent role worldwide, as it continuously simulates data and non-linear functions to obtain success for a model dealing with the analysis, classification, forecasting, etc. of a phenomenon.

2.4.1 Components of Artificial Neural Networks

An artificial neural network consists of a collection of connected neurons. These neurons work in parallel and at the same time. A neural network consists of three layers: Input Layer, Hidden Layer and Output Layer. It works as a simple system that collects the likely inputs and offers an answer in the form of a numeric value. A neural network that has one input layer and one output layer where the inputs are fed directly to the output via a series of weights is called a single layer neural network, while a network that has one or more hidden layers between the input and output layers is called a multilayer neural network. The system learns through the process of determining the number of neurons in each layer and adjusting the layer of communication weights based on training data.

2.4.1.1 Supervised Learning

When training the network, the required output is known, to be compared with the network output at each learning cycle, and the calculated error is used in the process of adjusting weights until the correct result is obtained, after which the network does not need to be trained.

2.4.1.2 Unsupervised Learning

The learning process occurs without predetermining the desired output. In this case, the network is trained to discover the features that are not visible in the dataset used in the training process, and then uses those features to divide the input data into different and closed groups within each group.

In Fig. 1 each of the inputs (xi) features a weight (wi) that represents the strength of that exact connection. The weighted sum of the inputs as well as the bias (b) is input to the activation function (f(x)) to get the output (\(\widehat{x}\)i) (Fig. 1) [16]. Activation functions help the network learn complex relationships and data patterns. Bias is used to regulate the output, together with the sum of the weighted inputs to the neuron, shifting the activation function to either the right or left. This mathematical form of this process can be written as follows:

$$\hat{x}_{i} = f\left( {\mathop \sum \limits_{i = 1}^{n} w_{i} x_{i} + b} \right)$$
(1)
Fig. 1
figure 1

Single neuron in a neural network

2.4.2 Learning Algorithm

Neural networks can be classified into two categories based on the pattern of connections: feedforward neural networks, in which the spread of the signals entering the network is usually forward. As all the interconnecting lines come in one direction from the input layer to the output layer, the outgoing signals rely upon the incoming signals only. For backpropagation neural networks it is possible to re-feed the outgoing signals from the network and divert their direction to become an incoming signal, thus the signal leaving any cell depends on the signals entering it, apart from the signals previously leaving it.

  • Forward propagation stage: This stage begins with displaying the form of input to the network, where each processing element of the input layer is allocated to one of the components that represents the input, in other words the network operates with a forward feeding system, and there is no adjustment to the weights. The general forward propagation equations are [17]:

    $$z^{\left[ l \right]} = w^{\left[ l \right]} a^{{\left[ {l - 1} \right]}} + b^{\left[ l \right]}$$
    (2)
    $$a^{\left[ l \right]} = g^{\left[ l \right]} \left( {z^{\left[ l \right]} } \right)$$
    (3)

    where \(z^{\left[ l \right]}\), the activation of layer l.; \(a^{{\left[ {l - 1} \right]}}\), input from layer l−1; \(a^{\left[ l \right]}\), output of layer l.; \(g^{\left[ l \right]}\), the activation function in layer l.; \(w^{\left[ l \right]}\), weights in layer l.; \(b^{\left[ l \right]}\), bias in layer l.

The output of the final layer, which is the output of the network, is denoted by a[L] = \(\widehat{y}\).

  • Back propagation and Weight adjustment stage: This is when network weights are adjusted, where network outputs are compared during training with the values fed from the outside (or with the target value), and the difference between the two is calculated, then the weights are adjusted, allowing the signal to re-propagate from the output layer to the input layer in reverse during the weight adjustment phase. The process is repeated until the network outputs are the same, correct given value. The standard backpropagation algorithm is the gradient descent algorithm on the cost function:

    $$C_{0} = \left( {a^{\left[ L \right]} - a^{\left[ 0 \right]} } \right)^{2} = \left( {x - \hat{y}} \right)^{2}$$
    (4)

One important operation used in the backward pass is to calculate derivatives to update each of the weights in the network so that they cause the actual output to be closer to the target output. We then calculate the partial derivative of cost function in respect of the weights (by using the chain rule) with these general equations:

$$\frac{{\partial C_{0} }}{{ \partial w^{\left[ l \right]} }} = \frac{{\partial z^{\left[ l \right]} }}{{ \partial w^{\left[ l \right] } }}*\frac{{\partial a^{\left[ l \right]} }}{{ \partial z^{\left[ l \right] } }}*\frac{{\partial C_{0} }}{{ \partial a^{\left[ l \right] } }}$$
(5)
$$\frac{{\partial z^{\left[ l \right]} }}{{ \partial w^{\left[ l \right] } }} = \left( {a^{{\left[ {l - 1} \right]}} } \right)$$
(6)
$$\frac{{\partial a^{{\left[ l \right]}} }}{{\partial z^{{\left[ l \right]}} }} = g^{{\prime \left[ l \right]}} \left( {z^{{\left[ l \right]}} } \right)$$
(7)
$$\frac{{\partial C_{0} }}{{ \partial a^{\left[ l \right] } }} = 2\left( {a^{\left[ L \right]} - a^{\left[ 0 \right]} } \right)$$
(8)

then, we will use this result to adjust the weights by using the following equation:

$$w_{new}^{\left[ l \right]} = w^{\left[ l \right]} - \eta \frac{{\partial C_{0} }}{{\partial w^{\left[ l \right]} }}$$
(9)

where η is the learning rate that scales the size of our weight updates in order to minimize the network's loss function. It is an indicator of the direction in which the weights are updated. This parameter can be fixed or updated adaptively [18]. The learning rate must be updated carefully, because it has a direct impact on the adjusted weights. This process of adjusting weights is applied to the output layer and all hidden layers, and is repeated until the desired result is obtained.

The choice of the optimizer affects both the speed of convergence and whether this actually occurs. Several alternatives to the usual gradient descent algorithms have been developed. Fit optimization algorithms such as Adam or RMS Probe perform well in the first part of the training, but generalization have been found to be poor in later stages compared to random derivative regression.

Before beginning training, we have to choose the appropriate size of the network. This is the most difficult step in the design of neural networks. In addition to the numerous options available for the activation function for each layer, there is the matter of selecting an acceptable number of layers and the choice of the optimal number of nodes in each of those layers. Choosing an improper network size leads to unacceptable results. The trial and error method is the easiest way of selecting the network size. The designer should try a variety of networks and choose the best one.

In this paper, the goal of the designed neural network was to have in each single row of the dataset the forecast number of deaths in a single governorate, with a specific age, gender, hospital and class of ICD-10 code. The neural network developed is a multilayer perceptron neural network that consists of one input layer with five nodes, three hidden layers with thirty nodes, and one output layer with one node. The activation function used is scaled exponential linear units (SELU). A 'normal' kernel initializer is used as a weight initializer. As a computing loss function we used mean squared error (MSE), and as an optimizer function Root Mean Square Propagation (RMSprop) with learning rate 0.01 and 500 epochs used for training. Loss function is a measure of how good the prediction model performs in terms of the ability to predict the expected outcome. The RMSprop optimizer restrains oscillations in the vertical direction.

Therefore, our learning rate can increase and our algorithm can take larger steps in the faster-converging horizontal direction.

The SELU activation function was chosen because it has an excellent self-normalization quality, there is no fear of vanishing gradients, and it learns faster and better than other activation functions [19], in part due to the pattern of the dataset, equivalent to SELU; furthermore, it gives more accurate results with minimum MSE.

This designed network produced good forecasting results, as shown in the results section. We should not forget that with more data we can obtain better and more accurate forecasting results. All these characteristics are valid for our datasets, and it is not necessary to get the same result for a different dataset.

In contrast to other machine learning methods, neural networks require substantially larger datasets for training. They also require a lot of computing power to be trained. A problem arises when either the dataset or the scale of the neural network has become too large. When the amount of data or the number of layers and neurons in a neural network grows, it usually does not scale well. This behavior is caused by a number of factors:

Non-linear activation functions are common in neural networks. The sequence in which we feed training data into a neural network has an impact on the outcome. The error function of neural networks frequently ends up in local minima. The only method to get around this is to train the network numerous times on different batches of the training dataset, however this requires multiplying training efforts several times.

Another issue is the weights of the network. The weight matrices scale faster than a linear model as the neural network grows in size. Finally, not all neural network topologies are created equal. Different architectures may tackle the same problem with a similar precision despite using vastly different amounts of processing. This means that we frequently use trial and error to find the most successful architecture, but we have no theoretical understanding of why it works.

Lastly, we should emphasize that neural networks are often closed systems. While they can learn abstract representations of a dataset, human analysts find it difficult to interpret these representations. This means that, while neural networks can make correct predictions in theory, we're unlikely to gain insights into the structure of a dataset through them [20].

3 Results

We can divide the results of this paper into the following sections:

3.1 Data Processing

After collecting 62,000 individuals from RBBD we coded the causes of death according to the WHO classification (ICD-10). Table 2 shows a sample of the prepared data that we used in this study (Table 2).

Table 2 Part of the coded data

3.2 Designing the Network

We used Python programming language to design the neural network, starting with the following steps:

  • Created a csv file for every class (governorates, hospital, age, gender).

  • Created the csv file relating to ICD-10 codes;

  • Developed a script to add the specific ICD-10 class.

  • Developed a script to take a single row from the starting dataset and increase the death counter of the specific row in the new dataset.

  • Developed a script to create a csv file containing data for forecasting. This script had two goals, the first was to divide the dataset into true and false examples, true was the rows with the number of deaths > 1, false the remaining rows, the second to put the data in a common format for the forecast data.

  • All data was normalized between 0 and 1, for false, true and forecast data.

  • Validation was developed to set validation and training (called test in the script) for the neural network, the concept was to take false and true rows, shuffle them and take 80% of values as a training set and 20% as a validation set.

  • Converted forecast data to a JavaScript Object Notation (JSON) file;

  • Converted elements of the forecast data from numbers to strings, to gain a better explanation of results.

The exact number of records used as a training set was 1,819,272, for the validation set the number was 363,856. The forecast records numbered 279,888. The best designed network for our forecasts consisted of one input layer with five nodes, three hidden layers with thirty nodes and one output layer with one node.

With learning rate 0.01 and 500 epochs, we get the forecasting result with loss function Mean Squared Error = 0.43 and Mean Absolute Error = 0.04, Root Mean Squared Error (RMSE) = 065, Akaike information criterion (AIC) = 3.6879, and Bayesian information criterion (BIC) = 9.3469.

3.3 Forecasting Results

From table 3, we can conclude that the number one cause of death will be Diseases of the circulatory system (I00-I99), and the number two causes of death is Neoplasms(C00-D49), then External causes of morbidity and mortality (V00-Y99), Certain infectious and parasitic diseases (A00-B99) are ranked third and fourth respectively. The last cause of death is Diseases of the ear and mastoid process which is coded as H60-H95 (Table 3).

Table 3 Forecast number of deaths classified by cause of death (1/1/2021–30/9/2021)

Table 4 gives the forecast number of deaths by gender (Table 4). Table 5 shows the forecast number of deaths by age, we conclude that the number of deaths increases in the elderly population, as the number of deaths in the 80 + age group is higher than all other groups (Table 5).

Table 4 Forecast number of deaths classified by gender (1/1/2021–31/12/2022)
Table 5 Forecast number of deaths classified by age group (1/1/2021–30/9/2021)

The table 6 shows the forecast number of deaths in the 20 hospitals with the highest number of deaths. It starts with Azadi Teaching Hospital, with 2496 deaths (Table 6).

Table 6 Forecast number of deaths by hospital

In table 7, we noted that the total number of deaths from Sulaimaneyah governorate are greater than other governorates, this is because the number of residents in Sulaimaneyah is greater compared to the other governorates. According to population projection by governorate and region for 2019 published by the Iraqi central statistical organization the total number of Sulaimaneyah population is 2,219,194, while the Duhok population totals 1,326,562 [21] (Table 7).

Table 7 Forecast number of deaths classified by residence of the deceased (Governorate) (1/1/2021–31/12/2022)

The results shown in these tables are part of forecast results that include only one category in each table, since we cannot show all the categories in one table. Also, some tables are so long they could not be presented in this paper. We are working on implementing a web application to show all the results. In the WebApp users can view all the results of the forecast together or the results of a specific category. Users can for example view the total forecast number of deaths in the age class 5–14 in the period 1/1/2021–31/3/2021, and in this specific category we can determine the proportion of males and females, proportions of cause of death, places of death and governorate.

Table 8 shows the total forecast number of deaths by quarterly period between 1/1/2021 and 31/12/2022. From the above table we can see a quite stable situation, with a slight increase in the number of deaths. This is not a bad indicator in view of the continuous increase in the population [22] (Table 8).

Table 8 Total forecast number of deaths

4 Conclusion and Discussion

Setting up a good forecasting model is important because of the impact of its results on the various processes of social and economic planning of a country. In this paper we chose artificial neural networks to forecast mortality in Kurdistan Region for the next two years because their results are more accurate and efficient in forecasting than traditional statistical methods. The best designed network for our data forecasts was one input layer with five nodes, three hidden layers with thirty nodes and one output layer with one node. We approved forecasting results with loss function Mean Squared Error = 0.43, Mean Absolute Error = 0.04.

From the results of running our network we observe that the number one cause of death is Diseases of the circulatory system, while the number two cause of death is neoplasms. The number of deaths among males is greater than the number of females, and the number of deaths increases in the elderly population, as the number of deaths in the 80 + age group is higher than all other groups. For the total number of deaths, we observed a quite stable situation, with a slight increase in the number of deaths, not a bad indicator in view of the continuous increase in the population.

The forecast number of deaths by gender, age, hospital, governorate and cause of death classified according to the ICD10 classification could not be shown in one table, therefore we are working on implementing a web application to show all the forecast results. In the web application we can find the results of forecasting in any specific category.

List of abbreviations

Abbreviation

Meaning

SDGs

Sustainable development goals

LC

Lee-carter

RNNs

Recurrent neural networks

RBBD

Registration Bureaus of Births and Deaths

KRG-HIS

Kurdistan Region Governorate Health Information System

DHIS

District Health Information Software

ICD

International Classification of Disease

WHO

World Health Organization

SELU

Scaled Exponential Linear units

MSE

Mean Squared Error

RMSprop

Root Mean Square Propagation

M

Male

F

Female

JSON

JavaScript Object Notation

RMSE

Root Mean Squared Error

AIC

Akaike information criterion

BIC

Bayesian information criterion

When we started collecting data for this research, we discovered that most information was paper-based and incomplete. To overcome this problem, we recommend using the KRG-HIS program in all health facilities to collect more correct daily information about health situations. Since its use, the health information system has begun to provide high-quality data, which is essential for planning, policy implementation and monitoring of health outcomes and services. Therefore, if this system is applied in all hospitals and health centers in the whole Kurdistan Region, we can obtain more reliable data and provide timely statistics on the health situation of the region, and (in our case) obtain more accurate forecasts. KRG-HIS is a system that can fill in the gaps for obtaining reliable and timely information on recorded diseases and deaths among the population.

We recommend that statisticians in local health departments use modern forecasting and estimating concepts and technologies, and keep up with the latest developments concerning these methods, as they are of great importance for the health services provided to society.