Introduction

Air pollution can be defined as the phenomenon that various substances that need to be found in the air are out of the specified limits and the substances that should not be in the air are found at a hazardous rate to humans, plants, animals and environment (Cavkaytar et al. 2013). Along with various natural factors, air pollution increasing in parallel with human factors related to population growth, technological development and industrialization, has long been a global concern. Among all factors, increasing population density is one of the key human factors that cause air pollution without doubt. In that, the impacts of the natural pace of population increase were visibly augmented by the migration from rural to urban areas in Turkey especially since 1980. According to TurkStat, while Turkey’s population was 43.9 million in 1980, this figure reached 63.2 million in 2000 and 80.8 million in 2017. Not surprisingly, the most dramatic change in Turkey’s demographics has been the rate of urbanization. While the urban–rural divide was 24.2–75.8% in 1927, today this rate is reversed to 92.5–7.5% as of 2017.

A similar tendency is seen in metropolitan areas, as well. For instance, the population in Ankara reached from 3,889,199 in 2000 to 4,771,716 in 2010 and 5,445,026 in 2017, where 94.3% of city population live in the centrum (TurkStat 2018). Such an increase in population and urbanization calls for a number of environmental issues. In Ankara, the second largest city and capital of Turkey, air pollution has been a constant problem. The increase in the number of vehicles in traffic and their consumption of fossil fuels induced largely by population growth are its most important sources. Although the use of natural gas, removal of leaded gasoline consumption, emphasis on the use of natural gas-powered public transport vehicles initiated in 1988 slowed this increase, susceptibility of Ankara to air pollution in winters has not yet disappeared.

A brief look at the contaminants that threaten human health underlines carbon monoxide (CO), particulate matter (PM2.5 and PM10), sulfur dioxide (SO2), nitrogen oxides (NOx) and ozone (O3) as noted by Li and Shi (2016). Particulate matter (PM) affects human health more than other pollutants (WHO 2018) and is defined as a mixture of solid particles and liquid droplets, hanging in the atmosphere. The PM10 represents thoracic particles whose aerodynamic diameter is less than 10 μm and can penetrate to the lower airways. The PM10 pollutants contain airborne particles due to the use of carbon-containing fuels, cigarette consumption and various industrial wastes, caused by humans, as well as particles formed due to the naturally occurring volcanic gases, seawater vapor, dust storms and forest fires which are mixed to atmosphere. The PM10 causes many lung diseases, especially asthma, and cardiovascular diseases (Cavkaytar et al. 2013; Bayram and Dikensoy 2006). The World Health Organization (WHO) has set the annual average PM10 amount standard as 20 μg/m3 and the 24-hour average amount of PM10 standard as 50 μg/m3 in the 2005 Air Quality Directive WHO (2018). The European Parliament and Council has determined the annual average PM10 limit as 40 μg/m3 and the 24-hour average PM10 limit as 50 μg/m3 under the directive 2008/50/EC of May 21, 2008, and that these values can be exceeded maximum 35 times per year European Commission (2018). In Turkey, according to Chamner of Environmental Engineers the amount of 24-hour average PM10 limit is determined as 80 μg/m3 in 2016 and this value has been updated as 60 μg/m3 in 2018. According to the Air Pollution Report for 2017, the PM10 concentrations recorded for Ankara in 2017 exceeded the upper bounds for more days than in 2016. As the Chamber of Environmental Engineers (2018) noted, such an incidence may be pointing at potential hazards to human health.

Note that, with regard to different particulate matters, PM2.5 could also have been considered in our analysis. PM2.5 are those particulate matters that have a diameter of less than 2.5 μm (micrometers), which is about 3% of the diameter of human hair. Since they are even smaller than their PM10 counterparts, they are also called fine particles and have a higher likelihood to float longer in air. Nevertheless, our analysis could not consider PM2.5 due to the inavailability of data on our side. Another important venue could be the consideration of the type of source of the PMs, i.e., a separate analysis of PMs from local sources (road traffic, industries, firewood burning, etc.,) and from distant sources (background PMs) could have shed more light to our understanding, basically since the variation in the concentration of PMs throughout the day and over the months of the year is very different and depends heavily on the source under study. However, we could not entail these in our analysis due to the inavailability of data with regard to sources of PM10.

Against this background, in this study, we propose an appropriate time series model for the monthly PM10 amounts observed in Ankara from January 1993 to December 2017 and generate a set of forecasts for future. The monthly PM10 measurements (μg/m3) for Ankara from January 1993 to December 2017 are used. The measurements between January 2011 and October 1993 are taken from TurkStat, while the measurements between December 2017 and November 2011 are taken from Air Pollution Monitoring Network of the Ministry of Environment and Urban Affairs. Data are compiled as monthly averages of observations taken from eight stations located in Sıhhiye (39.928105° N, 32.852785° E), Bahçelievler (39.928741° N, 32.823128° E), Dikmen (39.877900° N, 32.834940° E), Cebeci (39.931606° N, 32.877861° E), Demetevler (39.964339° N, 32.780970° E), Keçiören (39.996689° N, 32.811290° E), Sincan (39.966766° N, 32.575474° E) and Kayaş (39.911815° N, 32.964695° E) in Ankara, where the parentheses include the approximate locations of them.

Considering the inherent features of seasonality and periodicity in data, we maintain a periodogram-based investigation. Our forecasting models are obtained through harmonic regression techniques which consider periodicity besides classical time series ingredients. Then, the results obtained for these two models are statistically compared and the forecast values obtained according to both models are compared with the observed values for the course of 2018. These comparisons reveal a great deal of forecasting superiority in our proposed estimation framework. What distinguishes the analysis of this study from the earlier work is that our proposed methodology makes use of a single time series and extracts the information embedded in the series by eloquently handling the repeating behavior, i.e., the seasonality and periodicity. So, even in the absence of other variables, which might be of concern under another analytical scheme, our work shows that a better forecasting performance has been viable.

In the remainder of the article, a brief review of the related literature is provided in Sect. 2 and our methodology is introduced in Sect. 3. Section 4 provides the implementation of method, prior to discussions in Sect. 5.

Literature

As mentioned earlier, people of today are exposed to air pollutants beyond standards, especially in areas with high population density, and their consequent hazards to health (Cavkaytar et al. 2013). When the literature on air pollution is examined, it can be seen that the studies are clustered around three main themes: (1) the studies intended to determine the relationship between pollutants and meteorological parameters and to determine the most important causes of air pollution, (2) the studies that demonstrate the relationship between pollutants and various health indicators and (3) the studies intended for forecasting various pollutants for the near future.

Among the studies intended to determine the relationship between pollutants and meteorological parameters and to determine the key causes of air pollution, Çiçek et al. (2004) use the stepwise regression to exhibit the relationship between SO2, PM10, NO, NO2, CO values and meteorological parameters such as the temperature, wind speed and relative humidity for the period of November 2001–April 2002 in Sıhhiye District of Ankara. They reveal especially for March, the middle-level relation between climate elements and SO2, PM10, NO, NO2, CO concentrations. Genç et al. (2010) employ a multiple linear regression setup for the air pollution index formed by the amounts of PM10 and SO2 taken from different settlements of Ankara in 1999–2000, and they propose an air pollution index (Ic) as in Eq. (1):

$$I_{\text{c}} = 0.212\Delta T + 0.043P + 0.162I_{d - 1} - 1.705V - 27.945,$$
(1)

where ΔT is the daily temperature change, P is the daily mean atmospheric pressure, Id−1 is the air pollution index of the previous day and V is the daily average wind speed. In order for such a regression model to be applied, the variables in the explanatory variable role must be non-random variables and the variables in the dependent variable role must be independent random variables. Also, since the observed values used are the values obtained in unit time intervals, it is more suitable to apply the time series models instead of the regression model. Nevertheless, it is still possible to use regression techniques, but firstly the data must meet requirements of the stationarity tests. Finally, Koutrakis et al. (2005) examine the relationships between PM2.5, PM10, PM2.5–10 and meteorological variables using mixed regression models where they estimate the specific factors affecting the particle concentrations and their relative effects in Santiago, Chile, in 1989–2001. Koutrakis et al. (2005) report significant relationships between meteorological variables and particulate matter amounts, like the reduction in the amount of particulate matter on Sundays as a result of reduced traffic and other polluting activities.

Silva (2015) is another important work in that the authors developed an index for air and noise quality via aggregation of relevant measurements and presented their results in comparison with the standardized legal limits for air pollution and noise. As an aid in decision making for urban planners and various policy makers, Silva’s (2015) index is of high value-added and widespread benefits. Owing to its intuitive methodology that involves the computation of a weighted linear combination of two base indices (with regard to noise and air separately), Silva (2015) provides a good description of the subject matter. The recent work by Ganguly et al. (2019) is also worthwhile with regard to measurement of air pollution along with its connection to other relevant factors. In that, the study provides not only an assessment of measurements at the urban stations in comparison with background monitoring facilities, but also the linkages of measured pollutant concentrations to seasonal factors and other pollutants’ behavior. These two recent papers, in addition to those mentioned before, lay down a solid basis for understanding the problem at hand.

In the second strand, among the studies that investigate the relationship between pollutants and health indicators, to establish the relationship between the asthmatic response and the exposure of SO2 and PM10, Berktaş and Bircan (2003) consider the number of patients who admitted to emergency room (Ankara Atatürk Chest Diseases and Chest Surgery Training and Research Hospital) with complaint of asthma between January 1, 1998, and December 31, 1998, meteorological conditions (daily mean rainfall, actual pressure, relative humidity, wind speed, duration and direction as well as minimum, average and maximum daily temperatures) and SO2 and PM10 concentrations in the Ankara region. Pearson and Spearman correlation tests and chi-square test were used to reveal the inherent statistical relationships. A significant correlation between the amount of PM10 and the number of patients who applied was found, whereas a significant low level of association between the amount of SO2 and the number of patients who applied to emergency service due to asthma was found. The results show that even short-term exposure to low-level air pollution in Ankara increases emergency room visits for suspected asthmatic reactions. Pope and Dockery (1992) study the acute health impact of respirable particulate pollution on symptomatic and asymptomatic children during 1990–1991 in the Utah Valley. Using logistic regression analysis, a positive association between PM10 and respiratory symptoms was found. Ostro et al. (1999) reported a significant correlation between the PM10 amount and the daily mortality in the Coachella Valley by means of Poisson regressions.

The last strand of the reviewed literature is devoted to forecasting of various pollutants for the near future. Turgut and Temiz (2015) apply the Box–Jenkins methodology to the weekly PM10 concentration data obtained from the Ankara Sıhhiye station from January 1, 2010, to October 31, 2014, and forecast the future values of PM10. The study forecasts via an ARIMA (3,0,0) specification for November 2014, December 2014 and January 2015, yet omitting the seasonal effects which have been inherent to data. Another study for Ankara Province estimates an Air Quality Health Index comprising of concentration of pollutants such as PM2.5, O3 and NO2 (Bozkurt et al. 2015). Kurt et al. (2008) and Saral (2000) employ neural networks to predict air pollution in Istanbul, Kaplan et al. (2014) use Levenberg–Marquardt learning algorithm in artificial neural networks for PM10 and SO2 estimations for Kütahya Province and Yüksek et al. (2007) estimate the SO2 pollution for Sivas centrum using a backpropagation artificial neural network framework.

All in all, it is salutary that the literature is deserted neither on air pollution nor on its determinants or consequences. Nevertheless, there is a common tendency to use regression rather than time series techniques to study PM10 pollutant in Ankara, despite the genuine time series structure of the data sets. Indeed, this very time series structure might allow researchers to extract some rich information from data even in the absence of other explanatory variables or modeling peculiarities. Even when a pure times series approach is maintained, it is not rare to observe improper or no handling of seasonality. Among practitioners, it is not rare to observe the use of standard ADFFootnote 1 (augmented Dickey–Fuller) rather than DHFFootnote 2 (Dickey, Hasza and Fuller) or HEGYFootnote 3 (Hylleberg, Engle, Granger ve Yoo) tests even when seasonality is obvious. So, we try, in what follows to avoid these pitfalls through a genuine time series approach to modeling and forecasting.

Materials and methods

The fundamental aim in the time series analysis is to forecast the future values of a series using its observed (past) values. StationarityFootnote 4 is the most important assumption for the series to be forecastable. If the series is non-stationary, the forecasts obtained and the statistical inference about the model parameters will not be meaningful. MA series are always stationary, but AR series may not be stationary. Moreover, most of the economic series are non-stationary series. In order to make forecast by the non-stationary time series, you need to provide the stationary with the help of various transformations. There are many methods in the literature for testing the stationary of time series. But the two tests come into prominence in terms of both practicality and applicability: Dickey–Fuller test based on the distribution of the least squares estimators of the parameters and the Phillips–Perron test that used the critical values of this distribution. For these methods, the values of test statistics and p values are directly calculated by many package programs. However, the DHF method which is developed by depending on the distribution of the symmetric least squares estimator or the HEGY method is used to test the stationary of the seasonal series. In either case, an auxiliary regression model is utilized. If there is not a suspect about the periodicityFootnote 5 of the series, one of the above-mentioned tests can be used for the stationarity test. If the series contains a periodic component, the stationarity test based on periodograms can be used. Even if the series do not contain periodicity it can use, at the same time it can apply to the seasonal series (Akdi and Dickey 1999).

Periodograms are usually used to reveal hidden periodicities found in the series (Fuller 1996; Wei 2006; Brockwell and Davis 1987). Akdi and Dickey (1998) proposed a test based on periodograms. General explanations about periodograms are given below.

Periodic functions suggest trigonometric functions. So, to test whether the series contains a periodic component or not, any time series \(\left\{ {Y_{1} ,Y_{2} , \ldots ,Y_{n} } \right\}\) can be defined by

$$Y_{t} = \mu + R\cos \left( {wt + \phi } \right) + e_{t} ,\quad t = 1,2, \ldots ,n,$$
(4)

where \(\mu\), \(R\), \(w\) and \(\phi\) are referred to as expected value, amplitude, frequency and phase, respectively. It is necessary to estimate these parameters. Furthermore, when \(w_{k} = 2\pi k/n,\) the Fourier frequencies are obtained. Due to the characteristics of cosine functions, for \(\alpha = R\cos \left( \phi \right)\) and \(\beta = R\sin \left( \phi \right)\), this model can be written as

$$Y_{t} = \mu + \alpha \cos \left( {w_{k} t} \right) + \beta \sin \left( {w_{k} t} \right) + e_{t} , \quad t = 1,2, \ldots ,n.$$
(5)

According to this model, if the null hypothesis \(H_{0} :\alpha = \beta = 0\) is rejected, it is decided that the data contain the periodic component. Standard F test can be used to test this hypothesis. However, the use of the \(F\) statistic is not significant in case \(w_{k}\) frequencies are not known (Wei 2006). According to this model, the least squares estimators of \(\mu\), \(\alpha\) and \(\beta\) parameters are, respectively:

$$ \begin{aligned}& \hat{\mu } = \overline{Y}_{n} , a_{k} = \frac{2}{n}\mathop \sum \limits_{t = 1}^{n} \left( {Y_{t} - \overline{Y}_{n} } \right)\cos \left( {w_{k} t} \right) {\text{ve}}\; \\ &b_{k} = \frac{2}{n}\mathop \sum \limits_{t = 1}^{n} \left( {Y_{t} - \overline{Y}_{n} } \right)\sin \left( {w_{k} t} \right),\end{aligned} $$
(6)

where \(a_{k}\) and \(b_{k}\) are called as Fourier coefficients. Due to the characteristics of cosine functions, since

$$\mathop \sum \limits_{t = 1}^{n} \cos \left( {w_{k} t} \right) = \mathop \sum \limits_{t = 1}^{n} \sin \left( {w_{k} t} \right) = 0,$$
(7)

Fourier frequencies are invariant according to the mean. By means of these Fourier coefficients, the periodogram ordinate at \(w_{k}\) frequency of the time series is calculated as

$$I_{n} \left( {w_{k} } \right) = \frac{n}{2}\left( {a_{k}^{2} + b_{k}^{2} } \right).$$
(8)

Time series are usually examined under time domain and frequency domain. While the autocorrelation function is the most important point in the time domain, the spectral density function is important in the frequency domain. There is also a transition between autocorrelation and spectral density function of series as stated in Herglotz theorem. If \(f\left( {w_{k} } \right)\) is the spectral density function of the stationary time series, then the asymptotic distribution of the statistic \(I_{n} \left( {w_{k} } \right)/f\left( {w_{k} } \right)\) becomes a chi-square distribution with 2 degree of freedom (exponential distribution with expected value of 1). That is, the probability density function of the asymptotic distribution of normalized periodograms is

$$f\left( x \right) = \left\{ {\begin{array}{*{20}l} {e^{ - x} ,} \hfill & {x > 0} \hfill \\ 0 \hfill & {{\text{d}}y.} \hfill \\ \end{array} } \right.$$
(9)

Hence, periodograms can be taken as an estimator of the spectral density function. The periodograms are also used for the investigation of probable periodicities in the data (for testing the hypothesis \(H_{0}\) above). For any stationary time series, the periodogram values \(I_{n} \left( {w_{k} } \right)\) at each k frequency are calculated. The statistic \(V\) is defined by

$$V = I_{n} \left( {w_{\left( 1 \right)} } \right)\left[ {\mathop \sum \limits_{k = 1}^{m} I_{n} \left( {w_{k} } \right)} \right]^{ - 1} ,$$
(10)

where \(I_{n} \left( {w_{\left( 1 \right)} } \right)\) is the greatest periodogram value and \(m = n/2\). If there is no any periodic component in the data (under \(H_{0} :\alpha = \beta = 0\)), then for the \(V\) statistic

$$P\left( {V > c_{\alpha } } \right) = \alpha \cong m\left( {1 - c_{\alpha } } \right)^{m - 1} \left( {11} \right)$$

can be written (Wei 2006). For any selected level of \(\alpha\) significance, the critical value \(c_{\alpha }\) is calculated by

$$c_{\alpha } = 1 - \left( {\alpha /m} \right)^{{1/\left( {m - 1} \right)}} .$$
(12)

If \(V > c_{\alpha }\), then the hypothesis \(H_{0} :\alpha = \beta = 0\) is rejected and it is concluded that the series include periodic component.

If the given time series is stationary, it was stated that the asymptotic distribution of the normalized periodogram is a chi-square distribution with 2 degree of freedom. In that case, it is written as

$$I_{n} \left( {w_{k} } \right)/f\left( {w_{k} } \right)\mathop \to \limits^{D} \chi_{2}^{2} , n \to \infty$$
(13)

(Fuller 1996; Wei 2006; Brockwell and Davis 1987). Under the assumption that the series is not stationary in other words that it is unit-rooted, for each constant \(w_{k}\), it is

$$T_{n} \left( {w_{k} } \right) = \frac{{2\left( {1 - { \cos }\left( {w_{k} } \right)} \right)}}{{\hat{\sigma }_{n}^{2} }}I_{n} \left( {w_{k} } \right)\mathop \to \limits^{D} Z_{1}^{2} + 3Z_{2}^{2} , n \to \infty ,$$
(14)

where \(Z_{1}\) and \(Z_{2}\) are the independent variables which have the standard normal distribution and \(\hat{\sigma }_{n}^{2}\) is an estimator of the variance of the error term (Akdi and Dickey 1998). Briefly, the asymptotic distribution is

$$T_{n} \left( {w_{k} } \right)\mathop \to \limits^{D} \chi_{1}^{2} + 3\chi_{1}^{2} , n \to \infty .$$
(15)

Again, it has been shown by Akdi and Dickey (1999) that the method is also applicable for the seasonal time series (Akdi and Dickey 1999). That is, the statistic \(T_{n} \left( {w_{k} } \right)\) can also be used to test the stationary of the seasonal series (whether it is unit root or not). Although the asymptotic distribution is valid for each constant \(w_{k}\), the frequency \(w_{1}\) is usually used in the hypothesis tests. The critical values of the distribution are given by the authors.

The structure elaborated in Eq. (4) through (15) has an array of advantages for the modeler with regard to assumptions, distributions of test statistics and accuracy of numerical outcomes. In that, first, no model assumption is needed and the method is invariant to model specifications as the periodograms can be calculated without reference to model assumptions. Second, no parameter estimation is required except for the variance of the white noise series, as opposed to the standard unit root tests which require estimated model parameters first. Third, as the distribution of the statistic \(T_{n} \left( {w_{k} } \right)\) is known under \(H_{0}\) and \(H_{a}\), the analytical power function exists for the test. Fourth, the critical values of the test statistic do not depend on the sample size. Finally, since the periodograms are calculated through trigonometric transformations of data, any periodic components of data are to be captured by the method, a clear strength of the framework to yield more meaningful and accurate results eventually.

Analysis

As mentioned earlier, a chief capability of the methodology maintained in this paper is to reveal the information embedded in a time series by focusing solely on the time series itself, i.e., not explicitly referring to other variables. Such a capability on the side of modeling allows us to keep some factors outside the analysis. For instance, the prevailing meteorological conditions like wind direction, wind speed, etc., are not under our direct consideration. The bright side is that such an omission does not reduce the information content of our findings, since the impacts of these factors have already been established in the evolution of the time series over time. Equivalently, the numerous effects that seem to have been omitted are already captured as an array of periodic components in our periodogram-based analysis. Consequently, in this section the monthly PM10 measurements (μg/m3) for Ankara from January 1993 to December 2017 are used. The measurements between January 2011 and October 1993 are taken from TurkStat, while the measurements between December 2017 and November 2011 are taken from Air Pollution Monitoring Network of the Ministry of Environment and Urban Affairs. Data are compiled as monthly averages of observations taken from eight stations located in Ankara. The eight missing observations from 2006 to 2007 were interpolated using the behavior of 1993:01–2006:03 (Fig. 1).

Fig. 1
figure 1

Time series graphs of the PM10 data

Although there is a significant reduction in the PM10 amounts especially after substitution of coals with natural gas in the late 1990s, the values are still above the European Union, World Health Organization and United Nations standards. When the monthly behavior of PM10 values is considered (Table 1), it is seen that the averages for winter months are higher than those for the others. The highest values are observed in November, December and January. The lowest value is observed in June, July and August, though not falling below international standards.

Table 1 Monthly average PM10 quantities

To see whether the PM10 values differ according to the months or not, the following one-way ANOVA (analysis of variance) model is considered:

$$y_{ij} = \mu + \alpha_{i} + e_{ij} ,\quad i = 1,2, \ldots ,12,j = 1,2, \ldots ,25,$$
(16)

where \(e_{ij}\)’s are the independent variables which have the normal distribution. As a result of the analysis (Table 2), the null hypothesis of \(H_{0} :\alpha_{1} = \alpha_{2} = \cdots = \alpha_{12} = 0\) is rejected (\(F\,{\text{value}} = 34,33\), \(p\,{\text{value}} < 0.0001\)), i.e., the air pollution is said not to differ across months.

Table 2 Generalized linear model procedure

Time-domain approach

To determine a suitable time series model for the data, different candidates are considered and the model with the smallest AICFootnote 6 (Akaike information criterion) statistic value (Table 3) is chosen.

Table 3 Values of the AIC statistics

Considering the results in Table 3, it can be said that the AR(13) is the most suitable model (Model I) for the data. Accordingly, the model considered is

$$ \begin{aligned}Y_{t} &= \alpha_{0} + \alpha_{1} Y_{t - 1} + \alpha_{2} Y_{t - 2} + \cdots + \alpha_{12} Y_{t - 12} \\ &\quad+ \alpha_{13} Y_{t - 13} + e_{t} , \quad t = 1,2, \ldots ,n,\end{aligned} $$
(17)

where \(e_{t} \sim WN\left( {0,\sigma^{2} } \right)\). The parameter estimates for this model are given in Table 4.

Table 4 Parameter estimates for the Model I

According to the results obtained with both the PROC ARIMA and the PROC REG, the most suitable model (Model II) for the data is

$$Y_{t} = \alpha_{0} + \alpha_{1} Y_{t - 1} + \alpha_{12} Y_{t - 12} + \alpha_{13} Y_{t - 13} + e_{t} ,\quad t = 1,2, \ldots ,n.$$
(18)

On the other hand, the value of the AIC statistic obtained for this model is very close to the value obtained for the AR(13) model. The parameter estimates are presented in Table 5. The results show that the model parameters are significant.

Table 5 Parameter estimates for the Model II

Stationarity

If the sum of the estimated values of the parameters in the model is close to 1, then this case indicates that the model is stationary. When the estimation results obtained for the Model II are considered (Table 5), it is seen that the results are different from each other such that \(\hat{\alpha }_{1} + \hat{\alpha }_{12} + \hat{\alpha }_{13} \cong 0.9717\) for PROC ARIMA and \(\hat{\alpha }_{1} + \hat{\alpha }_{12} + \hat{\alpha }_{13} \cong 0.8361\) for PROC REG. In order to specify exactly whether the assumption of stationarity holds, the unit root test has been performed. The results of the ADF and PPFootnote 7 (Phillips–Perron) unit root test with 13 the maximum delay length are presented in Table 6. The results show that the series is stationary.

Table 6 Results of the ADF and PP unit root test

After determining that the series is stationary, 12 monthly forecast values for 2018 year are calculated by taking into account the Model II considered appropriate for the data. The forecasted values calculated, standard errors of these forecasted values and 95% confidence limits are given in Table 7. When the forecast values are examined, it is expected that the highest value will be observed in November. It is followed by December and February.

Table 7 PM10 forecasts for 2018 year (PROC ARIMA)

Due to the advantages offered by the unit root test based on the periodograms, the stationarity of the series is also tested with periodograms. For Model II, the periodogram value and estimate of variance are calculated as \(I_{n} \left( {w_{1} } \right) = 6133.66\) and \(\hat{\sigma }_{n}^{2} = 322.2444\), respectively. So, the value of the test statistic computed as

$$T_{n} \left( {w_{1} } \right) = 2\left( {1 - \cos \left( {w_{1} } \right)} \right)*I_{n} \left( {w_{1} } \right)/\hat{\sigma }_{n}^{2} = 0.008349.$$
(19)

Following Akdi and Dickey (1998), the critical values of the test statistic are given as

$$ \begin{aligned}& P\left( {T_{n} \left( {w_{1} } \right) \le 0.0348} \right) = 0.01,\;P\left( {T_{n} \left( {w_{1} } \right) \le 0.178} \right) = 0.05,\;\\ &P\left( {T_{n} \left( {w_{1} } \right) \le 0.368} \right) = 0.10.\end{aligned} $$
(20)

So, the null hypothesis that the series is not stationary has been rejected at the significant levels \(\alpha = 0.01\) and \(\alpha = 0.05\) (\(T_{n} \left( {w_{1} } \right) = 0.008349 < 0.178\) and \(T_{n} \left( {w_{1} } \right) = 0.008349 < 0.0348\)).

Periodicity

We then investigate whether there is hidden periodicity in the series by means of periodograms. Figure 2 displays the periodogram against frequencies from 0 to \(\pi\).

Fig. 2
figure 2

Graph of the periodograms against the frequencies

Figure 2 suggests that one of the periodograms differs significantly from the others. This can be regarded as a sign of a possible periodicity in the data (Wei 2006). The largest 5-periodogram values obtained from the data, the frequencies that corresponded to those and their periods are given in Table 8. Also, the total value of the periodograms in all frequencies is calculated as

Table 8 Largest 5-periodogram values

The statistic \(V\) is

$$V = \left[ {\mathop \sum \limits_{k = 1}^{n/2} I_{n} \left( {w_{k} } \right)} \right]^{ - 1} \hbox{max} \left\{ {I_{n} \left( {w_{k} } \right)} \right\} = \frac{113026.59}{218213.92} = 0.51796228,$$
(21)

and the critical values that corresponded to the statistic \(V\), with the approximate value \(P\left( {V > c_{\alpha } } \right) = \alpha \cong m\left( {1 - c_{\alpha } } \right)^{m - 1}\) Wei (2006), are calculated as \(c_{\alpha } = 1 - \left( {\alpha /m} \right)^{{1/\left( {m - 1} \right)}}\), where \(m\) is \(n/2\) and is defined by

$$m = \left\{ \begin{array}{*{20}l} {\left( {n - 1} \right)/2,} \hfill & {{\text{if}}\;n\;{\text{is}}\;{\text{odd}}} \hfill \\ {\left( {n/2} \right) - 1,} \hfill & {{\text{if}}\;n\;{\text{is}}\;{\text{even}}.} \hfill \\ \end{array} \right.$$
(22)

In order to look whether the other frequencies have the periodicity or not, the following formula is used:

$$V_{i} = \frac{{I_{n} \left( {w_{\left( i \right)} } \right)}}{{\mathop \sum \nolimits_{k = 1}^{n/2} I_{n} \left( {w_{k} } \right) - \mathop \sum \nolimits_{k = 1}^{i - 1} I_{n} \left( {w_{\left( k \right)} } \right)}},$$
(23)

where \(I_{n} \left( {w_{\left( i \right)} } \right)\) for \(i = 1,2,3,4,5\) are the periodogram values from big to small. Herefrom, \(V_{1} = 0.518\), \(V_{2} = 0.088\), \(V_{3} = 0.064\), \(V_{4} = 0.063\) and \(V_{5} = 0.043\) are obtained. For \(\alpha = 0.01\), \(\alpha = 0.05\) and \(\alpha = 0.10\), the critical values are \(c_{0.01} = 0.0624972682, c_{0.05} = 0.0523158530 {\text{ve}} c_{0.10} = 0.047896912.\)

If \(V > c_{\alpha }\), it will be rejected the null hypothesis that the dataset does not contain the periodic component and it will be decided that the series has a periodic component. According to the obtained results, since \(V = 0.5180 > c_{0.01} = 0.0625\), it is concluded that there is the periodic component in the data. The period of the series is obtained as \(P = 2\pi /w_{k} = 2\pi /0.52360 = 12\). This is an expected result because the data are monthly.

On the other hand, although it is \(V_{1} > c_{\alpha }\), \(V_{2} > c_{\alpha }\), \(V_{3} > c_{\alpha }\) and \(V_{4} > c_{\alpha }\), it is \(V_{5} < c_{\alpha }\). This shows that there is not a periodicity at the frequency corresponding to the fifth largest periodogram value. Therefore, the periodic components corresponding to the first four frequencies must be considered.

The harmonic regression model

Considering that the model is stationary and the data contain the periodic component, it is assumed that the following regression model (Trigonometric Model I) is appropriate for the data:

$$\begin{aligned} Y_{t} & = \mu + A_{1} \cos \left( {\frac{2\pi t}{12}} \right) + B_{1} \sin \left( {\frac{2\pi t}{12}} \right) \\ &\quad+ A_{2} \cos \left( {\frac{2\pi t}{12.5}} \right) + B_{2} \sin \left( {\frac{2\pi t}{12.5}} \right) \\ & \quad + A_{3} \cos \left( {\frac{2\pi t}{300}} \right) + B_{3} \sin \left( {\frac{2\pi t}{300}} \right) + A_{4} \cos \left( {\frac{2\pi t}{6}} \right) \\ & \quad + B_{4} \sin \left( {\frac{2\pi t}{6}} \right) + e_{t} , \quad t = 1,2, \ldots ,300. \\ \end{aligned}$$
(24)

Estimates of model parameters are given in Table 9.

Table 9 Parameter estimates for the Trigonometric Model I

Based on the p values in Table 9, it is decided that the appropriate model (Trigonometric Model II) is

$$\begin{aligned} Y_{t} & = \mu + A_{1} \cos \left( {\frac{2\pi t}{12}} \right) + B_{1} \sin \left( {\frac{2\pi t}{12}} \right) + A_{2} \cos \left( {\frac{2\pi t}{12.5}} \right) \\ & \quad + B_{2} \sin \left( {\frac{2\pi t}{12.5}} \right) + A_{3} \cos \left( {\frac{2\pi t}{300}} \right) + B_{3} \sin \left( {\frac{2\pi t}{300}} \right) \\ & \quad + A_{4} \cos \left( {\frac{2\pi t}{6}} \right) + e_{t} , \quad t = 1,2, \ldots ,300. \\ \end{aligned}$$
(25)

The parameter estimates obtained according to this model are as given in Table 10.

Table 10 Parameter estimates for the Trigonometric Model II

Based on the Trigonometric Model II, the following prediction model is established for the monthly PM10 data:

$$\begin{aligned} \hat{Y}_{t} & = 59.93 + 27.27\cos \left( {\frac{2\pi t}{12}} \right) + 3.14\sin \left( {\frac{2\pi t}{12}} \right) + 3.71\cos \left( {\frac{2\pi t}{12.5}} \right) \\ & \quad - 6.93\sin \left( {\frac{2\pi t}{12.5}} \right) + 4.76\cos \left( {\frac{2\pi t}{300}} \right) - 4.27\sin \left( {\frac{2\pi t}{300}} \right) \\ & \quad + 5.84\cos \left( {\frac{2\pi t}{6}} \right). \\ \end{aligned}$$
(26)

The graph of the PM10 values observed and of the prediction values which were calculated according to this model is shown in Fig. 3a. Also, the graph of the PM10 values observed and of the prediction values which were calculated according to the Model II is shown in Fig. 3b.

Fig. 3
figure 3

a Observed PM10 values and predictions of the Trigonometric Model II. b Observed PM10 values and predictions of the Model II. c The comparison of the monthly averages

The monthly average PM10 values observed for the last 23 years and the monthly average prediction values obtained from the ARIMA (Model II) and the harmonic regression (Trigonometric Model II) are presented in Fig. 3c. Note that the observations in the first 13 months are not taken into consideration in predictions, so the averages of the last 23 years are covered in the graphs.

As shown in Fig. 3c, the monthly average prediction values obtained by the harmonic regression are the closer to the real average values in comparison with the monthly average prediction values obtained by the ARIMA. The sum of the squares of the distances between these averages and the realized averages is calculated by

$$ \begin{aligned}& SS\left( {\text{ARIMA}} \right) = \mathop \sum \limits_{i = 1}^{12} \left( {\overline{y}_{i} - \overline{y}_{a,i} } \right)^{2} = 470.18105, \quad \\ &SS\left( {\text{HARMONIC}} \right) = \mathop \sum \limits_{i = 1}^{12} \left( {\overline{y}_{i} - \overline{y}_{h,i} } \right)^{2} = 279.63572,\end{aligned} $$
(27)

where \(\overline{y}_{i}\)’s are the monthly realized averages, \(\overline{y}_{a,i}\)’s are the monthly average predictions obtained by the ARIMA and \(\overline{y}_{h,i}\)’s are the monthly average predictions obtained by the harmonic regression. Again, when the sums of squares are examined, it is concluded that the prediction values obtained by the harmonic regression are closer to the realized values in comparison with the prediction values obtained by ARIMA. This result is more apparent in Table 11, in which are present the realized values for the first two months of 2018 and the forecast values which were calculated by taking into account the prediction values obtained by both methods.

Table 11 PM10 values forecasted for the 2018 year

When Fig. 3a and b is examined, it is seen that the predicted values obtained for both models are very close to the observed values.

When the forecast values given above are examined, it is seen that the forecasts obtained from the both models are lower in the summer months (May, June, July and August). The result that the air pollution values observed in the summer months are lower than the other months is an expected result. The forecast values obtained from the harmonic model are much closer to the monthly average, whereas the forecast values obtained with ARIMA are well above the monthly average. Accordingly, it can be said that the harmonic regression model is more consistent and reliable than the ARIMA. In fact, this result can be also seen in Fig. 3c. The monthly average values and the graph of prediction equation are presented in Fig. 4.

Fig. 4
figure 4

Monthly averages and the graph of prediction equation (the harmonic regression model)

Results and discussion

The model estimates and its yielding forecasts we presented in Sect. 4 were said to be superior to those produced by rivalling frameworks as evidenced by the comparison of alternative forecasts with actuals. Equivalently, the analytical approach of this paper comes with sizable benefits for the analysts and policy makers. It is salutary that these benefits are further augmented by the reduced costs of modeling, i.e., modeling as a professional effort. Among those, absence of the need for model assumptions seems to be an appealing one for any researcher or analyst. The ability of one, especially of the field specialist, to compute periodograms without reference to model assumptions provides a certain improvement to reliability of projection practices of institutions. In a similar fashion, as opposed to the standard unit root tests that mandate the preliminary estimation of model parameters, the approach maintained here removes that need except for the variance of the white noise series.

The third advantage, from a purely academic rather than field work perspective, is the knowledge of the distribution of the statistic \(T_{n} \left( {w_{k} } \right)\) under both \(H_{0}\) and \(H_{a}\), implying the existence of the analytical power function for the test. Independence of the critical values of the test statistic from the sample size, in addition, provides another strength.

Finally, the periodograms, being calculated through trigonometric transformations of data, can practically capture any periodic components, so promising more meaningful and accurate results. Over a practical domain of work, analysts or policy makers can easily obtain forecasts via our framework simply by substituting a particular value of \(t\), the period of forecast, into the estimated function.

In a nutshell, the harmonic regression approach as presented in this paper seems to be a source of ease and flexibility on the side of the field practitioners while providing the modelers with desired statistical properties.

With regard to the maintenance of a univariate modeling scheme, we again develop a benefit–cost perspective. Following the widespread principle of parsimony, also called the Occam’s razor, among the models that yield similar explanatory or forecasting performances, the one(s) with lesser explanatory factors are to be preferred. This simple-looking yet strong principle ensures that a researcher could avoid redundant factors in a well-guided manner. In the current case as presented in this study, our pure time series approach based on periodograms departs from the knowledge of (1) embodiment of the periodic and non-periodic effects of all related factors in a time series of interest and (2) ability to express any time series as a sum of properly defined sinusoidal. As a matter of fact, these allowed us to produce quality forecasts even in the absence of explicit reference to potentially relevant factors like weather conditions or the behavior of other pollutants.