Introduction

Time-series analysis is the most frequent and commonly used method for assessing the association between counts of health outcomes over a considered period of time and exposure to ambient air pollution. The problem of controlling for unmeasured confounders, such as seasonal and long-term trends, in the constructed statistical models, is not fully resolved. Time still remains to be the main concern in modelling Poisson time-series data. There are many criteria to determine an optimal number of knots used in the applied splines for representing the smooth function of time.

The case-crossover (CC) design introduced by Maclure (1991) eliminated time from the model. This fact probably is among the reasons that the CC method is so popular and attractive. In the CC method, time is controlled by the realized design. The time-stratified approach (Janes et al. 2005) to determine the control periods to an event period separates time into 1-month segments. The standard CC method uses events (individual cases) and assigns the corresponding controls located in one common month. The case and control days are chosen to be the same days of the week (dow). In the CC method, time is still present in relation between case periods and control periods, as the controls are pre- and post-event time. It was especially influential in the original presentation of the CC method (Maclure 1991), where the control period was always before the case period. Such unidirectional (or symmetrical to the event time) strategy to assign controls may result in bias (Janes et al. 2005).

In 2006, Szyszkowicz (2006) proposed considering daily counts of the health events on the hierarchical clusters of the form <year:month:dow>. Such a structure is naturally embedded in the hierarchy of the calendar year. The statistical methodology used was mixed-effect models (Szyszkowicz 2006). In the publication, it was also observed that the structure of the form <year:2 weeks:dow> could also be implemented in the proposed models. It reduced each cluster to the same two weekdays during a 2-week period; and therefore, each cluster has only two data points. Since two points determine a line, the slope related to the association between exposure and health counts on these two-point clusters can also be determined. In another publication (Szyszkowicz and Burr 2016), an approach to construct the chained clusters was presented, where each day was used twice in neighboured two point clusters. This technique increases the number of clusters and results in a smaller standard error of the estimated slope.

Further advancement in the statistical methodology proposed the use of a conditional Poisson regression (Armstrong et al. 2014) as a flexible alternative to the standard CC method, with stratum defined as three levels hierarchical cluster <year:month:dow>. As in the previous publications (Szyszkowicz 2006; Szyszkowicz and Burr 2016), daily counts are used here, not individual events. In these approaches, adjustment for day of week (dow) is realized by the design of the study.

In this paper, we propose to use clusters of the following form <year:2 days>. Therefore, in 1 year, there are 183 clusters of such type. The clusters are different during each year, as each year (data) has its own specific characteristics. It is the smallest possible time-window in a time-series study of daily counts; just neighboured 2 days are considered a common group. The conditional Poisson regression model was applied using the 2-day (2D) stratum, where day of week was adjusted in the constructed model. The impact of weekdays is modelled, and the time-window considered is narrowed to the smallest one.

Methods

The presented methodology is based on clusters of the form <year:2 days>. For example, the days of a year are grouped as follows: (1—first day of a year, 2—second day of a year), (3, 4), (5, 6), etc. For each pair, we assigned the cluster number: 1 for (1, 2), 2 for (3, 4), 3 for (5, 6), and so on. For example, the days January 1 and January 2 will have cluster number 1. Therefore, the clusters (2000, 1) and (2001, 1) are different, as they belong to different years. This construction results in 183 clusters per year, where the cluster number 183 has 2 days only in leap years. These clusters are used to control time. Time smoothers are not required in the models. The clusters are considered as time units. In addition, days of the week are adjusted for in the final model, which imposes another control for time. In this presentation, conditional Poisson regression model was applied for daily counts (Armstrong et al. 2014), where stratum was defined as the constructed 2-day cluster. This approach is referred to as the 2D method. The model has the following form in the R software (package gnm).

Model2D = gnm (Counts ~ AirPollution + dow + covariates, family = poisson, eliminate = factor(stratum)).

The calculations were carried out with the option family=quaspoisson, which allows for overdisperrsion in the used Poisson models. Other covariates can be included in the model, although dow should always be present to control effects of weekdays as other forms of time influence. In addition, for comparison purpose, the clusters of the form <year:3 days> were also used with 122 such structures per year. This approach is referred to as the 3D method.

Very detailed analysis of the impact of time (seasonal, trends, and concurvity) was carried out by the authors in the study related to air pollution in the form of particulate matter exposure and mortality (Peng et al. 2006). The authors generated simulated data for four different scenarios. The scenarios reflect the intensity of the used smoothing for time and air pollution with two effects of concurvity levels, moderate and high. The results from the simulated data should not demonstrate any effect, i.e. there are no associations with the considered exposures and health outcomes. The true slopes (Beta) are zero. Four scenarios were considered and described in greater detail in (Peng et al. 2006): scenario 1 has more smoothing on time than on air pollution and moderate concurvity; scenario 2 similar to scenario 1 but with high concurvity; scenario 3 has more smoothing on air pollution than on time and moderate concurvity; and scenario 4, similar to scenario 3 but with high levels of concurvity.

Data from Minneapolis-St Paul for the years 1987–1994 were used to calculate “true” parameter estimates in order to specify smoothing functions, which were applied to the controlled corresponding simulations. Daily counts and air pollution were simulated 500 times for each considered scenario (Peng et al. 2006). Temperature data remain fixed and is the same in each of the simulations.

Data from the corresponding repository (Data from Peng et al. 2006) representing scenarios 1–4 were used in the form of compressed R workspace files. Each file contains 500 datasets corresponding to each of the four scenarios being considered. Therefore, for each scenario, 500 estimations of the slope are generated. In the current approach, the calendar dates starting from January 1, 1987 and ending on December 31, 1994 (2922 days) were added to the data files. The dates allow us to construct the two-point clusters. In a similar manner, 3-day clusters (3D) were created in order to fit the 3D model. Both variables of daily counts and daily concentrations of air pollution (particulate matter) have simulated values and were used in the analysis. Temperature was represented in the model in the form of natural spline with three degrees of freedom.

For comparison purposes, the same data were assessed using the standard CC method on clusters of the form <year:month:dow>; where each cluster includes 4 or 5 days (referred as CCM, where M is for month). The number of days depends on day of the week and length of the month. Also, the CC method was applied on the clusters of the form (year:2 weeks:day of week> (referred as CC2W, where W is for week). These clusters are composed of two of the same weekdays with weekdays separated by 7 days in the calendar, e.g. Monday to Monday of the following week. This differs from the 2D structure, which has two neighboured days in the calendar.

The results from the simulated data should produce estimated slopes (Beta) equal to zero; therefore, real data was also considered where the true value of the slopes was unknown. The considered database represents mortality in Chicago for the years 1987–2000 (Wand 2003; Wood 2006). The endpoints of interest included all non-accidental death (Death), cardiovascular (CVD) death, respiratory (Resp) death, a combination of cardiovascular and respiratory (CVD&Resp) deaths, and other (Others) deaths defined as mortality not related to cardiovascular and respiratory. This last category was defined as follows: Others = Deaths – CVD&Resp, where CVD&Resp is a summary of both indicated types. These data were analysed using the models presented in this paper, namely 2D, 3D, CCM, and CC2W. Ambient air pollution exposure particulate matters (PM10) and ozone (O3) were considered in the model, as well as temperature which were represented by natural splines of three degrees of freedom.

The main objective of this paper is to present and compare the 2D method and its performance to the CCM method when the models are applied to simulated data (scenarios 1–4) and real Chicago (1987–2000) data. The 2D methodology can be extended to have clusters composed with many days. Following the proposed 3D clusters construction, by analogy one could also form 4D, 5D, and 6D clusters or include even more days, e.g. MD clusters, where the clusters MD are constructed using all days in 1 month.

In this presentation, we used and tested the following methods: CCM, CC2W, 2D, and 3D on four sets of the simulated data and on Chicago mortality data. In the appendix are provided the results for another mortality data generated by various forms of the methods.

Results

The results for the corresponding databases with the simulated data (scenario 1–scenario 4) are summarized in Tables 1 and 2, and illustrated in Figs. 1, 2, and 3. The estimated average of the slopes (Beta) and corresponding standard deviations (SD) are shown for the CCM method and the 2D (2 days) method in Table 1. The results demonstrate that the proposed 2D method gives an overall average slope less than 0.00001 in all four scenarios, compared with the CC method where the overall average slopes were less than 0.001. Only for a small number of cases (maximum 26), the proposed method estimated 95% confidence intervals (CI) which did not include zero. This is equivalent to the type I error rate being less than 5% (26/500) under the proposed 2D model. In contrast, the standard CC method assessed the slopes incorrectly in the presence of high concurvity (scenarios 2 and 4), i.e. the type I error rate for scenario 2 was 34% (172/500) and for scenario 4, it was 38% (192/500).

Table 1 Estimated values by two methods, CCM and 2D, for the simulations # (1–4): average of Beta’s values, their standard deviation (SD), and the number of rejections null hypothesis (RNH)
Table 2 Estimated values by two methods, CC2W and 3D, for the simulations # (1–4): average of Beta’s values, standard deviation (SD), and the number of rejections null hypothesis (RNH)
Fig. 1
figure 1

Estimated slopes (Beta) and 95% CIs obtained by using the standard CC and 2D method. The results are ordered by Beta values based on the CCM method. The results for the CC method are represented by small black symbols (squares: Beta, triangles: lower limit of 95% CI, diamonds: upper limit of 95% CI). The same symbols, but larger and in colour (i.e. red squares: Beta, blue triangles: lower limit of 95% CI, and green diamonds: upper limit of 95% CI) are used to illustrate the corresponding values generated by the 2D method. The x-axis identifies the simulated samples (1–500), ordered differently for each scenario. ad Results for scenario 1–scenario 4, respectively

Fig. 2
figure 2

Estimated slopes (Beta) and 95% CIs obtained by using the CC2W and 3D method. The results are ordered by Beta values based on the CC2W method. The results for the CC2W method are represented by small black symbols (squares: Beta, triangles: lower limit of 95% CI, diamonds: upper limit of 95% CI). The same symbols, but larger and in colour (i.e. red squares: Beta, blue triangles: lower limit of 95% CI, and green diamonds: upper limit of 95% CI) are used to illustrate the corresponding values generated by the 3D method. The x-axis identifies the simulated samples (1–500), ordered differently for each scenario. ad Results for scenario 1–scenario 4, respectively

Fig. 3
figure 3

Estimated beta values generated by the four considered methods (CCM, CC2W, 2D, and 3D) for the simulated data. The results are grouped into the pairs: (CCM, CC2W) and (2D, 3D). Top panel shows the values for the pairs (CCM, CC2W) sorted using CCM estimations. Bottom panel shows the values for the pairs (2D, 3D) sorted by using 2D estimations. The x-axis identifies the simulated samples (1–500) ordered differently for each scenario. The results are presented for four scenarios # (1–4) identified by (a)–(d), respectively

Table 2, similar to Table 1, summarizes the results for the methods CC2W and 3D. Comparing these two models, it was observed that the type I error rate (i.e. number of cases that were rejected when the null hypothesis is true) was similar between the two models. For the CC2W model, the type I error rate was less than 6.6% again highest in the scenarios with high concurvity; whereas for the 3D model, the type I error rate was less than 5.6%.

In Fig. 1 there are 4 panels, where each panel represents the results from the four scenarios considered in this analysis, e.g. panels a–d represent scenarios 1–4, respectively. Within each panel, the estimated slopes and corresponding 95% CI based on the CCM and 2D methods for each of the 500 runs are depicted. The slopes were ordered, smallest to largest, according to the values obtained from the standard CCM method. Black symbols were used to represent estimations generated by the CCM method; these appear on the graphs as continuous lines. A horizontal line showing the value of zero is also represented in each of the panels. In scenarios 2 and 4 (panel b and d), the lower limit of the 95% CI is above the zero level for the CCM method for almost 40% of cases. The numerical results from the 2D method are also sorted by the slopes from the CCM method; the represented runs for both methods are similar.

Figure 2 represents the results generated by the CC2W and 3D methods. The estimations are sorted by the values obtained from the CC2W method. The produced estimations from both methods are symmetrical with respect to the line representing zero. Figure 3 displays only the estimated slopes (Beta) based on all 4 methods of analysis (CCM, CC2W, 2D, and 3D) for the simulated data scenarios 1 to 4. Top panel represents the results for the CC methods where each pair of results, generated by CCM and CC2W, is sorted by CCM values. Bottom panel shows the results based on the pairs 2D and 3D, sorted by 2D values. In this figure, only cases which are in one common pair represent the same sample. This figure illustrates the distributions of the estimated slopes. The 2D and 3D methods show symmetrical distribution with respect to the horizontal line at zero. In contrast, the CCM method shows skewed values (above zero), mainly for the scenario 2 and 4.

Figure 4 illustrates the results obtained using the four models (2D, 3D, CCM, and CC2W) applied to the Chicago mortality database where daily death counts belong to five defined categories (Death, CVD, Resp, CVD&Resp, and Others). The most intriguing result is that the standard CC method (CCM) estimates statistically significant positive slope for all deaths (Deaths) and exposure to particulate matter, but it is not the case when the CC2W, 2D, and 3D methods were applied. The proposed 2D and 3D method estimated positive and statistical significant association for cardiovascular (CVD) death and exposure to ambient ozone.

Fig. 4
figure 4

Estimated slopes and 95% CIs for five causes of deaths and two types of exposure: particulate matter (PM10) and ambient ozone (O3). Four statistical methods (CCM, CC2W, 2D, and 3D) were applied to the Chicago mortality, 1987–2000 data

In addition, the proposed method 2D was realized on the simulation data described and presented in the publication (Burr et al. 2014). The authors created the set of four simulation data (Sim1–Sim4), each with 250 samples for a 10-year period. These data present various difficulties for the standard time-series methods. In this case, their true slope is 1.0 (Beta = 1.0). Using the 2D method, the minimum among lower bounds of the 95% confidence intervals were estimated as 0.97, 0.97, 0.95, and 0.94, for Sim1–Sim4, respectively. The corresponding data, codes in R, and the results are presented at https://github.com/szyszkowiczm/Data2D.

Discussion

The techniques presented here are relatively simple and practical. They can be very easily applied without additional efforts. Existing software allows constructing the corresponding models. The four simulated scenarios that are publically available data sets (Data from Peng et al. 2006) used in this paper allow readers to execute the described calculations. The results also are discussed in publication (Peng et al. 2006). In addition, one real and freely available mortality data (Chicago) can be used to test the methods proposed here. Also the data presented at https://github.com/szyszkowiczm/Data2D.

In the situation of the simulated data, the standard CC method demonstrates some difficulties in the presence of high concurvity. It is a signal that some real analysis results may be affected by the collinearity for non-linear correlation. The proposed 2D method does not have problems with data sets having high concurvity, e.g. scenarios 2 and 4.

Many details, results, and discussion are provided in publication (Peng et al. 2006), which evaluated the problems related to unmeasured confounders. Briefly, for the simulated data, the authors of the presentation (Peng et al. 2006) applied the five df (degree of freedom) selection methods and investigated under which circumstances the used method wrongly reports a statistically significant air pollution association. They found that the bias in the estimates generally decreases with more intensive smoothing. They also reported that the used model selection methods which optimize prediction may not be suitable to create an estimate with small bias. The 2D method approach presented in this paper is one of the possible solutions of these problems in environmental epidemiology. At least is allows to obtain an additional information.

As the results for the real data (Chicago) show we may have various responses generated by the proposed methods, qualitative (statistically significant or non-significant) and quantitative (estimated slopes and confidence intervals) in nature. For Chicago all mortality (Death), the CCM method indicates associations but 2D method does not. The opposite is true for CVD and exposure to ambient ozone. Such discrepancy is difficult to judge and decide “who is right.” For these data estimated, parametric non-linear concentration-response shapes were reported (Szyszkowicz 2018). The shapes suggest existence of a threshold for analysed exposure. The presence of the threshold may affect associations when we consider 2-day (2D) and whole month (CCM or MD) time-window.

Probably the associations are also related to the properties of the analysed health outcomes. The 2D method realizes many comparisons between two neighboured days with various (lagged) air pollution exposure but the health events are considered and compared in these 2 days. It may result that for acute health events (such as stroke), the association will be shown for the 2D method (two points) but not for the CC method (4 or 5 points).

The results for Chicago indicate that cardiovascular mortality (CVD) is related to ambient ozone. These associations are estimated by 2D and 3D methods, which use a short distance among the considered data points. In contrast, these methods, 2D and 3D, did not indicate significant associations of total non-accidental mortality (Death) with exposure to ambient particulate matter.

In conclusion, as the proposed methods are easy to implement and realize, it is suggested that during a short-term exposure study, the use of two methods, one with a narrow time-window (e.g. 2D, 3D methods) and another with a wider time-window (e.g. CCM or MD) be fit to the data.