Introduction

The Multi-centre Analysis of the Dynamics of Internal Migration and Health (MADIMAH) project was conceived in 2011 to provide much-needed evidence on relationships between migration and health in sub-Saharan Africa [1]. The project recognised the potential for Health and Demographic Surveillance Systems (HDSS) data to be employed using a standardised methodology and analytical framework to generate comparative results across diverse settings. HDSS monitor all births, deaths and in- and out-migrations in a geographically-defined population, generating prospective longitudinal data with a precise temporal dimension. Employing these data to produce evidence on migration dynamics has been the focus of the MADIMAH project.

Following the experience of the MADIMAH project, the International Network for the Demographic Evaluation of Populations and their Health (INDEPTH) have facilitated the public release of HDSS data from low- and middle-income countries (LMIC) through the iSHare data repository [2]. To date there are 34 core standardised longitudinal datasets from HDSSs located in the African, Asian and Pacific Regions available in this open resource [3].

A central aim of MADIMAH has been to advance a set of tools for data management and application of event history analysis (EHA) to encourage the use of these high quality, publically available data. This initiative seeks to fill the gap in longitudinal population data available in LMIC, which are crucial to understanding population dynamics and their consequences. The objective of this research note is to document a set of EHA tools to produce reliable and comparable statistical results. The research note is accompanied by a training manual (Additional file 1) that guides the user through EHA, illustrating how to produce standard cross-sectional and longitudinal demographic rates and advanced EHA using individual-level datasets. These tools build on a previously published data management training manual [4] that was developed to guide users through a set of procedures to produce HDSS datasets in a harmonised structure.

The EHA methods illustrated in this research note and described in detail in the accompanying training manual (Additional file 1), represent a collection of tools for analysis of longitudinal HDSS data. The MADIMAH project team has collated these methods based on its experiences of conducting multi-centre analyses of migration and mortality. The methods described have been tested on and applied to more than 30 HDSS datasets. Over the past 8 years, the MADIMAH team has brought together data managers, analysts and students from HDSS centres across sub-Saharan Africa to train on and apply these techniques to HDSS data. The accompanying manual, written in an accessible language but with the necessary statistical rigour, is targeted at researchers and analysts from multidisciplinary backgrounds (including demography, public health, epidemiology and statistics) who are interested in conducting longitudinal data analysis of demographic events.

Main text

Methods

Traditionally, demographic estimates have been based on cross-sectional or aggregate data. These calculations of demographic rates, dominant in publications, usually involve estimating the population at mid-period of interest as well as a count of the number of events of interest over the period. For example, a death rate that is computed according to the following formula requires that the total number of deaths in a population be counted and divided by the total mid-year population:

$$ {\text{Crude death rate }} = \frac{Total \,number\,of\,deaths\,\,in\,a\,given\,year }{total\,\,mid - year\,\,population} = \frac{{D_{{\left( {t, t + n} \right)}} }}{{\left( {P_{t} + P_{t + n} } \right)/2}} $$

This is often estimated based on the population at the start of the year added to the population at the end of the year, divided by two. These methods suffer from inaccuracies regarding the handling of events such as migration, and cannot easily deal with the issue of censoring [5]. Also, with such aggregates, it is not straightforward to obtain cohort measures of probabilities except through the application of formulas that convert rates to probabilities using approximate average person-years lived in the age interval [5]. The event history analyses (EHA) approach allows for the computation of exact person-years, and can successfully handle right- and left-censored data to produce estimates based on both calendar years and age groups. In addition to the computation of descriptive indicators (such as birth, death, in- and out-migration rates and probabilities), longitudinal data sources may be effectively utilised for more sophisticated EHA [6].

The analytic methods presented in this research note are illustrated using HDSS data but can also be applied to register or retrospective survey data. We use the Agincourt HDSS core micro dataset available for download through the INDEPTH iSHARE2 data repository [3]. The analytical dataset was extended to include data on causes of death (CoD) to exemplify the analysis of competing risks in the last section of the attached manual (Additional file 1), and these data are available upon reasonable request to the Agincourt HDSS site (https://www.agincourt.co.za/). The Agincourt HDSS was established in 1992 and is located in the rural north-east of South Africa. The surveillance population currently comprises over 90,000 individuals living in 11,500 households [7].

The core micro dataset, or core residency file, is a standardised file format containing the key events for each individual in the surveillance population with each event being documented as a single record. This type of dataset considers events that change the residency status of the individual (such as: enumeration, birth, death, in-migration, out-migration and end of observation). For each event, a corresponding event date is captured (see the MADIMAH team’s first manual of data management for more detail [4]).

The results below illustrate with the Agincourt HDSS micro data how to use standard commands available in most statistical software packages. Our illustrations and corresponding code in the attached manual (Additional file 1) uses a suite of Stata® version 15 commands. We highlight below new techniques such as the cumulative incidence function for competing risks such a causes of death or the reverse-time for the computation of in-migration rates. The results illustrate how a set of techniques applied to longitudinal HDSS data can be integrated to avoid unnecessary division between descriptive and more complex analyses.

Results

The foundation statistic in EHA is the hazard rate by age [5]. This rate represents the risk in a given short age interval of experiencing the event. It is expressed as an annualised probability, i.e. a number of events per 1000 person-years. The hazard curve is usually represented by age, sometimes for a specific calendar period. However, the hazard function need not be represented by age. Using the same data, one can represent the hazard function by calendar time, for the whole population but more often for a specific age group. Figure 1 is an illustration of hazard curves, with infant and child death hazards from 1 January 2013 to 31 December 2015. One can clearly see a drop in infant mortality from 2009 (antiretroviral treatment were largely made available free-of-charge from 2008 in the study area). The attached manual (Additional file 1) gives time-scale recommendations for smoothing hazard rates in a meaningful way in relation to data collection precision in dates and proportion of events.

Fig. 1
figure 1

(source: Agincourt HDSS 2003–2015)

Infant and child death hazard functions by calendar time

The above figure is for data exploration and for communication (to show levels and changes in trends) but may also be presented in tables. Two different indicators are used in the literature: rates and probabilities. Rates (nmx) most closely correspond to hazard rates except that they are usually defined for conventional age groups [5]. They are defined as the number of events over the total person-years accounted for in a given age interval, as exemplified in Table 1. The attached manual (Additional file 1) shows how to produce such a table for each calendar periods to identify mortality, migration or fertility trends, e.g. by 5-year age group and 5-year period.

Table 1 Death rates and survival probability by age group for males.

The other way to represent event intensity is through the survivor function that represents the probability to survive until a given age (nqx) for a synthetic cohort, i.e. a cohort of individuals that would have been subjected over their lifetimes to the conditions prevailing over the observed period (see Table 1). Both the death rates (nmx) and survival probabilities (nqx) may be computed from the same data without resorting to conversion formulas as necessary with aggregates. The distribution of events by age interval is the same for nmx and nqx. Aggregates (column 8) are not accurate since, as noted in the Stata output, the “survivor function is calculated over full data and evaluated at indicated times; it is not calculated from aggregates.” More reliable are the person-years displayed in column 2. Common summary cohort measures, such as life expectancy or median age at death are derived from the probabilities.

Another useful synthetic cohort descriptive tool is the cumulative incidence function (CIF) [8] that has not so far been presented in published manuals. We recommend this over the cumulative hazard function also known as the Nelson-Aalen function (NAF) to analyse competing events such as causes of death, which is based on the assumption of independence between competing events that doesn’t always hold. The advantage of the CIF over the NAF is that the sum of CIF for each competing event is equal to the Kaplan–Meier failure function, unlike the NAF whose scale has no clear interpretation (it frequently exceeds the value 1). However the NAF is still useful for repeatable events (competing or not) since the CIF does not handle repeatable events. Figure 2 presents the CIF for large categories of death. AIDS/TB represents about half of the mortality intensity in the 2003–2007 period.

Fig. 2
figure 2

(source: Agincourt HDSS 2003–2007, indeterminate causes of death excluded)

Cumulative incidence function (CIF) for three large causes of death for males

An original contribution that the MADIMAH team has streamlined is the detailed procedure to analyse in-migration [9, 10]. This is a special case in event history analysis that involves reversing analysis time to compute rates using destination population at risk instead of the origin population at risk (as done for out-migration analysis).

The full potential of longitudinal data relates not only to the ability to produce standard descriptive estimates as we have seen above, but also to the ability to produce more complex regression estimates. The well-known Cox model (semi-parametric proportional hazard model is its full name) and the less known Fine and Gray model for non-independent competing risks [11] can easily be implemented using the same micro data that we used to produce rates and probabilities. The MADIMAH team has successfully applied these methods to analyses of determinants and outcomes of demographic processes, to produce results that are comparable across diverse settings [12, 13].

Limitations

The computer programs and analyses outlined in this research note are flexible and can be applied to renewable or non-renewable events, competing risks or non-competing risks. However, consideration should be given as to the time-precision of the data, the precision of recorded dates for data collection (e.g., days) should always be higher than the unit of time of analysis (e.g., years). The manual (Additional file 1) has been designed for Stata users and the provided computer programs would require adaptation for use in other statistical software packages. The manual follows the previously published “Manual of event history data management using HDSS data” [4], which outlines the steps to structure the data into the required format for EHA.