Data Sources
The analysis in this paper is based on data on asset indicators collected by the HDSS. The Agincourt system has collected detailed longitudinal data on vital events including births, deaths, in- and out-migrations, as well as complementary data covering health, social and economic indicators in a predominantly rural population in northeast South Africa every year since 1992 (Kahn et al. 2007, 2012). Until 2006, the study included 21 villages. The study area was extended to 26 villages in 2007. Another five villages were added between 2010 and 2012 in response to an expanding trials and evaluation portfolio. The population, of approximately 115,000 people in 2014, is largely Shangaan-speaking and almost a third are former Mozambican refugees who arrived in the area in the early to mid-1980s and their descendants.
Collection of data on household asset indicators that include construction materials of the main dwelling, type of toilet facilities, sources of water and energy, ownership of modern assets and livestock only started in 2001 and has been repeated every 2 years. To assess changes in the asset indicators over the period 2001–2013, we use only the data collected from households in the original 21 villages.
Statistical Analysis
There are three parts to the analysis. The first part summarizes changes in ownership of various household assets in the Agincourt study population from 2001 to 2013. The second part involves constructing three composite indices that can be used as a measure of SES from the household asset items. The three indices namely absolute index, principal components analysis (PCA) index and multiple correspondence analysis (MCA) index are among the most widely utilized indices in the literature. The three indices are used to assess the robustness of our findings. Similar to the approach adopted by Howe et al. (2008), the three indices are compared with each other using scatter plots and the percentage of households classified into the same and different SES quintiles. The agreement of classification of households into SES quintiles between indices is assessed using Kappa statistics. The Kappa statistic, which takes values between 0 (no agreement better than chance) and 1 (perfect agreement) measures agreement in classification between two methods taking into account the agreement that is expected based on chance alone (Howe et al. 2008). Also similar to the approach adopted by Balen et al. (2010), the Spearman’s rank correlation coefficient is utilised for further comparisons of the three indices. The last part of the analysis applies the method of relative distributions developed by Handcock and Morris (1998, 1999) to the asset indices to assess changes in the distribution of SES over time in terms of location and shape. This part of the analysis also takes into account ethnic differences in the distribution of SES as a previous study by Sartorius and colleagues covering the period 2001–2007 showed persistent differentials in SES between the South African and Mozambican populations (Sartorius et al. 2013).
Construction of Asset Indices
The absolute index that we construct has been utilized by a number of other researchers that have analyzed data from the Agincourt HDSS (Houle et al. 2013; Gomez-Olive et al. 2014; Houle et al. 2014; Madhavan et al. 2012). To construct this index, first the items of each asset indicator are assigned a weight so that increasing values correspond to items associated with higher SES. For example, for the asset indicator wall material, 5 = brick; 4 = cement; 3 = other modern material; 2 = mud; and 1 = other traditional material. Thereafter, the value assigned to each item of an asset indicator is normalized by dividing it by the value assigned to the item associated with the highest SES. This results in items of a given asset indicator taking values within the range [0, 1]. The asset indicators are then grouped into five broad asset subcategories (modern assets, livestock, power supply, water and sanitation, and dwelling structure). The normalized values of the asset indicators within each subcategory are then summed to yield a subcategory-specific value. Each subcategory-specific value is further normalized so that it too is in the range [0, 1]. Finally, the five subcategory-specific normalized values are summed to produce an overall household asset index that falls in the range [0, 5].
The PCA index was first recommended by Filmer and Pritchett (2001) and is one of the most widely used asset indices (Gwatkin et al. 2007; McKenzie 2005; Minujin and Delamonica 2004). Construction of this index starts by constructing an \(n \times p\) matrix, \({\mathbf{X}}\), representing ownership of p asset items collected from n households. Thereafter, each element of \({\mathbf{X}}\) is normalized by subtracting from it the column mean and dividing the difference by the column standard deviation to produce another \(n \times p\) matrix, \({\mathbf{Y}}\). Next, a \(p \times p\) correlation matrix, \({\mathbf{R}}\), is computed from the normalized data matrix, \({\mathbf{Y}}\). This is followed by solving the equation \(\left( {{\mathbf{R}} - {\lambda }{\mathbf{I}}} \right){\mathbf{V}} = 0\) for \({{\lambda }}\) and \({\mathbf{V}}\), where \({{\lambda }}\) is a vector of eigenvalues, \({\mathbf{I}}\) is an identity matrix and \({\mathbf{V}}\) is a matrix of eigenvectors associated with the eigenvalues in \({{\lambda }}\). Each eigenvector is then scaled so that its sum of squares equals the total variance. The product of the normalized matrix of assets variables, \({\mathbf{Y}}\), and the matrix of scaled eigenvectors, \({\mathbf{V}}^{\varvec{*}}\) produces a set of uncorrelated linear combinations of the asset variables for each household j, known as principal components. For each household, the number of principal components equals the number of asset items, and the rank of each component corresponds to the rank of its associated eigenvector. The first component is associated with the most dominant (largest) eigenvalue and explains as much as possible of the variation in the original data. The second component is associated with the second largest eigenvalue and explains as much as possible of the remaining variation in the data, subject to being uncorrelated with the first component. Similarly, each subsequent component explains as much as possible of the remaining variation in the data, while being uncorrelated with the other components. Formally, for household j, the PCA index is computed as
$$A_{j} = v_{11}^{*} \left( {\frac{{{\text{x}}_{{j1 - {\bar{\text{x}}}_{1} }} }}{{s_{1} }}} \right) + v_{21}^{*} \left( {\frac{{{\text{x}}_{{j2 - {\bar{\text{x}}}_{2} }} }}{{s_{2} }}} \right) + \ldots + v_{p1}^{*} \left( {\frac{{{\text{x}}_{{jp - {\bar{\text{x}}}_{p} }} }}{{s_{p} }}} \right)$$
where v
*
i1
are the elements of the scaled eigenvector associated with the largest eigenvalue, \({\text{x}}_{ji}\) are the asset ownership values for household j and asset \(i, i \in \left[ {1, 2 \ldots p} \right]\), and \({\bar{\text{x}}}_{i}\) and s
i
are respectively, the mean and standard deviation of the asset ownership values across all households for asset item i. In our description of the steps to derive the PCA index we have kept the mathematical details to a minimum. More detailed mathematical descriptions of the steps involved in the PCA technique can be found in Everitt and Hothorn (2011), Rencher (2003).
The procedure used to construct the MCA index is similar to the one used to construct the PCA index but does not assume that the data are continuous and that there is a linear relationship between the observations (Traissac and Martin-Prevel 2012; Booysen et al. 2008; Howe et al. 2012). Because all the asset indicators are discrete or categorical, others have argued that the MCA index is the most appropriate asset-based measure of SES (Booysen et al. 2008; Traissac and Martin-Prevel 2012; Asselin and Anh 2008). In constructing the MCA index we follow the guidelines provided by Booysen et al. (2008) and Asselin and Asselin and Anh (2008). First, the indicators of asset ownership of all households are organized into a matrix \({\mathbf{X}}\) of ones and zeros called the “indicator matrix”. In the indicator matrix, each categorical asset indicator is decomposed into a set of mutually exclusive and exhaustive binary categories that each take only the value 0 or 1 such that every household has a ‘1’ in exactly one of each asset’s set of categories and a ‘0’ in the rest of the asset’s categories. Second, a matrix \({\mathbf{S}}\) is calculated by taking the \(\chi^{2}\) metric on row/column profiles of \({\mathbf{X}}\). Greenacre (2007) provides the formula for computing \({\mathbf{S}}\) as
$${\mathbf{S}} = {\mathbf{D}}_{r}^{{ - \frac{1}{2}}} \left( {{\mathbf{P}} - {\mathbf{rc}}^{\text{T}} } \right){\mathbf{D}}_{c}^{{ - \frac{1}{2}}}$$
where \({\mathbf{P}}\) is the matrix formed by dividing each element of the matrix \({\mathbf{X}}\) by the sum of its elements, \({\mathbf{r}}\) is a vector whose elements are the sums of the row elements of the matrix \({\mathbf{P}}\), \({\mathbf{c}}\) is a vector whose elements are the sums of the column elements of the matrix \({\mathbf{P}}\), and \({\mathbf{D}}_{\varvec{r}}\) and \({\mathbf{D}}_{\varvec{c}}\) are diagonal matrices formed from \({\mathbf{r}}\) and \({\mathbf{c}}\) respectively. Finally, singular value decomposition (SVD) is then performed on the matrix \({\mathbf{S}}\) to decompose it into three matrices such that \({\mathbf{S}} = {\mathbf{UD}}_{\alpha } {\mathbf{V}}^{\text{T}}\) (Greenacre 2007). The columns of the matrices \({\mathbf{U}}\) and \({\mathbf{V}}\) referred to as left and right singular vectors are respectively the eigenvectors of the matrices \({\mathbf{SS}}^{\text{T}}\) and \({\mathbf{S}}^{\text{T}} {\mathbf{S}}\) and the columns of the diagonal matrix \({\mathbf{D}}_{\alpha }\) known as singular values are the square roots of the common positive eigenvalues of \({\mathbf{SS}}^{\text{T}}\) and \({\mathbf{S}}^{\text{T}} {\mathbf{S}}\). Like in the PCA approach, in constructing a single asset index, the elements in the first column vector of the matrix \({\mathbf{V}}\) derived by the SVD are then used as weights of the asset categories. Consequently, as provided by Booysen et al. (2008), the MCA index score for household i is calculated as
$$MCA_{i} = R_{i1} W_{1} + R_{i2} W_{2} + \cdots + R_{ij} W_{j}$$
where R
ij
is the response of household i to asset category j and \(W_{j}\) is the MCA weight of asset category j.
The PCA and MCA indices are derived from pooled data from all the available years. This approach ensures that indices explain variation over time as well as across households and are not affected by changes in the contribution of particular assets to household welfare (McKenzie 2005). Pooling of the data is not necessary for the absolute index as the procedure used to generate this index assigns the same weight to the same asset item across time.
Assessing Distributional Changes in SES
The method of relative distributions that we apply to the three indices to assess trends in the distribution of SES quantifies differences between the distributions of a set of measurements of an attribute of interest from a population at one time period and another set of measurements of the same attribute from a different population, or from the same population at a later time period. It takes the values of one distribution (the comparison distribution) and expresses them as positions in another distribution (the reference distribution) (Handcock and Morris 1998, 1999). Compared to the standard approach of comparing distributions using summary statistics such as mean, median and variance, which do not consider the entire distributions, the relative distribution analytic approach allows direct comparisons between outcomes across the entire distributions and provides insights that may be missed by the former approach.
Taking 2001 as the baseline year, we obtain the relative distribution for each later time period, t, using the density function of the percentile rank, r, of asset index value,\(y\), in 2001 as
$$g_{t} \left( r \right) = \frac{{f_{t} \left( y \right)}}{{f_{0} \left( y \right)}}, \quad 0 < r \le 1$$
where f
0(y) and f
t
(y) are the density functions of the asset index values in 2001 and at a later time period respectively. Basically, the relative distribution, g
t
(r), represents the ratio of the population density at asset index value, y, at each later time period, t, to the density in 2001. When there are no differences between the comparison and reference distributions, the relative distribution is uniform or “flat” (taking a value of 1 throughout). When there are differences between the distributions, the relative distribution “rises” or “falls” depending on the direction of the difference. For example, if the proportion of households at a later time period, t, with asset index values equal to the median asset index value in 2001 is less than 50 %, the relative distribution will have a value below 1 at a point on the vertical axis corresponding to 50 % on the horizontal axis.
Following the approach by Handcock and Morris (1998, 1999), the changes in the relative distribution of the asset index values in 2001 and at later time periods are statistically summarized using the entropy statistic and median relative polarization (MRP) index. The entropy statistic used is based on the Kullback–Leibler divergence, which is a measure of the distance between two distributions and is defined by:
$$D\left( {F:F_{0} } \right) = \mathop \int \limits_{ - \infty }^{\infty } { \log }\left( {\frac{f\left( y \right)}{{f_{0} \left( y \right)}}} \right)dF\left( y \right) = \mathop \int \limits_{0}^{1} { \log }\left( {g\left( r \right)} \right)g\left( r \right)dr$$
where g(r) is the probability density function of the relative distribution of asset index values in the reference and comparison distributions and F
0 and F respectively represent the cumulative distribution functions of the reference and comparison distributions of asset index values. We use the entropy statistic to quantify: (1) overall divergence between the comparison and reference distributions; (2) divergence between the location-adjusted reference distribution and the reference distribution; and (3) divergence between the comparison distribution and the location-adjusted reference distribution. The location adjustment used is median adjustment. This is preferred over mean adjustment because of the well-known drawbacks of the mean when distributions are skewed. As for the MRP index, we use it to quantify the extent to which the shape difference between the distributions of asset index values in 2001 and at later time periods takes the form of relative polarization or rising inequality. It is computed as:
$$MRP_{t} = 4\mathop \int \limits_{0}^{1} \left| {r - \frac{1}{2}} \right| \times g_{t} \left( r \right)dr - 1$$
where g
t
(r) is the relative population density at asset index value, \(y\) at each time period, t weighted by the absolute difference between the baseline rank of y and the median, \(\left| {r - \frac{1}{2}} \right|\). Its value varies between −1 and 1, with 0 representing no change in the distribution of asset index values at time period t relative to the baseline year, positive values representing more polarization (i.e. increases in the tails of the distribution) and negative values representing less polarization (i.e. convergence towards the center of the distribution). In order to distinguish the contributions from the lower and upper tails of the distribution to the overall polarization, the MRP index is decomposed into lower (LRP) and upper (URP) polarization indices defined respectively as:
$$LRP_{t} = 8\mathop \int \limits_{0}^{{\frac{1}{2}}} \left| {r - \frac{1}{2}} \right| \times g_{t} \left( r \right)dr - 1$$
$$URP_{t} = 8\mathop \int \limits_{{\frac{1}{2}}}^{1} \left| {r - \frac{1}{2}} \right| \times g_{t} \left( r \right)dr - 1$$
These indices also vary between −1 and 1 and have similar interpretations as the MRP index.
The analysis of ethnic differences in the distribution of SES between the South African and Mozambican populations use the distribution of the asset index values of the Mozambican households as the reference distribution and that of the asset index values of the South African households as the comparison distribution.
Software
We use STATA version 13.1 (Stata Corp., College Station, USA) to construct the asset indices and to perform the descriptive analyses. We also utilize the R statistical package reldist to conduct the relative distribution analysis (Handcock and Aldrich 2002).
Ethics Statement
The Human Research Ethics Committee (Medical) of the University of the Witwatersrand reviewed and approved the Agincourt HDSS (protocol M960720 and M081145). At the start of surveillance in 1992, community consent was secured from civic and traditional leadership and has continuously been reaffirmed for over two decades through frequent meetings. This is facilitated by the Agincourt Unit’s LINC (Learning, Information dissemination and Networking with Community) Office. Three local people working under a coordinator in the LINC office regularly engage with Community Development Forums as well as a Community Advisory Group in the study site. Both are elected committees comprising village members. Community Development Forums, the lowest level of local government, include the Induna who represents the Traditional Council. The LINC office ensures that Forum members understand research objectives and results and are able to raise concerns about the Unit’s research in their communities, and provide feedback of research results at community meetings. The Community Advisory Group ensures information flows between the Unit and the community, voices concerns, assesses the potential impact of the Unit’s research on the community, and maintains ongoing dialogue and consultation. At the individual and household level, informed verbal consent is obtained from the head of the household or an eligible adult in the household at each annual follow-up surveillance visit. Prior to conducting any interview, a local fieldworker who is well-trained and versed in the Agincourt HDSS methods and the process of verbal informed consent explains in the local language to the respondent the purpose, aims and justification of the HDSS as well as information about confidentiality, privacy and the right to refuse to participate or withdraw from the HDSS. The responsible fieldworker documents the consent process by marking out the respondent on the household roster as well as recording the fieldworker details and date on the spaces provided at the top of the household roster. A verbal consenting process is normal practice for HDSS and the processes followed in the Agincourt HDSS have continued to be accepted by the aforementioned ethics committee. Furthermore, additional ethical clearance was obtained from the same ethics committee for the primary study reported in this paper (protocol M120488).
Data Availability
Detailed documentation of the Agincourt HDSS data and an anonymized database containing data from 10 % of the surveillance households are freely available on the Agincourt HDSS website (www.agincourt.co.za). The specific customized data used in this study are available on request to interested researchers.