Abstract
Background
Verbal autopsy is gaining increasing acceptance as a method for determining the underlying cause of death when the cause of death given on death certificates is unavailable or unreliable, and there are now a number of alternative approaches for mapping from verbal autopsy interviews to the underlying cause of death. For public health applications, the populationlevel aggregates of the underlying causes are of primary interest, expressed as the causespecific mortality fractions (CSMFs) for a mutually exclusive, collectively exhaustive cause list. Until now, CSMF Accuracy is the primary metric that has been used for measuring the quality of CSMF estimation methods. Although it allows for relative comparisons of alternative methods, CSMF Accuracy provides misleading numbers in absolute terms, because even random allocation of underlying causes yields relatively high CSMF accuracy. Therefore, the objective of this study was to develop and test a measure of CSMF that corrects this problem.
Methods
We developed a baseline approach of random allocation and measured its performance analytically and through Monte Carlo simulation. We used this to develop a new metric of populationlevel estimation accuracy, the Chance Corrected CSMF Accuracy (CCCSMF Accuracy), which has value near zero for random guessing, and negative quality values for estimation methods that are worse than random at the population level.
Results
The CCCSMF Accuracy formula was found to be CCSMF Accuracy = (CSMF Accuracy  0.632) / (1  0.632), which indicates that, at the populationlevel, some existing and commonly used VA methods perform worse than random guessing.
Conclusions
CCCSMF Accuracy should be used instead of CSMF Accuracy when assessing VA estimation methods because it provides a more easily interpreted measure of the quality of populationlevel estimates.
Similar content being viewed by others
Introduction
Understanding the leading cause of death (CoD) is vital information for health decisionmaking [1]. The civil and vital registration system (CVRS) constitutes the most timely and accurate source of this information [2, 3], but is unavailable in many regions of the world [4]. Verbal autopsy interviews (VAIs) provide a promising alternative (and potentially a complement) to the CVRS approach in settings where CVRS information is unavailable or unreliable [5, 6]. In populations where medical certification of causes of death is difficult to achieve, particularly those poorly serviced by health facilities, the only viable option to obtain information on causes of death is to use verbal autopsy (VA) methods. VA includes three components: (1) a VA instrument, used to elicit information from the family or relatives about signs and symptoms experienced by the deceased prior to death; (2) a diagnostic method to derive the most probable cause of death from these responses to the VA interview with families, which has traditionally been accomplished by physician review, but can also be assessed using a diagnostic algorithm that recognizes and associates response patterns with likely causes of death; and (3) a target cause of death list that covers the universe of causes of death which can be diagnosed from the VA interview, irrespective of the diagnostic approach followed. Worldwide, less than 40 % of deaths are medically certified each year [7], and an additional 100,000 or so are currently assigned a cause by some variant of verbal autopsy, mostly in routine mortality surveillance systems operating in China [8], India [9], or the INDEPTH network [10]. There is now increasing momentum worldwide to apply costeffective VA methods to facilitate the introduction of VA into routine civil registration systems in countries across Asia, Africa and Latin America [11].
It is technically challenging to predict the underlying cause of death from VAIs. A recent paper compared the quality of six prediction methods on VAIs where the underlying cause was known to meet rigorous clinical diagnostic criteria [12]. In that work, prediction quality was assessed with five different metrics. Most of these measure predictive quality on the individual level, to quantify how well a method predicts the cause of each death. However, for public health policy, it is of great importance to make accurate predictions at the population level. Causespecific mortality fraction (CSMF) accuracy is a recently developed metric for quantifying prediction quality at the population level [13]. CSMF accuracy is an index of absolute deviation of a set of estimated CSMFs from the true CSMF distribution, with value of one meaning perfect agreement, and value of zero meaning as far apart as possible. This metric is specific to validation studies which make use of a database of VAIs with known underlying cause of death (labeled data, in the parlance of machine learning). To protect against “overfitting”, (where an algorithm, or even a physician coder, estimates a CSMF distribution based on what they have seen in the past instead of the data that they are currently examining), CSMF accuracy requires repeated calculation of this absolute deviation index for multiple random samples of the underlying cause distribution of the test data.
CSMF accuracy, as originally formulated, is misleading, however. It is always a value that falls between zero and one, but in practice it is rarely lower than 0.5 [12]. This is a limitation in interpretation, because even a very lowquality approach scores well above zero. An extreme example is that of a “prediction” method that resorts simply to random guessing. Even this informationfree approach yields a CSMF accuracy substantially above zero.
In this paper, we propose a modification of the CSMF Accuracy metric, which we call chancecorrected causespecific mortality fraction (CCCSMF) accuracy, to adjust the baseline of the metric so that allocating causes uniformly at random (i.e. just “by chance”) achieves an expected accuracy score of zero. We believe that this will improve the interpretation of the absolute and comparative performance of different methods for estimating causeofdeath patterns in populations.
The remainder of the paper is organized as follows: in the methods section, we define the baseline algorithm of Random Allocation and review the definitions of chancecorrected concordance and CSMF accuracy. We then introduce the RandomFromTrain algorithm and reiterate the importance of randomly resampling the distribution of the test set when the populationlevel predictions are of primary interest, and define our new metric of CCCSMF accuracy. In the results section, we use Monte Carlo simulation to find the exact formula of CCCSMF accuracy, and to replicate the analytically derived chancecorrected concordance metric. We then apply this formula to produce a plot comparing three existing methods of coding VAIs in terms of CCC and CCCSMF accuracy, which we view as chancecorrecting previous results. We conclude the results section with a demonstration of the importance of randomly resampling the distribution of the test set. We follow this with discussion and conclusions sections, including a subsection discussing the limitations of our work.
Methods
In Machine Learning (ML), mapping from VAIs to CoD is an example of a classification problem, and ML methods for classification, such as neural networks [14], knearest neighbor [15], and random forests [16] have been applied to VAIs previously. An ML concept that is also quite relevant for VA applications is that of the “baseline approach”, where a simple approach is used as a comparison for the more sophisticated classification methods. A baseline approach for mapping VAIs to CoD is Random Allocation, which allocates the cause of death uniformly at random from a mutually exclusive, collectively exhaustive cause list (Table 1).
The machine learning construct of the confusion matrix is a useful tool for understanding the performance of a classifier on labeled validation data. The confusion matrix M is a crosstabulation of the number of true and predicted cases for each cause, which is to say a J × J matrix where J is the length of the mutually exclusive, collectively exhaustive cause list with the entry in row j and columns j' given by
Table 2 shows the confusion matrix for physiciancoded VA and random allocation for the Population Health Metrics Research Consortium (PHMRC) validation database for goldstandardlevelone adult deaths (this database consisted of 2,702 VAIs gathered from six sites in four countries, for deaths that met stringent clinical diagnostic criteria [17], and were subsequently coded by physicians in a validation of the PCVA approach to VA [18]).
Chancecorrecting concordance and CSMF accuracy
Recent work by Murray et al. developed robust metrics for individuallevel and populationlevel prediction quality: chancecorrected concordance and CSMF accuracy [13]. Both can be written easily in terms of the confusion matrix. Causespecific concordance (C_{ j }) is a measure of predictive quality at the individual level, which quantifies how likely a prediction is to be correct for a single VAI. It is equal to the fraction of VAIs where the prediction was correct, or in other words,
Then causespecific chancecorrected concordance (CCC_{ j }) has the form
This scales and shifts the concordance so that the expected CCC_{ j } of random allocation is zero. Finally, an overall metric of chancecorrected concordance is calculated as an unweighted mean of the causespecific values:
Chancecorrected concordance is an adaptation of a generalization of the sensitivity metric so familiar in epidemiology. It is generalized to account for the polytomous nature of the prediction task. The chance correction is important for making comparisons between classifiers designed for differentlength cause lists—shortening the cause list makes the concordance of Random Allocation go up, but leaves CCC unchanged at zero.
CSMF accuracy is a measure of predictive quality at the population level, which quantifies how closely the estimated CSMF values approximate the truth. It can be defined in terms of the normalized row and column sums of the confusion matrix, \( {\mathrm{CSMF}}_j^{\mathrm{true}}={\sum}_{j^{\prime }=1}^J{M}_{j,{j}^{\prime }}/\mathrm{n} \), and \( {\mathrm{CSMF}}_j^{\mathrm{pred}}={\sum}_{j^{\prime }=1}^J{M}_{j^{\prime },j}/\mathrm{n} \), where n is the total number of VAIs,
In this notation,
which has minimum value zero and maximum value one. Unlike CCC, the CSMF accuracy of Random Allocation is greater than zero, a deficiency that this paper seeks to remedy.
The importance of randomly resampling the CSMF distribution
These metrics have been widely used in measuring and comparing the quality of a range of verbal autopsy analysis methods [16, 18–22] and their use is complicated by the need to consider the average CCC and CSMF over the range of possible CSMF distributions. This is particularly relevant for CSMF accuracy, because a classifier that knows the CSMF distribution a priori could perform very well at the population level for that CSMF distribution without getting anything right at the individual level. This might seem like a purely theoretical concern, but a recent paper comparing four approaches to computer certified verbal autopsy methods omitted CSMF distribution resampling, and which led to reporting counterintuitive and misleading results [23]. To demonstrate this in an extreme example, we developed the populationlevel prediction scheme RandomFromTrain, where the prediction is random, but with a distribution derived from the training dataset (hence the name RandomFromTrain). This is subtly different from the distribution used in the Random Allocation predictor, and designed so that, in expectation, the CSMFs predicted for the test set match the CSMFs observed in the training set (Table 3).
The confusion matrix for RandomFromTrain prediction on the PHMRC validation database adult deaths is shown in Table 4.
Chancecorrecting previous results
Although previous work has used an unchancecorrected version of CSMF accuracy [12, 13, 16, 18, 19, 21, 22], it would be generally useful to have a metric of populationlevel accuracy where a score of zero indicates predictive accuracy equal to Random Allocation. We therefore set out to correct CSMF accuracy for chance analogously to chancecorrected concordance, and to develop a formula for Chance Corrected CSMF (CCCSMF) accuracy where the quality of random allocation is 0.0, while perfect prediction scores 1.0. To do this, we performed a Monte Carlo calculation of the CSMF accuracy of Random Allocation, by simulating a dataset with known CSMF distribution, assigning “predicted” causes of death uniformly at random, and measuring the CSMF accuracy of the predictions.
The distribution of the simulated dataset is an important and subtle detail of this calculation. We sampled the true CSMF distribution from an uninformative Dirichlet distribution (a probability distribution over CSMFs which gives equal probability to all possible CSMF distributions) [24]. We generated 10,000 replicates of the Monte Carlo simulation, and calculated the mean the CSMF accuracy across all replicates.
We then used the calculated values to chancecorrect the CSMF accuracy, according to the formula
We also used this simulation framework to perform a Monte Carlo calculation of the concordance for random allocation, which provides a crosscheck for the analytical derivation of CCC derived in Murray et al. [13]. We repeated the simulations for cause lists ranging from 3 to 50 causes.
To demonstrate the utility of this view, we updated the comparative performance plot from Murray et al. [12] for all commonly used methods, to use CCCSMF accuracy as the metric of populationlevel accuracy. This plot compared a range of VA prediction methods in a range of settings according to CCC and CSMF accuracy using a database of VAIs with known underlying cause of death, according to goldstandard clinical diagnostic criteria. As in the previous work, we have presented results for three age groups separately: Adult, Child, and Neonatal deaths (N = 7,846, 2,064, and 2,625 respectively). For each age group, in addition to analyzing with all available information, we also excluded all answers to questions that require the deceased to have contact with the health system, such as “Was [name] ever told by a health professional that he or she ever suffered from one of the following?” Following the terminology we developed in our previous work, we call these scenarios with and without healthcare experience (HCE).
This simulation setting also provided us an opportunity to demonstrate the importance of randomly resampling the causefraction of the test set from an uninformative Dirichlet distribution (a technical point that perhaps has not been sufficiently appreciated since its introduction in Murray et al. [13]). To do so, we compared the CCCSMF accuracy of Random Allocation with that of RandomFromTrain, where training data was either uniformly distributed among causes (as we strongly recommend) or distributed according to the same distribution as the test data (as has sometimes been the case in other work [23]).
We conducted all analysis with Python 2.7 (Additional file 1: Supplementary Text 2).
Results
We found that the CSMF Accuracy of Random Allocation decreased slightly and nonlinearly as a function of J across the random considered (Fig. 1), and we proved analytically that it tends towards an asymptotic value of 1 − e^{− 1} ≈ 0.632 as J and N tend to ∞ (Additional file 2: Supplementary Text 1). For simplicity, we use this value to produce the same formula for CCCSMF for all values of the CoD list J (J=6, 21, and 34 are the lengths of the PHMRC cause lists for neonatal, child, and adult deaths [17]):
We used a Monte Carlo estimation procedure to calculate the concordance of random allocation. The results of the estimates agree precisely with the analytical value of 1/J used for correcting for change in Murray et al. [13] (Fig. 2, R^2 = 1.0).
Using the chancecorrected metrics for the x and yaxes, we produced an updated version of the master graphic comparing the individual and populationlevel quality of all commonly used VA analysis methods from Murray et al. [12] for neonates, children, and adults, considering and not considering HCE (Fig. 3). For simplicity, we did not include uncertainty quantification, but the same adjustment formula applied to transform the point estimate of CSMF accuracy to CCCSMF accuracy can be used to transform the upper and lower limits of the CSMF accuracy 95 % CI.
When using the RandomFromTrain approach with training data drawn from the same CSMF distribution as test data, we measured an unreasonably high CCCSMF Accuracy. Resampling the test set CSMF distribution from an uninformative Dirichlet fixed this problem, and resulted in CCCSMF accuracy for RandomFromTrain near zero in a way similar to CCCSMF accuracy of Random Allocation (Table 5).
Discussion
The objective of this study was to develop and test a measure of populationlevel predictive accuracy that is informative in absolute, as well as relative, terms. We believe that our new metric, Chancecorrected CSMF Accuracy, makes things clearer by increasing the dynamic range of the populationlevel quality measure; although a method that attains CSMF Accuracy of 0.632 may sound promising in absolute terms, it is not. As shown above, this is the CSMF Accuracy of random guessing. By subtracting 0.632 from the CSMF Accuracy, random guessing and methods of similar quality are given a score near zero. Rescaling the scores by dividing through by 1  0.632 maintains the meaningful upper limit of the quality score, where CSMF Accuracy of 1.0 indicates perfect agreement between truth and prediction.
Unlike chancecorrected concordance, CCCSMF Accuracy is not essential for comparing different length cause lists. This is because the CSMF Accuracy of Random Allocation is relatively insensitive to changes in causelist length (it dropped from 0.67 to 0.63 as J ranged from 3 to 50 in Fig. 1). This can be compared with the concordance of random allocation for differentlength cause lists, which ranged from 0.35 to 0.02 as J ranged from 3 to 50 in Fig. 2.
Resampling the CSMF distribution is essential when evaluating CSMF and CCCSMF Accuracy; without it, the trivial approach of RandomFromTrain appears to be nearly perfect at the population level. The issue exemplified by RandomFromTrain is not merely a theoretical/pathological concern. It has also shown up in practice when evaluating the KingLu method (a recently developed method for mapping from VAIs directly to CSMFs) [23]. It is likely also relevant in physiciancertified verbal autopsy (PCVA), because physicians may rely on their prior beliefs about the composition of disease. Without resampling the test data, a validation method will not be able to contradict a prior belief, even if the belief is incorrect. In other words, just like all machine learning evaluations, it is essential to measure CCCSMF accuracy outofsample, but, because CSMF and CCCSMF Accuracy are populationlevel metrics, measuring outofsample predictive validity is more complicated than simply using a train/test split. The PHMRC developed a methodology for this which we recommend [13], and we hope that the demonstration of its importance here will help in its uptake.
Limitations
Despite the importance of resampling the CSMF distribution of the test set, it is not without limitations. The uninformative Dirichlet assumes that anything can happen in test CSMF, because the outofsample CSMF is selected uniformly at random. This is the simplest way to address the risk of overfitting, but it is perhaps too tough a challenge, since there is some structure to CSMF distributions that could be assumed.
The VAIs held out for outofsample validation were from the same population, selected uniformly at random. This approach may be overly optimistic about performance on VAIs from a completely different population. It would be prudent to replicate validation studies periodically, to guard against differential item functioning and changes symptomology.
The premise that every death has a single, underlying cause has been challenged [25, 26], and as the epidemiological transition continues and more individuals experience multiple comorbidities, this simplifying assumption will become even more tenuous. However, we may still hope to provide meaningful information at the population level.
Conclusion
Chancecorrected CSMF accuracy is a simple transformation of CSMF accuracy, but we believe that it provides additional clarity on the absolute and relative performance of VA analysis methods at the population level.
The chancecorrection of CSMF Accuracy does not change the overall recommendations from Murray et al.: namely that the Tariff 2.0 method is preferred for all applications of automated VA methods [12].
As the epidemiological transition, technology, and costs evolve, the accuracy and costeffectiveness of alternative approaches to measuring causes of deaths should continue to be assessed. Further innovation will improve the quality of this critical information for decisionmaking.
Abbreviations
 CCC:

Chancecorrected concordance
 CoD:

Cause of death
 CCCSMF:

Chancecorrected causespecific mortality fraction
 CSMF:

Causespecific mortality fraction
 CVRS:

Civil and vital registration system
 HCE:

Health care experience
 ML:

Machine learning
 PCVA:

Physician certified verbal autopsy
 PHMRC:

Population Health Metrics Research Consortium
 VAI:

Verbal autopsy interview
References
Mathers CD, Ma Fat D, Inoue M, Rao C, Lopez AD. Counting the dead and what they died from: an assessment of the global status of cause of death data. Bull World Health Organ. 2005;83:171–7.
AbouZahr C, Boerma T. Health information systems: the foundations of public health. Bull World Health Organ. 2005;83:578–83.
Mahapatra P, Shibuya K, Lopez AD, Coullare F, Notzon FC, Rao C, et al. Civil registration systems and vital statistics: successes and missed opportunities. Lancet. 2007;370:1653–63.
Phillips DE, Lozano R, Naghavi M, Atkinson C, GonzalezMedina D, Mikkelsen L, et al. A composite metric for assessing data on mortality and causes of death: the vital statistics performance index. Popul Health Metr. 2014;12:14.
Setel PW, Sankoh O, Rao C, Velkoff VA, Mathers C, Gonghuan Y, et al. Sample registration of vital events with verbal autopsy: a renewed commitment to measuring and monitoring vital statistics. Bull World Health Organ. 2005;83:611–7.
Setel PW, Macfarlane SB, Szreter S, Mikkelsen L, Jha P, Stout S, et al. A scandal of invisibility: making everyone count by counting everyone. Lancet. 2007;370:1569–77.
Mikkelsen L, Phillips DE, AbouZahr C, Setel PW, de Savigny D, Lozano R, et al. A global assessment of civil registration and vital statistics systems: monitoring data quality and progress. Lancet. 2015. doi:10.1016/S01406736(15)601714.
Yang G, Hu J, Rao KQ, Ma J, Rao C, Lopez AD. Mortality registration and surveillance in China: History, current situation and challenges. Popul Health Metr. 2005;3:3.
Jha P, Gajalakshmi V, Gupta PC, Kumar R, Mony P, Dhingra N, et al. Prospective Study of One Million Deaths in India: Rationale, Design, and Validation Results. PLoS Med. 2005;3, e18.
Sankoh O, Byass P. Causespecific mortality at INDEPTH Health and Demographic Surveillance System Sites in Africa and Asia: concluding synthesis. Glob Health Action. 2014;7.
Lopez AD, Setel PW. Better health intelligence: a new era for civil registration and vital statistics? BMC Med. 2015;13:73.
Murray CJ, Lozano R, Flaxman AD, Serina P, Phillips D, Stewart A, et al. Using verbal autopsy to measure causes of death: the comparative performance of existing methods. BMC Med. 2014;12:5.
Murray CJ, Lozano R, Flaxman AD, Vahdatpour A, Lopez AD. Robust metrics for assessing the performance of different verbal autopsy cause assignment methods in validation studies. Popul Health Metr. 2011;9:28.
Boulle A, Chandramohan D, Weller P. A case study of using artificial neural networks for classifying cause of death from verbal autopsy. Int J Epidemiol. 2001;30:515–20.
Reeves BC, Quigley M. A review of dataderived methods for assigning causes of death from verbal autopsy data. Int J Epidemiol. 1997;26:1080–9.
Flaxman AD, Vahdatpour A, Green S, James SL, Murray CJ. Random forests for verbal autopsy analysis: multisite validation study using clinical diagnostic gold standards. Popul Health Metr. 2011;9:29.
Murray CJ, Lopez AD, Black R, Ahuja R, Ali SM, Baqui A, et al. Population Health Metrics Research Consortium gold standard verbal autopsy validation study: design, implementation, and development of analysis datasets. Popul Health Metr. 2011;9:27.
Lozano R, Lopez AD, Atkinson C, Naghavi M, Flaxman AD, Murray CJ, et al. Performance of physiciancertified verbal autopsies: multisite validation study using clinical diagnostic gold standards. Popul Health Metr. 2011;9:32.
James SL, Flaxman AD, Murray CJ. Performance of the Tariff Method: validation of a simple additive algorithm for analysis of verbal autopsies. Popul Health Metr. 2011;9:31.
Murray CJ, James SL, Birnbaum JK, Freeman MK, Lozano R, Lopez AD, et al. Simplified Symptom Pattern Method for verbal autopsy analysis: multisite validation study using clinical diagnostic gold standards. Popul Health Metr. 2011;9:30.
Flaxman AD, Vahdatpour A, James SL, Birnbaum JK, Murray CJ. Direct estimation of causespecific mortality fractions from verbal autopsies: multisite validation study using clinical diagnostic gold standards. Popul Health Metr. 2011;9:35.
Lozano R, Freeman MK, James SL, Campbell B, Lopez AD, Flaxman AD, et al. Performance of InterVA for assigning causes of death to verbal autopsies: multisite validation study using clinical diagnostic gold standards. Popul Health Metr. 2011;9:50.
Desai N, Aleksandrowicz L, Miasnikof P, Lu Y, Leitao J, Byass P, et al. Performance of four computercoded verbal autopsy methods for cause of death assignment compared with physician coding on 24,000 deaths in low and middleincome countries. BMC Med. 2014;12:20.
Kotz S, Balakrishnan N, Johnson NL. Continuous Multivariate Distributions, Volume 1, Models and Applications. Wiley; 2000.
Dorn HF, Moriyama IM. Uses and Significance of Multiple Cause Tabulations for Mortality Statistics. Am J Public Health Nations Health. 1964;54:400–6.
Désesquelles AF, Salvatore MA, Pappagallo M, Frova L, Pace M, Meslé F, et al. Analysing Multiple Causes of Death: Which Methods For Which Data? An Application to the CancerRelated Mortality in France and Italy. Eur J Popul Rev Eur Démographie. 2012;28:467–98.
Acknowledgements
This work was supported by a National Health and Medical Research Council of Australia project grant, Improving methods to measure comparable mortality by cause (Grant no. 631494). CIs – ADL, IR, CJLM. The funders had no role in study design, data collection and analysis, interpretation of data, decision to publish, or preparation of the manuscript. The corresponding author had full access to all data analyzed and had final responsibility for the decision to submit this original research paper for publication.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors have developed a software tool that implements the Tariff Method for mapping VAIs to CoDs. This tool is available without charge online, but could constitute a nonfinancial competing interest.
Authors’ contributions
All authors participated in design of the study. ADF conducted the analysis and prepared the first draft. All authors revised the draft and approved the manuscript.
Additional files
Additional file 1:
Supplementary text 2 Python script.
Additional file 2:
Supplementary text 1 Asymptotic calculation. (PDF 183 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Flaxman, A.D., Serina, P.T., Hernandez, B. et al. Measuring causes of death in populations: a new metric that corrects causespecific mortality fractions for chance. Popul Health Metrics 13, 28 (2015). https://doi.org/10.1186/s1296301500611
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1296301500611