FormalPara Key Points for Decision Makers

Tuberculosis (TB) unit cost data are available for low- and middle-income countries; however, there is methodological variability in how this is reported.

There are gaps in the current tuberculosis unit cost literature, particularly for 12 high TB burden countries, TB vaccinations and the interventions of above-service costs, tuberculosis prevention, and active and intensified case finding.

Going forward we need to consistently define interventions and follow standard methods to align work.

1 Introduction

The epidemic of tuberculosis (TB) remains a global challenge, which has persisted despite longstanding efforts to eliminate it as a public health problem [1]. Substantial TB investment requires economic analysis for guidance and justification; however, this is often constrained by both a lack of data and a high level of resources needed to produce the necessary cost data [2]. Furthermore, easily accessible TB unit cost data are essential for many purposes such as National Strategic Planning, budgeting, priority setting and others [3]. However, variable quality of unit cost data may lead to poor decision making due to biased results. Quality can relate to robustness, precision and reliability of the data, as well as standard reporting of cost methods and results. If quality of reporting is high, comparability across cost estimates and settings is enhanced. Conversely, if reporting quality is low it is difficult to use cost estimates appropriately. In order to improve the nature and use of cost data in priority setting and decision making, understanding the quality of cost data and reporting can help to inform which methods need strengthening and which areas of reporting should be standardised.

Here we present a systematic review of papers reporting TB unit costs from a healthcare provider perspective. The working definition used in this paper for a unit cost is a cost per output (calculated through top-down, bottom-up or mixed methods), which is calculated by dividing the total cost by the quantity of output (examples of unit costs could be the ‘cost per patient treated’, the ‘cost per test’ or the ‘cost per patient retained’) [4]. Papers were assessed using study quality index indicators that included reporting standards, key cost components, precision and bias [5]. We aimed to summarise, compile and analyse all relevant historical TB cost data in low- and middle-income settings, which could be used to assess the current quality of TB cost data, highlight data gaps in research, inform necessary standards and methods, and be presented as an open access Unit Cost Study Repository on the Global Health Cost Consortium (GHCC) website [6], allowing for the addition of future TB costing work in a standardised way. The focus on low- and middle-income countries was with the intention of aiding decision makers, aiming to uncover available TB unit costs for countries where there is often thought to be a general lack of data. Specifically, the objectives were (1) to descriptively review papers reporting TB unit costs from a provider perspective in low- and middle-income settings, looking at methodological variation; (2) to assess the quality of the standards and methods used for TB unit cost papers with a study quality rating system and investigate the indicators of reporting quality through machine learning; and (3) to highlight and discuss the data gaps to inform intervention-standardised targeted TB unit cost collection going forward.

2 Methods

2.1 Search Strategy

A systematic review was conducted using disease- and intervention-specific key words relating to cost, tuberculosis and tuberculosis health services including tuberculosis treatment and prevention, which were combined using Boolean operators (see Table 1). We included papers (the term ‘papers’ is used to inclusively refer to the reports, studies, records and articles searched for and retrieved in this review) reporting TB unit costs published between January 1990 and March 2018, with no language or geographical restriction. Eight electronic databases were searched (PubMed, EMBASE, Medline, Econlit, The National Health Service Economic Evaluation Database and The Cost-Effectiveness Analysis Registry, between May and July 2016; Web of Science and Literatura Latino-Americana e do Caribe em Ciências da Saúde in February and March 2017; and through focused Google searches between November 2017 and March 2018). We excluded papers if they had no empirically collected cost data or if the currency or costing perspective could not be identified either from the paper or by contacting the authors. The search results from each database were downloaded into one library in EndNote X8 (Clarivate Analytics).

Table 1 Tuberculosis search terms

The screening of these papers was undertaken in three stages: by title, by abstract and by full text. In stage one, a researcher reviewed all papers by title (15,161), while a second researcher assessed the papers that were excluded based on title screening. Titles that identified animal research led to paper exclusion. In stage two, the resulting 6307 abstracts were screened independently by two researchers and were excluded if they did not report unit costs. In stage three, the 704 papers deemed eligible for full-text review (3% yield from the original search) were screened independently by two researchers. Papers were excluded if the studies were conducted in high-income countries based on World Bank categorisation for 2017 (n = 177); if no empirical data were collected regarding prices or quantities or TB-related costs (relevant data) were reported (n = 313), for instance if all the data were modelled or from secondary data sources; if the paper was only a correspondence, editorial, commentary, news piece or protocol (n = 41) or if it contained duplicate cost data (n = 3). We also cross-checked papers against a recent systematic literature review of TB costs for health services and patients [7].

In addition to the published literature search, we searched for grey literature, including focused Google searches, which yielded a total of 398 potential papers. The grey literature search focused on relevant websites, while the Google searches focused on identifying specific interventions by region. We searched the following sources for relevant papers: The European Association for Grey Literature Exploitation (EAGLE); The System for Information on Grey Literature in Europe (SIGLE, a bibliographic database); documents and meeting reports from the World Bank and World Health Organization (WHO) websites. The exclusion criteria that was used for the peer-reviewed literature was utilised for the focused Google searches. These focused Google searches were completed between November 2017 and March 2018 with the following search string format: “[intervention name]” costs [one of the following: Africa, Asia, East Europe] -US (where “-” is the Boolean term for minus) [5]. We reviewed the first 50 documents that resulted from the algorithm used in Google for different websites (see Supplementary Table 1 in the electronic supplementary material [ESM]). The final number of grey literature papers identified as potentially containing TB unit cost data was 31 (8%). We excluded duplicates and followed the same screening as for the peer-reviewed literature. When an abstract was not available, the executive summary was assessed.

We expected that most unit costs would be sourced from papers with noneconomic primary outcomes, as there have been very few costing studies in TB with large sample sizes. Searches were purposively broad to encompass as many papers with costs as possible whether their primary outcome was economic or noneconomic. This increased the number of papers that were not relevant and so care was taken to ensure that the exclusion criteria were correctly applied.

2.1.1 Data Extraction

We developed a comprehensive extraction tool in Microsoft Excel as part of the GHCC [5], which allowed us to describe the methods used, appraise the quality and reporting standards of included papers, and extract costs (not presented here). The extraction tool was developed by GHCC in order to assess quality and compliance with the GHCC Reference Case for Estimating the Costs of Global Health Services and Interventions [3] and was aligned with research needs (paper metadata and methodology) while also optimising data collection for open access publication in the Unit Cost Study Repository on the GHCC website (which displays fully disaggregated cost data).

Categorisation of unit costs was done according to GHCC’s intervention-standardised unit cost typology. Unit cost data were allocated to one of six TB interventions: above-service costs, active case finding (ACF), intensified case finding (ICF), passive case finding (PCF), TB prevention and TB treatment (see Supplementary Table 2 in the ESM). TB case detection and diagnosis can be done in three different ways; through passive case finding, which is the detection of tuberculosis cases where individuals self-identify at health facilities; active case finding, which is detection of cases of those not in health facilities, such as through outreach services; and intensified case finding, which is detection of cases in high-risk populations (see ESM for further explanation). Unit costs were further allocated to standard categories including geography (e.g. country, urbanicity), target population (e.g. demographic, clinical), implementation (e.g. platform, ownership, target populations, technology), costing methodology (e.g. perspective, economic or financial cost), and detailed information about how the intervention and study were conducted (e.g. year of the cost data collection, discount rate). Data were extracted on study scope, sampling, methods, inclusion of costs, valuation and analysis. We assessed whether specific information was explicitly stated, easily inferable or not reported (NR). Double extraction was performed by two teams of two extractors each, with significant interaction among the extractors. Further quality assurance work was undertaken by a senior researcher.

A study quality rating system was developed and applied to this database (and to the HIV costing papers on the GHCC Unit Cost Study Repository, which is available online) [5]. This study quality rating system assesses four categories and results in a composite quality rating made up of four letters for which there are four levels for each (A–D), with A representing the highest quartile of scores and B–D representing the three successively lower quartiles (see Table 2 for the quality rating of each paper in this review). The four quality rating indicators have equal weighting for assessing quality. AAAA would be the highest composite score and DDDD would be the lowest possible composite score based on this system. Under the first category, Key Cost Components (note different terms/labels were used in the original GHCC study quality rating system. ‘Bias low’ has been changed to Key Cost Components in this review and ‘Bias high’ has been changed to Bias), a paper would receive an ‘A’ if they appropriately account for and report key cost components such as above-service delivery cost, overhead costs, personnel inefficiency/downtime adjustment, and value volunteer time. This provides a signal for the completeness of the cost estimated. For the second category, Bias, an ‘A’ is given if the paper appropriately annuitises capital costs (which is deemed appropriate depending on the stated time since the programme started) and omits unrelated costs, for instance unrelated research costs. These are not the only sources of bias in costing papers, but as with the Key Cost Components, these indicators are considered a signal of potential bias. For the third category, Precision, an ‘A’ is indicated if the paper follows sampling, data collection and reports the cost estimate in a way that correctly reflects the level of precision of the paper. These include sampling at a country or site level as appropriate, selecting and reporting on a relevant cost allocation method, resource identification, the method of measuring output, and the number of sites selected. The fourth category is Reporting. An ‘A’ would be given for this category if the authors explicitly report key methods and results, such as the urbanicity (rural, urban, peri-urban or a mixture), the ownership (public, NGO, etc.), intervention components and breakdown by activity. In all the categories outlined above, completely omitting or failing to account for these aspects would result in scoring a ‘D’. Being given a low score for Key Cost Components could indicate an underestimation of unit costs, while a low score for Bias could indicate an overestimation (for instance, if the paper inappropriately included research costs, did not amortise or only costed the initial time period (< 6 months) of a new intervention or programme). The scoring rubric can be found in Supplementary Table 3 (see ESM) and further information on this quality rating can be found in the paper “Developing the Global Health Cost Consortium Unit Cost Study Repository for HIV and TB: Methodology and Lessons Learned” [5].

Table 2 List of all papers in review including final quality rating

2.1.2 Data Analysis

We conducted a descriptive analysis looking at study characteristics and frequencies of reporting that met the GHCC Reference Case [3] checklist (see Supplementary Table 4 in the ESM). The 17 principles of the GHCC Reference Case [3], which cover study design, resource use measurement, application of pricing and valuation and application of reporting and analysis, were kept in mind when developing the data analysis. Many of the variables analysed were included in the study quality rating system; however, this descriptive analysis was done separately to the application of the quality rating.

In addition, to investigate the indicators of reporting quality (variables in a costing that may result in a higher reporting quality rating), we used R programming in RStudio to perform basic machine learning to determine which variables (features) were important in the reporting quality rating. Machine learning is a way to understand the structure of the data. It automates a process, in the case of this review, of predicting a categorical variable to develop a classification tree based on decision rules (an algorithm). Hence, it takes the full dataset and tries to classify it as best it can into subsets so that the data in each subset is as homogenous as possible. A classification tree starts with a node that represents all the data at that point, and then branches out with a decision being made at every branch point. It has the strength of being easy to interpret by looking at the tree. However, it has the drawback that small changes in the data can result in very different trees.

This machine learning meant taking the reporting quality score variable (either an ‘A’, ‘B’ or ‘C’ as no papers received a ‘D’) and objectively ascertaining which of a subset of variables was the most important in influencing this score in order to comment on which aspects should be focused on in future studies. This method of recursive partitioning (utilising the ‘rpart’ package) allows one to train a classification model using the data and then evaluate that model with a retained portion of that same database. Recursive partitioning means that the model takes the data and splits it based on the variables in the model in order to better understand the variable of interest, in this case the reporting quality variable.

The database was relatively small (only 103 rows and 140 variables/features) and within each variable there was a high level of variability. After exploratory analysis, variables were limited to a subset of study characteristics and methods used in papers (see Table 3), with complete data and less variability in their categorisation, that were thought to be important in predicting reporting quality (for instance, the disaggregated unit cost data in US dollars was not included and neither were more subjective variables such as which costs were omitted and the justification for this). The model was then trained using all the papers (103) in the database and subsequently a random 20% (of the same dataset) was utilised for testing. A confusion index was used to assess accuracy in the predictive power of the model, which takes the known values (from the training data) and predicted values from the model and tests how well the model predicts the reporting quality rating using the test data. Diagrammatically this model is visualised as a classification tree (see Fig. 1).

Table 3 Key features with number of papers by category in brackets
Fig. 1
figure 1

Classification tree. The label for the node reflects the dominant class—i.e. node 2 = A. The next set of numbers in the node are the proportion of the classes in the node—i.e. node 2 = 53% A, 37% B and 10% C (note the proportions here are conditional on the decisions that have already been made to reach that point in the tree). The final number in the node is the proportion of data in the node—i.e. node 2 = 60% of the data (40% falls in node 3)

3 Results

3.1 Paper Selection

We identified 21,691 papers through our search strategy. Overall, a total of 15,161 papers were identified after excluding duplicates (see Fig. 2 for results by database). The systematic searches described above resulted in 170 papers containing empirically collected patient and provider costs relating to TB in low- and middle-income countries. Data from 103 cost papers (See Table 2) relating to TB costs using the provider perspective were included, representing 627 unit costs for 34 countries (see Fig. 3 and Supplementary Table 5 [ESM]).

Fig. 2
figure 2

PRISMA flowchart

Fig. 3
figure 3

Number of unit costs for TB interventions available in low- and middle-income countries

3.2 Study Characteristics

The costing purpose of the studies was extracted for each paper and was quite diverse given that the studies were a mix of costing done at different health facility levels, in different settings, for the purposes of economic evaluation, priority setting, budgeting, financial planning and technical efficiency analysis (Supplementary Table 4, see ESM).

Table 4 shows the number of papers and the number of unit costs by type of intervention (Supplementary Table 5 indicates the number of papers and the number of unit costs for all 34 countries, see ESM). Some papers reported on several unit costs across multiple interventions and several countries. There were three multi-country studies [8,9,10] covering Cambodia, Georgia, Kenya and Eswatini [8]; Syrian Arab Republic and Egypt [9]; and Malawi, Mozambique and United Republic of Tanzania [10]. Only 18 of the 30 high TB burden countries (60%) (see Table 5) as listed by WHO [1] have unit costs, while TB interventions in the remaining 12 high burden countries (to our knowledge and for the search time period) have not been costed or there are no costs in the public domain (Angola, Central African Republic, Congo, Democratic People’s Republic of Korea, Democratic Republic of the Congo, Lesotho, Liberia, Myanmar, Namibia, Papua New Guinea, Sierra Leone and Vietnam). These 18 high burden countries represent 75 papers in the database (73%, i.e. 27% of papers are for countries that are not on the high TB burden list) and 443 unit costs (71%). South Africa has the highest number of TB unit costs in the dataset (154 unit costs from 28 papers, see Table 5 and Fig. 3). These cover ACF, PCF, ICF, TB treatment and TB prevention, but no papers explicitly look only at above-service costs.

Table 4 Representation of intervention unit cost by articles/reports
Table 5 Visual representation of the number of tuberculosis papers and unit costs for countries on the World Health Organization’s 30 high burden country list

The period between 1991 and the end of 2005 accounts for 29% of the papers obtained and extracted (Fig. 4). There were no costing papers identified for TB vaccination as an intervention, nor as a technology under TB prevention (this despite further targeted searching for Bacillus Calmette–Guérin [BCG] vaccination costing work). Only one paper estimated above-service costs [11] as its main intervention in the past decade, and none were calculated prior to this. There were 19 papers that identified and listed some above-service-level costs [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29], but these were not the core intervention reported on. Unit costs for ACF have only been estimated from 2006 onwards and remain underrepresented compared with PCF and TB treatment. There were papers looking at the costs of ICF from as early as 2000, but again are underrepresented (compared with PCF and TB treatment). The majority of the unit costs (87% together) in this dataset are for the two interventions of TB treatment (409 unit cost estimates) and PCF (139 unit cost estimates) (see Table 4). Extracted unit costs were not distributed equally across included papers. For example, within technology types for PCF interventions, there is only one paper that looks at loop-mediated isothermal amplification (LAMP) [30], two that look at line-probe assay (LPA) [31, 32], and three that assess polymerase chain reaction (PCR) [24, 33, 34]. In comparison, among papers that included unit costs for TB prevention, 63% of the unit costs are provided by seven papers assessing isoniazid preventive therapy (IPT) [35,36,37,38,39,40,41] in five countries, only three of which are from the 30 high burden countries.

Fig. 4
figure 4

Represented years

The mean time lag between reported currency year and publication of costs (reference year) was 3 years with a standard deviation of ± 1.6 years, a maximum of 6 and a minimum of 0 years. Seven lead authors have more than one paper in this dataset: Atif [42, 43], Floyd [12, 44, 45], Kamolratanakul [46, 47], Schnippel [15, 48], Sinanovic [49,50,51,52], van Cleeff [33, 53] and Vassall [9, 54, 55]. Of the 560 unique authors, 9% of the contributions are made by 2% of the authors (11 authors: Floyd, Sinanovic, Vassall, Churchyard, Dowdy, McCarthy, Ramma, Sanne, Schnippel, Sohn, and Stevens). There are two single-authored papers [28, 56], and one 20-author paper [25] (see Table 2).

The setting (urbanicity) for 17 papers was not reported and the predominant category (40%) used was a mixed (mixture) setting (Table 3). The majority of sampling at the site level was purposive (34 papers) or convenience (40 papers) in nature (72% together), while only 10% of papers reported random sampling methods. The main type of ownership was public (84), and fixed facility (83) platforms were the most numerous (nine papers did not comment on the platform type). Cross-sectional timing was the most frequent with 61 papers reporting this method, the remainder had multiple time points (34) or timing was not reported (8). Principle 16 of the GHCC Reference Case [3] is linked to characterising uncertainty associated with unit cost estimates; however, no sensitivity analyses were reported or undertaken for 56 papers and similarly, economies of scale were not reported in 57 papers. Papers with economic methods (77) outweighed financial (23); however, the number of papers using real-world costing was almost equal to the number of papers using a per protocol (normative) costing (45 and 46, respectively). Of those that reported the number of sites selected, the maximum number of sites in a study was 149 [57] (the minimum number was one), however the mean was 10 with standard deviation of 21 (22 papers did not report on the number of sites selected). There were 15 papers with 10 or more sites in their selection (see Table 2).

3.3 Methods Utilised in Papers

We found there to be heterogeneity in the methods used in the papers that were reviewed. The output units were diverse even when efforts were made to group and classify these into more standard categories. Only four papers explicitly stated that research costs were included, while 10 stated that research costs were excluded (linked to Principle 6 of the GHCC Reference Case [3]). Overhead costs were included in 31 of the papers at least in part, nine reported that overhead costs were not included and 22 papers did not mention or include overhead costs. Approximately half (52 papers) had mention of the currency exchange methods, but the remainder did not. Almost a third (27% of papers) stated a discount rate, of which the majority (71% of those that reported a discount rate) cited a 3% discount rate. The division between economic and financial costing was relatively even within the database with the number of cross-sectional papers outweighing papers with multiple time points. Top-down methodology appeared to be used in nine papers, bottom-up in 27 papers, and a mixed methodology in four (use of both top-down and bottom-up elements), however, the remaining 63 papers did not state the methodology used nor could it be inferred. Cost allocations were recorded for 27 papers and were specific to study setting and context, making them difficult to standardise. Only seven papers acknowledged their amortisation methods (six were deemed to be appropriate and one stated no amortisation was undertaken), for 49 papers it appeared to be not appropriate to amortise and the remainder (47 papers) did not state their amortisation method or it could not be inferred by the reviewers. There may have been some of those that did not report their amortisation methods where amortisation was not necessary. The timing of data collection in papers was prospective in 26, retrospective in 55 and not reported (or inferable) in 22 papers (linked to Principle 10 of the GHCC Reference Case [3]). Almost two thirds (73% of papers) reported their study limitations. In terms of conflicts of interest, 42 papers (41% of the database) made a declaration, 31% of which (13 of the 42 papers) were that the authors had a conflict of interest to declare. Of the 103 papers, 61 papers did not make a declaration regarding conflicts of interest.

3.4 Quality Rating

Only one paper received an AAAA composite quality rating score [58], while the next highest of AAAB, AABA or ABAA were received by nine papers. Nine papers were assigned an ABDC, ACDB or CADB rating and as equal weighting has been attributed to the categories these papers had the ‘lowest’ quality rating (Table 2). Overall, the highest rating for the papers in the database was given for the category of Key Cost Components (96% of the papers received an ‘A’), while the lowest rating was given for Precision (48% obtained a ‘D’ rating) (see Table 6).

Table 6 Quality rating as a percentage for 103 articles/reports

3.5 Classification Model

A classification model was built as a function of 12 exogenous and endogenous variables [Reference Year, Urbanicity, Ownership, Platform, Intervention, Timing, Site Sampling, Economic versus Financial, Real World versus Per Protocol, Sensitivity Analysis, Economies of Scale, and the Omitted Summary (see Table 3)] and the resulting classification tree was constructed focusing on their link to reporting quality (see Fig. 1). It indicates the features that were more influential in the reporting quality rating (highlighted in dark grey in Table 3 and indicated in the nodes of Fig. 1); however, the accuracy (as displayed in the confusion index) when tested on 20% of the data (test data) continued to vary (from 40% to 85%) depending on the random selection of the test data from the database. Hence, the model is not a good predictor of quality, but we can take from this classification tree that these features are important signals of reporting quality rating.

The classification tree built using recursive partitioning (n = 103) has 13 nodes, of which seven are terminal nodes (once the data have been subset as much as is possible). The first node is a representation of all the data, with the subsequent nodes representing a categorisation or decision being made with the purpose being to subset the data repeatedly until they are as homogenous as possible. This helps us to identify which variables could be indicators of quality. The reason that the first node is blue is because there are more ‘B’s (n = 47; 46% of the data [written as 0.46 in Fig. 1]) than ‘A’s (n = 45; 44% of the data [written as 0.44 in Fig. 1]) and ‘C’s (n = 11; 11% of the data [written as 0.11 in Fig. 1]) in the dataset (i.e. ‘B’ is the dominant class). The probabilities in each node are conditional on the decisions that are made in the tree on the way to each node.

The features of ‘Intervention’ followed by ‘Urbanicity’ and ‘Site Sampling’ are the most likely indicators of quality of reporting, as displayed by the classification tree built using machine learning (Fig. 1). ‘Intervention’ type appears to be the most important variable for signalling a reporting quality rating (using the study quality rating system [5]), specifically if ICF or TB treatment are the interventions costed (the first branch of the tree results in 60% of the data falling in node two on the right and 40% in node three on the left. There is a 0.53 probability of receiving an ‘A’ rating for reporting as indicated in node two). If one then follows the branch to the left, the next most important aspect for obtaining an ‘A’ is ‘Urbanicity’ if rural, urban or a mixture of settings were selected as opposed to not reporting (NR) the setting or if peri-urban settings were listed; if the entire programme was assessed or random sampling was undertaken; or if a cluster, convenience, NR or purposive site sampling was used, but timing was at a single point in time, an ‘A’ rating was more likely. However, following the righthand branch, if the ‘Intervention’ was PCF, ACF, above-service costs or TB prevention, ‘Site Sampling’ was cluster, purposive or random and the ‘Platform’ was a fixed facility, an ‘A’ rating was also more likely.

4 Discussion

The contribution that this review has made is exposing the availability of 103 TB unit cost papers, with 627 unit costs, where authors have taken a provider perspective for low- and middle-income countries, which have been systematically extracted from a diverse source of papers and presented for analysis using an intervention-standardised typology. The use of the study quality rating system as well as the descriptive analysis of methodology has highlighted variability in reporting, particularly around uncertainty, sampling methods and the use of top-down, bottom-up and mixed methods for cost collection. Machine learning, which to our knowledge has not previously been used to examine signals of quality in a systematic review, was utilised to identify indicators for confidence in unit costs produced.

In these data, we have identified three main gaps. Firstly, TB cost data from a provider’s perspective spans almost three decades, leaving some areas out of date; second, we have identified that 12 of the 30 high TB burden countries do not have unit costs available or these have not been reported on; and third, some TB interventions are not well represented, for instance, unit costs for TB vaccination (0% of the unit costs in this database), comprehensive above-service (level) costs, ACF, ICF and TB prevention are limited to just a handful of papers (13% of unit costs in this review).

While the purpose of the costing studies was well defined for the 103 extracted provider perspective papers, there was heterogeneity in the methods used to estimate costs, especially with respect to the methods used to measure and allocate costs. Some areas of the papers had poor reporting, such as whether ‘above facility costs’ and unrelated costs were included, which resulted in these fields having many NR (not reported) entries. Principle 13 of the GHCC Reference Case [3] assesses whether currency conversion and discount rates are clearly stated (amongst other criteria). There was under reporting of discount rates and currency conversion; however, in part this may have been that (especially with regard to discounting) it was not relevant to these studies.

In looking at the study characteristics, descriptive findings and the methodological principles as laid out by the GHCC Reference Case [3], this review of TB costing papers indicates that transparency in methods is limited due to a lack of standard reporting of methods and results. In cases where methods are reported well, there is still a variation in approaches for measuring costs observed. Variability makes comparison difficult and creates concern about which unit costs to rely on [59]. It is not just variability in methods and approaches that makes it difficult to know which costs are reliable, some costs are incomplete, others are poorly described or may have been incorrectly categorised by the authors. The findings from this review of cost methods and reporting have been used to iteratively guide the development of the GHCC Reference Case for Estimating the Costs of Global Health Services and Interventions [3] and Costing Guidelines for Tuberculosis Interventions [4]. However, as they were developed in parallel there was cross learning between the two. The GHCC Reference Case [3] encourages researchers to be explicit and open in how they estimate the cost of interventions while the Costing Guidelines for Tuberculosis Interventions inform empirical cost collection methods [4] and, subsequent to this review, two intervention categories will be added, vaccination and TB infection control and above-service-level costs may be renamed TB policy, planning, coordination and management.

Currently, the Key Cost Components, Bias, Precision, and Reporting indicators have equal weight, so we are unable to comment on whether one indicator is more important or relevant than another. The four dimensions of quality used in this review are quite different. Lack of completeness and bias are important issues and could indicate overall poor quality. Reporting is something that allows one to assess the quality and completeness and so is critical and hence a central focus of this review. Precision relates to sampling which has not been a historic focus of TB costing, particularly because it is common for only a few sites to be selected, often purposively or due to convenience. There were very few papers with a large number of sites and around 20% did not report on the number of sites at all; these factors are likely to have impacted on the precision. Going forward, sampling should be encouraged as well as larger TB costing studies for TB interventions where there is an identified lack of data (vaccination, above-service costs, TB prevention, ACF and ICF). It is important to keep in mind that the study quality rating system assesses the appropriateness of a paper’s methods rather than giving a definitive measure of quality or causation.

Understanding the quality of cost data and reporting can help to inform which methods need strengthening and what areas of reporting should be standardised. When we used machine learning and ran the model repeatedly, the accuracy level, as determined by using a confusion index, fluctuated markedly indicating that the model did not reliably predict that a paper would score a particular rating (‘A’, ‘B’ or ‘C’, no papers received a ‘D’ for this category) for reporting due to the variability within the features and due to limited training given the comparatively small dataset. The machine learning was focused on the reporting of these variables. However, from this analysis, papers that report on ‘Intervention’ (in particular), ‘Urbanicity’, ‘Site Sampling’, ‘Platform’ and ‘Timing’ were identified by the machine learning model to be more indicative of the quality of reporting standards overall in comparison to reporting on ‘Ownership’, Real World versus Per Protocol Costing’, ‘Economies of Scale’, ‘Omissions’, ‘Economic versus Financial’ costing, ‘Sensitivity Analysis’ and the ‘Reference Year’.

The economic implications of this review are that TB unit costs have been identified, aligned to intervention-standardised typology and placed in an open access Unit Cost Study Repository. These easily accessible data are particularly useful for modelers and TB decision makers. The indication from machine learning is that reporting on ‘Intervention’ (in particular), ‘Urbanicity’ and ‘Site Sampling’, can guide those undertaking costing work, which often leads into National Strategic Planning, economic evaluation, budgeting and priority setting.

There are of course a couple of considerations, namely that the database primarily assesses what has been reported, so something may have been done and not written down or may have been reported but may have only been done at a superficial level. The other consideration is that the reporting quality rating is determined by the data that are included and then some of those aspects are assessed again in the recursive partitioning, which could have led to overfitting of the model. The database did require inference by the study team when authors had not been explicit in their reporting and so there could also have been limitations in the fields used to assess quality or even in the choice of fields to include in the quality rating. The search strategy may have been a limiting factor in the uncovering of TB vaccination costs (BCG in particular); however, a separate targeted search was conducted specifically for this and no empirical TB costs were found. Since the completion of the search a valuable database of vaccination costs, called the Immunization Delivery Cost Catalogue, has been made publicly available and can be found online at: http://immunizationeconomics.org/ican-idcc. This catalogue illuminated that the search for BCG costs would need to be aimed at childhood vaccinations, which often form part of an immunisation programme or package of vaccines, and so we were unlikely to pick up studies related to BCG with our selected search terms. In future reviews, we suggest using more search terms linked to TB vaccination and TB infection control, which have been added to the GHCC intervention-standardised list subsequent to this review.

5 Conclusion

This study demonstrates the variability in reporting and methods used to estimate TB costs from a provider perspective. Those working in TB and TB decision making would benefit if researchers explicitly reported what they did and increased their transparency regarding the methods they used to estimate the costs of TB interventions. We suggest being unambiguous in the methods used such as reporting mixed or separate top-down and bottom-up costing, and standardising unit costs to ease comparison between different papers’ findings (Supplementary Table 4, see ESM). A set of guidelines for TB costing drawing from this review further explains the important methodological aspects one should focus on [4]. Going forward, the unit cost gap needs to be addressed for TB vaccination (especially new vaccine candidates as they emerge), above-service costs, ACF, ICF and TB prevention and the remaining 12 high burden countries without TB unit costs in literature (Angola, Democratic People’s Republic of Korea, Democratic Republic of the Congo, Myanmar, Vietnam, Central African Republic, Congo, Lesotho, Liberia, Namibia, Papua New Guinea and Sierra Leone).