Introduction

In recent years, the increasing incidence of thyroid cancer has repeatedly been discussed in the literature. Several studies from different countries document this increase [1,2,3,4,5]. According to a study by Vacarella et al. [6], the age-standardized incidence of thyroid cancer of women in the United States increased from 9.1 cases per 100,000 inhabitants in 1988–1992 to 19.2 cases per 100,000 inhabitants in 2003–2007. A similar change can also be observed in European countries, where for instance the incidence of thyroid cancer of women in France increased from 6.9 to 16 cases per 100,000 inhabitants over the same period. The most significant change in the incidence of thyroid cancer was probably observed in South Korea. There has been an increase from 12.2 to 59.9 cases per 100,000 inhabitants. However, in terms of absolute numbers thyroid cancer mortality has hardly changed or even decreased [1,2,3, 7]. This puzzle can partly be explained by the majority of cases detected being so-called papillary carcinomas, which mostly have a favorable prognosis. Patients with papillary cancer have a 10-year survival rate of 80–90% [8, 9].

The phenomenon of an increasing incidence combined with almost stable mortality is widely accepted as a sign of overdiagnosis, which can be defined as “the detection of indolent pathology where treatment cannot provide benefit”[10],Footnote 1 while overtreatment means “that a treatment provides no benefit for the diagnosed condition” [11]. A high number of thyroid nodules suspicious of cancer turn out to be benign postoperatively and therefore their removal dispensable in retrospect. In Germany having the highest rates of thyroidectomies worldwide the ratio of malign to benign nodules as diagnosed histologically after their removal is 1:15, doing harm also to those without cancer [12]. Both, overdiagnosis and overtreatment which includes but goes beyond the treatment of overdiagnosed conditions, are summarized under the catchword medical overuse [13].

One of the consequences of non-indicated diagnostics is the risk of so-called cascade effects, defined as processes that, once initiated, proceed step-by-step until their seemingly almost inevitable outcome [14]. Ultrasound examination of the thyroid gland may act as a starting point of such a cascade initiating a process of further testing and controls finally ending in unnecessary thyroidectomies and radioiodine therapy. The increased sensitivity of diagnostic tests (e.g. ultrasound with 13 instead of 7.5 megahertz) contributes significantly to overdiagnosis, because there is a high risk of identifying benign nodules and non-fatal carcinomas, which are then further diagnosed and treated [15].

Identifying such cascades, in particular the factors by which they are triggered, is an important topic to be researched. Preventing such cascades from starting and accelerating can contribute to a better allocation of limited resources in a health system and improve patients’ wellbeing [16]. Since the number of persons diagnosed with thyroid nodules and thyroid cancer continues to rise and the duration of treatment is usually substantial due to the majority of thyroid carcinomas being not metastatic [17], better knowledge of the associated costs is required. Overdiagnosis and overtreatment are potential problems from an ethical as well as from an economic perspective. Medical services that do not provide any or only little benefit to patients or the harms of which exceed their potential benefits are wasting money that is urgently needed in other places [18].

As of now, the monetary costs of thyroid nodules and cancer care caused by medical overuse in Germany are not comprehensively evaluated. The contribution of this paper is to provide approximate answers to the questions whether questionable early thyroid ultrasound leads to cascade effects in the care of affected patients and what effects these avoidable cascades have on expenditures in the health care system. The aim of this paper is, therefore, to get more insights into the costs of medical overuse.

Methods

Data, sample selection and matching

Data

The analysis is based on two sources of quarterly administrative billing data from Germany, for the years 2012 to 2016. Two data sets have been analyzed: one data set was provided by the Bavarian Association of Statutory Health Insurance Physicians (Kassenärztliche Vereinigung Bayerns, KVB), which represents all physicians and psychotherapists licensed for outpatient care, under the roof of the social health insurance, in the state of Bavaria. All ambulatory care physicians send their reimbursement claims quarterly to their corresponding Association of Statutory Health Insurance Physicians. The data set contains information about patient characteristics (age, sex, ICD diagnosis, billing codes for medical and diagnostic treatment) and physicians’ characteristics, e.g. whether it is a group practice or single handed. The data can also be linked to information of the region as for example density of population according to the Federal Office for Building and Regional Planning or deprivation [19]. These data hence do not comprise information on inpatient treatment, results of medical tests or procedures and they are confined to patients from a single federal state. Yet, since the data comprises (almost) all publicly insured outpatient cases in Bavaria, the second biggest federal state of Germany, the number of observations is still very large. The second data set was provided by the Corporation for Efficiency and Quality in Health Insurance (Gesellschaft für Wirtschaftlichkeit und Qualität bei Krankenkassen, GWQ). The GWQ is owned by—and provides data services to—25 of in particular company-basedFootnote 2 sickness funds. However, only ten health insurance companies agreed to participate in the project. Accordingly, only the data from these funds could be used. Like the KVB data, the GWQ data include information about both patients and physicians (see above). The GWQ data are not limited to outpatient care but also include information on inpatient stays, prescribed drugs, and sick pay. Moreover, since company-based sickness funds operate nationwide, the data cover Germany as a whole.

The populations from which the two data sets originate hence differ. Besides the geographic confinement of the KVB data, self-selection into company-based health insurance is an issue. Individuals insured with company-based funds on average are younger [20] and healthier [21] than the overall population. Neither of the data sets used includes privately insured individuals. Though some patients may be included in both the KVB and the GWQ data, the number of observations for which this applies is likely to be small. The intention to use two different sources of observations is hence not to match information on the individual patient levelFootnote 3 but to compare results across different data sources that vary in several dimensions as described above and to exploit the relative advantages of the two data sources, which in terms of size and self-selection issues are with the KVB data and in terms of comprehensiveness of the cost information are with the GWQ data.

Sample description

Both data sources comprise patients who were 18 years or older and received a TSH test (EBM code 32101) for the first time in 2012. Patients who received any thyroid-specific tests or diagnoses in 2010 or 2011 have been excluded.Footnote 4 Patients had to be insured throughout 2010 to 2012, while gaps of a maximum of 30 days were allowed. Patients aged 110 years and older and those who lacked unique information regarding gender and date of birth were excluded. This also applies to individuals for whom implausible diagnoses were reported.Footnote 5

The sample taken was then split into two groups: a so-called ‘observation group’ and a ‘control group’. We used this terminology to emphasize the quasi experimental design of the analysis that we aimed to achieve through matching (described below). The analysis was based on observational data and did not involve any randomized treatment or intervention that was under the control of the experimenter. Patients in the observation group had an initial TSH test in 2012Footnote 6 and a thyroid ultrasound (EBM code 33012) within four subsequent weeks (0–28 days). The control group consisted of patients who received a TSH test in 2012 but received either no ultrasound at all or 28 days after the TSH test at the earliest. The definition of the two groups is based on the recommendations of the guideline of the German College of General Practitioners and Family Physicians (DEGAM) [22]. The guideline recommends in case of a first abnormal TSH test and an inconspicuous anamnesis, a second TSH test to be performed. In case of a second abnormal TSH test, further laboratory tests (ft4, ft3, antibodies) are recommended. According to international guidelines an ultrasound of the thyroid gland would only make sense in case of a palpably enlarged goiter, palpable thyroid nodules or lymph nodules or in case of hyperthyroidism in the absence of serological markers for thyroiditis or Graves’ disease, because then it might be caused by an autonomous adenoma [23, 24]. As all these findings are rare, it can be assumed that the vast majority of early ultrasound examinations have to be regarded as non-indicated, constituting an unjustified screening. Patients in the observation group with an initial diagnosis of hypo- or hyperthyroidism in the uptake quarter were excluded, as in these cases the early use of an ultrasound might have been reasonable.

The considered outcomes are all cost measures that are calculated from the billing information available in the two data sources used. Since outpatient services are reimbursed on a quarterly basis, all cost measures refer to costs per quarter. In other words, the unit of time considered in the analysis is the quarter of a year.

Matching

We applied propensity score matching to establish comparable baseline conditions for the observation and control groups to be able to measure the cost effects of the (potentially redundant) ultrasound examination. The propensity score, which equals the likelihood of receiving an ultrasound within 28 days of the initial TSH test, was calculated via logistic regression.Footnote 7 Our matching variables were based on information gathered prior to the initial TSH. It included socioeconomic information such as the patient’s age, gender and place of residence. Additionally, we included medical variables like the reason for physician consultation—in particular whether a TSH test was accompanied by a specific diagnosis i.e. whether a relevant complaint existed or whether the TSH was determined only routinely,Footnote 8 and the number of applicable risk groups according to the grouper suggested by InBA (Institut des Bewertungsausschusses) as an indicator for multi-morbidity [25]. We opted for one-to-one nearest neighbor matching. That is, each patient from the observation group was assigned one patient from the control group, whose probability of receiving the treatment was most similar. The observed outcome (costs) of this individual, hence, served as the estimated counterfactual outcome of the treated patient i.e. as an estimate of the health cost that would have been observed if the patient had not undergone a non-indicated ultrasound examination.

Regarding the data from the KVB, 665,126 patients entered the matching process. After matching, 68,862 patients were part of the observation and control group, respectively. Regarding the GWQ data 132,613 patients entered the matching process. After propensity score matching, 11,306 patients were included in the observation and the control group, respectively.

Table 1 shows the mean values of the covariates in the year 2012 for the observation group as well as the control group before the propensity score matching (columns one and two) and after the propensity score matching (columns three and four). We evaluated matching quality by means of the standardized bias in percent [26]Footnote 9 Columns five and six of Table 1 show the standardized differences between the observation and control group before and after propensity score matching. The matching successfully established similarity in terms of means of our observable variables across our observation and control groups. It reduced the standardized bias of all covariates substantially. The largest standardized bias was only at 0.85 for the KVB data, while 3.58 was the largest standardized bias for the GWQ data. All post-matching standardized biases were thus far below the rule of thumb threshold of five percent [27].

Table 1 Pre-treatment means of observation and control group and standardized bias

Presentation of results and statistical inference

To allow straightforwardly compare how the mean health costs in both groups evolved over time, we present the results in the form of bar plots. The figures below all have the same structure. The vertical axis represents the health costs while the horizontal axis represents the respective quarter since the initial TSH test, ranging from quarter 0 (the quarter where the initial TSH test was applied) to quarter 19. The bars in light grey depict the average cost of the control group, while the bars in dark grey represent the average cost of the observation group. Each bar is accompanied by a 95% normal distributionFootnote 10-based confidence interval. To decide by eyeballing whether observation and control group exhibit cost differentials that can most likely not be attributed to sampling error, we examine whether the confidence intervals do not overlap. This graphically easily depictable approach does not one-to-one correspond to a t test on equal group mean costs but is more conservative, since the confidence bands may overlap despite the t test rejecting the null, but not the other way round [28]. The results of the corresponding formal t tests are documented—together with the precise values of the group-specific costs—in the Appendix, see Tables 617.

Robustness check

The main specification of the empirical analysis rests on the assumption that an ultrasound within four weeks after a TSH test is in the vast majority of cases unnecessary or at least is not medically indicated. However, the data does not include information regarding the results of the TSH test. This is why we cannot rule out that TSH tests of patients in the observation group yielded more frequently suspicious results, making the physician consider an ultrasound to be reasonable although it was not yet indicated by the criterion defined above. To be able to rule out that the ultrasound was performed on the basis of an abnormal TSH test result, in a robustness check we changed the definition of the observation and control group as follows: the observation group includes only cases where both a TSH test and thyroid ultrasound were performed on the same day. The control group, on the other hand, also requires a TSH test to be performed, but a thyroid ultrasound was not performed on the same day, but one day after the TSH test at the earliest. Since in the analysis the TSH test and the ultrasound were performed simultaneously in the observation group, the result of the blood test cannot have triggered the ultrasound examination. This alternative empirical analysis is confined to the KVB data and uses, just like the main specification, a matched sample. The sample that enters the matching procedure is the same as for the main specification. Yet the alternative definition of the observation group reduces its size to 36,120 observations. In consequence, the robustness check is based on these observation and the same number of matching partners from the control group. Information for the alternative matching that parallels what is reported for the main specification in Table 1 is reported in the Appendix (Table 18).

Results

Outpatient costs

We started with comparing outpatient costs for the observation and control group based on data from the KVB and GWQ, respectively. In doing this, we distinguish between thyroid-specific costs (EBM codes that refer to thyroid-specific outpatient services as listed in Table 2 in the appendix) and total costs. The term total costs, hence, does not refer to aggregating costs over different sectors (outpatient, inpatient etc.) but means that cost are considered irrespective of whether or not they are thyroid-specific. If costs are aggregated over different sectors, we refer to this as overall costs (see subsection overall costs).Footnote 11 By design both patients in the observation and the control group had to have received a TSH test in quarter 0 and thus visited a physician. Consequently, the costs in this quarter were naturally higher compared to other quarters, in which neither members of the observation group nor members of the control group had even necessarily visited a physician. The remarkable difference in cost of the initial quarter compared to the following quarters is hence partly an artifact of the design of the analysis.

Thyroid-specific outpatient costs

Figure 1 displays the development of thyroid-specific outpatient costs. As already noted above, we see significantly higher thyroid-specific outpatient costs for the observation group in the quarter in which the TSH test was performed. The difference between the observation group and the control group is slightly higher than the pure cost of the ultrasound itself.Footnote 12 This suggests that, on average, further thyroid-specific examinations were carried out in the observation group. In the following quarters, a significant cost difference between the observation group and the control group persists. Yet, given that thyroid-specific costs are on average less than 5 € in the subsequent quarters, the deviation in costs between groups is of little relevance. With respect to thyroid-specific outpatient costs the pattern is almost the same for the KVB and GWQ data. Yet, the much larger size of the KVB sample results in much narrower confidence intervals compared to the GWQ data.

Fig. 1
figure 1

Thyroid-specific outpatient costs. Upper panel a KVB data, lower panel, b GWQ data; see Appendix Tables 6 and 7 for precise numerical values

Total outpatient costs

When considering the total outpatient costs (displayed in Fig. 2), we see—as for the thyroid-specific costs—that the costs in quarter 0 were significantly higher in the observation group compared to the control group. Yet, this cost difference exceeded the thyroid-specific cost differences. This finding suggests that patients who got the possibly non-indicated ultrasound also received more services that were not directly thyroid-related. However, this difference only occurs in quarter 0 and almost completely disappears in the following quarters. This pattern, once again, is the same for the analysis based on the KVB and the GWQ data.

Fig. 2
figure 2

Total outpatient costs. Upper panel a KVB data, lower panel b GWQ data; see Appendix Tables 8 and 9 for precise numerical values

Inpatient costs

To address possible effects of the possibly non-indicated ultrasound on the costs of inpatient care we had to focus on the GWQ data since the KVB data lack information on inpatient treatment and the associated costs. Based on the analysis of the data from GWQ, the pattern of how inpatient treatment costs differentially evolve for the two groups is similar to that for outpatient treatment costs.

Figure 3(a) shows the development of thyroid-specific inpatient costs.Footnote 13 The average thyroid-specific costs in quarter 0 were clearly higher in the observation group, i.e. the confidence intervals do not overlap. The difference increased further in the quarter after the TSH test and the ultrasound, caused by an increase in the average cost of the observation group. In the second quarter after the initial TSH determination the costs fell back below the level of quarter 0, although the difference between the two groups was still significant. Even three and four quarters after the initial TSH test, the difference between the observation group and the control group remained significant, even though the difference continued to decrease. After five quarters of the initial TSH determination, we could no longer detect any significant difference in inpatient costs.

Fig. 3
figure 3

Inpatient costs (GWQ data). Upper panel a thyroid-specific costs, lower panel, b total costs; see Appendix Tables 10 and 11 for precise numerical values

Total inpatient costs

An overview of the development of total inpatient costs is shown in Fig. 3(b). Similar to the outpatient results, inpatient thyroid-specific costs account for only a very small part of total inpatient costs. In contrast to outpatient total costs, however, there is not even a significant difference between our two groups in inpatient total costs in quarter 0. In terms of the point estimates, inpatient costs seem to be even higher for the control group in later quarters. Given the rather noisy estimates and in consequence rather wide confidence bands, this may be attributed to sampling error.

Pharmaceutical costs

Unlike the cost measures considered so far, we find an almost constant difference over time between the observation group and the control group with regard to thyroid-specific pharmaceutical costs.Footnote 14 The average thyroid-specific pharmaceutical costs of the observation group were significantly higher than those of the control group. Concerning the total costs for pharmaceuticals, no significant differences could be identified as a result of the presumably unnecessary ultrasound. This corroborates our earlier finding that even if some effects on thyroid-related costs exist, these specific costs are too small compared to the total health costs to significantly matter (see Fig. 4).

Fig. 4
figure 4

Pharmaceutical costs (GWQ data). Upper panel, a thyroid-specific costs, lower panel, b total costs; see Appendix Tables 12 and 13 for precise numerical values

Overall costs

Figure 7 depicts group-specific mean overall healthcare costs that is the sum of outpatient, inpatient, and pharmaceutical costs. Since the latter two cost categories are not part of the KVB data, the comparison of overall cost is only possible for the GWQ data. Considering only thyroid-specific cost, the overall cost differentials roughly mirror the pattern found for the inpatient cost (Fig. 3a). While clearly higher costs are observed for the observation group in the two quarters that directly follow the TSH test, the cost differential shrinks in subsequent quarters an vanishes in terms of statistical significance after six quarters. If the analysis is not confined to thyroid-specific costs (Fig. 7b) the analysis provides little evidence for systematically higher overall health costs in the observation group. This also mirrors the earlier results regrading inpatient costs (Fig. 3b) including the finding, that average costs are even lower in the observation group some years after the TSH tests.

Robustness check

Figures 5 and 6 compare the results of the alternative design, in which the observation group consisted only of patients who received the TSH test and the thyroid ultrasound on the same day, to the previously presented results, which are based on the original treatment definition. In this comparison, we focused on the differences in thyroid-specific outpatient costs and total outpatient costs, respectively, that were found in the KVB data. The alternative definition of observation and control group hardly changed our results. Again, there was a difference in thyroid-specific costs that was still significant several quarters after the original TSH test. In terms of total costs, the picture was very similar to our main specification. In the quarters following the original examination, the ultrasound did not lead to a cascade that would be reflected in increased costs. We hence conclude that the pattern of how health costs evolved over time for the two groups was not driven by the unobserved results of the TSH blood test (see Fig. 7).

Fig. 5
figure 5

Overall costs (GWQ data). Upper panel, a thyroid-specific costs, lower panel, b total costs; see Appendix Tables 14 and 15 for precise numerical values

Fig. 6
figure 6

Thyroid-specific outpatient costs. Upper panel, a KVB data, lower panel, b KVB alternative matching; see Appendix Table 16 for precise numerical values

Fig. 7
figure 7

Total outpatient costs. Upper panel, a KVB data (same information as in Fig. 2(a)), lower panel, b KVB alternative matching; see Appendix Table 17 for precise numerical values

Discussion

The results do not suggest that the presumably unnecessary ultrasound examination of the thyroid gland generally leads to a long-term increase in overall inpatient, outpatient or pharmaceutical costs. Hence the ultrasound examination does not seem to frequently act as a trigger of treatment cascades medicalizing patients and leading to a major waste of financial resources.

The picture is somewhat different if the focus is on costs specific to thyroid-related treatment. Here we see a significant short-term effect on both outpatient and inpatient costs. The latter finding may point to the ultrasound examination revealing abnormalities which were immediately examined in an inpatient setting or which led to surgeries on the thyroid gland within approximately one year. One possible explanation is that early application of ultrasound may indeed result in immediate hospital treatment that is not just earlier but additionally carried out compared to patients who received a—according to the guidelines—properly timed ultrasound examination or no ultrasound examination. This explanation is in line with the hypothesis that treatments of thyroid anomalies actually carried out, surgeries in particular, are frequently unnecessary since numerous patients would not have suffered from any disorders related to these anomalies probably until he or she dies from an unrelated disease. An alternative explanation is, however, that—even in the matched sample—an immediate ultrasound is selective in the sense that physicians do not stick to the guidelines if they—for unobserved reasons—think that further treatment might be required [29].

Moreover, costs for thyroid-specific pharmaceuticals were significantly higher in the observation group. This difference could still be observed several years after the ultrasound examination took place. Hence this result provides some support for the hypothesis that over-diagnosing thyroid anomalies may result to some extent in thyroid-related overtreatment and in consequence excess expenditures. An alternative explanation is that physicians who do not stick to medical guidelines with regard to the timing of the ultrasound examination were also less reluctant in prescribing drugs to treat possibly thyroid-related symptoms. Anyway, the share of thyroid-specific costs of total costs is relatively small. Therefore, the thyroid-specific cost effects we see in the data are of little importance to total health expenditures. Our results are against what we had expected from the literature. The steep rise of the incidence of papillary cancer of the thyroid gland, as shown in the iconic graph in Ahn’s et al. paper, was attributed to ultrasound screening of the thyroid gland as practiced routinely in South Korea [2]. Also studies from other countries suggested ultrasound screening as the reason for this rise [4, 6]. Given the high prevalence of thyroid nodules in the population and the high number of thyroidectomies in Germany in comparison to other European countries [12], we would have expected a far higher number of cascades of medical procedures with a corresponding rise in expenditures following an initial non-indicated ultrasound examination. The rise of cancer diagnoses should at least have been accompanied by a rise of diagnoses of nodular goiter.

Limitations and strengths

Routine data mostly have the problem that the coding quality of the diagnoses has limitations [30]. A further limitation was that the data did not contain the results of diagnostic tests and the data do not provide clinical information. Therefore, it is impossible to exactly define which ultrasound examination was necessary and which not. A strength of our study can be seen in the high number of cases examined, which allows for a matching procedure that is picky in finding good matching partners. Furthermore, carrying out the analysis on basis of two data sets that complement each other with respect to the information they include can is also a strength. Therefore, it is improbable that procedures which were applied in reality were not captured in our data. The empirical analysis rets on propensity score matching which is subject to inherent limitations. While successful matching makes to considered groups comparable in terms of observables it is in principle possible that observation and control remain very different in terms of unobserved factors. Moreover, external validity is always limited for matching analyses since the effect of interest is only identified for that part of the papulation where the distribution of the propensity score overlap for the considered groups. Limited generalizability to the entire population applies in particular to matching designs like the one used in the present analysis that estimate the ATT (average treatment effect on the treated) as this design intentionally focusses on the average effect in the observation group but not the population in general [27].

Conclusion

The data did not show the expected cascade of medical procedures after an initial unnecessary ultrasound of the thyroid gland. In consequence, the resulting costs effects are small, especially if seen as a fraction of total health care expenditures.