Introduction

As has been well documented, the proportion of gross domestic product (GDP) devoted to healthcare expenditures (HCE) in the USA has more than tripled since 1960 (CMS 2021a). The growth over that time has been far faster than any other country in the world, and the current level of US HCE also represents the largest share spent on health care by a wide margin (OECD 2020). These trends have spurred research into the factors driving healthcare costs and have helped motivate substantial policy changes at the US federal and state levels designed, at least in part, to control costs and ‘bend the curve’ (Cutler 2010). One example of such efforts is the “1 Percent Steps for Health Care Reform Project" (https://onepercentsteps.com/), an organization whose goal is to “offer a roadmap to policy makers of tangible steps we as a country can take to lower the cost of health care in the US....(and) to leverage leading scholars’ work to identify discrete problems in the US health system and offer evidence-based steps for reform." This group recommends a list of possible reforms that, if adopted, are estimated to reduce health care expenditures by 9%.

Our work complements these efforts by examining whether the two most substantial health care reforms to date—the creation of Medicare and Medicaid and enactment of the Affordable Care Act—appear to have affected health care expenditures as a share of US income. Specifically, we investigate whether there has been a structural break in the US HCE–GDP relationship since 1960 using methods designed to deal with the challenges of exploring changes near the ends of the sample. In so doing, we offer an empirical methodology that may prove useful to evaluating future reforms.

Ever since the seminal work by Newhouse (1977) found income to be the primary driving factor of health care expenditures and an income elasticity suggesting health care is a luxury good, many studies have estimated the health expenditure income elasticity using variation across countries/states and/or time (Parkin et al. 1987; Gerdtham et al. 1992; Newhouse 1992; Hitiris 1997; Baltagi and Moscone 2010; Farag et al. 2012; Hartwig and Sturm 2014; Baltagi, Badi, Raffaele Lagravinese, Francesco Moscone, and Elisa Tosetti. 2017).Footnote 1 While these results are mixed on whether aggregate health spending is a luxury or a necessity, they generally have concluded that the growth in income per capita is the major factor behind the surge in health care expenditures for developed countries.Footnote 2 This result especially holds when focusing solely on the more uniform US experience (Freeman 2003; Wang and Rettenmaier 2007; Wang 2009; Moscone and Elisa 2010a; Woodward and Wang 2012).

If health policy reforms help control costs, affect access to care, or otherwise alter health care consumer and provider decision making, then they may have an impact on the HCE–GDP relationship in general and the income elasticity specifically. Using US quarterly data from 1985Q1 to 2017Q1, Cheng and Nopphol (2019) investigate the impact of health care policy uncertainty (HCPU) and finds that HCPU shocks leads to temporary decreases in both HCE and GDP. While they do not investigate the effect on the income elasticity specifically, health care policy reforms like Medicare/Medicaid or the ACA seem likely to affect HCPU, particularly if they are subject to lengthy and contentious political debate.

In addition to HPCU, the income elasticity may vary due to other factors that are plausibly affected by health reforms and other policy changes. Barati and Fariditavana (2020) posit that the relationship with income could be asymmetric due to the behavioral tendencies of loss aversion (which leads to stronger reactions to income losses than gains) and stockpiling (which predicts the opposite). Their results using US annual data suggest that income gains have bigger impacts. More generally, it suggests the income elasticity may vary depending on the business cycle and overall health of the economy, a possibility we explore here as well. And, if health policy alters the incentives facing consumers and providers, these relationships could change.

Using a 1971–2009 panel of 14 OECD countries, Blazquez-Fernandez et al. (2014) allows for over-time and across-country heterogeneity in the income elasticity. They find it is larger for higher-income countries and has declined over time (if technological progress is controlled for with a temporal time trend). On the other hand, Baltagi et al. (2017) find the opposite effect when using annual data on 167 countries over 18 years. Lee, Oh and Meng (2019, Table 5) estimate the income elasticities for each of 14 OECD countries and find the elasticity for the USA (at 2.34) is well above the rest. As one possible explanation for this high value, we posit that the US health care system depends more on private insurance and private providers, which may have facilitated marketing more expensive luxury health care products and services. Many of the reforms offered by the 1% steps website suggest that regulating or reforming questionable billing practices, concentrated markets and other anti-competitive features could lower health care expenditures substantially and thus could be explanations for why the US is such an outlier among OECD countries. To the extent that reforms to the US system curb (or enhance) those tendencies, the HCE–GDP relationship may change.

To our knowledge, only Woodward and Wang (2012) use formal structural break tests to investigate whether the log–log relationship between US HCE and GDP has changed over time.Footnote 3 Using US annual data for 1960–2008 and the Kejriwal and Perron (2010) structural break test (henceforth KP test), Woodward and Wang (2012) show that the relationship has been surprisingly stable, suggesting that US policies have not changed the income elasticity nor ‘bent’ the curve. However, their data end before the Affordable Care Act (ACA), and their structural break methods preclude testing the effects of Medicare/Medicaid because that policy occurred too early in the sample.Footnote 4

As shown by the Kaiser Family Foundations compilations (Kaiser Family Foundation 2011, 2013), the USA has experienced a large number of health policy reforms since 1960, with Medicare/Medicaid and the ACA being the most ambitious.Footnote 5 As such, empirically analyzing whether they have affected the HCE–GDP relationship is important as both were designed to improve access and/or control costs. While the ACA’s first objective was to expand access, stemming the rise in health care expenditures was a second and critical goal. As President Barack Obama noted in an address to the House Democratic Caucus, “Every single good idea to bend the cost curve and start actually reducing health care costs [is] in this bill" (Obama 2010). Subsequent research, however, provides mixed evidence that the ACA has delivered on this goal. Focusing specifically on hospital utilization, Gaffney et al. (2019) find no evidence that either the ACA or Medicare/Medicaid impacted days spent in the hospital or hospital discharges. Chandra et al. (2013) and Weiner et al. (2017) suggest that the Great Recession was more likely to be responsible for the immediate slowdown in costs and the delayed expansion of coverage.

In this study, we focus on whether the HCE–GDP relationship in the USA changed during times of health care policy changes and reforms over the past 60 years. We investigate the possibility of these changes with 1960–2020 national data and 1963–2020 state-level data using structural break tests. As also pointed out by Piehl et al. (2003), structural break tests can be helpful in a policy evaluation setting, especially when the timings of the event and the effect of intervention are not clear. We employ both conventional KP tests and tests that permit end-of-sample (EOS) testing (Andrews and Kim 2006). To our knowledge, ours is the first health economics study to use these EOS structural break tests. These EOS tests are critical for studying the years during which the most substantial reforms took place. The ACA contains several provisions that did not become effective until as late as 2014 (Kaiser Family Foundation 2013) and some, such as the Medicaid expansion, continue to be debated and adopted by the states. The end-of-sample tests we use permit testing for structural breaks as late as 2018 and so should catch breaks associated with the ACA’s later changes. Similarly, these tests can search for breaks at the beginning of the sample, when Medicaid/Medicare was enacted (1965) and implemented (1966).

In addition to analyzing the nation as a whole, we examine each state independently and conduct more extensive analyses for Massachusetts, focusing on the period around its 2006 health care reform law. Using two different data sources for Massachusetts health expenditures, we are able to perform both traditional and EOS structural break tests that include 2006. The first source and the one used above in the separate analysis for each state is from the Bureau of Economic Analysis (BEA). It spans 1963 to 2020, thereby allowing 2006 to be subjected to both types of tests. The second is the more commonly used data from the Center for Medicare and Medicaid Services’ (CMS) National Health Expenditure Accounts (NHEA) (CMS 2019). Because it only includes 1980 through 2014, we cannot use it to test Medicaid/Medicare or the ACA but it can test the 2006 Massachusetts policy reform. Our Massachusetts analyses provide complementary evidence and robustness checks to our national- and individual state-level findings because (1) it does not rely solely on EOS structural break tests; (2) it uses two different data sources; and (3) the Massachusetts health care reform occurred before the Great Recession and the state was not as hard hit as most.Footnote 6 While the 2006 Massachusetts reform differed in key aspects, like the generosity of low-income subsidies and the level of employer responsibility, “the basic structure of the ACA was pioneered in the Bay State," and it was “the explicit model for the coverage and insurance market reform portions of the ACA" (Gruber 2013). The similar results from the two state-level datasets lend support for using the much longer and continually updated BEA data in future state-level research.

All our findings support the hypothesis that the log–log relationship between aggregate US HCE and GDP has been remarkably stable. Even the periods during the most substantial reforms like Medicare/Medicaid and the ACA yield no evidence of a change in the relationship. Across a range of tests, specifications, and samples, we find no consistent evidence of a structural break and instead find estimated income elasticities that barely budge over our data sample. We also find that all income elasticities well exceed 1.0, a result that is consistent with the hypothesis that health care is a luxury good in the USA. These two findings are also in line with Hall and Jones (2007) whose model based on standard assumptions predicts an income elasticity of health expenditures far greater than one. This finding suggests that the steady rise in health expenditures may in fact be a rational response to rising income in part because the marginal utility of extending life never decreases.

The rest of the paper is organized as follows. The next section describes the data used in the analyses and also provides details of the techniques designed to mitigate the problems encountered in estimating this long-run relationship in general and the income elasticity in particular and credibly testing for structural breaks that occur close to the ends of the sample. Section "Results" reports the results of our estimates and tests, and Section "Concluding Remarks" concludes.

Data and Empirical Strategy

To investigate the stability of the HCE–GDP relationship and income elasticity over time, we estimate time-series models using annual data on HCE and GDP, first at the national level and then for each state individually, with an in depth focus on Massachusetts. As is common in this literature, we estimate the log of per capita HCE as a function of the log of per capita GDP (Carrion-i-Silvestre 2005; Wang and Rettenmaier 2007; Baltagi and Moscone 2010; Moscone and Tosetti 2010a; Woodward and Wang 2012). We also follow most research in using real values (inflation-adjusted via the Consumer Price Index (CPI) for all goods and services), although in robustness checks we explore using nominal values instead as in Woodward and Wang (2012) and Hartwig (2011). As summarized briefly below, we follow recent research that tests the time-series properties of each data series and adjust our estimation accordingly.

The basic model is

$$\begin{aligned} h_t = \alpha + \beta y_t + \varepsilon _t \; \; \text {for} \; \; t=1,...,T \end{aligned}$$
(1)

where \(h_t\) and \(y_t\) are the natural logs of health care expenditures per capita (HCEPC) and gross domestic product per capita (GDPPC) at year t, \(\beta\) represents the income elasticity of health expenditures and T is the sample size. In a model without structural breaks, \(\alpha\) and \(\beta\) are constants over time, but with M structural breaks, there could be an \(\alpha\) and \(\beta\) for each separate regime (for a total of M+1 regimes).Footnote 7 Our empirical approach is to test for such regime changes and, for those that emerge, see if they coincide with the timing of major health care reforms. To test for changes near the beginning or end of the sample, we use EOS tests. We also explore the sensitivity of income elasticity estimates as more years of data are excluded at the ends of our sample to search for evidence of less abrupt structural changes and to verify that our findings for the overall relationship extend to the income elasticity in particular (i.e., are not driven by the intercept).

Data Description

National-level health expenditure data comes from the CMS’s NHEA (CMS 2021b) and is available from 1960 through 2020. Similar to the production-based framework used to measure GDP, the NHEA’s national health expenditure series represents the total annual amount spent on final health care consumption in the USA, as well as final spending on administration, public health activities and investment in structures, equipment and non-commercial research in the medical sector.Footnote 8

The NHEA data is also the most commonly used in state-level analyses (e.g., Moscone and Tosetti (2010a), Moscone and Tosetti (2010b) and Panopoulou and Pantelidis (2013)). The NHEA state-level data does not begin until 1980 and has not been updated since 2014, which precludes investigating the adoption of either Medicaid/Medicare or the ACA. The NHEA state-level data can be used, however, in our investigation of the 2006 Massachusetts health policy reform, a precursor to the ACA. In our NHEA-MA analyses, we use the State of Provider data which provides estimates of health care spending based on where the provider of care is located. The Massachusetts data differs slightly from the national level in that it does not include spending on administrative, public health or investment endeavors.

For all other state-level analyses, we use health expenditures from the BEA. This data spans the time period 1963–2020 and therefore allows us to test almost the same period as we do at the national level. Because the BEA changed its industry classifications in 1997, we create this aggregate measure using spending on “health services" from 1963 to 1997 and then sum spending on health and personal care stores, ambulatory health care services, hospitals and nursing/residential care services together from 1998–2020.Footnote 9 We perform the 2006 Massachusetts analyses using both data sources, and the similar findings lends support to our use of the BEA data for the other states and reforms. The other measures required for equation 1 come from the BEA (national- and state-level GDP) and the Census Bureau (population).

We begin our empirical investigation with a descriptive look at the relationship between HCE and GDP over our sample periods. Figure 1a plots the US HCE as a percentage of GDP and confirms that HCE has steadily grown as a percentage of GDP during this time period. This trend is consistent with an income elasticity that is greater than 1.0, the key parameter of interest in equation (1). The uptick at the end of Fig. 1a also highlights the impact of COVID-19 in 2020, which led to both an increase in health expenditures and a stark fall in GDP. Though we cannot test 2020 using our formal structural break tests, we do find that the estimated income elasticity is very robust to excluding 2020. The figure also denotes with dashed lines the 15% at both ends of the sample trimmed for the standard structural break tests and shows how this trimming excludes from the search most of the years likely affected by the most substantial US health care reforms.

Figure 1b provides a direct look at the relationship estimated in equation 1, the relationship between log HCE per capita and log GDP per capita. The slope of the plotted line (\(\frac{\partial ln[HCE]}{\partial ln[Y]}\)), which is the HCE income elasticity, is strikingly constant over time and greater than 1.0. A closer look at both figures does reveal a temporary leveling off in 2009 in the share of GDP made up by HCE in Fig. 1a and a corresponding blip around the same time in the income elasticity in Fig. 1b, suggesting the possibility of a breakpoint at the start of the ACA or the Great Recession. Otherwise, this simple descriptive look at the data thus provides evidence that the relationship has remained stable and suggests that income elasticity is greater than 1.0.

Figure 1c and 1d repeats these plots for the two Massachusetts samples. In contrast to the US sample, Fig. 1c shows how the period spanning the 2006 reform remains in the BEA data even after trimming, which allows us to perform both standard and EOS structural break tests.Footnote 10 This figure also shows that the two data sources for Massachusetts track each other reasonably well and thus lend support to our use of the BEA data; their correlation coefficient is 0.99. While both figures suggest less stability in the relationship than for the USA as a whole, they do not reveal an obvious difference before and after the 2006 Massachusetts reform and the slope of both series in Fig. 1d is also fairly constant. These figures thus provide the first evidence that the HCE–GDP relationship is quite stable over time and echo the findings of Woodward and Wang (2012), who find a similar stability using nominal data and a sample that ends in 2008. However, these figures provide us with only descriptive information, and as such, we move on to our formal statistical analyses.

Time Series Properties

Before formally investigating the stability of the log–log health expenditure–GDP relationship elasticity, we also determine the time-series properties of our different variables so that we can properly model the relationship between them over time.Footnote 11 Table 1 summarizes the key findings from these exercises and the different structural break tests we are able to perform on our national-level data and two Massachusetts datasets while Appendix Figure A1 illustrates those time series. For the sake of brevity, we do not report this information for the other states and DC, but they are in general similar to what we find for the national dataset and are available upon request.

Table 1 Time-series properties of the USA and Massachusetts ln(NHEPC) (\(h_t\)) and ln(GDPPC) (\(y_t\))

To assess the order of integration for our series, we use both the standard augmented Dickey–Fuller (ADF) test and the modified GLS de-trended test. We adopt the generalized least squares (GLS) de-trended test due to the cited power issues of the standard Dickey-fuller test in the case of negative serial correlation (Perron and Ng 1996; Elliot et al. 1996; Ng and Perron 2001).

It is now well recognized that a structural break in a series could nullify the validity of the ADF and DF-GLS tests. In particular, these tests will tend to fail to reject the null hypothesis of a unit root when a break is present because the test is not able to reliably distinguish non-stationarity from a shift in the series (Perron, 1990). To account for this possibility, we implement the test proposed by Clemente et al. (1998) (CMR). We prefer the CMR approach since it allows for up to two breaks, i.e., to ensure that the stationarity tests of each series are robust if there are multiple breaks.

Once the stationarity properties of the health expenditure and income series are established, it is important to then determine whether or not the two series are cointegrated. To test for cointegration, we use the Engle–Granger (EG) two-stage procedure.Footnote 12

The top two panels of Table 1 report the results of the unit root and cointegration tests. The battery of tests reported in the top panel overall suggests we cannot reject the null hypothesis of a unit root in the levels of either \(y_t\) or \(h_t\) for all three series. In unreported results, we find the same result for the other states and DC. This suggests that each series exhibits sustained persistence over time and that this serial correlation may bias our results if not properly treated.Footnote 13 The second panel presents the results for both the preferred EG test and the Johansen test as confirmation. Both tests strongly confirm that the variables in all three series reported in Table 1 are cointegrated, as are the other states and DC. These time-series property conclusions are consistent with the literature (Blomqvist and Carter 1997; Gerdtham and Löthgren 2000; Freeman 2003; Carrion-i-Silvestre 2005; Wang and Rettenmaier 2007) and suggest a modification of our primary model is necessary before testing for stability.

Due to these time-series properties, standard ordinary least squares (OLS) regression estimation between \(h_t\) and \(y_t\) may be biased due to the correlation between the right-hand side variable and the cointegration error. To overcome this, Stock and Watson (1993) propose the dynamic OLS (DOLS) estimator, which they show to be asymptotically efficient, by including leads and lags of the first difference of the integrated right-hand side variable in the regression. Adding these variables not only helps mitigate issues with autocorrelation, but also accounts for some of the simultaneity bias, which may occur in a regression of cointegrated variables.Footnote 14 However, using a Monte Carlo study, Hayakawa and Kurozumi (2008) investigate the finite sample properties of DOLS regressions and find that models without the first differenced leads are optimal if the cointegration errors do not Granger-cause the first difference of the integrated right-hand side variable. In other words, if past values of the cointegrating errors do not contain any information that could help predict the first differenced integrated regressor (\(y_t\)), then the simultaneity bias does not need to be addressed and the exclusion of the leads increases the degrees of freedom and the efficiency of the estimation, particularly in small samples.

We use the data driven procedures outlined by Hayakawa and Kurozumi (2008) to determine the optimal number of leads and lags to augment Equation 1 before proceeding with the structural break tests.Footnote 15 We find differences across the three samples. While the national-level data suggests two leads and two lags, the MA NHEA sample implies no leads and the MA BEA sample requires four leads and lags. The other states and DC mostly follow the national data with two leads and two lags (28 states) or one lead and lag (16 states). The remaining six states require at least three or four leads and lags and their higher number of lags preclude them from testing 1966. We explain this selection process in more detail further in the Appendix and discuss how the results differ across our three main time series in Appendix Tables 5, 6, 7.

Moving forward, we augment Equation 1 with the number of leads and lags as defined above for each of our series to reach our baseline DOLS models, and test for structural breaks using these augmented equations. The actual number of potential breakpoints, and where they occur, are not specified a priori, and both the intercept and the slope are allowed to vary in our tests.

Testing for Structural Change

To test the stability of the aggregate log–log HCE–GDP relationship over time, we first utilize the sup-Wald, the UDmax and the sequential multiple break testing procedures proposed by Kejriwal and Perron (2010).Footnote 16 These procedures fit the scope of our question because they (1) allow for breaks in both the intercept and slope of the DOLS equation, (2) allow for multiple breaks over the sample, (3) do not require the break dates to be specified a priori and (4) are consistent even under the presence of non-stationary and cointegrated variables such as both we and the literature more generally has found for \(h_t\) and \(y_t\) (Bai and Perron 1998; Bai and Perron 2003; Perron 2006; Kejriwal and Perron 2010). It is also known that sequential procedures have a tendency to stop too early, but we alleviate the issue since we combine the UDmax test and the sequential procedure following Bai and Perron (1998), Perron (2006) and Kejriwal (2008).

While these testing procedures fit our question well, two issues may bias these tests toward a failure to find structural breaks. The first is that given our relatively small sample (1960–2020 or T = 61 for the USA, 1980–2014 or T = 35 for the NHEA Massachusetts sample and 1963–2020 or T = 58 for the BEA Massachusetts and the rest of the state samples), the size and power of these procedures may be limited. Second, in each test, the related DOLS equation is subject to “trimming" i.e., the removal of a certain portion of the beginning and the end of the sample, to determine the range over which the breaks will be searched. The trimming ensures that each testing segment does not get too small, which is necessary to ensure adequate power, especially when there is serial correlation in the data (Andrews 1993, 2003; Bai and Perron 2003; Perron 2006). While trimming helps increase power, it excludes years at both ends of the sample and, as such, eliminates two periods when large scale health reforms took place in our sample. For example (and as reported at the bottom of Table 1), the standard 15% trimming that we adopt means the years actually being tested for structural breaks in the US national data are 1969–2011, meaning the advent of Medicaid/Medicare in 1966 and the majority of the ACA rollout, which occurred well after its enactment in 2010, are not actually being investigated as potential points of structural change. The bottom of Table 1 reports the years of each sample that can be subjected to the KP structural break tests for each sample, followed by the years investigated with the EOS tests described next.

To address both of these concerns, we adopt the EOS structural change tests proposed by Andrews (2003) and extended by Andrews and Kim (2006). The P test developed in Andrews and Kim (2006) fits our research question because it addresses changes in short time periods, such as in the beginning or the end of the sample, unlike most structural change tests which are designed to identify breaks over a long span of data. Following Andrews and Kim (2006), we also know that the P test is appropriate for models with cointegrated variables, as \(h_t\) and \(y_t\) are here. These tests therefore allow us to check whether breaks occurred prior to 1969 (Medicare/Medicaid) or post 2011 (ACA). One criticism of this procedure is that it typically requires pre-specifying the break points. To mitigate this issue, we alternatively pre-specify the full range of possible break points, allowing us to test each year in both tails of the sample.

To carry out the P test, one first pre-selects a hypothesized break date at time t = \(t_i\). The DOLS model is estimated over the entire sample, and the sum of squared residuals is calculated only for the post-change period (\(t_i\) + 1,...,T). This sum represents the P statistic associated with time \(t_i\). Because the number of observations at the beginning or end of the sample is small, one cannot rely on the standard asymptotic critical distribution to generate critical values. The P test instead relies on an iterative, sub-sampling procedure to derive the empirical distribution of test statistics. Specifically, we follow Andrews and Kim (2006) by estimating \(T-2(T -t_i+1)+1\) hypothetical test statistics using a moving window of the pre-change period and reach their empirical distribution. If the P statistic is greater than the value at the “1-significance level" percentage of this empirical distribution, the null hypothesis of no structural change can be rejected in favor of a structural change at the year corresponding to time \(t_i\) at that significance.Footnote 17

Results

We start by presenting formal structural break results for the national-level HCE–GDP relationship and income elasticity using the standard KP tests and a trimmed sample that spans 1969–2011. To explore whether the relationship is stable at the ends of the sample, when Medicare/Medicaid and the ACA are enacted, we use the EOS P test. We then subject these findings to three different robustness checks. The first estimates several variations of the log–log national-level HCE–GDP relationship and repeats the P tests. The second repeats this estimation and two types of tests for each of the other states and DC individually. Estimating each state independently helps avoid possible aggregation biases present in national data while the national scope of these reforms suggests that state-level relationships could also be impacted—with the possible exceptions of Massachusetts and Hawaii which enacted their own reforms. The final exercise focuses on Massachusetts and its 2006 reform, using two different sources of data and both types of tests. These complementary analyses permit testing for structural breaks during the reform period using both EOS P and KP tests, as well as abstracting from the effects of the Great Recession. Finding similar results for the national-, state-level and two focused Massachusetts analyses helps corroborate our key findings.

Structural Breaks in the US National log–log HCE–GDP Relationship

Table 2 presents the results for both sets of structural break tests applied to our main US HCE–GDP specification written in equation 1, estimated in line with the time-series properties described in Table 1. The KP tests with appropriately trimmed data tests the 1969–2011 period, while the EOS P-test looks for breaks between the years of 1963–1972 (Medicaid/Medicare era) and 2009–2018 (ACA era). As the constant slope in Figure 1b suggests, the different types of tests are unanimous in finding no evidence of structural breaks over the sample, including at the ends. For the inner 70% of the sample (1969–2011) examined with the KP tests, we fail to reject the null hypothesis of stability for all \(SupF^*\) tests, as well as the UDmax test; all tests fall far short of the 10% significance level. As a final level of confirmation, the sequential test finds no evidence of breaks either.

Table 2 Kejriwal and Perron (2010) and end-of-sample structural break tests for the US log–log HCE–GDP relationship

The last four columns of Table 2 report the results for the Andrews and Kim (2006) P test, which searches for breaks at the two critical ends of our sample. To conduct this test, we first must pre-specify a break year and then test the null hypothesis of no change at that year against the alternative hypothesis of a change at that year. The fact that we have to pre-specify allows us to conduct a falsification test in that we manually test all of the years in the beginning and the end of the sample, regardless of whether we expect them to be break years or not. If the EOS test finds many or all years to be significant, this casts doubt on the ability of the EOS test to reliably identify breaks. On the other hand, if only years reasonably expected to be breakpoints are selected, then we can be more confident that these are points of change in the curve. Similarly, having no years selected as breakpoints suggests that the relationship is in fact stable, even at the ends of the sample. The staggered rollout of the different provisions of the ACA suggests a need to test over many years as well.

Starting with the beginning of our sample 1963–1972, which spans the enactment of Medicare/Medicaid, we find no evidence of a structural break in any year. For example, in 1966, the p-value is 0.50, meaning that this year was found less likely to be a point of change than 50% of the hypothetical break years constructed of different periods in the post-1966 sample. As such, we find no evidence that the US HCE–GDP relationship changed in the aftermath of Medicare/Medicaid’s enactment. Performing a similar set of tests for the end of the sample 2009–2018, which spans the Great Recession and the ACA, likewise yields no evidence of a structural break. In fact, each year tested is associated with high p-values (ranging from 0.34 in 2016 to 0.81 in 2009), suggesting none of these years is close to being statistically significantly related to a change in the health expenditure income elasticity or the log–log HCE–GDP relationship overall.

Additional Exercises with National Data

We first subject these findings to alternative specifications of equation 1. The first uses nominal instead of real values of HCE and GDP, as in Woodward and Wang (2012). Consistent with Table 2 and Woodward and Wang (2012), we find no evidence of a structural break within the internal 70% of the samples.Footnote 18 As shown in Table 3, the P-tests once again provide no evidence of a structural break during either the beginning or end of the sample and the p-values are similar to Table 2.

Table 3 Estimated p-values from end-of-sample structural break tests for the US log–log HCE–GDP relationship: alternative specifications

While past research varies widely in the factors included (e.g., see Hartwig and Sturm (2014)), including some measure of the age distribution or dependency ratio is common [e.g., Wang, 2009; Barati and Fariditavana, 2020]. We therefore redo the analyses including the percentage of the population aged between 18 and 64 and the percentage aged 65 and older as additional controls. Table 3 reports this exercise in the third column of each panel. While the p-values are lower, the results continue to show no evidence of a structural break. Next, we follow Blazquez-Fernandez et al. (2014), which explicitly investigates whether the elasticity has changed over time and finds their results sensitive to including a proxy for technological change (a temporal time trend). This addition, reported in the fourth columns, likewise has no qualitative impact.

To deal with both of those factors together along with the potential influence of other unobserved variables as well, we consider whether there are breaks in the log–log HCE–GDP relationship relative to the rest of the OECD countries in the next column. Data on HCE and GDP for the OECD countries is only available from 1970 to 2020, meaning the impact of Medicare/Medicaid at the beginning of our sample cannot be tested in this exercise.Footnote 19 We calculate the median health expenditures and GDP by year across the OECD countries with data available over the full 1970–2020 time period and subtract that from the US values, giving US health expenditures and GDP net of the “global normal."Footnote 20 Though not as stark as the USA, all other developed countries have also seen a rise in their health expenditures over time, suggesting the possibility that shared factors could at least partly be responsible for this growth across all developed nations. These common factors could therefore also be influencing the stability of the HCE–GDP relationship that has been documented here so far and subtracting out the yearly OECD median should help control for the potential effect those influences have. As before, the HCE–GDP relationship remains stable throughout this exercise with p-values ranging from 0.41 to 0.83 in the end of the sample P test. In unreported analyses, we also subject this net of “global normal" measure to the KP test, finding no evidence of a break in the interior of the sample either.Footnote 21

Finally, we allow the effect of GDP on HCE to be asymmetric in the last column in each panel as in Barati and Fariditavana (2020). Across all of these different model specifications, we find no evidence that the log–log HCE–GDP relationship changed during the rollout of Medicare/Medicaid in the early part of our sample or during the Great Recession and the ACA in the latter part. The stability over the 1960–2020 period remains unchanged. As a final check, we also re-estimate these national-level models excluding 2020, as the COVID-19 pandemic may have had an undue effect on both HCE and GDP. The results are nearly identical.

The last set of exercises compare the estimated income elasticities obtained from different time periods and in past research. Whether nominal or real values are used, our estimated income elasticities barely budge when the sample is extended and are similar to past work that uses shorter series. For the main specification that uses real HCE–GDP values, we find a large income elasticity of 2.59 for the full 1960–2020 sample, similar to what Lee, Oh and Meng (2019, Table 5) find for 1960–1997 (2.34) and in line with the predictions made in Hall and Jones (2007). When we eliminate the post-ACA period (drop 2011 and years after), the estimated income elasticity changes by only 0.03 to 2.62. Woodward and Wang (2012), using nominal data instead, report an estimated income elasticity of 1.388 for their 1960–2008 sample, which excludes the entire post-ACA period. Our sample using updated data adds 12 years and increases the sample by more than 20%, yet the corresponding estimate is strikingly similar at 1.40.Footnote 22 In another exercise similar to the recursive approach of Blazquez-Fernandez et al. (2014), we systematically add and subtract the years included in the sample and re-estimate the model.Footnote 23 Across both nominal and real national data, this exercise yields elasticity estimates that barely budge, differing by at most 0.042. The bigger difference in elasticities comes with whether one estimates the relationship in real or nominal terms. In sum, neither the structural break tests nor the estimated elasticities themselves provide evidence that the national HCE–GDP relationship has changed during 1960–2020.

Estimates and Structural Break Tests Using State-Level Data

Our next set of robustness checks take advantage of the BEA state-level data spanning nearly the same time period (1963–2020) to repeat this estimation (including the diagnostic tests outlined in Table 1) and the two types of structural break tests for each individual state. This exercise may help alleviate possible aggregation bias; given the national scope of these reforms, it also provides additional, credible evidence as to whether the HCE–GDP relationship was affected at a disaggregated level. However, it also yields an enormous number of tests, summarized here and available in greater detail upon request, and its results are complicated by the fact that the prescribed leads and lags—and thus the possible years subjected to EOS tests—differ across states. Because the data does not begin until 1963 and most states require two or more lags, testing the period just before and during the Medicaid/Medicare enactment is not feasible for most states. We therefore limit our structural break analyses to the 70% interior of the sample (1972–2011) that can use traditional KP tests and apply the EOS tests to the period just before, during and after the ACA rollout. As the highest number of leads found is four, all states can test at least through 2016 and the overwhelming majority (44) can test through 2018.

Performing the KP tests for the 1972–2011 period for each of the other states and DC yields only seven states with possible breaks with no real pattern as the breaks occur at different times and none are even close to 2011.Footnote 24 Turning to the EOS tests and the end of the sample, we focus on the four years prior to the ACA (2007–2010), the four years during the rollout (2011–2014) and the four years (when possible) afterward (2015–2018).Footnote 25. Figure 2 summarizes the results of the EOS structural test results for 2007–2018 in all 50 states and DC. We once again find seven states with possible breaks and the timing and states appear fairly random. Two of the breaks occur in the pre-period, three in the rollout period and two in the post-period. Only one break occurs in the critical years of 2011 and 2014. Two of the states (AZ and DE) had suggested breaks in the internal period too, while the remaining five display no obvious pattern (AK, MD, MI, NH and SC). Both types of structural break tests therefore suggest a small number of breaks but with no systematic pattern linking them to policy reforms. Finding more structural breaks is perhaps to be expected in what seems to be less stable series in general; individual states are buffeted with forces that may have little effect in the aggregate. As an additional robustness check, we also use the structural break test proposed by Ditzen et al. (2021) that extend the Bai and Perron (1998) and Bai and Perron (2003) method to the panel level on our full set of states together, finding no evidence of a structural break in the interior 70% of the sample.Footnote 26

The estimated income elasticities are also reasonably stable across the states and, as in the national data, over time. The top panel of Figure 3 reports the estimated income elasticities and their 95% confidence intervals for each state; the two horizontal lines show the 95% confidence interval income elasticities estimates with the national BEA data. With the exception of Alaska, the elasticities fall in a reasonably close range of 1.51 and 3.42 and most of the confidence intervals overlap. Even Alaska’s outlier estimate of 0.92 is imprecise such that its confidence interval overlaps with most others. These estimates are also in the same general range as we find in the national NHEA data (2.62), despite using an entirely different data source. Also similar to the national analyses, re-estimating these models using nominal data yields substantially lower estimated elasticities, ranging from 1.17 to 1.52, which are again similar to those produced in the national analysis (1.40). These analyses therefore highlight that the little-utilized BEA state-level data yields estimates similar to the NHEA data and that the elasticities do appear sensitive to whether real or nominal values are used.

A final question is whether these estimates are stable over time, especially with respect to the period during and after the ACA. We therefore re-estimate the income elasticities for every state ending the sample first in 2010 before the ACA, and also in 2014 (thus omitting the years after most ACA policies were implemented). The bottom panel reports these three sets of estimates (using 1963–2010, 1963–2014 and 1963–2020) and clearly shows—once again—how little the estimated income elasticities are affected by extending the sample into the ACA era.

Evidence of Structural Breaks in Massachusetts’ log–log HCE–GDP Relationship

We now turn to testing for structural breaks in the state of Massachusetts, especially in the period immediately following its 2006 health care reform law. We analyze MA in more detail, as we view this exercise as a complement to the US analyses because of the similarity of the reform to the ACA; that the law occurred before the Great Recession and more time has elapsed makes it more straightforward to test than the ACA. Table 4 reports the structural break test results for both of the Massachusetts samples. The more commonly used NHEA sample is too short to permit testing for structural breaks around the 2006 reform with the traditional KP tests, and so, we perform only the EOS P-test. While we are primarily interested in the end of the sample—the period after the reform—-for completeness, we test at the beginning as well. Neither end of the sample provides evidence of a structural break.Footnote 27 The p-values during the reform period, in particular, are all quite high (0.60 or greater).

Table 4 Kejriwal and Perron (2010) & End-of-Sample Structural Break Tests for the Massachusetts log–log HCE–GDP Relationship

The BEA series, with its greater length, permits the full set of KP tests, and its interior 70% includes several years from the post-reform period (it includes 1972–2011). We can therefore perform two different types of tests to search for a structural break after the 2006 policy. Moreover, comparing the results of these more traditional structural break tests to those from EOS P-tests provides evidence on the validity of the P-tests used to test the years immediately following Medicare/Medicaid and the Great Recession/ACA in our US analyses. The full set of KP tests are reported in the first two columns of the BEA panel in Table 4, and the remaining columns report the EOS P-tests. For completeness, we again test both ends of the sample. None of these tests yield evidence of a structural break during either the ends of the sample or in the interior. In particular, there is no evidence of a structural break in 2006 or any subsequent year, corroborating the findings from the Great Recession/ACA era in the US analyses. Not only do the traditional KP tests find no breaks but the p-values from the EOS P-tests also strongly support this finding.

As in the state-level analyses, we also re-estimate the income elasticity both with and without the post-reform years included (i.e., ending in 2006 versus ending in 2014 for the NHEA or 2020 for the BEA). The income elasticity estimates are similar across time periods and data sets. Using the NHEA data, the estimated income elasticity changes from 1.70 to 1.61 when 2007–2014 is dropped. In the BEA data, the estimate changes from 1.71 to 1.70 when 2007–2020 is dropped. Both changes are strikingly similar and fall well within the estimated confidence intervals.

Concluding Remarks

The US has enacted many health care reforms over the last 60 years, most with the goal of expanding access to care and/or controlling cost, which in turn could presumably alter the relationship between GDP and health care expenditures. We explore the possible impact of these reforms by testing for structural breaks in the US health care expenditure–GDP relationship using the longest possible time series of annual data (1960-2020). However, because the most substantial reforms—Medicaid/Medicare and the ACA—-occurred near the beginning and the end of this data, respectively, we also use tests that perform well over short time periods, such as the end of the sample. Neither the traditional tests on the internal 70% of the sample, nor the EOS tests yield evidence of a structural break. These results are consistent with the hypothesis that the relationship in general and income elasticity in particular is unchanged over the 1960–2020 period. This conclusion is robust to using both real and nominal measures and controlling for additional factors.

This stability and our consistent finding that the US income elasticity of health expenditures is greater than one (and thus a luxury good) also provides empirical support for the conclusions from Hall and Jones (2007). That research presents a model based on standard assumptions and shows that a rising health expenditure share is the rational response to increased income because while the marginal utility of consumption falls with income, the marginal utility of extending life (obtained through more health spending) does not decrease. As income rises, it is therefore utility maximizing to shift spending from general consumption toward the relatively more valuable health expenditures, a response that would explain the stable and high-income elasticity we document here. While some aspect of rising health expenditures is likely the result of the various inefficiencies detailed in the “1 Percent Steps for health care Reform Project," the fact that the income elasticity remains stable and high even when controlling for the impact of technological change, the rising age share and the influence of other unobserved factors suggests that there is also another, more fundamentally entrenched side to the driving forces behind the rise in health expenditures.

We caution, however, against interpreting our findings about past reforms as necessarily being predictive of any impacts of future policies. Put simply, our analyses suggest that the major reforms put in place so far do not appear to have fundamentally changed the US HCE–GDP relationship. Future reforms such as those offered on the 1% steps website could very well have an effect, as the studies listed there suggest. Indeed, the empirical methodology presented here offers one approach to testing their success, if enacted.

A challenge to investigating the ACA era is its staggered rollout and coincidental timing with the Great Recession and subsequent recovery. Our parallel analyses using Massachusetts data address these challenges by testing for a structural break in the years following its 2006 health care reform, a reform that is widely viewed as the blueprint for the ACA (Gruber 2013). Its earlier adoption means that more years of post-reform data are available, which permits both traditional KP and EOS structural break tests. The Great Recession is also less of a confounding factor given the reform’s earlier timing and the lesser impact of the Great Recession on Massachusetts. Similarly, the tests provide no evidence of a break in Massachusetts’ HCE–GDP relationship.

All of our analyses therefore lend support to the hypothesis that neither the initiation of Medicare/Medicaid nor the ACA altered the health expenditure income elasticity or overall log–log HCE–GDP relationship. A final concern is that these tests may simply suffer from a lack of statistical power; while we use the best available data, our data series are still fairly short. In addition, the structural break tests consider the entire relationship, including the intercept, rather than only the income elasticity (the slope). Reassurance regarding these concerns comes from the remarkable precision, consistency and stability of the estimated income elasticities from these samples. Taken together, our results suggest that US health care is a luxury good and that the HCE–GDP relationship in general and health expenditure income elasticity in particular have remained quite stable since 1960 despite numerous reforms to health care policy.

Fig. 1
figure 1

Plots of health care expenditures and GDP per capita (HCEPC and GDPPC) over time

Fig. 2
figure 2

End-of-sample breaks across the States

Fig. 3
figure 3

Health expenditure income elasticity across states and time