1 Introduction

The synthetic control method (SCM), originally introduced by Abadie and Gardeazabal (2003), is an appealing tool for evaluating the causal treatment effects of policy interventions and programs in comparative case studies (Athey & Imbens, 2017). SCM has been employed in a large number of important applications (e.g., Abadie et al., 2010, 2015; Acemoglu et al., 2016; Cavallo et al., 2013; Gobillon & Magnac, 2016; Kleven et al., 2013; Bayer & Aklin, 2020). Since the outbreak of the COVID-19 pandemic, SCM has extensively been applied to identify the impacts of the pandemic-related restrictions (e.g., Alfano et al., 2021; Bonander et al., 2021; Cole et al., 2020; Lang et al., 2022; Mills & Rüttenauer, 2022; Mitze et al., 2020; Sehgal, 2021; Xin et al., 2021).

SCM estimates the causal treatment effect by constructing a counterfactual of the treated unit (i.e., synthetic control) using a convex combination of similar units not exposed to the treatment (i.e., donors). The convex combination requires non-negative weights that sum to one to avoid extrapolation. The weights are determined to ensure that the treated unit and the synthetic control resemble each other as closely as possible prior to the treatment, both with respect to the outcome of interest and some observed predictors. Since there are typically multiple predictors, the predictors are also weighted using another set of non-negative weights. Abadie and Gardeazabal (2003) and Abadie et al. (2010) discuss several alternative approaches to specify the predictor weights, including the use of subjective weights. In practice, a majority of published SCM applications resort to a data-driven procedure where the weights of predictors and donors are jointly optimized to minimize the mean squared prediction error of the synthetic control over the pre-treatment period, applying the Synth package described in Abadie et al. (2011), which is available for R, Matlab, and Stata.

Despite the popularity of SCM, rather surprisingly, no explicit mathematical formulation of how the predictor weights and the donor weights are jointly optimized has been presented in the literature. Several recent studies note that the synthetic controls produced by standard computational packages available for SCM may encounter numerical instability or fail to achieve the optimum (e.g., Albalate et al., 2021; Becker & Klößner, 2017, 2018; Becker et al., 2018; Klößner, 2015; Kuosmanen et al., 2021).

The purpose of the present paper is to provide a comprehensive investigation into the optimization problem that needs to be solved to compute the synthetic control weights. Unfortunately, the explicit formulation of the SCM problem reveals that computing the synthetic controls entails solving an NP-hard problem, referred to as a bilevel optimization problem (e.g., Hansen et al., 1992; Vicente et al., 1994). In essence, the task of computing synthetic controls turns out to be more challenging than any previous SCM studies recognize. This insight sheds light on the numerical instability reported by Klößner et al. (2015), among others. To address this problem, we develop an iterative algorithm for solving the original SCM problem, based on Tykhonov regularization or Karush–Kuhn–Tucker approximations. We formally prove that the proposed algorithm is guaranteed to converge to the optimal solution.

The rest of the paper is organized as follows. Section 2 introduces the SCM method and formulates the data-driven approach to compute the predictor and donor weights as a bilevel optimization problem. Section 3 develops an iterative algorithm that is guaranteed to converge to the optimal solution. Section 4 applies the proposed algorithm to the data of the seminal SCM application to the California tobacco control program and compares the empirical results with those produced by the existing computational tools for SCM. Section 5 presents our concluding remarks and discusses potential avenues for future research. Proofs of theorems and the implementation of the descent algorithm are presented in the Appendices.

2 Synthetic Control Method

2.1 Preliminaries

Following the usual notation (e.g., Abadie, 2021), suppose we observe units \(j = 1,\dots , J+1\), where the first unit is exposed to the intervention and the J remaining units are control units that can contribute to the synthetic control. The set of J control units is referred to as the donor pool. For the sake of clarity, we denote the number of time periods prior to treatment as \(T^{\text {pre}}\) and the number of time periods after the treatments as \(T^{\text {post}}\). The outcome of interest is denoted by Y: column vectors \(Y_1^{\text {pre}}\) and \(Y_1^{\text {post}}\) with \(T^{\text {pre}}\) and \(T^{\text {post}}\) rows, respectively, refer to the time series of the pre-treatment and post-treatment outcomes of the treated unit. Similarly, matrices \(Y_0^{\text {pre}}\) and \(Y_0^{\text {post}}\) with J columns refer to the pre-treatment and post-treatment outcomes of the control group, respectively.

Ideally, the impact of treatment could be measured as

$$\begin{aligned} {\alpha } = Y_1^{\text {post}}- Y_1^{\text {post,N}}, \end{aligned}$$
(1)

where \(Y_1^{\text {post,N}}\) refers to the counterfactual outcome that would occur if the unit was not exposed to the treatment. If one could observe the outcomes \(Y_1^{\text {post,N}}\) in an alternative state of nature, where the unit was not exposed to the treatment, then one could simply calculate the elements of vector \(\alpha \). The main challenge in the estimation of the treatment effect is that only \(Y_1^{\text {post}}\) is observable, whereas the counterfactual \(Y_1^{\text {post,N}}\) is not.

The goal of SCM is to construct a synthetic control group to estimate the counterfactual \(Y_1^{\text {post,N}}\). The key idea of the SCM is to use the convex combination of the observed outcomes of the control units \(Y_0^{\text {post}}\) as an estimator of \(Y_1^{\text {post,N}}\). Formally, the SCM estimator is defined as

$$\begin{aligned} {\hat{\alpha }}=Y_1^{\text {post}}- Y_0^{\text {post}}W, \end{aligned}$$
(2)

where the elements of vector \(W\) are non-negative and sum to one. The weights \(W\) characterize the synthetic control, that is, a counterfactual path of outcomes for the treated unit in the absence of treatment.

To set the weights \(W\), the simplest approach considered by Abadie and Gardeazabal (2003) is to track the observed path of pre-treatment outcomes as closely as possible to minimize the mean squared prediction error (MSPE). That is, one could apply the weights \(W\) that solve the following constrained least squares problem

$$\begin{aligned} \min _{W\in \mathbf {{\mathcal {W}}}} L(W) = \frac{1}{T^{\text {pre}}}\left\| Y_1^{\text {pre}}- Y_0^{\text {pre}}W\right\| ^2, \end{aligned}$$
(3)

where

$$\begin{aligned} \mathbf {{\mathcal {W}}}= \left\{ W \in {\mathbb {R}}^J: \ \sum _{j=2}^{J+1} W_j = 1, \ W_j \ge 0, \ j=2,\dots ,J+1 \right\} \end{aligned}$$
(4)

is the set of admissible weights for control units and \(\Vert \cdot \Vert \) denotes the usual Euclidean norm. The constraints on the weights \(W\) ensure that the synthetic control is a convex combination of the control units in the pool of donors. The fact that SCM does not involve extrapolation is considered one of its greatest advantages over regression analysis (e.g., Abadie, 2021). Note that if we relax the constraints on weights \(W\), then the unconstrained minimization problem reduces to the classic OLS problem without the intercept term. In that case, one could simply regress the time series \(Y_1^{\text {pre}}\) on the parallel outcomes of the J donors in the control group and set the weights \(W\) equal to the corresponding OLS coefficients. While the OLS problem has the well-known closed-form solution that satisfies the first-order conditions, the optimal solution to the constrained least squares problem stated above is typically a corner solution where at least some of the constraints on weights \(W\) are binding. The constrained least squares problem can be efficiently solved by quadratic programming (QP) algorithms such as CPLEX, which are guaranteed to converge to the global optimum.

In addition to the outcome of interest, an integral part of SCM is to utilize additional information observed during the pre-treatment period. Suppose we observe K variables referred to as predictors (also known as growth factors, characteristics, or covariates), which are observed prior to the treatment or are unaffected by the treatment, which can influence the evolution of Y. These predictors are denoted by a \((K\times 1)\) vector \(X_1\) and a \((K \times J)\) matrix \(X_0\), respectively.Footnote 1 Abadie et al. (2010) prove unbiasedness and consistency of the SCM under the condition that the synthetic control yields perfect fit to the predictors, that is, \(X_1 = X_0 W\). Abadie (2021) notes that “In practice, the condition \(X_1 = X_0 W\) is replaced by the approximate version \(X_1 \approx X_0 W\). It is important to notice, however, that for any particular dataset, there are no ex-ante guarantees on the size of the difference \(X_1 - X_0 W\). When this difference is large, Abadie et al. (2010) recommend against the use of synthetic controls because of the potential for substantial biases.

Since the K predictors included in X do not necessarily have the same effect on the outcomes Y, Abadie and Gardeazabal (2003) introduce a \((K \times K)\) diagonal matrix V where the diagonal elements are weights of the predictors that reflect the relative importance of the predictors. The diagonal elements of V must be non-negativeFootnote 2 and are usually normalized to sum to unity.Footnote 3 That is

$$\begin{aligned} V\in \left\{ \text {diag}(V): \ V\in {\mathbb {R}}^{K\times K}, \ \sum _{k=1}^K V_{kk}=1, \ V_{kk}\ge 0\right\} =:\mathbf {{\mathcal {V}}}, \end{aligned}$$
(5)

which is a subset of all non-negative diagonal matrices.

Both Abadie and Gardeazabal (2003) and Abadie et al. (2010) suggest that weights V could be subjectively determined. However, virtually all known applications of SCM resort to the data-driven procedure suggested by the authors. Unfortunately, these seminal papers do not explicitly state the required optimization problem. A closer examination of the SCM problem in the next section reveals that the SCM problem is far from trivial from the computational point of view.

2.2 Bilevel Optimization Problem

Since Abadie and Gardeazabal (2003) and Abadie et al. (2010) only state the SCM problem implicitly, to gain a better understanding of the data-driven approach, the first step is to formulate the SCM problem explicitly. By comparing with the original SCM articles, it is easy to verify that the optimal weights \(V^{\star }\), \(W^{\star }\) must be obtained as an optimal solution to the following optimistic bilevel optimization problem (cf. Albalate et al., 2021)

$$\begin{aligned}{} & {} \min _{V, \ W} \; L_V (V,W)=\frac{1}{T^{\text{ pre}}} \Vert Y_1^{\text {pre}}- Y_0^{\text {pre}}W(V)\Vert ^2 \end{aligned}$$
(6)
$$\begin{aligned} \text {s.t.}\quad {}\; & {} W(V) \in \Psi (V):=\mathop {\textrm{argmin}}\limits _{W \in \mathbf {{\mathcal {W}}}} L_W(V,W) = \Vert X_1 - X_0 W\Vert _V^2, \nonumber \\{} & {} V \in \mathbf {{\mathcal {V}}}, \end{aligned}$$
(7)

where \(\Vert \cdot \Vert _V\) is a semi-norm parametrized by V, and \(\Psi :\mathbf {{\mathcal {V}}}\rightrightarrows \mathbf {{\mathcal {W}}}\) denotes the solution set mapping from upper-level decisions to the set of global optimal solutions of the lower-level problem. For any \((K\times 1)\) real vector Z, we define \(\Vert Z\Vert _V=(Z^{\top }V Z)^{1/2}\). This becomes a proper norm only when V is positive-definite. If we denote the diagonal elements of V by \(v_1,\dots ,v_K\), we can write the lower-level objective as

$$\begin{aligned} L_W(V,W)=\sum _{k=1}^K v_k \left( X_{k,1}-\sum _{j=2}^{J+1} X_{k,j}W_j\right) ^2, \end{aligned}$$

which allows the lower-level problem to be interpreted as an importance-weighted least squares with weight constraints. As pointed out by Klößner and Pfeifer (2015), this original setup can be easily extended to allow the treatment of predictor data as time series, while maintaining the original structure of the optimization problem.

The explicit formulation of the optimization problem reveals several points worth noting. First, the SCM problem is a bilevel optimization problem, which is far from trivial from the computational point of view. The minimization problem (7) referred to the lower-level problem, and problem (6) is called the upper-level problem; the SCM literature commonly uses the terms inner and outer problems, but the meaning is the same. The problem is solvable when it is interpreted as an optimistic bilevel problem, but the global optimum is not necessarily unique.

Proposition 1

The synthetic control problem defined by (6)–(7) has a global optimistic solution \(({\bar{V}},{\bar{W}})\in \mathbf {{\mathcal {V}}}\times \mathbf {{\mathcal {W}}}\).

Unfortunately, the bilevel optimization problems are generally NP-hard (Hansen et al., 1992; Vicente et al., 1994). In particular, the hierarchical optimization structure can introduce difficulties such as non-convexity and disconnectedness (e.g., Sinha et al., 2013), which are also problematic in the present setting, as will be demonstrated in the next section.

Second, the explicit statement of the optimization problem makes it more evident that the optimal solution will typically be a corner solution where at least some of the first-order conditions do not hold. This causes a serious problem for the usual derivative-based optimization tools. This observation can help to explain at least partly the numerical instability of the SCM results, observed by Becker and Klößner (2017) and Klößner et al. (2015), among others. The general-purpose computational tools are simply ill-equipped for the task at hand. If the weights \(W,V\) are arbitrarily determined by an ad hoc computational tool that fails to converge to a feasible and unique global optimum, then all attractive theoretical properties of the estimator are no longer guaranteed.

3 Iterative Algorithm

The purpose of this section is to discuss a general algorithm for solving the original SCM problem (4)–(5) where the predictor weights V are jointly optimized with the donor weights \(W\). Since the general algorithm proves computationally demanding, we start by checking whether the unconstrained SCM problem (3) is a feasible solution as well as the possibility of corner solutions. It is noteworthy that surprisingly many of the SCM problems encountered in practice admit either an unconstrained solution or a corner solution. In case the optimal solution is not found through these feasibility checks, we suggest continuing the search for an optimal solution using a descent algorithm based on the Tykhonov regularization technique or Karush–Kuhn–Tucker (KKT) approximations.

To highlight the importance of coordination between the upper-level and lower-level problems, we can rephrase the lower-level problem (7) as

$$\begin{aligned} \min _{W\in \mathbf {{\mathcal {W}}}} L_W^{\varepsilon }(V,W) = \frac{1}{K}\Vert X_1 - X_0 W\Vert _{V^{\star }}^2 + \varepsilon \Vert Y_1^{\text {pre}}- Y_0^{\text {pre}}W(V)\Vert ^2 \end{aligned}$$
(8)

where \(\varepsilon >0\) denotes an infinitesimally small non-Archimedean scalar.Footnote 4 Introducing the upper-level objective as a part of the lower-level QP problem in (8) makes a subtle but important difference compared to problem (7): the primary objective of both (7) and (8) is to minimize the loss function \(L_W\) with respect to predictors X. However, if there are alternate optima \(W^{\star }\) that minimize the loss function \(L_W\), problem (8) will choose the best solution for the upper-level problem.

Proposition 2

For a given set of weights \(V^{\star }\), let \(W_{\varepsilon }(V^{\star })\) denote the unique optimal solution to problem (8) for any \(\varepsilon >0\). Then, we have that

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0+} W_{\varepsilon }(V^{\star })\in \mathop {\textrm{argmin}}\limits _W \{L_V(V^{\star },W): \ W\in \Psi (V^{\star })\}. \end{aligned}$$

The proof of the proposition is simple and can be omitted. Having ensured that constraint (5) holds, it is important to note that the optimal weights \(W\) that minimize \(\Vert X_1 - X_0 W\Vert _{V^{\star }}^2\) need not be unique. This is particularly relevant when there exist \(W\) that satisfy \(\Vert X_1 - X_0 W\Vert _{V^{\star }}^2 = 0\). In such cases, the non-Archimedean \(\varepsilon \) plays an important role by allowing us to select among the alternate optima for (5) the optimal weights \(W\) to minimize the upper-level objective (6).

Proposition 2 provides a useful result for SCM applications where the weights V are given. Recall that weights V might be subjectively determined, as Abadie and Gardeazabal (2003) and Abadie et al. (2010) suggest. Proposition 2 also demonstrates the critical importance of introducing an explicit link between the lower-level problem and the upper-level problem. In general, there can be many alternate optima where the loss function goes to zero, \(L_W = 0\). Without coordination, there is no guarantee that the SCM package would converge to the optimum. The lack of an explicit link between the upper-level and the lower-level problem is the most fundamental reason why the Synth package fails to reach the optimum.

3.1 Checking the Feasibility of an Unconstrained Solution

Consider first the situation where no predictors are used (i.e., \(K = 0\)). In this case, the bilevel optimization problem (6)–(7) reduces to the constrained regression problem (3). Problem (3) has a quadratic objective function and a set of linear constraints, which guarantees the existence of a unique global optimum when the usual assumptions of regression analysis hold (i.e., no rank deficiency). Such quadratic programming problems are considered straightforward from the computational point of view. While general-purpose derivative-based tools may struggle with the constraints, the simplex-based algorithms (e.g. the CPLEX solver) will converge to the global optimum.

Let \(L(W^{\star \star })=\min _{W\in \mathbf {{\mathcal {W}}}} L(W)\) denote the optimal solution to the problem (3), which is unique when no rank deficiency is present. As Kaul et al. (2022) correctly note, this solution is the lower bound for the optimal solution to the problem (6):

$$\begin{aligned} L_V(V,W)\ge L(W^{\star \star }) \text { for all } V\in \mathbf {{\mathcal {V}}}, W\in \mathbf {{\mathcal {W}}}. \end{aligned}$$
(9)

Intuitively, imposing additional constraints can never improve the optimal solution. To test if there exist importance weights \(V\in \mathbf {{\mathcal {V}}}\) such that \(W^{\star \star }\) is a feasible solution to the lower-level problem (7), we next solve the following linear programming (LP) problem

$$\begin{aligned} \min _{V\in \mathbf {{\mathcal {V}}}} L_W(V,W^{\star \star })= (X_1-X_0W^{\star \star })^{\top }V (X_1-X_0W^{\star \star }). \end{aligned}$$
(10)

While the objective function of problem (10) is the same as that of the lower-level problem (7) in that both problems minimize the same loss function, problem (7) is minimized with respect to weights W, whereas problem (10) is minimized with respect to weights V, taking \(W^{\star \star }\) as given. This LP problem finds the optimal predictor weights V to support the relaxed problem (3). Denote the optimal solution to problem (10) as \(V^{\star \star }\). If \(L_W(V^{\star \star },W^{\star \star })=0\), the optimal solution has been found. In other words, there exists matrix \(V^{\star \star }\in \mathbf {{\mathcal {V}}}\) such that \(W^{\star \star }\) is a feasible solution to the lower-level problem (7), i.e. \(W^{\star \star }\in \Psi (V^{\star \star })\). Hence, this is also the optimal solution to the bilevel optimization problem (6)–(7).

3.2 Establishing an Upper Bound for \(L_V\)

In the context of SCM, the domain of predictor weights V has K basic solutions, with the following diagonal elements: \(V_1 = (1,0,\,\ldots ,0), V_2 = (0,1,\,\ldots ,0),\,\ldots , V_K = (0,0,\,\ldots ,1)\). That is, we assign all weight to just one of the predictors and leave zero weight to all other predictors. We can insert the basic solution \(V_k,\, k = 1, \ldots , K\) as the weights V in problem (8), and solve the QP problem to find the optimal \(W_k\) for each \(k = 1, \ldots , K\). For each candidate weights \(W_k,\, k = 1, \ldots , K\), we calculate the value of the upper-level loss function \(L_V\) stated in (6). Finally, we select the basic solution s in \({1,\ldots , K}\) that minimizes \(L_V\). If \(L_W(V_s,W_s) = 0\) and \(L_V(V_s,W_s) = L(W^{\star \star })\), then the corner solution \((V_s,W_s)\) is one of the optimal solutions. If only \(L_W(V_s,W_s)=0\) but \(L_V(V_s,W_s)>L(W^{\star \star })\), the corner solution can be viewed as an upper bound for the optimal value.

Proposition 3

If there exist weights \(({\tilde{V}}, {\tilde{W}})\in \mathbf {{\mathcal {V}}}\times \mathbf {{\mathcal {W}}}\) satisfying \(X_{0k}{\tilde{W}} = x_{1k}\) for some predictor k, then there exists another feasible solution \((V_k, {\tilde{W}})\) for the SCM problem (6)–(7), where \(V_k \in \mathbf {{\mathcal {V}}}\) is a corner solution satisfying \(L_W(V_k,{\tilde{W}})=0\). If \(({\tilde{V}}, {\tilde{W}})\) is an optimal solution, then also \((V_k, {\tilde{W}})\) is an alternative optimal solution for the SCM problem.

This result demonstrates that whenever the donor weights \(W\) satisfy the basic condition required for the consistency of the SCM, \(X_1 = X_0W\), even just for a single predictor k, then it is easy to generate feasible solution candidates that are obtained by considering corner solutions with respect to predictor weights V. Intuitively, when the number of predictors is large, it is practically impossible to construct a convex combination of control units that matches the treated unit; in other words, no matrix \(W\) that satisfies \(X_0W = X_1\) exists. But if we use weights V to reduce the dimensionality of X by assigning some of the predictors a zero weight, then it becomes considerably easier to find vectors W that satisfy \(x_{0k}W = x_{1k}\) at least for some predictor k (note \(x_{0k}\) is the kth row of matrix \(X_0\) and \(x_{1k}\) is a scalar). Consequently, the set of feasible solutions for the SCM problem often contains several candidate solutions that “switch off” the constraints concerning predictors X by assigning zero weight, except for a single predictor k for which a perfect fit is possible. Therefore, it is understandable that many ad hoc tools attempting to solve the SCM problem (6)–(7) may end up assigning all weight to the most favorable predictor and discard all other predictors by assigning the zero weight. These observations can help to explain the empirical observation that the predictors often turn out to have little impact on the synthetic control, which has been noted by several authors (e.g., Ben-Michael et al., 2021; Doudchenko & Imbens 2017; Kaul et al., 2022). While these solutions may not necessarily be optimal for the SCM problem, they can still provide good approximations for the optimal value of the upper-level objective. Note that the previous iterations provide us with the corner solution \((V_k,W_k)\) and the unconstrained solution \(W^{\star \star }\), which can be used for constructing the following bounds for the loss function of the true optimum \((V ^{\star },W^{\star })\):

$$\begin{aligned} L_V(V_s,W_s) \ge L_V(V^{\star },W^{\star }) \ge L(W^{\star \star }). \end{aligned}$$

If the margin of \(L_V\) is small and \(W_s \approx W^{\star \star }\) by reasonable tolerance, there is no need to iterate further. But if there is a significant gap, the following iterative procedure is guaranteed to find the optimum.

3.3 Finding an Optimal Solution Using Tykhonov Regularization

Building on Proposition 2, the basic idea is to construct an iterative descent algorithm to find the bilevel optimal solution by using the following regularized lower-level problem:

$$\begin{aligned} \min _{W\in \mathbf {{\mathcal {W}}}} L_W^{\varepsilon }(V,W) = L_W(V,W)+\varepsilon L_V(V,W), \end{aligned}$$
(11)

where \(\varepsilon >0\). Note that problem (11) is just a re-stated version of the QP problem (8) above. When the optimal solution to the upper-level problem is uniquely defined, the regularized lower-level problem has considerably better regularity properties than the original formulation. In the literature on bilevel programming, this approach is known as Tykhonov regularization (Dempe, 2002). By requiring positive definiteness in the upper-level problem, we can make relatively strong claims regarding the properties of the optimal solutions for the regularized problem. Specifically, it can be shown that the unique optimal solution function to the problem (11), denoted by \(W_{\varepsilon _k}^{\star }(V)\), is Lipschitz continuous and directionally differentiable.

Definition 1

(Lipschitz continuity) A function \(z:{\mathbb {R}}^n\rightarrow {\mathbb {R}}^m\) is called locally Lipschitz continuous at a point \(x^0\in {\mathbb {R}}^n\) if there exists and open neighborhood \(U_{\varepsilon }(x^0)\) of \(x^0\) and a constant \(l<\infty \) such that

$$\begin{aligned} ||z(x)-z(x')||\le l ||x-x'|| \ \forall x, x' \in U_{\varepsilon }(x^0). \end{aligned}$$

Definition 2

(Directional differentiability) A function \(z:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is directionally differentiable at \(x^0\) if for each direction \(r\in {\mathbb {R}}^n\) the following one-sided limit exists:

$$\begin{aligned} z'(x^0;r)=\lim _{t\rightarrow 0+} t^{-1}[z(x^0+tr)-z(x^0)]. \end{aligned}$$

The value \(z'(x^0;r)\) is called the directional derivative of z at \(x=x^0\) in direction r.

Proposition 4

Consider the synthetic control problem in (6)–(7) and let the upper-level cross-product matrix \(Y_0^{\top }Y_0\) be positive definite. Take any sequence of positive numbers \(\{\varepsilon ^k\}_{k=1}^{\infty }\) converging to \(0+\). Then,

  1. 1.

    the optimal value of the regularized bilevel problem converges to the optimal value of the original problem as \(k\rightarrow \infty \) i.e.

    $$\begin{aligned} \min _{V,W}\left\{ L_V(V,W): \ W\in \Psi _{\varepsilon _k}(V), \ V\in \mathbf {{\mathcal {V}}}\right\} \rightarrow L_V^{\star }, \end{aligned}$$

    where

    $$\begin{aligned} \Psi _{\varepsilon _k}(V)=\mathop {\textrm{argmin}}\limits _{W\in \mathbf {{\mathcal {W}}}} L_W^{\varepsilon }(V,W), \\ L_V^{\star } = \min _{V,W}\left\{ L_V(V,W): \ W\in \Psi (V), \ V\in \mathbf {{\mathcal {V}}}\right\} \end{aligned}$$

    denote the optimal solution set mapping for (11) and the upper-level optimal value of the original problem, respectively.

  2. 2.

    for each \(\varepsilon _k\), the unique optimal solution to the regularized lower-level problem (11), denoted by \(W_{\varepsilon _k}^{\star }(V)\in \Psi _{\varepsilon _k}(V)\), is directionally differentiable and

    $$\begin{aligned} \lim _{k\rightarrow \infty }\{W_{\varepsilon _k}(V)\}=\mathop {\textrm{argmin}}\limits _W \{L_V(V,W): \ W\in \Psi (V)\} \end{aligned}$$

    for every fixed \(V\in \mathbf {{\mathcal {V}}}\).

Based on this result, solving the synthetic control problem is equivalent to considering a sequence of problems

$$\begin{aligned} \min _{V} \{ L_{\varepsilon _k}(V): \ V\in \mathbf {{\mathcal {V}}}\} \ \text {for } \varepsilon _k\rightarrow 0+, \end{aligned}$$
(12)

where the implicitly defined objective function \(L_{\varepsilon _k}(V)=L_V(V,W_{\varepsilon _k}^{\star }(V))\) is directionally differentiable with respect to V. The implementation of the descent algorithm is discussed in Appendix B.1. As an alternative to the Tykhonov algorithm, the problem can be also solved using a recently developed approach based on KKT conditions for bilevel problems (Dempe & Franke, 2019). This alternative is briefly described in Appendix B.2.

To summarize this section, the good news is that the SCM problem (6)–(7) is solvable. The bad news is that the required computations prove much more demanding than the original SCM studies assumed. Worse yet, the optimal solution is often a corner solution where most predictors are assigned a zero weight or have a negligible impact. We stress that imposing some small bounds for V (e.g., \(V_{kk} \ge 0.01\)) would have little impact in practice; the corner solution would simply assign the minimum weight to all predictors, except for the most favorable predictor that would get the maximum weight (\(=1-0.01(K-1)\)).

4 Empirical Comparisons

Applying the iterative algorithm proposed in Sect. 3 to the data of the seminal SCM application to the California tobacco control program (Abadie et al., 2010),Footnote 5 we empirically verify that the optimal solution in this original case is indeed a corner solution. Table 1 reports the loss function values of the upper-level problem (\(L_V\)) and the lower-level problem (\(L_W\)) as well as the donor weights (\(W\)) and the predictor weights (V) estimated by different SCM packages available for R.

The corner solution is found superior to the solutions obtained by the standard implementation of Synth package described in Abadie et al. (2011)Footnote 6 and the MSCMT (Multivariate Synthetic Control Method using Time Series) package proposed by Becker and Klößner (2018). This observation demonstrates that the existing SCM packages fail to find the optimal solution even in one of the original applications of SCM, which is also used as one of the examples to demonstrate the Synth package.

Table 1 California tobacco control application revisited: donor weights, predictor weights, loss functions, and empirical fit by different algorithms

Recall that the value of \(L_V\) measures how well the synthetic control matches the pre-treatment outcomes of the treated unit, and this is the upper-level objective to be minimized. In this respect, all computational packages come relatively close to the global optimum. It is worth noting that the magnitude of \(L_V\) is contingent upon the measurement units of outcomes: for example, multiplying \(Y_1^{\text {pre}}\) and \(Y_0^{\text {pre}}\) by 1 thousand would increase \(L_V\) by a factor of 1 million. Therefore, it is helpful to measure empirical fit with respect to the pre-treatment outcomes in terms of the coefficient of determination (\(R^2\))—after all, the upper-level problem is just constrained least squares regression without intercept. Such a comparison reveals that the differences in empirical fit are rather marginal; the \(R^2\) statistic varies between 0.97518 (Synth) and 0.97878 (optimum). In contrast, the differences in weights \(W\) and V across different computational packages are rather dramatic. The results of Table 1 help to illustrate that good empirical fit may be achieved with a wide variety of weights \(W\) and V, but there is only one unique global optimum.

Fig. 1
figure 1

The impact of suboptimal \(W\) weights on the evolution of synthetic California

The value of \(L_W\) measures how well the synthetic control matches the predictors \(X_1\). While the minimization of \(L_W\) is the lower-level objective, the consistency of SCM depends on the (nearly) perfect match with the predictors. In this regard, the value of \(L_W\) approaches zero at the global optimum, suggesting a perfect match in terms of the weighted predictors. In contrast, the relatively high value of \(L_W\) given by the standard Synth command points to the fact that Synth fails to converge to the global optimum in the California example. Furthermore, the MSCMT procedure greatly improves \(L_W\) in this case and converges to the global optimum. However, the optimal solution is a corner solution that assigns all weight to a single predictor: cigarette sales per capita in 1980 in the California tobacco control application (see Table 1). The MSCMT package allocates the weight evenly across three predictors, while the Synth package appears to use more balanced weights for predictors; however, note that Synth also assigns almost 90% of the predictor weight to cigarette sales per capita (the outcome variable) during two years of the pre-treatment period. Unfortunately, the Synth package proves inadequate in solving the optimization problem it is supposed to solve; its predictor weights are not what they are claimed to be, but just artifacts of a computational failure.

Figure 1 illustrates the impact of suboptimal donor weights on the evolution of synthetic California. Fortunately, the qualitative conclusions of this original and highly influential application remain, although the use of the suboptimal weights results in a reduced treatment effect.

5 Conclusions

SCM has proved a highly appealing approach to estimating causal treatment effects within the context of comparative case studies, as demonstrated by numerous published applications. Unfortunately, the standard computational packages aimed at jointly solving the donor weights and the predictor weights have proved numerically unstable. The explicit formulation of the SCM problem as an optimistic bilevel optimization problem highlights that the SCM problem is far from trivial from the computational perspective: the SCM problem is generally NP-hard, significantly exceeding the scope of the computational packages currently in use.

The main contribution of our paper was the introduction of an iterative computational algorithm for solving the original SCM problem. We were the first ones to prove that our SCM algorithm converges to the optimal solution under relatively mild assumptions. This underscores the existence of a theoretically valid approach for solving the SCM problem. However, the optimal solutions to the original SCM formulation are still typically obtained as corner solutions, where most of the predictors carry zero weight. Thus, in practice, it is rarely necessary to apply Tykhonov regularization or KKT approximations to locate the optimal solution. Instead, an optimal solution is usually identified already during the early stages of the iterative procedure.

The computational difficulties of the original SCM formulation do not diminish the conceptual allure of synthetic controls. While we do recognize the value of the data-driven approach to weight determination, it remains crucial to ensure the optimality of the synthetic controls, rather than allowing them to be artifacts of a suboptimal computational tool.

Our findings open various avenues for future research, encompassing both empirical and methodological studies. From the empirical point of view, it would be interesting to apply the proposed algorithm to replicate published SCM studies in order to examine the potential impacts of suboptimal weights on the qualitative conclusions. Becker and Klößner (2017) is an excellent example of such a replication study. We hope that the qualitative results of the influential SCM studies prove robust to the optimization errors that are evidently present, yet this remains to be tested empirically.

From the methodological point of view, the joint optimization of the predictor weights and the donor weights calls for further examination. In particular, the loss function to be minimized requires careful reconsideration to ensure that the optimal solution is reasonable for the intended purposes of using the predictors and that the problem remains computationally tractable. One possibility could involve adopting stepwise optimization of the predictor weights and donor weights, such that the predictor weights are first determined based on alternative criteria (e.g., regression analysis) and subsequently the donor weights are optimized taking the predictor weights as given. We leave this as a fascinating avenue for future research.

Finally, we hope that the insights of our paper could potentially foster further integration of SCM with other estimation approaches such as the difference-in-differences, panel data regression, and machine learning; several recent studies (e.g., Abadie, 2020; Amjad et al., 2018; Arkhangelsky et al., 2021; Ben-Michael et al., 2021; Doudchenko & Imbens, 2017; Xu, 2017) have made impressive progress in this direction.