Abstract
Objective
We address several issues concerning standard error bias in pooled timeseries crosssection regressions. These include autocorrelation, problems with unit root tests, nonstationarity in levels regressions, and problems with clustered standard errors.
Methods
We conduct unit root test for crimes and other variables. We use Monte Carlo procedures to illustrate the standard error biases caused by the above issues in pooled studies. We replicate prior research that uses clustered standard errors with differenceindifferences regressions and only a small number of policy changes.
Results
Standard error biases in the presence of autocorrelation are substantial when standard errors are not clustered. Importantly, clustering greatly mitigates bias resulting from the use of nonstationary variables in levels regressions, although in some circumstances clustering can fail to correct for standard error biases due to other reasons. The “small number of policy changes” problem can cause extreme standard error bias, but this can be corrected using “placebo laws”. Other biases are caused by weighting regressions, having too few units, and having dissimilar autocorrelation coefficients across units.
Conclusions
With clustering, researchers can usually conduct regressions in levels even with nonstationary variables. They should, however, be leery of pitfalls caused by clustering, especially when conducting differenceindifferences analyses.
Introduction
Fixedeffects (FE) models estimated on pooled timeseries crosssection (TSCS) data, also known as panel data sets, are commonly used in crime studies. Such studies typically use data from many units, frequently states, over many time periods. We use the terms “states” and “years” here, although TSCS can be used with other units. Pooling time series and cross sections creates large sample sizes, a benefit for almost any statistical procedure, and fixed effects regressions are consistent in the presence of unobserved heterogeneity, which is an intractable problem for analyses of crosssection models.
In criminology, the dependent variable is usually a crime rate, although other outcome measures are frequently used. The independent variables of interest are often policy variables, scored as dummies, a procedure called differenceindifferences (DD). Continuous independent variables are also common, such as economic variables (e.g., unemployment rates, poverty rates, income levels), demographic variables (race, age groups), imprisonment rates, police numbers, and executions. Nearly all TSCS regressions in criminology use the fixed effects model, with dummies for each state. Dummies for each year are also used to control for nationwide changes and to reduce crossstate covariance. This paper is limited to fixed effects TSCS models.^{Footnote 1}
We develop the theory of fixedeffects in more detail in the “Appendix”. In the absence of lagged dependent variables and omitted variable bias, the FE model, no matter how estimated (demeaning, least squares dummy variables, or first differences), is unbiased and consistent, even in the presence of serial correlation and heteroskedasticity. Positive autocorrelation, a common problem for any time series analysis, causes FE regressions to suffer from second order bias in which the standard errors are underestimated. Since the coefficient estimates are unbiased, the tratios are overestimated, due exclusively to standard errors bias, often leading to spurious regression results.
Serial correlation can also be addressed parametrically in panel data studies by including one or more lags of the dependent variable.^{Footnote 2} However, because the need for lags is determined by a standard test for serial correlation (Breusch 1978; Godfrey 1978), a false negative may result in only a partial correction. More recently, clustering the standard errors at the group level (e.g., states) has become common. Arellano (1987) originated this procedure, but it did not become widely used in criminology until Bertrand et al. (2002, 2004) recommended it. Clustered standard errors employ estimates of the error covariances within the cluster and ignore the remaining covariances. Since the only variance within the cluster is time series variance, this procedure corrects the standard errors for serial correlation, but the procedure can fail in several circumstances.
The clustering procedure raises several unresolved issues in fixedeffects TSCS regressions.^{Footnote 3} The first is nonstationarity, a potential problem with any time series analysis. We find that clustering greatly reduces the chances of spurious regressions even without differencing nonstationary variables. Next we describe four incidents where clustering fails to correct standard error bias: small number of units, heterogeneous autocorrelation coefficients, weighting regressions, and using variables that vary in only one or a few states. The latter is especially important because it affects many DD studies, and we suggest procedures to address this problem.
The next section describes serial correlation, clustering, and nonstationarity. The third section uses unit root tests to suggest how common nonstationarity is in crime studies. The fourth section uses Monte Carlo simulation to estimate how damaging nonstationarity is in TSCS regressions, emphasizing regressions with clustered standard errors. The fifth section repeats this analysis with DD regressions. The sixth section discusses pitfalls to be avoided with clustered TSCS regressions.
Serial Correlation and Unit Roots
Serial Correlation
For a long time, serial correlation in TSCS models was ignored or partly mitigated by entering lagged dependent variables. It is well known that positive serial correlation, the most common kind, causes standard errors to be underestimated (Wooldridge 2016, pp. 372ff). Bertrand et al. (2002, 2004), showed that autocorrelation produces inflated tratios in DD studies. The reason is that a dummy variable is highly autocorrelated since it typically consists of a series of zeroes followed by a series of ones so that each year’s value is the same as the prior year’s value except in the year of the policy change. Many crimerelated continuous variables are also highly serially correlated, producing similar problems. For example, incomplete correction for serial correlation can falsely indicate that executions have a significant deterrent impact on murder (Donohue and Wolfers 2006; Chalfin et al. 2012). Bertrand et al. (2002, 2004) recommend that researchers use clustered standard errors (for example by using the “cluster” option in Stata regression commands). Hanson (2007), using Monte Carlo simulations, found that clustering works well to produce correct standard errors in data sets of sizes commonly used in criminology TSCS research (see also Cameron and Miller 2015). It is now a standard procedure in criminology.^{Footnote 4}
Nonstationarity and Differencing
Nonstationarity is an extreme version of autocorrelation, with an autocorrelation coefficient (rho) equal to one. The result is a random walk in which each period’s value is determined by the prior period’s value plus a random component. Such series typically show long waves of increases and decreases over time and may also exhibit a trend or drift. It has long been known that using nonstationary variables results in spurious regressions, with tratios potentially overestimated by a factor of five or more (Granger and Newbold 1974). The bias increases with the length of the time series, so more time series observations make the problem worse. It can be understood as the usual problem caused by positive serial correlation, but magnified as the serial correlation coefficient approaches its limiting value of unity. The key point is that autocorrelation corrections, such as clustering, may not eliminate all of the standard error bias due to nonstationarity (Hamilton 1994; Moody 2016).
Nonstationary variables can almost always be converted into stationary variables by differencing, and much criminology research uses differenced variables.^{Footnote 5} However, one might not want to difference because the coefficients measure only shortterm relationships, which may behave differently from longterm relationships. Thus, DD crime research is almost always conducted in levels, rarely mentioning nonstationary problems and the likelihood that tratios are inflated. It is likely that any crimeprevention strategy will have impacts well beyond the current year because of a snowballing effect (Glaeser et al. 1996). Lagged crime rates are typically highly significant in crime equations (Spelman 2017). More police, arrests, prison sentences and the like, to the extent that they reduce crime, free up criminal justice resources to reduce crime in later years, an impact that might not be accurately captured using differenced variables. In all, one cannot expect coefficients when differencing to be the same as, or as important as, coefficients in levels regressions.^{Footnote 6}
Stationarity Tests
Traditionally researchers have conducted stationarity tests to determine whether to difference data, but the tests have serious problems. For single time series, the classic test is the augmented DickyFuller (ADF) unit root test (Dickey and Fuller 1979). The test equation is
where δt is a deterministic trend term, and the lagged dependent variables are included to address autocorrelation in the residuals. The null hypothesis is ρ = 1 so that the coefficient on the lagged level is zero. Rejecting the null hypothesis results in accepting the alternative, that the variable is stationary, around a trend if included or stationary in levels if omitted. The ADF and similar tests are notoriously weak because it is difficult to distinguish between a unit root (ρ = 1) and a near unit root autocorrelation (for example ρ = 0.99). That is, failure to reject nonstationarity does not rule out stationarity. The common practice, nevertheless, is to exercise caution by differencing variables when such failures occur.^{Footnote 7}
An alternative to the ADF test is the KPSS test, in which the null hypothesis is stationarity (Kwiatkowski et al. 1992). The test equation is as follows where the time series y_{t} is assumed to be decomposable into a trend, a random walk, and a stationary error term.
where
The initial value of the random walk, r_{0}, is treated as a constant.
If y_{t} is stationary (around a trend if ξ is not zero), then \(\sigma_{u}^{2}\) must be zero and r_{0} is the intercept term. If the null hypothesis of zero variance is rejected, the series is nonstationary and differencing is required to avoid biased standard errors.
In practice, because these tests lack power in the presence of near unit roots, they often fail to distinguish properly between unit and nonunit roots, and on occasion one test will indicate nonstationarity while the other indicates stationarity, leaving the issue unresolved. This is especially true with crime variables (Spelman 2008). Researchers, thus, do not know whether differencing is advisable, although they often use first differences as the conservative option.^{Footnote 8}
With respect to TSCS data, there are many panel unit root tests available (Breitung and Pesaran 2008; Maddala and Wu 1999). Because TSCS regressions tend to have more observations than single time series, panel unit root tests have more power than tests on single time series. Stata has six such tests, five of which are based on the null hypothesis of a unit root: Breitung (2000), Fisher (Maddala and Wu 1999), Harris and Tzavalis (1999), Im et al. (2003), and Levin et al. (2002). The null hypothesis for these tests is that all states have unit roots, that is, \(\rho_{1} = \rho_{2} = \cdots \rho_{n} = 1\), where n is the number of states. For the Breitung, HT, and Levin–Lin tests, the alternative hypothesis is \(\rho_{1} = \rho_{2} = \cdots = \rho_{n} = \rho\), that is, all of the states have the same stationary serial correlation coefficient. The IPS and Fisher tests allow each state to have its own autocorrelation coefficient if the null hypothesis is rejected. Since there is no reason to expect that all states have the same serial correlation, which can take any value less than one in absolute value, we will focus on the IPS and Fisher tests.
The sixth test, Hadri (2000), a generalization of the KPSS, tests the null hypothesis that the variable is stationary in all states. The alternative hypothesis is that all states have unit roots. There is some evidence from Monte Carlo simulations that the Hadri test overrejects the null of stationarity (Hlouskova and Wagner 2006).
We use these panel unit root tests on Uniform Crime Rates (UCR) with data from 1960 to 2015, available from the Bureau of Justice Statistics.^{Footnote 9} Since stationarity is a characteristic of the variable generally, it is best to use all years for which adequate data exists. The results vary (Table 1).^{Footnote 10} The tests usually find stationarity, either with or without a trend. The preferred tests, IPS and Fisher, find stationarity without a trend for all crimes except burglary, but again they fail to show how many states have stationary series. For every crime, the Hadri test rejects stationarity, indicating that at least one state series is nonstationary. However, we know that the test tends to overrejects stationarity, making it difficult to determine the extent of nonstationarity.
Thus a major limitation of panel unit root tests is that there is no obvious way, whether the null hypothesis is accepted or rejected, to determine how many crime series are stationary and how many are nonstationary. Given the results in Table 1, the various crime rates could be stationary in only a few states, or they could be stationary in nearly all states. Thus, the researcher cannot know the extent of standard error bias caused by nonstationary which, again, leads many researchers to conduct regressions in differences. For this reason we explore stationarity statebystate, using the DFGLS test (Elliot et al. 1996), a more powerful version of the ADF test. The power of the test derives from the method of detrending. Elliot et al. note that, when the sample size is large, any reasonable test will have high power unless ρ is very close to one. Therefore, we should use a procedure such that the parameter space is a shrinking neighborhood around unity as the sample size grows. The ADF and PP tests do not use such “local to unity” asymptotics. Details of the DFGLS test are given in the “Appendix”. Still, the test has limited power given the number of observations for each state, but it does supply better information about the extent of nonstationarity than the panel tests do.
Another problem with the unit root tests for single time series is determining the critical value for deciding whether a variable is stationary, which is especially important given the weakness of the tests. The usual critical value corresponds to a p value of 0.05. Stata programs also give critical values for 0.10. Even the latter often fails to identify stationarity, and a recent paper suggests that 0.30 probability be used (Kim and Choi 2017).
For our UCR data, the DFGLS test applied statebystate almost never rejects nonstationarity, with the exception of murder in a few states (Table 2).^{Footnote 11} This is true with and without a trend, and it is true even at the 0.10 significance level. The KPSS test (Kwiatkowski et al. 1992) rejects stationarity (without a trend) for a majority of states only for rape and assault. For the other five crime types, the DFGLS and KPSS tests reject neither stationarity nor nonstationarity for most states, leaving the stationarity status in doubt. These results are consistent with a similar analyses by Spelman (2008, 2017), but the results here are based on more years of data.
In Table 3 we present the result of panel unit root tests on independent variables commonly used in crime regressions. The TSCS tests, again, give little help especially because rejecting the null of nonstationarity does not indicate how many states have stationary data. The statebystate DFGLS tests give more information (Table 4). They rarely rule out nonstationarity for prison population, state real personal income, and police numbers.^{Footnote 12} The KPSS test almost always rejects stationarity, although the rejection occurs less often for trend stationarity. The percent population aged 20–24 presents a mixed picture, and the DFGLS is inconsistent with the KPSS. The variable appears to be nonstationary in many states, but very likely trend stationary in most. Poverty and unemployment rates are probably stationary in most states, but they could be nonstationary in a few.
In sum, this stationarity/nonstationarity topic with TSCS research is confusing and uncertain. We next show that the common procedure of clustering standard errors allows the researcher, in most cases, to avoid these problems.
Critical Values of tratios from Simulations
What exactly is the danger of conducting pooled regressions in levels in the presence of unit roots in some states? To what extent does clustering, which addresses autocorrelation, also mitigate the similar problem of standard error bias due to nonstationarity? We use Monte Carlo simulations with 50 states, with separate simulations for 20, 30, 40, and 50 years (years typical for pooled crime regressions using UCR data), and with separate simulations using varying numbers of states having nonstationary dependent and independent variables: 0, 10, 20, 30, 40 or 50. Thus, there are 144 variations. We conduct each with (1) simple OLS, (2) OLS with a lagged dependent variable, (3) clustered standard errors, and (4) clustering with a lagged dependent variable.^{Footnote 13} The simulations with clustering are the most important since researchers now routinely cluster. The OLS results are useful to evaluate the numerous earlier TSCS studies, which used OLS without regard to autocorrelation and nonstationarity and, later, with lagged dependent variables that partly address these issues. The “Appendix” describes the Monte Carlo procedure in more detail and the Monte Carlo programs are available in the online appendix.
The complete output of the simulation is in the online appendix, and a summary is in Table 5. We find that, except for simple OLS, the critical values are very similar irrespective of the number of years used, the number of nonstationary dependent variables, and the number of nonstationary independent variables. As expected, the OLS critical values increase with more years, but that pattern is not found when entering a lagged dependent variable or when clustering.
Also as expected, the critical values are by far the largest with OLS and somewhat smaller (approximately 2.40–2.50) when adding a lagged dependent variable. The key finding in Table 5 is that the critical values when clustering are approximately correct, around 2.07, and around 2.04 when adding a lagged dependent variable.^{Footnote 14} The clustered critical values never exceed 2.14. As a practical matter, with data sets approximating the size used in these simulations, researchers are justified in conducting regressions in levels with nonstationary data, with a small increase in the critical value and with due regard to the pitfalls that we discuss later.
Standard Error Bias with DifferenceinDifferences
We conducted similar Monte Carlo experiments with a step dummy independent variable. This variable takes the unit value in a random year, and years thereafter, in half the states (enough to mitigate problems resulting from having too few states with dummies, discussed later). The results are reported in Table 6. The critical values for clustered standard errors are similar to those with continuous variables, with a median of 2.11 and a high of 2.15. Consistent with Table 5, the number of years and the number of nonstationary state series make virtually no difference, except in the OLS simulations. In all, this should come as a relief to the many researchers who conduct DD in levels with clustering, without considering nonstationarity problems, but they must use slightly higher critical values. As expected, dummy coefficients in studies without clustering are not significant unless tratios are much higher, around 2.50 when using a lagged dependent variable, and very large without (consistent with Bertrand et al. (2002, 2004)).
Pitfalls in Clustering
Clustering is far from a complete panacea for standard error bias. We have identified four situations that can cause problems with clustered standard errors: (1) regressions with only a few states, (2) regressions with data that have dissimilar autocorrelation coefficients for the states, (3) regressions in which a variable is equal to zero (or otherwise constant) in most states, and (4) regressions weighted by population. The third is the most important, mainly because it leads to greatly understated standard errors in many DD analyses.
Small “N”
In the simulations so far in this paper the number of cross sections, N = 50, was selected because that is a common size in criminology TSCS studies (and citylevel and countylevel studies usually have even larger numbers of cross sections). The results in Table 5, however, do not apply when there are substantially fewer states. A small N causes some standard error bias with OLS (Cameron and Miller 2015), and the bias is much larger when clustering. We conducted simulations with 3–50 states,^{Footnote 15} using 50 years and assuming that both variables are nonstationary in all states. Figure 1 illustrates the results for clustered regressions and for OLS with a lagged dependent variable. The critical values when clustering are 5.68 with three states, 2.65 with ten, and 2.23 with 20. They stabilize when there are more than about 40 states, where the values are similar to those in Table 5.^{Footnote 16}
Wildly Heterogeneous Autocorrelation Coefficients
The second pitfall with clustering occurs when most states have small autocorrelation coefficients for the independent variable, and a few have large values. In our simulations (Table 5), we use a high value for rho, the autoregressive coefficient (random numbers between 0.80 and 0.99) because the topic there is nonstationary variables, and we assume that if some variables are nonstationary then the remaining are likely to have high autocorrelation coefficients.
On the other hand, it is possible that a variable has small autocorrelation coefficients in some states and large values in others. We illustrate the issue with simulations setting autocorrelation coefficients for independent variables at random numbers, between 0.10 and 0.20, in stationary states, and then we vary the number of states with nonstationary variables from one to fifty. As seen in Fig. 2, the standard error bias increases with more years, and it varies in an unusual way with the number of nonstationary states. The bias is negligible if only one state has nonstationary data, then increases and reaches a peak at seven to ten states, and finally declines and levels off at about 45 states. The peak is a critical value of 2.50 with 50 years and seven states having nonstationary independent variables. With 20 years the peak is only 2.20. We are not aware of any explanation for this pattern.^{Footnote 17} The problem has been minor in TSCS crime studies to date because they seldom have more than 30 years, but it may become a concern as more data becomes available.
Figure 2 only illustrates the possibility of standard error bias, and comprehensive guidelines are not feasible because the bias depends on, among other things, the particular distribution of rho in the data set. Of course, the bias is less when the value of the autocorrelation coefficient is larger; setting rho equal to a series of random numbers between zero and one generates a peak critical value of 2.27 with 50 years. Researchers might well calculate the statelevel autocorrelation coefficients to rule out heterogeneous values. A danger sign is an OLS standard error that is larger than the clustered standard error, since we found no sign of this particular bias with OLS.
Small Number of Policy Changes and Placebo Laws
A serious and common problem with clustering occurs in DD analyses with what Conley and Tabor (2011) call the “small number of policy changes.” That is, if the key independent variable is constant in all but a few states, the standard error for the coefficient is biased downward, the more so the fewer the number of states with policy changes. In practice, this problem occurs mainly in DD studies in which the dummy variable indicating a policy change takes the unit value in only a few states.
The use of clustered standard errors is justified by appealing to asymptotic assumptions in which the number of states goes to infinity. We have shown that clustering works quite well if policy changes occur in many states (Table 6). However, if there is a policy change in just one state, for example, standard errors derived from assuming a large number of observations are biased because clustering reduces the effective number of observations to the number of years in that state.
Conley and Tabor report the results of Monte Carlo experiments in which clustered standard errors are underestimated, and tratios overestimated, by a factor of three. That is, instead of rejecting the null hypothesis when it is false five percent of the time, the rejection rate is 15%. The problem is much worse when only one state has a policy change: clustered standard errors are underestimated by a factor of 17. Similarity, we conducted simulations using the procedure in Table 6, but varying the number of states with policy changes from 1 to 50. The simulations for one or two states with policy changes would not converge, so Fig. 3 starts with three states. The tratio critical values, when standard errors are clustered, begin at 6.78 for three states with policy changes, then decline before leveling off at about 40 states.^{Footnote 18} The bias is perhaps tolerable after about 25 states, where the critical value is 2.13.
There are more precise procedures to estimate the correct critical value. Conley and Tabor suggested an approach that employs the distribution of the coefficient on the treatment dummy for the untreated states to facilitate inference on the dummy for the treated state. However, this implementation is for a single state only, and if more than one state passed a similar law then the procedure is quite involved.
Fortunately, a simple alternative approach is available, the “placebo law” method (Bertrand et al. 2002, 2004; Bound et al. 1995; Abadie et al. 2010). It is closely related to the randomized inference test, also known as Fisher’s exact test (Rosenbaum 1996). Helland and Tabarrok (2004) use it to evaluate shallissue laws.^{Footnote 19} The procedure is implemented as follows. Suppose the regression model is
where i = 1,…,N refers to states and t = 1,…,T refers to years, y_{it} is the outcome variable, e.g. crime, α_{i} is the fixed state effect, δ_{t} is the fixed year effect, X_{it} is a matrix of control variables, and D_{it} is a step dummy variable equal one in those states (the “treated” states) and those years for which the law or policy was in force, zero otherwise. To gain knowledge of the true distribution of the parameter estimates and the corresponding standard errors, we replace the dummy D_{it} with another dummy P_{it} for which the state and years have been randomly chosen. We set the number of states at the number of treated states and chose a range of years in the neighborhood of the actual policy change.
We repeat this process many times, each time estimating (5) thereby generating a distribution of the estimates of β, which we call \(G(\hat{\beta })\), that is centered on zero, because since a randomly assigned law is fictitious, it can be expected to have an estimated coefficient of zero on average. However, random error will generate “significant” effects five percent of the time, even though the null hypothesis is true. The \(G(\hat{\beta })\) distribution can also be used to generate standard errors and the corresponding tratios which allow us to determine critical values and pvalues under the null hypothesis of no effect. Using \(G(\hat{\beta })\) we can see the simulated five percent critical value and compare it to the biased tstatistic generated by the estimation of (4).
A placebo law program can be appended to any Stata dofile, for example, that will yield the correct distributions of the estimated coefficient, for 95% confidence intervals, or the corresponding distributions of the tratios, the 0.025 and 0.975 elements (or the 95 percentile of the absolute value assuming symmetry) will yield the correct critical values for the twosided 0.05 significance levels. A sample program is in the online appendix.
An important property of this method is that it uses the actual data and, therefore, correctly reflects the effects of serial correlation, nonnormality, crossstate covariance, and other problems of real world panel data that plague statistical inference. We need make no assumptions concerning the distribution of the error term, such as the usual assumption of independent, identically distributed normal errors.
A key question is what is meant by a “small” number of policy changes. It depends on how much standard error bias one deems acceptable. Figure 3 suggests that some bias occurs when most states have policy changes. Conley and Tabor (2011) state that clustering fails for ten or fewer states. With five states, clustering rejects the correct null hypothesis 15% of the time as opposed to the nominal five percent. Obviously, one or two states with policy changes present serious problems. The severity of the standard error bias can only be determined on a case by case basis. In the following paragraphs we illustrate this bias with three studies of gun laws. The first evaluates only one policy change, and the bias is huge. We then give two examples of research where eight states adopted a particular policy. Here the bias is moderate, but it is enough to affect the results.
Illustration 1: Permit to Purchase Handguns
Webster et al. (2014a) studied Missouri’s August 2007 repeal of a law that required a permit to purchase (PTP) a handgun. Using statelevel data for 1999–2010 and clustering standard errors at the state level, the authors find that, “…the repeal of Missouri’s PTP law was associated with an increase in annual firearm homicides rates….”^{Footnote 20} (p. 297). Our replication of their analysis yielded the impressive tratio of 9.04, apparently confirming their results.
We subjected this model to the placebo law test. Using their data, we generated repeals of placebo laws in random years after 1999 from the same number of states that have permit to purchase laws. The results were in the direction expected, but nevertheless surprising. The 975th observation in the distribution of the 999 estimated tstatistics on the placebo law dummy variable is 10.60; thus the tratio of 9.04 is not significant at the five percent level, twotailed.^{Footnote 21} Similarly, the ConleyTabor analysis, which is relatively easy with just one policy change, yields a 95% confidence interval around the estimated coefficient on the policy dummy of (0.040, 2.19). We reject the null hypothesis if the coefficient (1.179) is outside the 95% confidence interval. Both these tests indicate that any impact of the PTP on gun homicides is not statistically significant.
Illustration 2: Large Capacity Magazine Bans
We estimated a DD model of the effect of large capacity magazine (LCM) bans on firearm homicide. Eight states have bans on LCM holding more than a set number of rounds, usually 10 (California 1999, Colorado 2013, Connecticut 2013, Hawaii 1992, Maryland 1994, Massachusetts 1998, New Jersey 1990, and New York 1994). Our sample consists of observations on 50 states from 1977 to 2015. The dependent variable is the log of the firearm homicide rate. The variables of interest are: a dummy variable taking the unit value for states in years in which a LCM ban is implemented and a spline consisting of zeroes before the ban, and the number of years the ban has been in place after the year of passage. We also include control variables that are frequently used in criminal policy regressions, and we include state and year dummy variables.
The results (Table 7) appear to show that the spline (postlaw trend) is significantly negative (t = 2.03), indicating that state LCM bans reduce firearm homicide over time. This result is not unexpected since most state LCM bans grandfather existing LCMs, so it is possible that the ban could become more effective as time passes and the stock of LCMs declines.
We generated eight placebo laws in random states beginning in random years. The random years were constrained to the years 1987–1997 for the six earlyadopting states and the years 2002–2012 for two states, reflecting the lateadopters Colorado and Connecticut. We then estimated the regression reported in Table 7 replacing the LCM ban dummy and postlaw trend with the placebo law dummy and placebo postlaw trend. We repeated this process 999 times. The resulting 999 tratios represent the distribution of tstatistics under the null of no relationship between LCM bans and firearm homicide. The 975th observation is the 0.05 twosided critical value under the null is 3.05 for the LCM ban dummy and 2.95 for the postlaw trend. Thus, the coefficients on the LCM ban dummy and the postlaw trend are far from significant.^{Footnote 22}
Illustration 3: RighttoCarry Gun Laws
Abhay et al. (2014) find that righttocarry gun laws have a significant positive impact on murder rates.^{Footnote 23} The study was limited to the eight states that adopted these laws in 1999–2010 (many more states enacted earlier laws). They use a dummy and spline, as in the large capacity magazine ban example above. They estimate a fixed effects model, with clustered standard errors. The key result is a positive and apparently highly significant break in the trend (coefficient = 0.014, t = 2.68). We applied the placebo law technique with eight randomly chosen states starting in randomly chosen years after 1999. The result is that the 0.05 critical value, twotailed, is 2.73, indicating that the break in trend is not significant.^{Footnote 24}
Excessive Weighting and Heterogeneous Variation
The smallnumberofpolicychanges problem is an extreme example of problems that occur when an independent variable has much greater variation in some states than others. The coefficient is based on states with policy changes, but the clustered standard error is based on the full sample. The same problem exists, to a lesser degree, when an independent variable has considerable variation within a few states and little variation in the rest.
An example occurs when a regression is weighted. Crime TSCS studies commonly weight regressions by state population, which in effect multiplies each variable by the square root of population. The reasons for weighting, seldom explained, appear to be to emphasize to the most important states or to equalize the within state variation of crime rates across states.^{Footnote 25} With DD regressions, the weighting transforms a dummy variable into the square root of population, such that the onset of the dummy is a much bigger jump in large states. That is, the weighed dummy varies much more in large states, such that the large states can have outsized influence on the dummy coefficient.
We illustrate these issues with Monte Carlo simulations, similar to those in Table 6, weighting by state population. We begin by varying the within state dependent variable variation by dividing by state population, using population to the powers of 0, 0.25,0.50, 75, and 1.00 in separate simulations. This produces more variation in small states, which is typically the case for per capita variables. We then vary the regression weight by the same five powers of population, leading to 25 combinations. As seen in Table 8, the resulting critical values for the dummy coefficients decrease when the dependent variable is divided by larger powers of population—that is, when larger states have less influence. The major feature of Table 8 is that standard error bias increases with greater regression weight. The critical value is 2.43 when the dependent variable is not divided by population and the regression is weighted by population.^{Footnote 26}
The simulations in Table 8 are only illustrations and do not apply to any particular empirical study. Nevertheless researchers should be cautious when interpreting tratios in a regression weighted by population. A placebo law analysis might be useful in such cases. The problems are less likely to arise when the crime rate is based on few crimes in small states, such as homicides, than when the crimes are plentiful, such as larcenies.
Summary
We have discussed several sources of standard error bias in crime studies using fixedeffects panel data models. The first, serial correlation, used to be a serious problem, but most researchers now control it by clustering. The second is nonstationary variables in regressions estimated with levels. Researchers usually avoid such regressions and, instead, difference the data to avoid biasing standard errors if the data might be nonstationary. An exception is differenceindifferences analysis where researchers tend to ignore unit root issues altogether. It is difficult to determine whether variables are stationary or nonstationary and, thus, whether differencing is deemed necessary. The available stationarity tests are often weak and difficult to interpret, such that the researcher has only a rough estimate of the number of nonstationary state series and, thus, the extent of any unit root problems.
Our results provide a way to avoid this difficulty. Monte Carlo simulations show that, with clustering, nonstationarity causes very little standard error bias, using either continuous or dummy independent variables. We suggest that a conservative tratio critical value be 2.15 for the 0.05 significance level.
Still, we must be leery of pitfalls when clustering. Our Monte Carlo simulations show that standard error bias is large when the number of units is small, the bias is noticeable when the independent variable has small autocorrelation coefficients in a few states and nonstationary elsewhere, the bias is severe when studying a small number of policy changes, and the bias can occur when weighting regressions.
These simulations show the likelihood of bias in specific circumstances, but they do not show the extent in any particular empirical research. For example, the simulations do not included logged or differenced variables. In one area, however, we present procedures that do give the extent of bias in empirical regressions—the placebo law procedure in DD regressions. If we want to study the effect of a policy implemented in one or a few states, we can use the placebo law procedure to estimate the appropriate critical values. We illustrate the procedure with three case studies, describe the procedure in the “Appendix”, and have placed sample programs in the online appendix.
Notes
 1.
The main alternative TSCS design is the random effects model. Random effects parameter estimates are biased and inconsistent in the presence of unobserved heterogeneity (Wooldridge 2002). The results here are not directly applicable to random effects regressions.
 2.
 3.
We are concerned here with issues that affect only standard errors in TSCS studies, and not issues causing biased coefficients and, thus, usually tratios as well. The major cause of the latter is missing variable bias. A widely discussed example is spatial dependency, where variables are related to their counterparts in other units.
 4.
It should be noted that addressing autocorrelation by clustering does not fix problems associated with omitted variables if the autocorrelation is caused by “misspecified dynamics,” e.g., not including lags in a dynamic regression model.
 5.
Differencing stationary data typically generates negative serial correlation in the residuals. This will cause the tratios to be underestimated, but that can be corrected by clustering in panels or using Newey and West (1987) standard errors in single time series.
 6.
Aside from the case where T = 2 in which case the FE and the first difference estimates are identical.
 7.
 8.
 9.
https://bjs.gov/ucrdata. These data are corrections of the original data published by the Federal Bureau of Investigation. The Bureau of Justice Statistics has not revised earlier data.
 10.
 11.
We also used two older tests, the ADF test and the Phillips and Perron (1988) tests. They found more states series to be stationary, but never a majority, especially for rape, robbery, and assault.
 12.
Similarly, Spelman (2017) found that police is nonstationary in the great majority of cities.
 13.
We include the latter because the lagged dependent variable was often used to mitigate autocorrelation and is still often used as an important control variable. See the “Appendix” concerning the possibility that the lagged dependent variable can bias coefficients.
 14.
The simulations were conducted without year dummies, an important element of the fixed effects model, because in the program the year effects are random. The results in Table 5 are the same when year effects are included. We explored using block bootstrapping in the simulations, and the critical values are approximately two to five percent larger, even if rho is set at zero. In practice, bootstrapping is probably not useful for crime regressions because we find that it usually fails when year effects are entered.
 15.
We could not use only 1 or 2 states because the variance matrix was nonsymmetric or highly singular.
 16.
In separate simulations, we found that the standard error biases are very similar when all variables are stationary or when only 20 years are used. The biases are slightly less throughout when adding a lagged dependent variable to the clustered regression. The biases for OLS, with and without a lagged dependent variable, are slightly higher than in Table 5 when only a few states are included. Also, the standard error biases for OLS are approximately a third less when variables are stationary or when 20 years are used. Simulations with block bootstrapping produced critical values similar to those in Fig. 1.
 17.
The general results apply when the nonstationary variables are substituted with variables that are stationary with large autocorrelation coefficients. The results are similar when the dependent variable, rather than the independent variable, has heterogeneous auto correlation coefficients. The standard error bias is greatly reduced by entering a lagged dependent variable.
 18.
This simulation assumes 50 states and 50 years, and it assumes that the dependent variable is nonstationary in all states. The results appear to be robust to other situations. They are very similar with 20 years and with stationary dependent variables. Critical values for OLS are quite uniform.
 19.
MacKennon and Webb (2017) show that wild bootstrapt methods do not work in this context.
 20.
The authors later published corrections to these results that did not change the overall conclusion or the order of magnitude of the estimated effects. (Webster et al 2014b). We thank Daniel Webster for supplying the data used in their paper. We believe that the authors can be faulted for not showing OLS results, where presumably the standard error is larger than with clustering, which should not be the case.
 21.
We use 999 observations instead of 1000 so that there is no need to average between two observations to get the 975th observation.
 22.
The 3.05 critical value for the dummy compares to the critical value of 2.72 for eight dummies in Fig. 3. The program, data, and log files for these examples are in the online appendix.
 23.
We thank John Donohue for supplying the data.
 24.
The 2.73 figure is the same as the critical value for eight state dummies in Fig. 3. Other examples of dummies or trends in a small number of states, with clustering, are Kovandzic et al (2004) and Crifasi et al (2015). Many earlier studies used single state variables with OLS, where the standard errors are biased due to autocorrelation, as discussed above.
 25.
In the past, the major reason for weighting was to mitigate heteroskedasticity, but this correction is now routine using robust regressions procedures, which are automatically included when clustering standard errors in Stata. The results of the simulation here are very similar when heteroskedasticity is created by introducing the dissimilar state variation into the error term.
 26.
Table 8 is based on simulations using 30 years. The results vary little with the number of years, the number of nonstationary dependent variables, and whether a lagged dependent variable is entered. The results are also the same when dummies are entered for all states.
 27.
References
Abadie A, Diamond A, Hainmueller J (2010) Synthetic control methods for comparative case studies: estimating the effect of California’s tobacco control program. J Am Stat Assoc 105:493–505
Abhay A, Donohue JJ, Zhang A (2014) The impact of right to crarry laws and the NRC Report: the latest lessons for the empirical evaluation of law and policy. NBER Working Paper No. 18294 http://www.nber.org/papers/w18294
Arellano M (1987) Computing robust standard errors for withingroups estimators. Oxf Bull Econ Stat 49(4):431–434
Bertrand M, Duflo E, Mullainathan S (2002) How much should we trust differencesindifferences estimates? NBER Working Paper No. 8841 http://www.nber.org/papers/w8841
Bertrand M, Duflo E, Mullainathan S (2004) How much should we trust differencesindifferences estimates? Q J Econ 119(1):249–275
Bound J, Jaeger DA, Baker RM (1995) Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. J Am Stat Assoc 90:443–450
Breitung J (2000) The local power of some unit root tests for panel data. In: Baltagi BH (ed) Advances in econometrics, volume 15: nonstationary panels, panel cointegration, and dynamic panels. JAI Press, Amsterdam, pp 161–178
Breitung J, Pesaran MH (2008) Unit roots and cointegration in panels. In: Matyas L, Sevestre P (eds) The economics of panel data. Springer, Berlin, pp 279–302
Breusch TS (1978) Testing for autocorrelation in dynamic linear models. Aust Econ Pap 17:334–355
Cameron AC, Miller DL (2015) A practitioner’s guide to clusterrobust inference. J Hum Resour 50(2):317–372
Chalfin A, Haviland AM, Raphael S (2012) What do panel studies tell us about a deterrent effect of capital punishment? A critique of the literature. J Quant Criminol 29:5–43
Conley TG, Tabor CR (2011) Inference with “difference in differences” with a small number of policy changes. Rev Econ Stat 93:113–125
Crifasi CK, Meyers JS, Vernick JS, Webster DW (2015) Effects of changes in permittopurchase handgun laws in Connecticut and Missouri on suicide rates. Prev Med 79:43–49
Dickey DA, Fuller WA (1979) Distribution of the estimators for autoregressive time series with a unit root. J Am Stat Assoc 74:427–431
Donohue JJ, Wolfers J (2006) Uses and abuses of statistical evidence in the death penalty debate. Stanford L Rev 58:791–845
Eliott G, Rothenburg TJ, Stock JH (1996) Efficient tests for autoregressive unit root. Econometrica 64:813–831
Glaeser EL, Sacerdote B, Scheinkman JA (1996) Crime and social interactions. Q J Econ 3:507–548
Godfrey LG (1978) Testing against general autoregressive and moving average error models when the regressors include lagged dependent variables. Econometrica 46:1293–1301
Granger CWJ, Newbold P (1974) Spurious regressions in econometrics. J Econom 2:111–120
Hadri K (2000) Testing for stationarity in heterogeneous panel. Econom J 3:148–161
Hamilton JD (1994) Time series analysis. Princeton University Press, Princeton
Hanson CB (2007) Asymptotic properties of a robust variance matrix estimator for panel data when T is large. J Econom 141(2):597–620
Harris RDF, Tzavalis E (1999) Inference for unit roots in dynamic panels where the time dimension is fixed. J Econom 91:201–226
Helland E, Tabarrok A (2004) Using placebo laws to test “more guns, lesscrime”. Adv Econ Anal Pol 4(1):1–7
Hlouskova J, Wagner M (2006) The performance of panel unit root and stationarity tests: results from a large scale simulation study. Econ Rev 25:85–116
Im KS, Pesaran MH, Shin Y (2003) Testing for unit roots in heterogeneous panels. J Econom 115:53–74
Judson RA, Owen AL (1999) Estimating dynamic panel data models: a guide for macroeconomists. Econ Lett 65:9–15
Kim JH, Choi I (2017) Unit roots in economic and financial time series: a reevaluation at the decisionbased significance levels. Econometrics 5(3):41
Kovandzic TV, Sloan JS, Vieriatis LM (2004) “Striking out” as crime reduction policy: the impact of “three strikes” laws on crime rates in U.S. cities. Justice Q 21:207–239
Kwiatkowski D, Phillips PCB, Schmidt P, Shin Y (1992) Testing the null hypothesis of stationarity against the alternative of a unit root. Econom J 54:91–115
Levin A, Lin CF, Chu SJ (2002) Unit roots in panel data: asymptotic and finitesample properties. J Econom 108:1–24
MacKinnon JG, Webb MD (2017) Wild bootstrap inference for wildly different cluster sizes. J Appl Econom 32(1):233–254
Maddala GS, Wu S (1999) A comparative study of unit root tests with panel data and a new simple test. Oxf Bull Econ Stat 61:631–652
Marvell TB (2010) Prison population and crime. In: Benson BL, Zimmerman PR (eds) Handbook on the economics of crime. Edward Elgar, Northhampton, pp 145–183
Moody CE (2016) Fixedeffects panel data models: to cluster or not to cluster. SSRN: https://ssrn.com/abstract=2840273
Newey WK, West KD (1987) A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55:703–708
Nickell S (1981) Biases in dynamic models with fixed effects. Econometrica 49:1417–1426
Phillips PCB, Perron P (1988) Testing for a unit root in time series regression. Biometrika 75:335–346
Rosenbaum PR (1996) Observational studies and nonrandomized experiments. In: Ghosh S, Rao CR (eds) Handbook of statistics, vol 13, pp 181–197
Spelman W (2008) Specifying the relationship between prison and crime. J Quant Criminol 24:149–178
Spelman W (2017) The murder mystery: police effectiveness and homicide. J Quant Criminol 33:859–886
Stock JH, Watson MW (2003) Introduction to econometrics. Pearson Education, Boston
Webster D, Crifasi CK, Vernick JS (2014a) Effects of the repeal of Missouri’s handgun purchaser licensing law on homicides. J Urban Health 91(2):293–302
Webster D, Crifasi CK, Vernick JS (2014b) Erratum to: Effects of the repeal of Missouri’s handgun purchaser licensing law on homicides. J Public Health 91(3):598–601
Wooldridge JM (2002) Economic analysis of cross section and panel data. MIT Press, Cambridge
Wooldridge JM (2016) Introductory economics, 6th edn. Cengage Learning, Boston
Acknowledgements
We thank the editor and three anonymous referees for constructive comments on an earlier draft.
Author information
Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Fixed Effects Regression and Monte Carlo
Appendix: Fixed Effects Regression and Monte Carlo
The fixed effects (FE) regression model has the advantage of being unbiased in the presence of unobserved heterogeneity. That is, if each state has longrun, permanent features that are correlated both with the dependent variable and the independent variables in the model, then any regression procedure, such as the random effects model or the pooled ordinary least squares model, that uses variation across states will be biased and inconsistent. This is very likely to be the case in criminology since Massachusetts, for example, is permanently different from Louisiana because of history, culture, climate, and a number of other dimensions. The same can be said for Arizona and Vermont, Hawaii and any other state.
The FE model has the form^{Footnote 27}
where i = 1, …, N, t = 1, …, T, X_{k,it} is the value of the kth regressor for state i in year t, α_{i} are statespecific fixed effects, and γ_{t} are yearspecific fixed effects. This model requires four assumptions (assuming one regressor for simplicity):
that is, the conditional value of the error term is zero, given the value of the regressor(s);
that is, the variable(s) over all the years in one state are distributed identically but independent of the same variable(s) over the same time span in other states;
have nonzero finite fourth moments; this assumption is important for asymptotic results, it limits the probability of observing extreme values of the regressor(s) or the errors;
the errors are uncorrelated over time, conditional on the regressor(s).
With these assumptions we can estimate the fixedeffects model, generating unbiased and consistent estimates. (The FE model is not efficient because it ignores crosssection variation.) The model is estimated by applying ordinary least squares to the demeaned variable(s), e.g. \(\bar{y}_{i} = (1/T)\sum\nolimits_{t = 1}^{T} {y_{it} } ,\;\bar{x}_{i} = (1/T)\sum\nolimits_{t = 1}^{T} {x_{it} } ,\bar{u}_{i} = (1/T)\sum\nolimits_{t = 1}^{T} {u_{it} }\). The crosssection equation is
Subtracting the crosssection equation from (6), still assuming only one regressor, yields
The fixed effects have been “swept out” or “absorbed.” We can write this equation as
The fixedeffects estimator is just OLS applied to (13). This estimator is also known as the “within” estimator because it only uses variation within each state.
Identical results can be achieved by regressing y on x and a set of dummy variables, one for each state such that the dummy for state i is one if state = state i, otherwise zero. This is the “least squares dummy variable” or LSDV model.
Generalizing to multiple regression, define the matrix of observations on the demeaned.
Regressor for state i as \(\tilde{X}_{i}\) so that
The asymptotic variance–covariance matrix is
where \(\hat{\sigma }_{u}^{2} = {{\sum\nolimits_{i = 1}^{N} {\sum\nolimits_{t = 1}^{T} {\hat{u}_{it}^{2} } } } \mathord{\left/ {\vphantom {{\sum\nolimits_{i = 1}^{N} {\sum\nolimits_{t = 1}^{T} {\hat{u}_{it}^{2} } } } {(N(T  1)  K)}}} \right. \kern0pt} {(N(T  1)  K)}}\) is a consistent estimator of the error variance and
The square roots of the principal diagonal of the AVAR matrix are the standard errors.
Clustered Standard Errors
The clustered asymptotic variance–covariance matrix (Arellano 1987) is a modified sandwich estimator (White 1984, Chapter 6):
The “meat” of the sandwich contains the estimated covariances among the error terms. The residuals are “clustered” in the sense that only covariances from state i are used, covariances from other states are ignored. The formula therefore corrects for heteroskedasticity (using the squared terms) and autocorrelation using the remaining terms.
Bias with Lagged Dependent Variables
A problem particular to TSCS regression is potential bias when a lagged dependent variable is included in the list of regressors. This variable is correlated with the error term, which biases its coefficient (Nickell 1981). The order of the bias is 1/T, and the bias declines relatively rapidly with more years. Judson and Owen (1999) show that, for T of 30 or more, fixed effects TSCS performs as well or better than other methods, such as generalized methods of moments. For T = 20 the bias is roughly 20% for the coefficient on the lagged dependent variable. The impact on the lagged dependent variable affects the coefficients and standard errors on other independent variables if these are correlated with the lagged dependent variable. In practice, the researcher does not know the direction and extent of this bias. We do not encounter the Nichols problem in our simulations because we set the coefficient on “x” to be zero, thereby dropping it from the regression and assuring that its coefficient remains unbiased.
DFGLS Unit Root Test
Elliott et al. (1996) propose the following twostep procedure.
If the data has a trend:

1.
Detrend by estimating the following regression using OLS.
$$(y_{t}  ay_{t  1} ) = \alpha (1  a) + \delta (t  a(t  1)) + v_{t}$$
ERS do Monte Carlo experiments to determine the optimal value of a:
Note that \(a \to 1\) as T goes to infinity (local to unity).

2.
Compute the (detrended) residuals \(e = \hat{v}_{t}\) and estimate the usual ADF test equation.
$$\Delta e_{t} = (\rho  1)e_{t  1} + \sum\limits_{j = 1}^{p} {\gamma_{j} } \Delta e_{t  j} + v_{t}$$
Use the modified AIC, to choose the lag length, p. Test using the standard tratio on the lagged level, taking the critical values from the ERS tables.
If the data does not have a trend:

1.
Estimate the constant only model (demeaned not detrended).
$$(y_{t}  ay_{t  1} ) = \alpha (1  a) + v_{t} \;{\text{where}}\;a = 1  \frac{7}{T}$$ 
2.
Compute the residuals, \(\tilde{y}_{t}\) and estimate the same ADF test.
$$\Delta \tilde{y}_{t} = (\rho  1)\tilde{y}_{t  1} + \sum\limits_{j = 1}^{p} {\gamma_{j} } \Delta \tilde{y}_{t  j} + v_{t}$$
Monte Carlo Program
The model is generated as follows. The researcher specifies the number of states with nonstationary dependent variables, n_{y}, and the number of states with nonstationary independent variables, n_{x}. In each case, the remaining states series are stationary with an autocorrelation coefficient randomly chosen with values between 0.80 and 0.99.
The state fixed effects (\(a_{i}\),\(b_{i}\)) are drawn from a uniform distribution with range [0, 100] and \(\sigma_{\varepsilon }^{2}\) = \(\sigma_{\nu }^{2}\) = 1. The estimated fixedeffects model is
We ran 10,000 regressions of this model for each variation of years and numbers of nonstationary dependent and independent variables. The results are presented in Table A in the online appendix, along with the Stata dofiles. We also prepared a lengthy table with critical values for clustered regressions comparable to Table A, but giving the critical values from 0.10 to 0.01 as well as 0.005 (Table B).
Rights and permissions
About this article
Cite this article
Moody, C.E., Marvell, T.B. Clustering and Standard Error Bias in Fixed Effects Panel Data Regressions. J Quant Criminol 36, 347–369 (2020). https://doi.org/10.1007/s109400189383z
Published:
Issue Date:
Keywords
 Panel data regression
 Auto correlation
 Nonstationarity
 Clustered standard errors
 Small N
 Differenceindifferences
 Weighted regressions