I don’t often post about econometric topics, as I am not an econometrician, but as an applied microeconomist, I have been reading a lot about the difference-in-difference revolution and the new approaches taken to deal with staggered adoption. As the best way to learn something is to teach it, I thought I’d write a blog post explaining my understanding of the problem and the various solutions available. As I am not an econometrician, I will be mainly explaining this through intuition and will add links to the various sources for those interested in a deeper dive on the technical details.
The Problem
For a long time, empiricists would use the two-way fixed effects estimator to estimate linear regressions of interest on a panel dataset of interest. This specification is an extension of the simple difference-in-difference framework for multiple treatment and control groups.
In its simplest form, a difference-in-differency study compares a treatment group (a country, state, firm etc) with a control group taking the difference across entity (treatment versus control) and across time (before intervention and after intervention). By combining these differences, we can isolate the effect of a treatment, so long as we believe parallel trends holds. Obviously, if we just take the difference across time for the treated group, then we may miss the fact that the broader environment has changed (i.e. the treatment was useless, the control group experienced the same change which we are assigning to treatment). This means that there needs to be an assumption that the treatment and control groups are somewhat similar and, in the absence of treatment, would follow a similiar trend (the parallel trends assumption: this can be visually observed by seeing if the pre-treatment trends follow a similar pattern in both the control and treatment groups; although there is no way to definitively confirm this). We also require a no anticipation assumption, such that the treatment group does not change behaviour before treatment actually begins, in anticipation.
We can extend the difference-in-difference framework to a larger panel dataset where we have multiple units and time periods. For instance, we might have data for all EU Member States across a 20-year time horizon and which to study the effect of a particular policy which was implemented in certain Member States at a given time. This framework becomes a two-way fixed effects (TWFE) study, as we estimate a simple linear regression using OLS after controlling for unit-fixed effects and time-fixed effects.
This is all fine and well if the treatment across units occurs in the same period. For example, a bunch of Member States implement a National Minimum Wage program at the same time. Then the TWFE estimator provides an unbiased estimate of the average treatment effect on the treated (ATT), which is what we want (so long as the parallel trends and no anticipation assumptions hold). The problem occurs if the treatment is staggered and the treatment effects are heterogeneous across either time or units. This means that the treatment occurs at different time periods and the effect of the treatment is not the same in all time periods and for all units. In this case, the TWFE estimator is no longer a weighted average of unit-level treatment effects and the ATT is no longer unbiased.
Fundamentally, the TWFE is making both clean comparisons (between treated and not-yet-treated, or never treated, groups) as well as forbidden comparisons (between units who are both already treated but at different times) (Goodman-Bacon, 2021). These forbidden comparisons are a problem with staggered adoption and heterogeneous treatment effects, and can have a negative weighting issue which results in the TWFE estimate having the opposite sign of all individual-level treatment effects (Chaisemartin and D’Haultfoeuille, 2020). Consequently, the TWFE estimator may not just be biased but it might be telling us that the treatment effect is the complete opposite to what it actually is!
Note, this negative weighting arises because the early-treated control group has already been treated and its treatment effect at the second period gets differenced out by the difference-in-difference estimator, resulting in negative weights (Chaisemartin and D’Haultfoeuille, 2020).
In other words, the TWFE estimate is a weighted average of comaprisons of treated units with never treated units (good comparison); treated units with not-yet treated units (good comparison so long as there is no anticipation effect), later-treated units with early-treated units (bad/forbidden comparison when there are heterogeneous treatment effects).
We can visualise these forbidden comparisons by imagining that an early-treated group is used as a control for a later-treated group, and consider heterogeneity in treatment effects across time since treatment. The early-treated group is not a good control group though, as its own treatment is on a different trajectory (given the amount of treatment time exposure) compared with the later-treated group. We therefore don’t want to be making such comparisons, and this is essentially what the solutions (later on) try to avoid.
Thus far, we have discussed only static set-ups, where we wish to find an ATT estimate. Dynamic set-ups, or event studies, are also possible, but run in to similar problems. In these set-ups, we have a similar formulation as the TWFE above but instead of having a single TWFE estimate, we have dummy variables for all time periods relative to treatment, so that we can see how treatment impacts our outcome variable over time. This formulation yields an unbiased estimate when there is heterogeneity in time since treatment (unlike above), but not when there is heterogeneity in cohorts, where a cohort is a group of units receiving treatment in the same period (Sun and Abraham, 2021).
Before we get to the solution, we need to be able to diagnose the problem. As mentioned, the problem only arises if we have staggered adoption and heterogeneous treatment effects. Therefore, we want to be able to check for these heterogeneous treatment effects and can use a number of diagnostic tests to see whether we are making a significant number of forbidden comparisons when calculating the TWFE estimate. Simply using a heterogeneous treatment affect robust estimator, for all panel data scenarios, is not ideal, where alternative estimators typically have lower statistical power (Jakiela, 2021).
The simplest check, is to ensure that we reject the constant treatment effects assumption. To do this, Jakiela (2021) proposes testing whether the slope differs between the residual of the treatment and control groups (i.e. scatter plot of the residuals of the estimated regression, separated by treatment and control units).
Goodman-Bacon (2021) demonstrates that the TWFE estimator can be decomposed into a weighted average of difference-in-difference comparisons between pairs of units and time periods where one unit changed treatment status and the other did not. Some of these comparison pairs will use early-treated units in the control group (i.e. a forbidden regression). These early-treated units can get negative weights if they are used as a control for a large number of later-treated units and “negative weights will tend to arise for early-treated units in periods late in the sample” (Roth et al., 2023). We can use the Goodman-Bacon decomposition to report the weights placed on the different TWFE estimates from each 2-period, 2-group estimate. This then allows us to evaluate how much weight is being placed on forbidden comparions between already treated units. In Stata, this can be achieved by using the bacondecomposition command or the ddtiming command. The latter command shows the average TWFE estimate and the TWFE weight for the various comparison groups (i) treated versus never treated, (ii) treated versus not yet treated (called “Earlier T vs. Later C” in the resulting output), and (iii) early-treated versus later-treated (this is the forbidden comparison and is called “Later T vs. Earlier C” in the resulting output).
Chaisemartin and D’Haultfoeuille (2020) similarly suggest diagnosing the problem by investigating what fraction of group-time ATTs receive negative weighting and the degree of heterogeneity in treatment effects that would be necessary for the estimated treatment effect to have the wrong sign. In particular, they suggest a ratio between the absolute value of the expectation of the TWFE estimator divided by the standard deviation of the weights. If the ratio is close to zero, then even a small amount of treatment heterogeneity can lead to the TWFE estimator having an opposite sign to the individual underlying ATTs, whilst if the ratio is very large then it is unlikely that the TWFE estimator will be of opposite sign to the underlying ATTs, unless treatment heterogeneity is “implausibly” large (Chaisemartin and D’Haultfoeuille, 2020). They create a Stata command, twowayfeweights, which shows the weights attached to the FE estimator (using the option path(“…”) to save the weights to a new .dta file), a summary of the number of positive and negative weights and the sum of positive and negative weights, respectively. The output also tells us ratio from above.
Remember, we would only consider the TWFE estimate biased if we detect that we have heterogeneous treatment effects or that our TWFE estimate is constructed from many forbidden comparisons. Then we need a solution, which we turn to next.
The Solution
As discussed so far, TWFE estimation in the presence of staggered adoption when treatment effects are heterogeneous across time or units can lead to biased estimates arising from a forbidden comparison problem of using earlier-treated units as a control group for later-treated units.
Jakiela (2021) proposes to drop the last (few) periods of data, when most groups are treated, to reduce the presence of negative weights in the TWFE estimator. Clearly, this takes advantage of the observation that negative weighting tends to occur in later years for early adopter units. Additionally, she suggests retaining only a fixed number of post-treatment years per unit, to reduce the presence of negative weights. Nonetheless, this may not completely remove negative weights, especially when there are not many never-treated units, and so a number of new estimators, robust to heterogeneous treatment effects, have been devised.
Different groups of solutions (new estimators) take different approaches: some carefully outline the control group and exclude the forbidden group from being used as a control group (we will discuss the Callaway Sant’Anna approach in this class of solutions), whilst others use an imputation method (e.g. Gardner, 2021), fitting a TWFE regression using observations only for units and time periods that are not yet treated and then inferring the never treated potential outcome for each treated unit. Another class of solutions estimates regressions with interaction terms to capture the heterogeneity in treatment (Sun and Abrahams, 2021; Wooldridge 2021). Finally, another solution is to use a stacked regression where each treated unit is matched to not-yet-treated (or never treated) controls and there are separate fixed effects for each set of treated units and its control (Cengiz et al. 2019).
Generally speaking, these different solutions should arrive at similar estimates as each other, most of the time. Still speaking generally, these different solutions are typically trying to remove the forbidden comparisons and focus only on the good comparisons. They differ in their exact implementation, what this means for the underlying assumptions propping them up, and how they can handle covariates (control variables). I outline some of the solutions I have worked with below, along with the command to implement this in Stata, but note that this is not a comprehensive list and nor does it fully outline the required assumptions for the solution and the difficulties in implementing it (for this, read the help / implementation files!).
The Callaway Sant’Anna (2021) estimator creates the TWFE estimator using either never-treated units or not-yet-treated units as the control group, thereby removing the forbidden comparison group and allowing for an unbiased ATT. This approach avoids the negative weighting issue and makes transparent exactly which units are being used as the control group. Using the not-yet-treated units (select as an option) increases the size of the control group and improves the efficiency of this estimator. More specifically, their approach treats each cohort (or group-time) differently, and estimates average treatment effects on the group treated in a particular time period. They use propensity score matching to compare these cohorts with similar groups that were never- (or not yet-) treated. This solution can be used in Stata with the csdid (user-written) command. In my experience, if you have a large number of treatment groups, then this can take a (very) long time in Stata and it might be best to use R (did command). They have three different options for estimation: outcome regression, inverse probability weighting or doubly-robust. These approaches are “equivalent from the identification perspective” but not the “estimation/inference perspective” (Callaway and Sant’Anna, 2021). The choice of which method to use depends on the modelling assumptions one is willing to make and the interested reader should check out the paper for more details. Post-estimation, the individual ATTs can be aggregated along a number of different dimensions, the most obvious being estat simple.
The Wooldridge (2021) approach, also called the Mundlak TWFE estimator, takes a different approach and continues using OLS but incorporates treatment-cohort heterogeneity by including a set of interaction terms. More specifically, this approach interacts cohort effects with time-specific effects to capture the heterogeneity which was previously missing. Intuitively, incorporating these interactions ensures that treated units are only being compared to clean controls. This can be implemented in Stata using the jwdid (user-written) command, followed by use of estat and the relevant aggregation method (e.g. simple or event).
The imputation approach has been taken by Gardner (2021) and Borusyak et al. (2023), and sets up a two-step regression method. (1) A model is estimated to find non-treated potential outcomes using only the never-treated (or not-yet treated) observations. (2) extrapolate the model estimated in (1) to the treated observations, obtaining an estimate of the treatment effect for each treated unit. These estimated treatment effects are then averaged. In Stata, this can be implemented using the did_imputation command.
Sun and Abraham (2021) turn to an event study to avoid the issue of treatment heterogeneity, with their solution using a fully interacted regression which recovers estimates of group-specific ATTs, and is similar to the Wooldridge (2021) approach. In particular, the estimator uses the TWFE specification including interactions of relative time (from treatment) and cohort dummies, weighting each cohort-specific estimator based on the sample share of the cohort in a given period of time. The key difference between the Sun and Abraham and Wooldridge approach, is that the Wooldridge approach provides estimates of treatment effect parameters only for post-treatment time periods, whereas Sun and Abraham provide estimates for all time periods, including pre-treatment. This means that Wooldridge (2021) is more efficient whilst Sun and Abraham (2021) can be used to test for the parallel trends assumption holding. To implement the Sun and Abraham approach, we can use eventstudyinteract in Stata.
The final approach that we will discuss here (but there are many other approaches!) is the approach of the stacked regression. Again, this method ensures that there are no problematic comparisons (i.e. not using forbidden comparison as a control group), and hence no bias in the estimator. It does this by restructuring the dataset to contain a series of sub-datasets each containing an individual experiment (treatment) and excluding the problematic controls. These individual datasets are then stacked by appending them together (Freedman et al. 2023). Once this stacked dataset has been created, we have a two-group design and we can proceed using TWFE. In detail, one has to be careful to ensure balance in the number of pre- and post-periods for each sub-experiment and weight the ATTs by the fraction of all trimmed treated units that adopt in a given period (Wing et al. 2024).
Further Reading
Borusyak, K., Jaravel, X. and Spiess, J., 2021. Revisiting event study designs: Robust and efficient estimation. arXiv preprint arXiv:2108.12419.
Callaway, B. and Sant’Anna, P.H., 2021. Difference-in-differences with multiple time periods. Journal of econometrics, 225(2), pp.200-230.
Callaway, B., 2023. Difference-in-differences for policy evaluation. Handbook of Labor, Human Resources and Population Economics, pp.1-61.
Cengiz, D., Dube, A., Lindner, A. and Zipperer, B., 2019. The effect of minimum wages on low-wage jobs. The Quarterly Journal of Economics, 134(3), pp.1405-1454.
De Chaisemartin, C. and d’Haultfoeuille, X., 2020. Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), pp.2964-2996.
Freedman, S.M., Hollingsworth, A., Simon, K.I., Wing, C. and Yozwiak, M., 2023. Designing Difference in Difference Studies With Staggered Treatment Adoption: Key Concepts and Practical Guidelines (No. w31842). National Bureau of Economic Research.
Gardner, J., 2022. Two-stage differences in differences. arXiv preprint arXiv:2207.05943.
Goodman-Bacon, A., 2021. Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), pp.254-277.
Jakiela, P., 2021. Simple diagnostics for two-way fixed effects. arXiv preprint arXiv:2103.13229.
Roth, J., Sant’Anna, P.H., Bilinski, A. and Poe, J., 2023. What’s trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics.
Sun, L. and Abraham, S., 2021. Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), pp.175-199.
Wing, C., Freedman, S. M. and Hollingsworth, A., 2024. Stacked Difference-in-Differences. NBER Working Paper 32054
Wooldridge, J.M., 2021. Two-way fixed effects, the two-way mundlak regression, and difference-in-differences estimators. Available at SSRN 3906345.