Whether and when to use lagged dependent variables (LDVs) has been a long-standing question in political science (Achen Reference Achen2000; Keele and Kelly Reference Keele and Kelly2006). Of particular concern has been the consequence(s) of including a LDV in a model in the presence of residual autocorrelation. Because the LDV has power against error persistence, the coefficient for the LDV will generally be inflated, and the coefficients for (persistent) predictors will be deflated.
In a recent paper, Wilkins (Reference Wilkins2018) re-engages this question, suggesting that including an additional lag of the outcome and predictor—that is, an ADL(2,1) modelFootnote 1 —offers leverage against such biases and should be preferred as a more general model specification.Footnote 2 While Wilkins (Reference Wilkins2018) correctly notes that a time-lagged error can be re-expressed as a time lag of the outcome and predictor (see Sargan Reference Sargan1964), we are concerned that other aspects of this discussion invite confusion.
First, such a model only suffices insofar, as it is dynamically complete. That is, one first needs to ensure that any model fully characterizes the dependence in the series, imposing only those restrictions supported by the data (Hendry Reference Hendry1995). Second, even when the ADL(2,1) model is sufficient, the strategy offered by Wilkins (Reference Wilkins2018) assumes that the underlying data-generating process (DGP) is a first-order partial adjustment (i.e., PA[1]) process (familiarly known as the LDV model) with autocorrelation in the residuals. If the DGP is actually a more general ADL(2,1) process—with meaningful effects of $x_{t-1}$ and $y_{t-2}$ —Wilkins’s approach mischaracterizes the dynamic process, inviting incorrect interpretations of the model coefficients and the long-run multiplier (LRM). Rather than proceed by assumption, we argue analysts should test whether these parameter restrictions are supported by their data.Footnote 3
To this end, we identify the nonlinear common factor restriction required to support Wilkins’s interpretation, suggest an associated Wald test to compare his proposed specification to alternative models, and demonstrate its efficacy via stochastic simulation. We caution researchers against privileging any single specification by default. Instead, we advocate that they undertake a general-to-specific specification search using a higher-order ADL(p,q) model, test whether lags in the structural equations are proxying for residual autocorrelation, and properly calculate quantities of interest.
1 Model Equivalence and Common Factor Restrictions
As first illustrated by Sargan (Reference Sargan1964), time-lagged realizations of a model’s structural terms—the outcome and its predictors—can be used to proxy for time-lagged realizations of its stochastic error process. Consider
y and x are covariates measured at time t, and $u_{t}$ is an autoregressive error term, which is a function of prior realizations (via $\rho $ ) and contemporaneous white noise residuals $e_{t}$ . Using the familiar backshift operator L and rearranging terms, this process can be expressed as an ADL(1,1):
This demonstrates how an ADL(1,1) model captures the error persistence from a PA(1) process using lags of observed variables.Footnote 4 This approach is widely known and has been discussed at length in the time-series literature (Hendry and Mizon Reference Hendry and Mizon1978; Sargan Reference Sargan1964, Reference Sargan1980). Historically, this specification was valuable, because ADL models could be estimated using ordinary least squares.
Yet, Sargan (Reference Sargan1980) and others cautioned that this approach has limitations. First, the equivalence is only obtained if the implied common factor restrictions of the reduced-form parameters are valid. That is, Equation (1e) can only be interpreted as a static model with autocorrelation in the residuals if $\beta _{2} = - \alpha \beta _{1}$ , allowing the simplification undertaken in the step from Equation (1d) to (1e).Footnote 5 If $x_{t-1}$ has independent effects, these restrictions are not met, and the estimator is biased (Sargan Reference Sargan1964).Footnote 6 Second, even when this restriction is satisfied, the ADL(1,1) model is inefficient, since the static model has fewer parameters to be estimated (Hendry and Mizon Reference Hendry and Mizon1978).
In a recent piece, Wilkins (Reference Wilkins2018) uses similar reasoning to argue that an ADL(2,1) model can be used to estimate a PA(1) process with autocorrelation in the residuals, thereby resolving the issue of LDVs raised in Achen (Reference Achen2000). As above, this can be achieved as follows:
From this, Wilkins (Reference Wilkins2018) argues in favor of the ADL(2,1) model as a more general approach, since, unlike the LDV model considered by Achen (Reference Achen2000), the ADL(2,1) model is robust to this additional error persistence.
While Wilkins’s strategy echoes that of Sargan and others, he is silent on their cautions. Most importantly, he does not discuss the common factor restriction required for this model equivalence to hold. As we detail in the Online Appendix, an ADL(2,1) process can reduce to a PA(1) process with autocorrelation in the residuals if, and only if, $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2} = 0$ . When this does not hold, it implies that the second-order lag of the outcome or the first-order lag of the predictor has true, independent effects on the contemporaneous outcome. That is, they have an effect above and beyond proxying for the lagged stochastic error, so interpreting these estimates as such can lead to a mischaracterization of the dynamic process. For example, in a regression of presidential approval (y) on consumer sentiment (x), the Wilkins (Reference Wilkins2018) approach assumes that there is no lagged effect of consumer sentiment ( $x_{t-1}$ ) and no second-order autocorrelation in approval ( $y_{t-2}$ ), with the coefficients of these covariates understood to exclusively reflect error persistence. When these assumptions are invalid—e.g., lagged consumer sentiment impacts contemporaneous approval—this interpretation of the parameters is not supported.
Not only would this mischaracterize specific coefficient estimates, but any marginal effects obtained from these parameters will also be incorrect. For example, the LRM for the effect of x on y for an ADL(2,1) model is
which Wilkins (Reference Wilkins2018) argues can recover the LRM for the PA(1) process as
where the reduced-form coefficients from the ADL(2,1) model are substituted in for their functional relations in the PA(1) process with autocorrelation in the residuals.Footnote 7 However, Equations (3) and (4) will only be equal if the common factor restriction—i.e., $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2} = 0$ —holds.Footnote 8 This restriction can be satisfied, given the right set of reduced-form coefficients; however, it should not be assumed. Only in stylized cases (e.g., $\alpha _{2} = 0$ and $\beta _{2} = 0$ ) will it be easy for researchers to easily determine whether the restriction is satisfied, and more often, it will entail complicated combinations of the coefficients (e.g., $\beta _{1} = 5, \beta _{2} = 2, \alpha _{1} = 0.5,$ and $\alpha _{2} = 0.36$ ).Footnote 9 When this restriction is not satisfied, Equation (4) mischaracterizes the LRM, inaccurately reflecting both the direct effect (i.e., ${\beta }_{1} + {\beta }_{2} \neq \beta $ ) and the persistence (i.e., $1 - {\alpha }_{1} - {\alpha }_{2} \neq 1 - \alpha $ ).Footnote 10 Using our earlier example of presidential approval, misinterpreting the effect of lagged consumer sentiment as error persistence impacts not only ( $\beta _{2}$ ), but propagates through the LRM to bias our understanding of the effect of consumer sentiment on approval more generally.
As such, it is important for researchers to determine whether they have a traditional ADL(2,1) process, i.e., Scenario A, where the structural lags have independent effects, or a Wilkins (Reference Wilkins2018) ADL(2,1) process, i.e., Scenario B, where these lags are proxies for the stochastic error process. Even when Wilkins’s interpretation is correct (Scenario B), and an ADL(2,1) model can rightly be reparameterized as a PA(1) model, Wilkins’s estimation strategy is inefficient, because it estimates one more reduced-form parameter than necessary to identify the structural equations. Moreover, one loses an additional year of data, since $y_{t-2}$ is used as an input. Neither of these matters asymptotically, but efficiency losses are greater in shorter series. This is especially important in time-series analysis, where sample coefficient estimates are used as inputs for additional quantities of interest (e.g., the LRM and impulse response functions). For these nonlinear combinations of coefficients, slight efficiency losses may have severe consequences.
In the next section, we use simulated data to quantify the costs of using the Wilkins (Reference Wilkins2018) strategy when its assumptions are not maintained. Given these costs, we also demonstrate how researchers can use a simple Wald test—evaluating $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2} = 0$ —to distinguish between Scenarios A and B in Section 2.2.
2 Simulations
We use simulations to evaluate the bias in the LRM under incorrect assumptions and the efficacy of our proposed Wald test. The outcome, y, is generated:
where $x_{t} \sim N(0,1)$ and $u_{t} = \rho u_{t-1} + e_{t}$ with $e_{t} \sim N(0,1)$ . We hold the contemporaneous effect fixed, $\beta _{1} =$ 5, varying the strength of the lagged predictor via $\beta _{2} =$ {0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00, 2.25, 2.50}, the lagged outcomes via $\alpha _{1} =$ {0.00, 0.20, 0.40} and $\alpha _{2} =$ {0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50}, and the residual autocorrelation via $\rho =$ {0.00, 0.20, 0.40}. For each combination of parameters, we generate 1,000 simulated data sets with sample sizes of $T =$ 50, 100, 200, 1000.Footnote 11
2.1 LRM Bias Associated with Incorrect Specification
One traditionally uses Equation (3) to calculate the LRM from the estimates of an ADL(2,1) model. For comparison, we directly calculate the LRM interpretation proposed by Wilkins (Reference Wilkins2018) given in Equation (4), which uses the ADL(2,1) estimates to capture a PA(1) process with autocorrelation. We calculate the bias for the LRM as the difference between the true LRM, $\frac {\beta _{1} + \beta _{2}}{1 - \alpha _{1} - \alpha _{2}}$ , and the LRM suggested by the Wilkins (Reference Wilkins2018) strategy, $\frac {\hat {\beta }_{1}}{1 - (\hat {\beta }_{2} / \hat {\beta }_{1} + \hat {\alpha }_{1})}$ , where the coefficients from the ADL(2,1) model are substituted to represent the LRM for the PA(1) process, $\frac {\beta }{1 - \alpha }$ . If the data support the Wilkins (Reference Wilkins2018) interpretation, these two will be equal to one another, as in Equation (4). If the data do not support the Wilkins (Reference Wilkins2018) interpretation, the difference between the two will reflect the extent of the bias from reinterpreting the ADL(2,1) estimates as if they had been produced by a PA(1) process with residual autocorrelation.
Because these LRMs are nonlinear in parameters, the resultant biases will also change in a nonlinear fashion. In Figure 1 we focus on the consequences of changes to $\alpha _{2}$ (x-axis), holding ${\beta }_{2}$ at fixed values (0.0 in panel 1, 1.0 in panel 2, and 2.0 in panel 3). The curves show the median bias in the LRMs (y-axis). In each panel, there is only one set of conditions where the bias is equal to zero. In the first panel, there is no bias when ${\beta }_{2} = 0$ and $\alpha _{2} = 0$ . When ${\beta }_{2} \neq 0$ in the second and third panels, the LRMs are biased except for the conditions, where the value of $\alpha _{2}$ exactly offsets ${\beta }_{2}$ , ( $\alpha _{2} = 0.12$ and $\beta _{2} = 1$ ) and ( $\alpha _{2} = 0.32$ when $\beta _{2} = 2$ ), respectively. The bias increases as $\alpha _{2}$ increases beyond these thresholds. In sum, where there is a true effect of $y_{t-2}$ , the LRM proposed by Wilkins (Reference Wilkins2018) is biased. However, in applied data settings, we would not know whether this bias attenuates or inflates the LRM, because it is a nonlinear combination of several parameters.
We show similar results in Figure 2, where we focus on changes to $\beta _{2}$ (x-axis) while holding ${\alpha }_{2}$ at fixed values (0.0 in panel 1, 0.3 in panel 2, and 0.5 in panel 3). As before, in the first plot, the LRM bias is 0 when $\beta _{2} = 0$ and $\alpha _{2} = 0$ . Equations (3) and (4) are equivalent in this condition. As the value of $\beta _{2}$ increases along the x-axis, the LRM implied by the Wilkins (Reference Wilkins2018) strategy underestimates the true value of the LRM at an increasing rate. The same pattern exists in the second ( $\alpha _{2}$ = 0.3) and third ( $\alpha _{2} = 0.5$ ) plots, but the y-intercept for the bias ( $\beta _{2} = 0$ ) is different in both cases.
The results presented in Figures 1 and 2 demonstrate the potential problems with assuming the restrictions proposed by Wilkins (Reference Wilkins2018). If the true DGP is an ADL(2,1), the LRM formula proposed by Wilkins (Reference Wilkins2018) is a biased estimator. Moreover, in applied research, the direction and magnitude of the bias are difficult to predict, because the nature of the bias depends on the values of $\alpha _{1}$ , $\alpha _{2}$ , $\beta _{1}$ , and $\beta _{2}$ . As such, researchers cannot confidently assume the effects are being under- or overestimated.
2.2 A Wald Test for the ADL(2,1) Against a PA(1) with Autocorrelation
The results presented in the last section highlight the perils associated with incorrectly calculating the LRM for an ADL(2,1) process as though it were generated by a PA(1) process. On the other hand, Wilkins (Reference Wilkins2018) demonstrates the biases risked by failing to impose these restrictions when the true DGP is a PA(1) process with autocorrelation in the residuals. In either case, proceeding purely from assumption is a risky strategy.
In Section 1, we discussed a possible test to evaluate the restrictions assumed by Wilkins (Reference Wilkins2018). This draws from a strategy outlined by Sargan (Reference Sargan1964, Reference Sargan1980), which demonstrates how Wald tests can be used to compare a wide range of time-series specifications. Specifically, we test whether estimated ADL(2,1) coefficients are consistent with a PA(1) process with residual autocorrelation by testing
This nonlinear Wald test is $\chi ^{2}$ distributed with 1 (the number of restrictions being tested) degree of freedom. The null hypothesis is that the ADL(2,1) is indistinguishable from a PA(1) with residual autocorrelation. The alternative hypothesis is that the data were generated by an alternative ADL(2,1) process, where $y_{t-2}$ and $x_{t-1}$ have independent effects. As such, this test enables researchers to evaluate whether their data are consistent with the interpretation suggested by Wilkins (Reference Wilkins2018) or not, avoiding the biases demonstrated above.
We demonstrate the efficacy of the proposed test using the simulations described in the previous section.Footnote 12 The results are presented in Table 1, which has four panels, one for each of the sample sizes. Each element in each panel gives the rejection rate for the respective combination of $\alpha _{1}$ , $\alpha _{2}$ , and $\beta _{2}$ .Footnote 13
Notes: Rejection rates are the proportion of the 1,000 simulations where $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2} = 0$ . The Wald tests are $\chi ^{2}$ distributed with $q = 1$ degrees of freedom.
Looking first at the rejection rates when $\alpha _{2} = \beta _{2} = 0$ (in italics), we demonstrate the size of the test. These rejection rates are, approximately, the expected 0.05, with somewhat worse performance in small samples. Demonstrating the power of the test is not straightforward, since, as noted above, increases to the individual parameters do not always increase the total of $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2}$ . Therefore, we focus on a particular case ( $\alpha _{1}=0.4$ and $\alpha _{2}=0.0$ ), where $\beta _{2}^{2} + \beta _{1}\beta _{2}\alpha _{1} - \alpha _{2}\beta _{1}^{2}$ strictly increases (0.00, 0.56, 1.25, 2.06, and 3.00) as $\beta _{2}$ increases (0.00, 0.25, 0.50, 0.75, and 1.00), thereby giving us clearer insight into the power of the test.
First, for each sample size, the power of the test is strictly increasing in the magnitude of the population parameter—that is, as we move to the right across the table in this row. In the asymptotic sample size ( $T = 1,000$ ), for example, we see that corresponding rejection rates to each condition are: 0.05, 0.89, 1.00, 1.00, 1.00. Encouragingly, we see that the size of the test is exact, and the power of the test quickly increases to unity. Comparing these conditions across T demonstrates the importance of sample size. In the $T = 50$ case, for example, the rejection rates drop to: 0.06, 0.14, 0.35, 0.69, 0.90. This indicates, as one would expect, that the magnitudes of the parameters will need to be larger to discriminate between models when sample sizes are small.
The results presented in this section demonstrate that the test proposed by Sargan (Reference Sargan1964) to distinguish static processes with residual autocorrelation from ADL(1,1) processes can be extended to PA(1) processes with residual autocorrelation and ADL(2,1) processes. While an ADL(2,1) model can be used to approximate a PA(1) process with serially correlated errors, one cannot assume that all ADL(2,1) models are simply capturing dynamics in the error process. Even small coefficients can produce large differences in the two models. As such, we offered a test for analysts to distinguish the two processes.
3 Discussion
Dynamic specification is critical to sound inference. However, accurately specifying model dynamics is complicated, because (a) theory is usually silent on the specific structure of long-run relationships and (b) we typically rely on data that are not collected with our specific hypotheses in mind. Given these challenges, researchers have long-sought, single, plug-and-play models that can be used to ensure results are not a consequence of mismodeled dynamics. These efforts are misguided, however, as there is no single best model that can be applied in all conditions. Minimally, researchers need to first consider whether their data are stationary (Webb, Linn, and Lebo Reference Webb, Linn and Lebo2020), and then determine whether their estimated models are balanced (Granger Reference Granger1990) and dynamically complete (Hendry Reference Hendry1995). Only then should the specific model specification concerns discussed here be taken up.
We demonstrate that the Wilkins’s (Reference Wilkins2018) ADL(2,1)-as-PA(1) with autocorrelation approach is only appropriate under a restrictive set of assumptions about the reduced-form parameters of the ADL(2,1). When these conditions are not met, this approach risks misunderstanding the dynamic process and produces biased quantities of interest, such as the LRM. To avoid this, we detail a test that can be used to determine whether the conditions assumed by Wilkins (Reference Wilkins2018) are satisfied. We note, however, that the conditions highlighted by Wilkins (Reference Wilkins2018) suggest a more parsimonious model is appropriate. In general, we argue that testing whether lagged systematic terms, as in ADL( $p,q$ ) models, are proxying for error persistence is sound practice, as it helps to avoid overparameterized models and possible misattribution of coefficient effects. The results presented in Table 1 highlight that the test generally performs well, helping researchers to determine whether their ADL(2,1) coefficients are indistinguishable from a PA(1) process with residual autocorrelation, or indicate a more general ADL(2,1) process.
While our discussion and simulations above are limited to a specific case where the analyst is arbitrating between two well-defined DGPs, applied researchers are likely to face less clear-cut choices. Time-series analyses are bedeviled by a number of practical challenges including more complex dynamic processes, inappropriate sampling and aggregation, and under-powered tests. Despite this, the strategy we articulate here can, and should, be incorporated as part of standard practice. First, analysts should begin with a plausible general model that reflects what their theory and pretesting tell them about their data and test restrictions on this model to arrive at a dynamic specification that is simultaneously parsimonious and dynamically complete (Hendry Reference Hendry1995). Second, since lagged systematic terms have power against error persistence, researchers should use the test discussed above in conjunction with traditional testing-down approaches. Finally, analysts should draw complete inferences from their models by calculating the LRM and other quantities of interest (De Boef and Keele Reference De Boef and Keele2008). We also caution researchers against overinterpreting coefficients for direct effects of lagged covariates. As demonstrated both here and in Wilkins (Reference Wilkins2018), these terms have power against stochastic processes, which invites misinterpretation, as the coefficients may reflect systematic effects, stochastic effects, or both.
Acknowledgement
This work was supported by the HPC facilities operated by, and the staff of, the University of Kansas Center for Research Computing. Thanks to Ali Kagalwala, Guy Whitten, the anonymous reviewers, and the editor for their helpful feedback. All remaining errors are ours alone.
Data Availability Statement
The replication materials for this paper can be found at Webb and Cook (Reference Webb and Cook2020).
Supplementary Material
For supplementary material accompanying this paper, please visit https://dx.doi.org/10.1017/pan.2020.53.