In this post, we describe Granger causality, which helps us answer the question of whether one time series is useful for predicting another, and in some cases can be used to make stronger causal statements.
A common question one would like to ask is: does one time series cause another? For instance, does your location cause your behavior: does being in a bar that allows smoking cause you to smoke? Does the number of chickens cause the future number of eggs or vice versa?
In practice making strong causal statements is hard, but we can more easily ask: is one time series predictive of future values of another, controlling for lags? Granger causality is a testing framework for asking this question, and in some cases, getting closer to answering the question of whether one time series causes future values of another.
In this post, we go over the basic univariate testing framework including how to choose the number of lags, and apply this to a chicken and egg dataset. We then ask what might prevent us from making a causal statement, and how can we eliminate some of the issues with these?
The Basic Univariate Test
In the basic univariate Granger causality test, we have two time series: and , and we ask the question: are lags of predictive of , controlling for lags of ? How can we test this: we start with the linear model
(1)
where we assume that . Here summarizes the information up to time of both and . We then posit the null and alternative hypotheses:
- : the lags of provide no additional predictive information about beyond the lags of .
- so that at least one lag provides additional information.
An Application: The Chicken or the Egg?
Let’s look at an application: are number of eggs manufactured predictive of future number of chickens, controlling for current and past number of chickens? To test this, we can import a dataset [1] that has two time series from 1930-1983: one is the number of chickens in the US, and one is the US egg production. We then run the grangertest function from lmtest with three lags.
library(lmtest)
data(ChickEgg)
grangertest(chicken ~ egg, order = 3, data = ChickEgg)
Granger causality test
Model 1: chicken ~ Lags(chicken, 1:3) + Lags(egg, 1:3)
Model 2: chicken ~ Lags(chicken, 1:3)
Res.Df Df F Pr(>F)
1 44
2 47 -3 5.405 0.002966 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We see that the effect of lags of number of eggs manufactured on chickens is highly significant, and conclude that eggs are predictive of future chickens.
How do we choose the number of lags?
This gives us a tradeoff of bias vs power. With too few lags, we can find residual autocorrelation, giving us a biased test. With too many, we might incorrectly reject the null due to spurious correlation. See https://stats.stackexchange.com/questions/107954/lag-order-for-granger-causality-test for more details.
Is This Causality?
From the statistical test, can we conclude that the number of eggs causes the future number of chickens? Not yet. There are several potential issues when making causal statements:
- latent confounders: there may be some other variable, which is correlated with number of eggs, that is the true cause of the number of chickens.
- reverse causation. This is a bit subtle since the future can’t cause the past. However, the number of chickens could cause the future number of eggs, and the current chickens could be correlated with the future chickens.
- bidirectional causality: they both cause each other. In the time series setting economists call this a feedback system.
- spurious correlation: there is a correlation between the two variables, but it is coincidental.
How Can we Get Closer to Causal Statements?
In order to get closer to saying that the number of eggs causes the future number of chickens, we’d like to rule out some of the above. We go through each:
- latent confounders: intuition can help guide this here. Is there likely to be some other variable that is the true cause of the number of chickens in the next generation that the number of eggs is associated with? Probably not. However, for other problems, such as complex diseases, it’s more difficult to answer this.
- reverse causation: one way to test for this is to reverse the response variable and ask: is the number of chickens predictive of the future number of eggs, controlling for lags of eggs? If it isn’t, then it suggests that the direction of causality, if we have causality, is correct
- bidirectional causation: similar to reverse causation
- spurious correlation: similar to latent confounders in that intuition can help guide us. It’s pretty clear that there should be some relationship between the number of eggs and the number of chickens.
Let’s test the other direction for whether chickens are predictive of number of eggs.
grangertest(egg ~ chicken, order = 3, data = ChickEgg)
Granger causality test
Model 1: egg ~ Lags(egg, 1:3) + Lags(chicken, 1:3)
Model 2: egg ~ Lags(egg, 1:3)
Res.Df Df F Pr(>F)
1 44
2 47 -3 0.5916 0.6238
This is not significant. Since we now have done two tests (one for each direction), we should apply a Bonferroni correction, so if we would normally want a .05 threshold for rejecting the null, we should now want a .025 for each test. The conclusions do not change.
We still cannot make a causal statement yet, but we’ve at least ruled out some of the possible pitfalls. To make a causal statement, we need to rule out latent confounders: the gold standard for that is via a randomized experiment where we randomize the number of eggs and see the effect on the number of chickens.
[1] Thurman W.N. & Fisher M.E. (1988), Chickens, Eggs, and Causality, or Which Came First?, American Journal of Agricultural Economics, 237-238.