When mere correlations are not enough: The Granger Causality test

In most data science-related problems, datasets consist of multiple variables, in which independent variables might depend on other independent variables.

When the variables in datasets represent observations at different times, we call this dataset a time series set.

The time interval in these data sets may be hourly, daily, weekly, monthly, quarterly, annually, etc.

One way to quantify these multivariate relationships in datasets is linear regression. A regression model might indicate a strong relationship between two or more variables. Still, these variables may be unrelated in reality. In this case, predictions based on these relationships fail due to a lack of domain knowledge [1]. For instance, researchers might build multilinear regression models without knowing the nature of the relationship between variables. Suppose such regression models produce a high R square value. In that case, the resulting model might further mislead the interpretations and generate poor predictions or forecasting [1].

Consider the following time-series graph: variable X has a direct influence on variable Y, but there is a lag (i.e., time difference) of 5 between X and Y, so we cannot use the correlation matrix [2].

When mere correlations are not enough: The Granger Causality tes
When mere correlations are not enough: The Granger Causality test

For example, consider X an increase in positive covid-19 cases in a city and Y an increase in the number of people hospitalized. For better forecasting, we would like to know if there is a causal relationship between the variables X and Y [1].

To solve this issue, Prof. Clive W.J. Granger, recipient of the 2003 Nobel Prize in Economics, developed the causality concept to improve forecasting performance [4].

It is, basically, an econometric hypothetical test for verifying the usage of one variable in forecasting another in multivariate time series data with a particular lag.

The requirements for performing the Granger Causality test include the following:

  • Testing if the variables are stationary: this is a prerequisite for performing the Granger Causality test, which indicates that the data must be stationary (i.e., it should have a constant mean, constant variance, and no seasonal component).
  • If the data is not stationary, transform it by differencing it, either with first-order or second-order differencing.
  • Do not proceed with the Granger causality test if the data is not stationary after second-order differencing.

Performing hypothesis testing: check for the null hypothesis as follows:

  • Null Hypothesis (H0) : Yt does not β€œGranger cause” Xt+1 e., 𝛼1 = 𝛼2 = β‹― = 𝛼𝑝 = 0.
  • Alternate Hypothesis (HA): Yt does β€œGranger cause” Xt+1, i.e., at least one of the lags of Y is significant.
  1. Calculate the f-statistic: using the following equations:
  • Fp,n-2π‘βˆ’1 = (πΈπ‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ π‘œπ‘“ 𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’) / (πΈπ‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’ π‘œπ‘“ π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π‘‰π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’)
  • Β Fp,n-2π‘βˆ’1 =Β  ( (π‘†π‘†πΈπ‘…π‘€βˆ’π‘†π‘†πΈπ‘ˆπ‘€) /𝑝) /(π‘†π‘†πΈπ‘ˆπ‘€ /π‘›βˆ’2π‘βˆ’1)

where n is the number of observations and SSE is Sum of Squared Errors.

If the p-values are less than a significance level (0.05) for at least one of the lags then reject the null hypothesis.

Once all the requirements are met, perform test for both the direction Xt–>Yt and Yt–>Xt. Try different lags (p). The optimal lag can be determined using AIC [1].

Now that you know some background on this statistical causality test, in this blog post, we will teach you how to perform the Granger Causality test in Python using the yearly volume of toxic comments and links from Reddit (available here) [5]

In the Reddit comments dataset above, the time series consists of years (from 2005 to 2020), and the first variable consists of the total number of toxic comments, while the second variable consists of the total number of links in comments.

First, we import the required libraries:

#Import the required libraries 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd


Then, we read the downloaded dataset (keep it in the same path as your script) as follows:

#Read the data and print the contents of the file
print('Redditor comments:')
print("=============")
df = pd.read_csv("commentCounts.csv") 
print(df)


The first requirement to conduct the Granger Causality test is to check if the dataset is stationary or not. For that, we can conduct an Augmented Dickey-Fuller Test (ADF Test) to check the f-statistic value:

from statsmodels.tsa.stattools import adfuller
result = adfuller(df['TotalLinks'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")
result = adfuller(df['TotalToxic'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

The result here shows that our dataset is not stationary in links. Thus, we have to make it stationary by performing first-order differencing as follows:

df_transformed = df.diff().dropna()
df = df.iloc[1:]

By repeating the ADF test, we can check if the transformed dataset is stationary or not:

result = adfuller(df_transformed['TotalLinks'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")
result = adfuller(df_transformed['TotalToxic'])
print(f'Test Statistics: {result[0]}')
print(f'p-value: {result[1]}')
print(f'critical_values: {result[4]}')
if result[1] > 0.05:
    print("Series is not stationary")
else:
    print("Series is stationary")

Both tests show that the variables are now stationary, which means that we can perform the Granger Causality test on both sides as follows:

from statsmodels.tsa.stattools import grangercausalitytests
grangercausalitytests(df_transformed[['TotalLinks', 'TotalToxic']], maxlag=4)
grangercausalitytests(df_transformed[['TotalToxic', 'TotalLinks']], maxlag=4)

Suppose that we pick a certain lag value like 3. In that case, we can see that in both directions, toxic comments granger cause links, and links granger cause toxic comments. Thus, the Granger Causality test concludes that in user-generated comments, there is a correlation between the existence of links and toxic comments [5].

Of course, the Granger Causality test is not suitable for every data science case, and just like any other statistical hypothesis method, it has its strengths and limitations, which we summarize as follows:

Strengths of Granger Causality test

  • Simple to compute and can be applied in many applications
  • Β It provides a much more rigorous rule for causation (or information flow) than simply observing a high correlation with some lag-lead relationship [3].
  • When time information is available, it characterizes the underlying spatiotemporal dynamics of variables rather than just modest correlations [3].

Limitations of Granger Causality test

  • Granger causality does not provide any insight into the relationship between the variable; hence it is not true causality, unlike “cause and effect” analysis [1].
  • Possible sources of misguiding test results include not frequent enough or too frequent sampling in the dataset [2]. This is particularly true in the case of the posts collection in Reddit from [5], where around 99% of posts had links, so it was not possible to perform the same Granger Causality test on posts and comments.
  • The Granger causality test cannot be performed on non-stationary data.

The entire script is available on Google Colab

Happy coding!

References:

(Blog Post Author: Hind Almerekhi)