Regression with Count Data: Poisson Regression, Overdispersion, Negative Binomial Regression, and Zero Inflation in R

In this post we describe how to do regression with count data using R.

In many applications we want to model the relationship between features and responses where the responses take count values. For instance, we may want to model the number of times an event happens in a fixed interval, given covariates. We might want to model the number of cigarettes smoked, given context, the number of calls arriving at a call center as a function of business characteristics, or the number of people arriving at a location, given characteristics of a location.

A first thought is to use linear regression naively: this poses problems including skewness, use of a continuous distribution, and interpretation. A second idea is to use a Poisson distribution to model $y|X$ , where $X$ is the design matrix of features. However, the Poisson distribution has the limitation that its variance equals its mean: fitting it often leads to overdispersion, where the observed variance is higher than that expected by the model. One can handle this in two ways: by quasi-likelihood estimation, or by modeling $y|X$ with a negative binomial distribution, which generalizes the Poisson distribution to two parameters, allowing the variance to differ from the mean. A final common problem is zero inflation, where the observed number of $0$ ‘s is higher than that expected by either a Poisson or negative binomial distribution: this can be solved via zero inflated models. In this article, we describe each modeling approach, its limitations, how we can handle those limitations, and show examples in R.

Using Linear Regression

Without Transforming Response

We can naively fit a linear regression model

(1) $\begin{align*}y&=X\beta+\epsilon\end{align*}$

here $y$ is the vector of count data observations and $X$ is a design matrix of features. The first problem with this formulation is that our true $y$ are lower bounded by $0$ , while our fitted values $\hat{y}$ are not. Thus we will have poor predictive accuracy and inferences about behavior close to $0$ . Further, when it comes to hypothesis testing, unless we have a lot of data, the skewness of $y|X$ will be problematic. This is generally a bad model, although it can be less bad in cases where all the observed $y$ are far from $0$ and the true $y|X$ has low skewness.

Taking Log of Response

We can also fit

(2) $\begin{align*}\log (y+1)&=X\beta+\epsilon\end{align*}$

Under this setting for hypothesis testing we assume (and need to test) that $y+1|X$ is log-normal. This is good in that it captures the skewness often present in count data, and $\hat{y}$ is also is now lower bounded by $0$ . However, this model also presents a few issues. Firstly, one fits a continuous distribution to discrete data, leading to questionable inference results.

Let’s look at the interpretation in the best case, where $y$ never has any $0$ values and thus we can fit $\log y=X\beta+\epsilon$ . In this case a unit change in $x_j^{(i)}$ is associated with an $\exp(\beta_j)$ change in $y_i$ , holding noise fixed, i.e. changes $y_i$ by $100(\exp(\beta_j))\%$ . This has two problems: 1) we don’t want $y_i$ to take arbitrary real values and 2) the holding noise fixed interpretation is not intuitive at all.

Poisson Regression

An alternative to using linear regression is to use Poisson regression, which assumes that $y|X\sim \textrm{Pois}(X\beta)$ . This is a generalized linear model, and the equation to represent this is

(3) $\begin{align*}\log\left(E(y|X)\right)&=X\beta\end{align*}$

The interpretation becomes a unit change in $x_j^{(i)}$ is associated with an $\exp(\beta_j)$ multiplicative change in $\mathbb{E}(y_i)$ . This treats discrete observations with a discrete distribution and is more intuitive than log-transforming the response.

Let’s fit a Poisson regression model to the warpbreaks dataset in R

pois_reg <- glm(formula = breaks ~ wool+tension, data = warpbreaks,
   family = poisson)
summary(pois_reg)


Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6871  -1.6503  -0.4269   1.1902   4.2616  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.69196    0.04541  81.302  < 2e-16 ***
woolB       -0.20599    0.05157  -3.994 6.49e-05 ***
tensionM    -0.32132    0.06027  -5.332 9.73e-08 ***
tensionH    -0.51849    0.06396  -8.107 5.21e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 297.37  on 53  degrees of freedom
Residual deviance: 210.39  on 50  degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

Zero-Inflation

In practical datasets, one often sees more zeros than would be predicted by a Poisson distribution. For instance, say our $y_i$ is the number of cigarettes smoked in the first day of a quit attempt for some participant $i$ . Often this participant will come from one of two groups: those who successfully quit and those who don’t. If those who don’t have a Poisson distribution as their true distribution for number of cigarettes smoked, then the distribution across all participants will have more $0$ s than posited by a Poisson.

We can model this via a mixture model with two components, where

(4) $\begin{align*}P(y_i=0)&=\pi+(1-\pi)P(z_i=0|X^{(i)})\\P(y_i=k)&=(1-\pi)P(z_i=k|X^{(i)})\end{align*}$

here $\pi$ is the probability of being in component $1$ , which takes value $0$ with probability $1$ , and $1-\pi$ is the probability of being in component $2$ , and $z_i|X^{(i)}$ is a Poisson distributed random variable with mean $X\beta$ . In order to test whether this is a good model, we can use the hypothesis test

$H_0:\pi=0$ , that is, the component that takes $0$ with probability $1$ has $0$ weight
$H_1:\pi\neq 1$ , this component has non-zero weight.

We can use a score test here [1], which can be found in the R package countreg. Let’s perform this test on the warpbreaks dataset in R (which incidentally doesn’t have any $0$ values in the responses).

library(countreg)
zitest(pois_reg)

	Zero inflation test

data:  pois_reg
S = 1.2158e-15, p-value = 0.5

As expected, we do not have a significant p-value at any of the standard cutoffs.

Overdispersion

Another issue often found in practice is that real data has overdispersion, where the variance is higher than the mean, while the Poisson distribution has variance equal to the mean. A common technique to ‘detect’ this is via a deviance goodness of fit test. However, this only tells us how good a fit a Poisson model is, not whether overdispersion is present.

An alternative is to instead use negative binomial regression. The negative binomial distribution has an additional parameter, allowing both the mean and variance to be estimated. Since the Poisson distribution is a special case of the negative binomial and the latter has one additional parameter, we can do a $\mathcal{X}_1^2$ test where

$H_0$ : true model is Poisson regression
$H_1$ : true model is negative binomial regression

Let’s test for overdispersion in R on the warpbreaks dataset.

library(pscl)
nb_reg <- glm.nb(formula=breaks ~ wool+tension, data = warpbreaks)
odTest(nb_reg)


Likelihood ratio test of H0: Poisson, as restricted NB model:
n.b., the distribution of the test-statistic under H0 is non-standard
e.g., see help(odTest) for details/references

Critical value of test statistic at the alpha= 0.05 level: 2.7055 
Chi-Square Test Statistic =  86.2922 p-value = < 2.2e-16

And thus we reject the null hypothesis that the true model is Poisson regression in favor of negative binomial regression.

Putting it Together

If you didn’t notice, we performed two hypothesis tests here: one for a zero inflated model, and one for a negative binomial model. We thus need to adjust our p-value threshold. Instead of using some value $\alpha$ , we have to do $\alpha/n$ , where $n$ is the number of hypothesis tests we ran. This is the Bonferroni correction.

From our tests, we decided to use a negative binomial model without using a zero-inflated model. Let’s look at the results.

summary(nb_reg)

Call:
glm.nb(formula = breaks ~ wool + tension, data = warpbreaks, 
    init.theta = 9.944385436, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0144  -0.9319  -0.2240   0.5828   1.8220  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.6734     0.0979  37.520  < 2e-16 ***
woolB        -0.1862     0.1010  -1.844   0.0651 .  
tensionM     -0.2992     0.1217  -2.458   0.0140 *  
tensionH     -0.5114     0.1237  -4.133 3.58e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(9.9444) family taken to be 1)

    Null deviance: 75.464  on 53  degrees of freedom
Residual deviance: 53.723  on 50  degrees of freedom
AIC: 408.76

Number of Fisher Scoring iterations: 1


              Theta:  9.94 
          Std. Err.:  2.56 

 2 x log-likelihood:  -398.764

Note that if we consider p-values for each individual coefficient, we actually need a Bonferroni correction. We performed two hypothesis tests to arrive at our final model, and have three features and one intercept, leading to six hypothesis tests. In order to ensure the probability of no type I errors is less than $0.05$ , we look for individual p-values of $0.05/6=.008$ . Under this criteria we can consider the intercept and tensionH to be significant, but not the other features.

[1] Van den Broek, Jan. “A score test for zero inflation in a Poisson distribution.” Biometrics (1995): 738-743.