r – negative binomial glm: issues with model diagnostics when using offset term

I’m trying to model count data using a negative binomial glm with an offset in R and am having some issues with getting the model to fit properly when I use an offset term. Here is a reproducible example (in this example I happen to be using the gamlss package, but I get similar results using other packages):

library(gamlss)

#construct df
df <- structure(list(offset_var = c(1.28599085563929, 0.56011798945881, 
                                    1.40150012655384, 0.910539373132146, 1.99332290786881, 0.431352036569322, 
                                    1.9664387276399, 0.39501204122542, 0.837952086514098, 1.49309174747163, 
                                    0.340780160418832, 0.185176379335315, 1.9358463846208, 0.23261768703236, 
                                    0.766384544663353, 0.683228994093252, 2.36951602293399, 0.570222733062088, 
                                    2.035225147398, 2.66190465894079, 2.45851192989865, 2.59441609615926, 
                                    1.79085721952421, 9.10131475225225, 1.14341199635633, 1.86464965904907, 
                                    0.969684569635742, 1.48586083003075, 1.57819408350658, 0.398395739771472, 
                                    1.01779798183851, 0.348242838034098, 1.56020949397414, 1.14211413937976, 
                                    2.17372502761175, 1.15472189286642, 0.52428786649856, 0.91480376034087, 
                                    1.85426680323653, 0.663761139444733, 1.2592920835499, 0.799109081286816, 
                                    1.9473416892704, 1.32010595330908, 1.92731761020335, 1.74079702185659
), response_var = c(31, 154, 71, 40, 3, 43, 66, 180, 62, 27, 
                    81, 84, 14, 66, 35, 40, 48, 155, 23, 37, 88, 33, 8, 38, 73, 29, 
                    11, 12, 70, 74, 40, 298, 24, 35, 27, 116, 60, 83, 68, 77, 8, 
                    56, 234, 64, 71, 57), pred1 = c(0.286190036347746, -0.343709975349617, 
                                                    0.669042107677564, -0.349882890393357, 0.613485872283906, -0.53507034170555, 
                                                    0.280017121304006, -0.53507034170555, 0.601140042196427, -0.436303701005714, 
                                                    -0.0966620349820732, -0.695566132842785, 0.502242062878638, -0.423957870918234, 
                                                    0.415821252266281, -0.491859936399372, 0.477550402703679, -0.386920380655795, 
                                                    0.594967127152687, -0.559762001880509, 0.959300453351286, -0.331364145262137, 
                                                    0.687560852808784, -0.232466165944349, 0.174946226942477, -0.646182812492866, 
                                                    0.366437931916363, -0.565934916924249, 0.613485872283906, -0.436303701005714, 
                                                    0.366437931916363, -0.417784955874494, 0.656696277590085, -0.485687021355632, 
                                                    0.588662873490995, -0.430130785961974, 0.298535866435225, -0.479514106311892, 
                                                    0.471377487659939, -0.491859936399372, 0.119389991548819, -0.423957870918234, 
                                                    0.292362951391485, -0.54124325674929, 0.79250040855236, -0.473341191268152
                    ), pred3 = c(0.144609842211824, 0.144609842211824, 0.499320654740666, 
                                 0.499320654740666, -0.466506214802146, -0.466506214802146, 0.282440329365891, 
                                 0.282440329365891, 0.486145681703882, 0.486145681703882, -0.84554005447582, 
                                 -0.84554005447582, 0.124340652924466, 0.124340652924466, 0.408109302947536, 
                                 0.408109302947536, -0.662103891425194, -0.662103891425194, 0.409122762411905, 
                                 0.409122762411905, 0.531751357600445, 0.531751357600445, 0.099004166315261, 
                                 0.099004166315261, 0.422297735448688, 0.422297735448688, -0.453331241765359, 
                                 -0.453331241765359, 0.461822654559046, 0.461822654559046, 0.272305734722211, 
                                 0.272305734722211, 0.186161680250921, 0.186161680250921, -0.220235564960693, 
                                 -0.220235564960693, -1.22356043468513, -1.22356043468513, 0.183121301857816, 
                                 0.183121301857816, -0.0712570236985814, -0.0712570236985814, 
                                 0.273319194186577, 0.273319194186577, -0.865809243763182, -0.865809243763182
                    ), pred6 = c(0.688053824835414, 0.540952466204903, 0.688189445154199, 
                                 0.628399305213738, 0.780506864063453, 0.596192469087064, 0.664351792193705, 
                                 0.60163003791155, 0.725643181043038, 0.617634868802872, 0.804339479465123, 
                                 0.811784880984014, 0.847678636187369, 0.576349468957153, 0.71955464755009, 
                                 0.625702671539389, 0.363763777714112, 0.446477292582477, 0.474674252997288, 
                                 0.480035185899025, 0, 0.4937661725622, 0.903418101749097, 0.513379111687292, 
                                 0.695930304479286, 0.617335579129071, 0.759590008590574, 0.654198263624173, 
                                 0.758157124611394, 0.638351745840282, 0.749652691590418, 0.616963963666818, 
                                 0.626922408721354, 0.587580912470206, 0.349235608213707, 0.430636110291635, 
                                 1, 0.723455668440276, 0.696660949537828, 0.615287637567703, 0.51486263616898, 
                                 0.570881476208878, 0.661291018750271, 0.599650096633512, 0.317335721700075, 
                                 0.449950453216749), pred2 = c(-0.680321210960666, -0.680321210960666, 
                                                               0.558882145665921, 0.558882145665921, -0.70320931057483, -0.70320931057483, 
                                                               0.50053372153711, 0.50053372153711, 0.547164958730963, 0.547164958730963, 
                                                               0.0776857795127178, 0.0776857795127178, -0.673005886144517, -0.673005886144517, 
                                                               0.486124748414392, 0.486124748414392, 0.117508381055176, 0.117508381055176, 
                                                               0.381145087091728, 0.381145087091728, -0.555834016794938, -0.555834016794938, 
                                                               -0.76505705674773, -0.76505705674773, 0.481691218222786, 0.481691218222786, 
                                                               -0.507516454724635, -0.507516454724635, 0.524838967408949, 0.524838967408949, 
                                                               -0.523896765378978, -0.523896765378978, 0.339264060817454, 0.339264060817454, 
                                                               0.201191263421734, 0.201191263421734, -0.132891070552154, -0.132891070552154, 
                                                               0.445985466143962, 0.445985466143962, -0.692062148950222, -0.692062148950222, 
                                                               0.493883426249702, 0.493883426249702, 0.100249281380711, 0.100249281380711
                                 ), pred4 = c(-0.0609439824226133, -0.0609439824226133, 0.212599822223845, 
                                              0.212599822223845, 0.35399400122385, 0.35399400122385, 0.17893107767439, 
                                              0.17893107767439, 0.397120623433909, 0.397120623433909, -0.15784307490735, 
                                              -0.15784307490735, 0.789710871384117, 0.789710871384117, -0.642029377404221, 
                                              -0.642029377404221, -0.986917827531154, -0.986917827531154, 0.637147236566486, 
                                              0.637147236566486, -0.186827334718838, -0.186827334718838, -0.631079132675486, 
                                              -0.631079132675486, -0.0654763965499504, -0.0654763965499504, 
                                              0.137659499227558, 0.137659499227558, -0.59614801294535, -0.59614801294535, 
                                              -0.314691328777094, -0.314691328777094, -0.39609462940867, -0.39609462940867, 
                                              -0.0211897356486334, -0.0211897356486334, -0.790591014881012, 
                                              -0.790591014881012, 0.442309840597833, 0.442309840597833, 0.687979445696367, 
                                              0.687979445696367, 0.379541632269314, 0.379541632269314, 0.630483382500631, 
                                              0.630483382500631), pred5 = c(0.792887432988887, 0.792887432988887, 
                                                                            -0.398666450848724, -0.398666450848724, 0.756245610872723, 0.756245610872723, 
                                                                            0.282670104231291, 0.282670104231291, -0.132107743020526, -0.132107743020526, 
                                                                            0.496085551149254, 0.496085551149254, 0.0987836918462516, 0.0987836918462516, 
                                                                            -0.253249103383483, -0.253249103383483, -0.695753927551872, -0.695753927551872, 
                                                                            0.132880314145315, 0.132880314145315, 0.581565542594452, 0.581565542594452, 
                                                                            -0.765249231086272, -0.765249231086272, 0.365348387506002, 0.365348387506002, 
                                                                            -0.719644009235505, -0.719644009235505, 0.00177392404073831, 
                                                                            0.00177392404073831, -0.855971101527024, -0.855971101527024, 
                                                                            0.233767666307952, 0.233767666307952, 0.446673294454707, 0.446673294454707, 
                                                                            -0.515722472384031, -0.515722472384031, 0.133732386579586, 0.133732386579586, 
                                                                            0.511333497308832, 0.511333497308832, 0.125627521889691, 0.125627521889691, 
                                                                            -0.573380520827721, -0.573380520827721)), row.names = c(NA, -46L
                                                                            ), class = "data.frame")

#model with offset

model <- gamlss(response_var~pred1*pred2 + pred1*pred3 + pred1*pred4 + pred1*pred5 +  pred6 + offset(log(offset_var)), data=df, family=NBI)

This produces the following model output and diagnostic plots. As you can see, the residuals vs fitted plot in the top right indicates heteroskedasticity, with higher fitted values ​​being overpredicted by the model (if I’m interpreting correctly).

> summary(model)
******************************************************************
Family:  c("NBI", "Negative Binomial type I") 

Call:  gamlss(formula = response_var ~ pred1 * pred2 + pred1 *      pred3 + pred1 * pred4 + pred1 * pred5 + pred6 +      offset(log(offset_var)), family = NBI, data = df) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.62174    0.58613   7.885 3.50e-09 ***
pred1       -1.50606    0.30134  -4.998 1.73e-05 ***
pred2        0.14898    0.34118   0.437   0.6651    
pred3       -0.05911    0.33767  -0.175   0.8621    
pred4       -0.54450    0.39455  -1.380   0.1766    
pred5        0.03808    0.36817   0.103   0.9182    
pred6       -0.52710    0.93088  -0.566   0.5750    
pred1:pred2  1.48585    0.67606   2.198   0.0349 *  
pred1:pred3  0.23541    0.63307   0.372   0.7123    
pred1:pred4  0.33970    0.72884   0.466   0.6441    
pred1:pred5  0.03169    0.74800   0.042   0.9665    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04399    0.18768  -0.234    0.816

------------------------------------------------------------------
No. of observations in the fit:  46 
Degrees of Freedom for the fit:  12
      Residual Deg. of Freedom:  34 
                      at cycle:  3 
 
Global Deviance:     497.0233 
            AIC:     521.0233 
            SBC:     542.967 
******************************************************************

plot(model)

What confuses me is that when I do the “incorrect” thing (see here, for example) and model the rate directly (count response variable/offset variable), the issue with the residuals vs fitted plots. Note that the only change I have made to this second model is to get rid of the offset term and instead use it to transform the original response variable (a count variable) into a rate:

#model rate directly

df$rate <- round((df$response_var)/(df$offset_var), 0) #construct rate variable, rounded to zero to resemble "discrete" data and enable model fitting

model2 <- gamlss(rate~pred1*pred2 + pred1*pred3 + pred1*pred4 + pred1*pred5 +  pred6, data=df, family=NBI)

Model coefficients are very similar to first model:

> summary(model2)
******************************************************************
Family:  c("NBI", "Negative Binomial type I") 

Call:  gamlss(formula = rate ~ pred1 * pred2 + pred1 * pred3 +      pred1 * pred4 + pred1 * pred5 + pred6, family = NBI,      data = df) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.61072    0.58763   7.846 3.91e-09 ***
pred1       -1.50873    0.30023  -5.025 1.59e-05 ***
pred2        0.14653    0.34105   0.430    0.670    
pred3       -0.06176    0.33559  -0.184    0.855    
pred4       -0.53680    0.39252  -1.368    0.180    
pred5        0.05368    0.36600   0.147    0.884    
pred6       -0.50076    0.93131  -0.538    0.594    
pred1:pred2  1.47773    0.67678   2.183    0.036 *  
pred1:pred3  0.22012    0.62725   0.351    0.728    
pred1:pred4  0.35012    0.72501   0.483    0.632    
pred1:pred5  0.06869    0.74227   0.093    0.927    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04853    0.18958  -0.256      0.8

------------------------------------------------------------------
No. of observations in the fit:  46 
Degrees of Freedom for the fit:  12
      Residual Deg. of Freedom:  34 
                      at cycle:  3 
 
Global Deviance:     487.4998 
            AIC:     511.4998 
            SBC:     533.4435 
******************************************************************

But there is no longer the same issue of heteroskedasticity in the residuals vs fitted plot, or at least not nearly to the same extent:

plot(model2)

Diagnostic plot for model2

I was under the impression that modeling with an offset was mathematically the same thing as modeling a rate directly (see here), but that using an offset was the “proper” way to do it because a rate variable technically does not follow a discrete distribution like Poisson or negative binomial. So I would expect the two models to produce the same results – and yet, they don’t. In fact, it seems that directly modeling the rate produces the better model.

Can anyone tell me what might be going on here? Would I be wrong to use the model with rate as the response variable (model2)? Is there any issue with my code that might explain the difference between the two models?

Note that I’ve posted this on StackOverflow because I thought there might be a coding component to the problem, but please let me know if this question is more appropriate for CrossValidated.

Leave a Comment