Models for proportion: Binomial GLMs

Examples

  • A study produced similar symptoms by exposing volunteers’ lower bodies to negative air pressure, likewise decreasing oxygen to the brain. The data record the ages of eight volunteers and whether they showed syncopal blackout-related signs (pallor, sweating, slow heartbeat, unconsciousness) during an 18 min period. Does resistance to blackout decrease with age?
    • The explanatory variable is Age. The response variable is Signs, coded as 1 if the subject showed blackout-related signs and 0 otherwise. The response variable is binary, taking only two distinct values, and no transformation can change that.
  • Second example, the response variable is the number of damaged O-rings out of six for each of the previous 23 launches with data available, so only seven values are possible for the response. No transformation can change this.
  • Rather than linear regression, a more sensible model would be to use a binomial distribution with mean proportion \(\mu\) for modelling the proportion \(y\) of O-rings damaged out of m at various temperatures x.

\[ \log \frac{\mu}{1-\mu}=\beta_0+\beta_1x \] combining the systematic and random components, a possible model for the data is:

\[ \begin{cases} ym \sim \text{Bin}(\mu,m) \quad &(\text{random component})\\ \log \frac{\mu}{1-\mu}=\beta_0+\beta_1x \quad &(\text{systematic component}) \end{cases} \]

Binomial PMF in EDM form

the outcome of many studies is a proportion y of a total number m: the proportion of individuals having a diseasel; the proportion of voters who vote in favor of a particular election candidate; … each case, the m individuals in each group are assumed to be independent and each individual can be classified into one of two possible outcomes.

\[ \mathcal{P}(y|\mu, m)={m \choose my}\mu^{my}(1-\mu)^{m(1-y)} \]

where \(m\) is known and \(\phi=1, 0<\mu<1\). For binomial GLMs, the use of quantile residuals is strongly recommended for diagnostic analysis.

Binomial responses may be specified in the glm() formula in one of three ways:

  • The response can be supplied as the observed proportions \(y_i\), when the sample sizes \(m_i\) are supplied as the weights in the call to glm().
  • The response can be given as a two-column array, the columns giving the numbers of successes and failures respectively in each group of size \(m_i\). The prior weights weights do not need to be supplied.
  • The response can be given as a factor or as a logical. This specification is useful if the data have one row for each observation.
data(turbines)

tur.m1 <- glm( Fissures/Turbines ~ Hours, family=binomial,
               weights=Turbines, data=turbines) # 1st 
tur.m2 <- glm( cbind(Fissures, Turbines-Fissures) ~ Hours,
               family=binomial, data=turbines) # 2nd
coef(tur.m1) 
##   (Intercept)         Hours 
## -3.9235965551  0.0009992372
coef(tur.m2)
##   (Intercept)         Hours 
## -3.9235965551  0.0009992372

Median Effective Dose: ED50

Binomial GLMs are commonly used to examine the relationship between the dose d of a drug and the proportion y of insects that survive. These models are called dose-response models. ED50: the dose of poison affecting \(50\%\) of the insects. By definition, \(\mu=0.5\) at the ED50. For a binomial GLM using logit link function, \(\eta=\log(\mu/(1-\mu))=0\) when \(\mu=0.5\), where \(\eta=\beta_0+\beta_1d\) and d is the dose. Solving for dose d gives that ED50 \(=-\hat{\beta_0}/\hat{\beta_1}\).

Overdispersion

For a binomial distribution, var\([y]=\mu(1-\mu)\). The amount of variation in the data can exceed this value, called overdispersion. Consequences are standard errors by GLM are underestimated, and tests on predictors will generally be more significant leading to overly complex models.

Overdispersion can be detected by goodness-of-fit test. Lack of fit may be caused by an inadequate model. But if all possible predictors are already in the model and no outliers exist when lack of fit persists ,then overdispersion is the alternative interpretation.

Overdispersion arises in two cases: the probabilities \(\mu_i\) are not constant between observations; the \(m_i\) cases, of which observations \(y_i\) is a proportion, are not independent.

Quasi-binomial GLMs are the remedies for overdispersion.

Example: Since seedsare usually planted together in common plots, it is highly possible that they might interact or be affected by common causes; in other words we might well expect seeds to be positively correlated, leading to overdispersion.

data(germ)
gm.m1 <- glm(Germ/Total ~ Seeds * Extract, family=binomial,
             data=germ, weights=Total)
printCoefmat(coef(summary(gm.m1)))
##                           Estimate Std. Error z value Pr(>|z|)  
## (Intercept)               -0.41224    0.18418 -2.2383  0.02520 *
## SeedsOA75                 -0.14593    0.22317 -0.6539  0.51318  
## ExtractCucumber            0.54008    0.24981  2.1619  0.03062 *
## SeedsOA75:ExtractCucumber  0.77810    0.30643  2.5392  0.01111 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(gm.m1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Germ/Total
## 
## Terms added sequentially (first to last)
## 
## 
##               Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                             20     98.719             
## Seeds          1    2.544        19     96.175  0.11070    
## Extract        1   56.489        18     39.686 5.65e-14 ***
## Seeds:Extract  1    6.408        17     33.278  0.01136 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c( deviance(gm.m1), df.residual(gm.m1) )
## [1] 33.27779 17.00000
# diagnostic plot
qres <- qresid(gm.m1)
qqnorm(qres, las=1)
abline(0, 1)

# deviance residual vs fitted values
scatter.smooth( qres ~ fitted( gm.m1 ), 
                col="grey", las=1, 
                ylab="Standardized residuals", 
                xlab="Fitted values")

the residual deviance is much larger than the residual degrees of freedom; from QQ-plot no large residuals present that would suggest outliers; influential observations cannot be an issue. So overdispersion is present.

gm.od <- update(gm.m1, family=quasibinomial)
anova(gm.od, test="F")
## Analysis of Deviance Table
## 
## Model: quasibinomial, link: logit
## 
## Response: Germ/Total
## 
## Terms added sequentially (first to last)
## 
## 
##               Df Deviance Resid. Df Resid. Dev       F    Pr(>F)    
## NULL                             20     98.719                      
## Seeds          1    2.544        19     96.175  1.3665   0.25854    
## Extract        1   56.489        18     39.686 30.3407 3.839e-05 ***
## Seeds:Extract  1    6.408        17     33.278  3.4418   0.08099 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The quasi-binomial analysis of deviance suggests that only Extract is significant in the model, so germination frequency differs by root stock but not by seed type, unlike the binomial glm which showed a significant Extract by Seeds interaction.

The binomial and quasi-binomial glms give identical coefficient estimates, but the standard errors from the quasi-binomial GLM are \(\sqrt{\phi}\) times those from the binomial model:

sqrt(summary(gm.od)$dispersion)
## [1] 1.36449
beta <- coef(summary(gm.m1))[,"Estimate"]
m1.se <- coef(summary(gm.m1))[,"Std. Error"]
od.se <- coef(summary(gm.od))[,"Std. Error"]
data.frame(Estimate=beta, Binom.SE=m1.se,
           Quasi.SE=od.se, Ratio=od.se/m1.se)
##                             Estimate  Binom.SE  Quasi.SE   Ratio
## (Intercept)               -0.4122448 0.1841784 0.2513095 1.36449
## SeedsOA75                 -0.1459269 0.2231659 0.3045076 1.36449
## ExtractCucumber            0.5400782 0.2498130 0.3408672 1.36449
## SeedsOA75:ExtractCucumber  0.7781037 0.3064332 0.4181249 1.36449

No goodness-of-fit for binary data

For binary data, likelihood ratio tests and score tests should be used, making sure that p’ is much smaller than n.

case study

an experiment exposed batches of insects to various deposits of insecticides. The proportion of insects y killed after six days of exposure in each batch of size m is potentially a function of the dose of insecticide and the type of insecticide.

data(deposit)
deposit$Prop <- deposit$Killed / deposit$Number


plot( Prop ~ Deposit, type="n", las=1, ylim=c(0, 1),
      data=deposit, main="Proportion of\ninsects killed",
      xlab="Deposit (in mg)", ylab="Proportion killed")
points( Prop ~ Deposit, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ Deposit, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ Deposit, pch="C", subset=(Insecticide=="C"), data=deposit)

# Add lines for each insecticide type with different line types
lines(Prop ~ Deposit, subset=(Insecticide == "A"), data=deposit, lty=1) # solid line
lines(Prop ~ Deposit, subset=(Insecticide == "B"), data=deposit, lty=2) # dashed line
lines(Prop ~ Deposit, subset=(Insecticide == "C"), data=deposit, lty=3) # dotted line

# Optionally, add a legend to identify the lines
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"), 
       lty=c(1, 2, 3), pch=c("A", "B", "C"))

fitting a GLM model using deposit amount the type of insecticide as predictors and plot the fitted lines:

ins.m1 <- glm(Killed/Number ~ Deposit + Insecticide,
              family = binomial, weights = Number, data = deposit)

newD <- seq( min(deposit$Deposit), max(deposit$Deposit), length=100)
newProp.logA <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="A") )
newProp.logB <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="B") )
newProp.logC <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="C") )

plot( Prop ~ Deposit, type="n", las=1, ylim=c(0, 1),
      data=deposit, main="Proportion of\ninsects killed",
      xlab="Deposit (in mg)", ylab="Proportion killed")
points( Prop ~ Deposit, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ Deposit, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ Deposit, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)

### disgnostic

plot( qresid(ins.m1) ~ fitted(ins.m1), type="n", las=1, ylim=c(-3, 3),
      main="Quantile resids. plotted\nagainst fitted values",
      xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m1) ~ fitted(ins.m1), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m1) ~ fitted(ins.m1), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m1) ~ fitted(ins.m1), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))

For each insecticide, the proportions are under-estimated at the lower and higher values of deposit.

LogOdds <- with(deposit, log(Prop/(1-Prop)) )
plot( LogOdds ~ Deposit, type="n", xlab="Deposit", data=deposit,
      main="Logits plotted\nagainst Deposit", las=1)
points( LogOdds ~ Deposit, pch="A", type="b", lty=1, data=deposit, subset=(Insecticide=="A") )
points( LogOdds ~ Deposit, pch="B", type="b", lty=2, data=deposit, subset=(Insecticide=="B") )
points( LogOdds ~ Deposit, pch="C", type="b", lty=3, data=deposit, subset=(Insecticide=="C") )

Plotting the log-odds against the deposit shows the relationship is not linear on log-odds. try logarithm of dose:

deposit$logDep <- log( deposit$Deposit )
ins.m2 <- glm(Killed/Number ~ logDep + Insecticide - 1,
              family = binomial, weights = Number, data = deposit)

# proportion vs. log-logDep
newD <- seq( min(deposit$logDep), max(deposit$logDep), length=100)
newProp.logA <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="A") )
newProp.logB <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="B") )
newProp.logC <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="C") )

plot( Prop ~ logDep, type="n", las=1, ylim=c(0, 1),
      data=deposit, main="Proportion of\ninsects killed",
      xlab="logDep (in mg)", ylab="Proportion killed")
points( Prop ~ logDep, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ logDep, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ logDep, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)

# quantile residual vs. fitted values
plot( qresid(ins.m2) ~ fitted(ins.m2), type="n", las=1, ylim=c(-3, 3),
      main="Quantile resids. plotted\nagainst fitted values",
      xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m2) ~ fitted(ins.m2), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m2) ~ fitted(ins.m2), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m2) ~ fitted(ins.m2), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))

# fitted log-odds vs. log deposit
plot(LogOdds ~ logDep, type="n", las=1,
     data=deposit, main="Log-Odds of Proportion of\nInsects Killed vs Log Deposit",
     xlab="Log Deposit (in mg)", ylab="Log-Odds of Proportion Killed")
# Add points for each insecticide type
points(LogOdds ~ logDep, pch="A", col="blue", subset=(Insecticide == "A"), data=deposit)
points(LogOdds ~ logDep, pch="B", col="green", subset=(Insecticide == "B"), data=deposit)
points(LogOdds ~ logDep, pch="C", col="purple", subset=(Insecticide == "C"), data=deposit)

# Predict the log-odds using the fitted model
logDep_seq <- seq(min(deposit$logDep), max(deposit$logDep), length.out = 100)
predict_data <- data.frame(logDep = logDep_seq, Insecticide = rep("A", 100))
pred_A <- predict(ins.m2, newdata = predict_data, type = "response")

predict_data$Insecticide <- rep("B", 100)
pred_B <- predict(ins.m2, newdata = predict_data, type = "response")

predict_data$Insecticide <- rep("C", 100)
pred_C <- predict(ins.m2, newdata = predict_data, type = "response")

# Add the fitted model lines for each insecticide
lines(logDep_seq, log(pred_A / (1 - pred_A)), col="blue", lty=1) # solid line for insecticide A
lines(logDep_seq, log(pred_B / (1 - pred_B)), col="green", lty=2) # dashed line for insecticide B
lines(logDep_seq, log(pred_C / (1 - pred_C)), col="purple", lty=3) # dotted line for insecticide C
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"),
       col=c("blue", "green", "purple"), pch=c("A", "B", "C"), lty=c(1, 2, 3))

While model 2 improves on model 1, proportions are still under-estimated for all types at the lower and higher values of deposit; plotting the log-odds against the logarithm of Deposit indicates that the log-odds are not constant, but are perhaps quadratic:

ins.m3 <- glm(Killed/Number ~ poly(logDep, 2) + Insecticide,
              family = binomial, weights = Number, data = deposit)


# diagnostic plots
# proportion vs. log-logDep
newD <- seq( min(deposit$logDep), max(deposit$logDep), length=100)
newProp.logA <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="A") )
newProp.logB <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="B") )
newProp.logC <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="C") )

plot( Prop ~ logDep, type="n", las=1, ylim=c(0, 1),
      data=deposit, main="Proportion of\ninsects killed",
      xlab="logDep (in mg)", ylab="Proportion killed")
points( Prop ~ logDep, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ logDep, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ logDep, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)

# quantile residual vs. fitted values
plot( qresid(ins.m3) ~ fitted(ins.m3), type="n", las=1, ylim=c(-2, 2),
      main="Quantile resids. plotted\nagainst fitted values",
      xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m3) ~ fitted(ins.m3), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m3) ~ fitted(ins.m3), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m3) ~ fitted(ins.m3), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))

# fitted log-odds vs. log deposit
plot(LogOdds ~ logDep, type="n", las=1,
     data=deposit, main="Log-Odds of Proportion of\nInsects Killed vs Log Deposit",
     xlab="Log Deposit (in mg)", ylab="Log-Odds of Proportion Killed")
# Add points for each insecticide type
points(LogOdds ~ logDep, pch="A", col="blue", subset=(Insecticide == "A"), data=deposit)
points(LogOdds ~ logDep, pch="B", col="green", subset=(Insecticide == "B"), data=deposit)
points(LogOdds ~ logDep, pch="C", col="purple", subset=(Insecticide == "C"), data=deposit)

# Predict the log-odds using the fitted model
logDep_seq <- seq(min(deposit$logDep), max(deposit$logDep), length.out = 100)
predict_data <- data.frame(logDep = logDep_seq, Insecticide = rep("A", 100))
pred_A <- predict(ins.m3, newdata = predict_data, type = "response")

predict_data$Insecticide <- rep("B", 100)
pred_B <- predict(ins.m3, newdata = predict_data, type = "response")

predict_data$Insecticide <- rep("C", 100)
pred_C <- predict(ins.m3, newdata = predict_data, type = "response")

# Add the fitted model lines for each insecticide
lines(logDep_seq, log(pred_A / (1 - pred_A)), col="blue", lty=1) # solid line for insecticide A
lines(logDep_seq, log(pred_B / (1 - pred_B)), col="green", lty=2) # dashed line for insecticide B
lines(logDep_seq, log(pred_C / (1 - pred_C)), col="purple", lty=3) # dotted line for insecticide C
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"),
       col=c("blue", "green", "purple"), pch=c("A", "B", "C"), lty=c(1, 2, 3))

anova( ins.m2, ins.m3, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Killed/Number ~ logDep + Insecticide - 1
## Model 2: Killed/Number ~ poly(logDep, 2) + Insecticide
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1        14     23.385                        
## 2        13     15.090  1   8.2949 0.003976 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the quadratic model is a statistically significantly improvement; no evidence exists to support overdispersion: the deviance is 15 and df is 13.

Models for counts

Example

  • quoting the number of diseased people in various cities is an unfair comparison, as some cities have a far larger population; comparing the number of people with the disease per unit population is a fairer comparison.
    • the disease rate is often more suitable for modelling than the actual number of people with the disease.
    • Poisson glms are more convenient when the populations are large and the rates are relatively small.
  • A study of the habitats of the noisy miner (a small but aggressive native Australian bird) counted the number of noisy miners y and the number of eucalypt trees x in two-hectare buloke woodland transects. Buloke woodland patches with more eucalypts tend to have more noisy miners
    • The number of noisy miners is more variable where more eucalypts are present. Between 0 and 10 eucalypts, the number of noisy miners is almost always zero; between 10 and 20 ucalypts, the number of noisy miners increases. This shows that the systematic relationship between the number of eucalypts and the number of noisy miners is not linear.
    • Between 0 and 10 eucalypts, the number of noisy miners varies little. Between 10 and 20 eucalypts, a larger amount of variation exists in the number of noisy miners. This shows that the randomness does not have constant variance.
    • the variation in the data may be modelled using a Poisson distribution.
    • A possible model for the systematic component is \(\log \mu = \beta_0+\beta_1x\), where x is the number of eucalypt trees at a given site, and μ is the expected number of noisy miners.

\[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]

introduction & model summary

Poisson distribution is used to model counts, which has PMF: \[ \mathcal{P}(y;\lambda)=\frac{\exp(-\lambda)\lambda^y}{y!} \] The unit deviance for Poisson distribution is: \[ d(y, \lambda)=2\{y\log\frac{y}{\lambda}-(y-\lambda)\} \] and the residual deviance is \[ D(y,\hat{\lambda})=\sum w_i d(y_i, \hat{\lambda}_i) \] where \(w_i\) are the prior weights.

action

model defined as \[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\log T+\beta_0+\sum_{j=1}^{p}\beta_jx_j \quad (\text{systematic component}) \end{cases} \]

data(danishlc)
danishlc$Rate <- danishlc$Cases / danishlc$Pop * 1000 # Rate per 1000
danishlc$Age <- ordered(danishlc$Age, # Ensure age-order is preserved
                        levels=c("40-54", "55-59", "60-64", "65-69", "70-74", ">74") )

danishlc$City <- abbreviate(danishlc$City, 1)  # Abbreviate city names
matplot( xtabs( Rate ~ Age+City, data=danishlc), pch=1:4, lty=1:4,
         type="b", lwd=2, col="black", axes=FALSE, ylim=c(0, 25),
         xlab="Age group", ylab="Cases/1000")
axis(side=1, at=1:6, labels=levels(danishlc$Age))
axis(side=2, las=1)
box()
legend("topleft", col="black", pch=1:4, lwd=2, lty=1:4, merge=FALSE,
       legend=c("Fredericia", "Horsens", "Kolding", "Vejle") )

dlc.m1 <- glm( Cases ~ offset( log(Pop) ) + City * Age,
               family=poisson, data=danishlc)
anova(dlc.m1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Cases
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                        23    129.908             
## City      3    3.393        20    126.515  0.33495    
## Age       5  103.068        15     23.447  < 2e-16 ***
## City:Age 15   23.447         0      0.000  0.07509 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

city seems not significant; alternatively, consider age as continuous and a quadratic term:

dlc.m2 <- update(dlc.m1, . ~ offset(log(Pop)) + Age )

danishlc$AgeNum <- rep( c(40, 55, 60, 65, 70, 75), 4)
dlc.m3 <- update(dlc.m1, . ~ offset( log(Pop) ) + AgeNum)
dlc.m4 <- update(dlc.m3, . ~ offset( log(Pop) ) + poly(AgeNum, 2) )

anova( dlc.m3, dlc.m4, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Cases ~ AgeNum + offset(log(Pop))
## Model 2: Cases ~ poly(AgeNum, 2) + offset(log(Pop))
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        22     48.968                          
## 2        21     32.500  1   16.468 4.948e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

now compare 4 models using AIC/BIC:

c( "With interaction"=AIC(dlc.m1), "Without interaction"=AIC(dlc.m2),
"Age (numerical)"=AIC(dlc.m3), "Age (numerical; quadratic)"=AIC(dlc.m4) )
##           With interaction        Without interaction 
##                   144.3880                   136.6946 
##            Age (numerical) Age (numerical; quadratic) 
##                   149.3556                   134.8876
# goodness-of-fit test
D.m2 <- deviance(dlc.m2)
df.m2 <- df.residual( dlc.m2 )
c( Dev=D.m2, df=df.m2, P = pchisq( D.m2, df.m2, lower = FALSE) )
##         Dev          df           P 
## 28.30652745 18.00000000  0.05754114
D.m4 <- deviance(dlc.m4)
df.m4 <- df.residual( dlc.m4 )
c( Dev=D.m4, df=df.m4, P = pchisq( D.m4, df.m4, lower = FALSE) )
##         Dev          df           P 
## 32.49959158 21.00000000  0.05206888

Both models are reasonably adequate. now consider diagnostic plots:

# factor age model
scatter.smooth( rstandard(dlc.m2) ~ sqrt(fitted(dlc.m2)),
                ylab="Standardized residuals", xlab="Sqrt(Fitted values)",
                main="Factor age model", las=1 )

plot( cooks.distance(dlc.m2), type="h", las=1, main="Cook's D", ylab="Cook's distance, D")

qqnorm( qr<-qresid(dlc.m2), las=1 ); abline(0, 1)

# quadratic age model
scatter.smooth( rstandard(dlc.m4) ~ sqrt(fitted(dlc.m4)),
                ylab="Standardized residuals", xlab="Sqrt(Fitted values)",
                main="Quadratic age model", las=1 )

plot( cooks.distance(dlc.m4), type="h", las=1, main="Cook's D", ylab="Cook's distance, D")

qqnorm( qr<-qresid(dlc.m4), las=1 ); abline(0, 1)

Two models are reasonable. But we prefer model 2 since model 4 has three influential obs compared to the rest and model 2 is simpler.

contingency tables: log-linear models

systematic component

  • two-way contingency table: factor A with I levels and factor B with J levels
    • \(y_{ij}\) refers to the observed count in row i and column j, \(i=1,2,\dots, I, j=1,2,\dots,J\)
    • let \(\mu_{ij}\) be the expected count in cell (i,j), \(\pi_{ij}\) be the expected probability that an observation is in cell (i,j), and \(\mu_{ij}=m\pi_{ij}\) where m is the total number of obs.
    • let \(m_{i \cdot}\) be the sum of counts in row i over all columns, \(m_{\cdot j}\) be the sum of counts in column j over all rows. -\(\pi_{ij}=\pi_{i \cdot}\pi_{\cdot j}\) if factors A and B are independent; then \(\mu_{ij}=m\pi_{i\cdot}\pi_{\cdot j}\)
    • taking logarithms: \(\log \mu_{ij}=\log m+\log \pi_{i \cdot} + \log \pi_{\cdot j}\)
    • the probabilities \(\pi_{i \cdot}\) depend on which row the observation is in; the probabilities \(\pi_{\cdot j}\) depend on which colum the obs. is in
  • example by simplest contingency table
    • to test for independence of row and column factors
    • a probabilistic model for the counts is needed. a complete model for the data depends on how the sample of individuals was collected.
Counts <- c(263, 258, 151, 222)
Att <- gl(2, 2, 4, labels=c("For", "Against") )
Inc <- gl(2, 1, 4, labels=c("High", "Low") )
data.frame( Counts, Att, Inc)
##   Counts     Att  Inc
## 1    263     For High
## 2    258     For  Low
## 3    151 Against High
## 4    222 Against  Low
gm.table <- xtabs( Counts ~ Att + Inc )
gm.table
##          Inc
## Att       High Low
##   For      263 258
##   Against  151 222

random component

sample scheme by presetting row/column margins

  • m obs. are randomly allocated to factor A and B as they arrive; neither row nor column total were fixed
    • no margins were fixed and no limit on how large the number of counts can be
    • if the total number of individuals observed follow a Poisson distribution and individuals are independent, then each of the counts in the contingency table follow a Poisson fdistribution

the log-likelihood function for the \(2 \times 2\) table is given as \[ \mathcal{L}(\mu;y)=\sum_{i=1}^{2}\sum_{j=1}^{2}(-\mu_{ij}+y_{ij}\log \mu_{ij}) \]

gm.1 <- glm( Counts ~ Att + Inc, family=poisson)
anova( gm.1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Counts
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     3     38.260              
## Att   1  24.6143         2     13.646 7.003e-07 ***
## Inc   1   4.8769         1      8.769   0.02722 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(gm.1)
## (Intercept)  AttAgainst      IncLow 
##   5.4859102  -0.3341716   0.1479201

the model has the systematic component \(\log \mu_{ij}=5.486-0.334x_1 + 0.1479x_2\) where \(x_1=1\) for obs located at the second row i.e., \(i=2\) and \(x_2=1\) if obs at second column i.e., \(j=2\). unlogging gets us the systematic component in their original units.

to test for the interaction between factor A and B:

gm.int <- glm( Counts ~ Att * Inc, family=poisson)
anova( gm.int, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Counts
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        3     38.260              
## Att      1  24.6143         2     13.646 7.003e-07 ***
## Inc      1   4.8769         1      8.769  0.027218 *  
## Att:Inc  1   8.7686         0      0.000  0.003065 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis of deviance table shows the interaction term is necessary in the model. the data suggest an association between A and B.

  • the total number of obs. m was fixed, i.e., the grand total is fixed

instead of a Poisson distribution, a multi-nomial distribution is appropriate.

the log-likelihood function for the \(2 \times 2\) table is given as \[ \mathcal{L}(\mu;y)=\sum_{i=1}^{2}\sum_{j=1}^{2}y_{ij}\log \mu_{ij} \] Estimating \(\mu_{ij}\) by maximizing the log-likelihood for the multinomial distribution constrained by \(\sum_{i=1}^{2}\sum_{j=1}^{2}y_{ij}=m\). The Poisson model, conditioning on the grand total, is equivalent to a multinomial model.

  • row totals were fixed, and obs. were allocated to column factor within each level of row factor
    • or column totals are fixed
    • the residuals deviance are exactly the same as above, Poisson dist. is applicable here

three-way contingency tables

data(kstones)

ks.mutind <- glm( Counts ~ Size + Method + Outcome, family=poisson, data=kstones) # mutual indep.
ks.SM <- glm( Counts ~ Size * Method + Outcome, family=poisson, data=kstones )  # partial independence
ks.SO <- update(ks.SM, . ~ Size * Outcome + Method)
ks.OM <- update(ks.SM, . ~ Outcome * Method + Size)
ks.noMO <- glm( Counts ~ Size * (Method + Outcome), family=poisson, data=kstones ) # conditional indep.
ks.noOS <- update(ks.noMO, . ~ Method * (Outcome + Size) )
ks.noMS <- update(ks.noMO, . ~ Outcome * (Method + Size) )
ks.no3 <- glm( Counts ~ Size*Method*Outcome - Size:Method:Outcome, family=poisson, data=kstones ) # uniform assoc.
ks.all <- glm( Counts ~ Size * Method * Outcome, family=poisson, data=kstones ) # saturated model

# model comparison
  • Simpson’s Paradox

higher-order tables

data(dyouth)
dy.m1 <- glm( Obs ~ Age*Depression*Gender + Age*Group, data=dyouth, family=poisson)
anova(dy.m1, test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Obs
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                     23     368.05              
## Age                    2   11.963        21     356.09  0.002525 ** 
## Depression             1  168.375        20     187.71 < 2.2e-16 ***
## Gender                 1   58.369        19     129.34 2.172e-14 ***
## Group                  1   69.104        18      60.24 < 2.2e-16 ***
## Age:Depression         2    3.616        16      56.62  0.163964    
## Age:Gender             2    3.631        14      52.99  0.162718    
## Depression:Gender      1    7.229        13      45.76  0.007175 ** 
## Age:Group              2   27.090        11      18.67 1.311e-06 ***
## Age:Depression:Gender  2    8.325         9      10.35  0.015571 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Males <- subset(dyouth, Gender=="M")
Females <- subset(dyouth, Gender=="F")
table.M <- prop.table( xtabs(Obs~Age+Depression, data=Males), 1)
table.F <- prop.table( xtabs(Obs~Age+Depression, data=Females), 1)
round(table.F * 100) # FEMALES
##        Depression
## Age      H  L
##   12-14 36 64
##   15-16 31 69
##   17-18 10 90
round(table.M * 100) # MALES
##        Depression
## Age      H  L
##   12-14 20 80
##   15-16 12 88
##   17-18 20 80

overdispersin for Poisson GLMs

for a Poisson dist., \(var[y]=E[y]\). in practise, however, the variance of the data exceeds \(E[y]\). this is called overdispersion. underdispersion occurs, but less commonly.

reason for occurrence, effects, and detection of overdispersion:

  • the mean retains some innate variability even after all predictors are fixed
    • clustered data
  • overdispersion may or may not affect the parameter estimates
  • standard errors \(\hat{\beta}_j\) are necessarily underestimated
    • tests on predictors generally appear to be more significant
    • confidence intervals will be narrower
  • overdispersion can be detected by a goodness-of-fit test
    • if residual deviance and Pearson goodness-of-fit statistics are much larger than the residual df., then either the fitted model is inadequate or the data are overdispersed.
    • if after fitting a saturated model and eliminating outliers, lack of fit remains meaning overdispersion is the alternative explanation.
data(pock)
plot( Count ~ jitter(log2(Dilution)), data=pock, las=1,
      xlab="Log (base 2) of dilution", ylab="Pock mark count")

mn <- with(pock, tapply(Count, log2(Dilution), mean) ) # Group means
vr <- with(pock, tapply(Count, log2(Dilution), var) ) # Group variances
plot( log(vr) ~ log(mn), las=1,
      xlab="Group mean", ylab="Group variance")

data.frame(mn, vr, ratio=vr/mn)
##        mn         vr    ratio
## 0 186.625 1781.12500 9.543871
## 1 104.700  667.34444 6.373872
## 2  50.800  360.40000 7.094488
## 3  27.100  194.98889 7.195162
## 4   9.100   17.65556 1.940171

here we have an example of clustered data. the sample variance is much larger than the mean for each group, clear evidence of overdispersion. fitting a Poisson GLM shows substantial lack of fit:

m1 <- glm( Count ~ log2(Dilution), data=pock, family=poisson )
printCoefmat(coef(summary(m1)))
##                 Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)     5.267932   0.022552 233.596 < 2.2e-16 ***
## log2(Dilution) -0.680944   0.015443 -44.093 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
X2 <- sum(residuals(m1, type="pearson")^2)
c(Df=df.residual(m1), Resid.Dev=deviance(m1), Pearson.X2=X2)
##         Df  Resid.Dev Pearson.X2 
##    46.0000   290.4387   291.5915

remedy for overdispersion: negative binomial GLMs

we can model overdispersion by assuming a second layer of variability by allowing \(E[y_i]=\lambda_i\) to be a random variable: \[ \begin{aligned} y_i|\lambda_i \sim \text{Pois}(\lambda_i), \lambda_i \sim \text{Gamma}(\mu_i, \phi) \end{aligned} \] by hierarchical mixtures of random variables, we can show that

\[ y_i \sim \text{NB}(\mu_i, k) \] where \(k=1/\phi\) and \(var[y_i]=\mu_i+\phi\mu_i^2\).

library(MASS) # Provides the function glm.nb()
m.nb <- glm.nb( Count ~ log2(Dilution), data=pock ) # fitting a negativev binomial GLM on y
m.nb$theta # phi 
## [1] 9.892894
m.nb <- glm.convert(m.nb)
printCoefmat(coef(summary(m.nb, dispersion=1)))
##                Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)     5.33284    0.08786  60.697 < 2.2e-16 ***
## log2(Dilution) -0.72460    0.03886 -18.646 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

remedy for overdispersion: quasi-Poisson GLMs

  • quasi-Poisson models keep the Poisson variance function \(V(\mu)=\mu\) but allows a general dispersion parameter \(V(\mu_i)=\phi \mu_i, \phi > 1\).

  • mean-variance relationshio

    • As compared to negative binomial models that use a quadratic variance function \(V(\mu)=\mu+\phi \mu^2\), quasi-Poisson models use a linear variance function \(V(\mu_i)=\phi \mu_i\).
  • quasi-Poisson models return identical parameter estimates as those from corresponding Poisson GLMs.

    • identical fitted values returned since identical estimates
  • the standard errors are inflated by a factor of \(\sqrt{\phi}\)

m.qp <- glm( Count ~ log2(Dilution), data=pock, family="quasipoisson")
se.m1 <- coef(summary(m1))[, "Std. Error"]
se.qp <- coef(summary(m.qp))[, "Std. Error"]
data.frame(SE.Pois=se.m1, SE.Quasi=se.qp, ratio=se.qp/se.m1)
##                   SE.Pois   SE.Quasi    ratio
## (Intercept)    0.02255150 0.05677867 2.517733
## log2(Dilution) 0.01544348 0.03888257 2.517733

case study

In a study of nesting female horseshoe crabs, each with an attached male, the number of other nearby male crabs (called satellites) were counted. The colour of the female, the condition of her spine, her carapace width, and her weight were also recorded. The purpose of the study is to understand the factors that attract satellite crabs. Are they more attracted to larger females? Does the condition or colour of the female play a role?

data(hcrabs)

hcrabs$Col <- ordered(hcrabs$Col, levels=c("LM", "M", "DM", "D"))
hcrabs$Spine <- ordered(hcrabs$Spine, levels=c("NoneOK", "OneOK", "BothOK"))

# EDA
with(hcrabs,{
      logSat <- log(Sat+1)
      plot( jitter(Sat) ~ Wt, ylab="Sat", las=1)
      plot( jitter(logSat) ~ log(Wt), ylab="log(Sat+1)", las=1)
      plot( logSat ~ Col, ylab="log(Sat+1)", las=1)
      plot( jitter(Sat) ~ Width, ylab="Sat", las=1)
      plot( jitter(logSat) ~ log(Width), ylab="log(Sat+1)", las=1)
      plot( logSat ~ Spine, ylab="log(Sat+1)", las=1)
      })

with(hcrabs,{
      plot( log(Wt) ~ log(Width), las=1 )
      plot( log(Wt) ~ Col, las=1 )
      plot( log(Wt) ~ Spine, las=1 )
      })

inter-correlation between weight and width was observed; crabs tend to cluster and not to be independent, hence overdispersion was expected:

cr.m1 <- glm(Sat ~ log(Wt) + log(Width) + Spine + Col, family=quasipoisson, data=hcrabs)
anova(cr.m1, test="F")
## Analysis of Deviance Table
## 
## Model: quasipoisson, link: log
## 
## Response: Sat
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev       F    Pr(>F)    
## NULL                         172     632.79                      
## log(Wt)     1   83.084       171     549.71 25.9645 9.429e-07 ***
## log(Width)  1    0.007       170     549.70  0.0023    0.9621    
## Spine       2    1.125       168     548.58  0.1758    0.8389    
## Col         3    7.630       165     540.95  0.7948    0.4984    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check overdispersion
deviance(cr.m1)
## [1] 540.946
df.residual(cr.m1)
## [1] 165

The residual deviance is more than three times the residual degrees of freedom, overdispersion was thus confirmed; only logWt turned out to be significant.

The final step is to refit the quasi-Poisson model or Negative Binomial model with only logWt term and plot diagnostics to identify potential deviations.

Models for positive continuous data

modeling positive continuous data

  • the boundary at zero limits the left tail of the distribution, thus introduces skewness
  • variance approaches 0 as the mean approaches 0, producing interesting mean-variance relationship
    • \(V(\mu)=\mu^2\): gamma distribution
    • \(V(\mu)=\mu^3\): inverse Gaussian distribution

pdf for a gamma dist. is given as \[ \mathcal{P}(y;\alpha, \beta)=\frac{y^{\alpha-1}\exp(-y/\beta)}{\Gamma(\alpha)\beta^{\alpha}}, y>0, \alpha>0, \beta>0 \]

where \(\alpha, \beta\) are the shape and scale parameters, respectively. \(E[y]=\alpha \beta, var[y]=\alpha\beta^2\). the canonical link function for a gamma dist. is the inverse link function \(\eta=1/\mu\); however, the logarithmic link function is often used because it avoids the need for contrains on the linear predictor in view of \(\mu>0\).

writing the gamma pdf in terms of \(\mu, \phi\) gives:

\[ \mathcal{P}(y;\alpha, \beta)=\frac{y}{\phi\mu}^{1/\phi}\frac{1}{y}\exp(-\frac{y}{\phi\mu})\frac{1}{\Gamma(1/\phi)} \]

the variance function is \(V[\mu]=\mu^2\). the coefficient of variation (CV) is defined as the ratio of the variance to the mean squared which is a measurement of the relative variation in the data. Gamma dist. has a CV\(=1\).

example modelling foliage biomass y as a function of tree trunk diameter d:

data(lime)
str(lime)
## 'data.frame':    385 obs. of  4 variables:
##  $ Foliage: num  0.1 0.2 0.4 0.6 0.6 0.8 1 1.4 1.7 3.5 ...
##  $ DBH    : num  4 6 8 9.6 11.3 13.7 15.4 17.8 18 22 ...
##  $ Age    : int  38 38 46 44 60 56 72 74 68 79 ...
##  $ Origin : Factor w/ 3 levels "Coppice","Natural",..: 2 2 2 2 2 2 2 2 2 2 ...
# Plot Foliage against DBH
plot(Foliage ~ DBH, type="n", las=1, xlab="DBH (in cm)", 
     ylab="Foliage biomass (in kg)",
     ylim = c(0, 15), xlim=c(0, 40), data=lime)
points(Foliage ~ DBH, data=subset(lime, Origin=="Coppice"),pch=1)
points(Foliage ~ DBH, data=subset(lime, Origin=="Natural"),pch=2)
points(Foliage ~ DBH, data=subset(lime, Origin=="Planted"),pch=3)
legend("topleft", pch=c(1, 2, 3), legend=c("Coppice", "Natural","Planted"))

# Plot Foliage against DBH, on log scale
plot( log(Foliage) ~ log(DBH), type="n", las=1,
      xlab="log of DBH (in cm)", ylab="log of Foliage biomass (in kg)",
      ylim = c(-5, 3), xlim=c(0, 4), data=lime)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Coppice"),pch=1)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Natural"),pch=2)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Planted"),pch=3)

# Plot Foliage against Age
plot(Foliage ~ Age, type="n", las=1,
     xlab="Age (in years)", ylab="Foliage biomass (in kg)",
     ylim = c(0, 15), xlim=c(0, 150), data=lime)
points(Foliage ~ Age, data=subset(lime, Origin=="Coppice"), pch=1)
points(Foliage ~ Age, data=subset(lime, Origin=="Natural"), pch=2)
points(Foliage ~ Age, data=subset(lime, Origin=="Planted"), pch=3)

# Plot Foliage against Origin
plot( Foliage ~ Origin, data=lime, ylim=c(0, 15), las=1, ylab="Foliage biomass (in kg)")

the response is always positive; the variance in foliage biomass increases as the mean DBH increases;