\[ \log \frac{\mu}{1-\mu}=\beta_0+\beta_1x \] combining the systematic and random components, a possible model for the data is:
\[ \begin{cases} ym \sim \text{Bin}(\mu,m) \quad &(\text{random component})\\ \log \frac{\mu}{1-\mu}=\beta_0+\beta_1x \quad &(\text{systematic component}) \end{cases} \]
the outcome of many studies is a proportion y of a total number m: the proportion of individuals having a diseasel; the proportion of voters who vote in favor of a particular election candidate; … each case, the m individuals in each group are assumed to be independent and each individual can be classified into one of two possible outcomes.
\[ \mathcal{P}(y|\mu, m)={m \choose my}\mu^{my}(1-\mu)^{m(1-y)} \]
where \(m\) is known and \(\phi=1, 0<\mu<1\). For binomial GLMs, the use of quantile residuals is strongly recommended for diagnostic analysis.
Binomial responses may be specified in the glm()
formula in one of three ways:
glm()
.weights
do not need to be supplied.data(turbines)
tur.m1 <- glm( Fissures/Turbines ~ Hours, family=binomial,
weights=Turbines, data=turbines) # 1st
tur.m2 <- glm( cbind(Fissures, Turbines-Fissures) ~ Hours,
family=binomial, data=turbines) # 2nd
coef(tur.m1)
## (Intercept) Hours
## -3.9235965551 0.0009992372
coef(tur.m2)
## (Intercept) Hours
## -3.9235965551 0.0009992372
to ensure \(0<\mu<1\), possible choice of link function for binomial GLMs are:
logit
link function, which is the canonical link function for the binomial distribution and the default link function in R:\[ \eta=\log \frac{\mu}{1-\mu}=logit(\mu) \]
-probit
link function: \(\eta=\Phi^{-1}(\mu)=probit(\mu)\), where \(\Phi(\cdot)\) is the CDF for the normal distribution. This link function is specified in R as link="probit
.
link="cloglog"
.tr.logit <- glm( Fissures/Turbines ~ Hours, data=turbines,
family=binomial, weights=Turbines)
tr.probit <- update( tr.logit, family=binomial(link="probit") )
tr.cll <- update( tr.logit, family=binomial(link="cloglog") )
tr.array <- rbind( coef(tr.logit), coef(tr.probit), coef(tr.cll))
tr.array <- cbind( tr.array,
c(deviance(tr.logit),
deviance(tr.probit),
deviance(tr.cll)) )
colnames(tr.array) <- c("Intercept", "Hours","Residual dev.")
rownames(tr.array) <- c("Logit","Probit","Comp log-log")
tr.array
## Intercept Hours Residual dev.
## Logit -3.923597 0.0009992372 10.331466
## Probit -2.275807 0.0005783211 9.814837
## Comp log-log -3.603280 0.0008104936 12.227914
newHrs <- seq( 0, 5000, length=100)
newdf <- data.frame(Hours=newHrs)
newP.logit <- predict( tr.logit, newdata=newdf, type="response")
newP.probit <- predict( tr.probit, newdata=newdf, type="response")
newP.cll <- predict( tr.cll, newdata=newdf, type="response")
plot( Fissures/Turbines ~ Hours, data=turbines,
pch=19, las=1,
xlim=c(0, 5000), ylim=c(0, 0.7),
xlab="Hours run", ylab="Proportion with fissures")
lines( newP.logit ~ newHrs, lty=1, lwd=2)
lines( newP.probit ~ newHrs, lty=2, lwd=2)
lines( newP.cll ~ newHrs, lty=4, lwd=2)
legend("topleft", lwd=2, lty=c(1, 2, 4),
legend=c("Logit","Probit","Comp. log-log"))
LogOdds <- predict( tr.logit )
odds <- exp( LogOdds )
plot( LogOdds ~ turbines$Hours, type="l", las=1,
xlim=c(0, 5000), ylim=c(-5, 1),
ylab="Empirical Log-odds", xlab="Run-time (in hours)" )
my <- turbines$Fissures
m <- turbines$Turbines
EmpiricalOdds <- (my + 0.5)/(m - my + 0.5) # empirical log-odds: avoid log of 0s
points( log(EmpiricalOdds) ~ turbines$Hours)
plot( odds ~ turbines$Hours, las=1, xlim=c(0, 5000), ylim=c(0, 2),
type="l", ylab="Odds", xlab="Run-time (in hours)")
points( EmpiricalOdds ~ turbines$Hours)
\[ \log \frac{\mu}{1-\mu}=\log-\text{odds}=\beta_0+\beta_1x \]
where x is a dummy variable taking the values 0 or 1. From this equation, we see that the odds of observing a success when \(x = 0\) is \(\exp(\beta_0)\), and the odds of observing a success when \(x = 1\) is \(\exp(β0 + β1) = \exp(\beta_0) \exp(\beta_1)\). The ratio of these two odds is \(\exp(\beta_1)\). This means that the odds of a success occurring when \(x = 1\) is \(\exp(\beta_1)\) times greater than when \(x = 0\). This ratio is called the odds ratio, often written OR.
Binomial GLMs are commonly used to examine the relationship between the dose d of a drug and the proportion y of insects that survive. These models are called dose-response models. ED50: the dose of poison affecting \(50\%\) of the insects. By definition, \(\mu=0.5\) at the ED50. For a binomial GLM using logit link function, \(\eta=\log(\mu/(1-\mu))=0\) when \(\mu=0.5\), where \(\eta=\beta_0+\beta_1d\) and d is the dose. Solving for dose d gives that ED50 \(=-\hat{\beta_0}/\hat{\beta_1}\).
For a binomial distribution, var\([y]=\mu(1-\mu)\). The amount of variation in the data can exceed this value, called overdispersion
. Consequences are standard errors by GLM are underestimated, and tests on predictors will generally be more significant leading to overly complex models.
Overdispersion can be detected by goodness-of-fit test. Lack of fit may be caused by an inadequate model. But if all possible predictors are already in the model and no outliers exist when lack of fit persists ,then overdispersion is the alternative interpretation.
Overdispersion arises in two cases: the probabilities \(\mu_i\) are not constant between observations; the \(m_i\) cases, of which observations \(y_i\) is a proportion, are not independent.
Quasi-binomial
GLMs are the remedies for overdispersion.
Example: Since seedsare usually planted together in common plots, it is highly possible that they might interact or be affected by common causes; in other words we might well expect seeds to be positively correlated, leading to overdispersion.
data(germ)
gm.m1 <- glm(Germ/Total ~ Seeds * Extract, family=binomial,
data=germ, weights=Total)
printCoefmat(coef(summary(gm.m1)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.41224 0.18418 -2.2383 0.02520 *
## SeedsOA75 -0.14593 0.22317 -0.6539 0.51318
## ExtractCucumber 0.54008 0.24981 2.1619 0.03062 *
## SeedsOA75:ExtractCucumber 0.77810 0.30643 2.5392 0.01111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(gm.m1, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Germ/Total
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 20 98.719
## Seeds 1 2.544 19 96.175 0.11070
## Extract 1 56.489 18 39.686 5.65e-14 ***
## Seeds:Extract 1 6.408 17 33.278 0.01136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c( deviance(gm.m1), df.residual(gm.m1) )
## [1] 33.27779 17.00000
# diagnostic plot
qres <- qresid(gm.m1)
qqnorm(qres, las=1)
abline(0, 1)
# deviance residual vs fitted values
scatter.smooth( qres ~ fitted( gm.m1 ),
col="grey", las=1,
ylab="Standardized residuals",
xlab="Fitted values")
the residual deviance is much larger than the residual degrees of freedom; from QQ-plot no large residuals present that would suggest outliers; influential observations cannot be an issue. So overdispersion is present.
gm.od <- update(gm.m1, family=quasibinomial)
anova(gm.od, test="F")
## Analysis of Deviance Table
##
## Model: quasibinomial, link: logit
##
## Response: Germ/Total
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F Pr(>F)
## NULL 20 98.719
## Seeds 1 2.544 19 96.175 1.3665 0.25854
## Extract 1 56.489 18 39.686 30.3407 3.839e-05 ***
## Seeds:Extract 1 6.408 17 33.278 3.4418 0.08099 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The quasi-binomial analysis of deviance suggests that only Extract is significant in the model, so germination frequency differs by root stock but not by seed type, unlike the binomial glm which showed a significant Extract by Seeds interaction.
The binomial and quasi-binomial glms give identical coefficient estimates, but the standard errors from the quasi-binomial GLM are \(\sqrt{\phi}\) times those from the binomial model:
sqrt(summary(gm.od)$dispersion)
## [1] 1.36449
beta <- coef(summary(gm.m1))[,"Estimate"]
m1.se <- coef(summary(gm.m1))[,"Std. Error"]
od.se <- coef(summary(gm.od))[,"Std. Error"]
data.frame(Estimate=beta, Binom.SE=m1.se,
Quasi.SE=od.se, Ratio=od.se/m1.se)
## Estimate Binom.SE Quasi.SE Ratio
## (Intercept) -0.4122448 0.1841784 0.2513095 1.36449
## SeedsOA75 -0.1459269 0.2231659 0.3045076 1.36449
## ExtractCucumber 0.5400782 0.2498130 0.3408672 1.36449
## SeedsOA75:ExtractCucumber 0.7781037 0.3064332 0.4181249 1.36449
For binary data, likelihood ratio tests and score tests should be used, making sure that p’ is much smaller than n.
an experiment exposed batches of insects to various deposits of insecticides. The proportion of insects y killed after six days of exposure in each batch of size m is potentially a function of the dose of insecticide and the type of insecticide.
data(deposit)
deposit$Prop <- deposit$Killed / deposit$Number
plot( Prop ~ Deposit, type="n", las=1, ylim=c(0, 1),
data=deposit, main="Proportion of\ninsects killed",
xlab="Deposit (in mg)", ylab="Proportion killed")
points( Prop ~ Deposit, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ Deposit, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ Deposit, pch="C", subset=(Insecticide=="C"), data=deposit)
# Add lines for each insecticide type with different line types
lines(Prop ~ Deposit, subset=(Insecticide == "A"), data=deposit, lty=1) # solid line
lines(Prop ~ Deposit, subset=(Insecticide == "B"), data=deposit, lty=2) # dashed line
lines(Prop ~ Deposit, subset=(Insecticide == "C"), data=deposit, lty=3) # dotted line
# Optionally, add a legend to identify the lines
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"),
lty=c(1, 2, 3), pch=c("A", "B", "C"))
fitting a GLM model using deposit amount the type of insecticide as predictors and plot the fitted lines:
ins.m1 <- glm(Killed/Number ~ Deposit + Insecticide,
family = binomial, weights = Number, data = deposit)
newD <- seq( min(deposit$Deposit), max(deposit$Deposit), length=100)
newProp.logA <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="A") )
newProp.logB <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="B") )
newProp.logC <- predict(ins.m1, type="response",newdata=data.frame(Deposit=newD, Insecticide="C") )
plot( Prop ~ Deposit, type="n", las=1, ylim=c(0, 1),
data=deposit, main="Proportion of\ninsects killed",
xlab="Deposit (in mg)", ylab="Proportion killed")
points( Prop ~ Deposit, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ Deposit, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ Deposit, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)
### disgnostic
plot( qresid(ins.m1) ~ fitted(ins.m1), type="n", las=1, ylim=c(-3, 3),
main="Quantile resids. plotted\nagainst fitted values",
xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m1) ~ fitted(ins.m1), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m1) ~ fitted(ins.m1), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m1) ~ fitted(ins.m1), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))
For each insecticide, the proportions are under-estimated at the lower and higher values of deposit.
LogOdds <- with(deposit, log(Prop/(1-Prop)) )
plot( LogOdds ~ Deposit, type="n", xlab="Deposit", data=deposit,
main="Logits plotted\nagainst Deposit", las=1)
points( LogOdds ~ Deposit, pch="A", type="b", lty=1, data=deposit, subset=(Insecticide=="A") )
points( LogOdds ~ Deposit, pch="B", type="b", lty=2, data=deposit, subset=(Insecticide=="B") )
points( LogOdds ~ Deposit, pch="C", type="b", lty=3, data=deposit, subset=(Insecticide=="C") )
Plotting the log-odds against the deposit shows the relationship is not linear on log-odds. try logarithm of dose:
deposit$logDep <- log( deposit$Deposit )
ins.m2 <- glm(Killed/Number ~ logDep + Insecticide - 1,
family = binomial, weights = Number, data = deposit)
# proportion vs. log-logDep
newD <- seq( min(deposit$logDep), max(deposit$logDep), length=100)
newProp.logA <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="A") )
newProp.logB <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="B") )
newProp.logC <- predict(ins.m2, type="response",newdata=data.frame(logDep=newD, Insecticide="C") )
plot( Prop ~ logDep, type="n", las=1, ylim=c(0, 1),
data=deposit, main="Proportion of\ninsects killed",
xlab="logDep (in mg)", ylab="Proportion killed")
points( Prop ~ logDep, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ logDep, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ logDep, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)
# quantile residual vs. fitted values
plot( qresid(ins.m2) ~ fitted(ins.m2), type="n", las=1, ylim=c(-3, 3),
main="Quantile resids. plotted\nagainst fitted values",
xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m2) ~ fitted(ins.m2), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m2) ~ fitted(ins.m2), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m2) ~ fitted(ins.m2), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))
# fitted log-odds vs. log deposit
plot(LogOdds ~ logDep, type="n", las=1,
data=deposit, main="Log-Odds of Proportion of\nInsects Killed vs Log Deposit",
xlab="Log Deposit (in mg)", ylab="Log-Odds of Proportion Killed")
# Add points for each insecticide type
points(LogOdds ~ logDep, pch="A", col="blue", subset=(Insecticide == "A"), data=deposit)
points(LogOdds ~ logDep, pch="B", col="green", subset=(Insecticide == "B"), data=deposit)
points(LogOdds ~ logDep, pch="C", col="purple", subset=(Insecticide == "C"), data=deposit)
# Predict the log-odds using the fitted model
logDep_seq <- seq(min(deposit$logDep), max(deposit$logDep), length.out = 100)
predict_data <- data.frame(logDep = logDep_seq, Insecticide = rep("A", 100))
pred_A <- predict(ins.m2, newdata = predict_data, type = "response")
predict_data$Insecticide <- rep("B", 100)
pred_B <- predict(ins.m2, newdata = predict_data, type = "response")
predict_data$Insecticide <- rep("C", 100)
pred_C <- predict(ins.m2, newdata = predict_data, type = "response")
# Add the fitted model lines for each insecticide
lines(logDep_seq, log(pred_A / (1 - pred_A)), col="blue", lty=1) # solid line for insecticide A
lines(logDep_seq, log(pred_B / (1 - pred_B)), col="green", lty=2) # dashed line for insecticide B
lines(logDep_seq, log(pred_C / (1 - pred_C)), col="purple", lty=3) # dotted line for insecticide C
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"),
col=c("blue", "green", "purple"), pch=c("A", "B", "C"), lty=c(1, 2, 3))
While model 2 improves on model 1, proportions are still under-estimated for all types at the lower and higher values of deposit; plotting the log-odds against the logarithm of Deposit indicates that the log-odds are not constant, but are perhaps quadratic:
ins.m3 <- glm(Killed/Number ~ poly(logDep, 2) + Insecticide,
family = binomial, weights = Number, data = deposit)
# diagnostic plots
# proportion vs. log-logDep
newD <- seq( min(deposit$logDep), max(deposit$logDep), length=100)
newProp.logA <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="A") )
newProp.logB <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="B") )
newProp.logC <- predict(ins.m3, type="response",newdata=data.frame(logDep=newD, Insecticide="C") )
plot( Prop ~ logDep, type="n", las=1, ylim=c(0, 1),
data=deposit, main="Proportion of\ninsects killed",
xlab="logDep (in mg)", ylab="Proportion killed")
points( Prop ~ logDep, pch="A", subset=(Insecticide=="A"), data=deposit)
points( Prop ~ logDep, pch="B", subset=(Insecticide=="B"), data=deposit)
points( Prop ~ logDep, pch="C", subset=(Insecticide=="C"), data=deposit)
lines( newProp.logA ~ newD, lty=1)
lines( newProp.logB ~ newD, lty=2)
lines( newProp.logC ~ newD, lty=3)
# quantile residual vs. fitted values
plot( qresid(ins.m3) ~ fitted(ins.m3), type="n", las=1, ylim=c(-2, 2),
main="Quantile resids. plotted\nagainst fitted values",
xlab="Fitted values", ylab="Residuals")
abline(h = 0, col="grey")
points( qresid(ins.m3) ~ fitted(ins.m3), pch="A", type="b", lty=1,subset=(deposit$Insecticide=="A") )
points( qresid(ins.m3) ~ fitted(ins.m3), pch="B", type="b", lty=2,subset=(deposit$Insecticide=="B") )
points( qresid(ins.m3) ~ fitted(ins.m3), pch="C", type="b", lty=3,subset=(deposit$Insecticide=="C"))
# fitted log-odds vs. log deposit
plot(LogOdds ~ logDep, type="n", las=1,
data=deposit, main="Log-Odds of Proportion of\nInsects Killed vs Log Deposit",
xlab="Log Deposit (in mg)", ylab="Log-Odds of Proportion Killed")
# Add points for each insecticide type
points(LogOdds ~ logDep, pch="A", col="blue", subset=(Insecticide == "A"), data=deposit)
points(LogOdds ~ logDep, pch="B", col="green", subset=(Insecticide == "B"), data=deposit)
points(LogOdds ~ logDep, pch="C", col="purple", subset=(Insecticide == "C"), data=deposit)
# Predict the log-odds using the fitted model
logDep_seq <- seq(min(deposit$logDep), max(deposit$logDep), length.out = 100)
predict_data <- data.frame(logDep = logDep_seq, Insecticide = rep("A", 100))
pred_A <- predict(ins.m3, newdata = predict_data, type = "response")
predict_data$Insecticide <- rep("B", 100)
pred_B <- predict(ins.m3, newdata = predict_data, type = "response")
predict_data$Insecticide <- rep("C", 100)
pred_C <- predict(ins.m3, newdata = predict_data, type = "response")
# Add the fitted model lines for each insecticide
lines(logDep_seq, log(pred_A / (1 - pred_A)), col="blue", lty=1) # solid line for insecticide A
lines(logDep_seq, log(pred_B / (1 - pred_B)), col="green", lty=2) # dashed line for insecticide B
lines(logDep_seq, log(pred_C / (1 - pred_C)), col="purple", lty=3) # dotted line for insecticide C
legend("bottomright", legend=c("Insecticide A", "Insecticide B", "Insecticide C"),
col=c("blue", "green", "purple"), pch=c("A", "B", "C"), lty=c(1, 2, 3))
anova( ins.m2, ins.m3, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Killed/Number ~ logDep + Insecticide - 1
## Model 2: Killed/Number ~ poly(logDep, 2) + Insecticide
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 14 23.385
## 2 13 15.090 1 8.2949 0.003976 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the quadratic model is a statistically significantly improvement; no evidence exists to support overdispersion: the deviance is 15 and df is 13.
\[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]
Poisson distribution is used to model counts, which has PMF: \[ \mathcal{P}(y;\lambda)=\frac{\exp(-\lambda)\lambda^y}{y!} \] The unit deviance for Poisson distribution is: \[ d(y, \lambda)=2\{y\log\frac{y}{\lambda}-(y-\lambda)\} \] and the residual deviance is \[ D(y,\hat{\lambda})=\sum w_i d(y_i, \hat{\lambda}_i) \] where \(w_i\) are the prior weights.
model defined as \[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\log T+\beta_0+\sum_{j=1}^{p}\beta_jx_j \quad (\text{systematic component}) \end{cases} \]
data(danishlc)
danishlc$Rate <- danishlc$Cases / danishlc$Pop * 1000 # Rate per 1000
danishlc$Age <- ordered(danishlc$Age, # Ensure age-order is preserved
levels=c("40-54", "55-59", "60-64", "65-69", "70-74", ">74") )
danishlc$City <- abbreviate(danishlc$City, 1) # Abbreviate city names
matplot( xtabs( Rate ~ Age+City, data=danishlc), pch=1:4, lty=1:4,
type="b", lwd=2, col="black", axes=FALSE, ylim=c(0, 25),
xlab="Age group", ylab="Cases/1000")
axis(side=1, at=1:6, labels=levels(danishlc$Age))
axis(side=2, las=1)
box()
legend("topleft", col="black", pch=1:4, lwd=2, lty=1:4, merge=FALSE,
legend=c("Fredericia", "Horsens", "Kolding", "Vejle") )
dlc.m1 <- glm( Cases ~ offset( log(Pop) ) + City * Age,
family=poisson, data=danishlc)
anova(dlc.m1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Cases
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 23 129.908
## City 3 3.393 20 126.515 0.33495
## Age 5 103.068 15 23.447 < 2e-16 ***
## City:Age 15 23.447 0 0.000 0.07509 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
city seems not significant; alternatively, consider age as continuous and a quadratic term:
dlc.m2 <- update(dlc.m1, . ~ offset(log(Pop)) + Age )
danishlc$AgeNum <- rep( c(40, 55, 60, 65, 70, 75), 4)
dlc.m3 <- update(dlc.m1, . ~ offset( log(Pop) ) + AgeNum)
dlc.m4 <- update(dlc.m3, . ~ offset( log(Pop) ) + poly(AgeNum, 2) )
anova( dlc.m3, dlc.m4, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: Cases ~ AgeNum + offset(log(Pop))
## Model 2: Cases ~ poly(AgeNum, 2) + offset(log(Pop))
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 22 48.968
## 2 21 32.500 1 16.468 4.948e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
now compare 4 models using AIC/BIC:
c( "With interaction"=AIC(dlc.m1), "Without interaction"=AIC(dlc.m2),
"Age (numerical)"=AIC(dlc.m3), "Age (numerical; quadratic)"=AIC(dlc.m4) )
## With interaction Without interaction
## 144.3880 136.6946
## Age (numerical) Age (numerical; quadratic)
## 149.3556 134.8876
# goodness-of-fit test
D.m2 <- deviance(dlc.m2)
df.m2 <- df.residual( dlc.m2 )
c( Dev=D.m2, df=df.m2, P = pchisq( D.m2, df.m2, lower = FALSE) )
## Dev df P
## 28.30652745 18.00000000 0.05754114
D.m4 <- deviance(dlc.m4)
df.m4 <- df.residual( dlc.m4 )
c( Dev=D.m4, df=df.m4, P = pchisq( D.m4, df.m4, lower = FALSE) )
## Dev df P
## 32.49959158 21.00000000 0.05206888
Both models are reasonably adequate. now consider diagnostic plots:
# factor age model
scatter.smooth( rstandard(dlc.m2) ~ sqrt(fitted(dlc.m2)),
ylab="Standardized residuals", xlab="Sqrt(Fitted values)",
main="Factor age model", las=1 )
plot( cooks.distance(dlc.m2), type="h", las=1, main="Cook's D", ylab="Cook's distance, D")
qqnorm( qr<-qresid(dlc.m2), las=1 ); abline(0, 1)
# quadratic age model
scatter.smooth( rstandard(dlc.m4) ~ sqrt(fitted(dlc.m4)),
ylab="Standardized residuals", xlab="Sqrt(Fitted values)",
main="Quadratic age model", las=1 )
plot( cooks.distance(dlc.m4), type="h", las=1, main="Cook's D", ylab="Cook's distance, D")
qqnorm( qr<-qresid(dlc.m4), las=1 ); abline(0, 1)
Two models are reasonable. But we prefer model 2 since model 4 has three influential obs compared to the rest and model 2 is simpler.
Counts <- c(263, 258, 151, 222)
Att <- gl(2, 2, 4, labels=c("For", "Against") )
Inc <- gl(2, 1, 4, labels=c("High", "Low") )
data.frame( Counts, Att, Inc)
## Counts Att Inc
## 1 263 For High
## 2 258 For Low
## 3 151 Against High
## 4 222 Against Low
gm.table <- xtabs( Counts ~ Att + Inc )
gm.table
## Inc
## Att High Low
## For 263 258
## Against 151 222
sample scheme by presetting row/column margins
the log-likelihood function for the \(2 \times 2\) table is given as \[ \mathcal{L}(\mu;y)=\sum_{i=1}^{2}\sum_{j=1}^{2}(-\mu_{ij}+y_{ij}\log \mu_{ij}) \]
gm.1 <- glm( Counts ~ Att + Inc, family=poisson)
anova( gm.1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Counts
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3 38.260
## Att 1 24.6143 2 13.646 7.003e-07 ***
## Inc 1 4.8769 1 8.769 0.02722 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(gm.1)
## (Intercept) AttAgainst IncLow
## 5.4859102 -0.3341716 0.1479201
the model has the systematic component \(\log \mu_{ij}=5.486-0.334x_1 + 0.1479x_2\) where \(x_1=1\) for obs located at the second row i.e., \(i=2\) and \(x_2=1\) if obs at second column i.e., \(j=2\). unlogging gets us the systematic component in their original units.
to test for the interaction between factor A and B:
gm.int <- glm( Counts ~ Att * Inc, family=poisson)
anova( gm.int, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Counts
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3 38.260
## Att 1 24.6143 2 13.646 7.003e-07 ***
## Inc 1 4.8769 1 8.769 0.027218 *
## Att:Inc 1 8.7686 0 0.000 0.003065 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis of deviance table shows the interaction term is necessary in the model. the data suggest an association between A and B.
instead of a Poisson distribution, a multi-nomial distribution is appropriate.
the log-likelihood function for the \(2 \times 2\) table is given as \[ \mathcal{L}(\mu;y)=\sum_{i=1}^{2}\sum_{j=1}^{2}y_{ij}\log \mu_{ij} \] Estimating \(\mu_{ij}\) by maximizing the log-likelihood for the multinomial distribution constrained by \(\sum_{i=1}^{2}\sum_{j=1}^{2}y_{ij}=m\). The Poisson model, conditioning on the grand total, is equivalent to a multinomial model.
data(kstones)
ks.mutind <- glm( Counts ~ Size + Method + Outcome, family=poisson, data=kstones) # mutual indep.
ks.SM <- glm( Counts ~ Size * Method + Outcome, family=poisson, data=kstones ) # partial independence
ks.SO <- update(ks.SM, . ~ Size * Outcome + Method)
ks.OM <- update(ks.SM, . ~ Outcome * Method + Size)
ks.noMO <- glm( Counts ~ Size * (Method + Outcome), family=poisson, data=kstones ) # conditional indep.
ks.noOS <- update(ks.noMO, . ~ Method * (Outcome + Size) )
ks.noMS <- update(ks.noMO, . ~ Outcome * (Method + Size) )
ks.no3 <- glm( Counts ~ Size*Method*Outcome - Size:Method:Outcome, family=poisson, data=kstones ) # uniform assoc.
ks.all <- glm( Counts ~ Size * Method * Outcome, family=poisson, data=kstones ) # saturated model
# model comparison
data(dyouth)
dy.m1 <- glm( Obs ~ Age*Depression*Gender + Age*Group, data=dyouth, family=poisson)
anova(dy.m1, test="Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: Obs
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 23 368.05
## Age 2 11.963 21 356.09 0.002525 **
## Depression 1 168.375 20 187.71 < 2.2e-16 ***
## Gender 1 58.369 19 129.34 2.172e-14 ***
## Group 1 69.104 18 60.24 < 2.2e-16 ***
## Age:Depression 2 3.616 16 56.62 0.163964
## Age:Gender 2 3.631 14 52.99 0.162718
## Depression:Gender 1 7.229 13 45.76 0.007175 **
## Age:Group 2 27.090 11 18.67 1.311e-06 ***
## Age:Depression:Gender 2 8.325 9 10.35 0.015571 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Males <- subset(dyouth, Gender=="M")
Females <- subset(dyouth, Gender=="F")
table.M <- prop.table( xtabs(Obs~Age+Depression, data=Males), 1)
table.F <- prop.table( xtabs(Obs~Age+Depression, data=Females), 1)
round(table.F * 100) # FEMALES
## Depression
## Age H L
## 12-14 36 64
## 15-16 31 69
## 17-18 10 90
round(table.M * 100) # MALES
## Depression
## Age H L
## 12-14 20 80
## 15-16 12 88
## 17-18 20 80
for a Poisson dist., \(var[y]=E[y]\). in practise, however, the variance of the data exceeds \(E[y]\). this is called overdispersion. underdispersion occurs, but less commonly.
reason for occurrence, effects, and detection of overdispersion:
data(pock)
plot( Count ~ jitter(log2(Dilution)), data=pock, las=1,
xlab="Log (base 2) of dilution", ylab="Pock mark count")
mn <- with(pock, tapply(Count, log2(Dilution), mean) ) # Group means
vr <- with(pock, tapply(Count, log2(Dilution), var) ) # Group variances
plot( log(vr) ~ log(mn), las=1,
xlab="Group mean", ylab="Group variance")
data.frame(mn, vr, ratio=vr/mn)
## mn vr ratio
## 0 186.625 1781.12500 9.543871
## 1 104.700 667.34444 6.373872
## 2 50.800 360.40000 7.094488
## 3 27.100 194.98889 7.195162
## 4 9.100 17.65556 1.940171
here we have an example of clustered data. the sample variance is much larger than the mean for each group, clear evidence of overdispersion. fitting a Poisson GLM shows substantial lack of fit:
m1 <- glm( Count ~ log2(Dilution), data=pock, family=poisson )
printCoefmat(coef(summary(m1)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.267932 0.022552 233.596 < 2.2e-16 ***
## log2(Dilution) -0.680944 0.015443 -44.093 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
X2 <- sum(residuals(m1, type="pearson")^2)
c(Df=df.residual(m1), Resid.Dev=deviance(m1), Pearson.X2=X2)
## Df Resid.Dev Pearson.X2
## 46.0000 290.4387 291.5915
we can model overdispersion by assuming a second layer of variability by allowing \(E[y_i]=\lambda_i\) to be a random variable: \[ \begin{aligned} y_i|\lambda_i \sim \text{Pois}(\lambda_i), \lambda_i \sim \text{Gamma}(\mu_i, \phi) \end{aligned} \] by hierarchical mixtures of random variables, we can show that
\[ y_i \sim \text{NB}(\mu_i, k) \] where \(k=1/\phi\) and \(var[y_i]=\mu_i+\phi\mu_i^2\).
library(MASS) # Provides the function glm.nb()
m.nb <- glm.nb( Count ~ log2(Dilution), data=pock ) # fitting a negativev binomial GLM on y
m.nb$theta # phi
## [1] 9.892894
m.nb <- glm.convert(m.nb)
printCoefmat(coef(summary(m.nb, dispersion=1)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.33284 0.08786 60.697 < 2.2e-16 ***
## log2(Dilution) -0.72460 0.03886 -18.646 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
quasi-Poisson models keep the Poisson variance function \(V(\mu)=\mu\) but allows a general dispersion parameter \(V(\mu_i)=\phi \mu_i, \phi > 1\).
mean-variance relationshio
quasi-Poisson models return identical parameter estimates as those from corresponding Poisson GLMs.
the standard errors are inflated by a factor of \(\sqrt{\phi}\)
m.qp <- glm( Count ~ log2(Dilution), data=pock, family="quasipoisson")
se.m1 <- coef(summary(m1))[, "Std. Error"]
se.qp <- coef(summary(m.qp))[, "Std. Error"]
data.frame(SE.Pois=se.m1, SE.Quasi=se.qp, ratio=se.qp/se.m1)
## SE.Pois SE.Quasi ratio
## (Intercept) 0.02255150 0.05677867 2.517733
## log2(Dilution) 0.01544348 0.03888257 2.517733
In a study of nesting female horseshoe crabs, each with an attached male, the number of other nearby male crabs (called satellites) were counted. The colour of the female, the condition of her spine, her carapace width, and her weight were also recorded. The purpose of the study is to understand the factors that attract satellite crabs. Are they more attracted to larger females? Does the condition or colour of the female play a role?
data(hcrabs)
hcrabs$Col <- ordered(hcrabs$Col, levels=c("LM", "M", "DM", "D"))
hcrabs$Spine <- ordered(hcrabs$Spine, levels=c("NoneOK", "OneOK", "BothOK"))
# EDA
with(hcrabs,{
logSat <- log(Sat+1)
plot( jitter(Sat) ~ Wt, ylab="Sat", las=1)
plot( jitter(logSat) ~ log(Wt), ylab="log(Sat+1)", las=1)
plot( logSat ~ Col, ylab="log(Sat+1)", las=1)
plot( jitter(Sat) ~ Width, ylab="Sat", las=1)
plot( jitter(logSat) ~ log(Width), ylab="log(Sat+1)", las=1)
plot( logSat ~ Spine, ylab="log(Sat+1)", las=1)
})
with(hcrabs,{
plot( log(Wt) ~ log(Width), las=1 )
plot( log(Wt) ~ Col, las=1 )
plot( log(Wt) ~ Spine, las=1 )
})
inter-correlation between weight and width was observed; crabs tend to cluster and not to be independent, hence overdispersion was expected:
cr.m1 <- glm(Sat ~ log(Wt) + log(Width) + Spine + Col, family=quasipoisson, data=hcrabs)
anova(cr.m1, test="F")
## Analysis of Deviance Table
##
## Model: quasipoisson, link: log
##
## Response: Sat
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F Pr(>F)
## NULL 172 632.79
## log(Wt) 1 83.084 171 549.71 25.9645 9.429e-07 ***
## log(Width) 1 0.007 170 549.70 0.0023 0.9621
## Spine 2 1.125 168 548.58 0.1758 0.8389
## Col 3 7.630 165 540.95 0.7948 0.4984
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check overdispersion
deviance(cr.m1)
## [1] 540.946
df.residual(cr.m1)
## [1] 165
The residual deviance is more than three times the residual degrees of freedom, overdispersion was thus confirmed; only logWt turned out to be significant.
The final step is to refit the quasi-Poisson model or Negative Binomial model with only logWt term and plot diagnostics to identify potential deviations.
pdf for a gamma dist. is given as \[ \mathcal{P}(y;\alpha, \beta)=\frac{y^{\alpha-1}\exp(-y/\beta)}{\Gamma(\alpha)\beta^{\alpha}}, y>0, \alpha>0, \beta>0 \]
where \(\alpha, \beta\) are the shape and scale parameters, respectively. \(E[y]=\alpha \beta, var[y]=\alpha\beta^2\). the canonical link function for a gamma dist. is the inverse link function \(\eta=1/\mu\); however, the logarithmic link function is often used because it avoids the need for contrains on the linear predictor in view of \(\mu>0\).
writing the gamma pdf in terms of \(\mu, \phi\) gives:
\[ \mathcal{P}(y;\alpha, \beta)=\frac{y}{\phi\mu}^{1/\phi}\frac{1}{y}\exp(-\frac{y}{\phi\mu})\frac{1}{\Gamma(1/\phi)} \]
the variance function is \(V[\mu]=\mu^2\). the coefficient of variation (CV) is defined as the ratio of the variance to the mean squared which is a measurement of the relative variation in the data. Gamma dist. has a CV\(=1\).
example modelling foliage biomass y as a function of tree trunk diameter d:
data(lime)
str(lime)
## 'data.frame': 385 obs. of 4 variables:
## $ Foliage: num 0.1 0.2 0.4 0.6 0.6 0.8 1 1.4 1.7 3.5 ...
## $ DBH : num 4 6 8 9.6 11.3 13.7 15.4 17.8 18 22 ...
## $ Age : int 38 38 46 44 60 56 72 74 68 79 ...
## $ Origin : Factor w/ 3 levels "Coppice","Natural",..: 2 2 2 2 2 2 2 2 2 2 ...
# Plot Foliage against DBH
plot(Foliage ~ DBH, type="n", las=1, xlab="DBH (in cm)",
ylab="Foliage biomass (in kg)",
ylim = c(0, 15), xlim=c(0, 40), data=lime)
points(Foliage ~ DBH, data=subset(lime, Origin=="Coppice"),pch=1)
points(Foliage ~ DBH, data=subset(lime, Origin=="Natural"),pch=2)
points(Foliage ~ DBH, data=subset(lime, Origin=="Planted"),pch=3)
legend("topleft", pch=c(1, 2, 3), legend=c("Coppice", "Natural","Planted"))
# Plot Foliage against DBH, on log scale
plot( log(Foliage) ~ log(DBH), type="n", las=1,
xlab="log of DBH (in cm)", ylab="log of Foliage biomass (in kg)",
ylim = c(-5, 3), xlim=c(0, 4), data=lime)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Coppice"),pch=1)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Natural"),pch=2)
points( log(Foliage) ~ log(DBH), data=subset(lime, Origin=="Planted"),pch=3)
# Plot Foliage against Age
plot(Foliage ~ Age, type="n", las=1,
xlab="Age (in years)", ylab="Foliage biomass (in kg)",
ylim = c(0, 15), xlim=c(0, 150), data=lime)
points(Foliage ~ Age, data=subset(lime, Origin=="Coppice"), pch=1)
points(Foliage ~ Age, data=subset(lime, Origin=="Natural"), pch=2)
points(Foliage ~ Age, data=subset(lime, Origin=="Planted"), pch=3)
# Plot Foliage against Origin
plot( Foliage ~ Origin, data=lime, ylim=c(0, 15), las=1, ylab="Foliage biomass (in kg)")
the response is always positive; the variance in foliage biomass increases as the mean DBH increases;