Background

where linear regression fails

general linear regression models assume the variance is constant, possibly from a normal distribution. However, many data types exist for which the randomness is not constant, which fails least-squares estimation and makes other methods necessary.

three specific situations where the variation is not constant and linear regression models fail:

  • the response \(y\) is a proportion out of m binary trials ranging between 0 and 1: binomial distribution may be appropriate. \[ \begin{cases} ym \sim \text{Bin}(\mu, m) \quad (\text{random component}) \\ \log\frac{\mu}{1-\mu}=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]

  • the response is a count: poisson distribution may be appropriate

\[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]

  • the response is positive continuous: right skewed, thus gamma and inverse Gaussian distribution may be appropriate

\[ \begin{cases} y \sim \text{Gamma}(\mu;\phi) \quad (\text{random component}) \\ \mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]

in these situations, the relationship between y and the predictors is usually non-linear, and the response has boundaries so a linear relationship cannot apply for all values of the response.

Exponential family are useful for most common types of data:

  • continuous data over the entire real line may be modeled by the normal distribution
  • proportion of a total number of counts may be modeled by the binomial distribution
  • discrete count data may be modeled by the Poisson or negative binomial distribution
  • continuous data over the positive real line may be modeled by the gamma and inverse Guassian distribution
  • positive data with exact zeros may be modeled by a special case of Tweedie distribution.

maximum likelihood method

  • least-squares is an appropriate criterion for fitting regression models to response data that are normally distributed.
  • maximum-likelihood is an appropriate for estimating the parameters of non-normal models on binomial, Poisson or gamma distributed response.
    • can be applied whenever a specific probability distribution has been proposed for the data
    • estimating unknown parameters by choosing those that maximize the probability density of the data

suppose \(y_1, \dots, y_n\) are independent observations from an exponential distribution with scale parameter \(\theta\), with pdf \(P(y;\theta)=\theta\exp (-y\theta)\), the joint pdf/likelihood function of the data is \[ \begin{aligned} \mathcal{P}(y_1, \dots, y_n|\theta)&=\prod_{i=1}^{n}\mathcal{P}(y_i|\theta)=\theta^n\exp (-\sum_{i=1}^{n} y_i \theta) \\ \mathcal{L}(\theta;y)&=n(\log \theta-\bar{y}\theta) \end{aligned} \] It is easy to show that least squares is a special case of maximum likelihood: consider a normal linear regression model, \(y_i \sim N(\mu_i, \sigma^2)\) with \(\mu_i=\beta_0+\beta_1x_{1i}+\dots, \beta_px_{pi}\), normal pdf is given by \[ f(y_i; \mu_i, \sigma^2)=(2 \pi \sigma^2)^{-1/2}\exp\bigg \{-\frac{(y_i-\mu_i)^2}{2\sigma^2} \bigg \} \] hence the log pdf for single obs. \(y_i\) is given as
\[ \log f(y_i; \mu_i, \sigma^2)=-\frac{1}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2}(y_i-\mu_i)^2 \] the log-likelihood function for the unknown parameters is given as \[ \begin{aligned} \mathcal{L}(\beta_0, \dots, \beta_p, \sigma^2;y)&=-\frac{n}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu_i)^2 \\ &=-\frac{n}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2} \text{SS} \end{aligned} \] to maximize the likelihood is the same as minimizing the sum of squares. hence maximum likelihood is the same as least-squares for normal regression models.

MLE properties

  • invariant
  • consistent
  • asymptotically unbiased: \(E(\beta_{mle}) \overset{p}{\rightarrow} \beta\)
  • asymptotically efficient: lowest variance among unbiased estimators
  • asymptotically normally distributed

MLE parameter estimation

  • single parameter
  • multiple parameters

MLE hypothesis testing

  • \(H_0: \beta_j=0\) vs. \(H_1: \beta_j \ne 0\)
    • LRT
    • Wald test, score test based on asymptotic distribution of the test statistic, \(\chi^2_1\)
  • global test: \(H_0: \beta_0=\beta_1=\dots=\beta_j=0\)

MLE confidence interval

  • invert a acceptance region of a hypothesis testing
  • so on

example

The total July rainfall (in millimetres) at Quilpie, Australia, has been recorded, together with the value of the monthly mean southern oscillation index (soi). we define the response variable y as probability that the rainfall exceeds 10 mm, which we will write as \(\mu\) because this is a Bernoulli distribution with probability \(\mu\)

require(GLMsData)
data(quilpie)
names(quilpie)
## [1] "Year"   "Rain"   "SOI"    "Phase"  "Exceed" "y"
mu <- c(0.2, 0.4, 0.5, 0.6, 0.8) # Candidate values to test
ll <- rep(0, 5) # A place-holder for the log-likelihood values
for (i in 1:5){
  ll[i] <- sum( dbinom(quilpie$y, size=1, prob=mu[i], log=TRUE))
}
data.frame(Mu=mu, LogLikelihood=ll)
##    Mu LogLikelihood
## 1 0.2     -63.69406
## 2 0.4     -48.92742
## 3 0.5     -47.13401
## 4 0.6     -48.11649
## 5 0.8     -60.92148
muhat <- mean(quilpie$y)
muhat # mle = sample mean
## [1] 0.5147059
boxplot( SOI ~ Exceed, horizontal=TRUE, data=quilpie, las=2,
         xlab="July average SOI", ylab="Rainfall exceeds threshold" )

plot( jitter(y, 0.15) ~ SOI, data=quilpie, pch=19, axes=FALSE, las=2,
      xlab="July average SOI", ylab="Rainfall exceeds threshold" )
axis(side=1, las=2)
axis(side=2, at=0:1, labels=c("No", "Yes"), las=2); box()

cdplot( Exceed ~ SOI, data=quilpie,
        xlab="July average SOI", ylab="Rainfall exceeds threshold" )

mu0 <- 0.5 # null value
n <- length(quilpie$y)
varmu <- muhat*(1-muhat)/n

# wald test
W <- (muhat - mu0)^2 / varmu
W
## [1] 0.05887446
# score test
S <- (muhat - mu0)^2 / ( mu0*(1-mu0)/n )
S
## [1] 0.05882353
# LRT
Lmu0 <- sum( dbinom(quilpie$y, 1, mu0, log=TRUE ) )
Lmuhat <- sum( dbinom(quilpie$y, 1, muhat, log=TRUE ) )
L <- 2*(Lmuhat - Lmu0)
L
## [1] 0.05883201
c( Wald=W, score=S, LLR=L)
##       Wald      score        LLR 
## 0.05887446 0.05882353 0.05883201
P.W <- pchisq(W, df=1, lower.tail=FALSE) # Wald
P.S <- pchisq(S, df=1, lower.tail=FALSE) # Score
P.L <- pchisq(L, df=1, lower.tail=FALSE) # Likelihood ratio
round(c(Wald=P.W, Score=P.S, LLR=P.L), 3 )
##  Wald Score   LLR 
## 0.808 0.808 0.808

GLM

generalized linear models: structure

random component

Specifying the random component by assuming the responses come from the EDM family:

Exponential dispersion model (EDM) family includes normal and gamma distributions from continuous probability density functions and poisson, binomial, negative binomial distributions from discrete probability mass functions.

GLMs are regression models linear in the parameters. GLM = systematic component + random component.

  • What probability distribution is appropriate? EDMs/Dispersion models
    • The answer determines the random component of the model.
    • the answer depends on the response data or knowledge of how the variance changes with the mean.
  • How are the explanatory variables related to the mean of the response? The answer suggests the systematic component of the model.
    • canonical link function: \(g(EY)=\theta=\eta=\beta_0+\sum\beta_jx_j\) where the linear combination of predictors is linked to the mean \(\mu\) through a link function \(g(.)\).
      • \(g(.)\) is monotonic, differentiable function relating fitted \(\mu\) to \(\eta\).
      • this systematic component shows that GLMs are regression models linear in the parameters.
    • offsets: commonly seen in Poisson GLMs, may be present in other GLMs too

EDMs in canonical form: \[ \mathcal{P}(y|\theta, \phi)=a(y,\phi)\exp\bigg\{ \frac{y\theta-\mathcal{k}(\theta)}{\phi}\bigg\} \] where

  • \(\theta\) is the canonical parameter.
  • \(\mathcal{k}(\theta)\) is a known function, called the cumulant function.
  • \(\phi>0\) is the dispersion parameter.
  • \(a(y,\phi)\) is a normalizing function ensuring that \(\mathcal{P}(y|\theta,\phi)\) is a probability function that it integrates to 1 on \((-\infty, \infty)\).

The mean and variance of the response is can be found by using the cumulant function:

  • \(E[Y]=\mu=\frac{d k(\theta)}{d \theta}\) which is a known function of \(\theta\)
  • \(var[Y]=\phi\frac{d^2 k(\theta)}{d \theta^2}=\phi\frac{d \mu}{d \theta}\)
  • define variance function as \(V(\mu)=\frac{d \mu}{d \theta}\), then \(var[Y]=\phi V(\mu)\)
    • the variance function \(V(\mu)\) uniquely defines the distribution within the EDMs
    • \(var[Y]=\phi V(\mu)\) specifies: the variance of an EDM depends on the mean, with an exception of normal distribution where \(V(\mu)=1\)

example as follows:

data(nminer)
breaks <- c(-Inf, 4, 11, 15, 19, Inf) + 0.5 # Break points
Eucs.cut <- cut(nminer$Eucs, breaks )
summary(Eucs.cut)
##  (-Inf,4.5]  (4.5,11.5] (11.5,15.5] (15.5,19.5] (19.5, Inf] 
##           9           6           5           6           5
mn <- tapply( nminer$Minerab, Eucs.cut, "mean" ) # Mean of each group
vr <- tapply( nminer$Minerab, Eucs.cut, "var" ) # Var of each group
sz <- tapply( nminer$Minerab, Eucs.cut, "length" ) # Num. in each group
cbind("Group size"=sz, "Group mean"=mn, "Group variance"=vr)
##             Group size Group mean Group variance
## (-Inf,4.5]           9  0.1111111      0.1111111
## (4.5,11.5]           6  0.5000000      1.5000000
## (11.5,15.5]          5  3.8000000     11.2000000
## (15.5,19.5]          6  4.3333333      7.8666667
## (19.5, Inf]          5  7.0000000     48.5000000
plot(jitter(Minerab)~(Eucs), pch=1, las=1, data=nminer, ylim=c(0, 20),
     xlab="Number of eucalypts/2 ha.", ylab="Number of noisy miners")
# Draw the dashed vertical lines
abline(v=breaks, lwd=1, lty=2, col="gray")

plot( log( vr ) ~ log ( mn ), pch=19, las=1, cex=0.45*sqrt(sz),
      xlab="Log of means", ylab="Log of variances" )

hm.lm <- lm( log( vr ) ~ log ( mn ), weights=sz )
coef(hm.lm)
## (Intercept)     log(mn) 
##    0.802508    1.295222
confint(hm.lm)
##                   2.5 %   97.5 %
## (Intercept) 0.007812159 1.597204
## log(mn)     0.821058278 1.769386

the slope of the linear regression line (weighted by the number of observations in each group) is \(b \approx 1.3\), suggesting the mean is approximately proportional to the variance. In addition, the estimate of \(\phi\) is approximately 1 as needed for the Poisson distribution. In other words, \(V (\mu) = \mu\) approximately.

EDMs in dispersion model form: instead of writing out the pdf using \(\theta\), try to express it using \(\mu\) since the mean has a clear interpretation as the mean of the distribution.

\[ \mathcal{P}(y|\theta, \phi)=b(y,\phi)\exp\bigg\{ -\frac{1}{2 \phi} d(y, \mu) \bigg\} \]

where:

  • \(d(y, \mu)\) is called Unit deviance: \(d(y, \mu)=2\{t(y,y)-t(y,\mu)\}, t(y, \mu)=y\theta-k(\theta)\), which can be interpreted as a type of distance measure between y and \(\mu\).
    • \(t(y,\mu)=y\theta-k(\theta)\)
  • \(D(y,\mu)\), Deviance function, its value being Total deviance: overall measure of distance between \(y_i\) and \(\mu_i\).
    • \(D(y,\mu)=\sum w_id(y_i, \mu_i)\)
  • \(D^*(y, \mu)=D(y,\mu)/\phi\), scaled deviance function, its value being scaled total deviance

Common pdfs of the EDM family in the dispersion model form:

  • Poisson distribution: \(\begin{cases} d(y,\mu)=2\bigg \{ y \log\frac{y}{\mu} -(y-\mu)\bigg \}, \quad y \ne 0 \\ d(0,\mu)=2\mu, \quad y=0 \end{cases}\)
  • Normal distribution: \(d(y, \mu)=(y-\mu)^2\)
  • Gamma distribution: \(d(y,\mu)=2\bigg \{ - \log\frac{y}{\mu} + \frac{(y-\mu)}{\mu}\bigg \}\)
  • and more

systematic component

GLMs assume a specific form form of the systematic component to link the mean and linear predictors through a link function \(g(\cdot)\) so that \(g(\mu)=\eta\).

\[ \eta=\beta_0+\sum_{j=1}^{p}\beta_jx_j \]

the systematic component shows that GLMs are regression models linear in the parameters.

  • normal: \(\theta=\mu\), the canonical link function is \(g(\mu)=\mu \Rightarrow \eta=\mu\), identity link function.
  • Poisson: the canonical link function is \(g(\mu)=\log \mu\) so that \(\log \mu =\eta\)

in some cases, the linear predictor contains a term that requires no estimation, called offset. for example, consider modeling the annual number of hospital births in various cities: the annual number of births is discrete, so a Poisson distribution may be suitable; however, the expected annual number of births \(\mu_i\) in city \(i\) depends on the given population \(P_i\) of the city since cities with larger population would be expected to have more births each year, in general; the number of births per unit of populaton, asuming a log-link function can be modelled using the systematic component: \(\log(\mu/P)=\eta\); rearranging gets \(\log(\mu)=\log P+\eta\). The term \(\log P\) is normally known and needs not to be estimated, called offset. Offsets commonly appear in Poisson GLMs, but may also appear in any GLMs.

  • the offset term is commonly a measure of exposure. e.g., the number of cases of a disease recorded in various mines depends on the number of workers, and the number of years each worker has worked in the mine. the exposure would be the number of person-years worked in each mine, which could be an offset term. meaning a mine with many workers employed for many years would be exposed to a greater likelihood of disease than that with only a few less experienced workers.

GLMs defined

Generalized Linear Models is defined as follows: \[ \begin{cases} y_i \sim \text{EDM}(\mu_i, \phi/w_i) \\ g(\mu_i)=o_i+\beta_0+\sum\beta_jx_j \end{cases} \]

where \(o_i\) is th offset terms.

GLM parameter estimation

mle of regression coefficients

analytical solution for MLEs for GLMs’ coefficients:

  • log-likelihood
  • first and second derivative w.r.t. \(\beta_j\)

computational algorithms for MLE calculation:

  • Newton-Raphson
  • Fisher Scoring for Computing MLEs
data("nminer")
nm.m1 <- glm( Minerab ~ Eucs, data=nminer,
              family=poisson(link="log"),
              control=list(trace=TRUE) ) # Shows the deviance each iteration
## Deviance = 82.14668 Iterations - 1
## Deviance = 64.49515 Iterations - 2
## Deviance = 63.32603 Iterations - 3
## Deviance = 63.31798 Iterations - 4
## Deviance = 63.31798 Iterations - 5
nm.m1
## 
## Call:  glm(formula = Minerab ~ Eucs, family = poisson(link = "log"), 
##     data = nminer, control = list(trace = TRUE))
## 
## Coefficients:
## (Intercept)         Eucs  
##     -0.8762       0.1140  
## 
## Degrees of Freedom: 30 Total (i.e. Null);  29 Residual
## Null Deviance:       150.5 
## Residual Deviance: 63.32     AIC: 121.5
  • R uses total deviance to declare convergence in probability
  • the residual deviance: the minimized total deviance after computing the MLEs for \(\beta_j\) and corresponding fitted values of \(\hat{\mu}, D(y, \hat{\mu})=\sum_{i=1}^{n}w_id(y_i, \hat{\mu}_i)\), which is equivalent to RSS that measures the residual variability across n observations for normal linear regression models.

MLE of scale parameter

  • analytical solution for linear regression scale param.

consider normal linear regression models, the MLE of \(\phi=\sigma^2\) is given as \(\hat{\sigma}^2=\frac{1}{n}\sum w_i(y_i-\hat{\mu}_i)^2\) which is biased. Instead, use sample variance \(s^2=\frac{1}{n-p'}\sum w_i(y_i-\hat{\mu}_i)^2\) which is unbiased and used in practice.

  • generalize to GLMs: modified profile log-likelihood (MPL) estimator

write out the log-likelihood as a function of the scale param,

\[ \mathcal{L^{0}}(\phi)=\frac{p'}{2}\log \phi+\mathcal{L}(\hat{\beta}_0, \dots, \hat{\beta}_p;\phi;y) \] compute \(\hat{\phi}^{0}\) such that it minimizes the MPL \(\mathcal{L^{0}}(\phi)\). the MPL estimator is a consistent and approximately unbiased estimator, even the sample size is small.

GLM inference

If \(\phi\) is known:

Wald tests for single coefficient

Wald tests are the simplest tests since they depend only on the estimated coefficients and standard errors.

The regression coefficients \(\hat{\beta}_j\) are asymptotically normally distributed when sample size is reasonably large, which is the basis of Wald tests by Central Limit Thm.

\[ H_0: \beta_j=\beta_j^0 \\ \]

the Wald test statistic is given by \(Z=\frac{\hat{\beta}_j-j^0}{\text{se}(\hat{\beta}_j)}\) where \(\text{se}(\hat{\beta}_j)=\sqrt{\phi}v_j\); if \(H_0\) is true, Z follows approximately standard normal distribution.

confidence intervals for individual coefficients

the \(100(1-\alpha)\%\) equal-tail confidence interval for \(\beta_j\) when \(\phi\) is known is given by \(\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{se}(\hat{\beta}_j)\).

LRTs for nested models

\[ \begin{aligned} &\text{Model A}: g(\hat{\mu}(A))=\hat{\beta}_0+\hat{\beta}_1x_1+\cdots+\hat{\beta}_ax_a \\ &\text{Model B}: g(\hat{\mu}(B))=\hat{\beta}_0+\hat{\beta}_1x_1+\cdots+\hat{\beta}_ax_a+\cdots+\hat{\beta}_bx_b \\ &H_0: \hat{\beta}_{a+1}x_{a+1}=\cdots=\hat{\beta}_bx_b=0 \end{aligned} \] the LRT statistic is given by: \(L=2\{\mathcal{L}(B)-\mathcal{L}(A)\}=\frac{D(y, \hat{\mu}_A)-D(y, \hat{\mu}_B)}{\phi} \sim \chi^2_{p_B-p_A}\).

Analysis of deviance tables to compare nested models

Score tests

Whereas Wald and likelihood ratio tests (LRTs) are used to test hypotheses about explanatory variables in the current fitted model, score tests enable testing of hypotheses about explanatory variables not (yet) in the current model, but which might be added.

goodness-of-fit tests

above tests are used to test for individual coefficient; but here is another question: How many explanatory variables are sufficient? This sort of test is only possible when \(\phi\) is known, because it requires a known distribution for the residual variability.

A goodness-of-fit test compares the current model (Model A say) with an alternative model (Model B) of a particular type where Model B is the largest possible model which can, in principle, be fitted to the data (saturated model).

  • Deviance goodness-of-fit test
  • Pearson goodness-of-fit test

If \(\phi\) is unknown:

Wald tests for single regression coefficients

the Wald test statistic becomes

\[ T=\frac{\hat{\beta}_j-\beta_j^0}{\text{se}(\hat{\beta}_j)} \] where the standard error \(\text{se}(\hat{\beta}_j)=sv_j\) needs a suitable estimator \(s^2\) of \(\phi\). the Pearson estimator \(s^2=\bar{\phi}\) is used by R.

confidence intervals for individual coefficients

switch to t-distribution instead of standard normal distribution.

LRTs to comapare nested models: F-tests

if \(\phi\) is unknown, an estimate of \(\phi\) has be used. An appropriate test statistic for comparing Model A that is nested under Model B is \(F=\frac{[D(y,\hat{\mu}_A)-D(y,\hat{\mu}_B)]/(p_B'-p_A')}{s^2}\) where \(p_B', p_A'\) indicate number of parameters in Model B, A, respectively.

data(trees)

# F tests for model comparison
tr.m0 <- glm( Volume ~ 1, family=Gamma(link="log"), data=trees)
tr.m1 <- update(tr.m0,.~.+ log(Girth) )
tr.m2 <- update(tr.m1,.~.+ log(Height) )
c( deviance(tr.m0), deviance(tr.m1), deviance(tr.m2) )
## [1] 8.3172012 0.3840839 0.1835153
c( df.residual(tr.m0), df.residual(tr.m1), df.residual(tr.m2) )
## [1] 30 29 28
dev1 <- deviance(tr.m0) - deviance(tr.m1)
dev2 <- deviance(tr.m1) - deviance(tr.m2)
df1 <- df.residual(tr.m0) - df.residual(tr.m1)
df2 <- df.residual(tr.m1) - df.residual(tr.m2)
c( dev1, dev2)
## [1] 7.9331173 0.2005686
c( df1, df2)
## [1] 1 1
phi.meandev <- deviance(tr.m2) / df.residual(tr.m2) # Mean dev. estimate
phi.Pearson <- summary(tr.m2)$dispersion # Pearson estimate
c("Mean deviance" = phi.meandev, "Pearson" = phi.Pearson )
## Mean deviance       Pearson 
##   0.006554117   0.006427286
F.Pearson <- c( dev1/df1, dev2/df2 ) / phi.Pearson # calculate obs.
F.meandev <- c( dev1/df1, dev2/df2 ) / phi.meandev
P.Pearson <- pf( F.Pearson, df1, df.residual(tr.m2), lower.tail=FALSE ) # calculate p-value
P.meandev <- pf( F.meandev, df2, df.residual(tr.m2), lower.tail=FALSE )
tab <- data.frame(F.Pearson, P.Pearson, F.meandev, P.meandev)
rownames(tab) <- c("Girth","Height")
print(tab, digits=3)
##        F.Pearson P.Pearson F.meandev P.meandev
## Girth     1234.3  1.05e-24    1210.4  1.38e-24
## Height      31.2  5.60e-06      30.6  6.50e-06

anaysis of deviance tables to compare nested models

to compare a series of GLMs, an analysis of deviance table is used:

anova(tr.m2, test="F")
## Analysis of Deviance Table
## 
## Model: Gamma, link: log
## 
## Response: Volume
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev        F    Pr(>F)    
## NULL                           30     8.3172                       
## log(Girth)   1   7.9331        29     0.3841 1234.287 < 2.2e-16 ***
## log(Height)  1   0.2006        28     0.1835   31.206 5.604e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLM Comparison

  • Nested models
    • LRT
    • Drop-in-Deviance test
  • Non-nested models
    • if \(\phi\) is unknown
      • AIC\(=-2\times \mathcal{L}(\hat{\beta}, \hat{\phi};y)+2(p'+1)\)
      • BIC\(=-2\times \mathcal{L}(\hat{\beta}, \hat{\phi};y)+(\log n)(p'+1)\)
      • \(\hat{\phi}\) is the MLE of \(\phi\)
      • smaller values of the AIC/BIC (closer to \(-\infty\)) represent better models.
    • if \(\phi\) is known
      • AIC\(=-2\times \mathcal{L}(\hat{\beta}, \hat{\phi};y)+2p'\)
      • BIC\(=-2\times \mathcal{L}(\hat{\beta}, \hat{\phi};y)+(\log n)p'\)
  • Automated model selection
    • forward/backward/mixed step-wise variable selection
    • the same automatic procedures used in normal linear regression case can also be used for GLMs: drop1(), add1(), step()
# information criteria
tr.aic1 <- glm( Volume ~ offset(2*log(Girth)) + log(Height),
                family=Gamma(link="log"), data=trees)
tr.aic2 <- glm( Volume ~ log(Girth) + offset(log(Height)),
                family=Gamma(link="log"), data=trees)
c(AIC(tr.aic1), AIC(tr.aic2))
## [1] 137.9780 138.3677
# automated selection
min.model <- glm( Volume~1, data=trees, family=Gamma(link="log"))
max.model <- glm( Volume~log(Girth) + log(Height), data=trees, family=Gamma(link="log"))
m.b <- step(min.model, scope=list(lower=min.model, upper=max.model), direction="both")
## Start:  AIC=255.48
## Volume ~ 1
## 
##               Df Deviance    AIC
## + log(Girth)   1   0.3841 230.75
## + log(Height)  1   4.7827 245.57
## <none>             8.3172 255.48
## 
## Step:  AIC=160.83
## Volume ~ log(Girth)
## 
##               Df Deviance    AIC
## + log(Height)  1   0.1835 147.82
## <none>             0.3841 160.83
## - log(Girth)   1   8.3172 752.55
## 
## Step:  AIC=139.9
## Volume ~ log(Girth) + log(Height)
## 
##               Df Deviance    AIC
## <none>             0.1835 139.90
## - log(Height)  1   0.3841 169.11
## - log(Girth)   1   4.7827 853.48
m.a <- step(min.model, scope=list(lower=min.model, upper=max.model), direction="backward")
## Start:  AIC=255.48
## Volume ~ 1

GLM Diagnostics

Assumptions of GLMs

the assumptions are never exactly true. it is important to be aware of that the sensitivity of the conclusions to the deviations from the assumptions.

  • Lack of outliers: All responses were generated from the same process, so that the same model is appropriate for all the observations.
  • Link function: The correct link function g() is used.
  • Linearity: All important explanatory variables are included, and each explanatory variable is included in the linear predictor on the correct scale.
  • Variance function: The correct variance function \(V(\mu_i)\) is used.
  • Dispersion parameter: The dispersion parameter \(\phi\) is constant.
  • Independence: The responses \(y_i\) are independent of each other.
  • Distribution: The responses \(y_i\) come from the specified EDM.

residuals for GLMs

the distances \(y_i-\hat{\mu}_i\) are called response residuals, and are the basis for residuals in linear regression. The response residuals are inadequate for assessing a fitted GLM since for GLMs within EDMs, the variance depends on the mean through the variance function \(V(\mu_i)\).

the case can be where the distances \(y_i-\hat{\mu}_i\) are the same for two responses, but one of them is extreme while the other is not. ideally, we want residuals to be like the residuals from linear regression models where they are normally distributed with mean 0 and a constant variance.

  • Pearson residuals: \(r_p=\frac{y-\hat{\mu}}{\sqrt{V(\hat{\mu})/w}}\)
  • Deviance residuals: \(r_D=\text{sign}(y-\hat{\mu})\sqrt{wd(y, \hat{\mu})}\), residual(fit) computes deviance residuals by default.
  • Quantile residuals: when the guidelines for GLMs asymptotics not met and the Pearson and deviance residual clearly cannot be normally distributed. try qresid(fit) from statmod package.
    • Quantile residuals are especially encouraged for discrete EDMs
    • standardized or studentized residual are encouraged as they have more constant variance.

checking GLM assumptions

  • independence: independence of the responses is the most important assumption.
    • plot residuals against lagged residuals
  • plot to check systematic component
    • plot standardized/quantile residuals against fitted values \(\hat{\mu}_i\)
      • to further check on link function, plot working responses against \(\hat{\eta}i\); if any curvature, another link function is needed.
    • plot standardized/quantile residuals against predictor \(x_j\)
      • to determine if predictor is included on the correct scale, use partial residuals: resid(fit, type="partial")
      • Two features to check for the residuals plots:
      • Trends: Any trends appearing in these plots indicate that the systematic component can be improved.
        • This could mean changing the link function, adding extra explanatory variables, or transforming the explanatory variables.
      • Constant variation: If the random component is correct (that is, the correct EDM is used), the variance of the points is approximately constant.
data(trees)
cherry.m1 <- glm( Volume ~ log(Girth) + log(Height),
                  family=Gamma(link=log), data=trees)

scatter.smooth( rstandard(cherry.m1) ~ log(fitted(cherry.m1)), las=1,
                ylab="Standardized deviance residual", xlab="log(Fitted values)" )

scatter.smooth( rstandard(cherry.m1) ~ log(trees$Girth), las=1,
                ylab="Standardized deviance residual", xlab="log(Girth)" )

scatter.smooth( rstandard(cherry.m1) ~ log(trees$Height), las=1,
                ylab="Standardized deviance residual", xlab="log(Height)" )

# working responses vs. eta to check on systematic component
z <- resid(cherry.m1, type="working") + cherry.m1$linear.predictor
plot( z ~ cherry.m1$linear.predictor, las=1, 
      xlab="Working responses, z", 
      ylab="Linear predictor")
abline(a=0, b=1)

# partial residuals vs predictor for predictor scale 
termplot(cherry.m1, partial.resid=TRUE, las=1)

The plot of \(z_i\) against \(\hat{\eta}_i\) is also approximately linear suggesting a suitable link function. The line shown on each termplot() represents is the ideal relationship, suggesting predictors are included on the suitable scale.

  • plot to check random component
    • plot to check random component
      • Q-Q plots can be used to determine if the choice of distribution is appropriate.
      • quantile residuals are used since thet have an exact normal dist.
qr.cherry <- qresid( cherry.m1 ) # extract quantile residuals
qqnorm( qr.cherry, las=1 ); qqline( qr.cherry)

  • outliers & influential obs
    • outliers: obs. inconsistent with the rest of data
    • influential obs: obs. that substantially change the fitted model when removed
      • influential observations are outliers with high leverage.
# outliers by studentized residual
rs <- cbind( rD=resid(cherry.m1), 
             "r'D"=rstandard(cherry.m1),
             "r''"=rstudent(cherry.m1), 
             rQ=qresid(cherry.m1))

apply( abs(rs), 2, max) # The maximum absolute for each residual
##       rD      r'D      r''       rQ 
## 0.166763 2.197761 2.329122 2.053011

Since \(\phi\) is small in this case, the saddlepoint approximation is suitable, and the quantile, standardized and Studentized residuals are very similar. No large residuals exist.

An approximation to Cook’s distance for GLMs is D as computed by the function cooks.distance() in R, where the Pearson estimator \(\bar{\phi}\) of \(\phi\) is used if it is unknown.

im <- influence.measures(cherry.m1)

im$infmat <- round(im$infmat, 3 )
head(im$infmat)
##   dfb.1_ dfb.l(G) dfb.l(H)  dffit cov.r cook.d   hat
## 1  0.015   -0.083    0.005  0.107 1.305  0.004 0.151
## 2  0.120   -0.082   -0.090  0.197 1.311  0.014 0.167
## 3  0.065   -0.021   -0.054  0.087 1.385  0.003 0.198
## 4 -0.011    0.021    0.004 -0.041 1.181  0.001 0.059
## 5  0.145    0.171   -0.170 -0.228 1.218  0.018 0.121
## 6  0.186    0.191   -0.212 -0.261 1.261  0.023 0.152
head(im$is.inf)
##   dfb.1_ dfb.l(G) dfb.l(H) dffit cov.r cook.d   hat
## 1  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 2  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 3  FALSE    FALSE    FALSE FALSE  TRUE  FALSE FALSE
## 4  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 5  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
## 6  FALSE    FALSE    FALSE FALSE FALSE  FALSE FALSE
colSums( im$is.inf )
##   dfb.1_ dfb.l(G) dfb.l(H)    dffit    cov.r   cook.d      hat 
##        0        0        0        0        3        0        0
# plot out im
cherry.cd <- cooks.distance( cherry.m1)
plot( cherry.cd, type="h", ylab="Cook's distance", las=1)

plot( dffits(cherry.m1), type="h", las=1, ylab="DFFITS")

infl <- which.max(cherry.cd) # The Observation number of largest D
infl # Which observation?
## 18 
## 18
cherry.cd[infl] # The value of D for that observation
##        18 
## 0.2067211

Three observations are identified as influential, but only by cr. Since none of the other measures identify these observations as influential, we should not be too concerned.

The value of Cook’s distance for Observation 18 is much larger than any others, but the observation is not identified as significantly influential.

Remedy

quasi-likelihood

when there is no EDM for a given mean-variance relationship, quasi-likelihood provides a way to conduct inference. by family=quasipoisson(), etc.

example

The GLM fitted to model the number of noisy miners from the number of eucalypt trees is:

data(nminer)
nm.m1 <- glm( Minerab ~ Eucs, data=nminer, family=poisson)
printCoefmat(coef(summary(nm.m1)))
##              Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -0.876211   0.282793 -3.0984  0.001946 ** 
## Eucs         0.113981   0.012431  9.1691 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# systematic & random components, outliers
qr <- qresid( nm.m1 )  # To find randomized quantile residuals
qqnorm(qr, las=1)
qqline(qr)

plot( qr ~ sqrt(fitted(nm.m1)), las=1 )

# influential points
plot( cooks.distance(nm.m1), type="h", las=1 )

plot( hatvalues(nm.m1), type="h", las=1)

maxhat <- which.max( hatvalues(nm.m1) ) # Largest leverage
maxqr <- which.max( abs(qr) ) # Largest abs. residual
maxinfl <- which.max( cooks.distance(nm.m1)) # Most influential
c( MaxLeverage=maxhat, MaxResid=maxqr, MaxInfluence=maxinfl)
##  MaxLeverage.11        MaxResid MaxInfluence.17 
##              11               7              17
which(influence.measures(nm.m1)$is.inf[,"cook.d"] )
## 17 
## 17

Only Observation 17 is influential: Observation 17 has a reasonably large residual and leverage, and so it is influential.

Observe the changes in the regression coefficients after omitting obs.17:

nm.m2 <- glm( Minerab ~ Eucs, 
              family=poisson, data=nminer,
              subset=(-maxinfl)) # A negative index removes the obs.
c( "Original model"=coef(nm.m1), "Without Infl"=coef(nm.m2))
## Original model.(Intercept)        Original model.Eucs 
##                 -0.8762114                  0.1139813 
##   Without Infl.(Intercept)          Without Infl.Eucs 
##                 -1.0112791                  0.1247156
plot( Minerab ~ jitter(Eucs), data=nminer,
      xlab="Number of eucalypts", ylab="Number of noisy miners")
newE <- seq( 0, 35, length=100)
newM1 <- predict( nm.m1, newdata=data.frame(Eucs=newE), type="response")
newM2 <- predict( nm.m2, newdata=data.frame(Eucs=newE), type="response")
lines( newM1 ~ newE, lty=1)
lines( newM2 ~ newE, lty=2)