general linear regression models assume the variance is constant, possibly from a normal distribution. However, many data types exist for which the randomness is not constant, which fails least-squares estimation and makes other methods necessary.
three specific situations where the variation is not constant and linear regression models fail:
the response \(y\) is a proportion out of m binary trials ranging between 0 and 1: binomial distribution may be appropriate. \[ \begin{cases} ym \sim \text{Bin}(\mu, m) \quad (\text{random component}) \\ \log\frac{\mu}{1-\mu}=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]
the response is a count: poisson distribution may be appropriate
\[ \begin{cases} y \sim \text{Pois}(\mu) \quad (\text{random component}) \\ \log\mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]
\[ \begin{cases} y \sim \text{Gamma}(\mu;\phi) \quad (\text{random component}) \\ \mu=\beta_0+\beta_1x \quad (\text{systematic component}) \end{cases} \]
in these situations, the relationship between y and the predictors is usually non-linear, and the response has boundaries so a linear relationship cannot apply for all values of the response.
Exponential family are useful for most common types of data:
suppose \(y_1, \dots, y_n\) are independent observations from an exponential distribution with scale parameter \(\theta\), with pdf \(P(y;\theta)=\theta\exp (-y\theta)\), the joint pdf/likelihood function of the data is \[ \begin{aligned} \mathcal{P}(y_1, \dots, y_n|\theta)&=\prod_{i=1}^{n}\mathcal{P}(y_i|\theta)=\theta^n\exp (-\sum_{i=1}^{n} y_i \theta) \\ \mathcal{L}(\theta;y)&=n(\log \theta-\bar{y}\theta) \end{aligned} \] It is easy to show that least squares is a special case of maximum likelihood: consider a normal linear regression model, \(y_i \sim N(\mu_i, \sigma^2)\) with \(\mu_i=\beta_0+\beta_1x_{1i}+\dots, \beta_px_{pi}\), normal pdf is given by \[ f(y_i; \mu_i, \sigma^2)=(2 \pi \sigma^2)^{-1/2}\exp\bigg \{-\frac{(y_i-\mu_i)^2}{2\sigma^2} \bigg \} \] hence the log pdf for single obs. \(y_i\) is given as
\[ \log f(y_i; \mu_i, \sigma^2)=-\frac{1}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2}(y_i-\mu_i)^2 \] the log-likelihood function for the unknown parameters is given as \[ \begin{aligned} \mathcal{L}(\beta_0, \dots, \beta_p, \sigma^2;y)&=-\frac{n}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu_i)^2 \\ &=-\frac{n}{2}\log(2 \pi \sigma^2)-\frac{1}{2\sigma^2} \text{SS} \end{aligned} \] to maximize the likelihood is the same as minimizing the sum of squares. hence maximum likelihood is the same as least-squares for normal regression models.
The total July rainfall (in millimetres) at Quilpie, Australia, has been recorded, together with the value of the monthly mean southern oscillation index (soi). we define the response variable y as probability that the rainfall exceeds 10 mm, which we will write as \(\mu\) because this is a Bernoulli distribution with probability \(\mu\)
require(GLMsData)
data(quilpie)
names(quilpie)
## [1] "Year" "Rain" "SOI" "Phase" "Exceed" "y"
mu <- c(0.2, 0.4, 0.5, 0.6, 0.8) # Candidate values to test
ll <- rep(0, 5) # A place-holder for the log-likelihood values
for (i in 1:5){
ll[i] <- sum( dbinom(quilpie$y, size=1, prob=mu[i], log=TRUE))
}
data.frame(Mu=mu, LogLikelihood=ll)
## Mu LogLikelihood
## 1 0.2 -63.69406
## 2 0.4 -48.92742
## 3 0.5 -47.13401
## 4 0.6 -48.11649
## 5 0.8 -60.92148
muhat <- mean(quilpie$y)
muhat # mle = sample mean
## [1] 0.5147059
boxplot( SOI ~ Exceed, horizontal=TRUE, data=quilpie, las=2,
xlab="July average SOI", ylab="Rainfall exceeds threshold" )
plot( jitter(y, 0.15) ~ SOI, data=quilpie, pch=19, axes=FALSE, las=2,
xlab="July average SOI", ylab="Rainfall exceeds threshold" )
axis(side=1, las=2)
axis(side=2, at=0:1, labels=c("No", "Yes"), las=2); box()
cdplot( Exceed ~ SOI, data=quilpie,
xlab="July average SOI", ylab="Rainfall exceeds threshold" )
mu0 <- 0.5 # null value
n <- length(quilpie$y)
varmu <- muhat*(1-muhat)/n
# wald test
W <- (muhat - mu0)^2 / varmu
W
## [1] 0.05887446
# score test
S <- (muhat - mu0)^2 / ( mu0*(1-mu0)/n )
S
## [1] 0.05882353
# LRT
Lmu0 <- sum( dbinom(quilpie$y, 1, mu0, log=TRUE ) )
Lmuhat <- sum( dbinom(quilpie$y, 1, muhat, log=TRUE ) )
L <- 2*(Lmuhat - Lmu0)
L
## [1] 0.05883201
c( Wald=W, score=S, LLR=L)
## Wald score LLR
## 0.05887446 0.05882353 0.05883201
P.W <- pchisq(W, df=1, lower.tail=FALSE) # Wald
P.S <- pchisq(S, df=1, lower.tail=FALSE) # Score
P.L <- pchisq(L, df=1, lower.tail=FALSE) # Likelihood ratio
round(c(Wald=P.W, Score=P.S, LLR=P.L), 3 )
## Wald Score LLR
## 0.808 0.808 0.808
Specifying the random component by assuming the responses come from the EDM family:
Exponential dispersion model (EDM) family includes normal and gamma distributions from continuous probability density functions and poisson, binomial, negative binomial distributions from discrete probability mass functions.
GLMs are regression models linear in the parameters. GLM = systematic component + random component.
EDMs in canonical form: \[ \mathcal{P}(y|\theta, \phi)=a(y,\phi)\exp\bigg\{ \frac{y\theta-\mathcal{k}(\theta)}{\phi}\bigg\} \] where
The mean and variance of the response is can be found by using the cumulant function:
example as follows:
data(nminer)
breaks <- c(-Inf, 4, 11, 15, 19, Inf) + 0.5 # Break points
Eucs.cut <- cut(nminer$Eucs, breaks )
summary(Eucs.cut)
## (-Inf,4.5] (4.5,11.5] (11.5,15.5] (15.5,19.5] (19.5, Inf]
## 9 6 5 6 5
mn <- tapply( nminer$Minerab, Eucs.cut, "mean" ) # Mean of each group
vr <- tapply( nminer$Minerab, Eucs.cut, "var" ) # Var of each group
sz <- tapply( nminer$Minerab, Eucs.cut, "length" ) # Num. in each group
cbind("Group size"=sz, "Group mean"=mn, "Group variance"=vr)
## Group size Group mean Group variance
## (-Inf,4.5] 9 0.1111111 0.1111111
## (4.5,11.5] 6 0.5000000 1.5000000
## (11.5,15.5] 5 3.8000000 11.2000000
## (15.5,19.5] 6 4.3333333 7.8666667
## (19.5, Inf] 5 7.0000000 48.5000000
plot(jitter(Minerab)~(Eucs), pch=1, las=1, data=nminer, ylim=c(0, 20),
xlab="Number of eucalypts/2 ha.", ylab="Number of noisy miners")
# Draw the dashed vertical lines
abline(v=breaks, lwd=1, lty=2, col="gray")
plot( log( vr ) ~ log ( mn ), pch=19, las=1, cex=0.45*sqrt(sz),
xlab="Log of means", ylab="Log of variances" )
hm.lm <- lm( log( vr ) ~ log ( mn ), weights=sz )
coef(hm.lm)
## (Intercept) log(mn)
## 0.802508 1.295222
confint(hm.lm)
## 2.5 % 97.5 %
## (Intercept) 0.007812159 1.597204
## log(mn) 0.821058278 1.769386
the slope of the linear regression line (weighted by the number of observations in each group) is \(b \approx 1.3\), suggesting the mean is approximately proportional to the variance. In addition, the estimate of \(\phi\) is approximately 1 as needed for the Poisson distribution. In other words, \(V (\mu) = \mu\) approximately.
EDMs in dispersion model form: instead of writing out the pdf using \(\theta\), try to express it using \(\mu\) since the mean has a clear interpretation as the mean of the distribution.
\[ \mathcal{P}(y|\theta, \phi)=b(y,\phi)\exp\bigg\{ -\frac{1}{2 \phi} d(y, \mu) \bigg\} \]
where:
Common pdfs of the EDM family in the dispersion model form:
GLMs assume a specific form form of the systematic component to link the mean and linear predictors through a link function \(g(\cdot)\) so that \(g(\mu)=\eta\).
\[ \eta=\beta_0+\sum_{j=1}^{p}\beta_jx_j \]
the systematic component shows that GLMs are regression models linear in the parameters.
in some cases, the linear predictor contains a term that requires no estimation, called offset. for example, consider modeling the annual number of hospital births in various cities: the annual number of births is discrete, so a Poisson distribution may be suitable; however, the expected annual number of births \(\mu_i\) in city \(i\) depends on the given population \(P_i\) of the city since cities with larger population would be expected to have more births each year, in general; the number of births per unit of populaton, asuming a log-link function can be modelled using the systematic component: \(\log(\mu/P)=\eta\); rearranging gets \(\log(\mu)=\log P+\eta\). The term \(\log P\) is normally known and needs not to be estimated, called offset. Offsets commonly appear in Poisson GLMs, but may also appear in any GLMs.
Generalized Linear Models is defined as follows: \[ \begin{cases} y_i \sim \text{EDM}(\mu_i, \phi/w_i) \\ g(\mu_i)=o_i+\beta_0+\sum\beta_jx_j \end{cases} \]
where \(o_i\) is th offset terms.
analytical solution for MLEs for GLMs’ coefficients:
computational algorithms for MLE calculation:
data("nminer")
nm.m1 <- glm( Minerab ~ Eucs, data=nminer,
family=poisson(link="log"),
control=list(trace=TRUE) ) # Shows the deviance each iteration
## Deviance = 82.14668 Iterations - 1
## Deviance = 64.49515 Iterations - 2
## Deviance = 63.32603 Iterations - 3
## Deviance = 63.31798 Iterations - 4
## Deviance = 63.31798 Iterations - 5
nm.m1
##
## Call: glm(formula = Minerab ~ Eucs, family = poisson(link = "log"),
## data = nminer, control = list(trace = TRUE))
##
## Coefficients:
## (Intercept) Eucs
## -0.8762 0.1140
##
## Degrees of Freedom: 30 Total (i.e. Null); 29 Residual
## Null Deviance: 150.5
## Residual Deviance: 63.32 AIC: 121.5
R
uses total deviance to declare convergence in probabilityconsider normal linear regression models, the MLE of \(\phi=\sigma^2\) is given as \(\hat{\sigma}^2=\frac{1}{n}\sum w_i(y_i-\hat{\mu}_i)^2\) which is biased. Instead, use sample variance \(s^2=\frac{1}{n-p'}\sum w_i(y_i-\hat{\mu}_i)^2\) which is unbiased and used in practice.
write out the log-likelihood as a function of the scale param,
\[ \mathcal{L^{0}}(\phi)=\frac{p'}{2}\log \phi+\mathcal{L}(\hat{\beta}_0, \dots, \hat{\beta}_p;\phi;y) \] compute \(\hat{\phi}^{0}\) such that it minimizes the MPL \(\mathcal{L^{0}}(\phi)\). the MPL estimator is a consistent and approximately unbiased estimator, even the sample size is small.
If \(\phi\) is known:
Wald tests are the simplest tests since they depend only on the estimated coefficients and standard errors.
The regression coefficients \(\hat{\beta}_j\) are asymptotically normally distributed when sample size is reasonably large, which is the basis of Wald tests by Central Limit Thm.
\[ H_0: \beta_j=\beta_j^0 \\ \]
the Wald test statistic is given by \(Z=\frac{\hat{\beta}_j-j^0}{\text{se}(\hat{\beta}_j)}\) where \(\text{se}(\hat{\beta}_j)=\sqrt{\phi}v_j\); if \(H_0\) is true, Z follows approximately standard normal distribution.
the \(100(1-\alpha)\%\) equal-tail confidence interval for \(\beta_j\) when \(\phi\) is known is given by \(\hat{\beta}_j \pm z_{\alpha/2} \cdot \text{se}(\hat{\beta}_j)\).
\[ \begin{aligned} &\text{Model A}: g(\hat{\mu}(A))=\hat{\beta}_0+\hat{\beta}_1x_1+\cdots+\hat{\beta}_ax_a \\ &\text{Model B}: g(\hat{\mu}(B))=\hat{\beta}_0+\hat{\beta}_1x_1+\cdots+\hat{\beta}_ax_a+\cdots+\hat{\beta}_bx_b \\ &H_0: \hat{\beta}_{a+1}x_{a+1}=\cdots=\hat{\beta}_bx_b=0 \end{aligned} \] the LRT statistic is given by: \(L=2\{\mathcal{L}(B)-\mathcal{L}(A)\}=\frac{D(y, \hat{\mu}_A)-D(y, \hat{\mu}_B)}{\phi} \sim \chi^2_{p_B-p_A}\).
Whereas Wald and likelihood ratio tests (LRTs) are used to test hypotheses about explanatory variables in the current fitted model, score tests enable testing of hypotheses about explanatory variables not (yet) in the current model, but which might be added.
above tests are used to test for individual coefficient; but here is another question: How many explanatory variables are sufficient? This sort of test is only possible when \(\phi\) is known, because it requires a known distribution for the residual variability.
A goodness-of-fit test compares the current model (Model A say) with an alternative model (Model B) of a particular type where Model B is the largest possible model which can, in principle, be fitted to the data (saturated model).
If \(\phi\) is unknown:
the Wald test statistic becomes
\[ T=\frac{\hat{\beta}_j-\beta_j^0}{\text{se}(\hat{\beta}_j)} \] where the standard error \(\text{se}(\hat{\beta}_j)=sv_j\) needs a suitable estimator \(s^2\) of \(\phi\). the Pearson estimator \(s^2=\bar{\phi}\) is used by R
.
switch to t-distribution instead of standard normal distribution.
if \(\phi\) is unknown, an estimate of \(\phi\) has be used. An appropriate test statistic for comparing Model A that is nested under Model B is \(F=\frac{[D(y,\hat{\mu}_A)-D(y,\hat{\mu}_B)]/(p_B'-p_A')}{s^2}\) where \(p_B', p_A'\) indicate number of parameters in Model B, A, respectively.
data(trees)
# F tests for model comparison
tr.m0 <- glm( Volume ~ 1, family=Gamma(link="log"), data=trees)
tr.m1 <- update(tr.m0,.~.+ log(Girth) )
tr.m2 <- update(tr.m1,.~.+ log(Height) )
c( deviance(tr.m0), deviance(tr.m1), deviance(tr.m2) )
## [1] 8.3172012 0.3840839 0.1835153
c( df.residual(tr.m0), df.residual(tr.m1), df.residual(tr.m2) )
## [1] 30 29 28
dev1 <- deviance(tr.m0) - deviance(tr.m1)
dev2 <- deviance(tr.m1) - deviance(tr.m2)
df1 <- df.residual(tr.m0) - df.residual(tr.m1)
df2 <- df.residual(tr.m1) - df.residual(tr.m2)
c( dev1, dev2)
## [1] 7.9331173 0.2005686
c( df1, df2)
## [1] 1 1
phi.meandev <- deviance(tr.m2) / df.residual(tr.m2) # Mean dev. estimate
phi.Pearson <- summary(tr.m2)$dispersion # Pearson estimate
c("Mean deviance" = phi.meandev, "Pearson" = phi.Pearson )
## Mean deviance Pearson
## 0.006554117 0.006427286
F.Pearson <- c( dev1/df1, dev2/df2 ) / phi.Pearson # calculate obs.
F.meandev <- c( dev1/df1, dev2/df2 ) / phi.meandev
P.Pearson <- pf( F.Pearson, df1, df.residual(tr.m2), lower.tail=FALSE ) # calculate p-value
P.meandev <- pf( F.meandev, df2, df.residual(tr.m2), lower.tail=FALSE )
tab <- data.frame(F.Pearson, P.Pearson, F.meandev, P.meandev)
rownames(tab) <- c("Girth","Height")
print(tab, digits=3)
## F.Pearson P.Pearson F.meandev P.meandev
## Girth 1234.3 1.05e-24 1210.4 1.38e-24
## Height 31.2 5.60e-06 30.6 6.50e-06
to compare a series of GLMs, an analysis of deviance table is used:
anova(tr.m2, test="F")
## Analysis of Deviance Table
##
## Model: Gamma, link: log
##
## Response: Volume
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev F Pr(>F)
## NULL 30 8.3172
## log(Girth) 1 7.9331 29 0.3841 1234.287 < 2.2e-16 ***
## log(Height) 1 0.2006 28 0.1835 31.206 5.604e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1()
, add1()
, step()
# information criteria
tr.aic1 <- glm( Volume ~ offset(2*log(Girth)) + log(Height),
family=Gamma(link="log"), data=trees)
tr.aic2 <- glm( Volume ~ log(Girth) + offset(log(Height)),
family=Gamma(link="log"), data=trees)
c(AIC(tr.aic1), AIC(tr.aic2))
## [1] 137.9780 138.3677
# automated selection
min.model <- glm( Volume~1, data=trees, family=Gamma(link="log"))
max.model <- glm( Volume~log(Girth) + log(Height), data=trees, family=Gamma(link="log"))
m.b <- step(min.model, scope=list(lower=min.model, upper=max.model), direction="both")
## Start: AIC=255.48
## Volume ~ 1
##
## Df Deviance AIC
## + log(Girth) 1 0.3841 230.75
## + log(Height) 1 4.7827 245.57
## <none> 8.3172 255.48
##
## Step: AIC=160.83
## Volume ~ log(Girth)
##
## Df Deviance AIC
## + log(Height) 1 0.1835 147.82
## <none> 0.3841 160.83
## - log(Girth) 1 8.3172 752.55
##
## Step: AIC=139.9
## Volume ~ log(Girth) + log(Height)
##
## Df Deviance AIC
## <none> 0.1835 139.90
## - log(Height) 1 0.3841 169.11
## - log(Girth) 1 4.7827 853.48
m.a <- step(min.model, scope=list(lower=min.model, upper=max.model), direction="backward")
## Start: AIC=255.48
## Volume ~ 1
the assumptions are never exactly true. it is important to be aware of that the sensitivity of the conclusions to the deviations from the assumptions.
the distances \(y_i-\hat{\mu}_i\) are called response residuals, and are the basis for residuals in linear regression. The response residuals are inadequate for assessing a fitted GLM since for GLMs within EDMs, the variance depends on the mean through the variance function \(V(\mu_i)\).
the case can be where the distances \(y_i-\hat{\mu}_i\) are the same for two responses, but one of them is extreme while the other is not. ideally, we want residuals to be like the residuals from linear regression models where they are normally distributed with mean 0 and a constant variance.
residual(fit)
computes deviance residuals by default.qresid(fit)
from statmod
package. resid(fit, type="partial")
data(trees)
cherry.m1 <- glm( Volume ~ log(Girth) + log(Height),
family=Gamma(link=log), data=trees)
scatter.smooth( rstandard(cherry.m1) ~ log(fitted(cherry.m1)), las=1,
ylab="Standardized deviance residual", xlab="log(Fitted values)" )
scatter.smooth( rstandard(cherry.m1) ~ log(trees$Girth), las=1,
ylab="Standardized deviance residual", xlab="log(Girth)" )
scatter.smooth( rstandard(cherry.m1) ~ log(trees$Height), las=1,
ylab="Standardized deviance residual", xlab="log(Height)" )
# working responses vs. eta to check on systematic component
z <- resid(cherry.m1, type="working") + cherry.m1$linear.predictor
plot( z ~ cherry.m1$linear.predictor, las=1,
xlab="Working responses, z",
ylab="Linear predictor")
abline(a=0, b=1)
# partial residuals vs predictor for predictor scale
termplot(cherry.m1, partial.resid=TRUE, las=1)
The plot of \(z_i\) against \(\hat{\eta}_i\) is also approximately linear suggesting a suitable link function. The line shown on each termplot()
represents is the ideal relationship, suggesting predictors are included on the suitable scale.
qr.cherry <- qresid( cherry.m1 ) # extract quantile residuals
qqnorm( qr.cherry, las=1 ); qqline( qr.cherry)
# outliers by studentized residual
rs <- cbind( rD=resid(cherry.m1),
"r'D"=rstandard(cherry.m1),
"r''"=rstudent(cherry.m1),
rQ=qresid(cherry.m1))
apply( abs(rs), 2, max) # The maximum absolute for each residual
## rD r'D r'' rQ
## 0.166763 2.197761 2.329122 2.053011
Since \(\phi\) is small in this case, the saddlepoint approximation is suitable, and the quantile, standardized and Studentized residuals are very similar. No large residuals exist.
An approximation to Cook’s distance for GLMs is D as computed by the function cooks.distance()
in R
, where the Pearson estimator \(\bar{\phi}\) of \(\phi\) is used if it is unknown.
im <- influence.measures(cherry.m1)
im$infmat <- round(im$infmat, 3 )
head(im$infmat)
## dfb.1_ dfb.l(G) dfb.l(H) dffit cov.r cook.d hat
## 1 0.015 -0.083 0.005 0.107 1.305 0.004 0.151
## 2 0.120 -0.082 -0.090 0.197 1.311 0.014 0.167
## 3 0.065 -0.021 -0.054 0.087 1.385 0.003 0.198
## 4 -0.011 0.021 0.004 -0.041 1.181 0.001 0.059
## 5 0.145 0.171 -0.170 -0.228 1.218 0.018 0.121
## 6 0.186 0.191 -0.212 -0.261 1.261 0.023 0.152
head(im$is.inf)
## dfb.1_ dfb.l(G) dfb.l(H) dffit cov.r cook.d hat
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
colSums( im$is.inf )
## dfb.1_ dfb.l(G) dfb.l(H) dffit cov.r cook.d hat
## 0 0 0 0 3 0 0
# plot out im
cherry.cd <- cooks.distance( cherry.m1)
plot( cherry.cd, type="h", ylab="Cook's distance", las=1)
plot( dffits(cherry.m1), type="h", las=1, ylab="DFFITS")
infl <- which.max(cherry.cd) # The Observation number of largest D
infl # Which observation?
## 18
## 18
cherry.cd[infl] # The value of D for that observation
## 18
## 0.2067211
Three observations are identified as influential, but only by cr. Since none of the other measures identify these observations as influential, we should not be too concerned.
The value of Cook’s distance for Observation 18 is much larger than any others, but the observation is not identified as significantly influential.
when there is no EDM for a given mean-variance relationship, quasi-likelihood provides a way to conduct inference. by family=quasipoisson()
, etc.
The GLM fitted to model the number of noisy miners from the number of eucalypt trees is:
data(nminer)
nm.m1 <- glm( Minerab ~ Eucs, data=nminer, family=poisson)
printCoefmat(coef(summary(nm.m1)))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.876211 0.282793 -3.0984 0.001946 **
## Eucs 0.113981 0.012431 9.1691 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# systematic & random components, outliers
qr <- qresid( nm.m1 ) # To find randomized quantile residuals
qqnorm(qr, las=1)
qqline(qr)
plot( qr ~ sqrt(fitted(nm.m1)), las=1 )
# influential points
plot( cooks.distance(nm.m1), type="h", las=1 )
plot( hatvalues(nm.m1), type="h", las=1)
maxhat <- which.max( hatvalues(nm.m1) ) # Largest leverage
maxqr <- which.max( abs(qr) ) # Largest abs. residual
maxinfl <- which.max( cooks.distance(nm.m1)) # Most influential
c( MaxLeverage=maxhat, MaxResid=maxqr, MaxInfluence=maxinfl)
## MaxLeverage.11 MaxResid MaxInfluence.17
## 11 7 17
which(influence.measures(nm.m1)$is.inf[,"cook.d"] )
## 17
## 17
Only Observation 17 is influential: Observation 17 has a reasonably large residual and leverage, and so it is influential.
Observe the changes in the regression coefficients after omitting obs.17:
nm.m2 <- glm( Minerab ~ Eucs,
family=poisson, data=nminer,
subset=(-maxinfl)) # A negative index removes the obs.
c( "Original model"=coef(nm.m1), "Without Infl"=coef(nm.m2))
## Original model.(Intercept) Original model.Eucs
## -0.8762114 0.1139813
## Without Infl.(Intercept) Without Infl.Eucs
## -1.0112791 0.1247156
plot( Minerab ~ jitter(Eucs), data=nminer,
xlab="Number of eucalypts", ylab="Number of noisy miners")
newE <- seq( 0, 35, length=100)
newM1 <- predict( nm.m1, newdata=data.frame(Eucs=newE), type="response")
newM2 <- predict( nm.m2, newdata=data.frame(Eucs=newE), type="response")
lines( newM1 ~ newE, lty=1)
lines( newM2 ~ newE, lty=2)