All models are wrong, but some are useful: models are not an exact representation of reality; models are useful approximations for representing the necessary detail for the purpose at hand.
consider a lung capacity experiment where lung capacity data were collected along with age, gender and smoking status. At any combination of height, age, and smoking status, many different values of lung capacity could be recorded, and so produce a distribution of recorded lung capacity values. A model for this distribution of values is called the random component of the statistical model. At a given combination of height, age, gender and smoking status, the distribution of lung capacity values has a mean. The mathematical relationshio between the mean and given values of the predictors is called the systematic component of the statistical model.
a ststistical model consists of a random component and a systematic component to explain these two features of real data. A statistical model is to mathematically represent both the systematic and random components of data.
many systematic components for the lung capacity data are possible. One simple systematic component is
\[ \mu_i = \beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\dots+\beta_px_{pi} \] for observation i, where \(\mu_i\) is the expected value of \(y_i\), so that \(\mu_i=E[y_i]\) for \(i=1,2,\dots, n\). The \(\beta_j\) are unknown regression parameters. Plotting y against exploratory variables indicates that this is a poor systematic component. the random component about this systematic components may take many forms. for example, \(var[y_i]=\sigma^2\) assumes that the variance of the responses \(y_i\) is constant about \(\mu_i\) without assuming the distribution of the responses. A popular assumption is to assume the responses are normally distributed around \(y_i\), written as \(y_i \sim N(\mu_i, \sigma^2)\). Both assumptions on constant variance are likely to be poor as the variation in the observed lung capacity increases for larger values of FEV.
Combining the random and systematic component, a possible linear regression model for modelling lung capacity data is \[ \begin{cases} var[y_i]=\sigma^2 \\ \mu_i=\beta_0+\beta_1x_{1i}+\dots+\beta_px_{pi} \end{cases} \]
generally, a regression model assumes that the mean response \(\mu_i\) for observation i depends on the p predictors \(x_{1i}, \dots, x_{pi}\) via some general function \(f\) through a number of regression parameters \(\beta_j\): \[ E[y_i]=\mu_i=f(x_{1i}, \dots, x_{pi};\beta_0, \dots, \beta_p) \] commonly, the parameters \(\beta_j\) are assumed to combine the effects of predictors linearly, thus the systematic component takes a more specific form
\[ \mu_i=f(\beta_0+\beta_1x_{1i}+\dots+\beta_px_{pi}) \] so called linear regression models. There are two special cases of regression models linear in the parameters well studies:
the role of a statistical model is to accurately represent the important systematic and random features of the data. The purpose of developing statistical models can be two-fold:
in general, an adequate statistical model balances two criteria:
“All models must be used and understood within limitations imposed by how the data were collected.” the method of data collection affects the conclusions that can be drawn from the analysis.
in an observational study, researchers may use elaborate equipment to collect physical measures or ask well designed questions, but do not control the process being observed. Observational studies generally only permit conclusions about associations, not a cause-and-effect.
in a designed experiment, researchers fo intervene to control the values of the exploratory variables that appear in the data such that they are able to determine which experimental condition is applied to each subject. A well-designed randomized experiment allows for cause-and-effect relationships.
statistical models treat experimental and observational studies the same way, and the statistical conclusions are similar. However, the scientific conclusions from experiments are usually much stronger. The best we can do in an observational study is to collect as many extraneous variables that are likely to affect the response, so that as many uncontrolled effects can be adjusted for.
\[ \begin{cases} var[y_i]=\sigma^2/w_i \\ \mu_i=\beta_0+\sum_{j=1}^{p}\beta_jx_{ji} \end{cases} \]
necessary assumptions for establishing linear regression models are:
##
## Call:
## lm(formula = Weight ~ Age, data = gestation, weights = Births)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.62979 -0.60893 -0.30063 -0.08845 1.03880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.678389 0.371172 -7.216 7.49e-07 ***
## Age 0.153759 0.009493 16.197 1.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7753 on 19 degrees of freedom
## Multiple R-squared: 0.9325, Adjusted R-squared: 0.9289
## F-statistic: 262.3 on 1 and 19 DF, p-value: 1.416e-12
##
## Call:
## lm(formula = Weight ~ Age, data = gestation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4179 -0.1176 0.0628 0.1435 0.2640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.049879 0.232945 -13.09 5.87e-11 ***
## Age 0.159483 0.006778 23.53 1.63e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1999 on 19 degrees of freedom
## Multiple R-squared: 0.9668, Adjusted R-squared: 0.9651
## F-statistic: 553.7 on 1 and 19 DF, p-value: 1.627e-15
consider the case of simple linear regression where the intercept and slope as well as the variance need to be estimated.
for any given intercept and slope, the deviations between the observed data and the model \(\mu_i\) are given by \(e_i=y_i-\mu_i=y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ji}\). To summarize the deviations, square them to avoid negative quantities then sum them:
\[ S(\beta_0, \beta_1)=\sum_{i}^{n}w_ie_i^2=\sum_{i}^{n}w_i(y_i-\mu_i)^2=\sum_{i}^{n}w_i(y_i-\beta_0-\beta_1x_i)^2 \] The non-negative weights \(w_i\) may be used to weight observations according to their precision; S summarizes how far the fitted line deviates from the observations.
intuitively, the least-squares estimators of \(\beta_j\) can be found by minimizing S.
using calculus, solving \[ \frac{\partial S}{\partial \beta_0}=0 \Rightarrow \hat{\beta_0}; \frac{\partial S}{\partial \beta_1}=0 \Rightarrow \hat{\beta_1} \] least-squares estimators are unbiased; the fitted means are given by \(\hat{\mu}_i=\hat{\beta_0}+\hat{\beta_1}x_i\) for \(i=1,\dots,n\).
the minimized value of S, evaluated at the least-squares estimates, is called residual sum-of-squares (RSS):
\[ \text{RSS}=\sum_{i}^{n}w_i(y_i-\beta_0-\beta_1x_i)^2 \bigg |_{\hat{\beta_0}, \hat{\beta_1}}=\sum_{i}^{n}w_i(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2 \] where \(r_i=y_i-\hat{\mu}_i\) is called the raw residuals.
estimating variance \(\sigma^2\) by \(\hat{\sigma}^2=\frac{RSS}{n}\) by definition of variance leads to a biased estimator; \(\frac{RSS}{n-2}\) is unbiased and used in practice by considering that RSS is based on loss of two degrees of freedom. more generally, for multiple linear regression, \(\hat{\sigma}^2=\frac{RSS}{n-p}\).
standard error is used in statistics to denote the standard deviation of an estimated quantity. \(se(\hat{\beta}_0), se(\hat{\beta}_1)\).
## ModelA ModelB
## 28.91982 13.73356
## [1] 358.8249
## [1] 1.128849e-105
## Analysis of Variance Table
##
## Model 1: log(FEV) ~ Age + Smoke
## Model 2: log(FEV) ~ Age + Ht + Gender + Smoke
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 651 28.920
## 2 649 13.734 2 15.186 358.82 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 3.000 -2033.551
## [1] 3.000 -2470.728
## [1] 3.000 -2020.102
## [1] 3.000 -2457.278
## Single term deletions
##
## Model:
## log(FEV) ~ Age + Ht + Gender + Smoke
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 13.734 -2516.6
## Age 1 1.0323 14.766 -2471.2 48.7831 7.096e-12 ***
## Ht 1 13.7485 27.482 -2064.9 649.7062 < 2.2e-16 ***
## Gender 1 0.1325 13.866 -2512.3 6.2598 0.01260 *
## Smoke 1 0.1027 13.836 -2513.7 4.8537 0.02794 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Single term additions
##
## Model:
## log(FEV) ~ Smoke
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 68.192 -1474.5
## Age 1 39.273 28.920 -2033.5 884.045 < 2.2e-16 ***
## Ht 1 53.371 14.821 -2470.7 2344.240 < 2.2e-16 ***
## Gender 1 2.582 65.611 -1497.8 25.616 5.426e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Start: AIC=-2516.58
## log(FEV) ~ Age + Ht + Gender + Smoke
##
## Df Sum of Sq RSS AIC
## <none> 13.734 -2516.6
## + Smoke:Age 1 0.040136 13.693 -2516.5
## + Ht:Gender 1 0.006336 13.727 -2514.9
## + Smoke:Gender 1 0.003488 13.730 -2514.7
## + Age:Gender 1 0.003119 13.730 -2514.7
## + Smoke:Ht 1 0.001732 13.732 -2514.7
## + Age:Ht 1 0.001455 13.732 -2514.6
## Start: AIC=-2507.58
## log(FEV) ~ (Smoke + Age + Ht + Gender)^2
##
## Df Sum of Sq RSS AIC
## - Smoke:Gender 1 0.000626 13.671 -2509.6
## - Age:Gender 1 0.000859 13.671 -2509.5
## - Age:Ht 1 0.002852 13.674 -2509.4
## - Smoke:Ht 1 0.005607 13.676 -2509.3
## - Ht:Gender 1 0.007841 13.678 -2509.2
## <none> 13.671 -2507.6
## - Smoke:Age 1 0.048232 13.719 -2507.3
##
## Step: AIC=-2509.55
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht +
## Age:Ht + Age:Gender + Ht:Gender
##
## Df Sum of Sq RSS AIC
## - Age:Gender 1 0.000581 13.672 -2511.5
## - Age:Ht 1 0.003282 13.675 -2511.4
## - Ht:Gender 1 0.007945 13.679 -2511.2
## - Smoke:Ht 1 0.009919 13.681 -2511.1
## <none> 13.671 -2509.6
## - Smoke:Age 1 0.047923 13.719 -2509.3
##
## Step: AIC=-2511.52
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht +
## Age:Ht + Ht:Gender
##
## Df Sum of Sq RSS AIC
## - Age:Ht 1 0.003749 13.676 -2513.3
## - Smoke:Ht 1 0.009377 13.681 -2513.1
## - Ht:Gender 1 0.011841 13.684 -2512.9
## <none> 13.672 -2511.5
## - Smoke:Age 1 0.047397 13.719 -2511.3
##
## Step: AIC=-2513.34
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Smoke:Ht +
## Ht:Gender
##
## Df Sum of Sq RSS AIC
## - Smoke:Ht 1 0.007326 13.683 -2515.0
## - Ht:Gender 1 0.008717 13.684 -2514.9
## <none> 13.676 -2513.3
## - Smoke:Age 1 0.050584 13.726 -2512.9
##
## Step: AIC=-2514.99
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age + Ht:Gender
##
## Df Sum of Sq RSS AIC
## - Ht:Gender 1 0.010484 13.693 -2516.5
## <none> 13.683 -2515.0
## - Smoke:Age 1 0.044284 13.727 -2514.9
##
## Step: AIC=-2516.49
## log(FEV) ~ Smoke + Age + Ht + Gender + Smoke:Age
##
## Df Sum of Sq RSS AIC
## - Smoke:Age 1 0.040136 13.734 -2516.6
## <none> 13.693 -2516.5
##
## Step: AIC=-2516.58
## log(FEV) ~ Smoke + Age + Ht + Gender
## (Intercept) Smoke Age Ht GenderM
## -1.9400 -0.0461 0.0234 0.0428 0.0293
## Start: AIC=-2516.58
## log(FEV) ~ Age + Ht + Gender + Smoke
##
## Df Sum of Sq RSS AIC
## <none> 13.734 -2516.6
## + Smoke:Age 1 0.040136 13.693 -2516.5
## + Ht:Gender 1 0.006336 13.727 -2514.9
## + Smoke:Gender 1 0.003488 13.730 -2514.7
## + Age:Gender 1 0.003119 13.730 -2514.7
## + Smoke:Ht 1 0.001732 13.732 -2514.7
## + Age:Ht 1 0.001455 13.732 -2514.6
## (Intercept) Age Ht GenderM Smoke
## -1.9400 0.0234 0.0428 0.0293 -0.0461
## Country Indus Sugar DMFT
## Albania : 1 Ind :29 Min. : 0.97 Min. :0.300
## Algeria : 1 NonInd:61 1st Qu.:14.53 1st Qu.:1.600
## Angolia : 1 Median :33.79 Median :2.300
## Argentina: 1 Mean :30.14 Mean :2.656
## Australia: 1 3rd Qu.:44.32 3rd Qu.:3.350
## Austria : 1 Max. :63.02 Max. :8.100
## (Other) :84
## Analysis of Variance Table
##
## Response: DMFT
## Df Sum Sq Mean Sq F value Pr(>F)
## Sugar 1 49.836 49.836 26.3196 1.768e-06 ***
## Indus 1 1.812 1.812 0.9572 0.33065
## Sugar:Indus 1 6.674 6.674 3.5248 0.06385 .
## Residuals 86 162.840 1.893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Raw Standardized
## 0.1812849 1.0027232
## 629 631
## 0.02207842 0.02034224
## [1] 0.006116208
## [1] 0.006116208
## 629 631
## 3.609822 3.325956
Observations 629 and 631 are many times greater than the mean value of the leverages.
## Age FEV Ht Gender Smoke
## 629 9 1.953 58 M Smoker
## 631 11 1.694 60 M Smoker
The two largest leverages correspond to the two unusual observations in the bottom left corner of the plot.
residuals vs. \(x_j\): linearity slightly non-linear, and have increasing variance. This suggests that the model is poor.
residuals vs. \(\hat{\mu}\): constant variance
- Q-Q plot for normality
Box-Cox transformation finds the optimal \(\lambda\) to achieve linearity, normality and constant variance simultaneously.
## Loading required package: MASS
## (Intercept) Temp I(Temp^2)
## (Intercept) 1.0000000 -0.9984975 0.9941781
## Temp -0.9984975 1.0000000 -0.9985344
## I(Temp^2) 0.9941781 -0.9985344 1.0000000
Observe that the correlations between the two predictors are extremely close to plus or minus one.
More numerically stable polynomials are usually fitted, called orthogonal polynomials, using poly()
in r.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2755556 0.0077737 1450.4766 < 2.2e-16 ***
## poly(Temp, 4)1 1.8409086 0.0329810 55.8173 < 2.2e-16 ***
## poly(Temp, 4)2 0.3968900 0.0329810 12.0339 2.02e-08 ***
## poly(Temp, 4)3 0.1405174 0.0329810 4.2606 0.0009288 ***
## poly(Temp, 4)4 -0.0556088 0.0329810 -1.6861 0.1156150
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Natural smoothing splines are linear at the end points, and hence can be extrapolated in a predictable way outside the interval of the data used to estimate the curve. For this reason, natural smoothing splines are a good practical choice in most cases for fitting data-driven curves.
## Loading required package: splines
## [1] 4.0000 -117.1234
## [1] 4.0000 -119.2705
## [1] 4.0000 -117.1234
## [[1]]
## [1] 4.0000 -119.2705
##
## [[2]]
## [1] 7.0000 -117.1797
##
## [[3]]
## [1] 10.0000 -128.3144
Of these three models, lm.ns has the smallest AIC.
after applying remedies to ensure linearity and constant variance, observations identified as outliers or influential may no longer present. Sometimes, however, they do remain.
Collinearity means that different combinations of the covariates may lead to nearly the same fitted values. Collinearity is therefore mainly a problem for interpretation rather than prediction.
A special case of collinearity occurs when a covariate and a power of the covariate are included in the same model. Using orthogonal polynomials or regression splines avoids this problem.
If collinearity is detected or suspected, remedies include:
## Cases Eligible OpRooms MainHours
## Cases 1.0000000 0.9602926 0.9264237 0.9802365
## Eligible 0.9602926 1.0000000 0.9399181 0.9749010
## OpRooms 0.9264237 0.9399181 1.0000000 0.9630730
## MainHours 0.9802365 0.9749010 0.9630730 1.0000000
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -114.58953 130.33919 -0.8792 0.40494
## Eligible 2.27138 1.68197 1.3504 0.21384
## OpRooms 99.72542 42.21579 2.3623 0.04580 *
## Cases 2.03154 0.67779 2.9973 0.01714 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: DMFT
## Df Sum Sq Mean Sq F value Pr(>F)
## Sugar 1 49.836 49.836 26.3196 1.768e-06 ***
## Indus 1 1.812 1.812 0.9572 0.33065
## Sugar:Indus 1 6.674 6.674 3.5248 0.06385 .
## Residuals 86 162.840 1.893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## dfb.1_ dfb.Sugr dfb.InNI dfb.S:IN dffit cov.r cook.d hat
## 0 0 0 0 2 11 0 2
DMFT is a strictly positive response variable that varies over an order of magnitude between countries, so a log-transformation may well be helpful:
## Analysis of Variance Table
##
## Response: log(DMFT)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sugar 1 10.9773 10.9773 36.8605 3.332e-08 ***
## Indus 1 0.6183 0.6183 2.0761 0.15326
## Sugar:Indus 1 1.3772 1.3772 4.6245 0.03432 *
## Residuals 86 25.6113 0.2978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## dfb.1_ dfb.Sugr dfb.InNI dfb.S:IN dffit cov.r cook.d hat
## 0 0 0 0 0 11 0 2
## AIC (DMFT) AIC (log-DMFT)
## 61.36621 -105.10967
## BIC (DMFT) BIC (log-DMFT)
## 413.3662 246.8903
In both cases, the model using log(DMFT) as the response variable is preferred.
To understand the how the chemical composition of cheese is related to its taste:
## Single term deletions
##
## Model:
## Taste ~ Acetic * log(H2S) * Lactic
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2452.3 148.11
## Acetic:log(H2S):Lactic 1 36.467 2488.8 146.55 0.3272 0.5731
## Single term deletions
##
## Model:
## Taste ~ Acetic + log(H2S):Lactic + Acetic:log(H2S):Lactic
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2679.1 142.76
## Acetic:log(H2S):Lactic 1 24.269 2703.4 141.03 0.2355 0.6315
## Single term deletions
##
## Model:
## Taste ~ log(H2S) + Lactic + Acetic
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2660.9 142.56
## log(H2S) 1 1012.39 3673.3 150.23 9.8922 0.004126 **
## Lactic 1 527.53 3188.4 145.98 5.1546 0.031706 *
## Acetic 1 8.05 2668.9 140.65 0.0787 0.781291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Taste Acetic logH2S Lactic
## Taste 1.0000000 0.5131983 0.7557637 0.7042362
## Acetic 0.5131983 1.0000000 0.5548159 0.5410837
## logH2S 0.7557637 0.5548159 1.0000000 0.6448351
## Lactic 0.7042362 0.5410837 0.6448351 1.0000000
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.591089 8.981801 -3.071888 0.004813785
## log(H2S) 3.946425 1.135722 3.474817 0.001742652
## Lactic 19.885953 7.959175 2.498494 0.018858866
## dfb.1_ dfb.l(H2 dfb.Lctc dffit cov.r cook.d hat
## 0 0 0 0 4 0 0