Statistical-Learning

Overview

  • statistical learning (Gareth James 2021) refers to a vast set of tools for understanding data. These tools can be classified as:
    • supervised: building a statistical model for predicting, or estimating, an output based on one or more inputs.
    • unsupervised: there are inputs but no supervising output; learn relationships and structure from such data.

History

  • 19th century-least squares method was developed
  • 1936-linear discriminant analysis to predict qualitative values
  • 1940s-logistic regression
  • 1970s-generalized linear model was developed to describe an entire class of statistical learning methods that include both linear and logistic regression as special cases.
  • 1980s-classification and regression trees were developed, followed by generalized additive models. Neural networks gained popularity.
  • 1990s-support vector machines arose.

organization

  • chap 2: terminology and concepts behind statistical learning + K-means classifier
  • chap 3: linear regression
  • chap 4: logistic regression + linear discriminant analysis
  • chap 5: cross-validation + bootstrap: choosing the best method
  • chap 6: improvement on classical linear regression - stepwise selection, ridge regression, principal components regression, lasso
  • chap 7: non-linear methods work well with one input variable, non-linear additive methods for more than one input variables
  • chap 8: tree-based methods, including bagging, boosting, random forests
  • chap 9: support vector machines that performs both linear and non-linear classification
  • chap 10: deep learning that performs non-linear regression and classification
  • chap 11: survival analysis, a regression approach specialized to the setting in whcih the output is censored, i.e., not fully observed.
  • chap 12: unsupervised settings, principal component analysis, K-means clustering, hierarchical clustering
  • chap 13: multiple hypothesis testing

chap 2. statistical learning

our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets. More generally, suppose that we observe a quantitative response \(Y\) and \(p\) different predictors, \(X_1, X_2, \dots, X_p\). We assume that there is some relationship between Y and \(X=(X_1, X_2, \dots, X_p)\), wchih can be written in the very general form: \[ Y=f(X)+\epsilon \] Here \(f\) is some fixed but unknown function of \(X_1, X_2, \dots, X_p\) and \(\epsilon\) is a random error term, which is independent of X and had mean zero. In this formulation, \(f\) represents the systematic information that X provides about Y.

In essence, statistical learning refers to a set of approaches for estimating \(f\).

What’s statistical learning?

Why estimate f?

  • prediction: in cases where X are readily available but the output Y cannot be easily obtained.
    • \(\hat{Y}=\hat{f}(X)\)
    • the accuracy of \(\hat{Y}\) as a prediction of Y depends on two quantities: reducible error and irreducible error

In general, \(\hat{f}\) is not a perfect estimate for \(f\), and this inaccuracy will introduce some error. This error is reducible since we can potentially improve the accuracy of \(f\) by using the most appropriate technique. However, since Y is a function of \(\epsilon\), which by definition, cannot be predicted using X, and this is known as the irreducible error.

Why is the irreducible error larger than zero? The quantity \(\epsilon\) may contain unmeasured variables that are useful in predicting Y. The focus of this book is on techniques for estimating \(f\) with the aim of minimizing the reducible error.

  • inference: We now are interested in understanding the association between Y and \(X_1, X_2, \dots, X_p\). In this situation, we wish to estimate \(f\), but our goal is not to make predictions for Y. Therefore, \(\hat{f}\) cannot be treated as a black box because we need to know its exact form.

    • which predictors are associated with the response?
    • what is the relationship between the response and each predictor?
    • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

How do we estimate f?

  • parametric methods: involve a two-step model-based approach

    • first, we make an assumption about the functional form, or shape of \(f\). for example, we assume \(f\) is linear:

    \[ f(X)=\beta_0+\beta_1X_1+\dots+\beta_pX_p \]

instead of estimating the black-box function \(f(X)\), we only need to estimate \(p+1\) coefficients \(\beta_0, \beta_1, \dots, \beta_p\). * second, we need a procedure to train our model using training data, that is we want to find values of estimates of parameters such that

\[ Y \approx \beta_0+\beta_1X_1+\dots+\beta_pX_p \]

the most common approach to fitting/training the linear model is referred to as (ordinary) least squares.

The model-based approach is referred to as parametric: it reduces the problem of estimating f down to one of estimating a set of parameters. Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters than it is to fit an entirely arbitrary function f. The downside of parametric approach is that the model we choose will usually not match the true unknown form of f. If the choice of f is too off, then the estimate will be poor. To overcome this, we can choose flexible models that can fit many different possible functional forms for f. But in general, fitting a flexible model requires estimating a greater number of parameters, which can lead to a phenomenon known as overfitting the data, meaning they follow the error o noise too closely.

  • non-parametric methods: no assumption about the functional form of f

Instead, they seek an estimate of f that gets as close to the data points as possible. The advantage over parametric approach is by avoiding the assumption of a particular form for f, they have the potential to accurately fit a wider range of possible shapes for f. But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters,a very large number of observations is required in order to obtain an accurate estimate for f, such as splines.

The trade-off between prediction accuracy and model interpretability

Why choosing inflexible method instead of a very flexible approach? * when inference is the goal, linear model may be a good choice since it is quite easy to understand the relationship between Y and X; in contrast, flexible approaches, such as splines and boosting methods can lead to such complicated estimates of f that it is difficult to understand how any individual predictors is associated with the response.

trade-off.
trade-off.

supervised vs. unsupervised

  • supervised
    • linear regression
    • logisitic regression
    • GAM
    • boosting
    • support vector machines
  • unsupervised
    • clustering
  • semi-supervised learning
    • for part of the observations, we don’t have response

regression vs. classification

There is no clear-cut distinction:

  • we tend to refer to problems with a quantitative response as regression problems
  • those involving qualitative response as classification problems

Least squares linear regression is used with a quantitative response, where as logistic regression is typically used with a qualitative response (binary). Thus, despite its name, logistic regression is a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well.

We tend to select statistical methods based on the fact that if the response is quantitative or qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important, provided that ant qualitative predictors are properly coded before analysis is performed.

Assessing model accuracy

There is no free lunch in statistics: no one method dominates all others over all possible data sets. Hence, it is an important task to decide for any given set of data which method produces the best results.

measuring the quality of fit

To measure the performance of a statistical learning method on a given data set, we need some ways to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.

in regression setting, the most commonly-used measure is the mean squared error (MSE): \[ MSE=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_1))^2 \] * training MSE * test MSE: we care about this!

When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. That is, our statistical learning procedure is working too hard to find patterns in the training data and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function of f.

bias-variance trade-off

The expected test MSE, for a given value \(x_0\), can always be decomposed into the sum of three fundamental quantities:
* the variance of \(\hat{f}(x_0)\) * the squared bias of \(\hat{f}(x_0)\) * the variance of the error

A good statistical learning method would simultaneously achieve low variance and low bias. * variance: the amount by which \(\hat{f}\) would change if we estimate it using a different training data set. Since the training data are used to fit the statistical learning method different training data sets will result in a different \(\hat{f}\). If a method has high variance then small changes in the training data can result in large changes in \(\hat{f}\). In general, more flexible statistical methods have higher variance.

  • bias: the error introduced by approximating a real-life problem by a much simpler model. Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases.

The relationship between bias, variance, and test MSE is referred to as the bias-variance trade-off. Good test set performance of a statistical leanring method required low variance as well as low squared biasa. The challenge lies in finding a method for which both the variance and the squared bias are low. in a real-life case in which \(f\) is unobserbed, it is generally not possible to explicitly compute the test MSE, bia, or variance for a statistical learning method. Nevertheless, we should alway keep in mind the bias-variacne trade-off. Cross-validation, which is a way to estimate the test MSE using the training data. Bias-variance trade-off transfer to the classification setting with only some modifications.

classification setting

  • training error rate
  • test error rate

Suppose that we seek to estimate f on the basis of training observations \({(x_1, y_1), \dots, (x_n,y_n)}\), where \(y_1, \dots, y_n\) are qualitative. The most common approach for quantifying the accuracy of our estimate f is the training error rate, the proportion of mistakes that are made if we apply our estimate \(\hat{f}\) to the training observations:

\[ \frac{1}{n}\sum_{i=1}^{n}I(y_i\ne\hat{y_i}) \]

I is an indicator variable that equals 1 if observed y does not euqal predicted y and 0 if it does. A good classifier is one for which the test error rate is smallest

\[ Ave(I(y_0\ne\hat{y_0})) \]

The Bayes classifier: we should simply assign a test observation with predictor vector \(x_0\) to the class j for which \[ Pr(Y=j|X=x_0) \] is largest. This is a conditional probability and it is the probability that Y=j given the observed predictor vector \(x_0\). This very simple classifier is called Bayes classifier. in a two-class problem where there are only two possible response values, the Bayes classifier corresponds to predicting class one if \(Pr(Y=1|X=x_0)>0.5\), and class two otherwise.

K-nearest Neighbors: * in theory, we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. That means, the Bayes classifier serves as an unattainable gold standard against which to compare other methods. * K-nearest neighbors (KNN), like many other approaches, attempts to estiamte the conditional distribution of Y given X. Despite KNN is a very simple approach, it can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

chap 3. Linear regression

estimatig the coefficients

Let \(\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i\) be the prediction for Y based on the ith value of X. Then \(e_i=y_i-\hat{y_i}\) represents the ith residual. We define the residual sum of squares (RSS) as:

\[ RSS=e^2_1+e^2_2+\dots+e^2_n \] The least squares approach chooses \(\hat{beta_0}, \hat{beta_1}\) to minimize the RSS.

assess the accuracy of coefficient estiamtes

Notice that different data sets generated from the same true model result in slightly different least squares lines, but the unobserved population regression line does not change. How accurate is the sample mean \(\hat{\mu}\) as an estimate of \(\mu\)? We have established that the average of \(\hat{\mu}\)s over many data sets will be very close to \(\mu\), but that a single estimate \(\hat{\mu}\) may be a substantial underestimate or overestimate of \(\mu\).

  • How far off will that single estimate of \(\hat{\mu}\) be?

In general, we answer this question by computing the standard error of \(\hat{\mu}\), written as SE(\(\hat{\mu}\)). We have the well-known formula:

\[ Var(\hat{\mu})=SE(\hat{\mu})^2=\frac{\sigma^2}{n} \] The estimate of \(sigma\) is known as the residual standard error (RSE), and is given by: \[ RSE=\sqrt{RSS/(n-2)}=\sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y_i})^2} \]

  • Standard errors can be used to construct confidence intervals.
  • Standard error can be used to perform hypothesis tests.

assess the accuracy of the model

The quality of a linear regression fit is typically assessed using two related quantities * residual standard error * \(R^2\)

The RSE is considered an absolute measure of the lack of fit of the linear model to data. If the predictions obtained using the model are very close to the true outcome values, then the RSE will be small; otherwise, RSE will be large.

\(R^2\) statistics provides an alternative measure of fit. It takes the form of a proportion-the proportion of variance explained-and so it always takes on a value between 0 and 1, and is independent of the scale of Y.

\[ R^2=\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS} \] The \(R^2\) has an interpretational advantage over the RSE, since it always takes value between 0 and 1; however, it can still be challenging to determine what is a good \(R^2\).

multiple linear regression

linear regression model takes the form \[ Y=\beta_0+\beta_1X_1+\dots+\beta_pX_p+\epsilon \]

We interpret \(\beta_j\) as the average effect on Y of a one unit increase in \(X_j\) , holding all other predictors fixed.

estimating the regression coefficients

We choose \(\beta_0, \beta_1,\dots,\beta_p\) to minimize the sum of squared residuals.

some important questions

when performing multiple linear regression, we usually are interested in answering a few important questions:

  • is at least one of the predictors useful in predicting the response?
    • F-test
  • do all the predictors help to explain Y, or is only a subset of the predictors useful?
    • forward selection
    • backward selection
    • mixed selection
  • how well does the model fit the data?
    • RSE
    • \(R^2\)
  • given a set of predictor values, what response value should we predict, and how accurate is our prediction?
    • confidence intervals
    • prediction intervals

other considerations in the regression model

qualitative predictors

So far, we have assumed that all variables in our linear regression model are quantitative. In practise, this is not necessarily the case.

  • predictors with two levels: indicator or dummy variables using one-hot encoding
  • qualitative predictors with more than two levels

extensions of the linear model

  • additive: we assume the association between a predictor \(X_j\) and the response Y does not depend on the values of the other predictors
  • linear: the change in response associated with a one-unit change in \(X_j\) is constant, regardless of the value of \(X_j\).

interaction term. non-linear relationship: polynomial functional form. * collinearity: correlation matrix for pairs; better way using variance inflation factor(VIF) * outliers: unusual \(y_i\) * high leverage: unusual \(x_i\)

chap 4. Classification

There are many classification techniques or classifier: logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes, and K-nearest neighbors.
* The discussion of logistic regression is used as a jumping-off point for a discussion of generalized linear models, and in particular, Poisson regression. * more computer-intensive classification methods: generalized additive methods, tree, random forests, and boosting; and support vector machines

Why not linear regression?

Suppose we have a qualitative response with three levels: stroke, drug overdose, and epileptic seizure. We could consider encoding these values as a quantitative response variables as follows:

\[ Y= \begin{cases} {1}, & \text{if stroke;} \\ {2}, & \text{if drug overdose;} \\ {3}, & \text{if epileptic serzure.} \\ \end{cases} \]

Least squares could be used to fit a linear regression model to predict Y on the basis of a set of predictors. However, this enconding implies an ordering on the outcomes, rendering drug overdose in between stroke and epileptic seizure and insisting that the difference between stroke and drug overdose is the same as the difference between drug overdose and epileptic seizure. In practice, there is no particular reason that this needs to be the case.

Even with only two classes, say a binary response with 0/1 coding where least squares is not completely unreasonable, a regression method has some problems: it can be shown that the \(X\hat{\beta}\) obtained using linear regression is an estimate of the probability of Y given X in this case. However, if we use linear regression, some of our estimates might be outside the [0,1] interval, making them hard to interpret as probabilities.

There are at least two reasons not to perform classification using a regression method:
* a regression method cannot accommodate a qualitative response with more than two classes * a regression method will not provide meaningful estimate of Pr(Y|X), even with just two classes.

logistic regression

logistic regression models the probability of Y given X, which can be written as

\[ p(X)=Pr(Y=1|X) \] The values of p(X) will range between 0 and 1, then for any given value of X, a prediction can be made for Y.

The logistic model

How should we model the relationship between \(p(X)=Pr(Y=1|X)\) and X? we consider using a linear regression model to represent these probabilities: \[ p(X)=\beta_0+\beta_1X \]

The problem with this approach is that with for X close to 0, we predict a negative probability of Y; if we were to predict for very large X, we would get values bigger than 1. These predictions are not sensible, since the true probabilities must fall between 0 and 1.

To avoid this problem, we must model p(X) using a function that gives outputs between 0 and 1 for all values of X. There are many candidate functions that meet this criteria. In logistic regression, we use the logistic function: \[ p(X)=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \] To fit the model, we use a method called maximum likelihood. We rewrite the logistic function: \[ \frac{p(X)}{1-p(X)}=e^{\beta_1+\beta_1X} \]

The quantity \(\frac{p(X)}{1-p(X)}\) is called odds, taking the range from \([0, \infty]\). By taking the logarithm of both sides, we arrive at \[ log(\frac{p(X)}{1-p(X)})=\beta_1+\beta_1X \] The left-hand side is called the log odds or logit. We see that the logistic regression model has a logit that is linear in X. What we can observe from the relationship between p(X) and X are:

  • the relationship is not linear

  • the change in p(X) caused by one unit change in X depends on the actual value of X

  • p(X) and X is positively associated if \(\beta_1\) is positive; and vice versa.

estimating the regression coefficients

The coefficients are unknown and must be estimated based on available training data. non-linear least squares can be used to fit the model but maximum likelihood is preferred because of its better statistical properties. We seek estimates for \(\beta_0, \beta_1\) such that the predicted probability \(\hat{p}(x_i)\) of \(x_i\) for each individual corresponds as closely as to the individual’s observed X status. In other words, we try to find such estimates that plugging these estimates into the model for p(X) yields a number close to 1 for all individuals who are classified as group 1, and a number close to 0 for all individuals who are classified as group 0. Formulating it in mathematical equation in a likelihood function: \[ \ell(\beta_0, \beta_1)=\prod_{i:y_i=1} p(x_i)\prod_{i':y_{i'}=1}(1-p(x_{i'})) \] The estimates \(\hat{\beta}_0\) and \(\hat{\beta_1}\) are chosen to maximize this likelihood function. maximum likelihood is a very general approach that is used to fit many of the non-linear models. In the linear regression setting, the least squares approach is a special case of maximum likelihood. One can use qualitative predictors with the logistic regression model using the dummy variable approach.

multiple logistic regression

The results obtained using one predictor may be different from those obtained using multiple predictors when there is correlation among the predictors. In general, the phenomenon is known as confounding.

multinomial logistic regression

when we have a response variable with more than two levels, we extend the two-level case to multinomial logistic regression. To do this, we first select a single level (the Kth level) to serve as the baseline, and then we rewrite the model : \[ Pr(Y=k|X=x)=\frac{e^{\beta_{k0}+\beta_{k1}x_1+\dots+\beta_{kp}x_p}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+\dots+\beta_{lp}x_p}} \] for \(k=1,\dots,K-1\), and \[ Pr(Y=K|X=x)=\frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+\dots+\beta_{lp}x_p}} \] it can be shown that for \(k=1,\dots,K-1\): \[ log\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}=\beta_{k0}+\beta_{k1}x_1+\dots+\beta_{kp}x_p \] This equation again indicates that the log odds between any pair of classes is linear in the features/predictors.

generative models for classification

In statistical jargon, we model the the conditional distribution of the response Y given the predictors. We now consider an alternative and less direct approach to estimating these probabilities. Now we model the distribution of the predictors X separately in each of the response classes, and then use Bayes’ theorem to flip these around into estimates for Pr(Y=k|X=x).

why do we need another method when we already have logistic regression? * when there is substantial separation between the two classes, the parameters for the logistic regression model are surprisingly unstable. * if the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the new approaches may be more accurate than logistic regression. * the new methods can be naturally extended to the case of more than two response levels.

suppose we wish to classify an observation into one of K (\(\ge2\)) classes, meaning the qualitative response Y can take on K possible distinct and un-ordered values. Let \(\pi_k\) denote the prior probability that a randomly chosen observation comes from the kth class, and \(f_k(X)=Pr(X|Y=k)\) denote the density function of X for an observation that comes from the kth class. According to Bayes’ theorem: \[ Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^{K}\pi_lf_l(x)} \] this is the posterior probability that an observation \(X=x\) belongs to the kth class. That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation. The posterior probability equation suggests that instead of directly computing the posterior probability, we can simply plug in estimates of \(\pi_k\) and \(f_k(x)\) into it. In general, estimating \(\pi_k\) is easy if we have a random sample from the population: we simply compute the fraction of the training observations that belong to the kth class. However, estimating the density function \(f_k(x)\) is much more challenging.

classifiers with different estimates of the density function

  • linear discriminant analysis for p=1
    • assumption: \(f_k(x)\) is normal or Gaussian
    • estimates omitted

LDA classifier plugs in the estimates and assigns an observation \(X=x\) to the class for which \[ \hat{\delta}(x)=x\frac{\hat{\mu_k}}{\hat{\sigma^2}}-\frac{\hat{\mu^2_k}}{\hat{2\sigma^2}}+log(\hat{\pi}_k) \] is largest. The word linear in the classifier’s name stems from the fact that the discriminant function \(\hat{\delta}(x)\) are linear function of x as compared to a more complex form of x.

  • linear discriminant analysis for p>1
    • assumption: we assume that \(\vec{X}=(X_1, \dots, X_p)\) is drawn from a multivariate Gaussian distribution, with a class-specific mean vector and a common covariance matrix.

The multivariate Gaussian distribution assumes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors. Put it in mathematical term, we write \(X \sim N(\mu, \mathbb{\Sigma})\). here \(E(X)=\mu\) is the mean of X (a vector of p components), and \(Cov(X)=\mathbb{\Sigma}\) is the \(p \times p\) covariance matrix of X.

in practise, a binary classifier can make two types of errors: * incorrectly assign an individual who belongs to class 1 to class 2 * incorrectly assign an individual of class 2 to class 1

A confusion matrix is a convenient way to display this information.

  • sensitivity: the percentage of true class Yes that are predicted
  • specificity: the percentage of class No that are correctly identified

The Bayes classifier works by assigning an observation to the class for which the posterior probability \(p_k(X)\) is greatest. In the two-class case this amounts to assigning an observation to class 1 if the posterior probability of class 1 is greater than 0.5. However, if we are concerned about the incorrectly predicting the class 1 status for individuals who belong to class 1, then we can consider lowering the threshold. Choice on the best threshold must be based on domain knowledge, such as detailed information about the costs associated with class 1.

ROC curve is a popular graphic for simultaneously displaying two types of errors for all possible thresholds. ROC is an acronym for receiver operating characteristics.The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the ROC curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. We expect a classifier that performs no better than chance to have an AUC of 0.5 on test sets. ROC curves are useful for comparing different classifiers since they take into account all possible thresholds.

When we apply a classifier, possible results are:
* true positive (TP) and true negative (TN) * false positive (FP) and false negative (FN)

  • false positive rate = FP/N, type I error, 1-specificity
  • true positive rate = TP/P, 1-type II error, power, sensitivity
  • positive predicted value: TP/P*, precision, 1-false discovery proportion
  • negative predicted value: TN/N*

quadratic discriminant analysis

LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution withi a class specific mean vector and a covariance matrix that is common to all K classes. Quadratic discriminant analysis (QDA) provides an alternative approach.

  • assumption: observations are drawn from a Gaussian distribution
  • unlike LDA, QDA assumes that each class has its own covariance matrix.

that is, QDA assumes that an observation from the kth class is of the form \(X \sim N(\mu_k, \Sigma_k)\), where \(\Sigma_k\) is a covariance matrix for the kth class. For QDA, the quantity x appears as a quadratic function in the discriminant function. this is where QDA gets its name.

Why assuming different covariance matrices for different classes of Y? or why prefer LDA over QDA? The answer lies in the bias-variance trade-off.

  • when there are \(p\) predictors, then estimating a covariance matrix requires estimating \(p(p+1)/2\) parameters.
  • QDA estimates a separate covariance matrix for each class, for a total of \(Kp(p+1)/2\) parameters.
  • LDA estimates \(Kp\) linear coefficients

Consequently, LDA is a much less flexible classifier than QDA, and has substantially lower variance, which could potentially lead to improved prediction performance. However, if the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias.

In general, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. QDA is recommended if the training data set is very large where the variance of the classifier is not a major concern, or the assumption of a common covariance matrix for all K classes is clearly untenable.

naive Bayes

naive Bayes classifier takes a different tract for estimating \(f_1(x), f_2(x), \dots, f_K(x)\). Instead of assuming that these functions belong to a particular family of distributions, e.g., multivariate normal distribution, it makes a single assumption: within the kth class, the p predictors are independent.

Stated mathematically, this assumption means that for \(k=1,\dots, K\), \[ f_k(x)=f_{k1}(x_1) \times f_{k2}(x_2) \times \dots \times f_{kp}(x_p) \] where \(f_{kj}\) is the density function of the jth predictor among observations in the kth class. This is a powerful assumption. Essentially, estimating a p-dimensional density function is challenging because we must consider not only the marginal distribution of each predictor-that is the distribution of each predictor on its own-but also the joint distribution of the predictors-that is, the association among different predictors. in the case of multivariate normal distribution, the association among predictors is summarized by the off-diagonal elements of the covariance matrix. However, in general, this association can be hard to characterize, and exceedingly challenging to estimate. by assuming independence of predictors, we completely eliminate the worry about the association.

In most settings, we do not believe the naive Bayes assumption that all predictors are independent. Essentially, the naive Bayes assumption introduces some bias, but reduces variance, leading to a classifier that works quite well in practise as a result of bias-variance trade-off. When p is large or n is small, naive Bayes relative to LDA or QDA presents a greater pay-off in where reducing variance is very important.

comparison of classification methods

To compare performance of LDA, QDA, naive Bayes, and logistic regression, we consider these approaches in a setting with K classes, so that we assign an observation to the class that maximizes Pr(Y=k|X=x). Equivalently, we can set K as the baseline class and assign an observation to the class that maximizes \[ log\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)} \] for \(k=1,\dots,K\).

comparison

  • LDA is a special case of QDA with \(c_{kjl}=0\) for all \(j, l=1, \dots, p, k=1,\dots,K\), which is not surprising since LDA is simply a restricted version of QDA with \(\Sigma_1=\Sigma_2=\dots=\Sigma_K=\Sigma\).
  • Any classifier with a linear decision boundary is a special case of naive Bayes with \(g_{kj}(x_j)=b_{kj}(x_j)\). This means LDA is a special case of naive Bayes.
  • naive Bayes is a special case of LDA with p=1.
  • neither QDA nor naive Bayes is a special case of the other. QDA has the potential to be more accurate in settings where interactions among predictors are important in discriminating classes.

None of these methods dominates the others: in any setting, the choice of method will depend on the true distribution of the predictors in each of the K classes, as well as other considerations, such as the values of n and p.

  • logistic regression: the posterior probability function is identical to that of LDA. We expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and we expect logistic regression to perform better when it does not.

K-nearest neighbors (KNN)

No assumptions are made in KNN so it’s a non-parametric approach: no assumptions are made about the shape of the decision boundary.

  • because KNN is completely non-parametric, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear, provided that n is very large and p is small
  • in order to provide accurate classification, KNN required a lot of observations relative to the number of predictors.
  • in settings where the decision boundary is non-linear but n is only modest ,or p is not very small, QDA may be preferred to KNN because QDA can provide non-linear boundary while taking parametric forms, meaning it required a smaller sample size for accurate classification, relative to KNN.
  • unlike logistic regression, KNN does not tell us which predictors are important: we don;t get a table of coefficients as from model summary of a logistic regression model.

generalized linear models

Sometimes we have response variable that are neither qualitative nor quantitative, rendering linear regression and classification untenable. Suppose we have counts of bikers as Y, problems fitting a linear regression model are:

  • negative predictions
  • violation of constant variance assumption
  • continuous predictions instead of integers

Poisson regression

Suppose a random variable Y takes on non-negative integer values, and it follows the Poisson distribution, then \[ Pr(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!}, k=0,1,2\dots \] important properties of Poisson distribution are: \(E(Y)=Var(Y)=\lambda\). IN stead of modeling the count directly, we model the mean to vary as a function of the covariates. IN particular, we consider the following model for the mean \(\lambda=E(Y)\), which we write as \(\lambda(X_1,\dots,X_p)\) to emphasize that it is a function of the covariates \(X_1,\dots,X_p\): \[ log\lambda(X_1,\dots,X_p)=\beta_0+\beta_1X_1+\dots+\beta_pX_p \] we take a log of \(\lambda(X_1,\dots,X_p)\) to be linear in predictors in order to ensure that \(\lambda(X_1,\dots,X_p)\) takes on non-negative values for all values of the covariates. To estimate the coefficients \(\beta_0,\beta_1,\dots,\beta_p\), we use the same maximum likelihood approach that we adopt for logistic regression. Specifically, given n observations from the Poisson regression model, the likelihood takes the form \[ \ell(\beta_0, \beta_1, \dots,\beta_p)=\prod_{i=1}^{n}\frac{e^{-\lambda(x_i)}\lambda(X_i)^{y_i}}{y_i!} \] where \(\lambda(x_i)=e^{\beta_0+\beta_1x_{i1}+\dots+\beta_px_{ip}}\). we estimate the coefficients that maximize the likelihood \(\ell(\beta_0, \beta_1, \dots,\beta_p)\).

  • interpretation: an increase in \(X_j\) by one unit is associated with a change in \(E(Y)=\lambda\) by a factor of \(exp(\beta_j)\).
  • mean-variance relationship: by modeling with Poisson regression, we explicitly assume that, in Bike example, mean bike usage in a given hour equals the variance of bike usage during that hour. By contrast, under a linear regression model, the variance of bike usage always takes on constant value. IN general, Poisson regression model is able to handle the mean-variance relationship in a way that the linear regression model is not.
  • non-negative fitted values: there are no negative predictions using the Poisson regression model. By contrast, when we fit a linear regression model to the Bike data set, almost 10% of the prediction are negative.
  • over-dispersion: in the Bike example, the variance appears to be much higher than the mean, a situation referred to as over-dispersion. this cause the Z-values to be inflated.

generalized linear models in greater generality

we have covered three types of regression models: linear, logistic, and Poisson. they share some characteristics:

  • each approach uses predictors \(X_1, \dots, X_p\) to predict a response Y. we assume that conditional on \(X_1, \dots, X_p\), Y belongs to a certain family of distributions. for linear regression, we typically assume that Y follows a Gaussian or normal distribution; for logistic regression, we assume that Y follows a Bernoulli distribution; for Poisson regression, we assume that Y follows a Poisson distribution.
  • each approach models the mean of Y as a function of the predictors. in linear regression, the mean of Y takes the form \[ E(Y|X_1, \dots, X_p)=\beta_0+\beta_1X_1+\dots+\beta_pX_p \] i.e., it is a linear function of the predictors. For logistic regression, the mean takes the form \[ E(Y|X_1, \dots, X_p)=Pr(Y=1|X_1, \dots, X_p)=\frac{e^{\beta_{0}+\beta_{1}x_1+\dots+\beta_{p}x_p}}{1+e^{\beta_{0}+\beta_{1}x_1+\dots+\beta_{p}x_p}} \]

while for Poisson regression it take the form \[ E(Y|X_1, \dots, X_p)=\lambda(X_1, \dots, X_p)=e^{\beta_{0}+\beta_{1}x_1+\dots+\beta_{p}x_p} \] Equations above can be expressed using a link function, \(\eta\), which applies a transformation to \(E(Y|X_1, \dots, X_p)\) so that the transformed mean is a linear function of the predictors. that is \[ \eta(E(Y|X_1, \dots, X_p))=\beta_{0}+\beta_{1}x_1+\dots+\beta_{p}x_p \]

The link function for linear, logistic, and Poisson regression are

  • \(\eta(\mu)=\mu\)
  • \(\eta(\mu)=log(\frac{\mu}{1-\mu})\)
  • \(\eta(\mu)=log(\mu)\)

The Gaussian, Bernoulli, and Poisson distributions are all members of a wider class of distributions, known as the exponential family. Other well-known members of this family include exponential distribution, the Gamma distribution, and the negative binomial distribution. In general, we can perform a regression by modeling the response Y as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors via the link function. Any regression approach that allows this very general recipe is known as a generalized linear model (GLM).

chap 5. Resampling

Re-sampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model on each sample to obtain additional information about the fitted model.

For example, - to estimate the variability of a linear regression fit, we repeatedly draw different samples from the training data set, fit a linear regression to each new sample, and then examine the extend to which the resulting fits differ. - such information would not be available from fitting the model only once using the original training sample

Two of most used re-sampling methods are

  • cross-validation: used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility.
  • bootstrap: used to measure accuracy of a parameter estimate or a given statistical learning method.

The process of evaluating a model’s performance is known as model assessment, whereas the process of choosing the proper level of flexibility for a model is known as model selection.

cross-validation

  • test error rate: the average error that results from using a statistical learning method to predict the response on a new observation.

The test error rate can be easily calculated if a designated test set is available. However, this is not usually the case. In contrast the training error rate can be obtained by applying the statistical learning method to the observations used in its training.

  • training error rate: quite different from the test error rate, and can dramatically underestimate the test error rate.

In the absence of ver large test set, a number of techniques can be used to estimate this quantity using available training data. Some make adjustments to the training error rate to estimate the test error rate; others achieve so by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations.

validation set approach

The validation set approach randomly divides the available set into two parts, a training set and a validation set or hold-out set. The statistical model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate-mean squared error in the case of quantitative response-provides an estimate of the test error rate.

  • the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
  • only observations included in the training set are used to fit the model. That is, only part of the original traning observations are used to fit the model. Statistical learning methods tend to perform worse when trained on fewer observations, which suggests that the validation set error rate may overestimate the test error rate for the model fit on the entire data set.

cross-validation

Cross-validation is a refinement of the validation set approach that addresses these two issues.

leave-one-out cross-validation

leave-one-out cross-validation (LOOCV) is closely related to the validation set approach, but it attempts to address that method’s drawbacks. Like validation set approach, LOOCV involves splitting the set of observations into two parts. Instead of splitting the data set into two comparable subsets, LOOCV uses a single observation for validation, and the rest observations for training. That is, the statistical learning method is fit on the \(n-1\) training observations, and a prediction \(\hat{y_i}\) is made for the one held-out observation using its value \(x_i\).

However, validation on a single observation is highly variable, so we repeat the procedure by \(n\) times, and calculating \(n\) MSE. The LOOCV estimate for the test MSE is the average of these \(n\) test error estimates: \[ \hat{MSE}=CV_{(n)}=\frac{1}{n}\sum_{i=1}^{n}MSE_i \] LOOCV has major advantages over the validation set approach
* far less bias: using as many as \(n-1\) observations, LOOCV has variability as compared to fitting on almost half of the observations * validation set approach yields different results due to randomness in the training/validation set split, while performing LOOCV multiple will always yield the same result since there is no randomness in the training/validation set splits.

k-fold cross-validation

K-fold cross-validation involves randomly dividing all observations into k groups, of approximately equal size. The first group/fold is used as the validation set, and the rest \(k-1\) groups are used to train the statistical learning model. Then MSE is calculated for the held-out first set. Repeating this procedure \(k\) times until all groups are used as validation set for once. This process produces \(k\) estimates of test error, \(MSE_1, MSE_2,\dots, MSE_k\). we have \[ CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MES_i \]

It is not hard to see LOOCV is a special case of k-fold cross-validation (CV) in which \(k\) is set to equal \(n\). typically, we perform \(k=5\) or \(k=10\) CV. the advantage of doing this rather than \(k=n\) is computational ease. LOOCV required fitting the statistical learning method n times, which could be computational expensive especially for algorithms with computational intensive procedures, and with extremely large number of observations. IN contrast, performing 10-fold CV requires fitting the learning procedure only ten times, which may be much more feasible.
* moreover, although there is some variability in the CV estimates due to how observations are divided into ten folds, this variability is typically much lower than the variability in the test error estimates from the validation set approach.

There are two scenarios in statistical learning process:
* to determine how well a given statistical learning procedure can be expected on independent data * here, the actual estimate of the test MSE is of interest * where the minimum point of the estimated test MSE occurs? * here, the location of the minimum point in the estimated test MSE curve is important, but the actual value of the estimated test MSE is not.

bias-variance trade-off for k-fold cross-validation

In addition to advantage on computation, k-fold CV has another less obvious but potentially more important advantage over LOOCV. That is, k-fold CV often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.

Bias:
* the validation set approach tends to overestimate the test error rate since the training set used to fit the statistical learning model contains only half of the observations. * LOOCV, in contrast, will give approximately unbiased estimates of test error, since each training set contains \(n-1\) observations, which is almost as many as the number of the entire data set; * Performing k-fold CV (k<n) will lead to intermediate bias since each training set contains approximately \((k-1)k/n\) observations-fewer than in the LOOCV approach.

Therefore, from the perspective of bias reduction, LOOCV is to be preferred to k-fold CV. However, bias is not the only source of concern in an estimating procedure; we must consider the procedure’s variance. From variance perspective, LOOCV has higher variance than k-fold CV (k<n). Why?

Variance:
* the training sets in a LOOCV procedure are almost identical to each other and thus are correlated.
* In contrast, k-fold CV has less similar training sets and smaller overlap between training sets.
* In statistics, the mean of highly correlated quantities has higher variance than does the mean of quantities that are not highly correlated. This can be explained by the fact that when variables are highly correlated, they tend to move together in a similar direction. the individual deviations from the mean tend to align and reinforce each other. This results in a larger overall spread of the values around the mean, and therefore a higher variance. On the other hand, if the quantities are not highly correlated, their movements are more independent of each other. The deviations from their means are likely to cancel each other out to some extent when you take the mean of the quantities.

To summarize, there is a bias-variance trade-off associated with the choice of k in the k-fold cross-validation. k=5 or k=10 has shown empirically to suffer neither from excessively high bias nor from very high variance.

cross-validation on classification problems

cross-validation is a very general technique that can be applied to not only regression problems, but also classification problems. In classification settings, instead of using MSE, we use the number of mis-classified observations. LOOCV error rate for example takes the form: \[ CV_{(n)}=\frac{1}{n}\sum_{i=1}^{n}Err_i \] where \(Err_i=I(y_i \ne \hat{y}_i)\). K-fold CV error rate and validation error rate are defined analogously.

bootstrap

bootstrap is extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.
* bootstrap can be used to estimate the standard errors of the coefficients from a linear regression fit. * bootstrap is powerful as it applied to a wide range of statistical learning methods for which a measure of variability is otherwise difficult to obtain and is not automatically output by statistical software.

[example:\\](example:\ suppose we wish to invest into two financial assets that yield returns of X and Y, respectively, where X and Y are random quantities. We will invest a fraction \(\alpha\) of our money in X, and will invest the remaining \(1-\alpha\) in Y. Since there is variability associated with the returns on these two assets, we wish to choose \(\alpha\) to minimize the total risk, or variance, of our investment. That is, we want to minimize \(Var(\alpha X+(1-\alpha)Y)\) .It can be shown that the value that minimize the risk is given by \[ \alpha=\frac{\alpha^2_Y-\alpha_{XY}}{\alpha^2_X+\alpha^2_Y-2\alpha_{XY}} \] where \(\alpha^2_X=Var(X), \alpha^2_Y=Var(Y), \alpha_{XY}=Cov(X,Y)\).

in reality, the quantities \(\alpha^2_X, \alpha^2_Y, \alpha_{XY}\) are unknown. We can compute estimates for these quantities, \(\hat{\alpha^2}_X,\hat{\alpha^2}_Y, \hat{\alpha}_{XY}\), using a data set that contains measurements for X and Y. we can then estimate the value \(\alpha\) using \[ \hat{\alpha}=\frac{\hat{\alpha^2}_Y-\hat{\alpha}_{XY}}{\hat{\alpha^2}_X+\hat{\alpha^2}_Y-2\hat{\alpha}_{XY}} \]

we simulate 100 pairs of returns for the investments X and Y, and then use these returns to estimate \(\alpha^2_X, \alpha^2_Y, \alpha_{XY}\)., which we then substitute into the estimator equation to obtain estimates for \(\alpha\).

It’s natural to wish to quantify the accuracy of our estimate of \(\alpha\). To estimate the standard deviation of \(\hat{\alpha}\), we repeat the process of simulation 100 paired observations of X and Y, and estimating \(\alpha\), 1000 times. we then obtain 1000 estimates for \(\alpha\), \(\hat{\alpha_1}, \hat{\alpha_2}, \dots, \hat{\alpha_{1000}}\): we then calculate the mean and standard deviation of the estimates \[ \begin{aligned} \bar{\alpha}=\frac{1}{1000}\sum_{i=1}^{1000}\hat{\alpha_i} \\ \rm SE_{\hat{\alpha}}=\sqrt{\frac{1}{1000-1}\sum_{i=1}^{1000}(\hat{\alpha}_i-\bar{\alpha})^2} \end{aligned} \]

this gives us a very good idea of the accuracy of \(\hat{\alpha}\) by SE. we would expect \(\hat{\alpha}\) to differ from \(\hat{\alpha}\) by approximately SE, on average.

In practice, however, the procedure estimating \(\rm SE(\hat{\alpha})\) above cannot be applied, because we cannot generate new samples from the original population. this is where bootstrap comes into play. bootstrap allows us to use a computer to emulate the process of obtaining new sample sets, so that we can estimate the variability of \(\hat{\alpha}\) without generating additional samples. Here, instead of drawing independent samples/data sets repeatedly from the population, we obtain distinct data sets repeatedly sampling observations from the original data set.

bootstrap can be used to estimated \(\rm SE\)s of the regression coefficients, so how does it compare to the estimates obtained using summary() function or the standard formula (RSS)? In fact, bootstrap provides a more accurate estimate of the standard errors of the regression coefficients since it does not rely on assumptions of correct model specification or independence of observations.

chap 6. linear model selection and regularization

linear model has distinct advantages in terms of inference, and on real-world problems, is often surprisingly competitive as compared to non-linear methods.

In practice, assumptions for linear model are often violated. here we learn ways in which the simple linear model can be improved, by replacing least squared fitting with alternative fitting procedures.

Why do we want to use another fitting procedure instead of least squares?
alternative fitting procedures can yield better:
* prediction accuracy * model interpretability

  • prediction accuracy: given that the true relationship is approximately linear, the least squares estimate will have low bias. When \(n \gg p\), then least squares estimates also have low variance. However, if the number of observations is not far more bigger than the number of predictors, then the least squares estimates can have a lot of variability. When \(n \le p\), then there is no longer a unique least squares estimate, and the method cannot be used at all.

  • model interpretability: irrelevant variables lead to unnecessary complexity in resulting model, removing these variables-by setting the corresponding coefficient estimates to 0-we obtain a model that is more easily interpretable.

there are many alternatives to using least squares to fit \[ Y = \beta_0+\beta_1X_1+\dots+\beta_pX_p+\epsilon \]

  • subset selection: this approach involves identifying a subset of predictors that are believed to be related to the response. we then fit the model using least squares on the reduced set of predictors

  • shrinkage: this approach involves fitting a model using all predictors. However, the estimated coefficients are shrunken towards 0 relative to the least squares estimates. this shrinkage (aka regularization) has the effect of reducing variance. depending on what type of shrinkage is performed, some of the coefficients may be estimated to be exactly 0. therefore, shrinkage methods can also perform variable selection.

  • dimension reduction: this approach involves projecting p predictors into an M-dimension subspace, where \(M \le p\). this can be achieved by computing M different linear combinations, or projections, of the p variables. then these M projections are used as predictors to fit the linear regression model by least squares.

subset selection

best subset + stepwise model selection procedures

best subset selection

to perform best subset selection, we fit a separate least squares regression for each possible combination of the predictors. that is, we fit p models that contain one predictor, \(\binom{n}{k}=p(p-1)/2\) models that contain exactly 2 predictors, and so forth. we then look at all of the resulting models to identify the best predictors.

algorithms is described below:

  • let \(M_o\) denote the null model, with no predictors. this model simply predicts the sample mean for each observation.
  • for \(k \in 1,2,\dots,p\):
    • fit all \(\binom{p}{k}\) models that contain exactly k predictors
    • pick the best among the \(\binom{p}{k}\) models,a nd call it \(M_k\). best is defined as having the smallest residual sum of squares (RSS), or equivalently largest \(R^2\)-training error.
  • select a single best model from among \(M_0, \dots, M_p\) using cross-validated test error, \(C_p(AIC), BIC,\) or adjusted \(R^2\).

in the case of logistic regression, instead of ordering models by RSS in step 2, we use deviance, a measure that plays the role of RSS for a broader class of models. The deviance is negative 2 times the maximized log-likelihood; the smaller the deviance, the better the fit.

disadvantage of best subset selection:
* computational limitation: number of possible models (\(2^p\)) grows rapidly as p increase. if \(p=10\), there are approximately 1000 possible models to be considered; if \(p=20\), then there are over one million possibilities.

solution to computational challenge: step-wise selection. for computational reasons, best subset selection cannot be applied with very large p.

forward stepwise selection

forward stepwise selection is an alternative to best subset selection as it considers a much smaller set of models. forward stepwise selection starts with no predictors, and then adds predictors to the model, one at a time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

best subset selection fits \(2^p\) models, while forward stepwise selection fits only \(1+p(1+p)/2\) models. when \(p=20\), best subset selection requires fitting 1,048,576 models, whereas forward stepwise selection requires fitting only 211 models, which is a substantial difference!

  • disadvantage of forward stepwise selection: it is not guaranteed to find the best set of predictors
backward stepwise selection

backward stepwise selection begins with the full least squares model containing all predictors, and then iteratively removes the lest useful predictor, one at a time.

hybrid approaches

forward + backward stepwise selection: after adding each new variable, the method may also remove any variables that no longer provide an improvement to the model fit. such approach attempts to closely mimic best subset selection while maintaining the computational advantages of forward and backward stepwise selection.

choosing the optimal model

to use above methods to select the best model, we need a way to determine which of these models is best.

note from chapter 2, training error is a poor estimate of test error. therefore training RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors since they are related to training error.

to select the best model with respect to test error, we need to estimate test error. there are two common approaches:

  • indirectly estimate test error by adjusting to the training error to account for the bias due to over-fitting.
    • \(C_p\)
    • AIC
    • BIC
    • adjusted \(R^2\)
  • directly estimate the test error, using either a validation set approach or a cross-validation approach.
    • problem with using validation set approach and cross-validation approach is that by changing the validation set or using another k-fold split, the precise best model will change.
    • solution: select the model using the one-standard-error rule. we first calculate the standard error of the estimated test MSE (RSS/n) for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.

shrinkage methods

subset selection methods involve selecting a subset of predictors and then using least squares to fit a linear model that contains a subset of predictors. An alternative to it is fitting a model containing all p predictors using a technique that constrain or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards 0.

The two best-known techniques for shrinking the regression coefficients towards 0 are ridge regression and lasso.

ridge regression

recall that the least squares fitting procedure estimates \(\beta_0, \beta_1, \dots, \beta_p\) to minimize \[ \rm RSS= \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \]

ridge regression is similar to least squares, except that the coefficients are estimated to minimize a slightly different quantity \[ \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}\beta^2_j=\rm RSS + \lambda\sum_{j=1}^{p}\beta^2_j \] where \(\lambda \ge 0\) is a tuning parameter, to be determined separately. this equation trades off two different criteria:
- as with least squares, ridge regression seeks to minimize RSS - the second term, called shrinkage penalty, is small when \(\beta_1, \dots, \beta_p\) are close to 0. the tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates.

as \(\lambda \rightarrow \infty\), the impact of the shrinkage penalty grows; when \(\lambda =0\), ridge regression produces least squares estimates. For each \(\lambda\), ridge regression produces a different set of coefficient estimates. so selecting a good value for \(\lambda\) is critical.

when ridge regression has coefficient estimates shrinkun to 0, this correspond to the null model.

term: \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2 \in [0,1]\). here the notation \(\hat{\beta}\) denotes the vector of least squares coefficient estimates; \(||\hat{\beta}||_2\) denotes the \(\ell_2\) norm of a vector, and is defined as \(||\beta||_2=\sqrt{\sum_{j=1}^{p}\beta_j^2}\), which measures the distance of \(\beta\) from \((0,0,\dots, 0)_{p-1}\).

as \(\lambda\) increases, the \(\ell_2\) norm of \(\hat{\beta}_\lambda^R\) will always decrease, and so will \(||\hat{\beta}_\lambda^R||_2/||\hat{\beta}||_2\). when \(\lambda=0\) when ridge regression penalty term equals RSS, the quantity equals 1; when \(\lambda \rightarrow \infty\) and therefore the ridge regression coefficient estimate is a vector of 0s and \(\ell_2\) norm equal to 0, the quantity equal to 0. A very small value of this quantity indicates that the ridge regression coefficient estimates have been shrunken towards 0.

attention: ridge regression is sensitive to the scaling of predictors. In fact, the value of \(X_j\beta_{j,\lambda}^R\) will depend not only on the value of \(\lambda\), but also on the scaling of the \(j\)th predictor. therefore, it’s best to apply ridge regression after standardizing the predictors so that all standardized predictors all have a standard deviation of 1.

why does ridge regression improve over least squares?

  • ridge regression’s advantage over least squares is rooted in the bias-variance trade-off.
    • as \(\lambda\) increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
    • in general, when the relationship between the response and predictors is close to linear, the least squares estimates will have low bias but may have high variance, which means s small change in the training data can cause a large change in the least squares coefficient estimates. Particularly, when \(p \rightarrow n\), the least squares estimates will be extremely variable; if \(p > n\), then the least squares estimates do not even have a unique solution, whereas ridge regression can still perform well by trading off a small increase in bias for large decrease in variance.
    • therefore, ridge regression works best when the least squares estimates have high variance.
    • ridge regression had computational advantages over best subset selection, which requires search for \(2^p\) models. in contrast, for a given \(\lambda\), ridge regression fits only one model and it can be done very quickly.

disadvantages of ridge regression:

  • unlike best subset, forward stepwise and backward stepwise selection which generally select models that contain a subset of predictors, ridge regression will include all predictors in the final model
    • the \(\lambda\sum_{j=1}^{p}\beta_j^2\) will shrink all coefficient towards 0, but it will not set any of them exactly to 0 (unless \(\lambda = \infty\)).
    • it may not be a problem for prediction accuracy, but it causes problems to interpretation especially when p is large.

lasso

lasso is a recent alternative to ridge regression that gets rid of noise variables: \[ \sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}|\beta_j|=\rm RSS + \lambda\sum_{j=1}^{p}|\beta_j| \] comparing lasso to ridge regression, lasso has a similar formulation. the only difference is that we replace \(\beta_j^2\) in the ridge regression penalty with \(|\beta_j|\) in the lasso. in statistical parlance, lasso uses \(\ell_1\) penalty instead of \(\ell_2\) penalty. the \(\ell_1\) norm of a coefficient vector \(\beta\) is defined by \(||\beta||=\sum_{j=1}^{p}|\beta_j|\).

like ridge regression, lasso shrinks the coefficient estimates towards 0. however, the \(\ell_1\) penalty will force some of the coefficient estimates to be exactly 0 when the tuning parameter \(\lambda\) is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection. consequently, models generated from lasso are generally easier to interpret than those produced by ridge regression.

we say lasso yields sparse models-that is, models that contain only a subset of predictors.

  • when \(\lambda=0\), lass0 simply gives the least squares estimates
  • when \(\lambda \rightarrow \infty\), lasso give the null model where every coefficient estimates are 0
  • in between two extremes, lasso is quite different from ridge regression, depending on the value of \(\lambda\), lasso can produce a model involving any number of variables.

re-formularize ridge regression and lasso \[ \rm \]

we can think of above equation as follows: when we perform lasso we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget of \(s\) for how large \(\sum_{j=1}^{p}|\beta_j|\) can be.

  • when \(s\) is extremely large, then this budget is not very restrictive, and the coefficient estimates can be large. in fact, if s is large enough that the least squares falls within the budget, then it will simply yield the least squares solution
  • if s is small, then the penalty must be small to avoid violating the budget.
lasso vs. ridge regression
  • it is clear that lasso leads to simpler and more interpretable models than ridge regression.
  • as \(\lambda\) increases, there is a drop in variance in return of an increase in bias.
  • but which one produces more accurate models or better prediction accuracy?
    • overall, neither ridge regression nor lasso will universally dominate the other.
    • lasso will perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or 0.
    • ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size.
selecting the tuning parameter

implementing ridge regression and lasso requires a method for selecting a value for the tuning parameter \(\lambda\), or equivalently, the value of the constraint s.

cross-validation provides a simple way to tackle this problem. we choose a grid of \(\lambda\) values, and compute the cross-validation error for each value of \(\lambda\). we then select the tuning parameter value doe which the cross-validation error is smallest. finally, the model is re-fit using all observations and the selected value of the tuning parameter.

dimension reduction methods

subset selection and shrinkage method have controlled variance by using a subset of the original variables and shrinking their coefficients towards 0. all of these methods are defined using the original predictors. next we explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables.

let \(Z_1, Z_2, \dots, Z_M\) denote \(M < P\) linear combinations of our original p predictors. that is \[ Z_m=\sum_{j=1}^{p}\phi_{jm}X_j \] for some constraint \(\phi_{1m}, \phi_{2m}, \dots, \phi_{pm}, m=1,\dots,\rm M\). we can then fit the linear regression model \[ y_i=\theta_0+\sum_{m=1}^{M}\theta_mz_{im}+\epsilon_i, i=1,\dots,n \] using least squares.

we can derive \[ \beta_j=\sum_{m=1}^{M}\theta_m\phi_{jm} \] here \(\phi_{jm}\) are the principal component loadings, which defined the direction.

in situations where p is large relative to n, selecting a value of \(M \ll p\) can significantly reduce the variance of the fitted coefficients; if \(M=p\), then no dimension reduction occurs.

all dimension reduction methods work in two steps:

  • first, the transformed predictors are obtained.
  • second, the model is fit using these M transformed predictors.

however, we can choose different ways to transform the original predictors.

principal components regression

The first principal component \(Z_1\) direction of the data is that along which the observations vary the most. that is, if we project the observations onto this line, then then resulting projected observations would have the largest possible variance.

the second principal component \(Z_2\) is a linear combination of the variable that is uncorrelated with \(Z_1\), and has largest variance subject to this constraint. it turns out that zero correlation condition of two directions is equivalent to the condition that the direction must be perpendicular, or orthogonal, to the first principal component direction.

advantages of PCR:

  • PCR tends to do well in cases when the first few principal components are sufficient to capture most of the variation in the predictor as well as the relationship with the response.
    • when many principal components are required to adequately model the response, PCR performs worse.
    • overall, PCR offers a significant improvement over least squares
    • PCR and ridge regression slightly outperform lasso.
    • PCR is not performing variable selection, since each principal component used in PCR is a linear combination of all the predictors.
    • PCR and ridge regression are closely related.

disadvantage

  • PCR suffers from a drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

partial least squares

PCR involves identifying linear combinations, or directions, that best capture the variance. these directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions.

partial least squares(PLS) is an supervised alternative to PCR. likewise, PLS is a dimension reduction method, whcih first identifies a new set of features \(Z_1, \dots, Z_M\) that are linear combinations of the original features, and then fits a linear model via least squares using these M new features. But unlike PCR, PLS identifies these new features in a supervised way with the response involved so tha the M new features are not only approximate the original features well, but also are related to the response. Overall, PLS attempts to find directions that help explain both the response and the predictors.

implementation of PLS:

  • after standardizing all predictors, PLS computes the first direction \(Z_1\) by setting each \(\phi_{j1}\) equal to the coefficient fromthe simple linear regression of Y onto \(X_j\). Hence, in computing \(Z_1=\sum_{j=1}^{p}\phi_{j1}X_J\), PLS places the highest weight on the variables that are most strongly related to the response.

  • to identify the second PLS direction,

    • we first adjust each of the variable for \(Z_1\), by regressing each variable on \(Z_1\) and taking residuals. these residuals can be interpreted as the remaining information that has not been explained by the first PLS direction.
    • we then compute \(Z_2\) using this orthogonalized data in exactly the same fashion as \(Z_1\) was computed based on the original data.
    • this iterative approach can be repeated M times to identify multiple PLS components
    • finally, we use least squares to fit a linear model to predict Y using all PLS components.

high-dimension considerations

terminology: data sets containing more features than observations are often referred to as high-dimensional. classical approaches such as least squares linear regression are not appropriate in this setting.

what goes wrong in high dimensions?

for classical statistical methods, least squares for example, that are not intended for high-dimensional setting, when the number of features is as large as, or larger than, the number of observations \(p>n, \rm or, p \approx n\), least squares cannot be performed. the reason is that:

  • regardless of whether or not there truly is a relationship between the features and the response, least squares will yield a set of coefficient estimates that result in a perfect fit to the data, such that the residuals are 0.
    • this is problematic because this perfect fit will almost certainly lead to over-fitting.
    • that is although it’s possible to perfectly fit the training data in high-dimensional setting, the resulting linear model will perform extremely poor on an independent test set, and therefore does not constitute a useful model.
    • thus a simple least squares regression line is too flexible and hence overfits the data.
    • \(R^2, C_p\), AIC, BIC are all not applicable in high-dimensional setting

regression in high-dimensions

many of the methods introduced so far, such as forward stepwise selection, ridge regression, lasso, and principal components regression, are particularly useful for performing regression in the high-dimensional setting.

essentially, these methods avoid overfitting bu using a less flexible fitting approach than least squares.

  • regularization or shrinkage plays a key role in high-dimensional problems.

  • appropriate tuning parameter selectionis crucial for good predictive performance.

  • test error tends to increase as the dimensionality of the problem increases, unless the additional features are truly associated with the response.

    curse of dimensionality: one would expect a better performance from a model fitted with more features; however, that is not the case.

  • adding additional signal features that are truly associated with the response will improve the fitted model, with reduced test set error.

  • adding noise that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increase test set error. this is because noise features increase the dimensionality, exacerbating the risk of over-fitting.

interpreting results in high dimensions

in high-dimensional settings, the multi-collinearity problem is extreme: any variables in the model can be written as a linear combination of all of the other variables.

essentially, this means that we can never know exactly which variable (if any) truly are predictive of the outcome. at most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome.

for example, suppose that we are trying to predict blood pressure on the basis of half a million SNPs, and that forward stepwise selection suggests that 17 of those SNPs lead to a good predictive model on the training data. it would be incorrect to conclude that these 17 SNPs predict pressure more effectively than the other SNPs not included in the model. there are likely to be many sets of 17 SNPs that would predict blood pressure just as well as the selected model. if we were to obtain another independent data set and perform the same statistical procedure, we would lilely to obtain a model containing a different, and perhaps even no n-overlapping, set of SNPs. However, this does not detract from the value of the first model obtained. the model may be very effective in predicting blood pressure on an independent set of patients, and might be clinically useful for physicians.

  • we must be careful not to overstate the results obtained, and to make it clear that what we have identified is simply one of many possible models for predicting blood pressure, and that it must be further validated on independent data sets.
  • it is also important to be particularly careful in reporting errors and measures of model fir in high-dimensional setting. we have seen that when \(p>n\), it is easy to obtain a useless model that has 0 residuals. therefore one should never use sum of squared errors, p-values, \(R^2\) statistics, or other traditional measures of model fit on the training data as evidence of a good model fit in the high-dimensional setting.
    • it is easy to obtain an \(R^2=1\) when \(p>n\).
    • it is important to instead report results on an independent test set, or cross-validation errors. for instance, the MSE or \(R^2\) on an independent test set is a valid measure of model fit, but MSE on the training set certainly is not.

chap 7. move beyond linearity

linear models are relatively simple to describe and implement, and have advantages over other methods in terms of interpretation and inference.

however, standard linear regression can have significant limitations in terms of predictive power:

  • linearity assumption is almost always an approximation, and sometimes a poor one

we can improve upon least squares using ridge regression, lasso, principal components regression, and other techniques. in this setting, improvement is obtained by reducing the complexity of the linear model, and hence the variance of the estimates. But we are still using a linear model.

we can relax the linearity assumption while still attempting to maintain as much interpretability as possible, using such as polynomial regression and step functions, as well as more sophisticated approaches such as splines, local regression, and generalized additive models.

  • polynomial regression extends linear model by adding extra predictors, obtained by raising each of the original predictors to a power. for example, a cubic regression involves three variables, \(X, X^2, X^3\) as predictors.

  • step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. this has the effect of fitting a piece-wise constant function.

  • regression splines are more flexible the polynomial and step functions, and are an extension of the two. they involve dividing the range of X into K distinct regions. within each region, a polynomial function is fit to the data. however, these polynomials are constrained so that they join smoothly at the region boundaries, or knots.

  • smoothing splines are similar to regression splines, but arise in a slightly different situation. something splines result from minimizing a residual sum of squares subject to a smoothness penalty.

  • local regression is similar to splines, but differs in an important way: the regions are allowed to overlap, and do so in a very smooth way.

  • generalized additive models allow us to extend the models above to deal with multiple predictors.

polynomial regression

using polynomial functions of the features as predictors in a linear model imposes a global structure on the non-linear function of X.

step functions

we can instead use step functions in order to avoid imposing such a global structure, by breaking the range of X into bins, and fitting a different constant in each bin. this equals to converting a continuous variable into an ordered categorical variable.

basis functions

polynomial and piecewise-constant regression models are special cases of a basis function approach. the idea is to apply a family of functions or transformations to a variable X: \(b_{1}(X), b_2(X),\dots, b_K(X)\). instead of fitting a linear model in X, we fit the model \[ y_i=\beta_0+\beta_1b_1(x_i)+\beta_2b_2(x_i)+\dots+\beta_Kb_k(x_i)+\epsilon \]

note that the basis function are fixed and known ahead of time.

  • for polynomial regression, the basis function are \(b_j(x_i)=x_i^j\)
  • for piecewise constant functions, they are \(b_j(x_i)=I(c_j\le x_i \le c_{j+1)})\)

we think if the linear model as a standard linear model with predictors \(b_1(x_i), b_2(x_i), \dots, b_K(x_i)\). hence, we can use least squares to estimate the unknown regression coefficients. that means, all of the inference tools for linear models, such as standard error for the coefficient estimates and F-statistics for the model’s overall significance, are available in this setting.

other alternatives for basis function include:

  • use wavelets or Fourier series to construct basis functions
  • commonly used: regression splines to construct basis functions

regression splines

extension of polynomial regression and piecewise constant regression approaches

piecewise polynomials

instead of fitting a high-order polynomial over the entire range of X, piecewise polynomial regression fits separate low-degree polynomials over different regions of X. for example, a piecewise cubic polynomial works by fitting a cubic regression model \[ y_i=\beta_0+\beta_1x_1+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i \] where the coefficients differ in different parts of the range of X. the points where the coefficient change are called knots.

for instance, a piecewise cubic with no knots is just a standard cubic polynomial, with \(d=3\); a piecewise cubic polynomial with a single knot at point c takes the form: \[ y_i=\begin{cases} y_i=\beta_{01}+\beta_{11}x_1+\beta_{21}x_i^2+\beta_{31}x_i^3+\epsilon_i, x_i<c \\ y_i=\beta_0+\beta_1x_1+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i, x_i \ge c \end{cases} \] put it in another way, we will fit two different polynomial functions to the data, one on the subset of the observations with \(x_i<c\), and one on the subset of \(x_i \ge c\). note that two functions have different regression coefficients, and can be obtained using least squares. using more knots leads to more flexible piecewise polynomial. in general, for K knots, we will have \(K+1\) different polynomial functions.

problem: when plotting out two polynomial functions, we notice that the function is discontinuous.

solution: add a constraint

constraints and splines

to remedy this problem, we can fit a piecewise polynomial function under the constraint that the fitted curve must be continuous. but still it looks unnatural. we could add another constraint: we not only want the polynomial function to be continuous at the knots, but also very smooth. that is, we want the first and second derivatives of the polynomials are continuous at the knots. this gives us a cubic spline. in general, a cubic spline with K knots uses a total of \(4+K\) degree of freedom.

the general definition of a degree-d spline is that it is a piecewise degree-d polynomial, with continuity in derivatives up to degree \(d-1\) at each know.

spline basis representation

how can we fit a piecewise degree-d polynomial under the constraint that it be continuous?

we can use the basis function model to represent a regression spline. A cubic spline with K knots can be modeled as \[ y_i=\beta_0+\beta_1b_1(x_1)+\beta_2b_2(x_i)+\dots+\beta_{K+3}b_{K+3}(x_i)+\epsilon_i \] for an appropriate choice of basis functions \(b_1, b_2, \dots, b_{K+3}\). the coefficients then can be fit using least squares.

to represent cubic splines, the most direct way is to start off with a basis for a cubic polynomial-namely, \(x, x^2, x^3\)-and then add one truncated power basis function per knot. a truncated power basis function is defined as \[ h(x,\xi)=(x-\xi)^3_+=\begin{cases}(x-\xi)^3, x>\xi\\ 0, \rm otherwise \end{cases} \] where \(\xi\) is the knot. adding a term of the form \(\beta_4h(x,\xi)\) to the cubic regression model will lead to a discontinuity in only the third derivative at \(\xi\); the function will remain continuous, with the first and second derivatives, at each of the knots.

in other words, to fit a cubic spline to a data set with K knots, we perform least squares regression with an intercept and \(3+K\) predictors, of the form \(X, X^2, X^3, h(X, \xi_1), h(X, \xi_2), \dots, h(X, \xi_K))\). this amounts to estimating a total of \(K+4\) regression coefficients; and fitting a cubic spline wit K knots uses \(K+4\) degrees of freedom.

choosing the number and locations of the knots

  • when we fit a spline, where should we place the knots?

the splines is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. hence, we could place more knots in places where we feel the function might vary most rapidly, and place fewer knots where it seems more stable. but in practice, it is common to place knots in a uniform fashion. one way to do this is to specify the desired degrees of freedom, and then have the software automatically place the corresponding number of knots at uniform quantile of the data.

  • how many knots should we use, or equivalently how many degree of freedom should our spline contain?

one option is to try out different numbers of knots and see which produces the best looking curve. another more objective approach is to use cross-validation. we hold out a proportion of the data, use a spline with a certain number of knots to the remaining data, and then use the spline to make predictions for the held-out data. we repeat this process multiple times until each observation has been left out once., and then compute the overall cross-validated residual sum of squares (RSS). this procedure can be repeated for different numbers of knots K, then the value of K giving the smallest RSS is chosen.

comparing spline to polynomial regression

  • polynomial regression produces undesired results at the boundaries, while natural spline still provides a reasonable fit to the data.
  • regression splines often give superior results compared with polynomial regression.
    • this is because instead of using high-order, e.g., \(X^{15}\) to produce flexible fits, splines introduces flexibility by increasing the number of knots but keeping the degree fixed.
  • generally, spline produces more stable estimates.

in sum, to fit regression splines, we

  • specify a set of knots
  • produce a sequence of basis functions
  • estimate spline coefficients using least squares

smoothing splines

in fitting a smooth curve to a set of data, what we really want is to find some function, \(g(x)\), that fits the observed data well, meaning we want \(RSS=\sum_{i=1}^{n}(y_i-g(x_i))^2\) to be small.

however, there is a problem with this approach: we can always achieve 0 RSS simply by choosing g(X) such that it interpolates the values of \(y_i\). Such as g(x) will woefully overfit the data since it is too flexible. what we want is a function g(x) that makes RSS small but that is also smooth.

how might we ensure that g(x) is smooth? A natural approach is to find the function g(x) that minimizes \[ \sum_{i=1}^{n}(y_i-g(x_i))^2+\lambda \int g''(t)^2dt \] where \(\lambda\) is a non-negative tuning parameter. the function \(g(x)\) that minimizes the above objective function is known as a smoothing spline.

the objective function takes the form of ““Loss+Penalty” that applies to ridge regression and lasso as well.

  • loss function: \(\sum_{i=1}^{n}(y_i-g(x_i))^2\) encourages \(g(x)\) to fit the data well
  • penalty term: encourages \(g(x)\) to be smooth; the larger the value of \(\lambda\), the smoother \(g(x)\) will be.

when \(\lambda=0\), the penalty term has no effect so the function g(x) will be very jumpy/variable; when \(\lambda \rightarrow \infty\), g will be perfectly smooth-it will be just a straight line that pass as closely as possible to the training data points. in this case, g will be the linear least squares line since the loss function amounts to minimizing the RSS. so \(\lambda\) controls the bias-variance trade-off of the smoothing spline.

g(x) that minimizes the smooth spline have some special properties: it is a piecewise cubic polynomial with knots at the unique values of \(x_1, x_2, \dots, x_n\), and continuous first and second derivatives at each knot; furthermore, it is linear in the region outside of the extreme knots.

choosing the smoothing parameeter

a smoothing spline is simply a natural cubic spline with knots at every unique value of \(x_i\).

in fitting a smoothing spline, we aim at finding the value of \(\lambda\) that makes the cross-validation RSS as small as possible.

local regression

local regression is a different approach for fitting flexible non-linear functions, which involves computing the fit at a target point \(x_0\) using the nearby training observations.

algorithm:

  • gather the fraction \(s=k/n\) of training points whose $x_i$ are closest to \(x_0\)

  • assign a weight \(K_{i0}=K(x_i, x_0)\) to each point in this neighborhood, so that the point furthest from \(x_0\) has weight 0, and the closest has the highest weight. all but these k nearest neighbors get weight 0.

  • fit a weighted least squares regression of the \(y_i\) on the \(x_i\) using the aforementioned weights, by finding \(\hat{\beta}_0, \hat{\beta}_1\) that minimize

    \[ \sum_{i=1}^{n}K_{i0}(y_i-\beta_0-\beta_1x_1)^2 \]

  • the fitted value at \(x_0\) is given by \(\hat{f}(x_0)=\hat{\beta}0+\hat{\beta}1x_0\)

in order to perform local regression, there are a number of choices to be made, such as how to define the weighting function K, and whether to fit a linear, constant, or quadratic regression in step 3. Above all choices, the most important choice is the spans, which is the proportion of points used to compute the local regression at \(x_0\), as defined in step 1. the span plays a role like that of the tuning parameter \(\lambda\) in smoothing splines: it controls the flexibility of the non-linear fit.

generalized additive models

in section 7.1-7.6, we present a number of approaches for predicting a response Y on the basis of a single predictor X, which can be seen as extensions of simple linear regression.

next comes predicting Y on the basis of several predictors, \(X_1, \dots, X_p\). which amounts to an extension of multiple linear regression.

generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.like linear models, GAMs can be applied with both quantitative and qualitative responses.

GAMs for regression problems

a natural way to extend the multiple linear regression model

\[ y_i=\beta_0+\beta_1x_{i1}+\dots+\beta_px_{ip}+\epsilon_i \]

to allow for non-linear relationshin between predictor and the response is to replace each linear coefficient \(\beta_j x_{ij}\) with a non-linear function \(f_j(x_{ij})\):\[ y_i=\beta_0+f_1(x_{i1})+\dots+f_p(x_{ip})+\epsilon_i \]

it is called an additive model because we calculate a separate \(f_j\) for each \(X_j\), and then add together all of their contributions

pros and cons of GAMs

  • GAMs allow us to fit a non-linear \(f_j\) to each predictor, so that we can automatically model non-linear relationships, meaning we don’t have to manually try out transformations on variables.

  • non-linear fit can potentially make more accurate predictions for Y.

  • able to examine the effect of each predictor while holding all others fixed.

  • smoothness of the function \(f_j\) can be summarized via \(df\).

  • no interactions are allowed, that is, important interactions can be missed. solution is to manually add interaction terms.

chap 8. tree-based methods

tree-based methods for regression and classification: these involving stratifying or segmenting the feature space into a number of simple regions. to make a prediction for a given observation, we often use the mean or the mode response of the training observations within the region. since the set of splitting rules used to segment the predictor space can be summarized in a tree, we call these types of approaches decision tree methods.

  • tree-based methods are simple and useful for interpretation
  • tree-based methods are, however, not competitive with the best supervised learning methods, e.g., ridge regression, lasso, pls (chap 6), splines and GAMs (chap 7), regarding prediction accuracy.

to overcome the disadvantage in prediction accuracy, bagging, random forests, boosting, and Bayesian additive regression trees provide solutions. these methods involve generating multiples trees which are to be combined to yield a single consensus prediction, which normally results in dramatic improvements in prediction accuracy, at the expense of loss in interpretation.

basics of decision tree

  • applicability: decision tree methods can be applied to both regression and classification problems

regression trees

Example

  • rules: Year < 4.5, Hits < 117.5.
  • leaves: sample sub-spaces that are divided by the rules are called terminal nodes or leaves of the tree.

regression trees have an advantage over other types of regression models such as standard least squares regression, ridge regression, lasso: it’s easier to interpret with a graphical representation.

  • prediction via stratification of the feature space

    • we divide the feature space into J distinct and non-overlapping regions, \(R_1, R_2, \dots, R_J\).
    • we assign the same prediction value (mean/mode value of the training response in region \(R_j\)) to every observation that falls in the same region \(R_j\).
  • objective function: the goal of regression trees is to find a set of regions \(R_1, \dots, R_J\) that minimize \[ \rm RSS = \sum_{j=1}^{J}\sum_{i \in R_j}(y_i-\hat{y}_{R_j})^2 \] where \(\hat{y}_{R_j}\) is the mean response for the training observations in \(j\)th region.

  • recursive binary splitting: a top-down, greedy approach

    • top-down: it begins at the top of the tree (with all observations), and then successively splits the predictor space
    • greedy: at each step of the tree-building process, the best split is determined locally at that particular step, rather than globally and looking ahead for future steps
  • procedure

    • first select the predictor \(X_j\) and the cutpoint to split the predictor space into regions {\(X|X_j<s\)} and {\(X|X_j \ge s\)} that lead to the greatest reduction in RSS.
    • next repeat the process, and look for the best predictor and cutpoint to split the data further so as to minimize the RSS within each of the resulting regions.
    • after identifying regions, we predict the response for a given test observation by assigning the mean of the training observations within the region where the test observation belongs.
  • tree pruning

procedure above may overfit the data and produces poor test set performance due to high complexity of the resulting tree. in contrast, a smaller tree with fewer splits might lead to lower variance and better interpretation at the price of slightly increased bias.

  • one alternative is to build the tree only so long as the decrease in RSS due to each split exceeds some threshold.
  • a better approach would be building a large tree, and then prune it back to obtain a sub-tree.

how do we decide the best way tp prune the tree? intuitively, we could select a subtree that leads to the lowest test error rate using cross-validation or validation set approach. But estimating cross-validation error for every possible sub-tree is too much since we can have an extremely large number of sub-trees.

  • cost complexity pruning, aka weakest link pruning, provides a walk-around.

  • algorithm

    • use recursive binary splitting to grow a large tree on training data.
    • apply cost complexity pruning to obtain a sequence of best sub-trees, as a function of \(\alpha\).
    • use k-fold cross-validation to choose \(\alpha\) that minimize the average error.
    • return the sub-tree from step 2 that correspond to the chosen value fo \(\alpha\).

classification trees

similar to regression trees except that it is intended for qualitative responses. prediction for an observation is achieved by assigning the most common class in the region to the observation that falls in the same region.

  • objective function
    • classification error rate: not sufficiently sensitive for tree-growing.
    • Gini index: a measure of node purity as it takes on a small value when \(\hat{p}_{mk}\)s are 0 or 1.

\[ \rm G= \sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk}) \] - Entropy: another measure of purity since it approximates 0 when nodes are pure.

\[ \rm D=-\sum_{k=1}^{K}\hat{p}_{mk}log\hat{p}_{mk} \]

Geni index and entropy are preferred when we prune the tree; but if prediction accuracy is the goal, then classification error rate is preferred.

trees vs. linear models

linear regression assumes a model of the form \[ f(X)=\beta_0+\sum_{j=1}^{p}X_j\beta_j \] whilst regression trees assume a model of the form \[ f(X)=\sum_{m=1}^{M}c_m*1_{X \in R_m} \] where \(R_1, \dots,R_M\) represent a partition of feature space. no method dominate the other.

  • trees are very easy to explain than linear regression

  • trees are believed to more closely mimic human decision-making than classical regression and classification methods

  • trees has graphic representations

  • no need to create dummy variables for trees

  • trees generally have no matching predictive accuracy as non-tree based regression and classification approaches

  • trees can be sensitive to small changes, or have high variance, meaning if we split the training set into two subsets at random and fit a decision tree to both subsets, the results could be quite different. A low-variance procedure, such as linear regression if the ratio of n to p os moderately large, will produce similar results when fitting on distinct data repeatedly.

solution to build robust decision trees

  • ensemble method: an approach that combines many weak learners to build a single powerful predictor. ensemble methods include:
    • bagging: used to reduce variance of a statistical learning method.

methodology behind: recall that given a set of n independent observations, each with variance \(\sigma^2\), the variance of the mean of the observations is given by \(\sigma^2/n\), meaning averaging a set of observations reduces variance.

therefore, a natural way to reduce the variance and increase test set accuracy is to draw many training sets from the population, build a separate prediction model with each training set, and averaing the resulting predictions. \[ \hat{f}_{avg}(x)=\frac{1}{B}\sum_{b=1}^{B}\hat{f}^b(x) \]

in practice, however, we are unable to draw multiple training sets from the population. But we could bootstrap, by taking repeated samples from the training set, and then fit train our method on the bootstrapped training sets to get predictions, and finally average all the predictions. This is called bagging, which is particularly useful for decision trees, and could improve predictions for many regression methods as well.

to apply bagging to regression trees, we construct B regression trees with B bootstrapped training sets, and average the resulting predictions.

out-of-bag error estimation depending on sample proportion, what’s left and not used for training trees are called out-of-bag (OOB) samples.we can predict the response for the ith observation using each of the trees in which that observation is OOB. this will yield B/3 predictions for the ith observation. we can then average these predicted responses (if regression is the goal) or can take in a majority vote (if classification is the goal) to make a single OOB prediction foe the ith observation. all OOB prediction can be obained this way for each of the n observations, from which the overall OOB MSE or classification error can be obtained. the resulting OOB error is a valid estimate of the test error for bagged model, since the response for each observation is predicted using only the trees that were not fit using that observation.

variable importance measure

although bagging improves accuracy, it makes interpretation hard for the resulting model. when we bagging a large number of trees, it’s no longer possible to represent the resulting statistical learning procedure in a single tree, and it is no longer clear which variables are the most important to the procedure. thus, bagging improves prediction accuracy at the price of interpretability.

one way to represent an overall summary of the importance of each predictor is using RSS (for bagging regression trees) or Geni index *for bagging classification trees) . recording the decreases in RSS or Geni index for splitting a predictor and taking the average over all B trees give the variable importance for predictors.

  • random forests:

random forests improves on bagged trees by way of de-correlating the trees.

what’s wrong with bagged trees? suppose there is one strong predictor, along with other moderately strong predictors, then the collection of bagged trees will look similar to each other, having the strong predictor in the top split. Averaging highly correlated quantities do not lead to as large of a reduction in variance as averaging many uncorrelated quantities. meaning bagging will not lead to a substantial reduction in variance as expected as compared with a single tree.

random forests overcome this issue by forcing each split to consider only a subset (\(m < p\)) of the predictors. therefore, on average \((p-m)/p\) of the splits will not even consider the strong predictor, so other predictors will have more chance.

  • boosting

bagging works by combining multiple decision trees trained on distinct bootstrapped training sets drawn from the original training set. boosting works in a similar fashion, except that the trees are grown sequentially: each tree is grown using information from previous trees.

that is, unlike fitting a single large tree to the data that amounts to fitting the data hard and potentially overfitting, boosting learns slowly: given the current model, boosting fits a decision tree to the residuals from the model, rather than the outcome Y.

boosting has three tuning parameters: - the number of trees B - shrinkage parameter \(\lambda\) - the number of \(d\) of splits in each tree.

  • Bayesian additive regression trees (BART)

BART is another ensemble method that uses decision trees as its building blocks.

separate trees recall that bagging and random forests male predictions from an average of regression tree, each of which is built using a random sample of data and/or predictors. each tree is built separately.

sequential trees in contrast, boosting uses a weighted sum of trees, each of which is constructed by fitting a tree to the residual of the current fit. thus, each new tree attempts to capture signal that is not captured by the current set of trees.

BART is something both: each tree is constructed randomly as in bagging and random forests, and each tree tries to capture signal not yet accounted for by the current model, as in boosting. what differentiate BART is the way it generates new trees. detailed algorithm of BART is omitted.

chap 9. support vector machines

  • maximal margin classifier: a simple, intuitive classifier which requires a linear boundary among classes.
  • support vector classifier: an extension
  • support vector machine: a further extension

maximal margin clasifier

  • hyperplane: in a p-dimensional space, a hyperplane is a p-1-dimensional flat affine subspace.

in two dimensioans, a hyperplane is a flat one-dimensional subspace-that is-a line. in three dimensions, a hyperplane is a flat two-dimensional subspace-that is-a plane.

mathematical definition of a hyperplane is given as \[ \beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_pX_P=0 \]

if a point \(X=()X_1, X_2, \dots,X_p)\) statisfies above equation, then X lies on the hyperplane. we can think of hyperplane as dividing p-dimensional space into two halves. for example, if \[ \beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_pX_P>0 \] then X lies to one side of the hyperplane. if \[ \beta_0+\beta_1X_1+\beta_2X_2+\dots+\beta_pX_P<0 \] then X lies to the other side of the hyperplane.

  • classification using a separating hyperplane

suppose we have a \(n \times p\) matrix \(\matrix{X}\) with n training observations in the \(p\)-dimensional space: \[ x_1=\begin{pmatrix}x_{11}\\\vdots\\ x_{1p}\\\end{pmatrix},\dots,x_n=\begin{pmatrix}x_{n1}\\\vdots\\ x_{np}\end{pmatrix} \] and that these observations fall into two classes. we also have a test observation, a vector of p features \(x^*=(x^*_1, \dots, x^*_p)^T\). our goal is t develop a classifier on the basis of the training observations that will correctly classify the test observation. possible approaches are:

  • linear discriminant analysis
  • logistic regression
  • classification trees
    • bagging
    • random forests
    • boosting

here we use a new method called hyperplane classifier: if a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located. that is, we classify \(x^*\) based on the sign of \(f^(x^*)=\beta_0+\beta_1x^*_1+\beta_2x^*_2+\dots+\beta_px^*_P\).

not surprisingly, a classifier based on a separating hyperplane leads to a linear decision boundary.

  • maximal margin classifier

generally, if the data can be perfectly separated using a hyperplane, then there will be infinitely many hyperplanes that can do so. we therefore need a reasonable way to decide which hyperplane to use. a natural choice is the maximal margin hyperplane, which is the separating hyperplane that is farthest from the training observations. - margin: the minimal distances of the observations to the hyperplane

the maximal margin hyperplane is the separating hyperplane that has the largest margin.

  • construction of the maximal margin classifier

  • the non-separable case

maximal margin classifier is a natural way to perform classification, if a separating hyperplane exists. However, there are cases where no separating hyperplane exists. that is observations are not separable by a hyperplane.

support vector classifier

in fact, even if a separating hyperplane exists, it might not be desirable in many instances due to sensitivity to individual observations, and short distance to observations. we may instead be willing to consider a classifier based on a hyperplane that does not perfectly separate the classes, but possesses:

  • greater robustness to individual observations
  • better classification of most of the training observations: it could be worthwhile to mis-classify a few training observations in order to do a better job in classifying the rest of observations.

support vector classifier, also known as a soft margin classifier, provides an solution.

support vector classifier classifies a test observation depending on which side of a hyperplane it lies’ the hyperplane is chosen to correctly separate most of the training observations.

  • observations that lie on the margin or violate the margin will affect the hyperplane, and the classifier obtained.
  • observations that lie on the correct side of the margin does not affect the support vector classifier
    • changing the position of the observations would not affect the classifier, provided that their positions remain on the correct side of the margin.
    • observations that lie directly on the margin, or the wrong side of the margin for their classes, are known as support vectors.
    • support vector classifier is robust to the behavior of observations that are far away from the hyperplane.

support vector machines

linear classifier to non-linear classifier: the suppor vector classifier is a natural approach for classification in the two-class setting with a linear boundary.

  • non-linear decision boundary

in practice, non-linear boundaries are more often seen. to fit a non-linear classifier, a natural way is to fit a support vector classifier using enlarged feature space using quadratic, cubic, and even higher-order polynomial functions of the features.

Instead of fitting a support vector classifier using p features \(X_1, X_2, \dots, X_p\), we could fit a support vector classifier using 2p features \(X_1, X_1^2, X_2, X_2^2, \dots, X_p, X_p^2\). the solution are generally non-linear. however, there are many ways to enlarge the feature space, other than quadratic, which could lead to unmanageable computations.

  • support vector machine

support vector machine (SVM) allows enlargement of feature space used by support vector classifier in a computation-efficient way.

SVM extends the support vector classifier in a specific way, using kernels. kernel approach is simply an efficient computational approach to enact enlargement of the feature space. A kernel is a function that quantifies the similarity of two observations.

SVMs with more than 2 clases

there are many proposed ways to extend SVMs to k-class cases, and the most popular are the one-versus-one and one-versus-all approaches.

  • one-versus-one / all-pairs

it constructs \(\binom{K}{2}\) SVMs, each of which compares a pair of classes. we classify a test observation using each of the \(\binom{K}{2}\) classifiers, and total the number of times that it is assigned to each of the K classes. the final classification is performed by assigning the test observation to the class to which it is most frequently assigned in these \(\binom{K}{2}\) pairwise classifications.

  • one-versus-all we fit K SVMs, each time comparing one of the K classes to the remaining K-1 classes. we assign the test observation to the class for which \(\beta_{0k}+\beta_{1k}x_1^*+\beta_{2k}x_2^*+\dots+\beta_{pk}x_p^*\) is largest. this amounts to a high level confidence that the test observation belongs to the kth class rather than to any of the other classes.

chap 12. unsupervised learning

in cases where we are not interested in prediction, is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?

unsupervised learning refers to a diverse set of techniques for answering questions such as these.

  • principal components analysis: a tool for data visualization or data pre-processing before supervised techniques
  • clustering: for discovering subgroups in data

unsupervised learning is often performed as part of an exploratory data analysis.

PCA

principal component analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data.

the first principal component is defined as the normalized linear combination of the features and is given by \[ Z_1=\phi_{11}X_1+\phi_{21}X_2+\dots+\phi_{p1}X_p \] where \(\sum_{j=1}^{p}\phi^2_{j1}=1\). we refer to the elements \(\phi_1, \phi_2,\dots, \phi_p\) as the loadings of the first principal component. together, the loadings constitute the loading vector. the loading vector \(\phi_1\) defined the direction in the feature space along which the data vary the most. if we project the data onto this direction, the projected values are the principal component scores \(z_{11}, \dots, z_{n1}\).

  • proportion of variance explained (PVE)

principal components with missing values

missing value imputation and principal components computation could be completed simultaneously.

clustering methods

clustering refers to a broad set of technique for finding subgroups, or clusters, in a data set. to identify concrete clusters, we have to define similarity or difference.

both clustering and PCA seem to simplify the data via a small number of summaries, but the mechanisms are differen:

  • PCA seeks to find a low-dimensional representation of the observations that explain a good fraction fo the variance
  • clustering looks to find homogeneous subgroups among the observations.

for example, an application in marketing, the goal of clustering is to perform market segmentation by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product.

some of the best known clustering approaches include:

  • K-means clustering: partition the observations into a pre-specified number of clusters.
  • hierarchical clustering: without pre-defined number of clusters, with tree-like representation called dendrogram.

k-means clustering

objective function: a good clustering is one for which the within-cluster variation is as small as possible.

\[ \begin{align*} \text{minimize} \quad & \{\sum_{k=1}^{K}W(C_k)\} \\ \end{align*} \]

to solve this problem, we have to define the within-cluster variation, \(W(C_k)\). the most common choice involves squared Euclidean distance:

\[ W(C_k)=\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^{p}(x_{ij-x_{i'j}})^2 \]

where \(|C_k|\) denotes the number of observations in the kth cluster, that is, the within-cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distance between the obervatiosn in the kth cluster, divided by the total number of observations in the kth cluster. the optimization problem that defined K-means clustering is given by

\[ \min_{X_1, \dots,X_K} \{ \sum_{k=1}^{K}\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum_{j=1}^{p}(x_{ij}-x_{i'j})^2 \} \]

hierarchical clustering

one of potential disadvantages of K-means clustering is that it requires pre-specification of the number of clusters K. alternatively, hierarchical clustering does not have such requirement.

  • bottom-up/agglomerative clustering: starting from the leaves and combining clusters up to the trunk

  • interpretation of hierarchical clustering

    • the height of the fusion on the y-axis indicates how different the two observations are. observations that fuse at the bottom of the tree are quite similar to each other, whereas observations that fuse at the top of the tree will be quite different.
    • misinterpretation: on the basis of the relative location on the y-axis, it is tempting but incorrect to conclude two observations are similar.
  • assumption: the term hierarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.

This is not necessarily true!

  • algorithm of hierarchical clustering
    • dissimilarity: measurement of dissimilarity between each pair of observations are defined, such as Euclidean distance.
    • starting at the bottom where each observation is treated as a cluster, and two clusters that are most similar are then fused resulting in n-1 clusters.
    • next two two clusters that are most similar to each other are fused again, resulting in n-1 clusters.
    • proceed until each observation belongs to one single cluster.
  • linkage: we need to determine dissimilarity between two clusters/groups
    • complete
    • average
    • single
    • centroid
  • choice of dissimilarity measure
    • Euclidean distance
    • correlation-based distance

practical issues in clustering

  • small decisions with big consequences
    • should the features be standardized?
    • what distance measure should be used? what type of linkage should be used? where should we cut the dendrogram to obtain clusters?
    • how many clusters should we look for in the data for K-mean clustering?

chap 13. multiple testing

now we switch our focus from estimation and prediction to hypothesis testing, which is key to conducting inference.

classical statistics focuses on testing a single null hypothesis against the alternative hypothesis. in contemporary settings, we may wish to testing a great many null hypothesis in the face of huge amount of data. for instance, we might want to test \(m\) null hypotheses, \(H_{01}, \dots, H_{0m}\), where \(H_{0i}\): the mean value of the \(j^{th}\)biomarker among mice in the control group equals that of the \(j^{th}\) biomarker among mice in the treatment group.

  • contemporary ways to conduct multiple testing in a big-data setting
  • challenges associated with multiple testing and solutions
  • more contemporary solutions: false discovery rate, estination of null distribution using re-sampling

hypthesis testing

hypothesis testing is a rigorous statistical framework for answering simple yes vs. no questions.

  • define the null and alternative hypothesis
    • typically, the alternative hypothesis posits that the null hypothesis does not hold
    • the treatment of \(H_0\) and \(H_a\) is asymmetric.

\(H_0\) is treated as the default state, and we focus on using data to reject it. If we reject \(H_0\), then it provides evidence in favor of \(H_a\), and thus a discovery. if we fail to reject \(H_0\), then the findings are vague: we would not know whether we failed to reject \(H_0\) because our sample size was too small or \(H_0\) really holds.

  • conduct a test statistic that summarizes the strength of evidence against the null hypothesis
    • test statistic is denoted as \(T\), which summarizes the extent to which our data are consistent with \(H_0\).
    • the way in which we construct \(T\) depends on the nature of the null hypothesis, e.g., two/multiple group means/variances
  • compute a p-value that quantifies the probability of having obtained a comparable or more extreme value of the test statistic under the null hypothesis
    • large (absolute) T value of a two-sample t-statistics provides evidence against \(H_0\).
    • how large is large?
    • how much evidence against \(H_0\) is provided by a given value of the test statistic?
    • p-value answers this question: the p-value is defined as the probability of observing a test statistic equal to or more extreme than the observed statistic, under the assumption that \(H_0\) is in fact true.
    • a small value of T indicates that we would not see such a T value very often, thus against the \(H_0\).
  • draw conclusion
    • it is not correct to state that the p-value of the probability that \(H_0\) holds, i.e., that the null hypothesis is true.
    • the one and only interpretation of the p-value is as the fraction of the time that we would expect to see such an extreme value of the test statistics if we repeated the experiment many times, provided \(H_0\) holds.
    • what did we accomplish by converting the test statistic into a p-value? to answer this question, we would need to know what value of the test statistic should be expected, under \(H_0\).
    • a small p-value indicates that such a large value of the test statistic is unlikely to occur under \(H_0\), and thereby provides evidence against \(H_0\), and makes discovery. but how small is small enough to reject \(H_0\)? it is typical to reject \(H_0\) if the p-value is below 0.05: meaning if \(H_0\) holds, we would expect to see such a small p-value no more than 5% of the time.
    • if we get a large p-value, we state that we failed to reject \(H_0\).

type I and type II errors

  • type I error: the probability of making a type I error given that \(H_0\) holds, i.e., the probability of incorrectly rejecting \(H_0\).
  • type II error: we do not reject \(H_0\) when \(H_0\) is in fact falls
  • power: 1-type II error, i.e., the probability of not making a type II error, given \(H_a\) holds or the probability of correctly rejecting \(H_0\).

challenge of multiple testing

rejecting \(H_0\) if the p-value is below (say) 0.01 provides us with a simple way to control the type I error for \(H_0\) at level 0.01: if \(H_0\) is true, then there is no more than 1% probability that we will reject it.

now suppose we wish to test m null hypotheses and if we reject all null hypotheses for which the p-value falls below 0.01, then how many type I error should we expect to make?

the main challenge of multiple testing: when testing a huge number of null hypotheses, we are bound to get some very small p-values by chance. if we draw a conclusion on whether to reject each null hypothesis without accounting for the fact that we have performed a great number of tests, then we may end up rejecting a large number of true null hypotheses-that is, making a large number if type I errors.

how severe is the problem> suppose we reject a single null hypothesis if the p-value is less than \(\alpha=0.01\), then there is a 1% chance of making a false rejection if \(H_0\) is true. now if we test against m null hypotheses, all of which are true, we expect to falsely reject approximately \(0.01 \times m\) null hypotheses. if \(m=10,000\), then we expect to falsely reject 100 null true hypotheses by chance.

stated in another way, the crux/point of the issue is:

rejecting a null hypothesis if the p-value is below \(alpha\) controls the probability of falsely rejecting that null hypothesis is true at level \(\alpha\). if we do this for m null hypotheses, however, the chance of falsely rejecting at least one of the m null hypotheses is quite higher.

family-wise error rate

the type I error rate is the probability of rejecting \(H_0\) if \(H_0\) is true. the family-wise error rate (FWER) geenralizes this notion to the setting of m null hypotheses, and is defined as the probability of making at least one Type I error.

———- H_0 is True H_0 is False
Reject H_0 V c S R
Do Not Reject H_0 U W m=R
Total m_0 m-m_0 m

here, V represents the number of type I errors (aka false positive or false discovery), S denotes the number of true positives, U the number of true negatives, and W the number of Type II errors (aka false negatives). the family-wise error rate is given by

\[ \rm FWER = Pr(V \ge 1) \]

a strategy of rejecting any null hupothesis for which the p-value of below \(\alpha\) leads to a FWER of

\[ \begin{aligned} \rm FWER(\alpha) = 1-PR(V=0) \\ & \rm 1-Pr(\text{do not falsely reject any null hypotheses}) \\ & \rm 1-Pr(\bigcap_{j=1}^{m} \{ \text{do not falsely reject} H_{0j} \}) \end{aligned} \]

if we assume the m tests are independent and all m null hypotheses are true, then \[ \rm FWER(\alpha)=1-\prod_{j=1}^{m}(1-\alpha)=1-(1-\alpha)^m \]

hence, if we test one null hypothesis, then \(\rm FWER(\alpha)=1-(1-\alpha)=\alpha\), so the Type I error rate and the FWER are equal. if we perform \(m=100\) independent tests, the \(\rm FWER(\alpha)=1-(1-\alpha)^{100}\), say \(\alpha=0.015\), then \(\rm FWER(0.05)=1-(1-0.05)^{100}=0.994\). stated another way, we are bound approximately to make at least one Type I error!

control of family-wise error rate

  • Bonferroni method let \(A_j\) denote the event that we make a Type I error for the jth null hypothesis, for \(j=1,\dots, m\), then \[ \begin{aligned} \rm FWER(\alpha)=Pr(\text{falsely reject at least one null hypothesis}) \\ & = \rm Pr(\bigcup_{j=1}^{m}A_j) \le \sum_{j=1}^{m}Pr(A_j) \end{aligned} \]

the inequality is a result from the fact that for any two events A and B, \(\rm Pr(A \bigcup B) \le Pr(A)+Pr(B)\), regardless of whether A and B are independent.

Bonferroni method sets the threshold for rejecting each hypothesis test to \(\alpha/m\), so that for each null hypothesis \(\rm Pr(A_j)\le \alpha/m\). together, \[ \rm FWER(\alpha) \le m \times \alpha/m=\alpha \] so this procedure controls the FWER at level \(\alpha\). For example, to control the family-wise error rate at level 0.1 while testing \(m=100\), null hypotheses, the Bonferroni method requires us to control the Type I error for each null hypothesis at level \(0.1/100=0.001\), and rejecting all null hypotheses for which p-value is below 0.001. - advantages: easy to understand, easy to implement, and successfully controls for Type I error. - cost: being conservative so we reject few null hypotheses, and thus will make quite a few Type II errors.

  • Holm’s step-down procedure/Holm-Bonferroni method

Holm’s method is less conservative than Bonferroni’s method while controlling for Type I error.

  • special cases: achieving higher power while controlling for Type I error
    • Tukey’s method: when performing \(m=G(G-1)/2\) pairwise comparisons of G means, it allows us to control the FWER at level \(\alpha\) while rejecting all null hypotheses for which the p-value falls below \(\alpha_T\), for some \(\alpha_T>\alpha/m\).
    • Scheffe’s method: \(H_0: \frac{1}{3}(\mu_1+\mu_2+\mu_3)=\frac{1}{2}(\mu_4+\mu_5)\).

trade-off between FWER and Power

power is defined as the number of false null hypotheses that we reject divided by the total number of false null hypotheses \(S/(m-m_0)\). controlling the FWER at level \(\alpha\) guarantees that we are very unlikely to reject any true null hypotheses, or to have any false positives.

in practise, when m is large, we may be willing to tolerate a few false postiveis, in the interest of making more discoveries, i.e., more rejections of the null hypothesis.

false discovery rate

when m is large, simply preventing any false positives is too stringent. we instead might want to make sure that the ratio of false positives (V) to total positives \((V+S)=R\) is sufficiently low, so that most of the rejected null hypotheses are not false positives.

the ratio V/R is known as the false discovery proportion (FDP). however, in practice, it is impossible to control the FDP since we have no way to know which hypothesis is true and which is false. therefore, we instead control the false discovery rate (FDR), given by \[ \rm FDR=E(FDP)=E(V/R) \] for instance, a genomic researcher might sequence the genomes of individuals with and without some particular medical condition, and then for each of 20,000 genes, test whether sequence variants in that gene are associated with the medical condition of interest. this amounts to performing \(m=20,000\) hypothesis tests.

the nature of this analysis is exploratory, so we are willing to tolerate some number of false positives in the set of genes that we will investigate further; thus, the FWER is not an appropriate choice. However, some correction for multiple testing is required: it would not be a good idea for us to simply investigate all genes with p-values less than 0.05, since we would expect 1000 genes to habe such small p-values simply by chance, even if no genes are associated with the disease (since \(0.05\times 20,000=1000\)). controlling the FDR for exploratory analysis at 20% guarantees that on average no more than 20% of the genes that we will further investigate are false positives.

  • Benjamini-Hochberg procedure

as long as the m p-values are independent or only mildly dependent, then the Benjamini-Hochberg procedure guarantees that \[ \rm FDR \le q \]

stated another way, this procedure ensures that, on average, no more than a fraction q of the rejected null hypotheses are false positives.

compared with FWER, FDR control is much milder and mroe powerful, in the sense that it allows us to reject many more null hypotheses, with a cost of substantially more false positives.

a re-sampling approach to p-valuesand false discovery rates

classical apporach of hypothesis testing requires some known distribution under \(H_0\), such as a normal distribution, a T-distribution, a \(\chi^2\)-distribution, or an F-distribution. this is referred to as theoretical null distribution.

in cases where no theoretical null distributions are available, such as sample sample size.

Gareth James, Trevor Hastie, Daniela Witten. 2021. An Introduction to Statistical Learning : With Applications in r. USA: Springer.