Machine Learning Notes
Statistical Decision Theory
statistical inference
data \(Z=(Z_1, \dots, Z_n)\) follow the distribution \(f(z|\theta)\)
- \(\theta \in \Theta\) is the parameter of interest, but unknown. It represents uncertainties.
- \(\theta\) is a scalar, vector, or matrix
- \(\Theta\) is the set containing all possible values of \(\theta\)
the goal is to estimate \(\theta\) using the data, i.e., to construct \(\hat{\theta}(Z)\).
- it combines the sampling information (data) with a knowledge of the consequences of our decisions.
- major types of inference:
- point estimation
- interval estimation
- hypothesis testing
loss & risk
- loss function: \(\mathcal{L(\theta, \hat{\theta}): \Theta \times \Theta}\Rightarrow R.\)
- \[L(\theta, \hat{\theta})\ge 0\]
- quantifies the consequence for each decision \(\hat{\theta}\)
- \(L(\theta, \hat{\theta}(Z))\) is a function of Z, and it is a random quantity
- examples
- squared error loss: \(L(\theta, \hat{\theta})=(\theta-\hat{\theta})^2\)
-
absolute error loss: $$ L(\theta, \hat{\theta})= \theta-\hat{\theta} $$ - 0-1 loss: \(L(\theta, \hat{\theta})=\begin{cases}0, \quad \hat{\theta}=\theta \\ 1, \quad \hat{\theta}\neq \theta\end{cases}\)
- risk function: expected loss
- \(R(\theta, \hat{\theta}(Z))=E_Z[L(\theta, \hat{\theta})]=\int_{Z}L(\theta, \hat{\theta}(Z))f(z\theta)dz\).
- \(R(\theta, \hat{\theta})\) is a deterministic function of \(\theta\)
- \[R(\theta, \hat{\theta}) \ge 0\]
- used to evaluate the overall performance of one estimator; compare two estimators; find the best estimator
- example from regression
- \[L(\theta, \hat{\theta})=(\theta-\hat{\theta}(Z))^2\]
- \(R(\theta, \hat{\theta})=E_{Z}[\theta-\hat{\theta}]^2\), also known as the mean squared error (MSE)
- example from binary classification
- 0-1 loss \(L(\theta, \hat{\theta})=\mathcal{1}(\theta \ne \hat{\theta}(Z))\)
- \(R(\theta, \hat{\theta})=E\mathcal{1}[\theta \ne \hat{\theta}(Z)]=\mathcal{P}(\theta \ne \hat{\theta}(Z))\), also known as the misclassification error rate
MSE & bias-variance tradeoff
- \[bias(\hat{\theta}) = E(\hat{\theta})-\theta\]
- \[var(\hat{\theta})=E[\hat{\theta}-E(\hat{\theta})]^2\]
- decomposition of MSE
- \[MSE = Bias^2[\hat{\theta}(Z)] + var[\hat{\theta}(Z)]\]
- both bias and variance contribute to the risk
Risk comparison & best estimator
- best estimator
- \(\hat{\theta}^*(Z)\) is the estimator that minimizes the risk for all possible values of \(\theta\)
- \[\hat{\theta}^*(Z) = \arg\min_{\hat{\theta}} R(\theta, \hat{\theta}(Z))\]
in practice, such estimator is not always available and there exists no uniformly best estimator. To avoid this difficulty, we restrict the estimators to be in a class C, i.e.,
- C={unibased estimators}, i.e., \(E[\hat{\theta}(Z)]=\theta\)
- C={linear decision rules}
UMVUE from unbiased estimators; best linear unbiased estimator (BLUE) from linear decision rules.
in addition, the risk function is a function of the unknown parameter, not easy to use. alternative methods to compare risk include maximum risk and Bayes risk.
Bayes risk & minimax risk
- maximum risk & minimax rule
- a decision rule that minimizes the maximum risk
- MinMax rule focuses on the worst-case risk; thus very conservative
- Bayes risk
- Bayes risk is the risk of the Bayes estimator
- Bayes rule: Bayes estimators are the minimizers of the Bayes risk
Learning Theory for Supervised Learning
Supervised learning
- input vector \(X \in R^p\)
- output Y is either discrete- ot continuous-valued
- the pair (X, Y) follow a joint distribution \((X, Y) \sim P(X, Y)\)
training data set:
- \[\{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\} \sim i.i.d P(X, Y)\]
- each data point carries some information about the population
the goal is to estimate the underlying relationship between X and Y,
\[f: X \Rightarrow Y\]for future prediction
- For regression, \(f: \mathcal{R^p \Rightarrow R}\)
- For K-class classification, \(f: \mathcal{R^p \Rightarrow \{1,\cdots, K\}}\)
we would like f(X) to be as close as possible to Y, i.e., \(f(X) \approx Y\).
optimal learner
similar to decision theory, we minimize a learning loss function L to find
- measure discrepancy between Y and f(X)
- minimize the error of predicting Y
risk R(f) is the expected loss, or Expected Prediction Error (EPE). we seek the best \(f\) by minimizing the risk R(f) over the entire population:
\[f^{*}=\arg \min R(f)\]optimal learner:
- the optinal solution \(f^{*}\) depends on the loss function L
- the optimal solution \(f^{*}\) depends on the joint distribution \((X, Y) \sim P(X, Y)\); if the joint distribution is unknown, the optimal solution is unachievable.
- Bayes risk is the smallest possible risk, which is the risk of the Bayes estimator/optimal learner.
Bayes rule for binary classification
- soft and hard classification rules
-
soft classifier: obtain $$\hat{P}(Y=1 X=x)\(, compare\)\hat{P}(Y=1 X=x)$$ with .5 - examples: logistic regression, trees, LDA
-
hard classifier: directly estimate $$1(P(Y=1 X=x)>.5)$$ - examples: SVM
-
empirical risk minimization
since P(X,Y) is unknown, the risk function R(f) is also unknown. we estimate the risk by the empirical risk:
\[R_{emp}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)\]- ERM for regression
- loss function \(L(Y,f(X))=[Y-f(X)]^2\)
- empirical risk \(R_{emp}(f)=RSS(f)=\sum_{i=1}^{n}[y_i-f(x_i)]^2\), also known as residual sum of squares (RSS)
restricted estimators
- overfitting: RSS=0
- solution: restricted estimators to a class & penalization
- pre-specified class
- parametric model class: the form of f is known, only parameters are unknown
- nonparametric model class: the form of f is unspecified
- penalization:
- penalized RSS = RSS + $$\lambda J(f)$
EPE interpretation
EPE = estimation error + approximation error
- estimation error: the error of estimating the parameters of f, due to the finite training data
- contribute to variance
- approximation error: the error of approximating f, possible that the empirical function is not in the specified class
- contribute to bias
Data issues
- missing values
- SVM, regularized models, neural networks cannot tolerate any missing values
- CART and Naive Bayes can handle missing values
- be on vastly different scales
- follow a skewed distribution
- contain a small number of extreme values
- be censored on the low and/or high end of the range
- have a complex relationship with the response and be truly predictive but cannot be adequately represented with a simple function or extracted by sophisticated models
- contain relevant and overly redundant information
Feature Engineering
some machine learning algrithms require feature engineering, such as tree-based algorithms or the data to be in a specific form. Some algorithms perform better if the data is prepared in a specific way. Additionally, the raw data may not be in the best position to expose the underlying structure and relationships to the responses.
- Feature Engineering
- Feature Construction
- categorical encoding
- dummy variables: Creating dummy predictors may not be the most effective way of extracting predictive information from a categorical predictor.
- factors
- numerous variables
- tree-based methods are immune to skewness and outliers
- linear models, kNN, SVM are sensitive to skewness and outliers
- MLR, nueral networks are adversely impacted by high correlations while partial least squares are designed for it
- categorical encoding
- Feature Extraction
- Feature Selection
- Feature Transformation
- regression methods work better when the features are standarized or normalized
- Feature Scaling
- instance based methods are more effective if the features have the same scale
- Feature Interaction
- Feature Construction
- linear models
- linear regression
- logistic regression (LR)
- linear discriminant analysis (LDA)
- tree-based methods
- decision trees
- random forest
- gradient boosting
- instance-based methods
- kNN
- SVM
- neural networks
- feedforward neural networks
- convolutional neural networks
- recurrent neural networks
Binary classification
- input vector \(x \in \mathbb{R}^d\)
- output \(y \in \{0,1\}\)
- the goal is to construct a function \begin{equation}f: \mathcal{X} \rightarrow {0,1} \end{equation}
using 0-1 loss, the risk of a classifier \(f: \mathcal{X} \rightarrow Y\) is given by:
\begin{equation} R(f)=EPE(f)E_{X,Y} \mathcal{1}(Y \ne f(X)) = P(Y \ne f(X)) \end{equation}
the Bayes rule \(f^{*}\) relies on the posterior probabilities
\begin{equation} f^{*}=\arg \min R(f)=\begin{cases} 1 \quad if \quad P(Y=1|X=x) > P(Y=0|X=x)
0 \quad if \quad P(Y=1|X=x) <> P(Y=0|X=x) \end{cases} \end{equation}
the Bayes risk is defined as the risk of \(f^{*}\), which has the smallest possible risk among all possible classifiers
\begin{equation} R(f^{})=P(Y \ne f^{}(X))=P(Y=1)P(Y=0|X=x)+P(Y=0)P(Y=1|X=x) \end{equation}
example: rare disease
define class 1 = “disease”, 0 = “disease-free”. \(\pi_1=1\%, \pi_0=99\%\).
recall the Bayes rule $$f(x)=\mathcal{1}(P(Y=1 | X=x) > P(Y=0 | X=x))$$ |
-
posterioe class probability P(Y=j X=x) gives updated probabilities after observing x -
if $$P(Y=1 X=x) > 0.5$$, thenwe randomly assign data to one class.
Example: Assume a certain rare disease occurs among 1% of the population. There is a test for this disease: 99.5% of the disease will test positive, and only 0.5% of the disease-free group will test positive. (We assume the false positive and false negative rate are both 0.005.) Now a person comes with a positive test result. What is the prediction rule?
the conditional probability of X given Y is
\[P(X=+|Y=1)=0.995, P(X=-|Y=0)=0.005 \\ P(X=+|Y=0)=0.005, P(X=-|Y=0)=0.995\]using Bayes’ Theorem,
\[P(Y=1|X=+) = \frac{P(X=+|Y=1)P(Y=1)}{P(X=+)} = \frac{P(X=+|Y=1)P(Y=1)}{P(X=+|Y=1)P(Y=1)+P(X=+|Y=0)P(Y=0)} = \frac{0.995*0.01}{0.995*0.01+0.005*0.99} = 0.668 \\ P(Y=0|X=+) = \frac{P(X=+|Y=0)P(Y=0)}{P(X=+)} = \frac{0.005*0.99}{0.995*0.01+0.005*0.99} = 0.332\]Since P(Y = 0 | X = +) = 0.332 < 0.668, the Bayes rule assigns a person with the “+” test result to class “disease”. |
Similarly, P(Y = 0 | X = −) = 0.9999, P(Y = 1 | X = −) = 0.0001, so the Bayes rule assigns a person with the “-” test result to class “disease-free”. |
unequal cost
the Bayes rule under unequal cost is given by
\begin{equation} f^{*}(x)= \begin{cases} 1 \quad if \quad C(1,0)P(Y=1|X=x) > C(0,1)P(Y=0|X=x)
0 \quad if \quad C(1,0)P(Y=1|X=x) < C(0,1)P(Y=0|X=x) \end{cases} \end{equation}
Linear classification methods
two popular linear classifiers
- Linear Discriminant Analysis (LDA)
- Logistic Regression models (LR)
- both models rely on the linear-odd assumption, indirectly or directly.
- LDA and LR estimate the coefficients in a different ways.
linear-logit model (LDA and LR): assume that the logit is linear in x:
\begin{equation} \log \frac{P(Y=1|X=x)}{P(Y=0|X=x)}=w^Tx+b \end{equation}
posterior probability:
\begin{equation} P(Y=1|X=x)=\frac{e^{w^Tx+b}}{1+e^{w^Tx+b}}=\frac{1}{1+e^{-w^Tx-b}}
P(Y=0|X=x)=\frac{1}{1+e^{w^Tx+b}} \end{equation}
under equal-cost, the decision boundary is given by $${x | w^Tx+b=0}={x | P(Y=1 | X=x)=0}$$. |
LDA
LDA assumptions
-
each class density is multivariate Guassian: $$X Y_j \sim N(\mu_j, \sigma_j)$$ - Equal covariance matrices for each class: \(\sigma_j = \sigma\)
under mixture Guassian assumption, the log-odds is expressed as:
\begin{equation} \log \frac{P(Y=1|X=x)}{P(Y=0|X=x)}=\log \frac{\pi_1 \phi(x|\mu_1, \Sigma)/m(x)}{\pi_0 \phi(x|\mu_0, \Sigma)/m(x)}
= \log \frac{\pi_1}{\pi_0} + \log \phi(x|\mu_1, \Sigma)- \log \phi(x|\mu_0, \Sigma)
= \log \frac{\pi_1}{\pi_0} - \frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1) + \frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)
= \log \frac{\pi_1}{\pi_0} - \frac{1}{2}(\mu_1+\mu_0)^T\Sigma^{-1}(\mu_1-\mu_0) + x^T\Sigma^{-1}(\mu_1-\mu_0)
\log \frac{\pi_1}{\pi_0} - \frac{1}{2}(\mu_1+\mu_0)^T\beta_1+x^T\beta_1 \end{equation}
under 0-1 loss, the Bayes rule is: assign 1 to x if and only if:
\[\log \frac{P(Y=1|X=x)}{Y=0|X=x}>0\]which is equivalent to: assign 1 to x if and only if:
\[\bigg [ \log \frac{\pi_1}{\pi_0} - \frac{1}{2}(\mu_1+\mu_0)^T\Sigma^{-1}(\mu_1-\mu_0) \bigg ] + x^T\Sigma^{-1}(\mu_1-\mu_0) > 0.\]LR
-
denote $$\mu=E(Y X)=P(Y=1 X)$$. - assuming \(g(\mu)=\log \frac{\mu}{1-\mu}=\beta_0+\beta_1^TX\).
high-dimensional classifiers
- LDA-type
- Naive Bayes, Nearest Shrunken Centroid (NSC)
- sparse LDA, regularized LDA
- penalized logistic regression
- large-margin methods
- support vector machines (SVM)
- classification tree
- decision tree
- random forest
- boosting
Nonlinear classifier
- KNN
- kernal SVM
- trees, random forests
- …
Nearest Neighbor (NN) classifiers
kernal SVM
Model Selection
- Cross Validation
- K-fold Cross Validation
- Stratified K-fold Cross Validation
- Leave-One-Out Cross Validation
- Repeated K-fold Cross Validation
- Nested Cross Validation
Assessing-Model-Performance
Regression
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared (R^2)
- Adjusted R-squared (R^2_adj)
Classification
- Confusion Matrix
- Accuracy
- ROC Curve: True Positive Rate (TPR) vs False Positive Rate (FPR)
- AUC: Area Under the ROC Curve
- Precision
- Recall
- Precision-recall Curve
- F1 Score
confusion matrix
Actual \ Predicted | Positive | Negative |
---|---|---|
Positive | TP | FN |
Negative | FP | TN |
- \[\mathcal{accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}}\]
- balanced data
- sensitivity = TP / (TP + FN)
- specificity = TN / (TN + FP)
- true positive rate (TPR) = sensitivity = TP / (TP + FN)
- false positive rate (FPR) = 1- specificity = FP / (FP + TN)
- ROC curve
- True Positive Rate (TPR) vs. False Positive Rate (FPR)
- sensitivity vs. (1-specificity)
- AUC: Area Under the ROC Curve
- imbalanced data
- \[\mathcal{precision = sensitivity = \frac{TP}{(TP + FP)}}\]
- \[\mathcal{recall = \frac{TP}{(TP + FN)}}\]
- F1 Score = 2 * (precision * recall) / (precision + recall)
- precision-recall curve: True Positive Rate (TPR) vs False Positive Rate (FPR)