# | date | topic | description |
---|---|---|---|
1 | 22-Aug-2022 | Introduction | |
2 | 24-Aug-2022 | Foundations of learning | |
3 | 29-Aug-2022 | PAC learnability | |
4 | 31-Aug-2022 | Linear algebra (recap) | hw1 released |
05-Sep-2022 | Holiday | ||
5 | 07-Sep-2022 | Linear learning models | |
6 | 12-Sep-2022 | Principal Component Analysis | project ideas |
7 | 14-Sep-2022 | Curse of Dimensionality | hw1 due |
8 | 19-Sep-2022 | Bayesian Decision Theory | hw2 release |
9 | 21-Sep-2022 | Parameter estimation: MLE | |
10 | 26-Sep-2022 | Parameter estimation: MAP & NB | finalize teams |
11 | 28-Sep-2022 | Logistic Regression | |
12 | 03-Oct-2022 | Kernel Density Estimation | |
13 | 05-Oct-2022 | Support Vector Machines | hw3, hw2 due |
10-Oct-2022 | * Mid-point projects checkpoint | * | |
12-Oct-2022 | * Midterm: Semester Midpoint | exam | |
14 | 17-Oct-2022 | Matrix Factorization | |
15 | 19-Oct-2022 | Stochastic Gradient Descent |
# | date | topic | description |
---|---|---|---|
16 | 24-Oct-2022 | k-means clustering | |
17 | 26-Oct-2022 | Expectation Maximization | hw4, hw3 due |
18 | 31-Oct-2022 | Automatic Differentiation | |
19 | 02-Nov-2022 | Nonlinear embedding approaches | |
20 | 07-Nov-2022 | Model comparison I | |
21 | 09-Nov-2022 | Model comparison II | hw5, hw4 due |
22 | 14-Nov-2022 | Model Calibration | |
23 | 16-Nov-2022 | Convolutional Neural Networks | |
21-Nov-2022 | Fall break | ||
23-Nov-2022 | Fall break | ||
24 | 28-Nov-2022 | Word Embedding | hw5 due |
30-Nov-2022 | Presentation and exam prep day | ||
02-Dec-2022 | * Project Final Presentations | * | |
07-Dec-2022 | * Project Final Presentations | * | |
12-Dec-2022 | * Final Exam | * | |
15-Dec-2022 | Grades due |
Minimizing the Bayes error is the optimal strategy if we know the posterior exactly!
In the general case we need to non-parametrically estimate the densities
$\prob{P$_{KDE}$}{\vec{x}} = \frac{1}{Nh^d} \sum_{n=1}^N \prob{K}{\frac{\vec{x} - \vec{x}^n}{h}}$, where $\prob{K}{\vec{x}} = (2\pi)^{-d/2}e^{-\frac{1}{2}\vec{x}^T\vec{x}}$
Decision boundary is sensitive to outliers!
Definition: Given a joint distribution $\prob{P$_{KDE}$}{\vec{x}|C}$ and a set of classifiers $\cal H$, we say that $h^*$ is a restricted Bayes optimal classifier with respect to $\cal H$ and ${\rm P}_{KDE}$ is $h^* \in {\cal H}$ and for all $h\in {\cal H}$, $\prob{error}{h^*:{\rm P}_{KDE}} \le \prob{error}{h: {\rm P}_{KDE}}$
The distance does not depend on $\|\vec{w}\|$
To maximize the gap, need to minimize $\|\vec{w}\|$ or equivalently $\frac{1}{2}\|\vec{w}\|^2$ ($\frac{1}{2}\vec{w}^T\vec{w}$)
Want to minimize $\frac{1}{2}\|\vec{w}\|^2$ subject to $y_i(\vec{w}^T\vec{x}_i + b) \ge 1$ for $i=1,\dots, N$
Classify new instance $\vec{x}$ as $f(\vec{x}) = \mbox{sign}(\vec{w}^T\vec{x} + b)$
Minimize $\frac{1}{2}\|\vec{w}\|^2$ subject to $y_i(\vec{w}^T\vec{x}_i + b) - 1 \ge 0$ for $i=1,\dots, N$
\begin{align} \underset{\vec{w} \in \RR^m}{\argmin} \vec{w}^T {\mathbf Q} \vec{w} & + \vec{w}^T\vec{c} + \epsilon \\ {\mathbf A}\vec{w} & \le \vec{b} \\ {\mathbf A} \in \RR^{n\times m}, \vec{w} & \in \RR^m, \vec{b}\in \RR^n\\ {\mathbf C}\vec{w} & = \vec{d} \\ {\mathbf C} \in \RR^{s\times m}, \vec{d} &\in \RR^s\\ \end{align}
Necessary optimality conditions
$\frac{\partial\Lambda}{\partial x} = 0$; $\frac{\partial\Lambda}{\partial \alpha} = 0$; $\frac{\partial\Lambda}{\partial \lambda} = 0$; $\frac{\partial\Lambda}{\partial \mu} = 0$;
$\min_x x^2$
s.t. $x \ge b$
$\min_x x^2$Let's move the constraint to the objective Lagrangian
s.t. $x \ge b$ or $x-b\ge 0$
$L(x, \alpha) = x^2 - \alpha(x-b)$
s.t. $\alpha \ge 0$
Solve: $\min_x \max_{\alpha} L(x, \alpha)$
s.t. $\alpha \ge 0$
Solve: $\min_x \max_{\alpha} L(x, \alpha)$
s.t. $\alpha \ge 0$
Ellipse equation
$\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1$
Main point
Using Lagrangian multipliers and KKT conditions can convert a constrained problem into an unconstrained problem.
Minimize $\frac{1}{2}\|\vec{w}\|^2$ subject to $y_i(\vec{w}^T\vec{x}_i + b) - 1 \ge 0$ for $i=1,\dots, N$
Minimize $\Lambda(\vec{w}, b, \alpha) = \frac{1}{2}\|\vec{w}\|^2 - \sum_{i=1}^N \alpha_i (y_i(\vec{w}^T\vec{x}_i + b) - 1)$
$\frac{\partial\Lambda(\vec{w}, b, \alpha)}{\partial \vec{w}} = \vec{w} - \sum_{i=1}^N \alpha_i y_i\vec{x}_i = 0 \implies \vec{w} = \sum_{i=1}^N \alpha_i y_i \vec{x}_i$
Maximize $\sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j=1}^N \alpha_i\alpha_j y_i y_j \vec{x}_i^T\vec{x}_j$ subject to $\alpha_i \ge 0$ and $\sum_{i=1}^N \alpha_i y_i = 0$
The data may not be linearly separable
Appropriate mappings are hard to construct. What to do?
Maximize $\sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j=1}^N \alpha_i\alpha_j y_i y_j \vec{x}_i^T\vec{x}_j$
subject to $\alpha_i \ge 0$ and $\sum_{i=1}^N \alpha_i y_i = 0$
The final classifier is $f(\vec{x}) = \mbox{sign}(\sum_{i=1}^N \alpha_i y_i \vec{x}_i\vec{x} + b)$
Maximize $\sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i,j=1}^N \alpha_i\alpha_j y_i y_j K(\vec{x}_i,\vec{x}_j)$
subject to $\alpha_i \ge 0$ and $\sum_{i=1}^N \alpha_i y_i = 0$
The final classifier is $f(\vec{x}) = \mbox{sign}(\sum_{i=1}^N \alpha_i y_i K(\vec{x}_i, \vec{x}) + b)$
Maximize $\vec{\alpha}^T\vec{1} - \frac{1}{2} \vec{\alpha}^T\vec{y}^T{\bm G}\vec{y} \vec{\alpha}$
subject to $\vec{\alpha} \ge 0$ and $\vec{\alpha}^T \vec{y} = 0$
$K(\vec{x}, \vec{y})$ is a kernel function iff it is symmetric (i.e. $K(\vec{x}, \vec{y}) = K(\vec{y}, \vec{x})$) and positive semidefinite (i.e. $\vec{x}^T{\bm G}\vec{x} \ge 0, \forall \vec{x} \in \RR^n$)
Let's introduce slack variables $\xi_i \ge 0$ allowing mistakes
Maximize $\vec{\alpha}^T\vec{1} - \frac{1}{2} \vec{\alpha}^T\vec{y}^T{\bm G}\vec{y} \vec{\alpha}$
subject to $C \ge \vec{\alpha} \ge 0$ and $\vec{\alpha}^T \vec{y} = 0$
Changing the constraints: $y_i(\vec{w}^T\vec{x} + b) \ge 1 - \xi_i$
Modifies the primal problem: $C\sum_{i=1}^N \xi_i + \frac{1}{2}\|\vec{w}\|^2$
Advantages
Disadvantages