# | date | topic | description |
---|---|---|---|
1 | 25-Aug-2025 | Introduction | |
2 | 27-Aug-2025 | Foundations of learning | Drop/Add |
3 | 01-Sep-2025 | Labor Day Holiday | Holiday |
4 | 03-Sep-2025 | Linear algebra (self-recap) | HW1 |
5 | 08-Sep-2025 | PAC learnability | |
6 | 10-Sep-2025 | Linear learning models | |
7 | 15-Sep-2025 | Principal Component Analysis | Project ideas |
8 | 17-Sep-2025 | Curse of Dimensionality | |
9 | 22-Sep-2025 | Bayesian Decision Theory | HW2, HW1 due |
10 | 24-Sep-2025 | Parameter estimation: MLE | |
11 | 29-Sep-2025 | Parameter estimation: MAP & NB | finalize teams |
12 | 01-Oct-2025 | Logistic Regression | |
13 | 06-Oct-2025 | Kernel Density Estimation | |
14 | 08-Oct-2025 | Support Vector Machines | HW3, HW2 due |
15 | 13-Oct-2025 | * Midterm | Exam |
16 | 15-Oct-2025 | Matrix Factorization | |
17 | 20-Oct-2025 | * Mid-point projects checkpoint | * |
18 | 22-Oct-2025 | k-means clustering |
# | date | topic | description |
---|---|---|---|
19 | 27-Oct-2025 | Expectation Maximization | |
20 | 29-Oct-2025 | Stochastic Gradient Descent | HW4, HW3 due |
21 | 03-Nov-2025 | Automatic Differentiation | |
22 | 05-Nov-2025 | Nonlinear embedding approaches | |
23 | 10-Nov-2025 | Model comparison I | |
24 | 12-Nov-2025 | Model comparison II | HW5, HW4 due |
25 | 17-Nov-2025 | Model Calibration | |
26 | 19-Nov-2025 | Convolutional Neural Networks | |
27 | 24-Nov-2025 | Thanksgiving Break | Holiday |
28 | 26-Nov-2025 | Thanksgiving Break | Holiday |
29 | 01-Dec-2025 | Word Embedding | |
30 | 03-Dec-2025 | * Project Final Presentations | HW5 due, P |
31 | 08-Dec-2025 | Extra prep day | Classes End |
32 | 10-Dec-2025 | * Final Exam | Exam |
34 | 17-Dec-2025 | Project Reports | due |
35 | 19-Dec-2025 | Grades due 5 p.m. |
Empirical Risk Minimization
The Realizability Assumption: $\exists h^* \in {\cal H} s.t. L_{{\cal D}, f}(h^*)=0 \implies L_S(h^*)=0$
The i.i.d. Assumption: Samples in the training set are independent and identically distributed.
$\delta$ - probability of a nonrepresentative sample
$\epsilon$ - the accuracy parameter
Failure is when $L_{({\cal D}, f)}>\epsilon$
Success is when $L_{({\cal D}, f)}\le \epsilon$
Given a finite $\cal H$, fixed $\delta \in (0,1)$ and $\epsilon > 0$, and $m \ge \ln{(|{\cal H}|/\delta)}/\epsilon$ for any $f, {\cal D}$ (realizability assumption must hold) with probability at least $1-\delta$ over i.i.d samples $S$ of size $m$, for every ERM hypothesis $h_S$:
$L_{({\cal D}, f)}(h_S) \le \epsilon$
PAC learnability: ${\cal H}$ is PAC learnable if $\exists m_{\cal H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following properties:
- $\forall \epsilon, \delta \in (0,1), {\cal D}, f$
- $\forall \cal D$ over $\cal X$ and $\forall f:{\cal X} \rightarrow \{0, 1\}$
- when realizability assumption holds
- when running on $m \ge m_{\cal H}(\epsilon, \delta)$ i.i.d samples the learning algorithm returns $h_S$ s.t.
- $\underset{S\sim{\cal D}}{\PP}[L_{({\cal D},f)}(h_S) \le \epsilon] \ge 1 - \delta$
$m_{\cal H}(\epsilon, \delta)$ is the minimal integer that satisfies the requirement of PAC learning with accuracy $\epsilon$ and confidence $\delta$
The Realizability Assumption: $\exists h^* \in {\cal H} s.t. \PP_{x\sim{\cal D}}[h^*(x) = f(x)]=1$
\begin{align} L_{\CD}(h) & \def \underset{(x,y)\sim {\CD}}{\PP} [h(x) \ne y] \def \CD[\{(x,y):h(x)\ne y\}] \end{align}Find a predictor $h$ that minimizes $L_{\CD}(h)$.
The learner does not know the data generating $\CD$.
${\cal H}$ is agnostic PAC learnable if $\exists m_{\cal H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following properties:
- $\forall \epsilon, \delta \in (0,1), {\cal D}$ over ${\cal X}\times {\cal Y}$
- when running on $m \ge m_{\cal H}(\epsilon, \delta)$ i.i.d samples returns $h_S$ s.t.
- $\underset{S\sim{\cal D}}{\PP}[L_{\CD}(h_S) \le \underset{h' \in {\cal H}}{\min} L_{\CD}(h') + \epsilon] \ge 1 - \delta$
The term "regression" was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean).
Expected square difference:
$L_{\CD}(h) \def \underset{(x,y)\sim\CD}{\mathbb{E}} (h(x) - y)^2$
Mean squared error:
$L_{\CD}(h) \def \frac{1}{m} \sum_{i=1}^{m} (h(x_i) - y_i)^2$
For any $\cal H$ and demeaning $Z$ let ${\cal l}: {\cal H}\times Z\rightarrow \mathbb{R}_{+}$We call such functions loss functions.
Redefine the risk function using $\cal l$:
$L_{\CD}(h) \def \underset{z\sim \CD}{\EE}[\loss(h,z)]$
Redefine empirical risk as:
$L_{\CD}(h) \def \frac{1}{m}\sum_{i=1}^m[\loss(h,z)]$
${\cal H}$ is agnostic PAC learnable with respect to $Z$ and $\loss: \hclass\times Z \rightarrow \RR_+$if $\exists m_{\cal H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following properties:
- $\forall \epsilon, \delta \in (0,1), {\cal D}$ over $Z$
- when running on $m \ge m_{\cal H}(\epsilon, \delta)$ i.i.d samples returns $h_S$ s.t.
- $\underset{S\sim{\cal D}}{\PP}[L_{\CD}(h_S) \le \underset{h^\prime \in {\cal H}}{\min} L_{\CD}(h^\prime) + \epsilon] \ge 1 - \delta$,
- where $L_{\CD}(h) = \underset{z\sim \CD}{\EE}[\loss(h,z)]$
Suffices to show that empirical risk $\forall h\in \hclass$ is a good approximation of the true risk.
$\epsilon$-representative sample: A training set is called $\epsilon$-representative (w.r.t. $Z$, $\hclass$, $\loss$, and $\CD$) if
$\forall h\in \hclass, \mid L_S(h) - L_{\CD}(h)\mid \le \epsilon$
Assume $S$ is $\frac{\epsilon}{2}$-representative (w.r.t. $Z$, $\hclass$, $\loss$, and $\CD$), then output of ERM$_{\hclass}(S)$ ($h_s\in \argmin_{h\in\hclass}L_S(h)$) satisfies:
$L_{\CD}(h_S) \le \underset{h\in\hclass}{\min}L_{\CD}(h)+\epsilon$
Proof:
$L_{\CD}(h_S) \le L_S(h_S) + \frac{\epsilon}{2} \le L_S(h) + \frac{\epsilon}{2}$
$L_S(h) + \frac{\epsilon}{2} \le L_{\CD}(h) + \frac{\epsilon}{2} + \frac{\epsilon}{2} = L_{\CD}(h) + \epsilon$
To ensure that ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least $1-\delta$ over the random choice of a training set, it will be an $\epsilon$-representative training set.
Finite hypothesis classes are agnostic PAC learnable