Given a finite $\cal H$, fixed $\delta \in (0,1)$ and $\epsilon > 0$, and $m \ge \ln{(|{\cal H}|/\delta)}/\epsilon$ for any $f, {\cal D}$ (realizability assumption must hold) with probability at least $1-\delta$ over i.i.d samples $S$ of size $m$, for every ERM hypothesis $h_S$:
$L_{({\cal D}, f)}(h_S) \le \epsilon$
Leslie Valliant
PAC Learning
Valiant 1984
PAC learnability: ${\cal H}$ is PAC learnable if
$\exists m_{\cal H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following properties:
$\forall \epsilon, \delta \in (0,1), {\cal D}, f$
$\forall \cal D$ over $\cal X$ and $\forall
f:{\cal X} \rightarrow \{0, 1\}$
when realizability assumption holds
when running on $m \ge m_{\cal H}(\epsilon, \delta)$ i.i.d samples the learning algorithm returns $h_S$ s.t.
$m_{\cal H}(\epsilon, \delta): (0,1)^2 \rightarrow \mathbb{N}$
How many samples are required to guarantee a probably approximately correct solution?
Many $m_{\cal H}$ satisfy the definition, to specify one:
$m_{\cal H}(\epsilon, \delta)$ is the minimal integer that satisfies the requirement of PAC learning with accuracy $\epsilon$ and confidence $\delta$
We can only hope to get within best error achievable by the hypothesis class.
Multiclass Classification
${\cal Y} = \{0,1,\dots,C\}$
document classification
Multiclass Classification: BCI
Multiclass Classification: next word
Regression
Regression: history
Carl Friedrich Gauss vs. Adrien-Marie Legendre
1809 vs. 1805
Francis Galton
Regression: why the name
The term "regression" was coined by Francis Galton in
the nineteenth century to describe a biological
phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down
towards a normal average (a phenomenon also known as
regression toward the mean).
${\cal H}$ is agnostic PAC learnable with respect to $Z$ and $\loss: \hclass\times Z \rightarrow \RR_+$if
$\exists m_{\cal H}: (0, 1)^2 \rightarrow \mathbb{N}$ and a learning algorithm with the following properties:
$\forall \epsilon, \delta \in (0,1), {\cal D}$ over $Z$
when running on $m \ge m_{\cal H}(\epsilon, \delta)$ i.i.d samples returns $h_S$ s.t.
where $L_{\CD}(h) = \underset{z\sim \CD}{\EE}[\loss(h,z)]$
Learning via Uniform Convergence
Learning Algorithm Function
Receive the training set $S$
$\forall h\in\hclass$ evaluate $L_S(h)$
Return $h$ for which $L_S(h)$ is minimized
We hope:
$h = \underset{h\in \hclass}{\argmin} L_S(h)$ is also $h = \underset{h\in \hclass}{\argmin} L_{\CD}(h)$
or is close to it ;)
Informally
Suffices to show that empirical risk $\forall h\in
\hclass$ is a good approximation of the true risk.
formal definition
$\epsilon$-representative sample:
A training set is called $\epsilon$-representative (w.r.t. $Z$, $\hclass$, $\loss$, and $\CD$) if
$\forall h\in \hclass, \mid L_S(h) - L_{\CD}(h)\mid \le \epsilon$
Assume $S$ is $\frac{\epsilon}{2}$-representative (w.r.t. $Z$, $\hclass$, $\loss$, and $\CD$), then output of ERM$_{\hclass}(S)$ ($h_s\in \argmin_{h\in\hclass}L_S(h)$) satisfies:
$L_{\CD}(h_S) \le \underset{h\in\hclass}{\min}L_{\CD}(h)+\epsilon$
To ensure that ERM rule is an agnostic PAC learner, it
suffices to show that with probability of at least
$1-\delta$ over the random choice of a training set, it
will be an $\epsilon$-representative training set.
without a proof
Finite hypothesis classes are agnostic PAC learnable