Advanced Machine Learning

05: Linear models

Schedule (you are here )

# date topic description
125-Aug-2025Introduction
227-Aug-2025Foundations of learningDrop/Add
301-Sep-2025Labor Day HolidayHoliday
403-Sep-2025Linear algebra (self-recap)HW1
508-Sep-2025PAC learnability
610-Sep-2025Linear learning models
715-Sep-2025Principal Component AnalysisProject ideas
817-Sep-2025Curse of Dimensionality
922-Sep-2025Bayesian Decision TheoryHW2, HW1 due
1024-Sep-2025Parameter estimation: MLE
1129-Sep-2025Parameter estimation: MAP & NBfinalize teams
1201-Oct-2025Logistic Regression
1306-Oct-2025Kernel Density Estimation
1408-Oct-2025Support Vector MachinesHW3, HW2 due
1513-Oct-2025* MidtermExam
1615-Oct-2025Matrix Factorization
1720-Oct-2025* Mid-point projects checkpoint*
1822-Oct-2025k-means clustering
# date topic description
1927-Oct-2025Expectation Maximization
2029-Oct-2025Stochastic Gradient DescentHW4, HW3 due
2103-Nov-2025Automatic Differentiation
2205-Nov-2025Nonlinear embedding approaches
2310-Nov-2025Model comparison I
2412-Nov-2025Model comparison IIHW5, HW4 due
2517-Nov-2025Model Calibration
2619-Nov-2025Convolutional Neural Networks
2724-Nov-2025Thanksgiving BreakHoliday
2826-Nov-2025Thanksgiving BreakHoliday
2901-Dec-2025Word Embedding
3003-Dec-2025* Project Final PresentationsHW5 due, P
3108-Dec-2025Extra prep dayClasses End
3210-Dec-2025* Final ExamExam
3417-Dec-2025Project Reportsdue
3519-Dec-2025Grades due 5 p.m.

Outline for the lecture

  • Linear decision boundary
  • Perceptron
  • Perceptron extensions
  • Non-separable case

Linear Decision Boundary

A Hyperplane

perceptron

A Hyperplane

hyperplane
Controls:
W - Move red point (x)
E - Rotate normal vector (w)
R - Move hyperplane along normal
Right click - Orbit camera
Left click - Drag selected element
[Labels update automatically]

An example!

iris

Solution region

solution region

Example: linear separability

separable and not

Perceptron

A Hyperplane

perceptron
Santiago Ramón y Cajal
Ramon y Cajal

A Neuron

cahal

A Perceptron

perceptron

Criterion (objective)

$$ J(\vec{w}) = -\sum_{\text{incorrect } i} l_i\vec{w}^Tx_i$$ $\vec{w}$ - parameters of our model (the perceptron)

Batch Perceptron

perceptron batch
${\cal Y}$ is the set of samples misclassified by $\vec{w}$

Stochastic Perceptron

perceptron stochastic

Stochastic Algorithm Convergence theorem

If the training samples are linearly separable then the sequence of weight vectors in line 4 of Algorithm 2 will terminate at a solution vector.
perceptron stochastic

Proof

Let us show that for any solution $\widetilde{\vec{w}}$ the following holds: $$\|\vec{w}_{k+1} - \widetilde{\vec{w}}\| \le \|\vec{w}_{k} - \widetilde{\vec{w}}\| $$
solution region

Proof (1/3)

$\vec{w}_{k+1} = \vec{w}_k + l_k \vec{x}_k$
$l_k \widetilde{\vec{w}}^T\vec{x}_k > 0$
$\vec{w}_{k+1} - \alpha \widetilde{\vec{w}} = (\vec{w}_k - \alpha \widetilde{\vec{w}}) + l_k \vec{x}_k$
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 = \|(\vec{w}_k - \alpha \widetilde{\vec{w}}) + l_k \vec{x}_k\|^2$
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 = \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 + 2(\vec{w}_k - \alpha \widetilde{\vec{w}})^T\vec{x}_kl_k + \|l_k \vec{x}_k\|^2$
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 = \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 + 2\vec{w}_k^T\vec{x}_kl_k - 2\alpha \widetilde{\vec{w}}^T\vec{x}_kl_k + \|l_k \vec{x}_k\|^2$
$\vec{w}_{k}^Tl_k \vec{x}_k \le 0$ since $\vec{x}_k$ was misclassified
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 - 2 \alpha \widetilde{\vec{w}}^Tl_k \vec{x}_k + \|l_k \vec{x}_k\|^2$

Proof (2/3)

$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 - 2 \alpha \widetilde{\vec{w}}^Tl_k \vec{x}_k + \|l_k \vec{x}_k\|^2$
$\beta^2 = \underset{k}{\max}\|l_k \vec{x}_k\|^2 = \underset{k}{\max} \|\vec{x}_k\|^2$
$\gamma = \underset{k}{\min}\left[ \widetilde{\vec{w}}^T \vec{x}_kl_k\right] > 0$
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 - 2 \alpha \gamma + \beta^2$
$\alpha = \frac{\beta^2}{\gamma}$
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 - \beta^2$
The distance to solution is reduced by at least $\beta^2$ at each iteration.

Proof (3/3)

$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_k - \alpha \widetilde{\vec{w}}\|^2 - \beta^2$
The distance to solution is reduced by at least $\beta^2$ at each iteration.
After $k$ iterations:
$\|\vec{w}_{k+1} - \alpha \widetilde{\vec{w}}\|^2 \le \|\vec{w}_1 - \alpha \widetilde{\vec{w}}\|^2 - k\beta^2$
The distance cannot become negative, so no more than $k_0$ iterations:
$k_0 = \frac{\|\vec{w}_1 - \alpha \widetilde{\vec{w}}\|^2}{\beta^2}$
Setting initial paramters to zero $\vec{w}_1 = \vec{0}$:
$k_0 = \frac{\alpha^2\|\widetilde{\vec{w}}\|^2}{\beta^2} = \frac{\beta^2\|\widetilde{\vec{w}}\|^2}{\gamma^2} = \frac{\underset{i}{\max} \|\vec{x}_i\|^2\|\widetilde{\vec{w}}\|^2}{\underset{i}{\min}\left[l_i\vec{x}_i^T\widetilde{\vec{w}}\right]}$

Perceptron Extensions

Margin

solution region
margin
stochastic p

Perceptron with Margin

perceptron margin

Perceptron Relaxation

Define the loss as $$ J_r(\vec{w}) = \frac{1}{2} \underset{\text{incorrect}}{\sum} \frac{(\vec{w}^T\vec{x}l - b)^2}{\|l\vec{x}\|^2} $$
Then the gradient is $$ \nabla_{\vec{w}} J_r = \underset{\text{incorrect}}{\sum} \frac{\vec{w}^T\vec{x}l - b}{\|l\vec{x}\|^2}\vec{x}l $$

Perceptron Relaxation

perceptron relaxed

Perceptron Relaxation: interpretation

$$ r(k) = \frac{b - \vec{w}_k^T\vec{x}_kl_k}{\|\vec{x}_kl_k\|} $$
relaxation move

Non-separable case

XOR

Separable example

separable

Multiple restarts

separable

Non-separable example

non-separable

AI winter

non-separable

What we usually encounter

non separable

Criterion (objective/loss)

$$ J(\vec{w}) = -\sum_{\text{incorrect } i} l_i\vec{w}^T\vec{x}_i$$ $\vec{w}$ - parameters of our model (the perceptron) $$ J_{\text{MSE}}(\vec{w}) = \frac{1}{2}\sum_{\forall i} (\vec{w}^T\vec{x}_i - b_i)^2$$ $$ \nabla_{\vec{w}} J_{\text{MSE}} = \sum_{\forall i} (\vec{w}^T\vec{x}_i - b_i)\vec{x}_i$$

Least Mean Squares

lms

Least Mean Squares

lms poor