Math 565: Lecture Notes and Videos on Optimization for Machine Learning
|
Copyright: I (Bala Krishnamoorthy) hold the copyright
for all lecture scribes/notes, documents, and other
materials including videos posted on these course web
pages. These materials might not be used for commercial
purposes without my consent.
|
Scribes from all lectures so far (as a single big file)
| Lec | Date | Topic(s) | Scribe | Video |
| 1 |
Jan 13 |
syllabus, logistics, ML problems: clustering, classification, regression, optimization in 1D, regression via minimization
|
Scb1
|
Vid1
|
| 2 |
Jan 15 |
\(\nabla J = \mathbf{0} \Rightarrow D^TD \mathbf{w} =
D^T \mathbf{y}\), optimization in graphs, using \(D =
QR\), Tikhonov regularization, binary classification
|
Scb2
|
Vid2
|
| 3 |
Jan 20 |
support vector machine (SVM), Taylor expansion, local
optimality in 1D, gradient descent
(python),
optimality in \(d\)-dim
|
Scb3
|
Vid3
|
| 4 |
Jan 22 |
local optimality: second order conditions, convex (cvx) sets + functions, properties, \(f(g(\mathbf{w}))\) cvx when \(f\) cvx + \(g\) linear
|
Scb4
|
Vid4
|
| 5 |
Jan 27 |
local min of cvx \(f \Rightarrow\) global min,
first+second derivative cndn of convexity, strict
convexity, computing \(\nabla J\), updating
\(\alpha_t\)
|
Scb5
|
Vid5
|
| 6 |
Jan 29 |
second order cndtns example, line search for
\(\alpha_t\), additively separable loss
\(J\)\(=\)\(\sum_i J_i\), stochastic gradient descent
(SGD)
|
Scb6
|
Vid6
|
| 7 |
Feb 3 |
optimization:general vs ML, hyperparameter tuning,
cross validation, SGD for regression, hinge &
\(L_2\)-SVM loss (for \(\pm 1\))
|
Scb7
|
Vid7
|
| 8 |
Feb 5 |
logistic regression loss \(J_{LR}\) smooth & strictly
convex w/o regularizer, coordinate descent (CD),
linear regression with CD
|
Scb8
|
Vid8
|
| 9 |
Feb 10 |
block coordinate decscent (BCD), k-means
clustering as BCD, CD challenges, momentum-based
learning
\(\mathbf{v}\)\(\leftarrow\)\(\beta\mathbf{v}\)\(-\)\(\alpha
\nabla J\)
|
Scb9
|
Vid9
|
| 10 |
Feb 12 |
variants of GD: AdaGrad, RMSProp, AdaM (Adam), Newton
method \(\mathbf{w} \leftarrow \mathbf{w} - H^{-1}
\nabla J\), optimal for quadratic \(J\)
|
Scb10
|
Vid10
|
| 11 |
Feb 17 |
Newton increases \(J\) for nonquadratic \(J\), line
search for Newton method, Newton method in regression
and in \(L_2\)-SVM
|
Scb11
|
Vid11
|
| 12 |
Feb 19 |
Newton for logistic regression SVM,
probability/uncertainty in misclassification, summary
of Newton for ML problems
|
Scb12
|
Vid12
|
| 13 |
Feb 24 |
Newton challenges: ill-conditioned Hessian, saddle
points, convergence problems, trust region method,
trust radius \(\delta_t\)
|
Scb13
|
Vid13
|
| 14 |
Feb 26 |
conjugate gradient method (CGM), \(H\)-orthogonality
(conjugacy): \(\mathbf{q}_i^T H \mathbf{q}_j = 0\),
projections of \(H\) using finite differences
|
Scb14
|
Vid14
|
| 15 |
Mar 3 |
CGM for non-quadratic \(J\): non/linear CGM,
quasi-Newton method, secant condition, BFGS method,
preserves \(G_t \succeq 0\)
|
Scb15
|
Vid15
|
| 16 |
Mar 5 |
L-BFGS: \(O(md)\) vs \(O(d^2)\) for BFGS, details for
\(m=2\), subgradients for convex \(J\), regression
with \(L_1\)-regularization
|
Scb16
|
Vid16
|
| 17 |
Mar 10 |
subgradients with CD, proximal gradient method (PGM),
primal and dual approaches for constrained
optimization
|
Scb17
|
Vid17
|
| 18 |
Mar 12 |
projective gradient method (PGM) for linear
constraints, projection matrix, convex quadratic
programs (CQPs)
|
Scb18
|
Vid18
|
| 19 |
Mar 24 |
\(A\mathbf{w} \leq \mathbf{b}\) constraints,
conditional gradient using LP, box constraints, \(A\)
w/ LI rows, Lagrangian relaxation, weak duality
|
Scb19
|
Vid19
|
| 20 |
Mar 26 |
minimax inequality, \(\min_{\mathbf{w}}
\max_{\boldsymbol{\alpha} \geq \mathbf{0}}
\{H(\mathbf{w},\boldsymbol{\alpha}) = F(\mathbf{w}) +
\sum_i \alpha_i f_i(\mathbf{w})\} \equiv \) primal
(P), Slater's condition, KKT conditions
|
Scb20
|
Vid20
|
| 21 |
Mar 31 |
hinge-loss SVM dual (D) using KKT conditions, using
stationarity conditions to eliminate \(\mathbf{w},
\xi_i\), solution of (D) using GD
|
Scb21
|
Vid21
|
| 22 |
Apr 2 |
kernel methods using dual H-SVM, dual of unconstrained
problems, linear regression, norm-constrained
optimization
|
Scb22
|
Vid22
|
| 23 |
Apr 7 |
VC dimension, shattering, types of SVM kernels,
computational graphs (CGs), local function at node,
gloabl function
|
Scb23
|
Vid23
|
| 24 |
Apr 9 |
video: regression as a CG, neural networks
(NNs), activation functions, training using CGs,
node-to-node derivatives
|
Scb24
|
Vid24
|
| 25 |
Apr 14 |
video: path aggregation lemma, dynamic
programming (DP) for node-to-node derivatives,
backpropagation algorithm
|
Scb25
|
Vid25
|
| 26 |
Apr 16 |
video: loss function-to-weights derivatives
using back propagation (DP), CGs with vector
variables, Jacobian products
|
Scb26
|
Vid26
|
| 27 |
Apr 21 |
video: details for NNs: edge derivative,
sensitivity recursion, and gradient w.r.t. edge
weights; decoupled formulation
|
Scb27
|
Vid27
|
| 28 |
Apr 23 |
video: NNs as vector ops, vector
backpropagation, exponential decay of sensitivities
(not with ReLU), skip connection
|
Scb28
|
Vid28
|
| 29 |
Apr 28 |
video: transformer networks, attention
weight/output, multi-head attention, Jacobian of
softmax, residual connections
|
Scb29
|
Vid29
|
Last modified: Wed Apr 29 12:27:44 PDT 2026
|