Math 565: Lecture Notes and Videos on Optimization for Machine Learning

Copyright: I (Bala Krishnamoorthy) hold the copyright for all lecture scribes/notes, documents, and other materials including videos posted on these course web pages. These materials might not be used for commercial purposes without my consent.

Scribes from all lectures so far (as a single big file)

Lec Date Topic(s) Scribe Video
1 Jan 13 syllabus, logistics, ML problems: clustering, classification, regression, optimization in 1D, regression via minimization Scb1 Vid1
2 Jan 15 \(\nabla J = \mathbf{0} \Rightarrow D^TD \mathbf{w} = D^T \mathbf{y}\), optimization in graphs, using \(D = QR\), Tikhonov regularization, binary classification Scb2 Vid2
3 Jan 20 support vector machine (SVM), Taylor expansion, local optimality in 1D, gradient descent (python), optimality in \(d\)-dim Scb3 Vid3
4 Jan 22 local optimality: second order conditions, convex (cvx) sets + functions, properties, \(f(g(\mathbf{w}))\) cvx when \(f\) cvx + \(g\) linear Scb4 Vid4
5 Jan 27 local min of cvx \(f \Rightarrow\) global min, first+second derivative cndn of convexity, strict convexity, computing \(\nabla J\), updating \(\alpha_t\) Scb5 Vid5
6 Jan 29 second order cndtns example, line search for \(\alpha_t\), additively separable loss \(J\)\(=\)\(\sum_i J_i\), stochastic gradient descent (SGD) Scb6 Vid6
7 Feb  3 optimization:general vs ML, hyperparameter tuning, cross validation, SGD for regression, hinge & \(L_2\)-SVM loss (for \(\pm 1\)) Scb7 Vid7
8 Feb  5 logistic regression loss \(J_{LR}\) smooth & strictly convex w/o regularizer, coordinate descent (CD), linear regression with CD Scb8 Vid8
9 Feb 10 block coordinate decscent (BCD), k-means clustering as BCD, CD challenges, momentum-based learning \(\mathbf{v}\)\(\leftarrow\)\(\beta\mathbf{v}\)\(-\)\(\alpha \nabla J\) Scb9 Vid9
10 Feb 12 variants of GD: AdaGrad, RMSProp, AdaM (Adam), Newton method \(\mathbf{w} \leftarrow \mathbf{w} - H^{-1} \nabla J\), optimal for quadratic \(J\) Scb10 Vid10
11 Feb 17 Newton increases \(J\) for nonquadratic \(J\), line search for Newton method, Newton method in regression and in \(L_2\)-SVM Scb11 Vid11
12 Feb 19 Newton for logistic regression SVM, probability/uncertainty in misclassification, summary of Newton for ML problems Scb12 Vid12
13 Feb 24 Newton challenges: ill-conditioned Hessian, saddle points, convergence problems, trust region method, trust radius \(\delta_t\) Scb13 Vid13
14 Feb 26 conjugate gradient method (CGM), \(H\)-orthogonality (conjugacy): \(\mathbf{q}_i^T H \mathbf{q}_j = 0\), projections of \(H\) using finite differences Scb14 Vid14
15 Mar  3 CGM for non-quadratic \(J\): non/linear CGM, quasi-Newton method, secant condition, BFGS method, preserves \(G_t \succeq 0\) Scb15 Vid15
16 Mar  5 L-BFGS: \(O(md)\) vs \(O(d^2)\) for BFGS, details for \(m=2\), subgradients for convex \(J\), regression with \(L_1\)-regularization Scb16 Vid16
17 Mar 10 subgradients with CD, proximal gradient method (PGM), primal and dual approaches for constrained optimization Scb17 Vid17
18 Mar 12 projective gradient method (PGM) for linear constraints, projection matrix, convex quadratic programs (CQPs) Scb18 Vid18
19 Mar 24 \(A\mathbf{w} \leq \mathbf{b}\) constraints, conditional gradient using LP, box constraints, \(A\) w/ LI rows, Lagrangian relaxation, weak duality Scb19 Vid19
20 Mar 26 minimax inequality, \(\min_{\mathbf{w}} \max_{\boldsymbol{\alpha} \geq \mathbf{0}} \{H(\mathbf{w},\boldsymbol{\alpha}) = F(\mathbf{w}) + \sum_i \alpha_i f_i(\mathbf{w})\} \equiv \) primal (P), Slater's condition, KKT conditions Scb20 Vid20
21 Mar 31 hinge-loss SVM dual (D) using KKT conditions, using stationarity conditions to eliminate \(\mathbf{w}, \xi_i\), solution of (D) using GD Scb21 Vid21
22 Apr   2 kernel methods using dual H-SVM, dual of unconstrained problems, linear regression, norm-constrained optimization Scb22 Vid22
23 Apr   7 VC dimension, shattering, types of SVM kernels, computational graphs (CGs), local function at node, gloabl function Scb23 Vid23
24 Apr   9 video: regression as a CG, neural networks (NNs), activation functions, training using CGs, node-to-node derivatives Scb24 Vid24
25 Apr 14 video: path aggregation lemma, dynamic programming (DP) for node-to-node derivatives, backpropagation algorithm Scb25 Vid25
26 Apr 16 video: loss function-to-weights derivatives using back propagation (DP), CGs with vector variables, Jacobian products Scb26 Vid26
27 Apr 21 video: details for NNs: edge derivative, sensitivity recursion, and gradient w.r.t. edge weights; decoupled formulation Scb27 Vid27
28 Apr 23 video: NNs as vector ops, vector backpropagation, exponential decay of sensitivities (not with ReLU), skip connection Scb28 Vid28
29 Apr 28 video: transformer networks, attention weight/output, multi-head attention, Jacobian of softmax, residual connections Scb29 Vid29


Last modified: Wed Apr 29 12:27:44 PDT 2026