OPTIMIZATION Complete Reference

WORLD'S MOST COMPREHENSIVE
OPTIMIZATION GUIDE
Unconstrained • Constrained • Convexity • LP • Gradient Methods • KKT • Duality
Problem Formulation
General Form
$\min_{x \in \mathbb{R}^n} f(x)$
subject to: $g_i(x) \leq 0$, $h_j(x) = 0$
Terminology
$f(x)$: Objective function
$g_i(x) \leq 0$: Inequality constraints
$h_j(x) = 0$: Equality constraints
Feasible set: All $x$ satisfying constraints
Types of Optima
Global min: $f(x^*) \leq f(x)$ for all feasible $x$
Local min: $f(x^*) \leq f(x)$ in neighborhood
Strict: Inequality is strict for $x \neq x^*$
Unconstrained: First Order
Gradient
$\nabla f(x) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$
First-Order Necessary Condition
If $x^*$ is local min: $\nabla f(x^*) = 0$
Points with $\nabla f = 0$ are stationary points
Could be min, max, or saddle point
Gradient Properties
Points in direction of steepest ascent
Perpendicular to level sets
$\|\nabla f\|$ = rate of maximum change
Unconstrained: Second Order
Hessian Matrix
$H_f(x) = \nabla^2 f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{pmatrix}$
Second-Order Conditions
Condition Conclusion
$H \succ 0$ (pos. def.) Local minimum
$H \prec 0$ (neg. def.) Local maximum
$H$ indefinite Saddle point
$H \succeq 0$ Inconclusive
Checking Definiteness
All eigenvalues $> 0$ ⟹ pos. definite
All leading principal minors $> 0$
$x^T H x > 0$ for all $x \neq 0$
Gradient Descent
Update Rule
$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$
Step Size Selection
Fixed: $\alpha_k = \alpha$ (may diverge)
Exact line search: $\alpha_k = \arg\min_\alpha f(x_k - \alpha \nabla f(x_k))$
Backtracking: Armijo condition
Convergence
Linear rate for strongly convex
Sublinear for convex
Condition number affects speed
Rule of thumb: $\alpha = \frac{1}{L}$ where $L$ = Lipschitz constant of $\nabla f$
Newton's Method
Update Rule
$x_{k+1} = x_k - [H_f(x_k)]^{-1} \nabla f(x_k)$
Properties
Quadratic convergence near optimum
Expensive: requires Hessian inversion
May not converge if started far from optimum
Quasi-Newton Methods
BFGS: Approximates $H^{-1}$ iteratively
L-BFGS: Limited memory version
SR1: Symmetric rank-1 update
Trade-off: Newton = fast but expensive, GD = cheap but slow.
Convexity
Convex Set
$\forall x, y \in C, \lambda \in [0,1]:$
$\lambda x + (1-\lambda)y \in C$
Convex Function
$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$
Epigraph is a convex set
First-order: $f(y) \geq f(x) + \nabla f(x)^T(y-x)$
Second-order: $H_f(x) \succeq 0$ everywhere
Why Convexity Matters
Local min = Global min
$\nabla f(x^*) = 0$ ⟹ $x^*$ is global min
Efficient algorithms exist
Key: Non-convex problems may have many local minima!
Strong Convexity
Definition
$f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2$
Equivalently: $H_f(x) \succeq \mu I$ for all $x$
Properties
Unique global minimum
GD converges linearly: $f(x_k) - f^* \leq (1-\frac{\mu}{L})^k(f(x_0) - f^*)$
Condition number: $\kappa = L/\mu$
Lipschitz Gradient
$\|\nabla f(x) - \nabla f(y)\| \leq L\|x-y\|$
Equivalently: $H_f(x) \preceq LI$
Lagrange Multipliers
Equality Constraints Only
$\min f(x)$ s.t. $h_j(x) = 0$
Lagrangian
$\mathcal{L}(x, \nu) = f(x) + \sum_j \nu_j h_j(x)$
Necessary Conditions
$\nabla_x \mathcal{L} = \nabla f(x) + \sum_j \nu_j \nabla h_j(x) = 0$
$h_j(x) = 0$ for all $j$
Geometric Interpretation
At optimum: $\nabla f$ is linear combo of $\nabla h_j$
$\nabla f$ perpendicular to constraint surface
$\nu_j$: Shadow price = rate of change of optimum w.r.t. constraint.
KKT Conditions
Full Lagrangian
$\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)$
KKT Conditions (Necessary)
1. Stationarity: $\nabla_x \mathcal{L} = 0$
2. Primal feasibility: $g_i(x) \leq 0$, $h_j(x) = 0$
3. Dual feasibility: $\lambda_i \geq 0$
4. Complementary slackness: $\lambda_i g_i(x) = 0$
Sufficiency
KKT + convexity ⟹ global optimum
Constraint qualification needed
Slackness: Either $\lambda_i = 0$ OR $g_i(x) = 0$ (or both).
Duality
Lagrange Dual Function
$g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)$
Dual Problem
$\max_{\lambda \geq 0, \nu} g(\lambda, \nu)$
Duality Gap
Weak duality: $d^* \leq p^*$ (always)
Strong duality: $d^* = p^*$ (under conditions)
Slater's Condition
If $\exists x$ with $g_i(x) < 0$ (strictly feasible), strong duality holds for convex problems.
Dual gives lower bound: $g(\lambda, \nu) \leq f(x^*)$ for any $\lambda \geq 0$.
Linear Programming
Standard Form
$\min c^T x$ s.t. $Ax = b$, $x \geq 0$
Inequality Form
$\min c^T x$ s.t. $Ax \leq b$
Key Properties
Feasible region is a polyhedron
Optimum at vertex (if bounded)
Finite number of vertices
LP Dual
Primal: $\min c^T x$ s.t. $Ax \geq b$, $x \geq 0$
Dual: $\max b^T y$ s.t. $A^T y \leq c$, $y \geq 0$
Strong duality: Always holds for LP (no gap).
Simplex Method
Basic Idea
Start at a vertex
Move to adjacent vertex with better objective
Repeat until optimal
Terminology
Basic variables: At nonzero level
Nonbasic: At zero
Pivot: Exchange basic/nonbasic
Complexity
Worst case: Exponential
Average case: Polynomial
Works well in practice
Interior Point Methods
Barrier Method
$\min f(x) - \frac{1}{t}\sum_i \log(-g_i(x))$
Process
Start with small $t$, solve barrier problem
Increase $t$, warm-start from previous
As $t \to \infty$, approach true solution
Properties
Polynomial time complexity
Path follows "central path"
Duality gap: $m/t$ ($m$ = # constraints)
Quadratic Programming
Standard Form
$\min \frac{1}{2}x^T Q x + c^T x$
s.t. $Ax = b$, $Gx \leq h$
Properties
$Q \succeq 0$: Convex QP (efficient)
$Q \succ 0$: Strictly convex (unique solution)
$Q$ indefinite: Non-convex (NP-hard)
Applications
Least squares with constraints
Support Vector Machines
Portfolio optimization
Constrained Optimization Algs
Penalty Method
$\min f(x) + \mu \sum_i \max(0, g_i(x))^2$
Increase $\mu$ as iterations progress
Augmented Lagrangian
$\mathcal{L}_\rho = f + \sum \nu_j h_j + \frac{\rho}{2}\sum h_j^2$
ADMM: Alternating Direction Method
Sequential QP (SQP)
Approximate problem by QP at each step
Superlinear convergence
Stochastic Gradient Descent
For $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$
$x_{k+1} = x_k - \alpha_k \nabla f_{i_k}(x_k)$
Properties
$i_k$ sampled randomly
Cheap per iteration (one sample)
Noisy but unbiased gradient
Variants
Mini-batch: Average over $B$ samples
Momentum: $v_k = \beta v_{k-1} + \nabla f_i$
Adam: Adaptive learning rates
ML standard: SGD + momentum + learning rate schedule.
Common Pitfalls
Local minima: Non-convex problems may have many!
Saddle points: $\nabla f = 0$ but not min/max.
Ill-conditioning: High $\kappa = L/\mu$ → slow convergence.
Step size: Too large → diverge. Too small → slow.
Constraint qualification: KKT may fail without LICQ.