Optimization – Complete Reference Guide

Unconstrained • Constrained • Convexity • LP • Gradient Methods • KKT • Duality

Problem Formulation

General Form

$\min_{x \in \mathbb{R}^n} f(x)$

subject to: $g_i(x) \leq 0$, $h_j(x) = 0$

Terminology

$f(x)$: Objective function

$g_i(x) \leq 0$: Inequality constraints

$h_j(x) = 0$: Equality constraints

Feasible set: All $x$ satisfying constraints

Types of Optima

Global min: $f(x^*) \leq f(x)$ for all feasible $x$

Local min: $f(x^*) \leq f(x)$ in neighborhood

Strict: Inequality is strict for $x \neq x^*$

Unconstrained: First Order

Gradient

$\nabla f(x) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$

First-Order Necessary Condition

If $x^*$ is local min: $\nabla f(x^*) = 0$

Points with $\nabla f = 0$ are stationary points

Could be min, max, or saddle point

Gradient Properties

Points in direction of steepest ascent

Perpendicular to level sets

$\|\nabla f\|$ = rate of maximum change

Unconstrained: Second Order

Hessian Matrix

$H_f(x) = \nabla^2 f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_i \partial x_j} \end{pmatrix}$

Second-Order Conditions

Condition	Conclusion
$H \succ 0$ (pos. def.)	Local minimum
$H \prec 0$ (neg. def.)	Local maximum
$H$ indefinite	Saddle point
$H \succeq 0$	Inconclusive

Checking Definiteness

All eigenvalues $> 0$ ⟹ pos. definite

All leading principal minors $> 0$

$x^T H x > 0$ for all $x \neq 0$

Gradient Descent

Update Rule

$x_{k+1} = x_k - \alpha_k \nabla f(x_k)$

Step Size Selection

Fixed: $\alpha_k = \alpha$ (may diverge)

Exact line search: $\alpha_k = \arg\min_\alpha f(x_k - \alpha \nabla f(x_k))$

Backtracking: Armijo condition

Convergence

Linear rate for strongly convex

Sublinear for convex

Condition number affects speed

Rule of thumb: $\alpha = \frac{1}{L}$ where $L$ = Lipschitz constant of $\nabla f$

Newton's Method

Update Rule

$x_{k+1} = x_k - [H_f(x_k)]^{-1} \nabla f(x_k)$

Properties

Quadratic convergence near optimum

Expensive: requires Hessian inversion

May not converge if started far from optimum

Quasi-Newton Methods

BFGS: Approximates $H^{-1}$ iteratively

L-BFGS: Limited memory version

SR1: Symmetric rank-1 update

Trade-off: Newton = fast but expensive, GD = cheap but slow.

Convexity

Convex Set

$\forall x, y \in C, \lambda \in [0,1]:$

$\lambda x + (1-\lambda)y \in C$

Convex Function

$f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$

Epigraph is a convex set

First-order: $f(y) \geq f(x) + \nabla f(x)^T(y-x)$

Second-order: $H_f(x) \succeq 0$ everywhere

Why Convexity Matters

Local min = Global min

$\nabla f(x^*) = 0$ ⟹ $x^*$ is global min

Efficient algorithms exist

Key: Non-convex problems may have many local minima!

Strong Convexity

Definition

$f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2$

Equivalently: $H_f(x) \succeq \mu I$ for all $x$

Properties

Unique global minimum

GD converges linearly: $f(x_k) - f^* \leq (1-\frac{\mu}{L})^k(f(x_0) - f^*)$

Condition number: $\kappa = L/\mu$

Lipschitz Gradient

$\|\nabla f(x) - \nabla f(y)\| \leq L\|x-y\|$

Equivalently: $H_f(x) \preceq LI$

Lagrange Multipliers

Equality Constraints Only

$\min f(x)$ s.t. $h_j(x) = 0$

Lagrangian

$\mathcal{L}(x, \nu) = f(x) + \sum_j \nu_j h_j(x)$

Necessary Conditions

$\nabla_x \mathcal{L} = \nabla f(x) + \sum_j \nu_j \nabla h_j(x) = 0$

$h_j(x) = 0$ for all $j$

Geometric Interpretation

At optimum: $\nabla f$ is linear combo of $\nabla h_j$

$\nabla f$ perpendicular to constraint surface

$\nu_j$: Shadow price = rate of change of optimum w.r.t. constraint.

KKT Conditions

Full Lagrangian

$\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)$

KKT Conditions (Necessary)

1. Stationarity: $\nabla_x \mathcal{L} = 0$

2. Primal feasibility: $g_i(x) \leq 0$, $h_j(x) = 0$

3. Dual feasibility: $\lambda_i \geq 0$

4. Complementary slackness: $\lambda_i g_i(x) = 0$

Sufficiency

KKT + convexity ⟹ global optimum

Constraint qualification needed

Slackness: Either $\lambda_i = 0$ OR $g_i(x) = 0$ (or both).

Duality

Lagrange Dual Function

$g(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)$

Dual Problem

$\max_{\lambda \geq 0, \nu} g(\lambda, \nu)$

Duality Gap

Weak duality: $d^* \leq p^*$ (always)

Strong duality: $d^* = p^*$ (under conditions)

Slater's Condition

If $\exists x$ with $g_i(x) < 0$ (strictly feasible), strong duality holds for convex problems.

Dual gives lower bound: $g(\lambda, \nu) \leq f(x^*)$ for any $\lambda \geq 0$.

Linear Programming

Standard Form

$\min c^T x$ s.t. $Ax = b$, $x \geq 0$

Inequality Form

$\min c^T x$ s.t. $Ax \leq b$

Key Properties

Feasible region is a polyhedron

Optimum at vertex (if bounded)

Finite number of vertices

LP Dual

Primal: $\min c^T x$ s.t. $Ax \geq b$, $x \geq 0$

Dual: $\max b^T y$ s.t. $A^T y \leq c$, $y \geq 0$

Strong duality: Always holds for LP (no gap).

Simplex Method

Basic Idea

Start at a vertex

Move to adjacent vertex with better objective

Repeat until optimal

Terminology

Basic variables: At nonzero level

Nonbasic: At zero

Pivot: Exchange basic/nonbasic

Complexity

Worst case: Exponential

Average case: Polynomial

Works well in practice

Interior Point Methods

Barrier Method

$\min f(x) - \frac{1}{t}\sum_i \log(-g_i(x))$

Process

Start with small $t$, solve barrier problem

Increase $t$, warm-start from previous

As $t \to \infty$, approach true solution

Properties

Polynomial time complexity

Path follows "central path"

Duality gap: $m/t$ ($m$ = # constraints)

Quadratic Programming

Standard Form

$\min \frac{1}{2}x^T Q x + c^T x$

s.t. $Ax = b$, $Gx \leq h$

Properties

$Q \succeq 0$: Convex QP (efficient)

$Q \succ 0$: Strictly convex (unique solution)

$Q$ indefinite: Non-convex (NP-hard)

Applications

Least squares with constraints

Support Vector Machines

Portfolio optimization

Constrained Optimization Algs

Penalty Method

$\min f(x) + \mu \sum_i \max(0, g_i(x))^2$

Increase $\mu$ as iterations progress

Augmented Lagrangian

$\mathcal{L}_\rho = f + \sum \nu_j h_j + \frac{\rho}{2}\sum h_j^2$

ADMM: Alternating Direction Method

Sequential QP (SQP)

Approximate problem by QP at each step

Superlinear convergence

Stochastic Gradient Descent

For $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$

$x_{k+1} = x_k - \alpha_k \nabla f_{i_k}(x_k)$

Properties

$i_k$ sampled randomly

Cheap per iteration (one sample)

Noisy but unbiased gradient

Variants

Mini-batch: Average over $B$ samples

Momentum: $v_k = \beta v_{k-1} + \nabla f_i$

Adam: Adaptive learning rates

ML standard: SGD + momentum + learning rate schedule.

Common Pitfalls

Local minima: Non-convex problems may have many!

Saddle points: $\nabla f = 0$ but not min/max.

Ill-conditioning: High $\kappa = L/\mu$ → slow convergence.

Step size: Too large → diverge. Too small → slow.

Constraint qualification: KKT may fail without LICQ.

OPTIMIZATION Complete Reference