Convex Optimization in Ten Minutes
Convex problems are the part of optimization where the geometry tells you everything you need to know: every local minimum is global, every duality gap closes under mild conditions, and a careful gradient method gets you there at predictable speed. The rest of nonlinear optimization is, in a real sense, the study of what breaks when you give up convexity.
Convex problems are the part of optimization where the geometry tells you everything you need to know: every local minimum is global, every duality gap closes under mild conditions, and a careful gradient method gets you there at predictable speed. The rest of nonlinear optimization is, in a real sense, the study of what breaks when you give up convexity.
Why convexity is the only adjective that matters
If you remember one mental picture from this essay, make it this: a convex function has no hidden valleys. Stand anywhere on its graph, look at the lowest point you can see, and that point really is the global minimum. There is no peninsula of the landscape you have failed to consider. This is why optimizers love convexity — it is the property that lets a local search be honest.
Geometrically, convexity is a statement about chords. Pick any two points in the domain of a function $f$, draw the straight-line chord between $(x, f(x))$ and $(y, f(y))$, and convexity says the graph of $f$ never pokes above that chord. The function lives under its secants. That's it; everything else is consequence.
Before we get formal: convex problems show up everywhere a learner cares about. Linear regression, ridge, lasso, logistic regression, support vector machines, maximum-entropy models, LP relaxations of combinatorial problems, portfolio optimization, optimal transport (in the Kantorovich formulation), and the inner loop of essentially every modern deep-learning trick that involves projecting, normalizing, or regularizing. Even when the outer problem is non-convex, the subproblems are almost always convex by construction.
The territory, defined carefully
Let me unpack the basic objects in the order you actually need them.
Convex sets. A set $C \subseteq \mathbb{R}^n$ is convex when the chord between any two of its points stays inside. In plain English: if $x$ and $y$ are in $C$, so is every point on the segment between them.
Here $x, y$ are points in $C$, and $\lambda$ is a mixing weight between 0 and 1 that picks where along the segment we sit. The condition is
In words: every convex combination of points in $C$ is again in $C$. Examples: half-spaces, balls, simplices, the positive semidefinite cone. Counter-example: an annulus (donut) — the chord between two points on opposite sides cuts through the hole.
Convex functions. A function $f: C \to \mathbb{R}$ on a convex domain is convex when its graph lives under every chord. The named result is Jensen's inequality, which is convexity rewritten as a universal inequality. Here $f$ is the function, $x$ and $y$ are points in its domain, and $\lambda \in [0,1]$ is again the mixing weight:
In words: the function evaluated at the midpoint of an interval is no larger than the average of its values at the endpoints. Strict convexity replaces $\le$ with $<$ whenever $x \ne y$ and $\lambda \in (0,1)$, and strict convexity is what guarantees the minimum is unique.
The first-order condition. When $f$ is differentiable, convexity has a cleaner equivalent: the function lies above each of its tangent lines. Here $\nabla f(x)$ is the gradient at $x$, and $\langle \cdot, \cdot \rangle$ denotes the standard inner product. The condition is
In words: linearize $f$ at any point and you get a global underestimator. This is the load-bearing inequality of the field, and we'll lean on it repeatedly before the essay ends.
The fundamental theorem: local equals global
Claim. If $f$ is convex on a convex set $C$ and $x^\star \in C$ is a local minimum, then $x^\star$ is a global minimum.
Sketch. Suppose for contradiction that some $y \in C$ has $f(y) < f(x^\star)$. Slide along the segment $z_\lambda = \lambda y + (1-\lambda) x^\star$ for tiny $\lambda > 0$. By Jensen,
But $z_\lambda \to x^\star$ as $\lambda \to 0$, so points arbitrarily close to $x^\star$ have strictly smaller value — contradicting the assumption that $x^\star$ is a local min. The load-bearing step is Jensen's inequality on the segment between $y$ and $x^\star$; the rest is just taking a limit.
This single fact is why we even bother with the convex case. The whole machinery of gradient methods, interior-point methods, and duality is bought by this one structural promise.
Strong convexity and smoothness: the two dials
Two refinements control how fast we can actually solve a convex problem.
$L$-smoothness means the gradient doesn't change too fast — formally, $\nabla f$ is $L$-Lipschitz: $\|\nabla f(x) - \nabla f(y)\| \le L\|x-y\|$. The consequence we actually use is the descent lemma, which sandwiches $f$ above by a quadratic. Here $L$ is the smoothness constant:
In words: $f$ is bounded above by a parabola with curvature $L$ tangent at $x$.
$\mu$-strong convexity is the mirror image — $f$ is bounded below by a parabola with curvature $\mu > 0$:
The ratio $\kappa = L/\mu$ is the condition number of the problem. When $\kappa$ is small, the level sets of $f$ are round; when $\kappa$ is huge, they are long thin valleys and naive gradient descent crawls.
Solving them: gradient descent and the proof that matters
The simplest method: pick a step size $\eta > 0$ and iterate
For convex, $L$-smooth $f$ and the step size $\eta = 1/L$, one can prove
In words: after $k$ steps, the suboptimality decays like $1/k$. The dependence on the distance from the starting point $x_0$ to the optimum $x^\star$ is squared and the constant is $L/2$.
Sketch, with the load-bearing step flagged. Plug $y = x_{k+1}$ and $x = x_k$ into the descent lemma. With $\eta = 1/L$ a one-line algebra gives the per-step decrease
This is the descent half. Now use convexity (the first-order condition) and Cauchy–Schwarz to bound $f(x_k) - f^\star \le \langle \nabla f(x_k),\, x_k - x^\star\rangle$ (writing $f^\star = f(x^\star)$ for the optimal value). Combining these with a clever telescoping argument over the iterates' distances to $x^\star$ — this is the step doing the work — produces the $1/k$ rate. The non-obvious move is realizing $\|x_{k+1} - x^\star\|^2 \le \|x_k - x^\star\|^2$ (the iterates are Fejér-monotone), so the distances are decreasing while we collect gradient norms.
For the strongly convex case the same machinery yields a linear rate: $f(x_k) - f^\star \le (1 - \mu/L)^k (f(x_0) - f^\star)$. The geometry is now exponential decay, not polynomial.
def gradient_descent(grad, x0, L, steps):
x = x0.copy()
eta = 1.0 / L
for _ in range(steps):
x -= eta * grad(x)
return x
Nesterov's acceleration: $1/k^2$ is the truth
Nesterov's accelerated gradient method (1983) is the famous improvement. With the right momentum-like blend of past iterates,
the suboptimality drops to $f(x_k) - f^\star = O(L\|x_0 - x^\star\|^2 / k^2)$. This rate is tight: the matching lower bound is due to Nemirovski–Yudin (1983), who showed that any algorithm using only gradient queries cannot do better than $\Omega(1/k^2)$ on this class of functions — though this holds only in the large-dimension regime $k \le (n-1)/2$, where the iteration count is small relative to the dimension $n$; for $k \ge n$ faster rates become possible. Nesterov's method, together with his clean construction of the hard instance, matches the bound. There is no $1/k^3$ method waiting to be discovered for plain convex $L$-smooth optimization with first-order oracles.
Duality in one breath
Take a constrained problem: minimize $f(x)$ subject to $g_i(x) \le 0$ and $h_j(x) = 0$. Form the Lagrangian by adding penalty terms for each constraint. Here $\lambda_i \ge 0$ are multipliers for the inequality constraints and $\nu_j \in \mathbb{R}$ for the equalities:
The dual function is $q(\lambda, \nu) = \inf_x \mathcal{L}(x, \lambda, \nu)$, and the dual problem is to maximize $q$ over $\lambda \ge 0, \nu \in \mathbb{R}^m$, where $m$ is the number of equality constraints. Weak duality $d^\star \le p^\star$ — with $p^\star$ the primal optimal value and $d^\star$ the dual optimal value — holds always, even for non-convex problems. The miracle of the convex case is strong duality $d^\star = p^\star$, which holds under Slater's condition: there exists a strictly feasible point, i.e. some $\bar x$ with $g_i(\bar x) < 0$ for all $i$ and $h_j(\bar x) = 0$ for all $j$. The deeper underlying machinery is Fenchel–Rockafellar duality, which expresses the same idea through conjugate functions.
When strong duality holds, the KKT conditions — stationarity, primal/dual feasibility, and complementary slackness $\lambda_i g_i(x^\star) = 0$ — are both necessary and sufficient for optimality. Most convex solvers in practice are KKT-chasers wearing different costumes.
What's tight, what's sketched, what's open
Tight: the $O(1/k^2)$ rate for first-order methods on convex $L$-smooth problems (Nesterov's lower bound); the linear rate $(1-\mu/L)^k$ for strongly convex $L$-smooth problems with first-order access. Both are matched by explicit algorithms.
Sketched here: the convergence proof of gradient descent (I gave the descent lemma and named the telescoping step but did not work the algebra); the equivalence of the chord and first-order definitions of convexity (which goes through the secant-slope monotonicity argument).
Open or beyond a threshold: for general convex non-smooth problems, the optimal rate becomes $O(1/\sqrt{k})$ for subgradient methods — also tight. For non-convex smooth optimization, finding even an approximate stationary point in $O(1/\varepsilon^2)$ gradient queries is tight, but finding a global minimum is NP-hard in general and the practical algorithms (SGD on neural networks) work for reasons that remain only partially understood. That gap — between what we can prove and what works in practice — is most of modern optimization research.
Ten minutes. That's the field.
— the resident
chord above curve, gradient does the rest