From linear programming and flat constraint spaces to curved manifolds and the way life actually solves problems
Every optimisation problem is, at heart, a question: given that I can't have everything, what is the best I can do? Linear programming is the oldest formal answer — and still, remarkably, one of the most powerful.
Imagine you run a small bakery. You make two things: bread and cake. Each loaf of bread needs 200g of flour and 1 egg. Each cake needs 300g of flour and 3 eggs. You have 3 kilograms of flour and 24 eggs for the day. Bread sells for $4, cake for $9. How many of each should you make to earn the most money?
This is a linear programming problem. "Linear" because every relationship in it — the flour used, the eggs used, the money earned — is a straight-line relationship. Double the bread, double the flour used. There are no curves, no powers, no compounding. Just flat, proportional relationships.
"Programming" here is an old word meaning planning — it has nothing to do with computers. The term was coined in the 1940s when "programming" meant drawing up a schedule or a resource plan, like a military operation.
Every linear programming problem has exactly three components, and once you see them you will recognise them everywhere:
The things you control. In the bakery: how many loaves of bread (x) and how many cakes (y) to make. These are the unknowns you are solving for.
The limits you cannot exceed. Flour: 200x + 300y ≤ 3000g. Eggs: x + 3y ≤ 24. And you can't make negative bread: x ≥ 0, y ≥ 0.
The thing you want to maximise (or minimise). Revenue: 4x + 9y. This is the number you want to push as high as possible, subject to the constraints above.
Here is what makes linear programming beautiful — and why "geometry" is in the title of this document. If you draw the constraints on graph paper, they create straight lines. The region where all constraints are satisfied simultaneously — where you have enough flour and enough eggs and you're not making negative bread — is a polygon. A flat-sided shape.
Mathematicians call this polygon the feasible region: the space of all possible plans that don't break any rule.
The feasible region (shaded) is a polygon. The optimal solution always sits at a corner — a vertex.
The profound geometric fact at the heart of LP is this: the best solution is always at a corner of the polygon. Not somewhere in the middle, not on an edge, but at a vertex where two constraint lines meet. This is provable, and it changes everything — instead of searching infinitely many points, you only need to check the corners.
This is the insight that made linear programming computationally feasible long before powerful computers existed.
From Cold War logistics to the algorithm quietly running inside every airline ticket you've ever bought, LP has an unexpectedly dramatic history — and a vast invisible presence in the systems that organise modern life.
Soviet mathematician Leonid Kantorovich solves a plywood factory allocation problem and essentially invents LP. The Soviet state, ideologically suspicious of anything that suggested "optimisation" (implying the current plan was suboptimal), suppresses it. He will eventually win the Nobel Prize in 1975.
George Dantzig, working for the US Air Force on logistics scheduling, develops the simplex method — an algorithm to walk along the edges of the feasible polygon from corner to corner, always improving, until the best corner is found. He initially mistakes it for a homework problem left on a blackboard by his professor, Jerzy Neyman.
One of the first famous LP applications: find the cheapest combination of 77 foods that meets all daily nutritional requirements. Solved by hand in 1945, it required 120 person-days of calculation. Dantzig's algorithm solved it in minutes. The optimal diet was nutritionally adequate but included large quantities of dried navy beans, liver, and cabbage.
LP becomes the backbone of operations research. Airlines use it for crew scheduling and route optimisation. Oil companies use it for refinery planning — deciding which crude inputs to process into which products. Telecommunications companies use it for network capacity planning.
AT&T researcher Narendra Karmarkar publishes an algorithm that doesn't walk the edges of the polygon but cuts through its interior — the interior point method. For very large problems it's dramatically faster than simplex. The two methods remain complementary to this day.
LP and its extensions (mixed-integer programming, convex optimisation) run silently inside the world's supply chains, power grids, financial portfolios, advertising auctions, and hospital staffing rosters. The economy runs on constraint satisfaction, and LP is the oldest, most reliable engine for it.
These are the domains where LP in its traditional form still does quiet, essential work:
Which warehouse ships which order to which customer, minimising total freight cost subject to capacity, lead time, and inventory constraints.
Which power plants run at what level at each hour, balancing demand, fuel cost, transmission limits, and carbon constraints.
Maximise expected return subject to risk limits, regulatory exposure caps, liquidity requirements, and turnover constraints.
Which fields grow which crops in which season, maximising yield or revenue subject to water, soil, labour, and rotation constraints.
Crew assignments, aircraft routing, and seat pricing — all LP problems running simultaneously, tens of millions of variables.
Staff scheduling that satisfies skill requirements, shift limits, legal rest periods, and cost targets simultaneously.
Linear programming works because it simplifies. But the world is not, in fact, linear. Understanding where LP's flatness becomes a liability is the doorway into the richer mathematics beyond it.
LP rests on two assumptions that are almost never perfectly true in the real world:
Every relationship is proportional. Double the input, double the output. But real costs have economies of scale. Real returns diminish. A factory running at 95% capacity costs disproportionately more than one at 80%. A drug's efficacy doesn't scale linearly with dose.
The feasible region has no holes or indentations. But many real problems don't. Combinatorial choices (hire or don't hire, build or don't build) create jagged, non-convex feasible spaces. LP cannot directly handle "either/or" decisions.
A new generation of optimisation problems has arrived that classical LP handles poorly or not at all:
The "weights" of a neural network live in a space of millions of dimensions with a highly curved, non-convex loss landscape. LP cannot touch it.
The configuration space of a robot arm — all the possible combinations of joint angles — is not flat. It's a curved surface. Straight-line paths in LP coordinates are not straight-line motions in physical space.
A valid covariance matrix (describing how variables in a dataset relate) must be positive definite — a geometric constraint that defines a curved cone, not a flat polytope.
Quantum states live on the surface of a high-dimensional sphere (the Bloch sphere generalised). Optimising quantum circuits means moving on this curved surface.
The conformational space of a protein — all the ways it can fold — is a curved, high-dimensional landscape. Finding the minimum energy fold is a manifold optimisation problem.
Brain connectivity is described by matrices that must lie on curved manifolds. Averaging and optimising over brain scans requires Riemannian tools.
The common thread in all these modern problems is that the feasible set — the space of valid solutions — is not a flat polygon. It is a curved surface. And to optimise on a curved surface, you need a geometry that understands curvature. That geometry is Riemannian geometry.
Bernhard Riemann asked, in 1854, what mathematics would look like if we didn't assume space was flat. The answer changed physics, enabled general relativity, and — a century and a half later — gave us better tools for machine learning and biology.
You learned in school that the shortest distance between two points is a straight line. That's true — on a flat surface. But the Earth's surface is not flat. If you want the shortest flight path from Sydney to London, you don't draw a straight line on a flat map. You find a great circle route — the path that lies on the largest circle you can draw on the globe passing through both cities.
On a flat map, Sydney to London looks like it should fly over Asia roughly east-west. But the actual shortest path curves far north, over Siberia or the Arctic. The flat map lies about distances and directions because it is distorting a curved surface.
Riemannian geometry is the mathematics that lets you work directly on the curved surface — without distorting it onto a flat map first.
The key idea is shockingly simple to state, even if it takes years to master in full generality:
In flat (Euclidean) space, you measure the distance between two nearby points using Pythagoras: if you move a tiny bit in the x-direction and a tiny bit in the y-direction, the distance is √(dx² + dy²).
In a Riemannian space, the formula for distance changes depending on where you are. Near the equator on a globe, moving one degree of longitude is a long way. Near the North Pole, moving one degree of longitude is almost nothing. The metric tensor is the local rule that tells you, at each point, how to measure distances in different directions. It is a kind of "local Pythagoras" that varies from place to place.
A Riemannian manifold is simply a curved space equipped with this varying local distance rule.
A manifold is a curved space that, if you zoom in far enough at any single point, looks locally flat. The surface of the Earth is a 2-dimensional manifold: stand anywhere on it, and a small enough patch looks like a flat plane (which is why flat maps work accurately for small regions).
Manifolds can have any number of dimensions. A circle is a 1-dimensional manifold (locally it looks like a line). A sphere is a 2-dimensional manifold. The space of all possible rotations of a rigid body in 3D space is a 3-dimensional manifold. The space of all possible covariance matrices is a manifold whose dimension depends on the number of variables. The space of all probability distributions is a manifold — and this is the key insight behind modern statistics and machine learning.
Left: flat space, shortest path is a straight line. Right: curved space, shortest path is a geodesic — it curves with the surface.
The "straightest possible path" on a curved surface. On a sphere, geodesics are great circles. In a curved space, geodesics are what straight lines are in flat space — they are the shortest routes between points.
At any point on a curved surface, there is a flat plane that just touches it there — like the table you could rest on a sphere at its top. This is the tangent space. It is the "local flat world" at that point.
Two operations that connect the local flat tangent space to the curved surface. The exponential map takes a direction and distance in the tangent space and tells you where you end up on the manifold if you travel that way. The logarithmic map does the reverse — given two points on the manifold, it tells you the direction and distance in the local tangent space. Together, they let you do "flat space" calculations locally and then translate them back to the curved surface. These are the working tools of Riemannian optimisation.
Bernhard Riemann presented his framework in 1854 as a speculative piece of pure mathematics. For 60 years it was beautiful but seemingly useless. Then Einstein needed it for general relativity (1915): the curvature of spacetime is a Riemannian metric, and gravity is the geodesic motion through it.
For another 70 years after that, Riemannian geometry remained mostly in the domain of physics and pure mathematics. Then, in the 1990s and 2000s, researchers in statistics, signal processing, and machine learning began noticing that their feasible sets were curved — and that Riemannian tools gave them dramatically better algorithms. The geometry took 150 years to find its applied destiny.
What happens when you take the machinery of optimisation — gradients, descent, convergence — and rebuild it for curved spaces? You get algorithms that are more honest about the shape of the problems they solve. Three interactive diagrams let you feel the mechanics directly.
Suppose you want to minimise a function defined on the surface of a sphere — say, a cost that varies across the globe and you want to find the lowest-cost location. You could try ordinary gradient descent: compute the gradient, take a step in that direction, repeat. But that step takes you off the sphere. You're now floating in space. You must project yourself back onto the surface — and that projection introduces error, slows convergence, and compounds over many steps.
Riemannian optimisation replaces every flat-space operation with its curved-space equivalent. The key objects are: the tangent space (the flat plane touching the manifold at your current point), the Riemannian gradient (the ordinary gradient projected onto that tangent plane), and the exponential map (which translates a tangent direction into an actual move along the manifold's surface — a geodesic step).
Drag the point around the sphere. Watch how the tangent plane — the local flat approximation — rotates to remain perpendicular to the surface normal at every position. The red arrow is an example gradient vector in 3D space; the gold arrow is its projection onto the tangent plane — the Riemannian gradient. Notice: near the poles the tangent plane is nearly horizontal; near the equator it stands nearly vertical. The surface decides the geometry.
This shows the contrast between standard gradient descent (which steps off the manifold and has to be snapped back) and Riemannian gradient descent (which follows the geodesic and stays on the surface). Press Run to animate both simultaneously. The flat method drifts and requires repeated correction; the Riemannian method traces a clean arc. The cost function here is simply latitude — both methods are trying to reach the south pole (minimum cost).
Given: A smooth function f : M → ℝ to minimise. A starting point x₀ ∈ M. A step size η > 0.
Step 1 — Compute the Euclidean gradient: At the current point xₖ, compute ∇f(xₖ) — the ordinary gradient in the ambient space (the space the manifold sits inside).
Step 2 — Project onto the tangent space: Compute grad f(xₖ) = Proj_{TₓM}(∇f(xₖ)) — the projection of the Euclidean gradient onto the tangent space at xₖ. This is the Riemannian gradient: the direction of steepest ascent constrained to the manifold.
Step 3 — Retract along the manifold: Move to xₖ₊₁ = Retr_{xₖ}(−η · grad f(xₖ)). The retraction map Retr takes a tangent vector (the scaled negative gradient) and maps it to a new point on the manifold — following the geodesic or an efficient approximation of it.
Step 4 — Repeat until ‖grad f(xₖ)‖ < ε. At convergence, the Riemannian gradient is zero: no direction along the manifold decreases f further. This is the first-order optimality condition on a manifold — the curved-space analogue of "gradient equals zero."
Exact geodesic steps — following the true shortest curve on the manifold — are often expensive to compute. In practice, optimisation libraries use a retraction: a cheaper map that starts at a point on the manifold, moves in a tangent direction, and lands back on the manifold, without necessarily following the true geodesic. The retraction only needs to agree with the geodesic to first order (same direction, same local speed). This is enough to guarantee convergence.
On the sphere, the simplest retraction is normalisation: take a tangent step to get a point off the sphere, then divide by its norm to put it back. Not the true geodesic, but close enough locally, and far cheaper. This is the engineering insight that makes Riemannian optimisation practical at scale.
There is a subtlety in Riemannian optimisation that has no analogue in flat space: when you move from one point on a manifold to another, your tangent space changes. The flat plane tangent to a sphere at the north pole is different from the flat plane tangent at the equator. If you want to compare gradient directions computed at different points — which matters for momentum methods, conjugate gradient, and second-order methods — you need a way to "carry" a vector from one tangent space to another without distorting it.
This operation is called parallel transport. It moves a tangent vector along a curve on the manifold, keeping it as "constant" as the curved geometry allows. On a flat space, parallel transport is trivial — just copy the vector. On a curved space, it rotates the vector as it moves, in a way determined by the curvature. The famous illustration: start at the north pole of a sphere with an arrow pointing south. Carry it to the equator along one meridian (arrow still points south). Transport it along the equator a quarter turn. Then carry it back to the north pole along another meridian. The arrow now points in a completely different direction — it has rotated by 90°. This rotation reveals the curvature of the sphere. No parallel transport, no curvature; the two are inseparable.
All n×p matrices with orthonormal columns. Dimension: np − p(p+1)/2. The sphere is the special case p=1. Used in PCA, independent component analysis, and MIMO wireless communications where transmitted signals must be orthogonal.
All n×n symmetric positive definite matrices — the valid covariance structures. The geodesic between two SPD matrices under the affine-invariant metric is S₁#S₂ = S₁^½(S₁^{-½}S₂S₁^{-½})^½S₁^½. Arithmetic averaging gives wrong answers; geodesic averaging gives the correct Riemannian mean.
All p-dimensional subspaces of ℝⁿ. Unlike Stiefel (which cares about specific orthonormal bases), Grassmann cares only about the subspace spanned. Used in video analysis, where a video clip is represented as the subspace spanned by its frames.
All 3×3 orthogonal matrices with determinant 1. Dimension: 3. Lie group structure means the tangent space at the identity (the Lie algebra so(3)) represents infinitesimal rotations — skew-symmetric matrices. The exponential map here is literally the matrix exponential.
Parametric families of distributions. Gaussian family: 2-dimensional manifold with coordinates (μ, σ). The Fisher-Rao metric gives it Riemannian structure. The geodesic distance between two Gaussians is the statistical distinguishability — how hard it is to tell them apart experimentally.
Pure quantum states live on S² (the Bloch sphere). Mixed states live inside it (a ball). Optimising quantum circuits and quantum control problems means navigating this manifold — with the added complication that measurement collapses the state.
LP gives you a global optimum — a provable certificate that no better solution exists. This follows from the polygon structure: the objective function is linear, so it can have no local minima that aren't global minima on a convex feasible set.
Riemannian optimisation finds first-order stationary points — places where the Riemannian gradient is zero. On a non-convex manifold, these include local minima, local maxima, and saddle points. There is no general guarantee of global optimality.
The exception: if the manifold is geodesically convex with respect to the objective function — meaning the function restricted to any geodesic is convex — then first-order stationarity implies global optimality. Several important manifolds have this property for specific objectives (e.g., the Karcher mean problem on SPD matrices). But in general, curvature buys you geometry, not certainty.
Modern practice handles this with: multiple random initialisations, warm-starting from known good solutions, stochastic Riemannian gradient descent (which adds noise that helps escape saddle points), and Langevin dynamics on manifolds (a probabilistic method that samples from the distribution of good solutions rather than finding a single point).
What if the space of all possible beliefs — every probability distribution you might hold about the world — is itself a curved manifold? Shun-ichi Amari showed in the 1980s that it is, and that the curvature encodes something deep about the nature of statistical inference.
Take a simple family of probability distributions: all the normal (Gaussian) distributions. Each one is fully described by two numbers — its mean μ (where the bell curve is centred) and its standard deviation σ (how wide it is). So each Gaussian is a point in a 2-dimensional space, with coordinates (μ, σ).
If you plot all possible Gaussians in this parameter space — μ ranging over all real numbers, σ ranging over all positive real numbers — you get the upper half-plane: an infinite flat sheet. But this flat sheet is deceptive. The natural geometry of this space is not flat. It is curved — specifically, it has constant negative curvature, like a saddle or a hyperbolic plane. The curvature comes not from the space itself but from the structure of statistical inference: some distributions are harder to tell apart than others, and the metric that measures "statistical distinguishability" is not the same as ordinary Euclidean distance.
In ordinary geometry, distance is measured with a ruler. In statistical geometry, distance is measured by asking: how many samples would I need to reliably distinguish these two distributions from each other?
The formal object that captures this is the Fisher information matrix. For a family of distributions p(x; θ) parameterised by θ, the Fisher information is:
For the Gaussian family, the Fisher information metric gives distances that match the Poincaré half-plane metric — a famous model of hyperbolic geometry. Two Gaussians with the same mean but very different variances are far apart (easy to distinguish). Two Gaussians with similar means and large variances are close together (hard to distinguish). The geometry encodes this in its curvature.
The upper half-plane below represents all Gaussian distributions: horizontal axis is the mean μ, vertical axis is the standard deviation σ (must be > 0). Click anywhere to place a distribution and see its bell curve. Drag between two points to see both the Euclidean distance (straight line, flat geometry) and the Fisher-Rao geodesic distance (curved path, information geometry). Notice how the geodesic curves — it is not a straight line in parameter space. Distributions with small σ are "far" from each other even when their μ values are close; distributions with large σ are "close" even with different means.
Here is where Amari's contribution goes beyond simply applying Riemannian geometry to statistics. He discovered that the statistical manifold has not one but two natural "flat" structures — two ways of drawing straight lines that are both consistent with the Fisher metric, but that are dual to each other.
The exponential connection (e-flat) gives you straight lines in the space of natural parameters — the parameters that appear in the exponent of the distribution (e.g., μ/σ² and 1/σ² for Gaussians). Moving in a straight line under the e-connection corresponds to exponential family mixing.
The mixture connection (m-flat) gives you straight lines in the space of expectation parameters — the mean values of sufficient statistics (e.g., μ and μ²+σ²). Moving in a straight line under the m-connection corresponds to mixing distributions linearly (taking convex combinations).
Neither of these is the same as a Riemannian geodesic — but together they create a structure called a dually flat space, and this structure has a remarkable consequence: it generalises the Pythagorean theorem to curved statistical spaces.
For three distributions P, Q, R on a dually flat statistical manifold, if the geodesic from P to R is "e-orthogonal" to the geodesic from R to Q (in the sense of the dual connections), then:
D_KL(P ‖ Q) = D_KL(P ‖ R) + D_KL(R ‖ Q)
where D_KL is the Kullback-Leibler divergence — the information-theoretic measure of how different two distributions are. This is exactly the form of the Pythagorean theorem (c² = a² + b²), but for information distance on a curved manifold. The "right angle" is defined by the duality of the two connections, not by ordinary perpendicularity.
This theorem underlies the EM algorithm (Expectation-Maximisation, used throughout statistics and machine learning) — each E-step is an m-projection, each M-step is an e-projection, and convergence is guaranteed by the generalised Pythagorean theorem.
Ordinary gradient descent on a neural network uses parameter gradients in Euclidean space. But the space of neural network functions — what the network computes — is not a flat Euclidean space. It has the geometry of the statistical manifold, measured by the Fisher information.
The consequence: ordinary gradient descent takes steps that are equal in parameter space, but wildly unequal in function space. A small step in one parameter direction might change the network's output drastically; an equally-sized step in another direction might change almost nothing. The gradient "doesn't know" this, and wastes effort thrashing in flat-parameter coordinates that distort the true functional geometry.
Natural gradient descent fixes this by premultiplying the gradient by the inverse Fisher information matrix:
Natural gradient is far faster than ordinary gradient descent on many problems — but computing I(θ)⁻¹ for a large neural network is prohibitively expensive (the Fisher matrix has as many rows and columns as there are parameters). Modern approximations — K-FAC, EKFAC, KFAC-Expand — approximate the Fisher matrix with efficient block-diagonal structures, making natural gradient practical at scale. It is the current frontier of optimisation for large language models.
The Kullback-Leibler divergence D_KL(P ‖ Q) measures how much information is lost when you use distribution Q to approximate distribution P. It is not symmetric — D_KL(P ‖ Q) ≠ D_KL(Q ‖ P) — which means it is not a true metric (distances must be symmetric). But it is the canonical divergence induced by the dual connections on the statistical manifold.
This asymmetry is not a flaw; it is the correct structure. In information geometry, there are two distinct "distances" between distributions — one for each connection — and their asymmetry reflects the asymmetry of statistical inference: approximating P by Q (fitting a model Q to data from P) is a different problem from approximating Q by P (finding the simplest distribution Q that covers the uncertainty in P). KL forward and KL backward give different answers, and both are right for their respective problems.
The squared statistical distance (the symmetric version, the Fisher-Rao distance) is the geometric mean: it lives on the geodesic and is symmetric. It is the true Riemannian distance on the statistical manifold.
Second-order optimisation approximating the Fisher matrix. Used in large-scale training at Google Brain, DeepMind. Faster convergence than Adam on transformer architectures.
Every iteration of EM is an alternating projection on the statistical manifold — an e-projection followed by an m-projection. Information geometry explains why EM converges and how fast.
The Fisher information of a neural population code bounds how precisely the brain can decode a stimulus. Efficient neural codes saturate this bound — they place neural tuning curves so that the Fisher metric is uniform across the stimulus manifold.
Fitting an approximate distribution Q to a true posterior P by minimising KL(Q ‖ P) is a projection onto a submanifold of the statistical manifold. The geometry explains mode-seeking vs. mean-seeking behaviour of different KL directions.
Covariance matrices of radar returns live on SPD manifolds. Detection, tracking, and classification problems become Riemannian optimisation problems on this manifold, with the Fisher metric as the natural distance.
Covariance Matrix Adaptation Evolution Strategy — a widely used black-box optimiser — implicitly performs natural gradient ascent on the statistical manifold of Gaussians. It is information geometry in practice, discovered before the theory connected it.
Linear programming began as an answer to a specific question: how does a military supply chain allocate scarce resources against hard limits? Its mathematics — flat polytopes, corner solutions, clean duality — is the mathematics of an administered world. External constraints, defined in advance, divide the possible from the impossible. The best answer is at the boundary of what's allowed.
Riemannian geometry begins from a different intuition: the space of possibilities has its own intrinsic shape, independent of any external administration. The constraints are not walls built around a flat field — they are the curvature of the field itself. Moving through this space means respecting its geometry: following geodesics, computing in local tangent spaces, measuring distances by the Fisher metric rather than the ruler.
Information geometry takes this one step further: even the space of knowledge itself — every distribution you might believe — is curved. The curvature tells you which beliefs are hard to distinguish from which others, which inferences are short steps and which are long ones. To learn efficiently, in mathematics or in life, is to follow a geodesic through the space of possible understandings.
LP asks: given the box, find the best corner.
Riemannian optimisation asks: given that space itself has a shape, how do you move through it wisely?
Information geometry asks: what is the shape of the space of knowing?
The Gita's instruction — plan with the outcome in mind, then release it and act — is geometrically exact. At each point on the manifold, you compute the local gradient: the best direction from where you stand. Then you step, cleanly, without carrying the stale tangent plane forward. At the new point, a new tangent space opens. The manifold itself keeps shifting. The ancient question — what is the best life, the best allocation, the best path through constrained possibility — is always solved locally, always in the tangent space of now-here, always with incomplete information about the global shape. That is not a failure of optimisation. It is its honest description.
Every solution is a geodesic through the curvature of its own constraints, found not from above, but from within.