Shika-B/optimal_transport.ipynb

## optimal_transport.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "babb01b6",
   "metadata": {},
   "source": [
    "A few sections of the notebook are also dedicated to the more general context of convex optimization and some theory related to it.\n",
    "\n",
    "# Some context on discrete optimal transport\n",
    "\n",
    "The goal of this notebook is to give some context as to what the discrete optimal transport problem is and present some algorithms to solve it efficiently. It does not assume prior knowledge of what optimal transport but does not give any specific motivation for it. One can read [the Wikipedia page](https://en.wikipedia.org/wiki/Transportation_theory_(mathematics)) for logistics planning and economics applications and [this survey](https://perso.liris.cnrs.fr/nbonneel/eg_star_transport.pdf), this [blog article](https://www.devonstrawn.com/articles/The-many-many-applications-of-optimal-transport.html) and this [talk by Gabriel Peyré](https://www.youtube.com/watch?v=mITml5ZpqM8) for a tour of some applications to computer vision and machine learning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a910870a",
   "metadata": {},
   "source": [
    "## A formulation of discrete optimal transport\n",
    "\n",
    "### General optimal transport\n",
    "\n",
    "Recall that optimal transport aims to solve the following problem: What is the minimal cost to transport one probability measure to another. This problem doesn't make much sense as is, so let's give it some precise meaning:\n",
    "\n",
    "Let $\\mu, \\nu$ be two probability measures on $(X, \\mathcal{A}), (Y, \\mathcal{B})$ two measurables spaces. Choose a measurable cost map $c: X \\times Y \\to \\mathbb{R}^+$. Then we can formulation our objective in two ways:\n",
    "- Monge formulation: Find a map $T: X \\to Y$ such that $\\nu = T_\\# \\mu$ and realizing the infimum\n",
    " $$\\inf \\left\\{ \\int_X c(x, T(x)) \\mathrm{d}\\mu \\mid T_\\# \\mu = \\nu \\right\\}$$\n",
    "- Kantorovich formulation: Find a probability measure $\\gamma$ on $X \\times Y$ with marginals $\\mu$ and $\\nu$ and realizing the infimum\n",
    " $$\\inf \\left\\{ \\int_{X \\times Y} c(x, y) \\mathrm{d}\\gamma \\mid (\\pi_1)_\\# \\gamma = \\mu, (\\pi_2)_\\# \\gamma = \\nu \\right\\}$$\n",
    "\n",
    "### Discrete optimal transport\n",
    "\n",
    "Discrete optimal transport concerns itself with the above problem in the case where $X$ and $Y$ are finite sets, or atleast the supports of $\\mu$ and $\\nu$ in $X$ and $Y$ respectively are. A better description may be \"finite\" optimal transport, as we are really only working with finite (-dimensional) data everywhere.  \n",
    "\n",
    "It helps to think of discrete optimal in matrice form, as often with finite optimization problems:\n",
    "Let's write $X = [\\![1; n]\\!]$ and $Y = [\\![1; m]\\!]$. A probability on $X$ is then really just a vector $\\bm{a} \\in \\Delta_n := \\{ x \\in \\mathbb{R}_+^n | \\sum_{i = 1}^n x_i = 1 \\}$ and the same thing is true for $\\bm{b}$ in $Y$. A cost map is given by the data of $C = (c(i, j))_{(i, j) \\in X \\times Y}$, that is a matrix in $\\mathcal{M}_{n, m}(\\mathbb{R}_+)$.\n",
    "\n",
    "Monge optimal transport then tries to find a map $T: [\\![1; n]\\!] \\to [\\![1; m]\\!]$ such that:\n",
    "- $T_\\# \\bm{a} = \\bm{b}$, that is $\\forall j \\in Y, \\bm{b}_j = \\sum_{i = 1, T(i) = j}^{n}{a_i}$ \n",
    "- The quantity $\\sum_{i = 1}^{n} C_{i, T(i)}$ is minimized among all $T \\in [\\![1; m]\\!]^{[\\![1; n]\\!]}$\n",
    "\n",
    "Kantorovich optimal transport looks instead for a matrix $P \\in \\mathcal{M}_{n, m}(\\mathbb{R}_+)$ such that:\n",
    "- Equality of marginals, given by $P \\bm{1}_m = \\bm{a}$ and $P^T \\bm{1}_n = \\bm{b}$\n",
    "- The quantity $\\langle C, P \\rangle = \\sum_{i = 1}^{n} \\sum_{j = 1}^{m} C_{i,j} P_{i, j}$ is minimized among all $P \\in U(\\bm{a}, \\bm{b})$ where \n",
    "$$U(\\bm{a}, \\bm{b}) := \\{P \\in \\mathcal{M}_{n, m}(\\mathbb{R}_+) \\mid P \\bm{1}_m = \\bm{a}, P^T \\bm{1}_n = \\bm{b}\\}$$\n",
    "\n",
    "In the Monge formulation, even with $n = m = 15$, we have $15! = 1307674368000$ maps to check. Kantorovich formulation doesn't help it strictly contains the Monge formulation: let $P_{i, j} = \\delta_{i, T(i)}$. As we cannot bruteforce the problem, we need to find something clever algorithms and properties to prune a lot of cases. \n",
    "\n",
    "From now on, we focus on the Kantorovich formulation, as it has better properties and we can recover the Monge-solution from the Kantorovich-solution when it exists quite often.\n",
    "\n",
    "This is a good moment to explore what are the (elementary-ish) theories available to tackle these kinds of problems\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90fda7e8",
   "metadata": {},
   "source": [
    "\n",
    "## A hierarchy of optimization problems\n",
    "\n",
    "First our problem falls into the absurdly general category of optimization problems: trying to minimize a given function $f: \\Omega \\subset \\mathbb{R}^k \\to \\mathbb{R}$ over a given domain subset $K \\subset \\Omega$. Without any kind of regularity assumption over $f$ and $K$, it is hopeless to expect solving problems in such generality. \n",
    "\n",
    "Constrained optimization problems are those where $K$ is defined by (in)equality equations:\n",
    "$$K = \\{x \\in \\Omega \\mid g_1(x) \\leq 0, \\ldots, g_n(x) \\leq 0 \\land h_1(x) = \\ldots = h_m(x) = 0\\}$$\n",
    "where the $(g_1, \\ldots, g_n)$ and $(h_1, \\ldots, h_m)$ satisfy some (possibly different) regularity conditions.\\\n",
    "In contrast, unconstrained optimization problems are those where $K = \\Omega$, and this is a case we understand very well from early multivariable calculus: solving $\\nabla f(x) = 0$ and, if that's not enough on its own, sorting solutions by looking at the hessian.\n",
    "\n",
    "The examples of constrained optimization problems we will be concerned with all fall under a very specific category, those of smooth convex optimization problems: The function to optimize $f$ is smooth convex, all the $g_i$'s are also smooth convex maps, and all the $h_j$'s are **affine** maps. Hence a convex problem can be rewritten as\n",
    "$$\\begin{align*}\n",
    "\\min_{x \\in \\mathbb{R}^k} f(x) \\\\\n",
    "\\text{subject to } &g_1(x) \\leq 0, \\ldots, g_n(x) \\leq 0\\\\\n",
    "\\text{and } &Ax = b\n",
    "\\end{align*}$$\n",
    "\n",
    "with $A \\in \\mathcal{M}_{m, k}(\\mathbb{R})$ and $b \\in \\mathbb{R}^m$.\n",
    "\n",
    "Note: One can actually remove the smoothness condition completely, and work with subgradients instead. While sometimes useful, the generalization is confusing if the reader does not know about subgradients and straightforward otherwise, and hence I choose not to dive into that.\n",
    "\n",
    "### About unconstrained and equality-constrained convex problems\n",
    "One of the most important thing to know about convex problems is that unconstrained convex problems are among the easiest optimization problems to solve both mathematically and numerically.\n",
    "Indeed, the convexity inequality enforces the existance of a local minimum, which will always be a global minimum, which avoids the very common situation when an algorithm gets stuck at a local minima. Algorithmically, variants of gradient descent or of the Newton's method can be used. \n",
    "In some cases it can be proven that there is always a constant such that the constant step-size gradient descent converges to an optima (for instance those with Lipschitz-gradient, corresponding to strongly convex functions in the $\\mathcal{C}^1$ case). In general more robust methods are preferred, and an adaptative step-size is determined at each iteration. Standard wisdom (the statement may be a proven theorem but I failed to find any solid reference for the result) is that any choice of step-size $(\\alpha_k)_{k \\in \\mathbb{N}}$ satisfying the Robbins-Monro condition does the job:\n",
    "$$\\text{Robbins-Monro conditions: } \\sum_{k = 0}^{+\\infty}{\\alpha_k} = +\\infty \\text{ and } \\sum_{k = 0}^{+\\infty}{\\alpha_k^2} < +\\infty$$\n",
    " \n",
    "It is notable that equality-constrained convex problems, i.e those without any inequality constraints, can algorithmically be reduced to unconstrained convex problems, and the discussion aboves applies: \\\n",
    "Write the equality constraints as $Ax = b$ with $A \\in \\mathcal{M}_{k, k}(\\mathbb{R})$ and $p = \\text{rank} A$. We can find a matrix $B \\in \\mathcal{M}_{k-p, k}$ such that $\\ker(A) = \\text{Im}(B)$. \n",
    "For any particular solution $x_0$ of $Ax = b$, we then have, given $x \\in \\mathbb{R}^k$ satisfying $Ax = b$, a corresponding $z \\in \\mathbb{R}^{k - p}$ such that $Bz + x_0 = x$, and thus the initial problem is equivalent to\n",
    "$$\\min_{z \\in \\mathbb{R}^p} f(Bz + x_0)$$\n",
    "\n",
    "### Useful facts on convex problems\n",
    "\n",
    "Finally, we mention a few facts about convex problems that will be of use to us. \n",
    "Recall that an extremal point of a convex set $C$ is a point $z$ such that if $z = tu + (1-t)v$, then $z = u$ or $z = v$. We defined that the domain of a constrained optimization problem as \n",
    "$$K = \\{x \\in \\Omega \\mid g_1(x) \\leq 0, \\ldots, g_n(x) \\leq 0 \\land h_1(x) = \\ldots = h_m(x) = 0\\}$$\n",
    "\n",
    "For convex problems, the set $K$ is always convex (in fact, convex problems are defined so that this property is true). This implies that whenever the objective function is linear and a minimum exists, the minimum must be reached at an extremal point of $K$. Indeed, the Krein-Milman theorem ensures that $K$ is the convex hull of its extremal points, and for any point $x = \\sum_{i = 0}^{n} t_i x_i$ with $x_i$ extremals, we have \n",
    "$$f(x) = \\sum_{i = 0}^{n} t_i f(x_i) \\geq \\min_{0 \\leq i \\leq n} f(x_i)$$\n",
    "by linearity.\n",
    "\n",
    "This seemingly niche fact is very important for two reasons:\n",
    "- The theoretical reason is that any convex problem can be reformulated as a convex problem with linear objective, for instance with:\n",
    "$$\\begin{align*}\n",
    "\\min_{x \\in \\mathbb{R}^k, t \\in \\mathbb{R}} t \\\\\n",
    "\n",
    "\\text{subject to } &f(x) - t \\leq 0 \\\\\n",
    "\\text{ and } &g_1(x) \\leq 0, \\ldots, g_n(x) \\leq 0\\\\\n",
    "\\text{and } &Ax = b\n",
    "\\end{align*}$$\n",
    "\n",
    "- The algorithmic reason is that the set of extremal points is usually finite, and thus if we're able to identify it then we transformed our search space from a multidimensional continuous space to a discrete space, and reduced the problem to combinatorics. This is actually one of the main insight of the (network) simplex algorithm we will describe later, to solve discrete optimal transport problems."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fe80318",
   "metadata": {},
   "source": [
    "\n",
    "## Discrete optimal transport is a linear programming problem.\n",
    "\n",
    "### Linear programming\n",
    "Among convex problems, we will be specifically interested in linear programming problems. Despite the name, its definition has nothing to do with computer science and it is really a mathematical concept. Indeed, linear programming problems are convex problems with the additional regularity that the objective is linear and that the inequality constraints are affine. Moreover, as an equality is really just the combination of two opposite inequalities, we can always write a constraint $h_i(x) = 0$ as $h_i(x) \\leq 0 \\land -h_i(x) \\leq 0$ and thus reduce a linear problem to a problem of the form:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\min_{x \\in \\mathbb{R}^k} \\langle c, x \\rangle \\\\\n",
    "\\text{subject to } Ax \\succeq b\n",
    "\\end{align*}$$\n",
    "\n",
    "where $\\preceq$ means the inequality holds for each component of the vectors.\n",
    "\n",
    "A useful trick in LP theory is the usage of slack variables: any vector $x$ can be written as $x = x^+ - x^-$ with $x^+, x^- \\succeq 0$. For instance, any inequality constraint $g_i(x) \\leq b_i$ can be rewritten as $g_i(x) + x' = b_i$ with $x' \\geq 0$.\n",
    "\n",
    "It turns out it is a little more convenient to enforce $x \\preceq 0$, and to work with equalities instead of inequalities, i.e. problems in the so-called augmented standard form:\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\min_{x \\in \\mathbb{R}^k} \\langle c, x \\rangle \\\\\n",
    "\\text{subject to } Ax = b \\\\\n",
    "\\text{ and } x \\succeq 0\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "The two formulations are formally equivalent. We already discussed how to go from two inequalities to an equality, so the the first one is *a priori* more general. We show we can recover the first one from the second one: split the original vector $x \\in \\mathbb{R}^k$$ into a new vector $x' \\in \\mathbb{R}^{2k}$, such that $x'_{2i}$ is the positive part of $x$ and $x'_{2i + 1}$ the negative part. Replace each inequality of the form\n",
    "$$\\sum_{i = 1}^{n}{a_i x_i} \\leq b_i$$\n",
    "with the inequality \n",
    "$$\\sum_{i = 1}^{n}{a_i x'_{2i} - a_i x'_{2i+1}} \\leq b_i$$\n",
    "Do the same operation with the objective vector. Finally, add slack variables so that equalities become inequalities, and we recover an equivalent problem.\n",
    "\n",
    "### A formulation of discrete optimal transport in linear programming\n",
    "\n",
    "Recall that we formulated the Kantorovich discrete optimal transport problem as\n",
    "$$\\min_{P \\in U(\\bm{a}, \\bm{b})} \\langle C, P \\rangle $$\n",
    "where\n",
    "$$U(\\bm{a}, \\bm{b}) = \\{P \\in \\mathcal{M}_{n, m}(\\mathbb{R}_+) \\mid P \\bm{1}_m = \\bm{a}, P^T \\bm{1}_n = \\bm{b}\\}$$\n",
    "\n",
    "As the set $U(\\bm{a}, \\bm{b})$ is defined by linear constraints, this is pretty much already a linear program in augmented standard form. Let's make it explicit:\n",
    "\n",
    "Consider the matrix $(m + n) \\times mn$ matrix given by\n",
    "$$A := \\begin{pmatrix}\n",
    "I_n \\otimes \\mathbf{1}_m^{\\top} \\\\\n",
    "\\mathbf{1}_n^{\\top} \\otimes I_m\n",
    "\\end{pmatrix}$$\n",
    "where $\\otimes$ denotes the [kronecker product](https://en.wikipedia.org/wiki/Kronecker_product). Personally I struggle to visualize such a formula, and I need to actually see the matrix, so here it is in the case $m = 2, n = 3$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "405367d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1., 1., 0., 0., 0., 0.],\n",
       "       [0., 0., 1., 1., 0., 0.],\n",
       "       [0., 0., 0., 0., 1., 1.],\n",
       "       [1., 0., 1., 0., 1., 0.],\n",
       "       [0., 1., 0., 1., 0., 1.]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "def get_OT_matrix(n, m):\n",
    "    # np.ones((p, q)) defines a p x q matrix full of ones\n",
    "    one_m, one_n = np.ones((m, 1)), np.ones((n, 1))\n",
    "\n",
    "    # np.kron is the kronecker product, np.eye(n) defines the n x n identity matrix\n",
    "    A_1 = np.kron(np.eye(n), one_m.T) \n",
    "    A_2 = np.kron(one_n.T, np.eye(m))\n",
    "\n",
    "    # np.concatenate concatenates a sequences of matrices by row\n",
    "    return np.concatenate([A_1, A_2]) \n",
    "\n",
    "get_OT_matrix(3,2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb1ae0fe",
   "metadata": {},
   "source": [
    "Now, view the matrix $P$ as a big $nm$ column vector $\\bm{p}$, with $\\bm{p}_{i + n(j-1)} = P_{i, j}$, i.e we stack $P$ column by column. Do the same thing with the cost matrix $C$, so that we get another $nm$ column vector $\\bm{c}$. Then we can reformulate the problem as\n",
    "$$\n",
    "\\min_{\\bm{p} \\in \\mathbb{R}_+^{nm}} \\bm{c}^T \\bm{p} \\\\\n",
    "\\text{subject to } A \\bm{p} = \\begin{pmatrix}\\bm{a} \\\\ \\bm{b}\\end{pmatrix}\n",
    "$$\n",
    "\n",
    "In terms of code, we already saw how to define $A$, and the function `np.hstack(M).reshape(-1, 1)` will allow us to transform a matrix $M$ into a column vector by stacking it column-wise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9644ce3f",
   "metadata": {},
   "source": [
    "## Duality in convex optimization\n",
    "Let $(P)$ be an optimization problem of the form\n",
    "$$\\begin{align*}\n",
    "\\inf_{x \\in \\mathbb{R}^k} f(x) \\\\\n",
    "\\text{subject to } &g_1(x) \\leq 0, \\ldots, g_n(x) \\leq 0\\\\\n",
    "\\text{and } &Ax = b\n",
    "\\end{align*}$$\n",
    "\n",
    "where $f, g_i: \\mathbb{R}^k \\to \\mathbb{R}$ $A \\in \\mathcal{M}_{m, k}(\\mathbb{R})$ and $b \\in \\mathbb{R}^k$. Let $K_P$ be the domain associated with the problem. We define the Lagrangian of $(P)$ as the map\n",
    "$$\\begin{align*}\n",
    "&\\mathbb{R}^k \\times \\mathbb{R}^n \\times \\mathbb{R}^m \\to \\mathbb{R}\\\\\n",
    "&\\mathcal{L}(x, \\mu, \\nu) = f(x) + \\sum_{i = 0}^{n}{\\mu_i g_i(x)} + \\nu^T (Ax - b)\n",
    "\\end{align*}$$\n",
    "\n",
    "The motivation for the Lagrangian is to transform a constrained optimization problem into an unconstrained one by adding *penalties* (in the form of the vectors $\\mu$ and $\\nu$). Indeed, let $x$ be arbitrary, there are two cases:\n",
    "- If $x \\in K_P$, $\\mathcal{L}(x, \\cdot, \\cdot)$ is constant equal to $f(x)$, and in particular \n",
    "$$\\sup_{\\mu \\succ 0, \\nu} \\mathcal{L}(x, \\mu, \\nu) = f(x)$$\n",
    "- Otherwise, one of the constrained is violated, and by taking its penalty coefficient go to $\\pm \\infty$ accordingly, we get \n",
    "$$\\sup_{\\mu \\succ 0, \\nu} \\mathcal{L}(x, \\mu, \\nu) = +\\infty$$\n",
    "\n",
    "We just showed that $(P)$ is equivalent to the following unconstrained minimization problem , with value in $\\overline{\\mathbb{R}}$: \n",
    "$$\\inf_{x \\in \\mathbb{R}^k} \\sup_{\\lambda \\succ 0, \\mu} \\mathcal{L}(x, \\mu, \\nu)$$\n",
    "\n",
    "If we imagine for a moment we *can* switch up the $\\inf$ and $\\sup$ here, we get a concave maximization problem we will call $(D)$:\n",
    "$$\\sup_{\\lambda \\succ 0, \\mu} \\inf_{x \\in \\mathbb{R}^k} \\mathcal{L}(x, \\mu, \\nu)$$\n",
    "\n",
    "We call this problem the (Lagrangian) dual of $(P)$. The weak Lagrangian duality states that the optimum of the dual problem is always smaller than the primal's, i.e $p^* - d^* \\geq 0$ where $p^*$ and $d^*$ represent the optimum values of the respective problems. Strong Lagrangian duality reinforces the statement by saying that this is an equality $p^* = d^*$, i.e we can actually switch up that $\\sup$ and that $\\inf$. While weak duality is always true, strong Lagrangian duality is not. \n",
    "\n",
    "The most common condition for strong Lagrangian duality concerns convex problems. It is called the (relaxed) Slater's condition and asks that there exists a point $x_{int}$ satisfying $f_i(x_{int}) < 0$ for any inequality constraint, with the strict inequality being relaxed to a large inequality whenever $f_i$ is affine. Note that it is always satisfied for linear programs with non-empty domains ! The only part we will need to understand and to prove correctness the network simplex algorithm is the weak duality: \n",
    "\n",
    "Weak duality really is a formal derivation, things just work out as you want them to: once again, denote by $p^*$ and $d^*$ the optimal value of $(P)$ and $(D)$ respectively. Since\n",
    "$$\\inf_{x} \\sup_{\\mu \\succeq 0, \\nu} \\mathcal{L}(x, \\mu, \\nu) = p^*$$\n",
    "by definition, we have for any $\\lambda \\succeq 0, \\mu$\n",
    "$$\\inf_x \\mathcal{L}(x, \\mu, \\nu) \\leq p^{*}$$\n",
    "and by taking the supremum over $\\mu, \\nu$, we finally get\n",
    "$$\\sup_{\\mu \\succeq 0, \\nu} \\inf_x \\mathcal{L}(x, \\mu, \\nu) = d^* \\leq p^*$$\n",
    "\n",
    "Slater's condition proof rely on the existence of a separating hyperplane for two convex disjoint bodies. A proof of the appropriate statement (and some variations) can be found on the [dedicated wikipedia page](https://en.wikipedia.org/wiki/Hyperplane_separation_theorem). A proof of the sufficiency of Slater's condition can be found in the book [Convex Optimization, by Boyd and Vandenberghe](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf), section 5.3.2. I also wrote a [blog post](https://shika-b.github.io/blog/lagrangian_duality/) about lagrangian duality that's a little more straight to the point if you don't feel about skimming a 715 pages book.\n",
    "\n",
    "\n",
    "### The dual of a linear program\n",
    "\n",
    "We showed we can always cast a linear program as\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\inf_{x \\in \\mathbb{R}^k} \\langle c, x \\rangle \\\\\n",
    "\\text{subject to } Ax = b \\\\\n",
    "\\text{ and } x \\succeq 0\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "it's Lagrangian is then \n",
    "$$\\mathcal{L}(x, \\mu, \\nu) = c^T x + \\mu^T x + \\nu^T (Ax - b)$$\n",
    "and, by grouping up terms, the dual problem becomes\n",
    "$$\\sup_{\\mu \\in \\mathbb{R}_+^n, \\nu \\in \\mathbb{R}^m} \\inf_{x \\in \\mathbb{R}^k} (c^T + \\mu^T + \\nu^TA) x - \\nu^Tb$$\n",
    "\n",
    "As $\\inf_{x \\in \\mathbb{R}^k} (c^T + \\mu^T + \\nu^TA) x - \\nu^Tb = -\\infty$ whenever $c^T + \\mu^T + \\nu^T A \\neq 0$, any solution must satisfy\n",
    "$c + \\mu = - A^t\\nu$ and $\\mu$ is entirely determined by the value of $\\nu$. Summarizing the constraints, we get\n",
    "\n",
    "$$\n",
    "\\begin{align*}\n",
    "\\sup_{\\nu \\in \\mathbb{R}^m} \\langle \\nu, b \\rangle \\\\\n",
    "\\text{subject to } A^t \\nu \\preceq c \\\\\n",
    "\\end{align*}\n",
    "$$\n",
    "\n",
    "which is exactly the dual program as defined in standard LP theory. Whenever any of the primal or dual program is feasible, Slater's condition is satisfied and the duality gap is zero. \n",
    "\n",
    "#### The case of discrete optimal transport\n",
    "\n",
    "Recall the linear program associated with a discrete optimal transport problem\n",
    "$$\\inf_{\\bm{p} \\in \\mathbb{R}_+^{nm}} \\langle \\bm{c}, \\bm{p} \\rangle \\\\\n",
    "\\text{subject to } A \\bm{p} = \\begin{pmatrix}\\bm{a} \\\\ \\bm{b}\\end{pmatrix}$$\n",
    "\n",
    "with\n",
    "$$A = \\begin{pmatrix}\n",
    "I_n \\otimes \\mathbf{1}_m^{\\top} \\\\\n",
    "\\mathbf{1}_n^{\\top} \\otimes I_m\n",
    "\\end{pmatrix}$$\n",
    "\n",
    "and by substiting we get the dual program\n",
    "$$\\sup_{y \\in \\mathbb{R}^(n+m)} \\langle \\begin{pmatrix}\\bm{a} \\\\ \\bm{b}\\end{pmatrix}, y \\rangle \\\\\n",
    "\\text{subject to } A^T y \\preceq \\bm{c}$$\n",
    "If we write $y = (\\mu, \\nu)$ with $(\\mu, \\nu) \\in \\mathbb{R}^n \\times \\mathbb{R}^m$, we can compute\n",
    "$$A^T \\begin{pmatrix}\\mu \\\\ \\nu \\end{pmatrix} =  \\mu \\oplus \\nu$$\n",
    "and then\n",
    "$$\\sup_{\\mu, \\nu \\in \\mathbb{R}^n \\times \\mathbb{R}^m} \\langle \\bm{a}, \\mu \\rangle + \\langle \\bm{b}, \\nu \\rangle \\\\\n",
    "\\text{subject to } \\mu \\oplus \\nu \\preceq \\bm{c}$$\n",
    "\n",
    "As we know that the optimal transport problem is always feasible, strong LP duality ensures that these problem are equivalent.\n",
    "\n",
    "### A formal duality for optimal transport\n",
    "\n",
    "The optimal transport problem can be seen as some kind of convex problem. Indeed, in the vector space of finite measures over $X \\times Y$ the subspace $\\mathcal{P}(X \\times Y)$ is convex (defined by a single affine constraint), and the marginal constraints are also affine. The main subtlety is that we're not over a finite-dimensional vector space anymore. Even though the framework we described above does not apply directly, we can try to follow similar ideas and define a *formal* dual to a given optimal transport problem. It turns out that this does leads to a strong duality result under reaosnable hypothesis on the spaces $X$ and $Y$ (compact is enough for instance, see below).\n",
    "\n",
    "Let $\\mu, \\nu$ be probability measures on measurable spaces $(X, \\mathcal{A}), (Y, \\mathcal{B})$ as above and $(OT)$ be the problem:\n",
    "$$\\inf \\left\\{ \\int_{X \\times Y} c(x, y) \\mathrm{d}\\gamma \\mid (\\pi_1)_\\# \\gamma = \\mu, (\\pi_2)_\\# \\gamma = \\nu \\right\\}$$\n",
    "\n",
    "We can rewrite the condition $(\\pi_1)_\\# \\gamma = \\mu \\land (\\pi_2)_\\# \\gamma = \\nu$ as a function of $\\gamma$ that's $0$ if the condition is satisfied and $+\\infty$ otherwise:\n",
    "\n",
    "$$\\sup_{(\\varphi, \\psi) \\in \\mathcal{C}(X) \\times \\mathcal{C}(Y)} \\int_X \\varphi(x) \\mathrm{d} \\mu(x) + \\int_Y \\psi(y) \\mathrm{d} \\nu(y) - \\int_{X \\times Y} \\varphi(x) + \\psi(y) \\mathrm{d} \\gamma(x, y) = \n",
    "\\begin{cases} 0 \\text{ if } (\\pi_1)_\\# \\gamma = \\mu \\land (\\pi_2)_\\# \\gamma = \\nu \\\\ \n",
    "+\\infty \\text{ otherwise}\n",
    "\\end{cases}$$\n",
    "\n",
    "Indeed, if one of the two constraints is violated, it is easy to construct \n",
    "\n",
    "The problem OT is then equivalent to\n",
    "$$\\inf_{\\gamma \\in \\mathcal{P}(X \\times Y)} \\sup_{(\\varphi, \\psi) \\in \\mathcal{C}(X) \\times \\mathcal{C}(Y)} \\int_{X \\times Y} c(x, y) \\mathrm{d}\\gamma(x, y) +  \\int_X \\varphi(x) \\mathrm{d} \\mu(x) + \\int_Y \\psi(y) \\mathrm{d} \\nu(y) - \\int_{X \\times Y} \\varphi(x) + \\psi(y) \\mathrm{d} \\gamma(x, y) $$\n",
    "\n",
    "By interchaning $\\inf$ and $\\sup$ and cleaning up the equality we get *formal* dual problem for $(OT)$, :\n",
    "$$(DOT): \\sup_{(\\varphi, \\psi) \\in \\mathcal{C}(X) \\times \\mathcal{C}(Y)} \\int_X \\varphi(x) \\mathrm{d} \\mu(x) + \\int_Y \\psi(y) \\mathrm{d} \\nu(y) \\inf_{\\gamma \\in \\mathcal{P}(X \\times Y)} \\int_{X \\times Y} (c(x, y)- \\varphi(x) - \\psi(y)) \\mathrm{d}\\gamma(x, y)\n",
    "$$\n",
    "\n",
    "Moreover, we have \n",
    "\n",
    "$$\\inf_{\\gamma \\in \\mathcal{P}(X \\times Y)} \\int_{X \\times Y} (c(x, y)- \\varphi(x) - \\psi(y)) \\mathrm{d}\\gamma(x, y) = \\begin{cases} 0 \\text{ if } \\varphi \\oplus \\psi \\leq c \\\\ -\\infty \\text{ otherwise}\n",
    "\\end{cases}$$\n",
    "\n",
    "which simplifies even further $(DOT)$ to:\n",
    "\n",
    "$$\\begin{align*}\n",
    "(DOT): &\\sup_{(\\varphi, \\psi) \\in \\mathcal{C}(X) \\times \\mathcal{C}(Y)} \\int_X \\varphi(x) \\mathrm{d} \\mu(x) + \\int_Y \\psi(y) \\mathrm{d} \\nu(y) \\\\\n",
    "&\\text{subject to } \\varphi + \\psi \\leq c \\\\\n",
    "\\end{align*}$$\n",
    "\n",
    "Strong duality for optimal transport can be derived in two steps from the work we've done:\n",
    "- Derive it in the case of discrete optimal transport, using the LP formulation and LP duality, as we did earlier.\n",
    "- Use the weak density of discrete measures in $\\mathcal{P}(X)$ when $X$ is compact.\n",
    "See [these lecture notes](https://lchizat.github.io/files2020ot/lecture2.pdf) for the details.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4544801",
   "metadata": {},
   "source": [
    "## Solving the discrete optimal transport\n",
    "\n",
    "Many algorithms for solving discrete optimal transport problems are special-cases of more general algorithms, either able to tackle general convex problems or atleast general linear programming problems. We describe the newtork simplex algorithm, a much quicker specialization of the simplex algorithm (that solves LP problems in general), designed specifically to solve LP given by an optimal transport problem.\n",
    "\n",
    "Some preliminary remarks that will be useful to think about the newtork simplex algorithm:\n",
    "\n",
    "Remark 1: We say a matrix $P$ and a pair of dual vectors $(\\mu, \\nu)$ are complementary if whenever $P_{i, j} > 0$, $C_{i, j} = \\mu_i + \\nu_j$. Weak duality enforces enforces that if the elements of complementary pair are respectively primal feasible and dual feasible, then they are also primal and dual optimal. Indeed we already know $p^* \\geq d^*$ and complementarity ensures\n",
    "$$p^* \\leq \\langle P, C \\rangle = \\langle P, \\mu \\oplus \\nu \\rangle = \\langle P, A^T \\begin{pmatrix} \\mu \\\\ \\nu \\end{pmatrix} \\rangle = \\langle \\begin{pmatrix} \\bm{a} \\\\ \\bm{b} \\end{pmatrix}, \\begin{pmatrix} \\mu \\\\ \\nu \\end{pmatrix} \\rangle = \\langle \\bm{a}, f \\rangle + \\langle \\bm{b}, g \\rangle \\leq d^*$$\n",
    "\n",
    "Remark 2: We know, by convexity of the problem and linearity of the objective, that we can look for a solution as an extremal point of $U(\\bm{a}, \\bm{b})$. It turns out the extremal points have some explicit structure we will partially describe later.\n",
    "\n",
    "Remark 3: We can think of a discrete optimal transport as a flow graph problem. Consider a complete bipartite graph $G(\\bm{a}, \\bm{b})$ with vertices $\\text{Supp}(\\bm{a}) \\sqcup \\text{Supp}(\\bm{b})$ \n",
    "\n",
    "![Bipartite flow graph](bipartite_flow_graph.png)\n",
    "\n",
    "For clarity, we will enumerate the vertices associated with the vector $\\bm{a}$ by letter $i$ and thos associated with $\\bm{b}$ by $j$. Label each vertex with its weight given by the respective probability vector, that is the $i$-th vertex gets weight $\\bm{a}_i$ and the $j$-th label gets weight $\\bm{b}_j$.\n",
    "\n",
    "The data of a feasible matrix $P$ is then a *flow* on the graph, i.e the information, for each pair $(a_i, b_j)$, of how much weight should translate from one to the other, with the constraint that the total flow out of vertex $i$ is equal to $\\bm{a}_i$ and the total flow into the vertex $j$ is equal to $\\bm{b}_j$. The flow going from $i$ to $j$ is given by $P_{i, j}$. We denote by $G(P)$ the subgraph of $G(\\bm{a}, \\bm{b})$ that only retains the edges with nonzero flow.\n",
    "\n",
    "### On extremal points of $U(\\bm{a}, \\bm{b})$\n",
    "\n",
    "Before getting to the actual algorithm, we need to cover one last point:\\\n",
    "We show in this section that if $P$ is extremal in $U(\\bm{a}, \\bm{b})$ then $G(P)$ has no cycle. In particular, $P$ must have has at most $n + m - 1$ nonzero entries.\n",
    "\n",
    "Assume that $G(P)$ has a cycle, we will create two feasible matrices $H$ and $Q$ such that $P = \\frac{H + Q}{2}$. That cycle is given by a sequence of edges $i_1 \\to j_1 \\to \\ldots \\to i_k \\to j_k \\to i_1$ where the vertices are all distinct (except for $i_1$ which occurs twice). Let $\\varepsilon > 0$ smaller than all nonzero entries of $P$, and consider the matrix $E$ given by  \n",
    "$$\n",
    "E_{i, j} := \\begin{cases}\n",
    "\\varepsilon \\text{ if } i \\to j \\text{ appears in the cycle}\\\\\n",
    "-\\varepsilon \\text{ if } j \\to i \\text{ appears in the cycle} \\\\\n",
    "0 \\text{ otherwise}\n",
    "\\end{cases}\n",
    "$$\n",
    "and define the two matrices $H = P - E$ and $Q = P + E$. Since $\\varepsilon$ smaller than all nonzero entries of $P$, we still have $H, Q \\succ 0$. Moreover, each line and each column of $E$ is either full of zeroes and containg exactly one $\\varepsilon$ entry and one $-\\varepsilon$ entry, the rest also being zeroes. It follows that $E \\bm{1}_n = 0$ and $E^T \\bm{1}_m = 0$. This shows that $H, Q \\in U(\\bm{a}, \\bm{b})$, and $P$ cannot be an extremal point.\n",
    "\n",
    "### A global description of the algorithm\n",
    "\n",
    "The algorithm will build a sequence of complementary pairs $(P^t, (\\mu^t, \\nu^t))_{t \\in \\mathbb{N}}$ such that:\n",
    "- $P^t$ is in $U(\\bm{a}, \\bm{b})$ and it has atmost $n + m - 1$ nonzero entries\n",
    "- $(\\mu^t, \\nu^t)$ ïs complementary to $P^t$.\n",
    "- Some kind of objective value increases as $t$ increases.\n",
    " \n",
    "Here's our to-do list:\n",
    "\n",
    "Step 1: Find a point $P^0 \\in U(\\bm{a}, \\bm{b})$ to initialize the algorithm.\\\n",
    "Step 2: Build vectors $(\\mu^0, \\nu^0)$ so that $P_0$ and $(\\mu^0, \\nu^0)$ are complementary.\\\n",
    "Step 3: Describe an update step to improve the situation.\\\n",
    "Final step: Why this actually work and why is it fast (implemented properly) ?\n",
    "\n",
    "### Step 1: Finding an initial point\n",
    "\n",
    "We will build a feasible matrix $P^0$ by saturating one of the row/column constraint at each step. Let's look at some code directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "id": "69505ab2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def initial_feasible(a, b):\n",
    "    \"\"\"\n",
    "    Input:\n",
    "    a is a probability vector of size n\n",
    "    b is a probability vector of size m\n",
    "    \n",
    "    Output:\n",
    "    P a matrix of size (n x m) such that P € U(a, b)\n",
    "    and P has atmost n + m - 1 nonzero entries\n",
    "    \"\"\"\n",
    "\n",
    "    n, m = a.shape[0], b.shape[0]\n",
    "    P = np.zeros((n, m)) # Initialize a (n x m) shaped zero-matrix\n",
    "\n",
    "    i, j = 0, 0\n",
    "    # Respectively row and column margin until saturation\n",
    "    # At the start, it is simply the first coefficients.\n",
    "    r, c = a[0], b[0]\n",
    "\n",
    "    # Iterate until we saturated all the rows or all the columns \n",
    "    while i < n and j < m:\n",
    "        # We saturate the row/column that's the closest to be saturated \n",
    "        t = min(r, c)\n",
    "        P[i, j] = t\n",
    "        # Update how much margin we got until constraint is saturated\n",
    "        r, c = r - t, c - t\n",
    "        \n",
    "        # If the row constraint is saturated, switch to next row, and update the saturation coefficient.\n",
    "        if r == 0:\n",
    "            i += 1\n",
    "            if i < n:\n",
    "                r = a[i]\n",
    "        # Same thing. Note that both conditionals can be true at the same time: for instance if a[0] = b[0]\n",
    "        if c == 0:\n",
    "            j += 1\n",
    "            if j < m:\n",
    "                c = b[j]\n",
    "    return P\n",
    "\n",
    "# We define some arbitrary probability vectors to test our code\n",
    "n, m = 3, 3\n",
    "a = np.random.rand(3) \n",
    "b = np.random.rand(3)\n",
    "a /= a.sum()\n",
    "b /= b.sum()\n",
    "\n",
    "P_0 = initial_feasible(a, b)\n",
    "\n",
    "assert np.allclose(P_0 @ np.ones_like(b), a)\n",
    "assert np.allclose(P_0.T @ np.ones_like(a), b)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2243372",
   "metadata": {},
   "source": [
    "\n",
    "### Step 2: Finding a complementary dual pair for $P_0$\n",
    "Now, we need to find a complementary dual pair for $P_0$. Let's define the support of a matrix as:\n",
    "$$\\text{Supp}(P) = \\{(i, j) | P_{i, j} \\neq 0 \\}$$\n",
    "Using the notation, we are looking for vectors $(\\mu, \\nu) \\in \\mathbb{R}^n \\times \\mathbb{R}^m$ satisfying the linear system\n",
    "$$\\mu_i + \\nu_j = C_{i, j} \\text{ for every } (i, j) \\in \\text{Supp}(P_0)$$\n",
    "We have $n + m$ variables but at most $n + m - 1$ constraints, so a solution to this system is never unique. The idea is that we will traverse the forest given by $\\text{Supp}(P_0)$, for each tree pick a random root and set its associated dual variable to $0$ and then propagate constraints using a DFS/BFS. This ensures that we lift any undetermination.\n",
    "\n",
    "While doing that forest traversal, we build the set of trees that compose that forest. We could avoid it by working with trees implicitly but after trying it, the code was spaghetti so I switched to something more explicit. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "id": "9ed1775c",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Tree:\n",
    "    def __init__(self, label, childs = None):\n",
    "        if childs is None:\n",
    "            childs = []\n",
    "        self.label = label\n",
    "        self.childs = childs\n",
    "    \n",
    "    def push_child(self, child):\n",
    "        self.childs.append(child)\n",
    "\n",
    "    def contains(self, v):\n",
    "        return self.label == v or any(c.contains(v) for c in self.childs)\n",
    "    \n",
    "    def path_from_root(self, v):\n",
    "        if self.label == v:\n",
    "            return [v]\n",
    "        else:\n",
    "            for c in self.childs:\n",
    "                path = c.path_from_root(v)  \n",
    "                if path is not None:\n",
    "                    return [self.label] + path\n",
    "\n",
    "    def lca(self, v, w):\n",
    "        \n",
    "        if self.label == v or self.label == w:\n",
    "            return self\n",
    "        \n",
    "        subtree = None\n",
    "        for child in self.childs:\n",
    "            r =  child.lca(v, w)\n",
    "            if r is not None:\n",
    "                if subtree is None:\n",
    "                    subtree = r\n",
    "                else:\n",
    "                    return self\n",
    "        return subtree\n",
    "        \n",
    "    def __repr__(self, depth=0):\n",
    "        s = depth*'\\t' + f\"Value: {self.label}\\n\"\n",
    "        for c in self.childs:\n",
    "            s += f\"{c.__repr__(depth+1)}\\n\"\n",
    "        return  s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "id": "4c7056a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found vectors mu: [ 0. -1. -2.], nu: [2. 5. 4.]\n",
      "The connected components are: \n",
      "[Value: (0, 'left')\n",
      "\tValue: (0, 'right')\n",
      "\t\tValue: (1, 'left')\n",
      "\t\t\tValue: (1, 'right')\n",
      "\n",
      "\t\t\tValue: (2, 'right')\n",
      "\t\t\t\tValue: (2, 'left')\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "def mat_support(P):\n",
    "    \"\"\"\n",
    "    Input:\n",
    "    P is a (n x m) matrix\n",
    "\n",
    "    Output:\n",
    "    A set s of all indices (i, j) such that P[i, j] != 0\n",
    "    \"\"\"\n",
    "    n, m = P.shape\n",
    "\n",
    "    support = set()\n",
    "    for i in range(n):\n",
    "        for j in range(m):\n",
    "            if P[i, j] > 0:\n",
    "                support.add((i, j))\n",
    "    return support\n",
    "\n",
    "def associated_dual_pair(n, m, edges_set, C):\n",
    "    \"\"\"\n",
    "    Input:\n",
    "    n, m are two integers\n",
    "\n",
    "    - edge_set is a set of edges of the form (v_idx, side) where:\n",
    "        - side is either 'left' or 'right'\n",
    "        - v_idx is an integer with 0 <= v_idx < n if side = 'left' and 0 <= v_idx < m otherwise\n",
    "    - C is a (n x m) matrix  \n",
    "\n",
    "    Output:\n",
    "    A triplet (mu, nu, trees) where:\n",
    "    - (mu, nu) is a pair of vectors of size n and m respectively satisfying:\n",
    "        If P[i, j] > 0 then mu[i] + nu[j] = C[i, j]\n",
    "    - trees is a list fo Tree objects are defined above, built by doing the DFS along the forest given by the edges set\n",
    "    \"\"\"\n",
    "      \n",
    "    # We store the ancestors as a dictionary associating to each vertex the root\n",
    "    # of the tree defined by the DFS, so that two vertex are in the same c.c iff they have the same\n",
    "    # associated root. If a vertex is root, its ancestor is None\n",
    "    \n",
    "    trees = []\n",
    "    mu = np.zeros(n)\n",
    "    nu = np.zeros(m)\n",
    "    \n",
    "    \n",
    "    seen = set() # vertices we already visited\n",
    "    def dfs(v_idx, side, node):\n",
    "        if (v_idx, side) in seen:\n",
    "            return\n",
    "\n",
    "        seen.add((v_idx, side))\n",
    "\n",
    "        if side == 'left':\n",
    "            for j in range(m):\n",
    "                if (v_idx, j) in edges_set and (j, 'right') not in seen:\n",
    "                    # Since P[v_idx, j] > 0, we must have nu[j] = C[i, j] - mu[v_idx], \n",
    "                    # and the value of mu[v_idx] is already determined at that point!\n",
    "                    nu[j] = C[v_idx, j] - mu[v_idx]\n",
    "                    \n",
    "                    curr_node = Tree((j, 'right'))\n",
    "                    node.push_child(curr_node)\n",
    "                    \n",
    "                    dfs(j, 'right', curr_node)\n",
    "        \n",
    "        if side == 'right':\n",
    "            for i in range(n):\n",
    "                if (i, v_idx) in edges_set and (i, 'left') not in seen:\n",
    "                    mu[i] = C[i, v_idx] - nu[v_idx]\n",
    "\n",
    "                    curr_node = Tree((i, 'left'))\n",
    "                    node.push_child(curr_node)\n",
    "\n",
    "                    dfs(i, 'left', curr_node)    \n",
    "     \n",
    "    for i in range(n):\n",
    "        if not (i, 'left') in seen:\n",
    "            root = Tree((i, 'left'))\n",
    "            mu[i] = 0 # initialize root of the tree at 0\n",
    "            dfs(i, 'left', root) # traverse the tree\n",
    "            trees.append(root)\n",
    "    for j in range(m):\n",
    "        if not (j, 'right') in seen:\n",
    "            root = Tree((j, 'right'))\n",
    "            nu[j] = 0\n",
    "            dfs(j, 'right', root)\n",
    "            trees.append(root)\n",
    "\n",
    "    return (mu, nu, trees)\n",
    "\n",
    "C = np.array([[2, 3, 2], [1, 4, 3], [1, 1, 2]])\n",
    "mu, nu, roots = associated_dual_pair(3, 3, mat_support(P_0), C)\n",
    "\n",
    "# We test our code on a random example\n",
    "for i in range(3):\n",
    "    for j in range(3):\n",
    "        if P_0[i, j] > 0:\n",
    "            assert abs(mu[i] + nu[j] - C[i, j]) < 1e-7\n",
    "\n",
    "print(f\"Found vectors mu: {mu}, nu: {nu}\")\n",
    "print(f\"The connected components are: \\n{roots}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eca3e0b8",
   "metadata": {},
   "source": [
    "### Step 3: Updating the complementary pair\n",
    "\n",
    "Now we need to update the matrix $P^t$. The idea is that either:\n",
    "- $(\\mu, \\nu)$ is dual feasible and we are done\n",
    "- There exists some $(i_0, j_0) \\in \\text{Supp}(P)$ such that $\\mu_{i_0} + \\nu_{j_0} > C_{i_0, j_0}$. \n",
    "\n",
    "In this case, we add the edge $(i_0, j_0)$ to the graph $G(P)$ and there are two cases:\n",
    "- The vertices $i_0$ and $j_0$ are in different connected components in $G(P)$. In this case, they merge two tree into a single one, and we get a new graph $G'$, on which we can simply reapply the dual pair routine to get a new dual pair.\n",
    "- The vertices $i_0$ and $j_0$ are in the same connected component in $G(P)$ and the addition of $(i_0, j_0)$ creates a cycle, oriented so that $(i_0, j_0)$ is positive. This cycle has an edge with minimal negative flow, which we will remove. We update the matrix $P$ accordingly to get a new matrix $\\overline{P}$ such that $\\langle \\overline{P}, C \\rangle < \\langle P, C \\rangle$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "id": "cc535b6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def network_simplex(a, b, C):\n",
    "    \"\"\"\n",
    "    Input:\n",
    "    a is a probability vector of size n\n",
    "    b is a probability vector of size m\n",
    "    C is a cost matrix of shape (n x m)\n",
    "    \n",
    "    Output:\n",
    "    A triplet (P, mu, nu) such that P and (mu, nu) are feasible complementary solutions.\n",
    "    They solve respectively the primal and dual discrete optimal transport problems:\n",
    "    - Primal: minimize <P, C> s.t P @ 1_m = a and P^T @ 1_n = b\n",
    "    - Dual: maximize <a, mu> + <b, nu> s.t mu[i] + nu[j] <= C[i, j]\n",
    "    \"\"\"\n",
    "\n",
    "    n, m = a.shape[0], b.shape[0]\n",
    "    P = initial_feasible(a, b)\n",
    "    edges_set = mat_support(P)\n",
    "\n",
    "    while True:\n",
    "        mu, nu, trees = associated_dual_pair(n, m, edges_set, C)\n",
    "\n",
    "        is_dual_feasible = True\n",
    "        i0, j0 = None, None\n",
    "\n",
    "        # check for dual feasibility. If (mu, nu) is not dual feasible, find the smallest\n",
    "        # pair (i0, j0) for which dual feasibility fails.\n",
    "\n",
    "        for i in range(n):\n",
    "            for j in range(m):\n",
    "               if 1e-7 < mu[i] + nu[j] - C[i, j]:\n",
    "                   is_dual_feasible = False\n",
    "                   i0, j0 = i, j\n",
    "                   break\n",
    "            if not is_dual_feasible:\n",
    "                break\n",
    "        \n",
    "        # We found a dual feasible solution, we are done\n",
    "        if is_dual_feasible:\n",
    "            break\n",
    "        \n",
    "        # If the update is degenerate, just add the edge in the graph and repeat the above operations\n",
    "        for tree in trees:\n",
    "            if tree.contains((i0, 'left')):\n",
    "                break\n",
    "\n",
    "        if not tree.contains((j0,'right')):\n",
    "            edges_set.add((i0, j0))\n",
    "            continue\n",
    "\n",
    "        \n",
    "        # Use the tree structure:\n",
    "        # The path from two vertex of a tree is the concatenation of \n",
    "        # the paths to their lowest common ancestor \n",
    "        lca = tree.lca((i0, 'left'), (j0, 'right'))\n",
    "        path_to_i = lca.path_from_root((i0, 'left'))\n",
    "        path_to_j = lca.path_from_root((j0, 'right'))\n",
    "\n",
    "        cycle = [(i0, 'left')] + path_to_j[::-1] + path_to_i[1:]\n",
    "\n",
    "        # find theta and leaving edge\n",
    "        theta = float('inf')\n",
    "        i_rmv, j_rmv = None, None\n",
    "        for e_idx in range(1, len(cycle) - 1, 2):  # negative edges\n",
    "            j, _ = cycle[e_idx]\n",
    "            i, _ = cycle[e_idx+1]\n",
    "            \n",
    "            if P[i, j] < theta:\n",
    "                theta = P[i, j]\n",
    "                i_rmv, j_rmv = i, j\n",
    "\n",
    "        # update P along the cycle\n",
    "        for e_idx in range(len(cycle) - 1):\n",
    "            direction = 1 if e_idx % 2 == 0 else -1\n",
    "            v1, _ = cycle[e_idx]\n",
    "            v2, _ = cycle[e_idx+1]\n",
    "            if direction == 1:\n",
    "                i, j = v1, v2\n",
    "            else:\n",
    "                i, j = v2, v1\n",
    "\n",
    "            P[i, j] += direction * theta\n",
    "                \n",
    "        edges_set.add((i0, j0))\n",
    "        edges_set.remove((i_rmv, j_rmv))\n",
    "    return (P, mu, nu)\n",
    "\n",
    "# We check for primal and dual feasibility + complementarity, to  our code works properly\n",
    "\n",
    "n,m = 3,4\n",
    "\n",
    "for _ in range(50):\n",
    "    a = np.random.rand(n)\n",
    "    b = np.random.rand(m)\n",
    "    a /= a.sum()\n",
    "    b /= b.sum()\n",
    "    C = np.random.rand(n, m)\n",
    "    P, mu, nu = network_simplex(a, b, C)\n",
    "    for i in range(n):\n",
    "        for j in range(m):\n",
    "            assert mu[i] + nu[j] <= C[i, j]\n",
    "            if P[i, j] > 0:\n",
    "                assert abs(mu[i] + nu[j] - C[i, j]) < 1e-7\n",
    "\n",
    "    assert np.allclose(P @ np.ones_like(b), a)\n",
    "    assert np.allclose(P.T @ np.ones_like(a), b)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a235377",
   "metadata": {},
   "source": [
    "\n",
    "### Correctness, efficiency and complexity\n",
    "\n",
    "TODO"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
No results found