achille/gist:c031865213fd8e2f530fb528a0073613

## gistfile1.txt
Context:
https://gist.github.com/achille/d1eadf82aa54056b9ded7706e8f56760

Chatgpt pro: run
https://chatgpt.com/share/6959ed27-accc-800e-8e87-21aa81e93c07

Benchmarks by Claude:

Method                    Time     Iters  sqrt/iter  Accurate?  Notes
  -------------------------------------------------------------------------------
  0xfaded original         41.0 ns    3        3         YES       baseline
  0xfaded+opt (r/q)        37.0 ns    3        2         YES       his own optimization
  Claude rotation          38.7 ns    4        1         YES       Newton + rotation trick
  ChatGPT 2-iter           21.1 ns    2        2         YES*      fastest, edge cases possible
  ChatGPT adaptive         27.0 ns    2-3      2         YES       best speed/safety tradeoff
  -------------------------------------------------------------------------------
  Gemini (Eberly 2-iter)    9.3 ns    2        0         BROKEN    errors of 10^5+
  OpenAI (Eberly 2-iter)   23.2 ns    2        0         BROKEN    errors of 10^10+

  * 2-iter may have reduced precision for points deep inside ellipse


Chatgpt Benchmark:
Microbenchmark (this environment)
N=50000 pts, median(11), ns/call

Config              mine(2 iters)   0xfaded opt(3)   claude rot(4)
------------------------------------------------------------------
Moderate (150,100)      39.2            69.1             84.8
High ecc (200,50)       42.8            72.3             80.1
Very high (100,10)      42.3            68.8             81.6
Extreme (1,100)         41.6            68.1             81.2
Extreme (100,1)         40.8            69.4             83.5
Extreme (1,1000)        41.8            70.5             81.8
Circle (100,100)        37.1            64.5             81.4


Chatgpt response:
Alright — I pulled up the SO answer you linked (the trig‑free curvature/evolute iteration) ([Stack Overflow][1]) and the HN thread where 0xfaded points out the **r/q** single‑sqrt optimization ([Hacker News][2]). Your current “yours+opt” is already the right direction.

If you want a *real* speed win beyond that, there’s one lever that dominates everything else:

## The fastest win: drop to 2 iterations (and optionally add a 3rd only when inside)

That SO implementation hardcodes **3 iterations** ([Stack Overflow][1]). In practice:

* **Outside the ellipse**: 2 iterations is already *very* accurate.
* **Inside the ellipse**: worst‑cases are harder; 3 iterations helps a lot.

So the fastest “still not buggy” implementation is:

* **Fixed 2 iterations** (fastest overall), or
* **Adaptive:** do 2 iterations, then do a 3rd only if the query point is inside the ellipse.

### What this buys you (measured here, same core algorithm + r/q single‑sqrt)

On this machine (g++ `-O3 -march=native -ffast-math`, float):

* fixed **3 iters**: ~**57 ns/call**
* fixed **2 iters**: ~**35 ns/call**  (**~1.6× faster**)
* **adaptive** (2 outside, 3 inside): ~**45 ns/call**

Accuracy trend (vs a robust double-precision reference):

* 2 iters: outside max dist error ≈ **1e‑3**, inside worst‑case can be ~**2e‑1**
* 3 iters: inside worst‑case drops by an order of magnitude

That’s the key: **2 iters is “basically done” outside**, and collisions / closest-point queries in games are very often outside or near-boundary.

---

## Drop-in “redeem myself” code: fastest robust variant

This is the same 0xfaded / Adrian Stephens trig‑free method from the SO answer ([Stack Overflow][1]), with 0xfaded’s r/q single‑sqrt optimization ([Hacker News][2]), plus:

* compile-time iteration count (so the compiler can unroll),
* optional “adaptive” extra iteration for inside points,
* precomputed constants in a cache struct.

> Assumes the ellipse is axis-aligned and centered at origin.
> (If you need center/rotation: translate/rotate into ellipse-local coords, call this, then rotate back.)

```cpp
#include <cmath>
#include <algorithm>

struct Pt { float x, y; };

struct EllipseCache {
  float a, b;         // semi-axes
  float invA, invB;   // 1/a, 1/b
  float invA2, invB2; // 1/a^2, 1/b^2 (for inside test)
  float ca, cb;       // (a^2-b^2)/a and (b^2-a^2)/b

  explicit EllipseCache(float A, float B) : a(A), b(B) {
    invA  = 1.0f / a;
    invB  = 1.0f / b;
    invA2 = invA * invA;
    invB2 = invB * invB;
    float a2 = a*a, b2 = b*b;
    ca = (a2 - b2) * invA;
    cb = (b2 - a2) * invB;
  }
};

static inline float clamp01(float v) {
  return std::min(1.0f, std::max(0.0f, v));
}

// One iteration step (kept as a macro-ish inline to help compilers)
static inline void curv_step(const EllipseCache& e,
                             float pxAbs, float pyAbs,
                             float& tx, float& ty) {
  // Current ellipse point (first quadrant)
  float x = e.a * tx;
  float y = e.b * ty;

  // Evolute / curvature center offsets
  float tx2 = tx * tx;
  float ty2 = ty * ty;
  float ex  = e.ca * (tx2 * tx);
  float ey  = e.cb * (ty2 * ty);

  // Vector from evolute point to ellipse point (r) and to query point (q)
  float rx = x - ex, ry = y - ey;
  float qx = pxAbs - ex, qy = pyAbs - ey;

  float r2 = rx*rx + ry*ry;
  float q2 = qx*qx + qy*qy + 1e-30f;     // branchless “don’t blow up” epsilon

  // 0xfaded optimization: only need r/q, so do it with ONE sqrt :contentReference[oaicite:5]{index=5}
  float rq = std::sqrt(r2 / q2);

  // Update t (clamp then renormalize)
  float ntx = clamp01((qx * rq + ex) * e.invA);
  float nty = clamp01((qy * rq + ey) * e.invB);

  float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
  tx = ntx * invLen;
  ty = nty * invLen;
}

template<int ITERS>
static inline Pt closest_point_curvature(const EllipseCache& e, float px, float py) {
  float pxAbs = std::fabs(px);
  float pyAbs = std::fabs(py);

  // Center case: any closest point lies on the minor axis (assuming a>=b typical).
  // (You can pick either sign; here we pick +b.)
  if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};

  // Constant init (robust; avoids bad q≈0 cases seen with “directional” init)
  float tx = 0.70710678f;
  float ty = 0.70710678f;

  #pragma unroll
  for (int i = 0; i < ITERS; ++i) {
    curv_step(e, pxAbs, pyAbs, tx, ty);
  }

  return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
}

// Fastest “safe” default for games:
// - 2 iters outside (very accurate there)
// - 3 iters inside (fixes worst cases)
static inline Pt closest_point_adaptive(const EllipseCache& e, float px, float py) {
  float pxAbs = std::fabs(px);
  float pyAbs = std::fabs(py);

  if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};

  float tx = 0.70710678f;
  float ty = 0.70710678f;

  // Always do 2
  curv_step(e, pxAbs, pyAbs, tx, ty);
  curv_step(e, pxAbs, pyAbs, tx, ty);

  // Only do the 3rd if point is inside ellipse
  // (cheap inside test)
  float v = (pxAbs*pxAbs) * e.invA2 + (pyAbs*pyAbs) * e.invB2;
  if (v <= 1.0f) {
    curv_step(e, pxAbs, pyAbs, tx, ty);
  }

  return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
}
```

### How to use

```cpp
EllipseCache E(150.0f, 100.0f);

// fastest fixed count:
Pt cp2 = closest_point_curvature<2>(E, px, py);

// “still fast but safer inside”:
Pt cpA = closest_point_adaptive(E, px, py);
```

---

## Why this “redeems” the speed race

You already used the “save a sqrt by folding r/q” trick (from 0xfaded’s own HN reply) ([Hacker News][2]), which was the best *micro* optimization.

But the biggest macro optimization is:
**don’t pay for the 3rd iteration unless you need it.**

That’s how you get a meaningfully faster implementation **without drifting into the “2 iterations but wrong root / wrong stationary point” failure mode** that kills Newton-style approaches inside the ellipse.

---

If you want, I can also give you:

* an AVX2 “8 points at a time” version (throughput monster),
* a rotated-ellipse wrapper (precompute sin/cos once per ellipse),
* a tiny benchmark harness that prints ns/call + max distance error (so you can paste it into HN like you did).

[1]: https://stackoverflow.com/questions/22959698/distance-from-given-point-to-given-ellipse "c++ - distance from given point to given ellipse - Stack Overflow"
[2]: https://news.ycombinator.com/item?id=46482345 "Total monthly number of StackOverflow questions over time | Hacker News"


Code used to benchmark:

// ellipse_bench_mine_vs_claude.cpp
// Compare:
//   (1) mine2      : curvature method, r/q single-sqrt, 2 iterations
//   (2) opt3       : curvature method, r/q single-sqrt, 3 iterations (0xfaded + r/q opt)
//   (3) claude_rot4: Newton + rotation trick, 4 iterations
//
// Build:
//   g++ -O3 -march=native -ffast-math -std=c++17 ellipse_bench_mine_vs_claude.cpp -o bench -lm
//
#include <cmath>
#include <cstdio>
#include <chrono>
#include <random>
#include <vector>
#include <algorithm>

using Clock = std::chrono::high_resolution_clock;

struct Point { float x, y; };

static inline float clamp01(float v) {
    return std::fmin(1.0f, std::fmax(0.0f, v));
}

// -----------------------------
// mine2: curvature, 2 iters
// -----------------------------
static inline Point mine2(float a, float b, float px, float py) {
    float px_abs = std::fabs(px), py_abs = std::fabs(py);
    float tx = 0.70710678f, ty = 0.70710678f;
    float a2 = a*a, b2 = b*b;
    float ca = (a2 - b2) / a;
    float cb = (b2 - a2) / b;

    // 2 iterations
    for (int i = 0; i < 2; ++i) {
        float x = a * tx, y = b * ty;
        float tx3 = tx*tx*tx;
        float ty3 = ty*ty*ty;
        float ex = ca * tx3;
        float ey = cb * ty3;
        float rx = x - ex, ry = y - ey;
        float qx = px_abs - ex, qy = py_abs - ey;

        // r/q combined into one sqrt (0xfaded optimization)
        float rq = std::sqrt((rx*rx + ry*ry) / (qx*qx + qy*qy + 1e-30f));

        float ntx = clamp01((qx * rq + ex) / a);
        float nty = clamp01((qy * rq + ey) / b);
        float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
        tx = ntx * invLen;
        ty = nty * invLen;
    }

    return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
}

// -----------------------------
// opt3: curvature, 3 iters
// -----------------------------
static inline Point opt3(float a, float b, float px, float py) {
    float px_abs = std::fabs(px), py_abs = std::fabs(py);
    float tx = 0.70710678f, ty = 0.70710678f;
    float a2 = a*a, b2 = b*b;
    float ca = (a2 - b2) / a;
    float cb = (b2 - a2) / b;

    // 3 iterations (matches SO implementation)
    for (int i = 0; i < 3; ++i) {
        float x = a * tx, y = b * ty;
        float tx3 = tx*tx*tx;
        float ty3 = ty*ty*ty;
        float ex = ca * tx3;
        float ey = cb * ty3;
        float rx = x - ex, ry = y - ey;
        float qx = px_abs - ex, qy = py_abs - ey;

        float rq = std::sqrt((rx*rx + ry*ry) / (qx*qx + qy*qy + 1e-30f));

        float ntx = clamp01((qx * rq + ex) / a);
        float nty = clamp01((qy * rq + ey) / b);
        float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
        tx = ntx * invLen;
        ty = nty * invLen;
    }

    return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
}

// -----------------------------
// claude_rot4: Newton rotation, 4 iters
// -----------------------------
static inline Point claude_rot4(float a, float b, float px, float py) {
    float px_abs = std::fabs(px), py_abs = std::fabs(py);
    float a2mb2 = a*a - b*b;

    // init: normalized direction in scaled space
    float nx = px_abs / a;
    float ny = py_abs / b;
    float len = std::sqrt(nx*nx + ny*ny + 1e-10f);
    float c = nx / len;
    float s = ny / len;

    for (int i = 0; i < 4; ++i) {
        float f  = a2mb2*s*c - px_abs*a*s + py_abs*b*c;
        float fp = a2mb2*(c*c - s*s) - px_abs*a*c - py_abs*b*s;
        if (std::fabs(fp) < 1e-10f) break;
        float dt = f / fp;

        // small-angle rotation update + renorm
        float nc = c + dt*s;
        float ns = s - dt*c;
        float inv = 1.0f / std::sqrt(nc*nc + ns*ns);
        c = nc * inv;
        s = ns * inv;
    }

    return { std::copysign(a * c, px), std::copysign(b * s, py) };
}

volatile float sink;
static inline void escape(Point p) { sink = p.x + p.y; }

template <typename F>
static double bench(F fn, float a, float b, const std::vector<Point>& pts, int runs) {
    // warmup
    for (int w = 0; w < 3; ++w) {
        for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
    }

    std::vector<double> times;
    times.reserve(runs);

    for (int r = 0; r < runs; ++r) {
        auto t0 = Clock::now();
        for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
        auto t1 = Clock::now();
        double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
        times.push_back(ns / (double)pts.size());
    }

    std::sort(times.begin(), times.end());
    return times[times.size() / 2]; // median
}

template <Point(*FN)(float,float,float,float)>
static float ellipse_eq_maxerr(float a, float b, const std::vector<Point>& pts) {
    float maxe = 0.0f;
    float invA = 1.0f / a;
    float invB = 1.0f / b;
    for (const auto& p : pts) {
        Point r = FN(a, b, p.x, p.y);
        float u = r.x * invA;
        float v = r.y * invB;
        float e = std::fabs(u*u + v*v - 1.0f);
        if (std::isfinite(e)) maxe = std::fmax(maxe, e);
    }
    return maxe;
}

int main() {
    std::mt19937 rng(42);
    std::uniform_real_distribution<float> angle(0.0f, 2.0f * (float)M_PI);
    std::uniform_real_distribution<float> radius(0.5f, 2.5f);

    const int N = 50000;
    const int runs = 11;

    struct Cfg { float a, b; const char* name; } cfgs[] = {
        {150, 100, "Moderate (150,100)"},
        {200,  50, "High ecc (200,50)"},
        {100,  10, "Very high (100,10)"},
        {  1, 100, "Extreme (1,100)"},
        {100,   1, "Extreme (100,1)"},
        {  1,1000, "Extreme (1,1000)"},
        {100, 100, "Circle (100,100)"},
    };

    std::printf("====================================================\n");
    std::printf("mine2 vs opt3 vs claude_rot4 (float)\n");
    std::printf("N=%d points, median(%d) runs\n", N, runs);
    std::printf("Build: g++ -O3 -march=native -ffast-math -std=c++17\n");
    std::printf("====================================================\n\n");

    for (const auto& cfg : cfgs) {
        std::vector<Point> pts;
        pts.reserve(N);
        for (int i = 0; i < N; ++i) {
            float ang = angle(rng);
            float rr  = radius(rng);
            pts.push_back({ cfg.a * rr * std::cos(ang), cfg.b * rr * std::sin(ang) });
        }

        double t_mine = bench(mine2, cfg.a, cfg.b, pts, runs);
        double t_opt3 = bench(opt3,  cfg.a, cfg.b, pts, runs);
        double t_cl   = bench(claude_rot4, cfg.a, cfg.b, pts, runs);

        float e_mine = ellipse_eq_maxerr<mine2>(cfg.a, cfg.b, pts);
        float e_opt  = ellipse_eq_maxerr<opt3>(cfg.a, cfg.b, pts);
        float e_cl   = ellipse_eq_maxerr<claude_rot4>(cfg.a, cfg.b, pts);

        std::printf("%s\n", cfg.name);
        std::printf("  mine2       %7.2f ns/call   max(|eq-1|)=%.2e\n", t_mine, e_mine);
        std::printf("  opt3        %7.2f ns/call   max(|eq-1|)=%.2e\n", t_opt3, e_opt);
        std::printf("  claude_rot4 %7.2f ns/call   max(|eq-1|)=%.2e\n", t_cl, e_cl);
        std::printf("\n");
    }

    return 0;
}
	Context:
	https://gist.github.com/achille/d1eadf82aa54056b9ded7706e8f56760

	Chatgpt pro: run
	https://chatgpt.com/share/6959ed27-accc-800e-8e87-21aa81e93c07

	Benchmarks by Claude:

	Method Time Iters sqrt/iter Accurate? Notes
	-------------------------------------------------------------------------------
	0xfaded original 41.0 ns 3 3 YES baseline
	0xfaded+opt (r/q) 37.0 ns 3 2 YES his own optimization
	Claude rotation 38.7 ns 4 1 YES Newton + rotation trick
	ChatGPT 2-iter 21.1 ns 2 2 YES* fastest, edge cases possible
	ChatGPT adaptive 27.0 ns 2-3 2 YES best speed/safety tradeoff
	-------------------------------------------------------------------------------
	Gemini (Eberly 2-iter) 9.3 ns 2 0 BROKEN errors of 10^5+
	OpenAI (Eberly 2-iter) 23.2 ns 2 0 BROKEN errors of 10^10+

	* 2-iter may have reduced precision for points deep inside ellipse


	Chatgpt Benchmark:
	Microbenchmark (this environment)
	N=50000 pts, median(11), ns/call

	Config mine(2 iters) 0xfaded opt(3) claude rot(4)
	------------------------------------------------------------------
	Moderate (150,100) 39.2 69.1 84.8
	High ecc (200,50) 42.8 72.3 80.1
	Very high (100,10) 42.3 68.8 81.6
	Extreme (1,100) 41.6 68.1 81.2
	Extreme (100,1) 40.8 69.4 83.5
	Extreme (1,1000) 41.8 70.5 81.8
	Circle (100,100) 37.1 64.5 81.4


	Chatgpt response:
	Alright — I pulled up the SO answer you linked (the trig‑free curvature/evolute iteration) ([Stack Overflow][1]) and the HN thread where 0xfaded points out the r/q single‑sqrt optimization ([Hacker News][2]). Your current “yours+opt” is already the right direction.

	If you want a real speed win beyond that, there’s one lever that dominates everything else:

	## The fastest win: drop to 2 iterations (and optionally add a 3rd only when inside)

	That SO implementation hardcodes 3 iterations ([Stack Overflow][1]). In practice:

	* Outside the ellipse: 2 iterations is already very accurate.
	* Inside the ellipse: worst‑cases are harder; 3 iterations helps a lot.

	So the fastest “still not buggy” implementation is:

	* Fixed 2 iterations (fastest overall), or
	* Adaptive: do 2 iterations, then do a 3rd only if the query point is inside the ellipse.

	### What this buys you (measured here, same core algorithm + r/q single‑sqrt)

	On this machine (g++ `-O3 -march=native -ffast-math`, float):

	* fixed 3 iters: ~57 ns/call
	* fixed 2 iters: ~35 ns/call (~1.6× faster)
	* adaptive (2 outside, 3 inside): ~45 ns/call

	Accuracy trend (vs a robust double-precision reference):

	* 2 iters: outside max dist error ≈ 1e‑3, inside worst‑case can be ~2e‑1
	* 3 iters: inside worst‑case drops by an order of magnitude

	That’s the key: 2 iters is “basically done” outside, and collisions / closest-point queries in games are very often outside or near-boundary.

	---

	## Drop-in “redeem myself” code: fastest robust variant

	This is the same 0xfaded / Adrian Stephens trig‑free method from the SO answer ([Stack Overflow][1]), with 0xfaded’s r/q single‑sqrt optimization ([Hacker News][2]), plus:

	* compile-time iteration count (so the compiler can unroll),
	* optional “adaptive” extra iteration for inside points,
	* precomputed constants in a cache struct.

	> Assumes the ellipse is axis-aligned and centered at origin.
	> (If you need center/rotation: translate/rotate into ellipse-local coords, call this, then rotate back.)

	```cpp
	#include <cmath>
	#include <algorithm>

	struct Pt { float x, y; };

	struct EllipseCache {
	float a, b; // semi-axes
	float invA, invB; // 1/a, 1/b
	float invA2, invB2; // 1/a^2, 1/b^2 (for inside test)
	float ca, cb; // (a^2-b^2)/a and (b^2-a^2)/b

	explicit EllipseCache(float A, float B) : a(A), b(B) {
	invA = 1.0f / a;
	invB = 1.0f / b;
	invA2 = invA * invA;
	invB2 = invB * invB;
	float a2 = aa, b2 = bb;
	ca = (a2 - b2) * invA;
	cb = (b2 - a2) * invB;
	}
	};

	static inline float clamp01(float v) {
	return std::min(1.0f, std::max(0.0f, v));
	}

	// One iteration step (kept as a macro-ish inline to help compilers)
	static inline void curv_step(const EllipseCache& e,
	float pxAbs, float pyAbs,
	float& tx, float& ty) {
	// Current ellipse point (first quadrant)
	float x = e.a * tx;
	float y = e.b * ty;

	// Evolute / curvature center offsets
	float tx2 = tx * tx;
	float ty2 = ty * ty;
	float ex = e.ca * (tx2 * tx);
	float ey = e.cb * (ty2 * ty);

	// Vector from evolute point to ellipse point (r) and to query point (q)
	float rx = x - ex, ry = y - ey;
	float qx = pxAbs - ex, qy = pyAbs - ey;

	float r2 = rxrx + ryry;
	float q2 = qxqx + qyqy + 1e-30f; // branchless “don’t blow up” epsilon

	// 0xfaded optimization: only need r/q, so do it with ONE sqrt :contentReference[oaicite:5]{index=5}
	float rq = std::sqrt(r2 / q2);

	// Update t (clamp then renormalize)
	float ntx = clamp01((qx * rq + ex) * e.invA);
	float nty = clamp01((qy * rq + ey) * e.invB);

	float invLen = 1.0f / std::sqrt(ntxntx + ntynty);
	tx = ntx * invLen;
	ty = nty * invLen;
	}

	template<int ITERS>
	static inline Pt closest_point_curvature(const EllipseCache& e, float px, float py) {
	float pxAbs = std::fabs(px);
	float pyAbs = std::fabs(py);

	// Center case: any closest point lies on the minor axis (assuming a>=b typical).
	// (You can pick either sign; here we pick +b.)
	if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};

	// Constant init (robust; avoids bad q≈0 cases seen with “directional” init)
	float tx = 0.70710678f;
	float ty = 0.70710678f;

	#pragma unroll
	for (int i = 0; i < ITERS; ++i) {
	curv_step(e, pxAbs, pyAbs, tx, ty);
	}

	return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
	}

	// Fastest “safe” default for games:
	// - 2 iters outside (very accurate there)
	// - 3 iters inside (fixes worst cases)
	static inline Pt closest_point_adaptive(const EllipseCache& e, float px, float py) {
	float pxAbs = std::fabs(px);
	float pyAbs = std::fabs(py);

	if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};

	float tx = 0.70710678f;
	float ty = 0.70710678f;

	// Always do 2
	curv_step(e, pxAbs, pyAbs, tx, ty);
	curv_step(e, pxAbs, pyAbs, tx, ty);

	// Only do the 3rd if point is inside ellipse
	// (cheap inside test)
	float v = (pxAbspxAbs) e.invA2 + (pyAbspyAbs) e.invB2;
	if (v <= 1.0f) {
	curv_step(e, pxAbs, pyAbs, tx, ty);
	}

	return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
	}
	```

	### How to use

	```cpp
	EllipseCache E(150.0f, 100.0f);

	// fastest fixed count:
	Pt cp2 = closest_point_curvature<2>(E, px, py);

	// “still fast but safer inside”:
	Pt cpA = closest_point_adaptive(E, px, py);
	```

	---

	## Why this “redeems” the speed race

	You already used the “save a sqrt by folding r/q” trick (from 0xfaded’s own HN reply) ([Hacker News][2]), which was the best micro optimization.

	But the biggest macro optimization is:
	don’t pay for the 3rd iteration unless you need it.

	That’s how you get a meaningfully faster implementation without drifting into the “2 iterations but wrong root / wrong stationary point” failure mode that kills Newton-style approaches inside the ellipse.

	---

	If you want, I can also give you:

	* an AVX2 “8 points at a time” version (throughput monster),
	* a rotated-ellipse wrapper (precompute sin/cos once per ellipse),
	* a tiny benchmark harness that prints ns/call + max distance error (so you can paste it into HN like you did).

	[1]: https://stackoverflow.com/questions/22959698/distance-from-given-point-to-given-ellipse "c++ - distance from given point to given ellipse - Stack Overflow"
	[2]: https://news.ycombinator.com/item?id=46482345 "Total monthly number of StackOverflow questions over time \| Hacker News"



	Code used to benchmark:

	// ellipse_bench_mine_vs_claude.cpp
	// Compare:
	// (1) mine2 : curvature method, r/q single-sqrt, 2 iterations
	// (2) opt3 : curvature method, r/q single-sqrt, 3 iterations (0xfaded + r/q opt)
	// (3) claude_rot4: Newton + rotation trick, 4 iterations
	//
	// Build:
	// g++ -O3 -march=native -ffast-math -std=c++17 ellipse_bench_mine_vs_claude.cpp -o bench -lm
	//
	#include <cmath>
	#include <cstdio>
	#include <chrono>
	#include <random>
	#include <vector>
	#include <algorithm>

	using Clock = std::chrono::high_resolution_clock;

	struct Point { float x, y; };

	static inline float clamp01(float v) {
	return std::fmin(1.0f, std::fmax(0.0f, v));
	}

	// -----------------------------
	// mine2: curvature, 2 iters
	// -----------------------------
	static inline Point mine2(float a, float b, float px, float py) {
	float px_abs = std::fabs(px), py_abs = std::fabs(py);
	float tx = 0.70710678f, ty = 0.70710678f;
	float a2 = aa, b2 = bb;
	float ca = (a2 - b2) / a;
	float cb = (b2 - a2) / b;

	// 2 iterations
	for (int i = 0; i < 2; ++i) {
	float x = a * tx, y = b * ty;
	float tx3 = txtxtx;
	float ty3 = tytyty;
	float ex = ca * tx3;
	float ey = cb * ty3;
	float rx = x - ex, ry = y - ey;
	float qx = px_abs - ex, qy = py_abs - ey;

	// r/q combined into one sqrt (0xfaded optimization)
	float rq = std::sqrt((rxrx + ryry) / (qxqx + qyqy + 1e-30f));

	float ntx = clamp01((qx * rq + ex) / a);
	float nty = clamp01((qy * rq + ey) / b);
	float invLen = 1.0f / std::sqrt(ntxntx + ntynty);
	tx = ntx * invLen;
	ty = nty * invLen;
	}

	return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
	}

	// -----------------------------
	// opt3: curvature, 3 iters
	// -----------------------------
	static inline Point opt3(float a, float b, float px, float py) {
	float px_abs = std::fabs(px), py_abs = std::fabs(py);
	float tx = 0.70710678f, ty = 0.70710678f;
	float a2 = aa, b2 = bb;
	float ca = (a2 - b2) / a;
	float cb = (b2 - a2) / b;

	// 3 iterations (matches SO implementation)
	for (int i = 0; i < 3; ++i) {
	float x = a * tx, y = b * ty;
	float tx3 = txtxtx;
	float ty3 = tytyty;
	float ex = ca * tx3;
	float ey = cb * ty3;
	float rx = x - ex, ry = y - ey;
	float qx = px_abs - ex, qy = py_abs - ey;

	float rq = std::sqrt((rxrx + ryry) / (qxqx + qyqy + 1e-30f));

	float ntx = clamp01((qx * rq + ex) / a);
	float nty = clamp01((qy * rq + ey) / b);
	float invLen = 1.0f / std::sqrt(ntxntx + ntynty);
	tx = ntx * invLen;
	ty = nty * invLen;
	}

	return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
	}

	// -----------------------------
	// claude_rot4: Newton rotation, 4 iters
	// -----------------------------
	static inline Point claude_rot4(float a, float b, float px, float py) {
	float px_abs = std::fabs(px), py_abs = std::fabs(py);
	float a2mb2 = aa - bb;

	// init: normalized direction in scaled space
	float nx = px_abs / a;
	float ny = py_abs / b;
	float len = std::sqrt(nxnx + nyny + 1e-10f);
	float c = nx / len;
	float s = ny / len;

	for (int i = 0; i < 4; ++i) {
	float f = a2mb2sc - px_absas + py_absbc;
	float fp = a2mb2(cc - ss) - px_absac - py_absb*s;
	if (std::fabs(fp) < 1e-10f) break;
	float dt = f / fp;

	// small-angle rotation update + renorm
	float nc = c + dt*s;
	float ns = s - dt*c;
	float inv = 1.0f / std::sqrt(ncnc + nsns);
	c = nc * inv;
	s = ns * inv;
	}

	return { std::copysign(a * c, px), std::copysign(b * s, py) };
	}

	volatile float sink;
	static inline void escape(Point p) { sink = p.x + p.y; }

	template <typename F>
	static double bench(F fn, float a, float b, const std::vector<Point>& pts, int runs) {
	// warmup
	for (int w = 0; w < 3; ++w) {
	for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
	}

	std::vector<double> times;
	times.reserve(runs);

	for (int r = 0; r < runs; ++r) {
	auto t0 = Clock::now();
	for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
	auto t1 = Clock::now();
	double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
	times.push_back(ns / (double)pts.size());
	}

	std::sort(times.begin(), times.end());
	return times[times.size() / 2]; // median
	}

	template <Point(*FN)(float,float,float,float)>
	static float ellipse_eq_maxerr(float a, float b, const std::vector<Point>& pts) {
	float maxe = 0.0f;
	float invA = 1.0f / a;
	float invB = 1.0f / b;
	for (const auto& p : pts) {
	Point r = FN(a, b, p.x, p.y);
	float u = r.x * invA;
	float v = r.y * invB;
	float e = std::fabs(uu + vv - 1.0f);
	if (std::isfinite(e)) maxe = std::fmax(maxe, e);
	}
	return maxe;
	}

	int main() {
	std::mt19937 rng(42);
	std::uniform_real_distribution<float> angle(0.0f, 2.0f * (float)M_PI);
	std::uniform_real_distribution<float> radius(0.5f, 2.5f);

	const int N = 50000;
	const int runs = 11;

	struct Cfg { float a, b; const char* name; } cfgs[] = {
	{150, 100, "Moderate (150,100)"},
	{200, 50, "High ecc (200,50)"},
	{100, 10, "Very high (100,10)"},
	{ 1, 100, "Extreme (1,100)"},
	{100, 1, "Extreme (100,1)"},
	{ 1,1000, "Extreme (1,1000)"},
	{100, 100, "Circle (100,100)"},
	};

	std::printf("====================================================\n");
	std::printf("mine2 vs opt3 vs claude_rot4 (float)\n");
	std::printf("N=%d points, median(%d) runs\n", N, runs);
	std::printf("Build: g++ -O3 -march=native -ffast-math -std=c++17\n");
	std::printf("====================================================\n\n");

	for (const auto& cfg : cfgs) {
	std::vector<Point> pts;
	pts.reserve(N);
	for (int i = 0; i < N; ++i) {
	float ang = angle(rng);
	float rr = radius(rng);
	pts.push_back({ cfg.a * rr * std::cos(ang), cfg.b * rr * std::sin(ang) });
	}

	double t_mine = bench(mine2, cfg.a, cfg.b, pts, runs);
	double t_opt3 = bench(opt3, cfg.a, cfg.b, pts, runs);
	double t_cl = bench(claude_rot4, cfg.a, cfg.b, pts, runs);

	float e_mine = ellipse_eq_maxerr<mine2>(cfg.a, cfg.b, pts);
	float e_opt = ellipse_eq_maxerr<opt3>(cfg.a, cfg.b, pts);
	float e_cl = ellipse_eq_maxerr<claude_rot4>(cfg.a, cfg.b, pts);

	std::printf("%s\n", cfg.name);
	std::printf(" mine2 %7.2f ns/call max(\|eq-1\|)=%.2e\n", t_mine, e_mine);
	std::printf(" opt3 %7.2f ns/call max(\|eq-1\|)=%.2e\n", t_opt3, e_opt);
	std::printf(" claude_rot4 %7.2f ns/call max(\|eq-1\|)=%.2e\n", t_cl, e_cl);
	std::printf("\n");
	}

	return 0;
	}
No results found