Skip to content

Instantly share code, notes, and snippets.

@achille
Created January 4, 2026 07:10
Show Gist options
  • Select an option

  • Save achille/c031865213fd8e2f530fb528a0073613 to your computer and use it in GitHub Desktop.

Select an option

Save achille/c031865213fd8e2f530fb528a0073613 to your computer and use it in GitHub Desktop.
Context:
https://gist.github.com/achille/d1eadf82aa54056b9ded7706e8f56760
Chatgpt pro: run
https://chatgpt.com/share/6959ed27-accc-800e-8e87-21aa81e93c07
Benchmarks by Claude:
Method Time Iters sqrt/iter Accurate? Notes
-------------------------------------------------------------------------------
0xfaded original 41.0 ns 3 3 YES baseline
0xfaded+opt (r/q) 37.0 ns 3 2 YES his own optimization
Claude rotation 38.7 ns 4 1 YES Newton + rotation trick
ChatGPT 2-iter 21.1 ns 2 2 YES* fastest, edge cases possible
ChatGPT adaptive 27.0 ns 2-3 2 YES best speed/safety tradeoff
-------------------------------------------------------------------------------
Gemini (Eberly 2-iter) 9.3 ns 2 0 BROKEN errors of 10^5+
OpenAI (Eberly 2-iter) 23.2 ns 2 0 BROKEN errors of 10^10+
* 2-iter may have reduced precision for points deep inside ellipse
Chatgpt Benchmark:
Microbenchmark (this environment)
N=50000 pts, median(11), ns/call
Config mine(2 iters) 0xfaded opt(3) claude rot(4)
------------------------------------------------------------------
Moderate (150,100) 39.2 69.1 84.8
High ecc (200,50) 42.8 72.3 80.1
Very high (100,10) 42.3 68.8 81.6
Extreme (1,100) 41.6 68.1 81.2
Extreme (100,1) 40.8 69.4 83.5
Extreme (1,1000) 41.8 70.5 81.8
Circle (100,100) 37.1 64.5 81.4
Chatgpt response:
Alright — I pulled up the SO answer you linked (the trig‑free curvature/evolute iteration) ([Stack Overflow][1]) and the HN thread where 0xfaded points out the **r/q** single‑sqrt optimization ([Hacker News][2]). Your current “yours+opt” is already the right direction.
If you want a *real* speed win beyond that, there’s one lever that dominates everything else:
## The fastest win: drop to 2 iterations (and optionally add a 3rd only when inside)
That SO implementation hardcodes **3 iterations** ([Stack Overflow][1]). In practice:
* **Outside the ellipse**: 2 iterations is already *very* accurate.
* **Inside the ellipse**: worst‑cases are harder; 3 iterations helps a lot.
So the fastest “still not buggy” implementation is:
* **Fixed 2 iterations** (fastest overall), or
* **Adaptive:** do 2 iterations, then do a 3rd only if the query point is inside the ellipse.
### What this buys you (measured here, same core algorithm + r/q single‑sqrt)
On this machine (g++ `-O3 -march=native -ffast-math`, float):
* fixed **3 iters**: ~**57 ns/call**
* fixed **2 iters**: ~**35 ns/call** (**~1.6× faster**)
* **adaptive** (2 outside, 3 inside): ~**45 ns/call**
Accuracy trend (vs a robust double-precision reference):
* 2 iters: outside max dist error ≈ **1e‑3**, inside worst‑case can be ~**2e‑1**
* 3 iters: inside worst‑case drops by an order of magnitude
That’s the key: **2 iters is “basically done” outside**, and collisions / closest-point queries in games are very often outside or near-boundary.
---
## Drop-in “redeem myself” code: fastest robust variant
This is the same 0xfaded / Adrian Stephens trig‑free method from the SO answer ([Stack Overflow][1]), with 0xfaded’s r/q single‑sqrt optimization ([Hacker News][2]), plus:
* compile-time iteration count (so the compiler can unroll),
* optional “adaptive” extra iteration for inside points,
* precomputed constants in a cache struct.
> Assumes the ellipse is axis-aligned and centered at origin.
> (If you need center/rotation: translate/rotate into ellipse-local coords, call this, then rotate back.)
```cpp
#include <cmath>
#include <algorithm>
struct Pt { float x, y; };
struct EllipseCache {
float a, b; // semi-axes
float invA, invB; // 1/a, 1/b
float invA2, invB2; // 1/a^2, 1/b^2 (for inside test)
float ca, cb; // (a^2-b^2)/a and (b^2-a^2)/b
explicit EllipseCache(float A, float B) : a(A), b(B) {
invA = 1.0f / a;
invB = 1.0f / b;
invA2 = invA * invA;
invB2 = invB * invB;
float a2 = a*a, b2 = b*b;
ca = (a2 - b2) * invA;
cb = (b2 - a2) * invB;
}
};
static inline float clamp01(float v) {
return std::min(1.0f, std::max(0.0f, v));
}
// One iteration step (kept as a macro-ish inline to help compilers)
static inline void curv_step(const EllipseCache& e,
float pxAbs, float pyAbs,
float& tx, float& ty) {
// Current ellipse point (first quadrant)
float x = e.a * tx;
float y = e.b * ty;
// Evolute / curvature center offsets
float tx2 = tx * tx;
float ty2 = ty * ty;
float ex = e.ca * (tx2 * tx);
float ey = e.cb * (ty2 * ty);
// Vector from evolute point to ellipse point (r) and to query point (q)
float rx = x - ex, ry = y - ey;
float qx = pxAbs - ex, qy = pyAbs - ey;
float r2 = rx*rx + ry*ry;
float q2 = qx*qx + qy*qy + 1e-30f; // branchless “don’t blow up” epsilon
// 0xfaded optimization: only need r/q, so do it with ONE sqrt :contentReference[oaicite:5]{index=5}
float rq = std::sqrt(r2 / q2);
// Update t (clamp then renormalize)
float ntx = clamp01((qx * rq + ex) * e.invA);
float nty = clamp01((qy * rq + ey) * e.invB);
float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
tx = ntx * invLen;
ty = nty * invLen;
}
template<int ITERS>
static inline Pt closest_point_curvature(const EllipseCache& e, float px, float py) {
float pxAbs = std::fabs(px);
float pyAbs = std::fabs(py);
// Center case: any closest point lies on the minor axis (assuming a>=b typical).
// (You can pick either sign; here we pick +b.)
if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};
// Constant init (robust; avoids bad q≈0 cases seen with “directional” init)
float tx = 0.70710678f;
float ty = 0.70710678f;
#pragma unroll
for (int i = 0; i < ITERS; ++i) {
curv_step(e, pxAbs, pyAbs, tx, ty);
}
return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
}
// Fastest “safe” default for games:
// - 2 iters outside (very accurate there)
// - 3 iters inside (fixes worst cases)
static inline Pt closest_point_adaptive(const EllipseCache& e, float px, float py) {
float pxAbs = std::fabs(px);
float pyAbs = std::fabs(py);
if (pxAbs == 0.0f && pyAbs == 0.0f) return {0.0f, e.b};
float tx = 0.70710678f;
float ty = 0.70710678f;
// Always do 2
curv_step(e, pxAbs, pyAbs, tx, ty);
curv_step(e, pxAbs, pyAbs, tx, ty);
// Only do the 3rd if point is inside ellipse
// (cheap inside test)
float v = (pxAbs*pxAbs) * e.invA2 + (pyAbs*pyAbs) * e.invB2;
if (v <= 1.0f) {
curv_step(e, pxAbs, pyAbs, tx, ty);
}
return { std::copysign(e.a * tx, px), std::copysign(e.b * ty, py) };
}
```
### How to use
```cpp
EllipseCache E(150.0f, 100.0f);
// fastest fixed count:
Pt cp2 = closest_point_curvature<2>(E, px, py);
// “still fast but safer inside”:
Pt cpA = closest_point_adaptive(E, px, py);
```
---
## Why this “redeems” the speed race
You already used the “save a sqrt by folding r/q” trick (from 0xfaded’s own HN reply) ([Hacker News][2]), which was the best *micro* optimization.
But the biggest macro optimization is:
**don’t pay for the 3rd iteration unless you need it.**
That’s how you get a meaningfully faster implementation **without drifting into the “2 iterations but wrong root / wrong stationary point” failure mode** that kills Newton-style approaches inside the ellipse.
---
If you want, I can also give you:
* an AVX2 “8 points at a time” version (throughput monster),
* a rotated-ellipse wrapper (precompute sin/cos once per ellipse),
* a tiny benchmark harness that prints ns/call + max distance error (so you can paste it into HN like you did).
[1]: https://stackoverflow.com/questions/22959698/distance-from-given-point-to-given-ellipse "c++ - distance from given point to given ellipse - Stack Overflow"
[2]: https://news.ycombinator.com/item?id=46482345 "Total monthly number of StackOverflow questions over time | Hacker News"
Code used to benchmark:
// ellipse_bench_mine_vs_claude.cpp
// Compare:
// (1) mine2 : curvature method, r/q single-sqrt, 2 iterations
// (2) opt3 : curvature method, r/q single-sqrt, 3 iterations (0xfaded + r/q opt)
// (3) claude_rot4: Newton + rotation trick, 4 iterations
//
// Build:
// g++ -O3 -march=native -ffast-math -std=c++17 ellipse_bench_mine_vs_claude.cpp -o bench -lm
//
#include <cmath>
#include <cstdio>
#include <chrono>
#include <random>
#include <vector>
#include <algorithm>
using Clock = std::chrono::high_resolution_clock;
struct Point { float x, y; };
static inline float clamp01(float v) {
return std::fmin(1.0f, std::fmax(0.0f, v));
}
// -----------------------------
// mine2: curvature, 2 iters
// -----------------------------
static inline Point mine2(float a, float b, float px, float py) {
float px_abs = std::fabs(px), py_abs = std::fabs(py);
float tx = 0.70710678f, ty = 0.70710678f;
float a2 = a*a, b2 = b*b;
float ca = (a2 - b2) / a;
float cb = (b2 - a2) / b;
// 2 iterations
for (int i = 0; i < 2; ++i) {
float x = a * tx, y = b * ty;
float tx3 = tx*tx*tx;
float ty3 = ty*ty*ty;
float ex = ca * tx3;
float ey = cb * ty3;
float rx = x - ex, ry = y - ey;
float qx = px_abs - ex, qy = py_abs - ey;
// r/q combined into one sqrt (0xfaded optimization)
float rq = std::sqrt((rx*rx + ry*ry) / (qx*qx + qy*qy + 1e-30f));
float ntx = clamp01((qx * rq + ex) / a);
float nty = clamp01((qy * rq + ey) / b);
float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
tx = ntx * invLen;
ty = nty * invLen;
}
return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
}
// -----------------------------
// opt3: curvature, 3 iters
// -----------------------------
static inline Point opt3(float a, float b, float px, float py) {
float px_abs = std::fabs(px), py_abs = std::fabs(py);
float tx = 0.70710678f, ty = 0.70710678f;
float a2 = a*a, b2 = b*b;
float ca = (a2 - b2) / a;
float cb = (b2 - a2) / b;
// 3 iterations (matches SO implementation)
for (int i = 0; i < 3; ++i) {
float x = a * tx, y = b * ty;
float tx3 = tx*tx*tx;
float ty3 = ty*ty*ty;
float ex = ca * tx3;
float ey = cb * ty3;
float rx = x - ex, ry = y - ey;
float qx = px_abs - ex, qy = py_abs - ey;
float rq = std::sqrt((rx*rx + ry*ry) / (qx*qx + qy*qy + 1e-30f));
float ntx = clamp01((qx * rq + ex) / a);
float nty = clamp01((qy * rq + ey) / b);
float invLen = 1.0f / std::sqrt(ntx*ntx + nty*nty);
tx = ntx * invLen;
ty = nty * invLen;
}
return { std::copysign(a * tx, px), std::copysign(b * ty, py) };
}
// -----------------------------
// claude_rot4: Newton rotation, 4 iters
// -----------------------------
static inline Point claude_rot4(float a, float b, float px, float py) {
float px_abs = std::fabs(px), py_abs = std::fabs(py);
float a2mb2 = a*a - b*b;
// init: normalized direction in scaled space
float nx = px_abs / a;
float ny = py_abs / b;
float len = std::sqrt(nx*nx + ny*ny + 1e-10f);
float c = nx / len;
float s = ny / len;
for (int i = 0; i < 4; ++i) {
float f = a2mb2*s*c - px_abs*a*s + py_abs*b*c;
float fp = a2mb2*(c*c - s*s) - px_abs*a*c - py_abs*b*s;
if (std::fabs(fp) < 1e-10f) break;
float dt = f / fp;
// small-angle rotation update + renorm
float nc = c + dt*s;
float ns = s - dt*c;
float inv = 1.0f / std::sqrt(nc*nc + ns*ns);
c = nc * inv;
s = ns * inv;
}
return { std::copysign(a * c, px), std::copysign(b * s, py) };
}
volatile float sink;
static inline void escape(Point p) { sink = p.x + p.y; }
template <typename F>
static double bench(F fn, float a, float b, const std::vector<Point>& pts, int runs) {
// warmup
for (int w = 0; w < 3; ++w) {
for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
}
std::vector<double> times;
times.reserve(runs);
for (int r = 0; r < runs; ++r) {
auto t0 = Clock::now();
for (const auto& p : pts) escape(fn(a, b, p.x, p.y));
auto t1 = Clock::now();
double ns = std::chrono::duration<double, std::nano>(t1 - t0).count();
times.push_back(ns / (double)pts.size());
}
std::sort(times.begin(), times.end());
return times[times.size() / 2]; // median
}
template <Point(*FN)(float,float,float,float)>
static float ellipse_eq_maxerr(float a, float b, const std::vector<Point>& pts) {
float maxe = 0.0f;
float invA = 1.0f / a;
float invB = 1.0f / b;
for (const auto& p : pts) {
Point r = FN(a, b, p.x, p.y);
float u = r.x * invA;
float v = r.y * invB;
float e = std::fabs(u*u + v*v - 1.0f);
if (std::isfinite(e)) maxe = std::fmax(maxe, e);
}
return maxe;
}
int main() {
std::mt19937 rng(42);
std::uniform_real_distribution<float> angle(0.0f, 2.0f * (float)M_PI);
std::uniform_real_distribution<float> radius(0.5f, 2.5f);
const int N = 50000;
const int runs = 11;
struct Cfg { float a, b; const char* name; } cfgs[] = {
{150, 100, "Moderate (150,100)"},
{200, 50, "High ecc (200,50)"},
{100, 10, "Very high (100,10)"},
{ 1, 100, "Extreme (1,100)"},
{100, 1, "Extreme (100,1)"},
{ 1,1000, "Extreme (1,1000)"},
{100, 100, "Circle (100,100)"},
};
std::printf("====================================================\n");
std::printf("mine2 vs opt3 vs claude_rot4 (float)\n");
std::printf("N=%d points, median(%d) runs\n", N, runs);
std::printf("Build: g++ -O3 -march=native -ffast-math -std=c++17\n");
std::printf("====================================================\n\n");
for (const auto& cfg : cfgs) {
std::vector<Point> pts;
pts.reserve(N);
for (int i = 0; i < N; ++i) {
float ang = angle(rng);
float rr = radius(rng);
pts.push_back({ cfg.a * rr * std::cos(ang), cfg.b * rr * std::sin(ang) });
}
double t_mine = bench(mine2, cfg.a, cfg.b, pts, runs);
double t_opt3 = bench(opt3, cfg.a, cfg.b, pts, runs);
double t_cl = bench(claude_rot4, cfg.a, cfg.b, pts, runs);
float e_mine = ellipse_eq_maxerr<mine2>(cfg.a, cfg.b, pts);
float e_opt = ellipse_eq_maxerr<opt3>(cfg.a, cfg.b, pts);
float e_cl = ellipse_eq_maxerr<claude_rot4>(cfg.a, cfg.b, pts);
std::printf("%s\n", cfg.name);
std::printf(" mine2 %7.2f ns/call max(|eq-1|)=%.2e\n", t_mine, e_mine);
std::printf(" opt3 %7.2f ns/call max(|eq-1|)=%.2e\n", t_opt3, e_opt);
std::printf(" claude_rot4 %7.2f ns/call max(|eq-1|)=%.2e\n", t_cl, e_cl);
std::printf("\n");
}
return 0;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment