Skip to content

Instantly share code, notes, and snippets.

@statico
Last active February 26, 2026 04:40
Show Gist options
  • Select an option

  • Save statico/e1ddc0b35b36078be5c7aa6b83c7de9b to your computer and use it in GitHub Desktop.

Select an option

Save statico/e1ddc0b35b36078be5c7aa6b83c7de9b to your computer and use it in GitHub Desktop.
claude-captcha-deepdive

How Modern CAPTCHAs Actually Work: A Deep Dive

The Big Picture

Your intuition is correct: all three systems fundamentally collect a mass of signals, send them to black-box server-side ML models, and get back a probabilistic score. None of them have a clean deterministic "if X then bot" rule. It's all heuristics and probability. But the depth of what they collect is staggering.


1. Google reCAPTCHA: The VM

Google's approach is the most opaque and arguably the most sophisticated. The core of reCAPTCHA is BotGuard -- a custom virtual machine implemented in JavaScript that runs a proprietary bytecode language inside your browser.

The BotGuard VM

This has been partially reverse-engineered (see neuroradiology/InsideReCaptcha and dsekz/botguard-reverse), and yes, it is a real VM:

  • Custom bytecode interpreter that emulates a register-based CPU mimicking x86 architecture
  • Bytecode is encrypted with XTEA (extended Tiny Encryption Algorithm) -- each 8-byte block XORed with a keystream
  • The VM has self-modifying code: it changes its own decryption keys and its own opcode numbers at runtime
  • Register values are encrypted; opcodes include flow-changing instructions
  • The bytecode has direct access to the JavaScript variables of its own interpreter -- it can reach out and touch the JS environment
  • To even see the instructions, you must write a custom disassembler, decompiler, AND debugger

The key insight from dsekz/botguard-reverse: BotGuard is not a fingerprinter per se. It's a browser attestation system. It generates a token that proves the code was executed by a real browser. The token can only be generated if the full VM executes correctly in a genuine browser environment. Google then correlates this token server-side with other signals (IP reputation, cookies, session history).

What reCAPTCHA Collects (from InsideReCaptcha reverse engineering)

Two byte arrays (xhr1 and xhr2) are assembled by the VM and sent to Google:

Signal Details
Browser plugins Full enumeration
User-Agent String
Screen resolution Dimensions
Execution time How long the VM took to run
Timezone Offset
Interaction counts Clicks, keyboard events, touch actions within the captcha iframe
Browser function behavior Tests specific JS functions and checks how they behave
CSS rule rendering How the browser applies CSS (differs per engine)
Canvas rendering Draws to canvas and hashes the output
Google cookies Since reCAPTCHA runs on google.com, it has access to all Google cookies
Dynamic key hashes Hashes of Function.toString() output, browser-specific APIs, hostname

xhr2 is additionally encrypted with XTEA before transmission to https://www.google.com/recaptcha/api2/frame.

The Google Account Advantage

This is reCAPTCHA's secret weapon that competitors can't match: because the script runs on the google.com domain, it can read Google cookies. If you're logged into a Google account with years of browsing history, you get a massive trust boost. Research has confirmed that the Google tracking cookie "plays a crucial role in determining the difficulty of challenge presented to the user." A fresh incognito window with no Google session almost always gets a harder challenge.

A 2019 study estimated reCAPTCHA has caused 819 million hours of wasted human time and generated billions in Google profits through the cookie/tracking data it collects.

reCAPTCHA v3 Scoring

v3 runs invisibly and returns a score from 0.0 (bot) to 1.0 (human). Google doesn't publish the exact scoring formula, but confirmed signals include:

  • All of the above fingerprinting data
  • Mouse movement patterns on the page (not just in the captcha widget)
  • Browsing behavior across the entire page
  • Full screenshots of the browser window (per GDPR analysis findings)
  • Historical behavior across all sites using reCAPTCHA

reCAPTCHA Enterprise

Adds risk reason codes: AUTOMATION (headless browsers), UNEXPECTED_ENVIRONMENT (emulated environments), TOO_MUCH_TRAFFIC, UNEXPECTED_USAGE_PATTERNS, LOW_CONFIDENCE_SCORE, plus fraud-specific signals for payment protection.


2. Cloudflare Turnstile

Cloudflare's approach is the most layered -- they have signals at every level of the stack, from TCP packets up to JavaScript behavior.

The Detection Stack (Bottom to Top)

Layer 1: TCP/IP Fingerprinting Before your browser even sends HTTP, Cloudflare analyzes the TCP SYN packet. Different OS TCP/IP stacks set different initial window sizes, TTL values, and TCP options. A request claiming to be Chrome on Windows but with Linux TCP characteristics is immediately suspicious.

Layer 2: TLS Fingerprinting (JA3/JA4) The TLS ClientHello message reveals the client's supported cipher suites, extensions, and elliptic curves -- each TLS library (Chrome, Firefox, curl, Python requests, Go net/http) produces a distinctive fingerprint.

  • JA3 (original): MD5 hash of ClientHello fields. Degraded after Chrome 110+ started randomizing extension order.
  • JA4 (successor, Sept 2023): Sorts extensions before hashing, defeating randomization. Has 8 components including protocol type, TLS version, SNI presence, cipher/extension counts, and truncated SHA256 hashes.

Cloudflare analyzes 15+ million unique JA4 fingerprints from 500M+ user agents daily, computing aggregate stats like browser_ratio_1h (what % of traffic with this fingerprint is browser-based) and reqs_quantile_1h (volume anomaly detection).

Layer 3: HTTP/2 Fingerprinting This is one of the hardest signals to spoof. The HTTP/2 SETTINGS frame sent at connection startup contains values like HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, and the order of pseudo-headers (:method, :authority, :scheme, :path). Most HTTP libraries don't let you customize these. Cloudflare has 50+ heuristics based on HTTP/2 fingerprints alone.

Layer 4: HTTP Header Analysis Header order, presence/absence, and consistency. A claimed Chrome browser missing sec-ch-ua headers, or headers in the wrong order for the claimed browser.

Layer 5: JavaScript Detection (JSD) An invisible JS snippet injected into HTML responses that:

  1. Sets __CF$cv$params with a ray ID and timestamp
  2. Fetches /cdn-cgi/challenge-platform/scripts/jsd/main.js
  3. Collects fingerprint data (navigator.webdriver, plugins, chrome object, canvas, WebGL, AudioContext)
  4. POSTs results back, receives a cf_clearance cookie

Layer 6: Behavioral Analysis Unsupervised ML models that establish baselines of normal visitor behavior and detect anomalies in session traversal paths, request sequences, and interaction patterns. Can detect previously unknown bot types.

Layer 7: Proof of Work SHA256 puzzle solved by Web Workers (uses 75% of navigator.hardwareConcurrency cores). Difficulty is personalized -- low for trusted visitors, high for suspicious ones. The solve time itself is a signal (too fast = specialized hardware, too slow = resource-constrained bot).

Turnstile's ML Pipeline

  • Heuristics engine: ~20 microseconds, hundreds of rules, catches ~15% of global traffic
  • ML engine (CatBoost): ~50 microseconds per model, processes 250+ request attributes, handles majority of detections
  • Total detection budget: Under 100 microseconds per request
  • Training data: Trillions of requests weekly across 26M+ properties

Challenge Obfuscation

The IUAM (I'm Under Attack Mode) challenge uses:

  • window._cf_chl_opt configuration with cRay as the decryption key for the second-stage script
  • Custom LZW compression with 6-bit binary packing
  • Different script variants per site
  • Scripts re-obfuscated regularly with new keys
  • cRay rotates with every request (no replay)

Private Access Tokens

On Apple devices (iOS 16+, macOS Ventura+), the device's secure enclave performs cryptographic attestation via the Privacy Pass protocol. This essentially lets Apple vouch "this is a real device" without identifying which device. When available, this can eliminate challenges entirely.


3. hCaptcha

hCaptcha is fascinating because of its unique business model and surprisingly open architecture (it's been more thoroughly reverse-engineered than the others).

Business Model Affects Everything

hCaptcha is run by Intuition Machines. Unlike Google (which monetizes user data for ads), hCaptcha pays websites to use their captcha. The challenges themselves are ML training tasks -- when you "select all images with bicycles," you're labeling data for paying ML customers. Connected to the HUMAN Protocol blockchain marketplace (HMT token). This means:

  • Challenges must produce useful ML annotations, not just arbitrary puzzles
  • They can credibly market privacy because their revenue comes from data labeling, not user tracking
  • Volume = revenue, so the free tier is genuinely free

Multi-Layer Encryption

hCaptcha's fingerprinting script (introduced v1.10.7, June 2022) is encrypted and decrypted inside WebAssembly before execution. The reverse engineering effort (Implex-ltd/hcaptcha-reverse) revealed:

  • Three-layer encryption: AES-256-GCM keys derived in WASM + AES-128-CBC at the JS level
  • Key generation uses a Linear Congruential Generator (LCG) with multiplier 6364136223846793005
  • The "N data" payload contains encrypted fingerprint events + device motion data + timestamps
  • A CRC-32 checksum validates payload integrity

What hCaptcha Fingerprints (~40+ signals)

From d4c5d1e0/hcaptcha reverse engineering, signals are categorized by ID:

Category IDs What
Screen/Display 101-104 Window dimensions, color depth, pixel ratio, touch points
CSS 201-205, 301-304 CSS rendering, getClientRects(), emoji rendering, color defaults
Device 602-603 Language, platform, CPU arch, device memory, core count
Browser 401-402, 501, 701-703 Window property enumeration, media sources, Client Hints
Canvas 1101 Text rendering fingerprint, hashed with xxHash3
WebGL 2401-2402 Vendor/renderer, WebGL property hash
Audio - Voice/codec data, hashed
WebRTC - Peer connection fingerprint, hashed
Math 1901 Browser-specific floating point behavior (hashed)
Automation 2301-2304, 2601-2701 Worker thread analysis, extension detection, webdriver, Playwright globals, CDP side effects, missing Chrome objects

~15+ parameters use xxHash3 with seed 5575352424011909552 for obfuscation.

Proof of Work on Every Request

Even in "passive" mode, hCaptcha runs Hashcash-based proof-of-work via WASM on every single request. Uses rust-hashcash v0.3.3. Difficulty extracted from a JWT. Typically completes in <10ms on modern hardware, but creates a computational cost floor for attackers at scale.

Motion Data (Since v1.80.0)

Mouse movement data is now encrypted and embedded in the N payload with a new obfuscated VM. The structure (detectivevoke/hcaptcha-motion-data):

  • st: start time
  • mm: mouse movements (x, y, timestamp)
  • mm-mp: movement metric
  • md/mu: mouse down/up events with metrics
  • topLevel: screen dimensions, navigator properties, webdriver status

Enterprise vs Free: Massive Gap

Free tier: basic challenges + fingerprinting + PoW. Enterprise adds:

  • Full passive mode (zero visual challenges)
  • Three risk scores: Bot Score, Fraud Score, Account Takeover Score (0.0 to 1.0)
  • Advanced Threat Signatures: ML clustering that groups attackers across thousands of IPs/devices into threat clusters
  • AI Agent Detection: Specifically identifies OpenAI Operator and similar
  • Private Learning: Custom ML models trained on customer data + hCaptcha platform data

4. The Signal Taxonomy (All Systems)

Here's the complete picture of what these systems look at, organized by difficulty to spoof:

Easy to Spoof

  • User-Agent string
  • HTTP headers (order, presence)
  • Screen resolution, timezone, language
  • navigator.webdriver (just set to false)

Medium Difficulty

  • IP reputation (residential proxies help, but behavioral analysis continues)
  • Canvas fingerprint (can be randomized but inconsistency is itself a signal)
  • WebGL vendor/renderer (must match claimed OS)
  • Plugins, fonts, feature enumeration
  • Google cookies (just log into a Google account... but then you're tracked)

Hard to Spoof

  • TLS fingerprint (JA4): Requires matching the exact TLS implementation of the claimed browser. Most HTTP libraries can't do this.
  • HTTP/2 SETTINGS frame: SETTINGS values and pseudo-header order are baked into the HTTP library. Most don't expose customization.
  • Canvas/WebGL cross-validation: Canvas output must be consistent with the GPU reported by WebGL, which must be consistent with the claimed OS. Faking one without the others creates inconsistencies.
  • Behavioral biometrics: Mouse velocity, acceleration, curvature, pause duration before clicks. Bots move in straight lines or perfect Bezier curves; humans are jittery and inconsistent.
  • CDP/Automation artifacts: Runtime.enable serialization side effects, cdc_ variables, missing Chrome objects in headless mode.

Nearly Impossible to Spoof at Scale

  • BotGuard VM attestation: The self-modifying bytecode VM that changes its own opcodes. You'd need to decompile and recompile it for every new version Google pushes.
  • Cross-request ML models: Cloudflare's behavioral analysis across sessions, hCaptcha's threat signature clustering. Even with perfect per-request signals, anomalous patterns across thousands of requests are detectable.
  • Private Access Tokens: Hardware attestation from the device's secure enclave. Can't fake this without physical hardware.
  • Mobile sensor data: Accelerometer, gyroscope, magnetometer readings during touch events. When a real human touches a phone, the device physically moves; bots don't generate this IMU data.

5. About Your Mouse Wiggling

You're right to wiggle! Here's exactly why it works:

When you see the reCAPTCHA checkbox, the system is already collecting mouse movement data from the moment the page loaded. The checkbox click itself is almost irrelevant -- what matters is the trajectory of your mouse approaching the checkbox:

  • Human mouse movements have characteristic jitter, micro-corrections, and variable velocity. The acceleration profile follows Fitts's Law (larger/closer targets are reached faster).
  • Bot mouse movements tend to be either perfectly straight lines, perfect Bezier curves (the common fake), or absent entirely.
  • The pause before clicking matters -- humans have variable reaction times; bots are either instant or use fixed delays.
  • Post-click behavior matters too -- humans often move the mouse away or continue scrolling; bots stop dead.

By wiggling, you're generating a rich stream of movement data with natural human entropy -- random velocities, direction changes, micro-pauses. This feeds the behavioral model with strong "human" signals and likely pushes your score well above the challenge threshold.

hCaptcha's motion data format (mm array) records every mouse movement with coordinates and timestamps. Research projects like detectivevoke/hcaptcha-motion-data try to fake this with randomized Bezier curves, but the statistical distribution of real human movement is hard to replicate convincingly.


6. The Arms Race (Anti-Detection Tools)

The evasion ecosystem is active:

  • puppeteer-extra-plugin-stealth: JS-level API patching. Largely defeated post-2022 when Chrome unified headless/headful codebases.
  • undetected-chromedriver: Patches cdc_ variables in the chromedriver binary. Still catches basic detection but fails against protocol-level checks.
  • Nodriver (Python): Eliminates CDP entirely, uses OS-level input. More effective but limited functionality.
  • Rebrowser: Drop-in Playwright/Puppeteer replacement that minimizes high-risk CDP domains (Runtime, Console).
  • Antidetect browsers (GoLogin, Multilogin): Full browser profiles with consistent fingerprints, residential proxies, real browser engines. The closest to undetectable, but expensive and don't scale.
  • FlareSolverr: Runs a full browser to solve Cloudflare challenges and return cookies. Works but is slow and resource-intensive.

The key inflection point was November 2022 when Google unified headless and headful Chrome, eliminating most API-level differences. Detection shifted from "is this headless?" to "is CDP active?" -- a much harder question to mask.


TL;DR

You're right: it's black boxes all the way down. These systems collect 40-250+ signals across every layer of the stack (TCP, TLS, HTTP/2, HTTP headers, JavaScript environment, rendering, behavior, hardware), encrypt and obfuscate the collection code, send it all to server-side ML models trained on trillions of requests, and get back a probability. No single signal is decisive. The genius is in the cross-validation -- each signal is easy to fake in isolation, but making them all consistent with each other at scale is extraordinarily difficult.

Google's moat is BotGuard's VM + Google cookies. Cloudflare's moat is protocol-level fingerprinting (TLS/HTTP/2) + massive traffic dataset. hCaptcha's moat is multi-layer WASM encryption + proof-of-work economics.

And your mouse wiggling? Statistically optimal human behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment