statico/claude-captcha-deepdive.md

## claude-captcha-deepdive.md

      
    Raw
  

              claude-captcha-deepdive.md
            
          
    How Modern CAPTCHAs Actually Work: A Deep Dive

The Big Picture

Your intuition is correct: all three systems fundamentally collect a mass of signals, send them to black-box server-side ML models, and get back a probabilistic score. None of them have a clean deterministic "if X then bot" rule. It's all heuristics and probability. But the depth of what they collect is staggering.

1. Google reCAPTCHA: The VM

Google's approach is the most opaque and arguably the most sophisticated. The core of reCAPTCHA is BotGuard -- a custom virtual machine implemented in JavaScript that runs a proprietary bytecode language inside your browser.
The BotGuard VM

This has been partially reverse-engineered (see neuroradiology/InsideReCaptcha and dsekz/botguard-reverse), and yes, it is a real VM:

Custom bytecode interpreter that emulates a register-based CPU mimicking x86 architecture
Bytecode is encrypted with XTEA (extended Tiny Encryption Algorithm) -- each 8-byte block XORed with a keystream
The VM has self-modifying code: it changes its own decryption keys and its own opcode numbers at runtime
Register values are encrypted; opcodes include flow-changing instructions
The bytecode has direct access to the JavaScript variables of its own interpreter -- it can reach out and touch the JS environment
To even see the instructions, you must write a custom disassembler, decompiler, AND debugger

The key insight from dsekz/botguard-reverse: BotGuard is not a fingerprinter per se. It's a browser attestation system. It generates a token that proves the code was executed by a real browser. The token can only be generated if the full VM executes correctly in a genuine browser environment. Google then correlates this token server-side with other signals (IP reputation, cookies, session history).
What reCAPTCHA Collects (from InsideReCaptcha reverse engineering)

Two byte arrays (xhr1 and xhr2) are assembled by the VM and sent to Google:


Signal
Details


Browser plugins
Full enumeration


User-Agent
String


Screen resolution
Dimensions


Execution time
How long the VM took to run


Timezone
Offset


Interaction counts
Clicks, keyboard events, touch actions within the captcha iframe


Browser function behavior
Tests specific JS functions and checks how they behave


CSS rule rendering
How the browser applies CSS (differs per engine)


Canvas rendering
Draws to canvas and hashes the output


Google cookies
Since reCAPTCHA runs on google.com, it has access to all Google cookies


Dynamic key hashes
Hashes of Function.toString() output, browser-specific APIs, hostname


xhr2 is additionally encrypted with XTEA before transmission to https://www.google.com/recaptcha/api2/frame.
The Google Account Advantage

This is reCAPTCHA's secret weapon that competitors can't match: because the script runs on the google.com domain, it can read Google cookies. If you're logged into a Google account with years of browsing history, you get a massive trust boost. Research has confirmed that the Google tracking cookie "plays a crucial role in determining the difficulty of challenge presented to the user." A fresh incognito window with no Google session almost always gets a harder challenge.
A 2019 study estimated reCAPTCHA has caused 819 million hours of wasted human time and generated billions in Google profits through the cookie/tracking data it collects.
reCAPTCHA v3 Scoring

v3 runs invisibly and returns a score from 0.0 (bot) to 1.0 (human). Google doesn't publish the exact scoring formula, but confirmed signals include:

All of the above fingerprinting data
Mouse movement patterns on the page (not just in the captcha widget)
Browsing behavior across the entire page
Full screenshots of the browser window (per GDPR analysis findings)
Historical behavior across all sites using reCAPTCHA

reCAPTCHA Enterprise

Adds risk reason codes: AUTOMATION (headless browsers), UNEXPECTED_ENVIRONMENT (emulated environments), TOO_MUCH_TRAFFIC, UNEXPECTED_USAGE_PATTERNS, LOW_CONFIDENCE_SCORE, plus fraud-specific signals for payment protection.

2. Cloudflare Turnstile

Cloudflare's approach is the most layered -- they have signals at every level of the stack, from TCP packets up to JavaScript behavior.
The Detection Stack (Bottom to Top)

Layer 1: TCP/IP Fingerprinting
Before your browser even sends HTTP, Cloudflare analyzes the TCP SYN packet. Different OS TCP/IP stacks set different initial window sizes, TTL values, and TCP options. A request claiming to be Chrome on Windows but with Linux TCP characteristics is immediately suspicious.
Layer 2: TLS Fingerprinting (JA3/JA4)
The TLS ClientHello message reveals the client's supported cipher suites, extensions, and elliptic curves -- each TLS library (Chrome, Firefox, curl, Python requests, Go net/http) produces a distinctive fingerprint.

JA3 (original): MD5 hash of ClientHello fields. Degraded after Chrome 110+ started randomizing extension order.
JA4 (successor, Sept 2023): Sorts extensions before hashing, defeating randomization. Has 8 components including protocol type, TLS version, SNI presence, cipher/extension counts, and truncated SHA256 hashes.

Cloudflare analyzes 15+ million unique JA4 fingerprints from 500M+ user agents daily, computing aggregate stats like browser_ratio_1h (what % of traffic with this fingerprint is browser-based) and reqs_quantile_1h (volume anomaly detection).
Layer 3: HTTP/2 Fingerprinting
This is one of the hardest signals to spoof. The HTTP/2 SETTINGS frame sent at connection startup contains values like HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, and the order of pseudo-headers (:method, :authority, :scheme, :path). Most HTTP libraries don't let you customize these. Cloudflare has 50+ heuristics based on HTTP/2 fingerprints alone.
Layer 4: HTTP Header Analysis
Header order, presence/absence, and consistency. A claimed Chrome browser missing sec-ch-ua headers, or headers in the wrong order for the claimed browser.
Layer 5: JavaScript Detection (JSD)
An invisible JS snippet injected into HTML responses that:

Sets __CF$cv$params with a ray ID and timestamp
Fetches /cdn-cgi/challenge-platform/scripts/jsd/main.js
Collects fingerprint data (navigator.webdriver, plugins, chrome object, canvas, WebGL, AudioContext)
POSTs results back, receives a cf_clearance cookie

Layer 6: Behavioral Analysis
Unsupervised ML models that establish baselines of normal visitor behavior and detect anomalies in session traversal paths, request sequences, and interaction patterns. Can detect previously unknown bot types.
Layer 7: Proof of Work
SHA256 puzzle solved by Web Workers (uses 75% of navigator.hardwareConcurrency cores). Difficulty is personalized -- low for trusted visitors, high for suspicious ones. The solve time itself is a signal (too fast = specialized hardware, too slow = resource-constrained bot).
Turnstile's ML Pipeline


Heuristics engine: ~20 microseconds, hundreds of rules, catches ~15% of global traffic
ML engine (CatBoost): ~50 microseconds per model, processes 250+ request attributes, handles majority of detections
Total detection budget: Under 100 microseconds per request
Training data: Trillions of requests weekly across 26M+ properties

Challenge Obfuscation

The IUAM (I'm Under Attack Mode) challenge uses:

window._cf_chl_opt configuration with cRay as the decryption key for the second-stage script
Custom LZW compression with 6-bit binary packing
Different script variants per site
Scripts re-obfuscated regularly with new keys
cRay rotates with every request (no replay)

Private Access Tokens

On Apple devices (iOS 16+, macOS Ventura+), the device's secure enclave performs cryptographic attestation via the Privacy Pass protocol. This essentially lets Apple vouch "this is a real device" without identifying which device. When available, this can eliminate challenges entirely.

3. hCaptcha

hCaptcha is fascinating because of its unique business model and surprisingly open architecture (it's been more thoroughly reverse-engineered than the others).
Business Model Affects Everything

hCaptcha is run by Intuition Machines. Unlike Google (which monetizes user data for ads), hCaptcha pays websites to use their captcha. The challenges themselves are ML training tasks -- when you "select all images with bicycles," you're labeling data for paying ML customers. Connected to the HUMAN Protocol blockchain marketplace (HMT token). This means:

Challenges must produce useful ML annotations, not just arbitrary puzzles
They can credibly market privacy because their revenue comes from data labeling, not user tracking
Volume = revenue, so the free tier is genuinely free

Multi-Layer Encryption

hCaptcha's fingerprinting script (introduced v1.10.7, June 2022) is encrypted and decrypted inside WebAssembly before execution. The reverse engineering effort (Implex-ltd/hcaptcha-reverse) revealed:

Three-layer encryption: AES-256-GCM keys derived in WASM + AES-128-CBC at the JS level
Key generation uses a Linear Congruential Generator (LCG) with multiplier 6364136223846793005
The "N data" payload contains encrypted fingerprint events + device motion data + timestamps
A CRC-32 checksum validates payload integrity

What hCaptcha Fingerprints (~40+ signals)

From d4c5d1e0/hcaptcha reverse engineering, signals are categorized by ID:


Category
IDs
What


Screen/Display
101-104
Window dimensions, color depth, pixel ratio, touch points


CSS
201-205, 301-304
CSS rendering, getClientRects(), emoji rendering, color defaults


Device
602-603
Language, platform, CPU arch, device memory, core count


Browser
401-402, 501, 701-703
Window property enumeration, media sources, Client Hints


Canvas
1101
Text rendering fingerprint, hashed with xxHash3


WebGL
2401-2402
Vendor/renderer, WebGL property hash


Audio
-
Voice/codec data, hashed


WebRTC
-
Peer connection fingerprint, hashed


Math
1901
Browser-specific floating point behavior (hashed)


Automation
2301-2304, 2601-2701
Worker thread analysis, extension detection, webdriver, Playwright globals, CDP side effects, missing Chrome objects


~15+ parameters use xxHash3 with seed 5575352424011909552 for obfuscation.
Proof of Work on Every Request

Even in "passive" mode, hCaptcha runs Hashcash-based proof-of-work via WASM on every single request. Uses rust-hashcash v0.3.3. Difficulty extracted from a JWT. Typically completes in <10ms on modern hardware, but creates a computational cost floor for attackers at scale.
Motion Data (Since v1.80.0)

Mouse movement data is now encrypted and embedded in the N payload with a new obfuscated VM. The structure (detectivevoke/hcaptcha-motion-data):

st: start time
mm: mouse movements (x, y, timestamp)
mm-mp: movement metric
md/mu: mouse down/up events with metrics
topLevel: screen dimensions, navigator properties, webdriver status

Enterprise vs Free: Massive Gap

Free tier: basic challenges + fingerprinting + PoW. Enterprise adds:

Full passive mode (zero visual challenges)
Three risk scores: Bot Score, Fraud Score, Account Takeover Score (0.0 to 1.0)
Advanced Threat Signatures: ML clustering that groups attackers across thousands of IPs/devices into threat clusters
AI Agent Detection: Specifically identifies OpenAI Operator and similar
Private Learning: Custom ML models trained on customer data + hCaptcha platform data


4. The Signal Taxonomy (All Systems)

Here's the complete picture of what these systems look at, organized by difficulty to spoof:
Easy to Spoof


User-Agent string
HTTP headers (order, presence)
Screen resolution, timezone, language
navigator.webdriver (just set to false)

Medium Difficulty


IP reputation (residential proxies help, but behavioral analysis continues)
Canvas fingerprint (can be randomized but inconsistency is itself a signal)
WebGL vendor/renderer (must match claimed OS)
Plugins, fonts, feature enumeration
Google cookies (just log into a Google account... but then you're tracked)

Hard to Spoof


TLS fingerprint (JA4): Requires matching the exact TLS implementation of the claimed browser. Most HTTP libraries can't do this.
HTTP/2 SETTINGS frame: SETTINGS values and pseudo-header order are baked into the HTTP library. Most don't expose customization.
Canvas/WebGL cross-validation: Canvas output must be consistent with the GPU reported by WebGL, which must be consistent with the claimed OS. Faking one without the others creates inconsistencies.
Behavioral biometrics: Mouse velocity, acceleration, curvature, pause duration before clicks. Bots move in straight lines or perfect Bezier curves; humans are jittery and inconsistent.
CDP/Automation artifacts: Runtime.enable serialization side effects, cdc_ variables, missing Chrome objects in headless mode.

Nearly Impossible to Spoof at Scale


BotGuard VM attestation: The self-modifying bytecode VM that changes its own opcodes. You'd need to decompile and recompile it for every new version Google pushes.
Cross-request ML models: Cloudflare's behavioral analysis across sessions, hCaptcha's threat signature clustering. Even with perfect per-request signals, anomalous patterns across thousands of requests are detectable.
Private Access Tokens: Hardware attestation from the device's secure enclave. Can't fake this without physical hardware.
Mobile sensor data: Accelerometer, gyroscope, magnetometer readings during touch events. When a real human touches a phone, the device physically moves; bots don't generate this IMU data.


5. About Your Mouse Wiggling

You're right to wiggle! Here's exactly why it works:
When you see the reCAPTCHA checkbox, the system is already collecting mouse movement data from the moment the page loaded. The checkbox click itself is almost irrelevant -- what matters is the trajectory of your mouse approaching the checkbox:

Human mouse movements have characteristic jitter, micro-corrections, and variable velocity. The acceleration profile follows Fitts's Law (larger/closer targets are reached faster).
Bot mouse movements tend to be either perfectly straight lines, perfect Bezier curves (the common fake), or absent entirely.
The pause before clicking matters -- humans have variable reaction times; bots are either instant or use fixed delays.
Post-click behavior matters too -- humans often move the mouse away or continue scrolling; bots stop dead.

By wiggling, you're generating a rich stream of movement data with natural human entropy -- random velocities, direction changes, micro-pauses. This feeds the behavioral model with strong "human" signals and likely pushes your score well above the challenge threshold.
hCaptcha's motion data format (mm array) records every mouse movement with coordinates and timestamps. Research projects like detectivevoke/hcaptcha-motion-data try to fake this with randomized Bezier curves, but the statistical distribution of real human movement is hard to replicate convincingly.

6. The Arms Race (Anti-Detection Tools)

The evasion ecosystem is active:

puppeteer-extra-plugin-stealth: JS-level API patching. Largely defeated post-2022 when Chrome unified headless/headful codebases.
undetected-chromedriver: Patches cdc_ variables in the chromedriver binary. Still catches basic detection but fails against protocol-level checks.
Nodriver (Python): Eliminates CDP entirely, uses OS-level input. More effective but limited functionality.
Rebrowser: Drop-in Playwright/Puppeteer replacement that minimizes high-risk CDP domains (Runtime, Console).
Antidetect browsers (GoLogin, Multilogin): Full browser profiles with consistent fingerprints, residential proxies, real browser engines. The closest to undetectable, but expensive and don't scale.
FlareSolverr: Runs a full browser to solve Cloudflare challenges and return cookies. Works but is slow and resource-intensive.

The key inflection point was November 2022 when Google unified headless and headful Chrome, eliminating most API-level differences. Detection shifted from "is this headless?" to "is CDP active?" -- a much harder question to mask.

TL;DR

You're right: it's black boxes all the way down. These systems collect 40-250+ signals across every layer of the stack (TCP, TLS, HTTP/2, HTTP headers, JavaScript environment, rendering, behavior, hardware), encrypt and obfuscate the collection code, send it all to server-side ML models trained on trillions of requests, and get back a probability. No single signal is decisive. The genius is in the cross-validation -- each signal is easy to fake in isolation, but making them all consistent with each other at scale is extraordinarily difficult.
Google's moat is BotGuard's VM + Google cookies. Cloudflare's moat is protocol-level fingerprinting (TLS/HTTP/2) + massive traffic dataset. hCaptcha's moat is multi-layer WASM encryption + proof-of-work economics.
And your mouse wiggling? Statistically optimal human behavior.
Signal	Details
Browser plugins	Full enumeration
User-Agent	String
Screen resolution	Dimensions
Execution time	How long the VM took to run
Timezone	Offset
Interaction counts	Clicks, keyboard events, touch actions within the captcha iframe
Browser function behavior	Tests specific JS functions and checks how they behave
CSS rule rendering	How the browser applies CSS (differs per engine)
Canvas rendering	Draws to canvas and hashes the output
Google cookies	Since reCAPTCHA runs on google.com, it has access to all Google cookies
Dynamic key hashes	Hashes of `Function.toString()` output, browser-specific APIs, hostname
Category	IDs	What
Screen/Display	101-104	Window dimensions, color depth, pixel ratio, touch points
CSS	201-205, 301-304	CSS rendering, `getClientRects()`, emoji rendering, color defaults
Device	602-603	Language, platform, CPU arch, device memory, core count
Browser	401-402, 501, 701-703	Window property enumeration, media sources, Client Hints
Canvas	1101	Text rendering fingerprint, hashed with xxHash3
WebGL	2401-2402	Vendor/renderer, WebGL property hash
Audio	-	Voice/codec data, hashed
WebRTC	-	Peer connection fingerprint, hashed
Math	1901	Browser-specific floating point behavior (hashed)
Automation	2301-2304, 2601-2701	Worker thread analysis, extension detection, webdriver, Playwright globals, CDP side effects, missing Chrome objects