Author: Bruno Date: September 2025 Ticket: TA-13099 PR: #976 (Blocked/Closed) Status: Investigation Complete, Alternative Approach Explored
In September 2025, a critical race condition in Gr4vy's Secure Fields library was investigated following a production incident (August 27, 2025) that affected ~5,000 users across multiple merchants including Wikimedia and Grammarly. The issue: controller and input iframes could load in any order, causing FORM_CHANGE events to reference undefined fields when inputs loaded first.
Root Cause: BroadcastChannel messages from inputs sent before controller was listening were permanently lost, leaving the controller's fields object incomplete.
Solution Implemented (PR #976): Controller-driven sync mechanism where controller broadcasts a one-time sync request on boot, and inputs replay their current state. This "pull-based" approach ensured complete state without UX degradation.
Outcome: After thorough peer and senior architect reviews, the team decided (Sep 29-30) to explore alternative architectural approaches. PR #976 was moved to "Blocked" status, TA-13099 was closed, and TA-13399 was created to investigate a more comprehensive solution. The investigation produced 34+ documentation files and extensive technical analysis.
- August 27, 2025: Production incident (PR #950) breaks Wikimedia, Grammarly, PlayHQ (~5k users)
- August 27, 2025 19:04 UTC: Incident detected
- August 27, 2025 22:00 UTC: Incident resolved (revert deployed)
- September 8, 2025: TA-13099 created by Luca Allievi - "Make controller always load before inputs"
- September 15, 2025: Initial implementation complete, documentation created
- September 16, 2025: PR #976 opened, initial reviews from Gary Evans
- September 18, 2025: Feedback from Giordano Arman recommending senior architect review
- September 19-20, 2025: Senior architecture review completed
- September 22, 2025: Peer review analysis and comprehensive documentation
- September 25, 2025: New architecture proposals created
- September 29-30, 2025: Team decision to explore different approach
- September 30, 2025: TA-13099 closed, TA-13399 created
What Happened:
PR #950 introduced a postal code field but included an architectural refactoring that changed how the controller's fields object was initialized - from pre-initialized with all expected properties to starting as an empty object {}. This breaking change caused merchants' custom validation code to fail.
Impact Metrics:
- Affected Merchants: Wikimedia (~2.5k users), Grammarly (~2k users), PlayHQ (148 users), others
- Total Users: ~5,000 unable to complete transactions
- Detection Time: 07:04 PM UTC
- Resolution Time: 10:00 PM UTC (~3 hours)
- Priority: P1
Error Observed:
// Wikimedia's code
if (data.fields) {
cardNumberFieldEmpty = data.fields.number.empty; // CRASH: Cannot read properties of undefined
cardNumberFieldValid = data.fields.number.valid;
}The Breaking Change:
// BEFORE PR #950 (Working)
export let fields: CardFields = {
number: { ...initField }, // Always exists
expiryDate: { ...initField }, // Always exists
securityCode: { ...initField }, // Always exists
}
// AFTER PR #950 (Broken)
export let fields: CardFields & OtherFields = {} // Empty! Only populated via handleAdd()Immediate Response:
- 07:04 PM: Wikimedia reported errors
- 07:39 PM: P2 incident created
- 08:05 PM: Revert PR created
- 08:39 PM: Revert merged
- 09:57 PM: Fix deployed (after CI build issues)
- 10:00 PM: Incident resolved
Created By: Luca Allievi Title: "Make controller always load before inputs" Objective: Ensure controller readiness before relying on field data
Acceptance Criteria:
- Ensure controller readiness before relying on field data
- Submissions must consistently include field values
FORM_CHANGEevents must be reliable- No breaking changes to public SDK API
- No UX degradation (fields should be interactive immediately)
Assignment: Assigned to Bruno for investigation and implementation
Key Question from Luca:
"I'm not sure if there is another solution to the issue rather than controller loads first. If you have [one] and meet the constraints, I'm more than happy to know."
Primary Issue: Controller's fields object empty when inputs load first
Technical Flow:
- SDK creates controller iframe and input iframes in parallel
- Input iframes load and send
addmessages via BroadcastChannel - If controller loads after inputs, it misses the
addmessages - Controller's
fieldsobject remains empty/incomplete - Merchants accessing
fields.number.emptygetundefinederror
Why Messages Are Lost: BroadcastChannel is lossy for late subscribers - if a message is sent before a listener attaches, it's permanently dropped. No retry, no queue, no delivery guarantee.
From the post-mortem "5 Whys" analysis:
- PR #950 removed default field values → Made race condition fatal
- Race condition in iframe loading → No guaranteed load order
- BroadcastChannel lossy → Early messages permanently lost
- No controller-first enforcement → SDK allowed parallel iframe creation
- Architectural assumption → System relied on "luck" (controller loading first)
Message Sequence When It Works:
sequenceDiagram
participant SDK
participant Controller
participant Input
SDK->>Controller: Create iframe
Controller->>Controller: Load & attach listeners
Controller->>SDK: postMessage('ready')
SDK->>Input: Create iframe
Input->>Input: Load
Input->>Controller: BroadcastChannel('add', {type: 'number'})
Controller->>Controller: fields[type]._added = true ✅
Controller->>SDK: postMessage('form-change', {fields: {...}})
Message Sequence When It Fails:
sequenceDiagram
participant SDK
participant Controller
participant Input
SDK->>Controller: Create iframe (slow)
SDK->>Input: Create iframe (fast)
Input->>Input: Load first
Input->>Controller: BroadcastChannel('add') → LOST! ❌
Controller->>Controller: Load late, listeners attach
Controller->>SDK: postMessage('ready')
Controller->>SDK: postMessage('form-change', {fields: {}}) ← EMPTY!
Current Controller Boot Sequence (Vulnerable):
// apps/secure-fields/src/controller.ts
broadcastChannel.listen() // Listeners attach here
parent.listen()
parent.message('ready') // Signals ready to SDK
// Problem: Input messages sent BEFORE listen() are lost!The handleAdd Function:
// apps/secure-fields/src/controller.ts:91-93
export const handleAdd = (data: { type: string }) => {
fields[data.type]._added = true
}The Flow:
- Input sends:
channel.message('add', { type: 'number' }) - Controller receives:
broadcastChannel.onMessage('add', handleAdd) - Handler sets:
fields['number']._added = true - If controller not listening: Message lost,
fields['number']never created
From /Users/bruno/www/gr4vy/secure-fields/docs/handleAdd-undefined-type-bug-analysis.md:
The BroadcastChannel only delivers messages to listeners that are already connected when the message is sent. If the controller hasn't initialized its BroadcastChannel listener yet, it will never receive the 'add' message from inputs that loaded first.
During the investigation, 12+ distinct approaches were analyzed. Here are the primary ones:
How It Works:
SecureFields.addField() returns stubs that queue method calls (setPlaceholder, setStyles, etc.). Real inputs created only after controller ready; queued calls then flush.
Implementation:
- Gate input creation/DOM replacement until
readyreceived - Stubs collect calls and listeners
- Flush queue when controller signals ready
Pros:
- Enforces "controller first, then inputs"
- No internal protocol changes needed
- Literal reading of ticket requirements
Cons:
- ❌ Inputs not interactive until ready → UX delay/regression
- ❌ Higher complexity for proxying methods and listeners
- ❌ More edge cases to handle
Decision: ❌ Rejected - UX degradation unacceptable
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:
Fields are not interactive until
ready→ UX delay/regression. Higher complexity for proxying methods and listeners; more edge cases. Tradeoffs: most literal reading of AC, but least user-friendly and most invasive.
How It Works:
Inputs render immediately but buffer their BroadcastChannel messages behind a local controllerReady flag. When controller emits one-shot "ready" broadcast, inputs flush buffered messages.
Implementation:
- Inputs add local queue
- Controller emits one-shot broadcast
- Inputs flush on receiving "ready"
Pros:
- No visible UX delay
- Fields work immediately
Cons:
- ❌ If controller loads first (common case), inputs miss the one-shot "ready"
- ❌ Needs intervals or repeated signals → flakiness
- ❌ More moving parts at input side
- ❌ Persistent timers undesirable
Decision: ❌ Rejected - Complexity and potential flakiness
How It Works:
After attaching listeners, controller broadcasts a one-time sync request. Each input responds by re-sending its current state: add and update. Number input also re-sends derived update-field for security code sizing/label.
Implementation Details:
Controller (apps/secure-fields/src/controller.ts):
// Reorder to:
broadcastChannel.listen() // 1. Attach listeners FIRST
parent.listen() // 2. Attach parent listener
broadcastChannel.message('sync') // 3. Request state replay
parent.message('ready') // 4. Signal ready to SDKInputs (apps/secure-fields/src/input.ts):
channel.onMessage('sync', () => {
// Re-mark field as added
channel.message('add', { type })
// Re-send current value/validity
fireFormUpdate({ target: input })
// If number field, also re-emit CVV constraints
if (input.id === 'number') {
const { code, schema } = validate(...)
const codeLabel = currentCodeLabel || code?.name || 'CVV'
const size = code?.size
channel.message('update-field', {
id: 'securityCode',
size,
codeLabel
})
}
})Pros:
- ✅ Robust to either load order
- ✅ No UX delay - inputs interactive immediately
- ✅ No timers/intervals needed
- ✅ Minimal code changes (3 files)
- ✅ No public API changes
- ✅ Single one-shot broadcast at boot
- ✅ Best balance of simplicity, UX, and reliability
Cons:
- Introduces one internal message type (
sync) - One extra broadcast at boot (minimal overhead)
Decision: ✅ CHOSEN - Best balance of all factors
Sync Completion Detection:
// Controller tracks which fields have synced
let syncAddedTypes = new Set()
let syncUpdatedTypes = new Set()
const checkSyncCompletion = () => {
if (every added field has at least one update) {
parent.message('sync-complete', {
bootStartedAt,
syncCompletedAt
})
}
}From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:
Adopt Option C (controller-issued
sync+ early listener attach). It fixes the race with minimal code, no UX regression, and no public API changes. It is robust to either load order and aligns with the acceptance criteria's intent.
How It Works:
SDK caches last input event per field. On controller ready, sends a sync payload to controller via postMessage. Controller merges it as update.
Pros:
- Avoids adding BroadcastChannel message for sync
Cons:
- ❌ Wider change surface (SDK + controller)
- ❌ Duplicated logic
- ❌ Must handle number→securityCode derivation explicitly
- ❌ Not as clean as Option C
Decision: ❌ Rejected - Less elegant than Option C
Core Mechanism: Pull-based resync where controller asks inputs to replay their state
Key Components:
- Controller broadcasts
syncon boot (after attaching listeners) - Inputs respond by re-sending
addandupdate - Controller tracks completion via sets of added/updated types
- SDK receives
sync-completewith minimal timing info - Single 5s timeout if controller never loads
Files Changed: 3 files
apps/secure-fields/src/controller.ts(Controller sync logic)apps/secure-fields/src/input.ts(Input sync handler)packages/secure-fields/src/index.ts(SDK timeout and diagnostics)
Message Flow:
sequenceDiagram
autonumber
participant SDK
participant Controller
participant Input1 as Input (number)
participant Input2 as Input (expiry)
Note over SDK: User creates SecureFields
SDK->>Controller: Create iframe
SDK->>Input1: Create iframe
SDK->>Input2: Create iframe
Note over Input1,Input2: Inputs may load first
Input1->>Input1: Load, attach listeners
Input2->>Input2: Load, attach listeners
Note over Controller: Controller loads
Controller->>Controller: Attach listeners FIRST
Controller-->>Input1: BroadcastChannel('sync')
Controller-->>Input2: BroadcastChannel('sync')
Controller->>SDK: postMessage('ready')
Note over Input1,Input2: Inputs replay state
Input1-->>Controller: add({type: 'number'})
Input1-->>Controller: update({number: {value, valid, empty}})
Input1-->>Controller: update-field({id: 'securityCode', size, label})
Input2-->>Controller: add({type: 'expiryDate'})
Input2-->>Controller: update({expiryDate: {value, valid, empty}})
Note over Controller: All fields synced
Controller->>Controller: checkSyncCompletion()
Controller->>SDK: postMessage('sync-complete', {timings})
Note over SDK: Normal operation
Controller->>SDK: postMessage('form-change', {fields, complete})
Code Snippets:
Controller Boot Sequence:
// apps/secure-fields/src/controller.ts
// CRITICAL ORDER:
broadcastChannel.listen() // 1. Listen first
parent.listen() // 2. Parent listener
broadcastChannel.message('sync') // 3. Request sync
parent.message('ready') // 4. Signal readyInput Sync Handler:
// apps/secure-fields/src/input.ts
channel.onMessage('sync', () => {
channel.message('add', { type })
fireFormUpdate({ target: input })
if (input.id === 'number') {
channel.message('update-field', {
id: 'securityCode',
size,
codeLabel
})
}
})SDK Timeout and Diagnostics:
// packages/secure-fields/src/index.ts
// 5-second hard timeout
const timeoutId = setTimeout(() => {
if (!controllerReady) {
error('Controller failed to load within timeout', {
timeoutMs: 5000
})
}
}, 5000)
// Clear on ready
case 'ready':
clearTimeout(timeoutId)
processFieldQueue()
// Diagnostic logging (debug only)
case 'sync-complete':
if (_controllerReadyDelayed) {
log('Controller sync completed after delay', data)
}- No Public API Changes: Internal messaging only, merchants don't need updates
- Backward Compatible: Existing integrations continue working unchanged
- Minimal Code Changes: Only 3 files touched, ~100 lines added
- No Timers/Polling: Single one-shot sync broadcast
- Inputs Interactive Immediately: No UX delay or placeholder states
- Robust: Works regardless of load order
- PCI Safe: No storage of sensitive values, ephemeral messaging only
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-implementation-summary.md:
We adopted Option C from the design report: a controller-driven replay request (pull-based resync) that works regardless of load order. No public API changes. No timers/intervals for polling. Inputs remain interactive immediately.
E2E Test: Delayed controller scenario
// packages/example-cdn/index.e2e.test.ts
// Delay controller by ~2s
page.route('**/controller.html*', route => {
setTimeout(() => route.continue(), 2000)
})
// Should still submit complete payload
expect(submitPayload).toHaveProperty('payment_method.number')
expect(submitPayload).toHaveProperty('payment_method.expiration_date')
expect(submitPayload).toHaveProperty('payment_method.security_code')Unit Tests:
- Controller resync completion detection
- Input sync handler replays state correctly
- Number input re-emits CVV
update-field - SDK timeout triggers after 5s if no ready
sync-completelogged only when delayed
Test Plan from /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-test-plan.md:
- Delayed controller still submits values ✅
- Controller never ready → single timeout error ✅
- Sync completes only after updates for all added fields ✅
- Number input on sync re-emits CVV settings ✅
Gary Evans (September 16, 2025):
Q: What if controller fails to load?
A: SDK has a 5-second timeout. After 5s, if controller hasn't posted ready, SDK logs a single error. Inputs remain rendered but submit won't work (controller owns submit flow).
Q: Can users submit before sync completes?
A: If controller not ready: submit call won't succeed. If controller ready but syncing: processes whatever state it has. Best practice: disable submit until both ready event fired AND FORM_CHANGE.complete === true.
Q: Does custom validation work during slow controller load?
A: Yes for input-level validation (runs in each input iframe independently). Delayed for controller-aggregated state (FORM_CHANGE event). Recommendation: tie field UI/validation to input events for responsiveness; use FORM_CHANGE.complete only to enable submit.
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-faq.md:
If the controller is not ready yet: a submit call won't succeed — the controller won't process it and you won't receive a success event. No invalid data is sent; it simply doesn't complete.
Giordano Arman (September 18, 2025):
Feedback: Recommended waiting for Luca's review before proceeding. Suggested getting senior architect input on long-term architectural implications.
Action Taken: Senior architecture review requested and completed (see next section).
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-peer-review.md:
Design Decisions Explained:
-
Controller-driven replay (pull-based)
- Why: Ensures controller receives complete state from source of truth (inputs)
- Avoided: SDK-side caching (duplicates logic), queuing events, timers/intervals
-
Single
syncbroadcast- Why: Minimal overhead, no polling
- Risk mitigation: Inputs attach listeners during boot; controller sends sync after its listeners attach
-
Minimal diagnostics
sync-completewith{bootStartedAt, syncCompletedAt}only- Conditional logging: only when field added before
ready - Avoided: Soft delay timers, verbose telemetry
What Was Intentionally Avoided:
- Polling loops or intervals
- SDK-side state caching
- Queuing field methods
- Adding public API changes
- Complex timing metrics
From peer review:
Net effect: minimal changes with maximal reliability and no UX regression. We fixed the race at its source (missed messages) with a small, explicit replay and preserved the public API.
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-senior-architecture-review.md:
Observations:
-
Sync model is intentionally minimal and pull-based ✅
- Fixes root race (missed BroadcastChannel messages)
- No queuing, polling, or SDK logic duplication
-
Message hygiene good in apps; weaker in SDK init
⚠️ - Apps use
MessageHandler.PostMessagewith origin+channel checks - SDK constructor checks origin only; submit path checks origin+channel+type
- Suggestion: Add channel check to SDK constructor for consistency
- Apps use
-
Completeness semantics sound ✅
completerequiresnumber,expiryDate, plus conditionallysecurityCode/postalCode
-
Scoped replay and minimal state ✅
- Two small sets of field types during resync
- Clears on completion
- Minimal diagnostic payload
Questions for Long-Term Resilience:
-
Is one-shot
syncenough across all browsers?- With storage-based BroadcastChannel polyfill, small ordering window exists
- Suggestion: Consider re-issuing sync once on first add (microtask-deferred)
-
Multiple instances on same page?
- Two SDK instances would share channel
- Suggestion: Per-instance channel suffix (nonce)
-
SPA lifecycle - how to tear down?
- No
destroy()method - Suggestion: Add explicit cleanup for SPAs
- No
-
EventBus/ListenersManager: global scope concerns?
EventBus.unsubscribeAll()clears global subscribersListenersManager.remove()usestoString()matching (brittle)- Suggestion: ID-centric listener management
Hardening Suggestions:
1. SDK Constructor Channel Check:
const listener = (message: MessageEvent) => {
const isKnownOrigin = message.origin === this.frameUrl
const isKnownChannel = message?.data?.channel === MESSAGE_CHANNEL
if (!(isKnownOrigin && isKnownChannel)) return
// ... handle message
}2. Microtask-Deferred Re-Sync (for edge cases):
let resyncReissued = false
export const handleAdd = (data: { type: string }) => {
fields[data.type]._added = true
if (syncInProgress && !resyncReissued) {
resyncReissued = true
queueMicrotask(() => broadcastChannel.message('sync'))
}
}3. SPA-Safe Teardown:
class SecureFields {
destroy() {
ListenersManager.removeAll()
EventBus.unsubscribeAll()
// Remove frames
this.controller?.remove()
}
}Rationale and Trade-offs:
Keeping replay source of truth in inputs avoids duplicating validation/formatting in the SDK, minimizing risk and drift. A microtask-delayed "second sync" mitigates the only remaining known race without introducing timers/intervals.
Closing Statement:
The chosen fix for TA-13099 is the right long-term call: one-shot, controller-driven sync from the inputs (the real state owners). The suggestions above harden boundary conditions without increasing complexity for integrators or expanding the public API.
From /Users/bruno/www/gr4vy/secure-fields/docs/SENIOR_ARCHITECT_REVIEW.md:
This was a highly critical review identifying deeper architectural concerns:
🔴 Critical Architectural Issues:
-
Real Problem: Temporal Dependencies in Distributed System
- Race condition is symptom of fundamentally flawed architecture
- 3+ independent processes (SDK + Controller + N inputs) with implicit ordering
- No guaranteed message ordering
- No error recovery if controller crashes
- Silent failures when BroadcastChannel unsupported
-
The Proposed Queue is a Band-Aid
- Doesn't address lack of message ordering guarantees
- No mechanism for detecting/handling stalled initialization
- Creates new failure modes
-
Violation of Core Design Principles
- SDK now handles async state management (doesn't belong there)
- Fails silently later instead of fail-fast
- Single Responsibility Principle violated
From senior architect review:
The race condition isn't just a timing issue—it's a symptom of a fundamentally flawed architecture where we have 3+ independent processes with implicit ordering dependencies. The proposed queue is a band-aid that doesn't address: No guaranteed message ordering between iframes, No error recovery if controller crashes, No way to detect/handle stalled initialization.
Alternative: Event-Driven Coordination with Promises:
class SecureFields {
private controllerInitialized: Promise<void>
constructor(config: Config) {
this.controllerInitialized = this.initController(config)
}
async addCardNumberField(element, options): Promise<SecureInput> {
await this.controllerInitialized // Wait for controller
return this._createCardNumberField(element, options)
}
}Benefits Over Current Solution:
- Explicit dependencies (controller must load first)
- Fail-fast (immediate error if controller fails)
- Promise-based (natural async/await patterns)
- Memory safe (no persistent queues)
- Testable (easy to mock controller init)
Production Readiness Red Flags:
- Creates new race condition (queue processing vs controller ready)
- No fallback if BroadcastChannel fails
- Silent degradation (placeholder objects mask failures)
- No monitoring hooks for queue overflow
Final Verdict: 🔴 REQUEST CHANGES
From senior architect:
As the senior architect who has seen this codebase evolve over 5 years, I'm concerned about accumulating technical debt. Each "quick fix" makes the next problem harder to solve. The queueing approach is acceptable as an interim fix, but we should be planning the move to Promise-based initialization within the next 2 quarters.
What Happened: After extensive review discussions, the team decided to explore a different architectural approach rather than proceeding with PR #976.
Actions Taken:
- PR #976 moved to "Blocked" status
- TA-13099 closed (investigation complete)
- TA-13399 created for new investigation
Rationale:
-
Solution Seen as "Band-Aid"
- Fixes immediate symptom but doesn't address underlying architectural issues
- Creates new edge cases and failure modes
- Adds complexity to already complex iframe communication
-
Desire for Architectural Simplification
- Current solution adds more events between iframes
sync-completemechanism adds new coordination layer- Long-term maintainability concerns
-
Senior Architect Concerns
- Temporal dependencies in distributed system
- Silent failure modes
- Memory management concerns
- Violation of design principles
-
Alternative Approaches Worth Exploring
- Promise-based API (async/await patterns)
- Service worker coordination
- Explicit controller-first loading
- Unified message handler architecture
From team discussions:
While PR #976 successfully solves the immediate race condition, the team consensus is to explore architectural patterns that eliminate the race condition by design rather than working around it. The investigation produced valuable insights that will inform the next approach.
Current System Architecture:
Components:
-
SDK (Host Page) -
packages/secure-fields/src/- Public API for integrators
- Creates controller and field iframes
- Relays events via EventBus
-
Controller (Hidden Iframe) -
apps/secure-fields/src/controller.ts- Central state for all fields
- Validates form completeness
- Updates Checkout Session on submit
-
Input Fields (Iframes) -
apps/secure-fields/src/input.ts- One iframe per field (number, expiry, CVV, postal code)
- Handle formatting, validation, autofill
- Emit field-level events
Communication Channels:
- postMessage: SDK ↔ Controller, SDK ↔ Inputs
- BroadcastChannel (
secure-fields): Controller ↔ Inputs - BroadcastChannel (
secure-fields-card): Controller ↔ Click to Pay Encrypt - MessageChannel: Click to Pay Controller ↔ Encrypt (port transfer)
Message Types:
- Controller ↔ SDK:
ready,form-change,submit,success,error - Controller ↔ Inputs:
add,update,reset,sync(new) - Input ↔ SDK:
focus,blur,input,update
From /Users/bruno/www/gr4vy/secure-fields/docs/SECURE_FIELDS_ARCHITECTURE_REPORT.md:
sequenceDiagram
participant Host as Host Page (SDK)
participant Ctrl as Controller iframe
participant Num as Input iframe (number)
participant API as Checkout Sessions API
Host->>Ctrl: Create iframe controller.html
Ctrl-->>Host: postMessage ready
Host->>Num: Create iframe input.html?type=number
Num-->>Host: onload
Host->>Num: postMessage update{styles, label}
Num-->>Ctrl: BroadcastChannel add{type: 'number'}
Num->>Ctrl: BroadcastChannel update{number: {value, valid, empty}}
Ctrl-->>Host: postMessage form-change{fields, complete}
Host->>Ctrl: postMessage submit{method: 'card'}
Ctrl->>API: PUT /checkout/sessions/:id/fields
API-->>Ctrl: 200
Ctrl-->>Host: postMessage success{scheme}
Controller Field State:
export let fields: CardFields & OtherFields = {
number: {
value: '', // Sanitized (not PAN)
valid: true,
empty: true,
autofilled: false,
_added: false // Internal flag
},
expiryDate: { /* similar */ },
securityCode: { /* similar */ },
postalCode: { /* similar */ }
}Form Complete Logic:
const complete = Object.entries(fields)
.filter(([_, field]) => field._added)
.every(([_, field]) => field.valid && !field.empty)From Documentation Files:
-
PR #950 Investigation (
PR-950-investigation-summary.md)- Detailed analysis of the incident
- Breaking change identification
- Payload structure comparison
- Root cause: architectural change from static to dynamic field management
-
Wikimedia Breaking Change Analysis (
wikimedia-breaking-change-analysis.md)- Merchant code assumptions documented
- Defensive coding patterns analyzed
- API contract violation explained
- Shows how implicit contracts matter
-
handleAdd Undefined Bug (
handleAdd-undefined-type-bug-analysis.md)- Secondary bug discovered during investigation
SecureInput.update()inconsistency- Missing
typefield in dynamic updates - Fixed in PR #977
-
setPlaceholder Bug Flow (
setPlaceholder-bug-flow.md)- Detailed message flow diagrams
- Working vs broken scenarios visualized
- Code path analysis with mermaid diagrams
-
Comprehensive Analysis Docs:
CURRENT_SYSTEM_ANALYSIS.md- Full system architectureTA-13099_IMPLEMENTATION_DOCUMENTATION.md- Line-by-line code analysisFINAL_IMPLEMENTATION_SUMMARY.md- Complete solution summaryPROPOSED_SOLUTION.md- Initial proposalREFACTORED_SOLUTION.md- IterationREFACTORING_EFFORT_ANALYSIS.md- Bundle size, complexity analysis
Historical Context:
- PR #950: Added postal code but broke field initialization (August 27)
- PR #959: Safe reimplementation of postal code (retained field defaults)
- PR #976: Controller sync solution (September 16)
- PR #977: Fixed
handleAddundefined type bug (September)
From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-faq.md:
A: The SDK never receives ready. After ~5 seconds, SDK logs a single timeout error (debug logging must be enabled). Inputs remain rendered and editable, but tokenization/submission cannot complete because controller owns the submit flow.
Recommended integration: Gate submit on SDK ready event and show user-friendly error if not received within your own timeout window.
Q: If controller takes a long time to load, do custom validations still run while inputs are editable?
A:
- Yes for input-level validation: Each input iframe validates and formats on every keystroke, independent of controller. SDK still receives
inputevents, so container attributes/state update as usual (invalid, autofilled, CVV label). - Delayed for controller-aggregated state:
FORM_CHANGEevent (withcompleteflag) comes from controller, so it starts flowing after controller is ready and finishes initial sync.
Recommendation: Keep field UI/validation tied to input events for responsiveness; use FORM_CHANGE.complete only to enable submit.
A:
- If controller not ready: Submit call won't succeed - controller won't process it and no success event emitted. No invalid data sent; simply doesn't complete.
- If controller ready but still syncing: Processes whatever state it has so far. The
completeflag inFORM_CHANGEindicates when form is fully ready.
Recommended integration:
- Disable submit until SDK has fired
readyevent - AND latest
FORM_CHANGE.completeistrue - This guarantees controller has all current input values before submission
A:
- Stored payment method (CVV only):
syncreplays whatever is present. Number/expiry not added; CVV-only flows behave as before. - Autofill (Chrome):
isAutofilledalready computed; resync emits currentupdatesnapshot including autofilled state. - Click to Pay: Controller mirrors updates to card channel; resync integrates seamlessly.
- Controller never loads: After 5s, SDK logs single error.
- Method change: Controller already sends
reset; if desired, can broadcast anothersyncafterwards to rehydrate values.
TA-13099-controller-init-report.md- Initial analysis of all solution optionsTA-13099-implementation-summary.md- High-level implementation summaryTA-13099-final-implementation.md- Detailed final implementation reportTA-13099_IMPLEMENTATION_DOCUMENTATION.md- Line-by-line code analysis (768 lines)TA-13099_IMPLEMENTATION_PLAN.md- Step-by-step implementation planTA-13099_REVISED_IMPLEMENTATION.md- Iteration after initial feedbackTA-13099_PR_DETAILS.md- PR #976 details and description
TA-13099-peer-review.md- Detailed peer review notesTA-13099-senior-architecture-review.md- Senior architect suggestionsPEER_REVIEW_ANALYSIS.md- Critical analysis from peer perspectiveSENIOR_ARCHITECT_REVIEW.md- Critical architectural concerns (5-year perspective)SENIOR_ARCHITECT_CRITICAL_ANALYSIS.md- Deep architectural critique
TA-13099-test-plan.md- Comprehensive test planTA-13099-test-checklist.md- Test execution checklistTA-13099-faq.md- Frequently asked questions
PR-950-INCIDENT.md- Production incident summaryPR-950-investigation-summary.md- Detailed incident investigationwikimedia-breaking-change-analysis.md- Merchant impact analysis
handleAdd-undefined-type-bug-analysis.md- Secondary bug discoveredsetPlaceholder-bug-flow.md- Message flow diagrams for bugsecure-input-type-bug-analysis.md- Input type handling bugempty-added-field-log-analysis.md- Log analysis of field addition
SECURE_FIELDS_ARCHITECTURE_REPORT.md- Complete architecture overviewSECURE_FIELDS_ARCHITECTURE_REPORT copy.md- Backup/alternate versionSECURE_FIELDS_NEW_ARCH.md- Proposed new architectureSECURE_FIELDS_NEW_ARCH_CLAUDE.md- Alternative architecture proposalCURRENT_SYSTEM_ANALYSIS.md- Current system deep dive
PROPOSED_SOLUTION.md- Initial solution proposalREFACTORED_SOLUTION.md- Refactored approachWAIT-FOR-CONTROLLER-IMPLEMENTATION.md- Alternative implementationFINAL_IMPLEMENTATION_SUMMARY.md- Summary of final approachREFACTORING_EFFORT_ANALYSIS.md- Bundle size and complexity analysis
PR-959-REVIEW.md- Review of safe postal code implementationSCHEME_BASED_CVV_IMPLEMENTATION.md- Related CVV feature
debug-required-fields.html- Debug page for testingtest-loading-order.html- Load order test pagewikimedia-repro.html- Wikimedia scenario reproductionwikimedia-repro-force-error.html- Force error scenarios
Total Documentation: 34+ markdown files, 3+ HTML test files
/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/controller.ts
Lines affected: ~15-30 (boot sequence reorder)
New functionality:
- Sync broadcast after listener attachment
- Sync completion tracking (syncAddedTypes, syncUpdatedTypes)
- checkSyncCompletion() function
- sync-complete message to SDK
/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/input.ts
Lines affected: ~10-25 (new sync handler)
New functionality:
- channel.onMessage('sync', handler)
- Re-emit add and update on sync
- Number field re-emits update-field for CVV
/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/index.ts
Lines affected: ~20-40 (timeout and diagnostics)
New functionality:
- 5-second timeout for controller ready
- _controllerReadyDelayed flag
- sync-complete handler with conditional logging
/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/constants.ts
/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/constants.ts
New: 'sync' message type
Modified: MESSAGE_CHANNEL usage
/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/types.ts
/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/types.ts
New: Sync-related message types
Modified: CardFields type definitions
- TA-13099 - Make controller always load before inputs
- Created: September 8, 2025
- Assigned: Bruno
- Status: Closed (investigation complete)
- Outcome: PR #976 blocked, team exploring alternatives
- TA-13399 - New investigation for alternative approach
- Created: September 30, 2025
- Status: Open
- Objective: Explore architectural simplification
- TA-13380 - Blocked by TA-13099 resolution
- Status: Waiting
- Dependency: Requires stable controller initialization
- TA-13016 - Wikimedia production incident post-mortem
- Date: August 27, 2025
- Root cause: PR #950 breaking change
-
TA-12842 - Add postal code field
- Implemented: PR #950 (broke production)
- Re-implemented: PR #959 (safe version)
-
TA-12843 - Submit postal code in checkout session
- Depends on: TA-12842
- Status: Completed (via PR #959)
-
Breaking Changes Are Dangerous
- Implicit API contracts matter as much as explicit ones
- Merchant-specific code will break in unexpected ways
- Defensive coding patterns reveal expected guarantees
-
Race Conditions Are NOT Edge Cases
- Production environments have unpredictable timing
- What works locally may fail in production
- Network latency, CDN caching, CPU load all affect timing
-
Testing Gaps
- Internal tools didn't catch merchant-specific patterns
- Race conditions hard to reproduce in controlled environments
- Need simulation of real merchant implementations
-
Architecture Assumptions
- System relied on "luck" (controller loading first)
- No explicit synchronization between distributed components
- Silent failures worse than loud failures
-
Multiple Solutions Exist
- Evaluated 12+ approaches with different trade-offs
- Queue-based, Promise-based, sync-based all viable
- Best solution depends on priorities (UX vs simplicity vs maintainability)
-
Trade-Offs Are Inevitable
- No perfect solution exists
- Every approach creates new edge cases
- Must choose which trade-offs are acceptable
-
Architecture Review Is Critical
- Senior architect perspective identified deeper issues
- Peer review caught implementation concerns
- Multiple review rounds improved solution quality
-
Documentation Matters
- 34+ docs created during investigation
- Comprehensive analysis aids decision-making
- Future teams benefit from documented thought process
-
BroadcastChannel Limitations
- Lossy for late subscribers
- No delivery guarantees
- Not suitable for critical initialization messages
-
Iframe Communication Complexity
- Distributed system with temporal dependencies
- No guaranteed message ordering
- Requires explicit synchronization
-
State Management Challenges
- Multiple sources of truth (SDK, Controller, Inputs)
- Synchronization requires careful design
- Error recovery adds significant complexity
The TA-13099 investigation successfully identified the root cause of a critical production race condition and developed a working solution (PR #976) using controller-driven sync. The solution was technically sound, thoroughly documented, and extensively reviewed.
However, the investigation also revealed deeper architectural concerns. The team's decision to explore alternative approaches reflects a mature engineering culture that values long-term maintainability over quick fixes.
Key Outcomes:
- ✅ Root Cause Identified: BroadcastChannel message loss when controller loads after inputs
- ✅ Working Solution Developed: Controller sync with input state replay
- ✅ Comprehensive Documentation: 34+ documents covering all aspects
- ✅ Thorough Reviews: Peer and senior architect perspectives
- ✅ Informed Decision: Team chose to explore architectural alternatives
What Worked:
- Systematic evaluation of 12+ solution approaches
- Minimal code changes (3 files)
- No public API breaking changes
- No UX degradation
- Comprehensive testing strategy
What Could Be Better:
- Solution adds complexity to already complex iframe communication
- Creates new edge cases and failure modes
- Doesn't address underlying architectural issues
- Temporary fix rather than permanent solution
Next Steps:
- TA-13399 will explore architectural simplification
- Consider Promise-based API for explicit async handling
- Evaluate service worker coordination
- Investigate unified message handler patterns
Bruno's Contribution:
Bruno's investigation was comprehensive, methodical, and well-documented. The 34+ documentation files created during this investigation provide invaluable insights for future architectural decisions. The work demonstrates strong technical analysis, clear communication, and commitment to quality.
The decision to not merge PR #976 doesn't diminish the value of this investigation - it's a testament to the team's commitment to building the right solution, not just the fastest solution.
Document Version: 1.0 Created: September 2025 Last Updated: September 30, 2025 Status: Investigation Complete, Alternative Approach In Progress Author: Bruno Reviewers: Gary Evans, Giordano Arman, Luca Allievi, Senior Architect
This document consolidates 34+ individual documentation files, Jira ticket TA-13099, PR #976 discussions, Confluence post-mortem analysis, and git commit history into a single comprehensive reference for future architectural decisions.