Skip to content

Instantly share code, notes, and snippets.

@brunodesde1987
Last active November 12, 2025 14:06
Show Gist options
  • Select an option

  • Save brunodesde1987/61715fd31b723d111ecd4b43b89f477d to your computer and use it in GitHub Desktop.

Select an option

Save brunodesde1987/61715fd31b723d111ecd4b43b89f477d to your computer and use it in GitHub Desktop.
Secure Fields Race Condition: Bruno's Investigation & Solution (September 2025)

Secure Fields Race Condition: Bruno's Investigation & Solution (September 2025)

Author: Bruno Date: September 2025 Ticket: TA-13099 PR: #976 (Blocked/Closed) Status: Investigation Complete, Alternative Approach Explored


Executive Summary

In September 2025, a critical race condition in Gr4vy's Secure Fields library was investigated following a production incident (August 27, 2025) that affected ~5,000 users across multiple merchants including Wikimedia and Grammarly. The issue: controller and input iframes could load in any order, causing FORM_CHANGE events to reference undefined fields when inputs loaded first.

Root Cause: BroadcastChannel messages from inputs sent before controller was listening were permanently lost, leaving the controller's fields object incomplete.

Solution Implemented (PR #976): Controller-driven sync mechanism where controller broadcasts a one-time sync request on boot, and inputs replay their current state. This "pull-based" approach ensured complete state without UX degradation.

Outcome: After thorough peer and senior architect reviews, the team decided (Sep 29-30) to explore alternative architectural approaches. PR #976 was moved to "Blocked" status, TA-13099 was closed, and TA-13399 was created to investigate a more comprehensive solution. The investigation produced 34+ documentation files and extensive technical analysis.


Timeline

  • August 27, 2025: Production incident (PR #950) breaks Wikimedia, Grammarly, PlayHQ (~5k users)
  • August 27, 2025 19:04 UTC: Incident detected
  • August 27, 2025 22:00 UTC: Incident resolved (revert deployed)
  • September 8, 2025: TA-13099 created by Luca Allievi - "Make controller always load before inputs"
  • September 15, 2025: Initial implementation complete, documentation created
  • September 16, 2025: PR #976 opened, initial reviews from Gary Evans
  • September 18, 2025: Feedback from Giordano Arman recommending senior architect review
  • September 19-20, 2025: Senior architecture review completed
  • September 22, 2025: Peer review analysis and comprehensive documentation
  • September 25, 2025: New architecture proposals created
  • September 29-30, 2025: Team decision to explore different approach
  • September 30, 2025: TA-13099 closed, TA-13399 created

1. Context & Background

The Incident (August 27, 2025)

What Happened: PR #950 introduced a postal code field but included an architectural refactoring that changed how the controller's fields object was initialized - from pre-initialized with all expected properties to starting as an empty object {}. This breaking change caused merchants' custom validation code to fail.

Impact Metrics:

  • Affected Merchants: Wikimedia (~2.5k users), Grammarly (~2k users), PlayHQ (148 users), others
  • Total Users: ~5,000 unable to complete transactions
  • Detection Time: 07:04 PM UTC
  • Resolution Time: 10:00 PM UTC (~3 hours)
  • Priority: P1

Error Observed:

// Wikimedia's code
if (data.fields) {
  cardNumberFieldEmpty = data.fields.number.empty;  // CRASH: Cannot read properties of undefined
  cardNumberFieldValid = data.fields.number.valid;
}

The Breaking Change:

// BEFORE PR #950 (Working)
export let fields: CardFields = {
  number: { ...initField },      // Always exists
  expiryDate: { ...initField },  // Always exists
  securityCode: { ...initField }, // Always exists
}

// AFTER PR #950 (Broken)
export let fields: CardFields & OtherFields = {}  // Empty! Only populated via handleAdd()

Immediate Response:

  • 07:04 PM: Wikimedia reported errors
  • 07:39 PM: P2 incident created
  • 08:05 PM: Revert PR created
  • 08:39 PM: Revert merged
  • 09:57 PM: Fix deployed (after CI build issues)
  • 10:00 PM: Incident resolved

Investigation Ticket: TA-13099

Created By: Luca Allievi Title: "Make controller always load before inputs" Objective: Ensure controller readiness before relying on field data

Acceptance Criteria:

  1. Ensure controller readiness before relying on field data
  2. Submissions must consistently include field values
  3. FORM_CHANGE events must be reliable
  4. No breaking changes to public SDK API
  5. No UX degradation (fields should be interactive immediately)

Assignment: Assigned to Bruno for investigation and implementation

Key Question from Luca:

"I'm not sure if there is another solution to the issue rather than controller loads first. If you have [one] and meet the constraints, I'm more than happy to know."


2. Root Cause Analysis

The Bug

Primary Issue: Controller's fields object empty when inputs load first

Technical Flow:

  1. SDK creates controller iframe and input iframes in parallel
  2. Input iframes load and send add messages via BroadcastChannel
  3. If controller loads after inputs, it misses the add messages
  4. Controller's fields object remains empty/incomplete
  5. Merchants accessing fields.number.empty get undefined error

Why Messages Are Lost: BroadcastChannel is lossy for late subscribers - if a message is sent before a listener attaches, it's permanently dropped. No retry, no queue, no delivery guarantee.

Why It Happened

From the post-mortem "5 Whys" analysis:

  1. PR #950 removed default field values → Made race condition fatal
  2. Race condition in iframe loading → No guaranteed load order
  3. BroadcastChannel lossy → Early messages permanently lost
  4. No controller-first enforcement → SDK allowed parallel iframe creation
  5. Architectural assumption → System relied on "luck" (controller loading first)

Technical Deep Dive

Message Sequence When It Works:

sequenceDiagram
    participant SDK
    participant Controller
    participant Input

    SDK->>Controller: Create iframe
    Controller->>Controller: Load & attach listeners
    Controller->>SDK: postMessage('ready')
    SDK->>Input: Create iframe
    Input->>Input: Load
    Input->>Controller: BroadcastChannel('add', {type: 'number'})
    Controller->>Controller: fields[type]._added = true ✅
    Controller->>SDK: postMessage('form-change', {fields: {...}})
Loading

Message Sequence When It Fails:

sequenceDiagram
    participant SDK
    participant Controller
    participant Input

    SDK->>Controller: Create iframe (slow)
    SDK->>Input: Create iframe (fast)
    Input->>Input: Load first
    Input->>Controller: BroadcastChannel('add') → LOST! ❌
    Controller->>Controller: Load late, listeners attach
    Controller->>SDK: postMessage('ready')
    Controller->>SDK: postMessage('form-change', {fields: {}}) ← EMPTY!
Loading

Current Controller Boot Sequence (Vulnerable):

// apps/secure-fields/src/controller.ts
broadcastChannel.listen()  // Listeners attach here
parent.listen()
parent.message('ready')    // Signals ready to SDK

// Problem: Input messages sent BEFORE listen() are lost!

Code-Level Analysis

The handleAdd Function:

// apps/secure-fields/src/controller.ts:91-93
export const handleAdd = (data: { type: string }) => {
  fields[data.type]._added = true
}

The Flow:

  1. Input sends: channel.message('add', { type: 'number' })
  2. Controller receives: broadcastChannel.onMessage('add', handleAdd)
  3. Handler sets: fields['number']._added = true
  4. If controller not listening: Message lost, fields['number'] never created

From /Users/bruno/www/gr4vy/secure-fields/docs/handleAdd-undefined-type-bug-analysis.md:

The BroadcastChannel only delivers messages to listeners that are already connected when the message is sent. If the controller hasn't initialized its BroadcastChannel listener yet, it will never receive the 'add' message from inputs that loaded first.


3. All Approaches Evaluated

During the investigation, 12+ distinct approaches were analyzed. Here are the primary ones:

Option A: SDK Queues Fields Until Controller Ready (Strict Ordering)

How It Works: SecureFields.addField() returns stubs that queue method calls (setPlaceholder, setStyles, etc.). Real inputs created only after controller ready; queued calls then flush.

Implementation:

  • Gate input creation/DOM replacement until ready received
  • Stubs collect calls and listeners
  • Flush queue when controller signals ready

Pros:

  • Enforces "controller first, then inputs"
  • No internal protocol changes needed
  • Literal reading of ticket requirements

Cons:

  • Inputs not interactive until ready → UX delay/regression
  • ❌ Higher complexity for proxying methods and listeners
  • ❌ More edge cases to handle

Decision: ❌ Rejected - UX degradation unacceptable

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:

Fields are not interactive until ready → UX delay/regression. Higher complexity for proxying methods and listeners; more edge cases. Tradeoffs: most literal reading of AC, but least user-friendly and most invasive.


Option B: Inputs Buffer Messages Until Controller Ready (Push-Based)

How It Works: Inputs render immediately but buffer their BroadcastChannel messages behind a local controllerReady flag. When controller emits one-shot "ready" broadcast, inputs flush buffered messages.

Implementation:

  • Inputs add local queue
  • Controller emits one-shot broadcast
  • Inputs flush on receiving "ready"

Pros:

  • No visible UX delay
  • Fields work immediately

Cons:

  • ❌ If controller loads first (common case), inputs miss the one-shot "ready"
  • ❌ Needs intervals or repeated signals → flakiness
  • ❌ More moving parts at input side
  • ❌ Persistent timers undesirable

Decision: ❌ Rejected - Complexity and potential flakiness


Option C: Controller-Driven Sync (Pull-Based) ⭐ CHOSEN

How It Works: After attaching listeners, controller broadcasts a one-time sync request. Each input responds by re-sending its current state: add and update. Number input also re-sends derived update-field for security code sizing/label.

Implementation Details:

Controller (apps/secure-fields/src/controller.ts):

// Reorder to:
broadcastChannel.listen()      // 1. Attach listeners FIRST
parent.listen()                 // 2. Attach parent listener
broadcastChannel.message('sync') // 3. Request state replay
parent.message('ready')         // 4. Signal ready to SDK

Inputs (apps/secure-fields/src/input.ts):

channel.onMessage('sync', () => {
  // Re-mark field as added
  channel.message('add', { type })

  // Re-send current value/validity
  fireFormUpdate({ target: input })

  // If number field, also re-emit CVV constraints
  if (input.id === 'number') {
    const { code, schema } = validate(...)
    const codeLabel = currentCodeLabel || code?.name || 'CVV'
    const size = code?.size
    channel.message('update-field', {
      id: 'securityCode',
      size,
      codeLabel
    })
  }
})

Pros:

  • ✅ Robust to either load order
  • ✅ No UX delay - inputs interactive immediately
  • ✅ No timers/intervals needed
  • ✅ Minimal code changes (3 files)
  • ✅ No public API changes
  • ✅ Single one-shot broadcast at boot
  • ✅ Best balance of simplicity, UX, and reliability

Cons:

  • Introduces one internal message type (sync)
  • One extra broadcast at boot (minimal overhead)

Decision:CHOSEN - Best balance of all factors

Sync Completion Detection:

// Controller tracks which fields have synced
let syncAddedTypes = new Set()
let syncUpdatedTypes = new Set()

const checkSyncCompletion = () => {
  if (every added field has at least one update) {
    parent.message('sync-complete', {
      bootStartedAt,
      syncCompletedAt
    })
  }
}

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:

Adopt Option C (controller-issued sync + early listener attach). It fixes the race with minimal code, no UX regression, and no public API changes. It is robust to either load order and aligns with the acceptance criteria's intent.


Option D: SDK Bridge Fallback

How It Works: SDK caches last input event per field. On controller ready, sends a sync payload to controller via postMessage. Controller merges it as update.

Pros:

  • Avoids adding BroadcastChannel message for sync

Cons:

  • ❌ Wider change surface (SDK + controller)
  • ❌ Duplicated logic
  • ❌ Must handle number→securityCode derivation explicitly
  • ❌ Not as clean as Option C

Decision: ❌ Rejected - Less elegant than Option C


4. Chosen Solution: Controller Sync (PR #976)

Design

Core Mechanism: Pull-based resync where controller asks inputs to replay their state

Key Components:

  1. Controller broadcasts sync on boot (after attaching listeners)
  2. Inputs respond by re-sending add and update
  3. Controller tracks completion via sets of added/updated types
  4. SDK receives sync-complete with minimal timing info
  5. Single 5s timeout if controller never loads

Implementation Details

Files Changed: 3 files

  • apps/secure-fields/src/controller.ts (Controller sync logic)
  • apps/secure-fields/src/input.ts (Input sync handler)
  • packages/secure-fields/src/index.ts (SDK timeout and diagnostics)

Message Flow:

sequenceDiagram
    autonumber
    participant SDK
    participant Controller
    participant Input1 as Input (number)
    participant Input2 as Input (expiry)

    Note over SDK: User creates SecureFields
    SDK->>Controller: Create iframe
    SDK->>Input1: Create iframe
    SDK->>Input2: Create iframe

    Note over Input1,Input2: Inputs may load first
    Input1->>Input1: Load, attach listeners
    Input2->>Input2: Load, attach listeners

    Note over Controller: Controller loads
    Controller->>Controller: Attach listeners FIRST
    Controller-->>Input1: BroadcastChannel('sync')
    Controller-->>Input2: BroadcastChannel('sync')
    Controller->>SDK: postMessage('ready')

    Note over Input1,Input2: Inputs replay state
    Input1-->>Controller: add({type: 'number'})
    Input1-->>Controller: update({number: {value, valid, empty}})
    Input1-->>Controller: update-field({id: 'securityCode', size, label})

    Input2-->>Controller: add({type: 'expiryDate'})
    Input2-->>Controller: update({expiryDate: {value, valid, empty}})

    Note over Controller: All fields synced
    Controller->>Controller: checkSyncCompletion()
    Controller->>SDK: postMessage('sync-complete', {timings})

    Note over SDK: Normal operation
    Controller->>SDK: postMessage('form-change', {fields, complete})
Loading

Code Snippets:

Controller Boot Sequence:

// apps/secure-fields/src/controller.ts
// CRITICAL ORDER:
broadcastChannel.listen()        // 1. Listen first
parent.listen()                  // 2. Parent listener
broadcastChannel.message('sync') // 3. Request sync
parent.message('ready')          // 4. Signal ready

Input Sync Handler:

// apps/secure-fields/src/input.ts
channel.onMessage('sync', () => {
  channel.message('add', { type })
  fireFormUpdate({ target: input })

  if (input.id === 'number') {
    channel.message('update-field', {
      id: 'securityCode',
      size,
      codeLabel
    })
  }
})

SDK Timeout and Diagnostics:

// packages/secure-fields/src/index.ts

// 5-second hard timeout
const timeoutId = setTimeout(() => {
  if (!controllerReady) {
    error('Controller failed to load within timeout', {
      timeoutMs: 5000
    })
  }
}, 5000)

// Clear on ready
case 'ready':
  clearTimeout(timeoutId)
  processFieldQueue()

// Diagnostic logging (debug only)
case 'sync-complete':
  if (_controllerReadyDelayed) {
    log('Controller sync completed after delay', data)
  }

Why This Approach

  1. No Public API Changes: Internal messaging only, merchants don't need updates
  2. Backward Compatible: Existing integrations continue working unchanged
  3. Minimal Code Changes: Only 3 files touched, ~100 lines added
  4. No Timers/Polling: Single one-shot sync broadcast
  5. Inputs Interactive Immediately: No UX delay or placeholder states
  6. Robust: Works regardless of load order
  7. PCI Safe: No storage of sensitive values, ephemeral messaging only

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-implementation-summary.md:

We adopted Option C from the design report: a controller-driven replay request (pull-based resync) that works regardless of load order. No public API changes. No timers/intervals for polling. Inputs remain interactive immediately.

Testing

E2E Test: Delayed controller scenario

// packages/example-cdn/index.e2e.test.ts
// Delay controller by ~2s
page.route('**/controller.html*', route => {
  setTimeout(() => route.continue(), 2000)
})

// Should still submit complete payload
expect(submitPayload).toHaveProperty('payment_method.number')
expect(submitPayload).toHaveProperty('payment_method.expiration_date')
expect(submitPayload).toHaveProperty('payment_method.security_code')

Unit Tests:

  • Controller resync completion detection
  • Input sync handler replays state correctly
  • Number input re-emits CVV update-field
  • SDK timeout triggers after 5s if no ready
  • sync-complete logged only when delayed

Test Plan from /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-test-plan.md:

  1. Delayed controller still submits values ✅
  2. Controller never ready → single timeout error ✅
  3. Sync completes only after updates for all added fields ✅
  4. Number input on sync re-emits CVV settings ✅

5. Review Feedback & Discussion

PR #976 Reviews

Gary Evans (September 16, 2025):

Q: What if controller fails to load? A: SDK has a 5-second timeout. After 5s, if controller hasn't posted ready, SDK logs a single error. Inputs remain rendered but submit won't work (controller owns submit flow).

Q: Can users submit before sync completes? A: If controller not ready: submit call won't succeed. If controller ready but syncing: processes whatever state it has. Best practice: disable submit until both ready event fired AND FORM_CHANGE.complete === true.

Q: Does custom validation work during slow controller load? A: Yes for input-level validation (runs in each input iframe independently). Delayed for controller-aggregated state (FORM_CHANGE event). Recommendation: tie field UI/validation to input events for responsiveness; use FORM_CHANGE.complete only to enable submit.

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-faq.md:

If the controller is not ready yet: a submit call won't succeed — the controller won't process it and you won't receive a success event. No invalid data is sent; it simply doesn't complete.


Giordano Arman (September 18, 2025):

Feedback: Recommended waiting for Luca's review before proceeding. Suggested getting senior architect input on long-term architectural implications.

Action Taken: Senior architecture review requested and completed (see next section).


Peer Reviews

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-peer-review.md:

Design Decisions Explained:

  1. Controller-driven replay (pull-based)

    • Why: Ensures controller receives complete state from source of truth (inputs)
    • Avoided: SDK-side caching (duplicates logic), queuing events, timers/intervals
  2. Single sync broadcast

    • Why: Minimal overhead, no polling
    • Risk mitigation: Inputs attach listeners during boot; controller sends sync after its listeners attach
  3. Minimal diagnostics

    • sync-complete with {bootStartedAt, syncCompletedAt} only
    • Conditional logging: only when field added before ready
    • Avoided: Soft delay timers, verbose telemetry

What Was Intentionally Avoided:

  • Polling loops or intervals
  • SDK-side state caching
  • Queuing field methods
  • Adding public API changes
  • Complex timing metrics

From peer review:

Net effect: minimal changes with maximal reliability and no UX regression. We fixed the race at its source (missed messages) with a small, explicit replay and preserved the public API.


Senior Architect Review

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-senior-architecture-review.md:

Observations:

  1. Sync model is intentionally minimal and pull-based

    • Fixes root race (missed BroadcastChannel messages)
    • No queuing, polling, or SDK logic duplication
  2. Message hygiene good in apps; weaker in SDK init ⚠️

    • Apps use MessageHandler.PostMessage with origin+channel checks
    • SDK constructor checks origin only; submit path checks origin+channel+type
    • Suggestion: Add channel check to SDK constructor for consistency
  3. Completeness semantics sound

    • complete requires number, expiryDate, plus conditionally securityCode/postalCode
  4. Scoped replay and minimal state

    • Two small sets of field types during resync
    • Clears on completion
    • Minimal diagnostic payload

Questions for Long-Term Resilience:

  1. Is one-shot sync enough across all browsers?

    • With storage-based BroadcastChannel polyfill, small ordering window exists
    • Suggestion: Consider re-issuing sync once on first add (microtask-deferred)
  2. Multiple instances on same page?

    • Two SDK instances would share channel
    • Suggestion: Per-instance channel suffix (nonce)
  3. SPA lifecycle - how to tear down?

    • No destroy() method
    • Suggestion: Add explicit cleanup for SPAs
  4. EventBus/ListenersManager: global scope concerns?

    • EventBus.unsubscribeAll() clears global subscribers
    • ListenersManager.remove() uses toString() matching (brittle)
    • Suggestion: ID-centric listener management

Hardening Suggestions:

1. SDK Constructor Channel Check:

const listener = (message: MessageEvent) => {
  const isKnownOrigin = message.origin === this.frameUrl
  const isKnownChannel = message?.data?.channel === MESSAGE_CHANNEL
  if (!(isKnownOrigin && isKnownChannel)) return
  // ... handle message
}

2. Microtask-Deferred Re-Sync (for edge cases):

let resyncReissued = false
export const handleAdd = (data: { type: string }) => {
  fields[data.type]._added = true
  if (syncInProgress && !resyncReissued) {
    resyncReissued = true
    queueMicrotask(() => broadcastChannel.message('sync'))
  }
}

3. SPA-Safe Teardown:

class SecureFields {
  destroy() {
    ListenersManager.removeAll()
    EventBus.unsubscribeAll()
    // Remove frames
    this.controller?.remove()
  }
}

Rationale and Trade-offs:

Keeping replay source of truth in inputs avoids duplicating validation/formatting in the SDK, minimizing risk and drift. A microtask-delayed "second sync" mitigates the only remaining known race without introducing timers/intervals.

Closing Statement:

The chosen fix for TA-13099 is the right long-term call: one-shot, controller-driven sync from the inputs (the real state owners). The suggestions above harden boundary conditions without increasing complexity for integrators or expanding the public API.


SENIOR_ARCHITECT_REVIEW.md (Critical Analysis)

From /Users/bruno/www/gr4vy/secure-fields/docs/SENIOR_ARCHITECT_REVIEW.md:

This was a highly critical review identifying deeper architectural concerns:

🔴 Critical Architectural Issues:

  1. Real Problem: Temporal Dependencies in Distributed System

    • Race condition is symptom of fundamentally flawed architecture
    • 3+ independent processes (SDK + Controller + N inputs) with implicit ordering
    • No guaranteed message ordering
    • No error recovery if controller crashes
    • Silent failures when BroadcastChannel unsupported
  2. The Proposed Queue is a Band-Aid

    • Doesn't address lack of message ordering guarantees
    • No mechanism for detecting/handling stalled initialization
    • Creates new failure modes
  3. Violation of Core Design Principles

    • SDK now handles async state management (doesn't belong there)
    • Fails silently later instead of fail-fast
    • Single Responsibility Principle violated

From senior architect review:

The race condition isn't just a timing issue—it's a symptom of a fundamentally flawed architecture where we have 3+ independent processes with implicit ordering dependencies. The proposed queue is a band-aid that doesn't address: No guaranteed message ordering between iframes, No error recovery if controller crashes, No way to detect/handle stalled initialization.

Alternative: Event-Driven Coordination with Promises:

class SecureFields {
  private controllerInitialized: Promise<void>

  constructor(config: Config) {
    this.controllerInitialized = this.initController(config)
  }

  async addCardNumberField(element, options): Promise<SecureInput> {
    await this.controllerInitialized  // Wait for controller
    return this._createCardNumberField(element, options)
  }
}

Benefits Over Current Solution:

  • Explicit dependencies (controller must load first)
  • Fail-fast (immediate error if controller fails)
  • Promise-based (natural async/await patterns)
  • Memory safe (no persistent queues)
  • Testable (easy to mock controller init)

Production Readiness Red Flags:

  1. Creates new race condition (queue processing vs controller ready)
  2. No fallback if BroadcastChannel fails
  3. Silent degradation (placeholder objects mask failures)
  4. No monitoring hooks for queue overflow

Final Verdict: 🔴 REQUEST CHANGES

From senior architect:

As the senior architect who has seen this codebase evolve over 5 years, I'm concerned about accumulating technical debt. Each "quick fix" makes the next problem harder to solve. The queueing approach is acceptable as an interim fix, but we should be planning the move to Promise-based initialization within the next 2 quarters.


6. Team Decision & Outcome

September 29-30, 2025: Decision to Explore Alternatives

What Happened: After extensive review discussions, the team decided to explore a different architectural approach rather than proceeding with PR #976.

Actions Taken:

  • PR #976 moved to "Blocked" status
  • TA-13099 closed (investigation complete)
  • TA-13399 created for new investigation

Rationale:

  1. Solution Seen as "Band-Aid"

    • Fixes immediate symptom but doesn't address underlying architectural issues
    • Creates new edge cases and failure modes
    • Adds complexity to already complex iframe communication
  2. Desire for Architectural Simplification

    • Current solution adds more events between iframes
    • sync-complete mechanism adds new coordination layer
    • Long-term maintainability concerns
  3. Senior Architect Concerns

    • Temporal dependencies in distributed system
    • Silent failure modes
    • Memory management concerns
    • Violation of design principles
  4. Alternative Approaches Worth Exploring

    • Promise-based API (async/await patterns)
    • Service worker coordination
    • Explicit controller-first loading
    • Unified message handler architecture

From team discussions:

While PR #976 successfully solves the immediate race condition, the team consensus is to explore architectural patterns that eliminate the race condition by design rather than working around it. The investigation produced valuable insights that will inform the next approach.


7. Technical Artifacts

Architecture Documentation

Current System Architecture:

Components:

  1. SDK (Host Page) - packages/secure-fields/src/

    • Public API for integrators
    • Creates controller and field iframes
    • Relays events via EventBus
  2. Controller (Hidden Iframe) - apps/secure-fields/src/controller.ts

    • Central state for all fields
    • Validates form completeness
    • Updates Checkout Session on submit
  3. Input Fields (Iframes) - apps/secure-fields/src/input.ts

    • One iframe per field (number, expiry, CVV, postal code)
    • Handle formatting, validation, autofill
    • Emit field-level events

Communication Channels:

  • postMessage: SDK ↔ Controller, SDK ↔ Inputs
  • BroadcastChannel (secure-fields): Controller ↔ Inputs
  • BroadcastChannel (secure-fields-card): Controller ↔ Click to Pay Encrypt
  • MessageChannel: Click to Pay Controller ↔ Encrypt (port transfer)

Message Types:

  • Controller ↔ SDK: ready, form-change, submit, success, error
  • Controller ↔ Inputs: add, update, reset, sync (new)
  • Input ↔ SDK: focus, blur, input, update

From /Users/bruno/www/gr4vy/secure-fields/docs/SECURE_FIELDS_ARCHITECTURE_REPORT.md:

sequenceDiagram
    participant Host as Host Page (SDK)
    participant Ctrl as Controller iframe
    participant Num as Input iframe (number)
    participant API as Checkout Sessions API

    Host->>Ctrl: Create iframe controller.html
    Ctrl-->>Host: postMessage ready

    Host->>Num: Create iframe input.html?type=number
    Num-->>Host: onload
    Host->>Num: postMessage update{styles, label}
    Num-->>Ctrl: BroadcastChannel add{type: 'number'}

    Num->>Ctrl: BroadcastChannel update{number: {value, valid, empty}}
    Ctrl-->>Host: postMessage form-change{fields, complete}

    Host->>Ctrl: postMessage submit{method: 'card'}
    Ctrl->>API: PUT /checkout/sessions/:id/fields
    API-->>Ctrl: 200
    Ctrl-->>Host: postMessage success{scheme}
Loading

State Management

Controller Field State:

export let fields: CardFields & OtherFields = {
  number: {
    value: '',         // Sanitized (not PAN)
    valid: true,
    empty: true,
    autofilled: false,
    _added: false      // Internal flag
  },
  expiryDate: { /* similar */ },
  securityCode: { /* similar */ },
  postalCode: { /* similar */ }
}

Form Complete Logic:

const complete = Object.entries(fields)
  .filter(([_, field]) => field._added)
  .every(([_, field]) => field.valid && !field.empty)

Related Investigations

From Documentation Files:

  1. PR #950 Investigation (PR-950-investigation-summary.md)

    • Detailed analysis of the incident
    • Breaking change identification
    • Payload structure comparison
    • Root cause: architectural change from static to dynamic field management
  2. Wikimedia Breaking Change Analysis (wikimedia-breaking-change-analysis.md)

    • Merchant code assumptions documented
    • Defensive coding patterns analyzed
    • API contract violation explained
    • Shows how implicit contracts matter
  3. handleAdd Undefined Bug (handleAdd-undefined-type-bug-analysis.md)

    • Secondary bug discovered during investigation
    • SecureInput.update() inconsistency
    • Missing type field in dynamic updates
    • Fixed in PR #977
  4. setPlaceholder Bug Flow (setPlaceholder-bug-flow.md)

    • Detailed message flow diagrams
    • Working vs broken scenarios visualized
    • Code path analysis with mermaid diagrams
  5. Comprehensive Analysis Docs:

    • CURRENT_SYSTEM_ANALYSIS.md - Full system architecture
    • TA-13099_IMPLEMENTATION_DOCUMENTATION.md - Line-by-line code analysis
    • FINAL_IMPLEMENTATION_SUMMARY.md - Complete solution summary
    • PROPOSED_SOLUTION.md - Initial proposal
    • REFACTORED_SOLUTION.md - Iteration
    • REFACTORING_EFFORT_ANALYSIS.md - Bundle size, complexity analysis

Historical Context:

  • PR #950: Added postal code but broke field initialization (August 27)
  • PR #959: Safe reimplementation of postal code (retained field defaults)
  • PR #976: Controller sync solution (September 16)
  • PR #977: Fixed handleAdd undefined type bug (September)

8. FAQ

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-faq.md:

Q: What happens if controller.js fails to load?

A: The SDK never receives ready. After ~5 seconds, SDK logs a single timeout error (debug logging must be enabled). Inputs remain rendered and editable, but tokenization/submission cannot complete because controller owns the submit flow.

Recommended integration: Gate submit on SDK ready event and show user-friendly error if not received within your own timeout window.


Q: If controller takes a long time to load, do custom validations still run while inputs are editable?

A:

  • Yes for input-level validation: Each input iframe validates and formats on every keystroke, independent of controller. SDK still receives input events, so container attributes/state update as usual (invalid, autofilled, CVV label).
  • Delayed for controller-aggregated state: FORM_CHANGE event (with complete flag) comes from controller, so it starts flowing after controller is ready and finishes initial sync.

Recommendation: Keep field UI/validation tied to input events for responsiveness; use FORM_CHANGE.complete only to enable submit.


Q: If controller takes a long time to load, can a user submit before values are synced?

A:

  • If controller not ready: Submit call won't succeed - controller won't process it and no success event emitted. No invalid data sent; simply doesn't complete.
  • If controller ready but still syncing: Processes whatever state it has so far. The complete flag in FORM_CHANGE indicates when form is fully ready.

Recommended integration:

  1. Disable submit until SDK has fired ready event
  2. AND latest FORM_CHANGE.complete is true
  3. This guarantees controller has all current input values before submission

Q: What are the edge cases?

A:

  • Stored payment method (CVV only): sync replays whatever is present. Number/expiry not added; CVV-only flows behave as before.
  • Autofill (Chrome): isAutofilled already computed; resync emits current update snapshot including autofilled state.
  • Click to Pay: Controller mirrors updates to card channel; resync integrates seamlessly.
  • Controller never loads: After 5s, SDK logs single error.
  • Method change: Controller already sends reset; if desired, can broadcast another sync afterwards to rehydrate values.

9. Appendices

A. All Documentation Files Created (34+)

Core Implementation Documentation

  1. TA-13099-controller-init-report.md - Initial analysis of all solution options
  2. TA-13099-implementation-summary.md - High-level implementation summary
  3. TA-13099-final-implementation.md - Detailed final implementation report
  4. TA-13099_IMPLEMENTATION_DOCUMENTATION.md - Line-by-line code analysis (768 lines)
  5. TA-13099_IMPLEMENTATION_PLAN.md - Step-by-step implementation plan
  6. TA-13099_REVISED_IMPLEMENTATION.md - Iteration after initial feedback
  7. TA-13099_PR_DETAILS.md - PR #976 details and description

Review Documentation

  1. TA-13099-peer-review.md - Detailed peer review notes
  2. TA-13099-senior-architecture-review.md - Senior architect suggestions
  3. PEER_REVIEW_ANALYSIS.md - Critical analysis from peer perspective
  4. SENIOR_ARCHITECT_REVIEW.md - Critical architectural concerns (5-year perspective)
  5. SENIOR_ARCHITECT_CRITICAL_ANALYSIS.md - Deep architectural critique

Testing Documentation

  1. TA-13099-test-plan.md - Comprehensive test plan
  2. TA-13099-test-checklist.md - Test execution checklist
  3. TA-13099-faq.md - Frequently asked questions

Incident Analysis

  1. PR-950-INCIDENT.md - Production incident summary
  2. PR-950-investigation-summary.md - Detailed incident investigation
  3. wikimedia-breaking-change-analysis.md - Merchant impact analysis

Bug Analysis

  1. handleAdd-undefined-type-bug-analysis.md - Secondary bug discovered
  2. setPlaceholder-bug-flow.md - Message flow diagrams for bug
  3. secure-input-type-bug-analysis.md - Input type handling bug
  4. empty-added-field-log-analysis.md - Log analysis of field addition

Architecture Documentation

  1. SECURE_FIELDS_ARCHITECTURE_REPORT.md - Complete architecture overview
  2. SECURE_FIELDS_ARCHITECTURE_REPORT copy.md - Backup/alternate version
  3. SECURE_FIELDS_NEW_ARCH.md - Proposed new architecture
  4. SECURE_FIELDS_NEW_ARCH_CLAUDE.md - Alternative architecture proposal
  5. CURRENT_SYSTEM_ANALYSIS.md - Current system deep dive

Solution Iterations

  1. PROPOSED_SOLUTION.md - Initial solution proposal
  2. REFACTORED_SOLUTION.md - Refactored approach
  3. WAIT-FOR-CONTROLLER-IMPLEMENTATION.md - Alternative implementation
  4. FINAL_IMPLEMENTATION_SUMMARY.md - Summary of final approach
  5. REFACTORING_EFFORT_ANALYSIS.md - Bundle size and complexity analysis

Related Work

  1. PR-959-REVIEW.md - Review of safe postal code implementation
  2. SCHEME_BASED_CVV_IMPLEMENTATION.md - Related CVV feature

Test Files

  1. debug-required-fields.html - Debug page for testing
  2. test-loading-order.html - Load order test page
  3. wikimedia-repro.html - Wikimedia scenario reproduction
  4. wikimedia-repro-force-error.html - Force error scenarios

Total Documentation: 34+ markdown files, 3+ HTML test files


B. Code References

Controller Changes

/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/controller.ts
Lines affected: ~15-30 (boot sequence reorder)
New functionality:
  - Sync broadcast after listener attachment
  - Sync completion tracking (syncAddedTypes, syncUpdatedTypes)
  - checkSyncCompletion() function
  - sync-complete message to SDK

Input Changes

/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/input.ts
Lines affected: ~10-25 (new sync handler)
New functionality:
  - channel.onMessage('sync', handler)
  - Re-emit add and update on sync
  - Number field re-emits update-field for CVV

SDK Changes

/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/index.ts
Lines affected: ~20-40 (timeout and diagnostics)
New functionality:
  - 5-second timeout for controller ready
  - _controllerReadyDelayed flag
  - sync-complete handler with conditional logging

Constants

/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/constants.ts
/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/constants.ts
New: 'sync' message type
Modified: MESSAGE_CHANNEL usage

Types

/Users/bruno/www/gr4vy/secure-fields/apps/secure-fields/src/types.ts
/Users/bruno/www/gr4vy/secure-fields/packages/secure-fields/src/types.ts
New: Sync-related message types
Modified: CardFields type definitions

C. Related Tickets

Primary Ticket

  • TA-13099 - Make controller always load before inputs
    • Created: September 8, 2025
    • Assigned: Bruno
    • Status: Closed (investigation complete)
    • Outcome: PR #976 blocked, team exploring alternatives

Follow-Up Tickets

  • TA-13399 - New investigation for alternative approach
    • Created: September 30, 2025
    • Status: Open
    • Objective: Explore architectural simplification

Related Tickets

  • TA-13380 - Blocked by TA-13099 resolution
    • Status: Waiting
    • Dependency: Requires stable controller initialization

Incident Tickets

  • TA-13016 - Wikimedia production incident post-mortem
    • Date: August 27, 2025
    • Root cause: PR #950 breaking change

Feature Tickets

  • TA-12842 - Add postal code field

    • Implemented: PR #950 (broke production)
    • Re-implemented: PR #959 (safe version)
  • TA-12843 - Submit postal code in checkout session

    • Depends on: TA-12842
    • Status: Completed (via PR #959)

Lessons Learned

From PR #950 Incident

  1. Breaking Changes Are Dangerous

    • Implicit API contracts matter as much as explicit ones
    • Merchant-specific code will break in unexpected ways
    • Defensive coding patterns reveal expected guarantees
  2. Race Conditions Are NOT Edge Cases

    • Production environments have unpredictable timing
    • What works locally may fail in production
    • Network latency, CDN caching, CPU load all affect timing
  3. Testing Gaps

    • Internal tools didn't catch merchant-specific patterns
    • Race conditions hard to reproduce in controlled environments
    • Need simulation of real merchant implementations
  4. Architecture Assumptions

    • System relied on "luck" (controller loading first)
    • No explicit synchronization between distributed components
    • Silent failures worse than loud failures

From TA-13099 Investigation

  1. Multiple Solutions Exist

    • Evaluated 12+ approaches with different trade-offs
    • Queue-based, Promise-based, sync-based all viable
    • Best solution depends on priorities (UX vs simplicity vs maintainability)
  2. Trade-Offs Are Inevitable

    • No perfect solution exists
    • Every approach creates new edge cases
    • Must choose which trade-offs are acceptable
  3. Architecture Review Is Critical

    • Senior architect perspective identified deeper issues
    • Peer review caught implementation concerns
    • Multiple review rounds improved solution quality
  4. Documentation Matters

    • 34+ docs created during investigation
    • Comprehensive analysis aids decision-making
    • Future teams benefit from documented thought process

Technical Insights

  1. BroadcastChannel Limitations

    • Lossy for late subscribers
    • No delivery guarantees
    • Not suitable for critical initialization messages
  2. Iframe Communication Complexity

    • Distributed system with temporal dependencies
    • No guaranteed message ordering
    • Requires explicit synchronization
  3. State Management Challenges

    • Multiple sources of truth (SDK, Controller, Inputs)
    • Synchronization requires careful design
    • Error recovery adds significant complexity

Conclusion

The TA-13099 investigation successfully identified the root cause of a critical production race condition and developed a working solution (PR #976) using controller-driven sync. The solution was technically sound, thoroughly documented, and extensively reviewed.

However, the investigation also revealed deeper architectural concerns. The team's decision to explore alternative approaches reflects a mature engineering culture that values long-term maintainability over quick fixes.

Key Outcomes:

  1. Root Cause Identified: BroadcastChannel message loss when controller loads after inputs
  2. Working Solution Developed: Controller sync with input state replay
  3. Comprehensive Documentation: 34+ documents covering all aspects
  4. Thorough Reviews: Peer and senior architect perspectives
  5. Informed Decision: Team chose to explore architectural alternatives

What Worked:

  • Systematic evaluation of 12+ solution approaches
  • Minimal code changes (3 files)
  • No public API breaking changes
  • No UX degradation
  • Comprehensive testing strategy

What Could Be Better:

  • Solution adds complexity to already complex iframe communication
  • Creates new edge cases and failure modes
  • Doesn't address underlying architectural issues
  • Temporary fix rather than permanent solution

Next Steps:

  • TA-13399 will explore architectural simplification
  • Consider Promise-based API for explicit async handling
  • Evaluate service worker coordination
  • Investigate unified message handler patterns

Bruno's Contribution:

Bruno's investigation was comprehensive, methodical, and well-documented. The 34+ documentation files created during this investigation provide invaluable insights for future architectural decisions. The work demonstrates strong technical analysis, clear communication, and commitment to quality.

The decision to not merge PR #976 doesn't diminish the value of this investigation - it's a testament to the team's commitment to building the right solution, not just the fastest solution.


Document Version: 1.0 Created: September 2025 Last Updated: September 30, 2025 Status: Investigation Complete, Alternative Approach In Progress Author: Bruno Reviewers: Gary Evans, Giordano Arman, Luca Allievi, Senior Architect


This document consolidates 34+ individual documentation files, Jira ticket TA-13099, PR #976 discussions, Confluence post-mortem analysis, and git commit history into a single comprehensive reference for future architectural decisions.

Secure Fields Race Condition: Bruno's Investigation & Solution (September 2025)

Secure Fields Race Condition: Bruno's Investigation & Solution (September 2025)

Author: Bruno Date: September 2025 Ticket: TA-13099 PR: #976 (Blocked/Closed) Status: Investigation Complete, Alternative Approach Explored


Executive Summary

In September 2025, a critical race condition in Gr4vy's Secure Fields library was investigated following a production incident (August 27, 2025) that affected ~5,000 users across multiple merchants including Wikimedia and Grammarly. The issue: controller and input iframes could load in any order, causing FORM_CHANGE events to reference undefined fields when inputs loaded first.

Root Cause: BroadcastChannel messages from inputs sent before controller was listening were permanently lost, leaving the controller's fields object incomplete.

Solution Implemented (PR #976): Controller-driven sync mechanism where controller broadcasts a one-time sync request on boot, and inputs replay their current state. This "pull-based" approach ensured complete state without UX degradation.

Outcome: After thorough peer and senior architect reviews, the team decided (Sep 29-30) to explore alternative architectural approaches. PR #976 was moved to "Blocked" status, TA-13099 was closed, and TA-13399 was created to investigate a more comprehensive solution. The investigation produced 34+ documentation files and extensive technical analysis.


Timeline

  • August 27, 2025: Production incident (PR #950) breaks Wikimedia, Grammarly, PlayHQ (~5k users)
  • August 27, 2025 19:04 UTC: Incident detected
  • August 27, 2025 22:00 UTC: Incident resolved (revert deployed)
  • September 8, 2025: TA-13099 created by Luca Allievi - "Make controller always load before inputs"
  • September 15, 2025: Initial implementation complete, documentation created
  • September 16, 2025: PR #976 opened, initial reviews from Gary Evans
  • September 18, 2025: Feedback from Giordano Arman recommending senior architect review
  • September 19-20, 2025: Senior architecture review completed
  • September 22, 2025: Peer review analysis and comprehensive documentation
  • September 25, 2025: New architecture proposals created
  • September 29-30, 2025: Team decision to explore different approach
  • September 30, 2025: TA-13099 closed, TA-13399 created

1. Context & Background

The Incident (August 27, 2025)

What Happened: PR #950 introduced a postal code field but included an architectural refactoring that changed how the controller's fields object was initialized - from pre-initialized with all expected properties to starting as an empty object {}. This breaking change caused merchants' custom validation code to fail.

Impact Metrics:

  • Affected Merchants: Wikimedia (~2.5k users), Grammarly (~2k users), PlayHQ (148 users), others
  • Total Users: ~5,000 unable to complete transactions
  • Detection Time: 07:04 PM UTC
  • Resolution Time: 10:00 PM UTC (~3 hours)
  • Priority: P1

Error Observed:

// Wikimedia's code
if (data.fields) {
  cardNumberFieldEmpty = data.fields.number.empty;  // CRASH: Cannot read properties of undefined
  cardNumberFieldValid = data.fields.number.valid;
}

The Breaking Change:

// BEFORE PR #950 (Working)
export let fields: CardFields = {
  number: { ...initField },      // Always exists
  expiryDate: { ...initField },  // Always exists
  securityCode: { ...initField }, // Always exists
}

// AFTER PR #950 (Broken)
export let fields: CardFields & OtherFields = {}  // Empty! Only populated via handleAdd()

Immediate Response:

  • 07:04 PM: Wikimedia reported errors
  • 07:39 PM: P2 incident created
  • 08:05 PM: Revert PR created
  • 08:39 PM: Revert merged
  • 09:57 PM: Fix deployed (after CI build issues)
  • 10:00 PM: Incident resolved

Investigation Ticket: TA-13099

Created By: Luca Allievi Title: "Make controller always load before inputs" Objective: Ensure controller readiness before relying on field data

Acceptance Criteria:

  1. Ensure controller readiness before relying on field data
  2. Submissions must consistently include field values
  3. FORM_CHANGE events must be reliable
  4. No breaking changes to public SDK API
  5. No UX degradation (fields should be interactive immediately)

Assignment: Assigned to Bruno for investigation and implementation

Key Question from Luca:

"I'm not sure if there is another solution to the issue rather than controller loads first. If you have [one] and meet the constraints, I'm more than happy to know."


2. Root Cause Analysis

The Bug

Primary Issue: Controller's fields object empty when inputs load first

Technical Flow:

  1. SDK creates controller iframe and input iframes in parallel
  2. Input iframes load and send add messages via BroadcastChannel
  3. If controller loads after inputs, it misses the add messages
  4. Controller's fields object remains empty/incomplete
  5. Merchants accessing fields.number.empty get undefined error

Why Messages Are Lost: BroadcastChannel is lossy for late subscribers - if a message is sent before a listener attaches, it's permanently dropped. No retry, no queue, no delivery guarantee.

Why It Happened

From the post-mortem "5 Whys" analysis:

  1. PR #950 removed default field values → Made race condition fatal
  2. Race condition in iframe loading → No guaranteed load order
  3. BroadcastChannel lossy → Early messages permanently lost
  4. No controller-first enforcement → SDK allowed parallel iframe creation
  5. Architectural assumption → System relied on "luck" (controller loading first)

Technical Deep Dive

Message Sequence When It Works:

sequenceDiagram
    participant SDK
    participant Controller
    participant Input

    SDK->>Controller: Create iframe
    Controller->>Controller: Load & attach listeners
    Controller->>SDK: postMessage('ready')
    SDK->>Input: Create iframe
    Input->>Input: Load
    Input->>Controller: BroadcastChannel('add', {type: 'number'})
    Controller->>Controller: fields[type]._added = true ✅
    Controller->>SDK: postMessage('form-change', {fields: {...}})
Loading

Message Sequence When It Fails:

sequenceDiagram
    participant SDK
    participant Controller
    participant Input

    SDK->>Controller: Create iframe (slow)
    SDK->>Input: Create iframe (fast)
    Input->>Input: Load first
    Input->>Controller: BroadcastChannel('add') → LOST! ❌
    Controller->>Controller: Load late, listeners attach
    Controller->>SDK: postMessage('ready')
    Controller->>SDK: postMessage('form-change', {fields: {}}) ← EMPTY!
Loading

Current Controller Boot Sequence (Vulnerable):

// apps/secure-fields/src/controller.ts
broadcastChannel.listen()  // Listeners attach here
parent.listen()
parent.message('ready')    // Signals ready to SDK

// Problem: Input messages sent BEFORE listen() are lost!

Code-Level Analysis

The handleAdd Function:

// apps/secure-fields/src/controller.ts:91-93
export const handleAdd = (data: { type: string }) => {
  fields[data.type]._added = true
}

The Flow:

  1. Input sends: channel.message('add', { type: 'number' })
  2. Controller receives: broadcastChannel.onMessage('add', handleAdd)
  3. Handler sets: fields['number']._added = true
  4. If controller not listening: Message lost, fields['number'] never created

From /Users/bruno/www/gr4vy/secure-fields/docs/handleAdd-undefined-type-bug-analysis.md:

The BroadcastChannel only delivers messages to listeners that are already connected when the message is sent. If the controller hasn't initialized its BroadcastChannel listener yet, it will never receive the 'add' message from inputs that loaded first.


3. All Approaches Evaluated

During the investigation, 12+ distinct approaches were analyzed. Here are the primary ones:

Option A: SDK Queues Fields Until Controller Ready (Strict Ordering)

How It Works: SecureFields.addField() returns stubs that queue method calls (setPlaceholder, setStyles, etc.). Real inputs created only after controller ready; queued calls then flush.

Implementation:

  • Gate input creation/DOM replacement until ready received
  • Stubs collect calls and listeners
  • Flush queue when controller signals ready

Pros:

  • Enforces "controller first, then inputs"
  • No internal protocol changes needed
  • Literal reading of ticket requirements

Cons:

  • Inputs not interactive until ready → UX delay/regression
  • ❌ Higher complexity for proxying methods and listeners
  • ❌ More edge cases to handle

Decision: ❌ Rejected - UX degradation unacceptable

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:

Fields are not interactive until ready → UX delay/regression. Higher complexity for proxying methods and listeners; more edge cases. Tradeoffs: most literal reading of AC, but least user-friendly and most invasive.


Option B: Inputs Buffer Messages Until Controller Ready (Push-Based)

How It Works: Inputs render immediately but buffer their BroadcastChannel messages behind a local controllerReady flag. When controller emits one-shot "ready" broadcast, inputs flush buffered messages.

Implementation:

  • Inputs add local queue
  • Controller emits one-shot broadcast
  • Inputs flush on receiving "ready"

Pros:

  • No visible UX delay
  • Fields work immediately

Cons:

  • ❌ If controller loads first (common case), inputs miss the one-shot "ready"
  • ❌ Needs intervals or repeated signals → flakiness
  • ❌ More moving parts at input side
  • ❌ Persistent timers undesirable

Decision: ❌ Rejected - Complexity and potential flakiness


Option C: Controller-Driven Sync (Pull-Based) ⭐ CHOSEN

How It Works: After attaching listeners, controller broadcasts a one-time sync request. Each input responds by re-sending its current state: add and update. Number input also re-sends derived update-field for security code sizing/label.

Implementation Details:

Controller (apps/secure-fields/src/controller.ts):

// Reorder to:
broadcastChannel.listen()      // 1. Attach listeners FIRST
parent.listen()                 // 2. Attach parent listener
broadcastChannel.message('sync') // 3. Request state replay
parent.message('ready')         // 4. Signal ready to SDK

Inputs (apps/secure-fields/src/input.ts):

channel.onMessage('sync', () => {
  // Re-mark field as added
  channel.message('add', { type })

  // Re-send current value/validity
  fireFormUpdate({ target: input })

  // If number field, also re-emit CVV constraints
  if (input.id === 'number') {
    const { code, schema } = validate(...)
    const codeLabel = currentCodeLabel || code?.name || 'CVV'
    const size = code?.size
    channel.message('update-field', {
      id: 'securityCode',
      size,
      codeLabel
    })
  }
})

Pros:

  • ✅ Robust to either load order
  • ✅ No UX delay - inputs interactive immediately
  • ✅ No timers/intervals needed
  • ✅ Minimal code changes (3 files)
  • ✅ No public API changes
  • ✅ Single one-shot broadcast at boot
  • ✅ Best balance of simplicity, UX, and reliability

Cons:

  • Introduces one internal message type (sync)
  • One extra broadcast at boot (minimal overhead)

Decision:CHOSEN - Best balance of all factors

Sync Completion Detection:

// Controller tracks which fields have synced
let syncAddedTypes = new Set()
let syncUpdatedTypes = new Set()

const checkSyncCompletion = () => {
  if (every added field has at least one update) {
    parent.message('sync-complete', {
      bootStartedAt,
      syncCompletedAt
    })
  }
}

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-controller-init-report.md:

Adopt Option C (controller-issued sync + early listener attach). It fixes the race with minimal code, no UX regression, and no public API changes. It is robust to either load order and aligns with the acceptance criteria's intent.


Option D: SDK Bridge Fallback

How It Works: SDK caches last input event per field. On controller ready, sends a sync payload to controller via postMessage. Controller merges it as update.

Pros:

  • Avoids adding BroadcastChannel message for sync

Cons:

  • ❌ Wider change surface (SDK + controller)
  • ❌ Duplicated logic
  • ❌ Must handle number→securityCode derivation explicitly
  • ❌ Not as clean as Option C

Decision: ❌ Rejected - Less elegant than Option C


3.5 POC Implementations

Bruno created two proof-of-concept branches to test different implementation approaches:

POC Branch: poc-controller-ready (Option A Implementation)

Branch: https://github.com/gr4vy/secure-fields/compare/main...poc-controller-ready

Approach: Promise-based controller readiness

Implementation Strategy:

  • Uses Promise<void> for controller ready state
  • SDK waits for promise to resolve before creating fields
  • No explicit queue - defers all field setup until controller signals ready

Key Code Changes:

class SecureFields {
  private controllerReady: Promise<void>
  private resolveControllerReady: () => void

  constructor(config: Config) {
    // Create promise that resolves when controller ready
    this.controllerReady = new Promise((resolve) => {
      this.resolveControllerReady = resolve
    })

    // ... controller iframe creation
  }

  // On controller 'ready' message:
  case 'ready': {
    EventBus.publish(Events.READY, data)
    this.resolveControllerReady() // ← Resolve promise
    break
  }

  private async _addField(
    element: string | HTMLElement,
    options: Omit<Field, 'element'>
  ) {
    // Wait for controller before creating input
    await this.controllerReady

    // Now safe to create input iframe
    const input = new SecureInput({
      frameUrl: this.frameUrl,
      parentOrigin: this.parentOrigin,
      font: this.font,
      options,
    })

    // Add to DOM
    setupInput(element, input)
  }
}

Pros:

  • Clean async/await pattern
  • No Proxy complexity
  • Explicit dependency on controller
  • Type-safe with TypeScript

Cons:

  • UX impact: Fields not interactive until controller loaded
  • Breaking change: All add*Field methods become async
  • Merchants would need to update integration code
  • Wait time = controller load + input load

Why Not Chosen:

  • Breaking API change unacceptable
  • UX degradation for users with slow connections
  • More invasive than Option C (sync approach)

POC Branch: poc-queue-fields (Option B/Approach 2 Implementation)

Branch: https://github.com/gr4vy/secure-fields/compare/main...poc-queue-fields

Approach: Proxy-based method queueing with explicit ControllerLoader utility

Implementation Strategy:

  • Create ControllerLoader singleton class
  • Queue field creation requests
  • Use Proxy to intercept method calls on placeholder inputs
  • Replay queued calls when controller ready

Key Code Changes:

New file: packages/secure-fields/src/controller-loader.ts

type FieldQueue = {
  element: string | HTMLElement
  input: SecureInput
  inputConfig: {
    frameUrl: string | null
    parentOrigin: string
    font?: string
    options: Field
    paymentMethodScheme?: string
  }
  methodCalls: Array<{ method: string; args: any[] }>
}

class ControllerLoader {
  ready: boolean
  timeoutId?: number | NodeJS.Timeout
  fieldQueue: Array<FieldQueue>

  constructor() {
    this.ready = false
    this.fieldQueue = []
  }

  // Create Proxy that queues method calls
  private _createQueueProxy(
    input: SecureInput,
    queueEntry: FieldQueue
  ): SecureInput {
    return new Proxy(input, {
      get(target, prop) {
        const queueableMethods = [
          'setPlaceholder',
          'update',
          'focus',
          'blur'
        ]

        if (typeof prop === 'string' && queueableMethods.includes(prop)) {
          // Return function that queues the call instead of executing
          return (...args: any[]) => {
            queueEntry.methodCalls.push({ method: prop, args })
          }
        }

        if (prop === 'addEventListener' || prop === 'removeEventListener') {
          return (...args: any[]) => {
            queueEntry.methodCalls.push({ method: prop as string, args })
          }
        }

        return target[prop]
      }
    })
  }

  addFieldToQueue(
    element: string | HTMLElement,
    input: SecureInput,
    inputConfig: FieldQueue['inputConfig']
  ) {
    const queueEntry: FieldQueue = {
      element,
      input,
      inputConfig,
      methodCalls: []
    }

    this.fieldQueue.push(queueEntry)

    // Return proxy that queues calls
    return this._createQueueProxy(input, queueEntry)
  }

  processFieldQueue(
    frameUrl: string,
    callback: ProcessFieldCallback
  ) {
    this.ready = true
    if (this.fieldQueue.length === 0) return

    log('Processing field queue', {
      queueLength: this.fieldQueue.length
    })

    const queueSnapshot = [...this.fieldQueue]
    this.fieldQueue = []

    for (const field of queueSnapshot) {
      try {
        // Create real input now that controller ready
        const readyInput = new SecureInput({
          ...field.inputConfig,
          frameUrl,
        })
        Object.assign(field.input, readyInput)

        callback(field.element, field.input)

        // Replay all queued method calls
        for (const call of field.methodCalls) {
          field.input[call.method].apply(field.input, call.args)
        }

        log('Field created from queue', {
          fieldType: field.input.type,
          replayedCalls: field.methodCalls.length
        })
      } catch (createError) {
        error('Failed to create field from queue', {
          fieldType: field.inputConfig.options.type,
          error: createError.message
        })
      }
    }
  }

  setTimeout() {
    this.timeoutId = setTimeout(() => {
      if (!this.ready && this.fieldQueue.length > 0) {
        error('Controller failed to load', {
          queueLength: this.fieldQueue.length,
          timeoutMs: 2000
        })
      }
    }, 2000)
  }

  cleanup() {
    this.ready = false
    this.fieldQueue = []
    if (this.timeoutId) clearTimeout(this.timeoutId)
    this.timeoutId = undefined
  }
}

export const global = new ControllerLoader()

Updated: packages/secure-fields/src/index.ts

import { global as ControllerLoader } from './controller-loader'

class SecureFields {
  constructor(config: Config) {
    this._cleanup()

    // ... setup

    ControllerLoader.setTimeout()

    // Listen for controller ready
    window.addEventListener('message', (message) => {
      if (message.origin === this.frameUrl) {
        switch (message.data.type) {
          case 'ready': {
            // Process queued fields when controller ready
            ControllerLoader.processFieldQueue(
              this.frameUrl,
              (element, readyInput) => this._addField(element, readyInput)
            )
            // ... rest of ready handler
            break
          }
        }
      }
    })
  }

  addCardNumberField(
    element: string | HTMLElement,
    options?: Omit<Field, 'element' | 'type'>
  ) {
    let inputConfig: InputConfig

    if (!this.cardNumber) {
      inputConfig = {
        frameUrl: '', // ← Empty until controller ready
        parentOrigin: this.parentOrigin,
        font: this.font,
        options: {
          label: 'Card number',
          ...options,
          type: 'number',
        },
      }
      this.cardNumber = new SecureInput(inputConfig)
    }

    // Return proxy that queues calls
    return ControllerLoader.addFieldToQueue(
      element,
      this.cardNumber,
      inputConfig
    )
  }

  // Similar for addSecurityCodeField, addExpiryDateField, addField
}

Flow:

  1. Merchant calls secureFields.addCardNumberField('#number')
  2. SDK creates placeholder SecureInput (no frameUrl yet)
  3. SDK returns Proxy of placeholder
  4. Merchant calls methods on proxy: field.setPlaceholder('1234 5678...')
  5. Proxy queues: { method: 'setPlaceholder', args: ['1234 5678...'] }
  6. Controller loads, sends 'ready' message
  7. SDK processes queue:
    • Creates real SecureInput with frameUrl
    • Adds to DOM
    • Replays all queued calls: field.setPlaceholder('1234 5678...')

Pros:

  • Fields can be added before controller ready
  • No UX delay (fields render immediately)
  • No breaking API changes
  • Queued calls replayed automatically

Cons:

  • High complexity: Proxy pattern + queue management
  • ~180 lines of new utility code
  • Cognitive overhead for debugging
  • Method calls execute out of order (queued then replayed)
  • Timeout mechanism needed (2s in POC)

Why Not Chosen:

  • More complex than Option C (sync approach)
  • Proxy pattern adds cognitive overhead
  • Replay logic fragile (order matters for some calls)
  • Timeout still needed (same as concerns with Option C)

Comparison of POC Approaches

Aspect poc-controller-ready (A) poc-queue-fields (B)
Pattern Promise-based wait Proxy-based queue
Code Added ~50 lines ~180 lines
Complexity Low (async/await) High (Proxy + replay)
API Changes Breaking (async methods) None (backward compatible)
UX Impact High (delay) Low (fields render)
Debugging Easy (linear flow) Hard (queued + replayed)
Risk Medium (breaking) High (complexity)

Why Option C (Controller Sync) Was Chosen Over POCs

Neither POC was chosen because:

  1. poc-controller-ready (A):

    • Breaking API change
    • UX degradation unacceptable
    • Too invasive for the problem
  2. poc-queue-fields (B):

    • Too complex (Proxy + queue + replay)
    • Cognitive overhead too high
    • Similar timeout concerns as Option C
    • Replay logic fragile
  3. Option C (Controller Sync) was better because:

    • ✅ No breaking changes
    • ✅ No UX impact
    • ✅ Simpler than Proxy approach (~80 lines vs ~180 lines)
    • ✅ No queue replay complexity
    • ✅ Pull-based (inputs respond to sync) vs push-based (queue + replay)
    • ✅ More robust to load order variations

Key Insight: The POCs validated that Option A and B were viable but too costly. Option C provided the same benefits with less complexity and risk.


4. Chosen Solution: Controller Sync (PR #976)

Design

Core Mechanism: Pull-based resync where controller asks inputs to replay their state

Key Components:

  1. Controller broadcasts sync on boot (after attaching listeners)
  2. Inputs respond by re-sending add and update
  3. Controller tracks completion via sets of added/updated types
  4. SDK receives sync-complete with minimal timing info
  5. Single 5s timeout if controller never loads

Implementation Details

Files Changed: 3 files

  • apps/secure-fields/src/controller.ts (Controller sync logic)
  • apps/secure-fields/src/input.ts (Input sync handler)
  • packages/secure-fields/src/index.ts (SDK timeout and diagnostics)

Message Flow:

sequenceDiagram
    autonumber
    participant SDK
    participant Controller
    participant Input1 as Input (number)
    participant Input2 as Input (expiry)

    Note over SDK: User creates SecureFields
    SDK->>Controller: Create iframe
    SDK->>Input1: Create iframe
    SDK->>Input2: Create iframe

    Note over Input1,Input2: Inputs may load first
    Input1->>Input1: Load, attach listeners
    Input2->>Input2: Load, attach listeners

    Note over Controller: Controller loads
    Controller->>Controller: Attach listeners FIRST
    Controller-->>Input1: BroadcastChannel('sync')
    Controller-->>Input2: BroadcastChannel('sync')
    Controller->>SDK: postMessage('ready')

    Note over Input1,Input2: Inputs replay state
    Input1-->>Controller: add({type: 'number'})
    Input1-->>Controller: update({number: {value, valid, empty}})
    Input1-->>Controller: update-field({id: 'securityCode', size, label})

    Input2-->>Controller: add({type: 'expiryDate'})
    Input2-->>Controller: update({expiryDate: {value, valid, empty}})

    Note over Controller: All fields synced
    Controller->>Controller: checkSyncCompletion()
    Controller->>SDK: postMessage('sync-complete', {timings})

    Note over SDK: Normal operation
    Controller->>SDK: postMessage('form-change', {fields, complete})
Loading

Code Snippets:

Controller Boot Sequence:

// apps/secure-fields/src/controller.ts
// CRITICAL ORDER:
broadcastChannel.listen()        // 1. Listen first
parent.listen()                  // 2. Parent listener
broadcastChannel.message('sync') // 3. Request sync
parent.message('ready')          // 4. Signal ready

Input Sync Handler:

// apps/secure-fields/src/input.ts
channel.onMessage('sync', () => {
  channel.message('add', { type })
  fireFormUpdate({ target: input })

  if (input.id === 'number') {
    channel.message('update-field', {
      id: 'securityCode',
      size,
      codeLabel
    })
  }
})

SDK Timeout and Diagnostics:

// packages/secure-fields/src/index.ts

// 5-second hard timeout
const timeoutId = setTimeout(() => {
  if (!controllerReady) {
    error('Controller failed to load within timeout', {
      timeoutMs: 5000
    })
  }
}, 5000)

// Clear on ready
case 'ready':
  clearTimeout(timeoutId)
  processFieldQueue()

// Diagnostic logging (debug only)
case 'sync-complete':
  if (_controllerReadyDelayed) {
    log('Controller sync completed after delay', data)
  }

Why This Approach

  1. No Public API Changes: Internal messaging only, merchants don't need updates
  2. Backward Compatible: Existing integrations continue working unchanged
  3. Minimal Code Changes: Only 3 files touched, ~100 lines added
  4. No Timers/Polling: Single one-shot sync broadcast
  5. Inputs Interactive Immediately: No UX delay or placeholder states
  6. Robust: Works regardless of load order
  7. PCI Safe: No storage of sensitive values, ephemeral messaging only

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-implementation-summary.md:

We adopted Option C from the design report: a controller-driven replay request (pull-based resync) that works regardless of load order. No public API changes. No timers/intervals for polling. Inputs remain interactive immediately.

Testing

E2E Test: Delayed controller scenario

// packages/example-cdn/index.e2e.test.ts
// Delay controller by ~2s
page.route('**/controller.html*', route => {
  setTimeout(() => route.continue(), 2000)
})

// Should still submit complete payload
expect(submitPayload).toHaveProperty('payment_method.number')
expect(submitPayload).toHaveProperty('payment_method.expiration_date')
expect(submitPayload).toHaveProperty('payment_method.security_code')

Unit Tests:

  • Controller resync completion detection
  • Input sync handler replays state correctly
  • Number input re-emits CVV update-field
  • SDK timeout triggers after 5s if no ready
  • sync-complete logged only when delayed

Test Plan from /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-test-plan.md:

  1. Delayed controller still submits values ✅
  2. Controller never ready → single timeout error ✅
  3. Sync completes only after updates for all added fields ✅
  4. Number input on sync re-emits CVV settings ✅

5. Review Feedback & Discussion

PR #976 Reviews

Gary Evans (September 16, 2025):

Q: What if controller fails to load? A: SDK has a 5-second timeout. After 5s, if controller hasn't posted ready, SDK logs a single error. Inputs remain rendered but submit won't work (controller owns submit flow).

Q: Can users submit before sync completes? A: If controller not ready: submit call won't succeed. If controller ready but syncing: processes whatever state it has. Best practice: disable submit until both ready event fired AND FORM_CHANGE.complete === true.

Q: Does custom validation work during slow controller load? A: Yes for input-level validation (runs in each input iframe independently). Delayed for controller-aggregated state (FORM_CHANGE event). Recommendation: tie field UI/validation to input events for responsiveness; use FORM_CHANGE.complete only to enable submit.

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-faq.md:

If the controller is not ready yet: a submit call won't succeed — the controller won't process it and you won't receive a success event. No invalid data is sent; it simply doesn't complete.


Giordano Arman (September 18, 2025):

Feedback: Recommended waiting for Luca's review before proceeding. Suggested getting senior architect input on long-term architectural implications.

Action Taken: Senior architecture review requested and completed (see next section).


Peer Reviews

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-peer-review.md:

Design Decisions Explained:

  1. Controller-driven replay (pull-based)

    • Why: Ensures controller receives complete state from source of truth (inputs)
    • Avoided: SDK-side caching (duplicates logic), queuing events, timers/intervals
  2. Single sync broadcast

    • Why: Minimal overhead, no polling
    • Risk mitigation: Inputs attach listeners during boot; controller sends sync after its listeners attach
  3. Minimal diagnostics

    • sync-complete with {bootStartedAt, syncCompletedAt} only
    • Conditional logging: only when field added before ready
    • Avoided: Soft delay timers, verbose telemetry

What Was Intentionally Avoided:

  • Polling loops or intervals
  • SDK-side state caching
  • Queuing field methods
  • Adding public API changes
  • Complex timing metrics

From peer review:

Net effect: minimal changes with maximal reliability and no UX regression. We fixed the race at its source (missed messages) with a small, explicit replay and preserved the public API.


Senior Architect Review

From /Users/bruno/www/gr4vy/secure-fields/docs/TA-13099-senior-architecture-review.md:

Observations:

  1. Sync model is intentionally minimal and pull-based

    • Fixes root race (missed BroadcastChannel messages)
    • No queuing, polling, or SDK logic duplication
  2. Message hygiene good in apps; weaker in SDK init ⚠️

    • Apps use MessageHandler.PostMessage with origin+channel checks
    • SDK constructor checks origin only; submit path checks origin+channel+type
    • Suggestion: Add channel check to SDK constructor for consistency
  3. Completeness semantics sound

    • complete requires number, expiryDate, plus conditionally securityCode/postalCode
  4. Scoped replay and minimal state

    • Two small sets of field types during resync
    • Clears on completion
    • Minimal diagnostic payload

Questions for Long-Term Resilience:

  1. Is one-shot sync enough across all browsers?

    • With storage-based BroadcastChannel polyfill, small ordering window exists
    • Suggestion: Consider re-issuing sync once on first add (microtask-deferred)
  2. Multiple instances on same page?

    • Two SDK instances would share channel
    • Suggestion: Per-instance channel suffix (nonce)
  3. SPA lifecycle - how to tear down?

    • No destroy() method
    • Suggestion: Add explicit cleanup for SPAs
  4. EventBus/ListenersManager: global scope concerns?

    • EventBus.unsubscribeAll() clears global subscribers
    • ListenersManager.remove() uses toString() matching (brittle)
    • Suggestion: ID-centric listener management

Hardening Suggestions:

1. SDK Constructor Channel Check:

const listener = (message: MessageEvent) => {
  const isKnownOrigin = message.origin === this.frameUrl
  const isKnownChannel = message?.data?.channel === MESSAGE_CHANNEL
  if (!(isKnownOrigin && isKnownChannel)) return
  // ... handle message
}

2. Microtask-Deferred Re-Sync (for edge cases):

let resyncReissued = false
export const handleAdd = (data: { type: string }) => {
  fields[data.type]._added = true
  if (syncInProgress && !resyncReissued) {
    resyncReissued = true
    queueMicrotask(() => broadcastChannel.message('sync'))
  }
}

3. SPA-Safe Teardown:

class SecureFields {
  destroy() {
    ListenersManager.removeAll()
    EventBus.unsubscribeAll()
    // Remove frames
    this.controller?.remove()
  }
}

Rationale and Trade-offs:

Keeping replay source of truth in inputs avoids duplicating validation/formatting in the SDK, minimizing risk and drift. A microtask-delayed "second sync" mitigates the only remaining known race without introducing timers/intervals.

Closing Statement:

The chosen fix for TA-13099 is the right long-term call: one-shot, controller-driven sync from the inputs (the real state owners). The suggestions above harden boundary conditions without increasing complexity for integrators or expanding the public API.


SENIOR_ARCHITECT_REVIEW.md (Critical Analysis)

From /Users/bruno/www/gr4vy/secure-fields/docs/SENIOR_ARCHITECT_REVIEW.md:

This was a highly critical review identifying deeper architectural concerns:

🔴 Critical Architectural Issues:

  1. Real Problem: Temporal Dependencies in Distributed System

    • Race condition is symptom of fundamentally flawed architecture
    • 3+ independent processes (SDK + Controller + N inputs) with implicit ordering
    • No guaranteed message ordering
    • No error recovery if controller crashes
    • Silent failures when BroadcastChannel unsupported
  2. The Proposed Queue is a Band-Aid

    • Doesn't address lack of message ordering guarantees
    • No mechanism for detecting/handling stalled initialization
    • Creates new failure modes
  3. Violation of Core Design Principles

    • SDK now handles async state management (doesn't belong there)
    • Fails silently later instead of fail-fast
    • Single Responsibility Principle violated

From senior architect review:

The race condition isn't just a timing issue—it's a symptom of a fundamentally flawed architecture where we have 3+ independent processes with implicit ordering dependencies. The proposed queue is a band-aid that doesn't address: No guaranteed message ordering between iframes, No error recovery if controller crashes, No way to detect/handle stalled initialization.

Alternative: Event-Driven Coordination with Promises:

class SecureFields {
  private controllerInitialized: Promise<void>

  constructor(config: Config) {
    this.controllerInitialized = this.initController(config)
  }

  async addCardNumberField(element, options): Promise<SecureInput> {
    await this.controllerInitialized  // Wait for controller
    return this._createCardNumberField(element, options)
  }
}

Benefits Over Current Solution:

  • Explicit dependencies (controller must load first)
  • Fail-fast (immediate error if controller fails)
  • Promise-based (natural async/await patterns)
  • Memory safe (no persistent queues)
  • Testable (easy to mock controller init)

Production Readiness Red Flags:

  1. Creates new race condition (queue processing vs controller ready)
  2. No fallback if BroadcastChannel fails
  3. Silent degradation (placeholder objects mask failures)
  4. No monitoring hooks for queue overflow

Final Verdict: 🔴 REQUEST CHANGES

From senior architect:

As the senior architect who has seen this codebase evolve over 5 years, I'm concerned about accumulating technical debt. Each "quick fix" makes the next problem harder to solve. The queueing approach is acceptable as an interim fix, but we should be planning the move to Promise-based initialization within the next 2 quarters.


6. Team Decision & Outcome

September 29-30, 2025: Decision to Explore Alternatives

What Happened: After extensive review discussions, the team decided to explore a different architectural approach rather than proceeding with PR #976.

Actions Taken:

  • PR #976 moved to "Blocked" status
  • TA-13099 closed (investigation complete)
  • TA-13399 created for new investigation

Rationale:

  1. Solution Seen as "Band-Aid"

    • Fixes immediate symptom but doesn't address underlying architectural issues
    • Creates new edge cases and failure modes
    • Adds complexity to already complex iframe communication
  2. Desire for Architectural Simplification

    • Current solution adds more events between iframes
    • sync-complete mechanism adds new coordination layer
    • Long-term maintainability concerns
  3. Senior Architect Concerns

    • Temporal dependencies in distributed system
    • Silent failure modes
    • Memory management concerns
    • Violation of design principles
  4. Alternative Approaches Worth Exploring

    • Promise-based API (async/await patterns)
    • Service worker coordination
    • Explicit controller-first loading
    • Unified message handler architecture

From team discussions:

While PR #976 successfully solves the immediate race condition, the team consensus is to explore architectural patterns that eliminate the race condition by design rather than working around it. The investigation produced valuable insights that will inform the next approach.


7. Technical Artifacts

Architecture Documentation

Current System Architecture:

Components:

  1. SDK (Host Page) - packages/secure-fields/src/

    • Public API for integrators
    • Creates controller and field iframes
    • Relays events via EventBus
  2. Controller (Hidden Iframe) - apps/secure-fields/src/controller.ts

    • Central state for all fields
    • Validates form completeness
    • Updates Checkout Session on submit
  3. Input Fields (Iframes) - apps/secure-fields/src/input.ts

    • One iframe per field (number, expiry, CVV, postal code)
    • Handle formatting, validation, autofill
    • Emit field-level events

Communication Channels:

  • postMessage: SDK ↔ Controller, SDK ↔ Inputs
  • BroadcastChannel (secure-fields): Controller ↔ Inputs
  • BroadcastChannel (secure-fields-card): Controller ↔ Click to Pay Encrypt
  • MessageChannel: Click to Pay Controller ↔ Encrypt (port transfer)

Message Types:

  • Controller ↔ SDK: ready, form-change, submit, success, error
  • Controller ↔ Inputs: add, update, reset, sync (new)
  • Input ↔ SDK: focus, blur, input, update

From /Users/bruno/www/gr4vy/secure-fields/docs/SECURE_FIELDS_ARCHITECTURE_REPORT.md:

sequenceDiagram
    participant Host as Host Page (SDK)
    participant Ctrl as Controller iframe
    participant Num as Input iframe (number)
    participant API as Checkout Sessions API

    Host->>Ctrl: Create iframe controller.html
    Ctrl-->>Host: postMessage ready

    Host->>Num: Create iframe input.html?type=number
    Num-->>Host: onload
    Host->>Num: postMessage update{styles, label}
    Num-->>Ctrl: BroadcastChannel add{type: 'number'}

    Num->>Ctrl: BroadcastChannel update{number: {value, valid, empty}}
    Ctrl-->>Host: postMessage form-change{fields, complete}

    Host->>Ctrl: postMessage submit{method: 'card'}
    Ctrl->>API: PUT /checkout/sessions/:id/fields
    API-->>Ctrl: 200
    Ctrl-->>Host: postMessage success{scheme}
Loading

State Management

Controller Field State:

export let fields: CardFields & OtherFields = {
  number: {
    value: '',         // Sanitized (not PAN)
    valid: true,
    empty: true,
    autofilled: false,
    _added: false      // Set by handleAdd
  },
  expiryDate: { ... },
  securityCode: { ... },
  postalCode: { ... }
}

Completeness Logic:

export const isComplete = () => {
  const requiredFields = [fields.number, fields.expiryDate]
  if (fields.securityCode._added) requiredFields.push(fields.securityCode)
  if (fields.postalCode._added) requiredFields.push(fields.postalCode)

  return requiredFields.every(f => f.valid && !f.empty)
}

8. Documentation Created

As part of this investigation, 34+ technical documents were created:

Core Documentation:

  • TA-13099-controller-init-report.md - Main design report
  • TA-13099-implementation-summary.md - Implementation details
  • TA-13099-test-plan.md - Testing strategy
  • TA-13099-faq.md - Frequently asked questions
  • TA-13099-peer-review.md - Peer review feedback
  • TA-13099-senior-architecture-review.md - Architecture review
  • SENIOR_ARCHITECT_REVIEW.md - Critical architectural analysis

Architecture Documentation:

  • SECURE_FIELDS_ARCHITECTURE_REPORT.md - System architecture
  • handleAdd-undefined-type-bug-analysis.md - Bug analysis
  • secure-fields-race-condition-post-mortem.md - Incident post-mortem

Alternative Approaches:

  • TA-13099-controller-first-approach-analysis.md
  • TA-13099-unified-message-handler-approach.md
  • TA-13099-service-worker-approach.md

And 20+ additional documents covering:

  • Message flow diagrams
  • Code change analysis
  • Security considerations
  • PCI compliance review
  • Performance impact analysis
  • Backward compatibility testing
  • Edge case scenarios
  • Browser compatibility matrix
  • Deployment strategy
  • Rollback plan

9. Key Learnings

  1. Race Conditions in Distributed Systems Are Hard

    • BroadcastChannel is lossy for late subscribers
    • Iframe load order is unpredictable
    • Temporal dependencies create fragility
  2. Quick Fixes Have Long-Term Costs

    • Each workaround adds complexity
    • Technical debt compounds
    • Makes next problem harder to solve
  3. Architecture Matters More Than Implementation

    • Band-aids mask deeper issues
    • Silent failures are worse than loud failures
    • Explicit dependencies better than implicit
  4. Thorough Investigation Has Value

    • 34+ documents created
    • Multiple approaches evaluated
    • Team alignment on problem space
    • Foundation for future work
  5. Sometimes the Right Answer is "Not Yet"

    • PR #976 technically worked
    • But wasn't the right long-term solution
    • Better to explore alternatives than rush

10. Next Steps

TA-13399: New investigation ticket created to explore:

  • Promise-based API patterns
  • Service worker coordination
  • Explicit controller-first loading
  • Unified message handler architecture

Goals:

  • Eliminate race condition by design (not workaround)
  • Simplify architecture (fewer moving parts)
  • Explicit dependencies (no implicit ordering)
  • Fail-fast patterns (not silent degradation)

Timeline: To be determined


Appendix A: Links

  • Original Ticket: TA-13099
  • PR: #976 (Blocked)
  • New Ticket: TA-13399
  • Production Incident: August 27, 2025 (#950)
  • Revert PR: (deployed August 27, 2025 22:00 UTC)

Appendix B: Related Incidents

  • August 27, 2025: Production incident affecting ~5k users
  • Historical: Multiple iframe timing issues over 5 years
  • Pattern: Race conditions between controller and inputs

End of Report

This document represents the complete investigation of TA-13099 from September 2025, including all approaches evaluated, reviews received, and the decision to explore alternative architectural solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment