brunodesde1987/TEAM-INVESTIGATION-OCT-NOV-2025.md

## TEAM-INVESTIGATION-OCT-NOV-2025.md

      
    Raw
  

              TEAM-INVESTIGATION-OCT-NOV-2025.md
            
          
    Secure Fields Race Condition: Team's Alternative Investigation (October-November 2025)

Executive Summary

Following the September 2025 incident and discussion of PR #976 (Bruno's late sync solution), the team initiated an independent investigation to explore alternative approaches to the iframe loading race condition. This investigation, tracked as TA-13399, evaluated four distinct approaches ranging from SDK queuing to architectural redesign.
Key Finding: The team identified Approach 4 (Stateless Controller) as the optimal long-term solution due to its architectural simplification benefits. However, this approach is blocked by TA-5920, a years-old infrastructure issue regarding mixed version deployment risk between /apps and /packages folders.
Interim Decision: Until TA-5920 is resolved, the team selected Approach 1 (SDK waits for controller) as the temporary solution, despite acknowledging it degrades UX by forcing users to wait for both controller and input load times before fields become interactive.
Current Status: The investigation moved to "Review" status on November 10, 2025, with PR #1011 demonstrating the stateless controller POC but remaining unmerged due to the blocker.
Timeline


Sep 29, 2025 07:52 UTC: TA-13399 created by Luca Allievi
Sep 29, 2025 07:57 UTC: Moved to "Refinement Ready" status
Sep 29, 2025 10:15 UTC: Linked as blocker for TA-13380 (CVV-only mode bug)
Sep 29, 2025 12:33 UTC: Moved to "Dev Ready" status
Oct 7, 2025: Sprint assignment (T-Wolf O)
Oct 8, 2025: Sprint reassignment to T-Wolf N
Oct 22, 2025: Sprint expansion to T-Wolf O
Oct 27, 2025 10:26 UTC: Assigned to Giordano Arman, moved to "In Progress"
Nov 5, 2025: Summary updated by Cristiano Betta, sprint expansion to T-Wolf P
Nov 6, 2025 11:17 UTC: PR #1011 opened (stateless controller POC)
Nov 10, 2025 09:32 UTC: Moved to "Review" status

1. Context & Motivation

Why Seek Alternatives to PR #976

The investigation ticket (TA-13399) was opened after team discussions about PR #976, Bruno's late sync solution. The ticket description states:

"A PR was opened with an attempt to fix the issue by implementing a sort of data reconciliation logic but, after some discussion, we weren't completely on board with it. We further discussed potential solutions and we decided to open a new investigation to go deeper."

Team's concerns about PR #976:

Perceived as "data reconciliation logic" rather than addressing root cause
Desire to explore more architecturally pure solutions
Opportunity to "go deeper" into the underlying problem
Preference for solutions that simplify rather than add complexity

Maintained requirements:

Clean, logical and easily understandable (and testable)
Backward-compatible

Investigation Ticket: TA-13399

Created by: Luca Allievi
Created: September 29, 2025, 07:52 UTC
Issue Type: Investigation
Priority: Medium
Labels: Frontend
Objective: "Investigate controller / inputs loading logic for secure fields"
Requirements:

Clean, logical and easily understandable (and testable)
Backward-compatible

Expected Outcome:

Confluence doc with proposals, pros and cons
POCs for each approach

Blocks: TA-13380 (CVV-only mode complete flag bug)
Assignment History:

Initially unassigned through refinement and dev ready phases
Oct 27, 2025: Assigned to Giordano Arman
Oct 27, 2025: Moved from "Dev Ready" to "In Progress"
Nov 10, 2025: Moved to "Review" status

Resources Provided:

Braintree Hosted Fields - Industry reference implementation
Checkout.com Frames - Alternative industry approach

2. Four Approaches Evaluated

The team evaluated four main approaches to solve the race condition, documented in the Confluence page:
Approach 1: SDK Waits for Controller Before Adding Inputs

Description:

SDK queues field creation until controller signals ready
SecureFields.add*Field methods wait for controller load to complete
Prevents inputs from attempting to load before controller is ready
Serializes the loading process: controller first, then inputs

POC Reference: Described in PR #976 comment, point A
Pros:

Builds on top of existing architecture
Guarantees controller loads first (eliminates race condition)
No complex state synchronization needed
Straightforward implementation
Low technical risk

Cons:

Does not allow user input until controller loaded
UX degradation: fields not interactive immediately
Introduces queuing for calls like setPlaceholder
Wait time = controller load + input load (vs input load alone)
Particularly impacts users with slow connections

From Confluence - Comparative evaluation:

"builds on top of existing architecture"
"does not allow user input until the controller is loaded; introduces queuing for calls like setPlaceholder"

Decision: Selected as interim solution until TA-5920 resolved
Team's rationale for interim selection:
From Confluence page:

"Although it introduces complexity and it degrades UX, because the wait time for the user before being able to write into the inputs would be controller load time + input load time instead of input load time alone."


Approach 2: Controller Readiness + Input Buffering

Description:

Wait for controller to load and send ready event on BroadcastChannel
Inputs probe/poll until they see ready signal, then register
Methods like setPlaceholder queued while waiting using Proxy abstraction
Uses JavaScript Proxy to intercept calls before ready state
More sophisticated than Approach 1's simple queue

POC Branch: poc-queue-fields
POC Code: GitHub comparison
Pros:

Builds on top of existing architecture
Fields can render immediately (better than Approach 1)
More sophisticated queuing than Approach 1
Inputs remain responsive in DOM

Cons:

Introduces complexity with Proxy abstraction
Queuing calls adds cognitive overhead
Timeout mechanism on controller load
More complex than Approach 1
Proxy pattern may be unfamiliar to maintainers

From Confluence - Comparative evaluation:

"builds on top of existing architecture"
"introduces complexity by queuing calls (abstraction over the input with Proxy); timeout on the controller load (although this is optional, not tied to the architecture of the solution)"

Note from Confluence: The timeout mechanism is described as "optional, not tied to the architecture of the solution," suggesting it's an implementation detail rather than fundamental requirement.
Decision: Explored but not selected

Approach 3: Late Sync (Bruno's PR #976)

Description:

Controller and inputs load independently (no waiting)
When controller finishes loading after inputs, sends sync request
Inputs respond with their current state
Controller updates internal state with latest field values
Introduces new event protocol: sync request → sync response → sync complete

POC: PR #976
Original Ticket: TA-13099 (closed September 30, 2025)
Pros:

Builds on top of existing architecture
No UX delay (fields interactive immediately)
Fields work correctly regardless of load order
Proven in testing
Minimal code changes
Protocol enhancement rather than structural change

Cons:

Adds more events back and forth between iframes
Requires 'sync-complete' mechanism for stable 'ready' event
Timeout on controller load (although this is optional, not tied to the architecture of the solution)
Team concern: seen as adding complexity
Perceived as "band-aid" solution

From Confluence - Comparative evaluation:

"builds on top of existing architecture"
"adds more events back and forth between the iframes; 'sync-complete' to have stable 'ready' event; timeout on the controller load (although this is optional, not tied to the architecture of the solution)"

Team's Perspective:
From Confluence "Proposal" section:

"Approaches 2 and 3 introduce significant amount of complexity to the codebase."

Decision: Explored but team preferred Approach 4
Key Difference from Team's Approach:

Bruno's approach: Pragmatic fix with minimal changes
Team's approach: Architectural redesign for long-term benefit


Approach 4: Stateless Controller ⭐ TEAM'S RECOMMENDED APPROACH

Original Name: "Controller removal" (later renamed to "Stateless controller")
Description:

Controller no longer maintains state as source of truth
Inputs become the source of truth for validation and state
Controller becomes orchestrator that gathers input data on-demand
Simplifies architecture by removing state management from controller
State still "passes through" controller but doesn't "live" there
Fundamental shift in responsibility: state ownership moves to inputs

POC: PR #1011
Ticket: TA-13399
Pros:

Simplifies the architecture significantly
Removes need to maintain state in controller
Relies on existing input state management as source of truth
Eliminates state synchronization issues entirely
Removes considerable amount of code
Desirable architectural improvement regardless of race condition
Long-term maintainability benefits

Cons:

Lower level updates needed across codebase
Needs more comprehensive testing
Structural change (not just protocol enhancement)
Higher implementation risk
Touches both /apps and /packages (deployment complexity)
Requires careful rollout strategy

From Confluence - Comparative evaluation:

"simplifies the architecture"
"lower level updates, so it needs more testing"

From Confluence - Full rationale:

"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal. It adds code but it removes a considerable amount of code too."

Decision: Best approach comparatively, but BLOCKED by TA-5920
Why it's better:
From Confluence:

"Approach 4 is better comparatively"

The team concluded this approach:

Simplifies architecture (removes state management burden)
Desirable regardless of race condition (pays down technical debt)
Removes considerable code (net benefit despite adding some code)
Relies on proven input state management


3. Comparative Evaluation

Summary Table


Approach
Description
Pros
Cons
Complexity
UX Impact
Status


1: SDK waits
Queue fields until controller ready
Builds on existing, guarantees order
Blocks user input, queuing overhead
Medium
High (negative)
Interim solution


2: Input buffering
Inputs probe for ready, use Proxy
Builds on existing, fields render immediately
Proxy complexity, timeout mechanism
High
Medium
Evaluated


3: Late sync (Bruno)
Sync after independent load
Builds on existing, no UX impact
More events, sync-complete mechanism
Medium
None
Evaluated


4: Stateless controller
Remove state from controller
Simplifies architecture significantly
Lower level changes, more testing
High
None
Recommended (blocked)


Team's Comparative Analysis

From Confluence "Proposal" section:
Approach 1 impact:

"Approach 1 would impact the UX of users with low speed internet connection (eg. african donors of Wikimedia) or simply users who use the widget while the iframe servers are having a bad day."

Approaches 2 and 3 complexity:

"Approaches 2 and 3 introduce significant amount of complexity to the codebase."

Approach 4 benefits:

"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal."

Issues Discovered (Not Directly Related to Race Condition)

The Confluence page documents two additional issues discovered during investigation:
1. READY Event Documentation Inaccuracy:
From Confluence "Notes" section:

"the docs say that the READY event is dispatched to the merchant when widget is ready to be used, but in fact it's dispatched when the controller is loaded (inputs could be still loading)"

Impact:

Misleading for merchants trying to gate submit button
Merchants may enable submit before all inputs are ready
Documentation doesn't match implementation
Potential for user confusion and bugs

2. No Submit Guardrails:
From Confluence "Notes" section:

"none of the approaches implement some mechanism that handles the case when the submit call happens before the resources have loaded"

Impact:

User could submit while iframes still loading
Silent failure possible
Poor user experience
No feedback mechanism

Important note from Confluence:

"Any solution to these note points is interchangeable with any of the approaches outlined above."

This means these issues can be addressed independently of which approach is chosen for the race condition.
4. Team's Proposal: Stateless Controller

Rationale

The Confluence page presents a detailed rationale for why Approach 4 is the best choice:
Why Approach 1 is inadequate:

"Approach 1 would impact the UX of users with low speed internet connection (eg. african donors of Wikimedia) or simply users who use the widget while the iframe servers are having a bad day."

Specific user impact:

African donors of Wikimedia (example of global reach)
Any user during server issues
Mobile users on poor networks
Users in regions with limited infrastructure

Why Approaches 2 and 3 add complexity:

"Approaches 2 and 3 introduce significant amount of complexity to the codebase."

Team's perspective:

More moving parts to maintain
More potential failure modes
Cognitive overhead for developers
Additional events and protocols to understand

Why Approach 4 simplifies:

"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal. It adds code but it removes a considerable amount of code too."

Key benefits:

State ownership moves to inputs (single source of truth)
Controller becomes simpler orchestrator
Net code reduction despite adding some code
Architectural improvement independent of race condition fix
Aligns with best practices (state close to where it's used)

Architecture Redesign

Current Architecture (with race condition):
SDK
 ↓
Controller iframe (maintains state)
 ↓ (BroadcastChannel)
Input iframes (report to controller)
 ↓
Controller aggregates state → FORM_CHANGE event to SDK

Problems with current:

Race condition when inputs load before controller
Controller must maintain synchronized state
Complex state management logic in controller
State can become out of sync

Stateless Architecture (no race condition):
SDK
 ↓
Controller iframe (stateless orchestrator)
 ↓ (BroadcastChannel)
Input iframes (maintain own state - source of truth)
 ↓
On submit: Controller gathers from inputs on-demand

Benefits of stateless:

No race condition (no state to synchronize)
Inputs already manage their own state
Controller simply orchestrates on-demand
State lives where it's used

Additional Implementations Recommended

From Confluence "Proposal" section:
The team recommended implementing Approach 4 along with solutions to the discovered issues:
1. Reliable READY Event:
From Confluence:

"dispatch READY only when all iframes have loaded (not just the controller), so if some merchants rely on this event to be able to understand if the widget is submittable, we are not breaking their code"

Implementation:

Wait for controller AND all input iframes
Only dispatch when entire widget is ready
Matches documented behavior
Prevents merchant code breakage

Benefit:

Merchants can reliably gate submit button
Documentation matches implementation
Better developer experience

2. Submit Feedback Mechanism:
From Confluence:

"if the user tries to submit when READY has not been dispatched yet, we can show some feedback that goes like 'Cannot process right now, try again later'"

Covers two cases:
a. User submits while resources still loading:

Provide clear feedback
Prevent silent failure
Guide user to wait

b. Submit after one or more iframes failed to load:

Handle error gracefully
Inform user of issue
Prevent confusion

Additional benefit from Confluence:

"This approach would also cover the case in which upon the submit click occurrence we need to open a popup/window."

This handles scenarios where submit triggers popup (e.g., 3DS authentication) that might be blocked if not user-initiated.
Merchant Communication Plan

From Confluence "Review outcome":

"After these implementations the merchants should be informed about relying on the READY event to display the submit button on their UIs."

Communication requirements:

Inform merchants of READY event reliability improvements
Recommend gating submit button on READY event
Update documentation to reflect new behavior
Provide migration guide if needed

5. PR #1011: Stateless Controller Implementation

Overview

Created by: Giordano Arman (GiordanoArman)
Created: November 6, 2025, 11:17 UTC
Status: Open (as of November 10, 2025)
Title: "task(investigation): controller / inputs loading logic - TA-13399"
URL: https://github.com/gr4vy/secure-fields/pull/1011
Description

From PR description:

"This approach fixes the underlying iframe load race condition issue by making the existing controller stateless.
In short, the state input handling and validation would happen within the inputs, the controller becomes simply an element that gathers all the input and issues a call towards the API on the fly, upon submit.
In this PR there's code and pseudo-code that aims to explain how this approach would result in code changes. I've added comment throughout the code, if you want to pull changes locally you can find the comments easily by searching for 'TA-13399' in your IDE."

Diagrams

Before (Race Condition Architecture):

Image dimensions: 1034 x 564
After (Stateless Architecture):

Image dimensions: 6194 x 5787 (detailed architectural diagram)
Implementation Approach

From PR description:

Mix of real code and pseudo-code
Comments marked with "TA-13399" for easy searching
Exploratory PR, not production-ready
Demonstrates architectural direction
Shows code changes needed

Code search hint:

"if you want to pull changes locally you can find the comments easily by searching for 'TA-13399' in your IDE"

Code Structure

From PR description, the changes involve:

State/input handling moves to inputs
Validation happens within inputs
Controller becomes gatherer and API caller
On-demand data collection on submit

Review Status

As of November 6, 2025:

No review comments yet
CI status checks:

❌ tests (20.x): FAILURE
✅ scan (20.x): SUCCESS
⏭️ release: SKIPPED
✅ secure-fields (gr4vy-admin): SUCCESS


PR checklist (not completed):

 Code follows style guidelines
 Self-review performed
 yarn lint passed
 yarn test passed
 Latest changes from main pulled
 Tested React and CDN versions
 Labels added for release

Test failures:
The PR shows test failures, expected for an exploratory/investigation PR with pseudo-code mixed in.
Link to Investigation

From PR description:

"Part of https://gr4vy.atlassian.net/browse/TA-13399."

6. Critical Blocker: TA-5920 (Mixed Versions Risk)

The Problem

From Confluence "Review outcome" section:

"Since Approach 4 (stateless controller) involves changing files in /apps and /packages, and the contents of those two folders get deployed in different ways, there's a risk that the user might end up with the resources from /app from a version of secure-fields and the resources from /packages from a different one, breaking the widget."

Root cause:

/apps folder: Contains controller iframe code
/packages folder: Contains SDK code
Different deployment mechanisms for each folder
No guaranteed synchronization between deployments
Users may load mixed versions

Example failure scenario:
User's browser loads:
- /apps/controller.js from version 1.2.3 (old, stateful controller)
- /packages/sdk.js from version 1.2.4 (new, expects stateless controller)

Result: Widget completely broken
- SDK expects stateless protocol
- Controller uses stateful protocol
- Communication breakdown
- Silent failures or errors

Another failure scenario:
User's browser loads:
- /apps/controller.js from version 1.2.4 (new, stateless)
- /packages/sdk.js from version 1.2.3 (old, expects stateful)

Result: Widget broken
- SDK sends commands expecting state in controller
- Controller doesn't maintain state
- Data loss, validation failures

Why This Blocks Approach 4

From Confluence "Review outcome":

"Therefore implementing Approach 4 now would be quite complex, as we would need to first execute a release that comprises the current approach mixed with Approach 4, and then executing another release that removes the old approach."

Implementation complexity:
Step 1: Dual-mode release

Support BOTH stateful and stateless modes
Controller detects SDK version
SDK detects controller version
Version negotiation protocol
Maintain two code paths

Step 2: Second release

Remove old stateful code
Only after ensuring all users upgraded
No clear timeline for "safe" removal
Risk of breaking users who cached old versions

Challenges:

Careful coordination required
High risk of breaking production
Difficult to test all version combinations
Rollback complexity
Extended timeline (two releases minimum)

Prerequisite: TA-5920 - "Solving this mixed versions risk"
TA-5920 Context

The blocker ticket:

Ticket: TA-5920
Issue: Mixed versions risk between /apps and /packages
Status: Unresolved
Age: Created years ago
Timeline: No timeline for resolution
Impact: Blocks multiple initiatives, not just this one

Fundamental issue:

Deployment architecture problem
Requires infrastructure changes
May need CDN/caching strategy changes
May need versioned API endpoints
Organizational priority question

7. Interim Solution: Approach 1 (SDK Waits)

Decision

From Confluence "Review outcome":

"Until we fix the mixed versions risk, we should opt for Approach 1 (SDK waits for controller before adding inputs), although it introduces complexity and it degrades UX, because the wait time for the user before being able to write into the inputs would be controller load time + input load time instead of input load time alone."

Rationale

Why Approach 1 despite drawbacks:


Safety over perfection:

Safer than risking mixed versions
Lower technical risk than Approach 4
Can deploy immediately without infrastructure changes


Deployability:

No architectural changes needed
Works with current deployment system
Single release, no coordination needed


Simplicity:

Builds on existing architecture
No version negotiation needed
Clear rollback path


Temporary:

Explicitly interim solution
Can replace with Approach 4 when blocker resolved
Not intended as permanent fix


UX Trade-off

Wait time impact:
Before (ideal):
User waits: Input load time only (~500ms-2s)
Controller loads independently in background

With Approach 1:
User waits: Controller load time + Input load time
Total: ~1s-4s or more depending on connection

Impact calculation:

Controller load: ~500ms-2s (depending on connection)
Input load: ~500ms-2s (depending on connection)
Total wait: 1s-4s+ (serialized, not parallel)

Affected users:


Users with slow connections:

Example from Confluence: "African donors of Wikimedia"
Mobile users on 3G/4G
Rural areas with limited infrastructure
Developing countries


Users during server issues:

"iframe servers are having a bad day" (from Confluence)
CDN issues
Network congestion
Geographic routing problems


All users to some degree:

Even fast connections see noticeable delay
Perception of "slow" form
May impact conversion rates


Mitigation

Why acceptable as interim:

Temporary solution (not permanent)
Better than broken widget from mixed versions
Can be replaced when TA-5920 resolved
Maintains backward compatibility

What's NOT mitigated:

UX degradation is real and measurable
No way to eliminate the wait
Users will experience slower forms
Potential business impact (conversion, satisfaction)

Monitoring recommendations:

Track load times in production
Monitor user feedback
Measure impact on form completion rates
A/B testing if possible

8. Path Forward After TA-5920

When Blocker Resolved

From Confluence "Proposal" and "Review outcome" sections:
Step 1: Implement Approach 4 (Stateless Controller)

Follow PR #1011 architectural design
Complete real implementation (beyond POC)
Comprehensive testing required:

Unit tests for all components
Integration tests for iframe communication
E2E tests covering all scenarios
Performance benchmarks
Load testing


Gradual rollout recommended:

Internal testing first
Beta merchants
Phased production rollout
Monitor error rates and performance


Step 2: Implement Point 1 (Reliable READY Event)
From Confluence:

"dispatch READY only when all iframes have loaded (not just the controller), so if some merchants rely on this event to be able to understand if the widget is submittable, we are not breaking their code"

Implementation requirements:

Wait for controller load
Wait for ALL input iframes load
Aggregate ready signals
Dispatch READY only when complete widget ready
Update event dispatch logic
Test with various load orders

Benefit:

Matches documented behavior
Merchants can trust READY for submit button gating
Prevents breaking existing merchant code
Better developer experience

Step 3: Implement Point 2 (Submit Feedback)
From Confluence:

"if the user tries to submit when READY has not been dispatched yet, we can show some feedback that goes like 'Cannot process right now, try again later'"

Implementation requirements:
Case a: Submit while resources loading

Detect submit before READY dispatched
Show user-friendly message
Suggested: "Cannot process right now, try again later"
Prevent silent failure

Case b: Submit after iframe load failure

Detect when one or more iframes failed to load
Show error message
Provide recovery options
Log error for debugging

Additional benefit:
Also covers popup/window scenarios (e.g., 3DS) where user-initiated action required
Step 4: Merchant Communication
From Confluence:

"After these implementations the merchants should be informed about relying on the READY event to display the submit button on their UIs."

Communication plan:


Documentation update:

Update event documentation
Add READY event reliability guarantees
Provide code examples
Show submit button gating pattern


Migration guide:

How to update integration
Best practices for submit button control
Testing recommendations
Backward compatibility notes


Announcement:

Email to merchant developers
Update release notes
Highlight reliability improvements
Encourage adoption of READY-based gating


Support preparation:

Train support team on changes
Prepare FAQ
Monitor for merchant questions
Proactive outreach to key merchants


Migration Strategy

Rollout plan:
Phase 1: Internal Testing

Deploy to internal test environments
Team validation
Load testing
Security review

Phase 2: Beta Testing

Select beta merchant cohorts
Close monitoring
Gather feedback
Iterate on issues

Phase 3: Gradual Rollout

10% of traffic
Monitor metrics closely
25% → 50% → 100% if metrics good
Rollback plan ready at each stage

Metrics to track:

Load times: Controller, inputs, total
Error rates: All iframe operations
Success rates: Form submissions
READY event dispatch timing
User-facing errors
Merchant feedback

Success criteria:

No increase in error rates
Load times acceptable
Zero critical bugs
Positive merchant feedback
Performance targets met

Rollback triggers:

Error rate spike (>1% increase)
Load time regression (>20% slower)
Critical bugs discovered
Merchant complaints
Performance degradation

9. Related Work & Context

TA-13380: CVV-only Mode Complete Flag Bug

Status: Blocked by TA-13399
Link: TA-13380
Issue:

complete flag returns false even when CVV-only form is valid
Related to same state management issues
Controller state doesn't accurately reflect CVV-only validity
Needs resolution of race condition first

Impact:

Merchants can't reliably gate submit button in CVV-only mode
False negative on form completeness
User experience degradation
Workarounds needed in merchant implementations

Why it's blocked:

Root cause is controller state management
Approach 4 (stateless) would resolve this
Approach 1 (interim) may not fully resolve
Must wait for proper solution

Link from TA-13399:
Created September 29, 2025 by Luca Allievi:

"This work item blocks TA-13380"

TA-13099: Bruno's Original Investigation

Status: Closed September 30, 2025
Link: TA-13099
Context:

Bruno's original investigation ticket
Resulted in PR #976 (late sync solution)
Closed when team decided to open TA-13399
Represents alternative approach path

Why closed:
From TA-13399 description:

"we weren't completely on board with it. We further discussed potential solutions and we decided to open a new investigation to go deeper."

Other Context

PR #996, #997: Click to Pay (merged Oct/Nov 2025)

Not directly related to race condition
Separate features that landed during investigation period
Shows parallel development continuing

Industry References:
Braintree Hosted Fields:

Link: https://github.com/braintree/braintree-web/tree/main/src/hosted-fields
Referenced in TA-13399 description
Industry standard implementation
Used for comparison and ideas

Checkout.com Frames:

Link: https://www.checkout.com/docs/developer-resources/sdks/frames-sdks/frames-reference
Referenced in TA-13399 description
Alternative industry approach
Different architectural patterns

10. Confluence Page Development

Metadata

Page ID: 1676607506
Title: "Secure Fields race condition solution"
Space: GB (Gr4vy Build)
Current Version: 14
URL: https://gr4vy.atlassian.net/wiki/spaces/GB/pages/1676607506
Evolution

Version History:

Multiple iterations through version 14
Team discussion and refinement
Evolved from 3 approaches to 4 approaches
Renamed "Controller removal" to "Stateless controller"

Key Changes:

Added comparative evaluation
Added "Notes" section with discovered issues
Expanded "Proposal" with detailed rationale
Added "Review outcome" with TA-5920 blocker

Structure

Sections:

Goal: Investigation objectives
Approaches: Four approaches with pros/cons
Notes: Additional issues discovered
Proposal: Recommendation for Approach 4
Review outcome: TA-5920 blocker and interim decision

Comments and Discussion

Note: Specific inline comments on highlighted text may not be available via API, but the page shows evidence of team collaboration:

Version 14 indicates multiple rounds of editing
Multiple author contributions (visible in changelog)
Refined over October-November period

Key contributors:

Luca Allievi (creator)
Giordano Arman (implementer)
Cristiano Betta (summary updates)
Paulo Ferrarini (sprint management)
Gary Evans (sprint management)

11. Technical Considerations

Testing Requirements

For Stateless Approach (Approach 4):
Unit tests:

Input state management in isolation
Controller orchestration logic
Data gathering on-demand
Validation in inputs
Error handling

Integration tests:

Controller-input communication
BroadcastChannel messaging
State consistency across inputs
Form validation aggregation
Submit flow end-to-end

E2E tests:

Delayed controller load scenarios
Delayed input load scenarios
Failed controller load cases
Failed input load cases
Mixed load order combinations
Network throttling scenarios
Slow connection simulation
Server error conditions

Load time metrics:

Controller load time tracking
Input load time tracking
Total ready time tracking
Comparison with current architecture

Memory tests:

State in multiple inputs (memory implications)
Memory leak detection
Long-running form sessions
Multiple form instances

State coordination tests:

Multiple inputs with same data
State updates across inputs
Validation synchronization
Error state propagation

Error handling:

Partial load failures
Timeout scenarios
Network errors
Recovery mechanisms

For Interim Approach 1 (SDK Waits):
Queue tests:

Queue overflow handling
Queue order preservation
Queue timeout mechanisms
Queue memory management

Timing tests:

Total wait time measurement
Performance impact quantification
User experience metrics
Comparison benchmarks

Error feedback:

Timeout messages to users
Clear error communication
Recovery options
Retry mechanisms

Risk Assessment


Aspect
Approach 1 (Interim)
Approach 4 (Desired)


Technical Risk
Low
High


UX Impact
Negative (delay)
None


Code Complexity
Medium (queuing)
High (architectural)


Deployment Risk
Low
High (TA-5920)


Testing Effort
Medium
High


Rollback Complexity
Easy
Difficult


Maintenance
Short-term
Long-term benefit


Performance
Worse (serial loading)
Better (simpler)


Scalability
Same
Better


Developer Experience
Same
Better


Documentation
Minimal changes
Significant updates


Performance Considerations

Approach 1 (Interim):
Latency:

Added latency: controller + input load time
Serial loading instead of parallel
User-visible delay before interaction
Potential for timeout scenarios

Memory:

Queue overhead
Queued messages memory
Queue cleanup needed

Risk scenarios:

Queue overflow if many operations queued
Memory leak if queue not properly cleaned
Timeout handling complexity

Approach 4 (Stateless):
Latency:

On-demand gathering at submit time
Slight delay on submit (gather from inputs)
Overall better due to simpler architecture
Parallel loading maintained

Memory:

State distributed across inputs
Each input manages own state
Potential for higher memory if many inputs
Need to monitor in production

Performance gains:

Reduced controller complexity
Less state synchronization overhead
Fewer messages between iframes
Simpler code paths (faster execution)

12. Comparison with Bruno's Investigation

Approach Alignment

Intersection between investigations:
Bruno's Option A → Team's Approach 1

Both: SDK queues until controller ready
Bruno: Described as option in comment
Team: Selected as interim solution

Bruno's Option B → Team's Approach 2

Both: Input buffering with readiness checks
Bruno: Considered as alternative
Team: Evaluated with Proxy approach

Bruno's Option C → Team's Approach 3

Both: Late sync with state reconciliation
Bruno: Chosen and implemented in PR #976
Team: Evaluated but not recommended

Bruno's controller-less idea → Team's Approach 4

Both: Remove controller or make stateless
Bruno: Mentioned as theoretical option
Team: Fully explored with POC in PR #1011
Different implementation details

Technical Philosophy

Bruno's Approach (PR #976):
Philosophy:

Pragmatic: Fix the problem with minimal changes
Proven: Tested and working solution
Deployable: Ready to merge and release
Low-risk: Protocol enhancement, not structural change

Characteristics:

Backward compatible: Zero breaking changes
Minimal code: Small, focused changes
Protocol enhancement: New events for sync
Quick to production: Can deploy immediately

Trade-offs accepted:

Adds events (team calls "complexity")
Not architecturally "pure"
Seen as "band-aid" by some

Philosophy statement (implied):

Ship working solution now
Iterate later if needed
Value delivery over perfection
Pragmatism over purity

Team's Approach (Stateless):
Philosophy:

Architectural: Fix the underlying design
Pure: Remove state management complexity
Long-term: Pay down technical debt
Proper: "Clean, logical and easily understandable"

Characteristics:

Backward compatible: Required
Significant changes: Structural redesign
Architectural shift: State ownership changes
Blocked: Can't deploy until TA-5920

Trade-offs accepted:

Can't deploy immediately
Higher risk and testing needs
Blocked by infrastructure issue
May wait indefinitely

Philosophy statement (from Confluence):

"desirable regardless of the investigation's goal"

Shows desire for architectural improvement even beyond race condition fix.
Key Differences

Speed vs Architecture:
Bruno:

Speed to production
Working solution in hand
Band-aid that works
Incremental improvement

Team:

Architectural purity
Proper long-term solution
Foundation for future
Comprehensive redesign

Risk vs Benefit:
Bruno:

Low risk: Proven in tests
Immediate benefit: Fix production issue
Incremental: Small, focused change
Reversible: Easy rollback

Team:

High risk: Needs extensive testing
Long-term benefit: Better architecture
Comprehensive: Large structural change
Complex rollback: Difficult to reverse

Deployment:
Bruno:

Can deploy today
No blockers
Single release
Immediate resolution

Team:

Blocked by TA-5920
No timeline
Complex multi-release
Indefinite wait

Philosophical Tension

This investigation reveals classic engineering tension:
Pragmatism vs Idealism:

Bruno: "Make it work, ship it"
Team: "Do it right, even if it takes time"

Short-term vs Long-term:

Bruno: Fix the urgent problem now
Team: Build the proper foundation

Incremental vs Revolutionary:

Bruno: Small, focused improvements
Team: Comprehensive redesign

Delivery vs Architecture:

Bruno: Value delivery to users/merchants
Team: Value architectural cleanliness

Impact of Decision

Choosing Approach 1 (interim) over PR #976:
Implications:

Race condition remains in production longer
Users experience degraded UX (Approach 1's delay)
More time spent on interim solution
PR #976 work effectively discarded
Waiting for TA-5920 with no timeline

Alternative if PR #976 chosen:

Race condition fixed immediately
No UX degradation
Can iterate to Approach 4 later
Users protected now
Business value delivered

Team's rationale for waiting:

Architectural purity worth the wait
Interim solution acceptable
Proper fix is Approach 4
Don't want "band-aid" solutions

13. Open Questions

Critical Questions

1. TA-5920 Timeline:

When will mixed versions risk be resolved?
Is there active work on this blocker?
What's the priority of this infrastructure work?
Should we re-prioritize this ticket given blocking impact?
Who owns this ticket?
What's the estimated timeline (months? years?)?

2. Priority Decision:

Should we deploy PR #976 (Bruno's solution) as interim instead of Approach 1?
Is architectural purity worth indefinite wait?
What's the cost of UX degradation in Approach 1?
What's the business impact of continued race condition exposure?
Have we measured conversion impact?
What do merchants prefer: working solution now vs perfect solution later?

3. UX Trade-off:

Have we measured impact of Approach 1 delay on users?
What percentage of users affected by slow loading?
Geographic distribution of impacted users?
A/B testing planned for Approach 1?
Acceptable threshold for delay?
Impact on form completion rates?
Impact on merchant satisfaction?

4. Rollback Plan:

If stateless approach has production issues after TA-5920 resolved, what's rollback strategy?
Can we safely revert architectural changes?
What's the rollback complexity?
Monitoring and alerting requirements?
Who's on-call for rollout?
What are the rollback triggers?

5. Testing Strategy:

Comprehensive test plan for stateless approach?
Coverage requirements?
Performance benchmarks?
Load testing strategy?
Beta testing cohorts identified?
Timeline for testing phases?

Strategic Questions

6. Interim Solution Choice:

Why not PR #976 as interim solution?
Is Approach 1's UX impact measured?
Have merchants been consulted?
What's the decision rationale beyond architecture?

7. Resource Allocation:

Two separate investigations (Bruno and team)?
Could resources have been combined?
What's the cost of parallel work?
Coordination between investigations?

8. Decision Making:

Who makes final call: Approach 1 vs PR #976?
What are the decision criteria?
Business stakeholders involved?
Customer feedback considered?

9. Timeline Expectations:

When do we expect to deploy proper solution?
What if TA-5920 takes another year?
At what point do we reconsider interim choice?
Acceptable wait time for architectural purity?

10. Measurement:

How will we measure success?
What metrics define "better"?
User impact quantification?
Business impact quantification?

Appendices

A. All Approaches Summary


#
Approach
Status
POC
Pros
Cons
UX Impact


1
SDK waits
Interim solution
PR #976 comment point A
Builds on existing, guarantees order
UX degradation, queuing overhead
High (negative)


2
Input buffering
Evaluated
poc-queue-fields
Builds on existing, immediate render
Proxy complexity, timeout
Medium


3
Late sync
Evaluated (Bruno)
PR #976
No UX impact, proven in tests
"More events" concern
None


4
Stateless
Recommended (blocked)
PR #1011
Simplifies architecture
Blocked by TA-5920
None


B. Key Documents

Jira Tickets:

TA-13399 - This investigation (Team's alternative)
TA-13099 - Bruno's investigation (closed Sep 30, 2025)
TA-13380 - CVV-only mode bug (blocked by TA-13399)
TA-5920 - Mixed versions blocker (blocks Approach 4)

Pull Requests:

PR #976 - Bruno's late sync solution (Open, unmerged)
PR #1011 - Stateless controller POC (Open, exploratory)

Confluence Pages:

Post-mortem - Original incident September 2025
Approaches doc - This investigation (version 14)

Documentation:

Secure Fields Events - Current docs with READY event inaccuracy

C. Timeline Summary

September 2025:

Sep 29, 07:52 UTC: TA-13399 created by Luca Allievi
Sep 29, 07:57 UTC: Moved to "Refinement Ready"
Sep 29, 10:15 UTC: Linked as blocker for TA-13380
Sep 29, 12:33 UTC: Moved to "Dev Ready"
Sep 30: TA-13099 (Bruno's ticket) closed

October 2025:

Oct 7: Sprint assignment (T-Wolf O)
Oct 8: Sprint reassignment (T-Wolf N)
Oct 22: Sprint expansion (back to T-Wolf O)
Oct 27, 10:26 UTC: Assigned to Giordano Arman
Oct 27, 10:26 UTC: Moved to "In Progress"

November 2025:

Nov 5: Summary updated by Cristiano Betta
Nov 5: Sprint expansion (T-Wolf P)
Nov 6, 11:17 UTC: PR #1011 opened (stateless POC)
Nov 10, 09:32 UTC: Moved to "Review" status

D. Resources Referenced

Industry Examples:

Braintree Hosted Fields - Reference implementation
Checkout.com Frames - Alternative approach

Internal:

Bruno's 35+ documentation files from original investigation (TA-13099)
PR #976 review discussions
Team conversations (Sep 29-30, 2025)
Confluence page iterations (14 versions)

Code References:

POC branch: poc-queue-fields
PR #976: Late sync implementation
PR #1011: Stateless controller POC
Search term in PR #1011: "TA-13399" for code comments

E. Team Members

Key Contributors:
Luca Allievi:

Created TA-13399
Defined investigation scope
Linked related tickets

Giordano Arman:

Assigned to investigation
Created PR #1011 (stateless POC)
Implemented exploratory code

Cristiano Betta:

Updated ticket summaries
Contributed to Confluence page
Sprint management

Paulo Ferrarini:

Sprint management
Refinement process

Gary Evans:

Sprint management
Prioritization

Cansın Güler:

Moved ticket to "Dev Ready"

Bruno:

Original investigation (TA-13099)
PR #976 implementation
35+ documentation files
Alternative approach advocate


Conclusion

The team's investigation into the Secure Fields race condition produced a thorough evaluation of four distinct approaches, with Approach 4 (Stateless Controller) identified as the optimal long-term solution. This approach offers significant architectural benefits by eliminating state management complexity in the controller and relying on inputs as the single source of truth.
However, the team discovered a critical blocker: TA-5920 (mixed versions risk). This years-old infrastructure issue prevents safe deployment of Approach 4 due to the risk of users loading mismatched versions of controller and SDK code. The complexity of supporting dual modes (stateful and stateless simultaneously) was deemed too high.
As a result, the team selected Approach 1 (SDK waits) as the interim solution, fully acknowledging its negative UX impact. Users will experience noticeably longer wait times before fields become interactive, particularly affecting those with slow connections or during server issues. The team explicitly noted users like "African donors of Wikimedia" as examples of affected populations.
Key Insights

1. Architecture vs Pragmatism:
The investigation reveals a fundamental tension between architectural purity and pragmatic delivery. The team chose to wait for the "proper" solution (Approach 4) rather than deploy the working solution (Bruno's PR #976), accepting continued exposure to the race condition and degraded UX as acceptable trade-offs.
2. The Cost of Perfection:
By choosing architectural improvement over immediate fix, the team accepted:

Continued race condition exposure in production
Degraded UX for all users (Approach 1's serialized loading)
Indefinite timeline (TA-5920 has no resolution date)
Additional development effort (two separate investigations)
Discarded working solution (PR #976 remains unmerged)

3. Blocker Impact:
TA-5920 demonstrates how infrastructure debt can block multiple product initiatives. This ticket has existed for years and blocks not just this race condition fix, but potentially other improvements. The lack of timeline or active work raises questions about prioritization.
4. Investigation Thoroughness:
The team produced excellent documentation:

Four approaches thoroughly evaluated
Confluence page with comparative analysis
POC implementations for verification
Clear pros/cons for each approach
Identified additional issues (READY event, submit guardrails)

5. The Unasked Question:
Neither investigation fully addresses: Why not deploy PR #976 now and migrate to Approach 4 later?
This would:

Fix race condition immediately
Provide zero UX degradation
Allow architectural improvement when TA-5920 resolved
Deliver value to users and merchants now
Reduce risk exposure

Final Observation

Sometimes the "proper solution" isn't the right solution if it can't be deployed. The team chose architectural purity over pragmatic delivery, resulting in:

Race condition: Still exists in production
Interim solution: Degrades UX for all users
Proper solution: Blocked indefinitely
Working solution (PR #976): Unmerged

This investigation exemplifies the engineering principle: The best architecture in the world is worthless if it never ships.
The race condition remains unfixed, users experience slower forms, and the team waits for infrastructure improvements with no clear timeline. Meanwhile, a working solution sits in PR #976, proven in tests, ready to merge, but rejected for not being architecturally ideal.
Question for consideration: Is it better to have a working "band-aid" solution in production, or an ideal solution blocked indefinitely?
Approach	Description	Pros	Cons	Complexity	UX Impact	Status
1: SDK waits	Queue fields until controller ready	Builds on existing, guarantees order	Blocks user input, queuing overhead	Medium	High (negative)	Interim solution
2: Input buffering	Inputs probe for ready, use Proxy	Builds on existing, fields render immediately	Proxy complexity, timeout mechanism	High	Medium	Evaluated
3: Late sync (Bruno)	Sync after independent load	Builds on existing, no UX impact	More events, sync-complete mechanism	Medium	None	Evaluated
4: Stateless controller	Remove state from controller	Simplifies architecture significantly	Lower level changes, more testing	High	None	Recommended (blocked)
Aspect	Approach 1 (Interim)	Approach 4 (Desired)
Technical Risk	Low	High
UX Impact	Negative (delay)	None
Code Complexity	Medium (queuing)	High (architectural)
Deployment Risk	Low	High (TA-5920)
Testing Effort	Medium	High
Rollback Complexity	Easy	Difficult
Maintenance	Short-term	Long-term benefit
Performance	Worse (serial loading)	Better (simpler)
Scalability	Same	Better
Developer Experience	Same	Better
Documentation	Minimal changes	Significant updates
#	Approach	Status	POC	Pros	Cons	UX Impact
1	SDK waits	Interim solution	PR #976 comment point A	Builds on existing, guarantees order	UX degradation, queuing overhead	High (negative)
2	Input buffering	Evaluated	poc-queue-fields	Builds on existing, immediate render	Proxy complexity, timeout	Medium
3	Late sync	Evaluated (Bruno)	PR #976	No UX impact, proven in tests	"More events" concern	None
4	Stateless	Recommended (blocked)	PR #1011	Simplifies architecture	Blocked by TA-5920	None