Following the September 2025 incident and discussion of PR #976 (Bruno's late sync solution), the team initiated an independent investigation to explore alternative approaches to the iframe loading race condition. This investigation, tracked as TA-13399, evaluated four distinct approaches ranging from SDK queuing to architectural redesign.
Key Finding: The team identified Approach 4 (Stateless Controller) as the optimal long-term solution due to its architectural simplification benefits. However, this approach is blocked by TA-5920, a years-old infrastructure issue regarding mixed version deployment risk between /apps and /packages folders.
Interim Decision: Until TA-5920 is resolved, the team selected Approach 1 (SDK waits for controller) as the temporary solution, despite acknowledging it degrades UX by forcing users to wait for both controller and input load times before fields become interactive.
Current Status: The investigation moved to "Review" status on November 10, 2025, with PR #1011 demonstrating the stateless controller POC but remaining unmerged due to the blocker.
- Sep 29, 2025 07:52 UTC: TA-13399 created by Luca Allievi
- Sep 29, 2025 07:57 UTC: Moved to "Refinement Ready" status
- Sep 29, 2025 10:15 UTC: Linked as blocker for TA-13380 (CVV-only mode bug)
- Sep 29, 2025 12:33 UTC: Moved to "Dev Ready" status
- Oct 7, 2025: Sprint assignment (T-Wolf O)
- Oct 8, 2025: Sprint reassignment to T-Wolf N
- Oct 22, 2025: Sprint expansion to T-Wolf O
- Oct 27, 2025 10:26 UTC: Assigned to Giordano Arman, moved to "In Progress"
- Nov 5, 2025: Summary updated by Cristiano Betta, sprint expansion to T-Wolf P
- Nov 6, 2025 11:17 UTC: PR #1011 opened (stateless controller POC)
- Nov 10, 2025 09:32 UTC: Moved to "Review" status
The investigation ticket (TA-13399) was opened after team discussions about PR #976, Bruno's late sync solution. The ticket description states:
"A PR was opened with an attempt to fix the issue by implementing a sort of data reconciliation logic but, after some discussion, we weren't completely on board with it. We further discussed potential solutions and we decided to open a new investigation to go deeper."
Team's concerns about PR #976:
- Perceived as "data reconciliation logic" rather than addressing root cause
- Desire to explore more architecturally pure solutions
- Opportunity to "go deeper" into the underlying problem
- Preference for solutions that simplify rather than add complexity
Maintained requirements:
- Clean, logical and easily understandable (and testable)
- Backward-compatible
Created by: Luca Allievi Created: September 29, 2025, 07:52 UTC Issue Type: Investigation Priority: Medium Labels: Frontend
Objective: "Investigate controller / inputs loading logic for secure fields"
Requirements:
- Clean, logical and easily understandable (and testable)
- Backward-compatible
Expected Outcome:
- Confluence doc with proposals, pros and cons
- POCs for each approach
Blocks: TA-13380 (CVV-only mode complete flag bug)
Assignment History:
- Initially unassigned through refinement and dev ready phases
- Oct 27, 2025: Assigned to Giordano Arman
- Oct 27, 2025: Moved from "Dev Ready" to "In Progress"
- Nov 10, 2025: Moved to "Review" status
Resources Provided:
- Braintree Hosted Fields - Industry reference implementation
- Checkout.com Frames - Alternative industry approach
The team evaluated four main approaches to solve the race condition, documented in the Confluence page:
Description:
- SDK queues field creation until controller signals ready
SecureFields.add*Fieldmethods wait for controller load to complete- Prevents inputs from attempting to load before controller is ready
- Serializes the loading process: controller first, then inputs
POC Reference: Described in PR #976 comment, point A
Pros:
- Builds on top of existing architecture
- Guarantees controller loads first (eliminates race condition)
- No complex state synchronization needed
- Straightforward implementation
- Low technical risk
Cons:
- Does not allow user input until controller loaded
- UX degradation: fields not interactive immediately
- Introduces queuing for calls like
setPlaceholder - Wait time = controller load + input load (vs input load alone)
- Particularly impacts users with slow connections
From Confluence - Comparative evaluation:
"builds on top of existing architecture"
"does not allow user input until the controller is loaded; introduces queuing for calls like setPlaceholder"
Decision: Selected as interim solution until TA-5920 resolved
Team's rationale for interim selection: From Confluence page:
"Although it introduces complexity and it degrades UX, because the wait time for the user before being able to write into the inputs would be
controller load time + input load timeinstead ofinput load timealone."
Description:
- Wait for controller to load and send
readyevent on BroadcastChannel - Inputs probe/poll until they see
readysignal, then register - Methods like
setPlaceholderqueued while waiting using Proxy abstraction - Uses JavaScript Proxy to intercept calls before ready state
- More sophisticated than Approach 1's simple queue
POC Branch: poc-queue-fields
POC Code: GitHub comparison
Pros:
- Builds on top of existing architecture
- Fields can render immediately (better than Approach 1)
- More sophisticated queuing than Approach 1
- Inputs remain responsive in DOM
Cons:
- Introduces complexity with Proxy abstraction
- Queuing calls adds cognitive overhead
- Timeout mechanism on controller load
- More complex than Approach 1
- Proxy pattern may be unfamiliar to maintainers
From Confluence - Comparative evaluation:
"builds on top of existing architecture"
"introduces complexity by queuing calls (abstraction over the input with Proxy); timeout on the controller load (although this is optional, not tied to the architecture of the solution)"
Note from Confluence: The timeout mechanism is described as "optional, not tied to the architecture of the solution," suggesting it's an implementation detail rather than fundamental requirement.
Decision: Explored but not selected
Description:
- Controller and inputs load independently (no waiting)
- When controller finishes loading after inputs, sends sync request
- Inputs respond with their current state
- Controller updates internal state with latest field values
- Introduces new event protocol: sync request → sync response → sync complete
POC: PR #976 Original Ticket: TA-13099 (closed September 30, 2025)
Pros:
- Builds on top of existing architecture
- No UX delay (fields interactive immediately)
- Fields work correctly regardless of load order
- Proven in testing
- Minimal code changes
- Protocol enhancement rather than structural change
Cons:
- Adds more events back and forth between iframes
- Requires 'sync-complete' mechanism for stable 'ready' event
- Timeout on controller load (although this is optional, not tied to the architecture of the solution)
- Team concern: seen as adding complexity
- Perceived as "band-aid" solution
From Confluence - Comparative evaluation:
"builds on top of existing architecture"
"adds more events back and forth between the iframes; 'sync-complete' to have stable 'ready' event; timeout on the controller load (although this is optional, not tied to the architecture of the solution)"
Team's Perspective: From Confluence "Proposal" section:
"Approaches 2 and 3 introduce significant amount of complexity to the codebase."
Decision: Explored but team preferred Approach 4
Key Difference from Team's Approach:
- Bruno's approach: Pragmatic fix with minimal changes
- Team's approach: Architectural redesign for long-term benefit
Original Name: "Controller removal" (later renamed to "Stateless controller")
Description:
- Controller no longer maintains state as source of truth
- Inputs become the source of truth for validation and state
- Controller becomes orchestrator that gathers input data on-demand
- Simplifies architecture by removing state management from controller
- State still "passes through" controller but doesn't "live" there
- Fundamental shift in responsibility: state ownership moves to inputs
POC: PR #1011 Ticket: TA-13399
Pros:
- Simplifies the architecture significantly
- Removes need to maintain state in controller
- Relies on existing input state management as source of truth
- Eliminates state synchronization issues entirely
- Removes considerable amount of code
- Desirable architectural improvement regardless of race condition
- Long-term maintainability benefits
Cons:
- Lower level updates needed across codebase
- Needs more comprehensive testing
- Structural change (not just protocol enhancement)
- Higher implementation risk
- Touches both
/appsand/packages(deployment complexity) - Requires careful rollout strategy
From Confluence - Comparative evaluation:
"simplifies the architecture"
"lower level updates, so it needs more testing"
From Confluence - Full rationale:
"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal. It adds code but it removes a considerable amount of code too."
Decision: Best approach comparatively, but BLOCKED by TA-5920
Why it's better: From Confluence:
"Approach 4 is better comparatively"
The team concluded this approach:
- Simplifies architecture (removes state management burden)
- Desirable regardless of race condition (pays down technical debt)
- Removes considerable code (net benefit despite adding some code)
- Relies on proven input state management
| Approach | Description | Pros | Cons | Complexity | UX Impact | Status |
|---|---|---|---|---|---|---|
| 1: SDK waits | Queue fields until controller ready | Builds on existing, guarantees order | Blocks user input, queuing overhead | Medium | High (negative) | Interim solution |
| 2: Input buffering | Inputs probe for ready, use Proxy | Builds on existing, fields render immediately | Proxy complexity, timeout mechanism | High | Medium | Evaluated |
| 3: Late sync (Bruno) | Sync after independent load | Builds on existing, no UX impact | More events, sync-complete mechanism | Medium | None | Evaluated |
| 4: Stateless controller | Remove state from controller | Simplifies architecture significantly | Lower level changes, more testing | High | None | Recommended (blocked) |
From Confluence "Proposal" section:
Approach 1 impact:
"Approach 1 would impact the UX of users with low speed internet connection (eg. african donors of Wikimedia) or simply users who use the widget while the iframe servers are having a bad day."
Approaches 2 and 3 complexity:
"Approaches 2 and 3 introduce significant amount of complexity to the codebase."
Approach 4 benefits:
"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal."
The Confluence page documents two additional issues discovered during investigation:
1. READY Event Documentation Inaccuracy:
From Confluence "Notes" section:
"the docs say that the READY event is dispatched to the merchant when widget is ready to be used, but in fact it's dispatched when the controller is loaded (inputs could be still loading)"
Impact:
- Misleading for merchants trying to gate submit button
- Merchants may enable submit before all inputs are ready
- Documentation doesn't match implementation
- Potential for user confusion and bugs
2. No Submit Guardrails:
From Confluence "Notes" section:
"none of the approaches implement some mechanism that handles the case when the submit call happens before the resources have loaded"
Impact:
- User could submit while iframes still loading
- Silent failure possible
- Poor user experience
- No feedback mechanism
Important note from Confluence:
"Any solution to these note points is interchangeable with any of the approaches outlined above."
This means these issues can be addressed independently of which approach is chosen for the race condition.
The Confluence page presents a detailed rationale for why Approach 4 is the best choice:
Why Approach 1 is inadequate:
"Approach 1 would impact the UX of users with low speed internet connection (eg. african donors of Wikimedia) or simply users who use the widget while the iframe servers are having a bad day."
Specific user impact:
- African donors of Wikimedia (example of global reach)
- Any user during server issues
- Mobile users on poor networks
- Users in regions with limited infrastructure
Why Approaches 2 and 3 add complexity:
"Approaches 2 and 3 introduce significant amount of complexity to the codebase."
Team's perspective:
- More moving parts to maintain
- More potential failure modes
- Cognitive overhead for developers
- Additional events and protocols to understand
Why Approach 4 simplifies:
"Approach 4 (stateless controller) removes the need to maintain state in the controller (it still passes through, but it does not 'live' there), relies on the existing state management logic of the inputs as source of truth and generally simplifies the architecture, which is desirable regardless of the investigation's goal. It adds code but it removes a considerable amount of code too."
Key benefits:
- State ownership moves to inputs (single source of truth)
- Controller becomes simpler orchestrator
- Net code reduction despite adding some code
- Architectural improvement independent of race condition fix
- Aligns with best practices (state close to where it's used)
Current Architecture (with race condition):
SDK
↓
Controller iframe (maintains state)
↓ (BroadcastChannel)
Input iframes (report to controller)
↓
Controller aggregates state → FORM_CHANGE event to SDK
Problems with current:
- Race condition when inputs load before controller
- Controller must maintain synchronized state
- Complex state management logic in controller
- State can become out of sync
Stateless Architecture (no race condition):
SDK
↓
Controller iframe (stateless orchestrator)
↓ (BroadcastChannel)
Input iframes (maintain own state - source of truth)
↓
On submit: Controller gathers from inputs on-demand
Benefits of stateless:
- No race condition (no state to synchronize)
- Inputs already manage their own state
- Controller simply orchestrates on-demand
- State lives where it's used
From Confluence "Proposal" section:
The team recommended implementing Approach 4 along with solutions to the discovered issues:
1. Reliable READY Event:
From Confluence:
"dispatch READY only when all iframes have loaded (not just the controller), so if some merchants rely on this event to be able to understand if the widget is submittable, we are not breaking their code"
Implementation:
- Wait for controller AND all input iframes
- Only dispatch when entire widget is ready
- Matches documented behavior
- Prevents merchant code breakage
Benefit:
- Merchants can reliably gate submit button
- Documentation matches implementation
- Better developer experience
2. Submit Feedback Mechanism:
From Confluence:
"if the user tries to submit when READY has not been dispatched yet, we can show some feedback that goes like 'Cannot process right now, try again later'"
Covers two cases:
a. User submits while resources still loading:
- Provide clear feedback
- Prevent silent failure
- Guide user to wait
b. Submit after one or more iframes failed to load:
- Handle error gracefully
- Inform user of issue
- Prevent confusion
Additional benefit from Confluence:
"This approach would also cover the case in which upon the submit click occurrence we need to open a popup/window."
This handles scenarios where submit triggers popup (e.g., 3DS authentication) that might be blocked if not user-initiated.
From Confluence "Review outcome":
"After these implementations the merchants should be informed about relying on the READY event to display the submit button on their UIs."
Communication requirements:
- Inform merchants of READY event reliability improvements
- Recommend gating submit button on READY event
- Update documentation to reflect new behavior
- Provide migration guide if needed
Created by: Giordano Arman (GiordanoArman) Created: November 6, 2025, 11:17 UTC Status: Open (as of November 10, 2025) Title: "task(investigation): controller / inputs loading logic - TA-13399" URL: https://github.com/gr4vy/secure-fields/pull/1011
From PR description:
"This approach fixes the underlying iframe load race condition issue by making the existing controller stateless. In short, the state input handling and validation would happen within the inputs, the controller becomes simply an element that gathers all the input and issues a call towards the API on the fly, upon submit.
In this PR there's code and pseudo-code that aims to explain how this approach would result in code changes. I've added comment throughout the code, if you want to pull changes locally you can find the comments easily by searching for 'TA-13399' in your IDE."
Before (Race Condition Architecture):
Image dimensions: 1034 x 564
After (Stateless Architecture):
Image dimensions: 6194 x 5787 (detailed architectural diagram)
From PR description:
- Mix of real code and pseudo-code
- Comments marked with "TA-13399" for easy searching
- Exploratory PR, not production-ready
- Demonstrates architectural direction
- Shows code changes needed
Code search hint:
"if you want to pull changes locally you can find the comments easily by searching for 'TA-13399' in your IDE"
From PR description, the changes involve:
- State/input handling moves to inputs
- Validation happens within inputs
- Controller becomes gatherer and API caller
- On-demand data collection on submit
As of November 6, 2025:
- No review comments yet
- CI status checks:
- ❌ tests (20.x): FAILURE
- ✅ scan (20.x): SUCCESS
- ⏭️ release: SKIPPED
- ✅ secure-fields (gr4vy-admin): SUCCESS
PR checklist (not completed):
- Code follows style guidelines
- Self-review performed
-
yarn lintpassed -
yarn testpassed - Latest changes from
mainpulled - Tested React and CDN versions
- Labels added for release
Test failures: The PR shows test failures, expected for an exploratory/investigation PR with pseudo-code mixed in.
From PR description:
"Part of https://gr4vy.atlassian.net/browse/TA-13399."
From Confluence "Review outcome" section:
"Since Approach 4 (stateless controller) involves changing files in
/appsand/packages, and the contents of those two folders get deployed in different ways, there's a risk that the user might end up with the resources from/appfrom a version ofsecure-fieldsand the resources from/packagesfrom a different one, breaking the widget."
Root cause:
/appsfolder: Contains controller iframe code/packagesfolder: Contains SDK code- Different deployment mechanisms for each folder
- No guaranteed synchronization between deployments
- Users may load mixed versions
Example failure scenario:
User's browser loads:
- /apps/controller.js from version 1.2.3 (old, stateful controller)
- /packages/sdk.js from version 1.2.4 (new, expects stateless controller)
Result: Widget completely broken
- SDK expects stateless protocol
- Controller uses stateful protocol
- Communication breakdown
- Silent failures or errors
Another failure scenario:
User's browser loads:
- /apps/controller.js from version 1.2.4 (new, stateless)
- /packages/sdk.js from version 1.2.3 (old, expects stateful)
Result: Widget broken
- SDK sends commands expecting state in controller
- Controller doesn't maintain state
- Data loss, validation failures
From Confluence "Review outcome":
"Therefore implementing Approach 4 now would be quite complex, as we would need to first execute a release that comprises the current approach mixed with Approach 4, and then executing another release that removes the old approach."
Implementation complexity:
Step 1: Dual-mode release
- Support BOTH stateful and stateless modes
- Controller detects SDK version
- SDK detects controller version
- Version negotiation protocol
- Maintain two code paths
Step 2: Second release
- Remove old stateful code
- Only after ensuring all users upgraded
- No clear timeline for "safe" removal
- Risk of breaking users who cached old versions
Challenges:
- Careful coordination required
- High risk of breaking production
- Difficult to test all version combinations
- Rollback complexity
- Extended timeline (two releases minimum)
Prerequisite: TA-5920 - "Solving this mixed versions risk"
The blocker ticket:
- Ticket: TA-5920
- Issue: Mixed versions risk between
/appsand/packages - Status: Unresolved
- Age: Created years ago
- Timeline: No timeline for resolution
- Impact: Blocks multiple initiatives, not just this one
Fundamental issue:
- Deployment architecture problem
- Requires infrastructure changes
- May need CDN/caching strategy changes
- May need versioned API endpoints
- Organizational priority question
From Confluence "Review outcome":
"Until we fix the mixed versions risk, we should opt for Approach 1 (SDK waits for controller before adding inputs), although it introduces complexity and it degrades UX, because the wait time for the user before being able to write into the inputs would be
controller load time + input load timeinstead ofinput load timealone."
Why Approach 1 despite drawbacks:
-
Safety over perfection:
- Safer than risking mixed versions
- Lower technical risk than Approach 4
- Can deploy immediately without infrastructure changes
-
Deployability:
- No architectural changes needed
- Works with current deployment system
- Single release, no coordination needed
-
Simplicity:
- Builds on existing architecture
- No version negotiation needed
- Clear rollback path
-
Temporary:
- Explicitly interim solution
- Can replace with Approach 4 when blocker resolved
- Not intended as permanent fix
Wait time impact:
Before (ideal):
User waits: Input load time only (~500ms-2s)
Controller loads independently in background
With Approach 1:
User waits: Controller load time + Input load time
Total: ~1s-4s or more depending on connection
Impact calculation:
- Controller load: ~500ms-2s (depending on connection)
- Input load: ~500ms-2s (depending on connection)
- Total wait: 1s-4s+ (serialized, not parallel)
Affected users:
-
Users with slow connections:
- Example from Confluence: "African donors of Wikimedia"
- Mobile users on 3G/4G
- Rural areas with limited infrastructure
- Developing countries
-
Users during server issues:
- "iframe servers are having a bad day" (from Confluence)
- CDN issues
- Network congestion
- Geographic routing problems
-
All users to some degree:
- Even fast connections see noticeable delay
- Perception of "slow" form
- May impact conversion rates
Why acceptable as interim:
- Temporary solution (not permanent)
- Better than broken widget from mixed versions
- Can be replaced when TA-5920 resolved
- Maintains backward compatibility
What's NOT mitigated:
- UX degradation is real and measurable
- No way to eliminate the wait
- Users will experience slower forms
- Potential business impact (conversion, satisfaction)
Monitoring recommendations:
- Track load times in production
- Monitor user feedback
- Measure impact on form completion rates
- A/B testing if possible
From Confluence "Proposal" and "Review outcome" sections:
Step 1: Implement Approach 4 (Stateless Controller)
- Follow PR #1011 architectural design
- Complete real implementation (beyond POC)
- Comprehensive testing required:
- Unit tests for all components
- Integration tests for iframe communication
- E2E tests covering all scenarios
- Performance benchmarks
- Load testing
- Gradual rollout recommended:
- Internal testing first
- Beta merchants
- Phased production rollout
- Monitor error rates and performance
Step 2: Implement Point 1 (Reliable READY Event)
From Confluence:
"dispatch READY only when all iframes have loaded (not just the controller), so if some merchants rely on this event to be able to understand if the widget is submittable, we are not breaking their code"
Implementation requirements:
- Wait for controller load
- Wait for ALL input iframes load
- Aggregate ready signals
- Dispatch READY only when complete widget ready
- Update event dispatch logic
- Test with various load orders
Benefit:
- Matches documented behavior
- Merchants can trust READY for submit button gating
- Prevents breaking existing merchant code
- Better developer experience
Step 3: Implement Point 2 (Submit Feedback)
From Confluence:
"if the user tries to submit when READY has not been dispatched yet, we can show some feedback that goes like 'Cannot process right now, try again later'"
Implementation requirements:
Case a: Submit while resources loading
- Detect submit before READY dispatched
- Show user-friendly message
- Suggested: "Cannot process right now, try again later"
- Prevent silent failure
Case b: Submit after iframe load failure
- Detect when one or more iframes failed to load
- Show error message
- Provide recovery options
- Log error for debugging
Additional benefit: Also covers popup/window scenarios (e.g., 3DS) where user-initiated action required
Step 4: Merchant Communication
From Confluence:
"After these implementations the merchants should be informed about relying on the READY event to display the submit button on their UIs."
Communication plan:
-
Documentation update:
- Update event documentation
- Add READY event reliability guarantees
- Provide code examples
- Show submit button gating pattern
-
Migration guide:
- How to update integration
- Best practices for submit button control
- Testing recommendations
- Backward compatibility notes
-
Announcement:
- Email to merchant developers
- Update release notes
- Highlight reliability improvements
- Encourage adoption of READY-based gating
-
Support preparation:
- Train support team on changes
- Prepare FAQ
- Monitor for merchant questions
- Proactive outreach to key merchants
Rollout plan:
Phase 1: Internal Testing
- Deploy to internal test environments
- Team validation
- Load testing
- Security review
Phase 2: Beta Testing
- Select beta merchant cohorts
- Close monitoring
- Gather feedback
- Iterate on issues
Phase 3: Gradual Rollout
- 10% of traffic
- Monitor metrics closely
- 25% → 50% → 100% if metrics good
- Rollback plan ready at each stage
Metrics to track:
- Load times: Controller, inputs, total
- Error rates: All iframe operations
- Success rates: Form submissions
- READY event dispatch timing
- User-facing errors
- Merchant feedback
Success criteria:
- No increase in error rates
- Load times acceptable
- Zero critical bugs
- Positive merchant feedback
- Performance targets met
Rollback triggers:
- Error rate spike (>1% increase)
- Load time regression (>20% slower)
- Critical bugs discovered
- Merchant complaints
- Performance degradation
Status: Blocked by TA-13399 Link: TA-13380
Issue:
completeflag returns false even when CVV-only form is valid- Related to same state management issues
- Controller state doesn't accurately reflect CVV-only validity
- Needs resolution of race condition first
Impact:
- Merchants can't reliably gate submit button in CVV-only mode
- False negative on form completeness
- User experience degradation
- Workarounds needed in merchant implementations
Why it's blocked:
- Root cause is controller state management
- Approach 4 (stateless) would resolve this
- Approach 1 (interim) may not fully resolve
- Must wait for proper solution
Link from TA-13399: Created September 29, 2025 by Luca Allievi:
"This work item blocks TA-13380"
Status: Closed September 30, 2025 Link: TA-13099
Context:
- Bruno's original investigation ticket
- Resulted in PR #976 (late sync solution)
- Closed when team decided to open TA-13399
- Represents alternative approach path
Why closed: From TA-13399 description:
"we weren't completely on board with it. We further discussed potential solutions and we decided to open a new investigation to go deeper."
PR #996, #997: Click to Pay (merged Oct/Nov 2025)
- Not directly related to race condition
- Separate features that landed during investigation period
- Shows parallel development continuing
Industry References:
Braintree Hosted Fields:
- Link: https://github.com/braintree/braintree-web/tree/main/src/hosted-fields
- Referenced in TA-13399 description
- Industry standard implementation
- Used for comparison and ideas
Checkout.com Frames:
- Link: https://www.checkout.com/docs/developer-resources/sdks/frames-sdks/frames-reference
- Referenced in TA-13399 description
- Alternative industry approach
- Different architectural patterns
Page ID: 1676607506 Title: "Secure Fields race condition solution" Space: GB (Gr4vy Build) Current Version: 14 URL: https://gr4vy.atlassian.net/wiki/spaces/GB/pages/1676607506
Version History:
- Multiple iterations through version 14
- Team discussion and refinement
- Evolved from 3 approaches to 4 approaches
- Renamed "Controller removal" to "Stateless controller"
Key Changes:
- Added comparative evaluation
- Added "Notes" section with discovered issues
- Expanded "Proposal" with detailed rationale
- Added "Review outcome" with TA-5920 blocker
Sections:
- Goal: Investigation objectives
- Approaches: Four approaches with pros/cons
- Notes: Additional issues discovered
- Proposal: Recommendation for Approach 4
- Review outcome: TA-5920 blocker and interim decision
Note: Specific inline comments on highlighted text may not be available via API, but the page shows evidence of team collaboration:
- Version 14 indicates multiple rounds of editing
- Multiple author contributions (visible in changelog)
- Refined over October-November period
Key contributors:
- Luca Allievi (creator)
- Giordano Arman (implementer)
- Cristiano Betta (summary updates)
- Paulo Ferrarini (sprint management)
- Gary Evans (sprint management)
For Stateless Approach (Approach 4):
Unit tests:
- Input state management in isolation
- Controller orchestration logic
- Data gathering on-demand
- Validation in inputs
- Error handling
Integration tests:
- Controller-input communication
- BroadcastChannel messaging
- State consistency across inputs
- Form validation aggregation
- Submit flow end-to-end
E2E tests:
- Delayed controller load scenarios
- Delayed input load scenarios
- Failed controller load cases
- Failed input load cases
- Mixed load order combinations
- Network throttling scenarios
- Slow connection simulation
- Server error conditions
Load time metrics:
- Controller load time tracking
- Input load time tracking
- Total ready time tracking
- Comparison with current architecture
Memory tests:
- State in multiple inputs (memory implications)
- Memory leak detection
- Long-running form sessions
- Multiple form instances
State coordination tests:
- Multiple inputs with same data
- State updates across inputs
- Validation synchronization
- Error state propagation
Error handling:
- Partial load failures
- Timeout scenarios
- Network errors
- Recovery mechanisms
For Interim Approach 1 (SDK Waits):
Queue tests:
- Queue overflow handling
- Queue order preservation
- Queue timeout mechanisms
- Queue memory management
Timing tests:
- Total wait time measurement
- Performance impact quantification
- User experience metrics
- Comparison benchmarks
Error feedback:
- Timeout messages to users
- Clear error communication
- Recovery options
- Retry mechanisms
| Aspect | Approach 1 (Interim) | Approach 4 (Desired) |
|---|---|---|
| Technical Risk | Low | High |
| UX Impact | Negative (delay) | None |
| Code Complexity | Medium (queuing) | High (architectural) |
| Deployment Risk | Low | High (TA-5920) |
| Testing Effort | Medium | High |
| Rollback Complexity | Easy | Difficult |
| Maintenance | Short-term | Long-term benefit |
| Performance | Worse (serial loading) | Better (simpler) |
| Scalability | Same | Better |
| Developer Experience | Same | Better |
| Documentation | Minimal changes | Significant updates |
Approach 1 (Interim):
Latency:
- Added latency: controller + input load time
- Serial loading instead of parallel
- User-visible delay before interaction
- Potential for timeout scenarios
Memory:
- Queue overhead
- Queued messages memory
- Queue cleanup needed
Risk scenarios:
- Queue overflow if many operations queued
- Memory leak if queue not properly cleaned
- Timeout handling complexity
Approach 4 (Stateless):
Latency:
- On-demand gathering at submit time
- Slight delay on submit (gather from inputs)
- Overall better due to simpler architecture
- Parallel loading maintained
Memory:
- State distributed across inputs
- Each input manages own state
- Potential for higher memory if many inputs
- Need to monitor in production
Performance gains:
- Reduced controller complexity
- Less state synchronization overhead
- Fewer messages between iframes
- Simpler code paths (faster execution)
Intersection between investigations:
Bruno's Option A → Team's Approach 1
- Both: SDK queues until controller ready
- Bruno: Described as option in comment
- Team: Selected as interim solution
Bruno's Option B → Team's Approach 2
- Both: Input buffering with readiness checks
- Bruno: Considered as alternative
- Team: Evaluated with Proxy approach
Bruno's Option C → Team's Approach 3
- Both: Late sync with state reconciliation
- Bruno: Chosen and implemented in PR #976
- Team: Evaluated but not recommended
Bruno's controller-less idea → Team's Approach 4
- Both: Remove controller or make stateless
- Bruno: Mentioned as theoretical option
- Team: Fully explored with POC in PR #1011
- Different implementation details
Bruno's Approach (PR #976):
Philosophy:
- Pragmatic: Fix the problem with minimal changes
- Proven: Tested and working solution
- Deployable: Ready to merge and release
- Low-risk: Protocol enhancement, not structural change
Characteristics:
- Backward compatible: Zero breaking changes
- Minimal code: Small, focused changes
- Protocol enhancement: New events for sync
- Quick to production: Can deploy immediately
Trade-offs accepted:
- Adds events (team calls "complexity")
- Not architecturally "pure"
- Seen as "band-aid" by some
Philosophy statement (implied):
- Ship working solution now
- Iterate later if needed
- Value delivery over perfection
- Pragmatism over purity
Team's Approach (Stateless):
Philosophy:
- Architectural: Fix the underlying design
- Pure: Remove state management complexity
- Long-term: Pay down technical debt
- Proper: "Clean, logical and easily understandable"
Characteristics:
- Backward compatible: Required
- Significant changes: Structural redesign
- Architectural shift: State ownership changes
- Blocked: Can't deploy until TA-5920
Trade-offs accepted:
- Can't deploy immediately
- Higher risk and testing needs
- Blocked by infrastructure issue
- May wait indefinitely
Philosophy statement (from Confluence):
"desirable regardless of the investigation's goal"
Shows desire for architectural improvement even beyond race condition fix.
Speed vs Architecture:
Bruno:
- Speed to production
- Working solution in hand
- Band-aid that works
- Incremental improvement
Team:
- Architectural purity
- Proper long-term solution
- Foundation for future
- Comprehensive redesign
Risk vs Benefit:
Bruno:
- Low risk: Proven in tests
- Immediate benefit: Fix production issue
- Incremental: Small, focused change
- Reversible: Easy rollback
Team:
- High risk: Needs extensive testing
- Long-term benefit: Better architecture
- Comprehensive: Large structural change
- Complex rollback: Difficult to reverse
Deployment:
Bruno:
- Can deploy today
- No blockers
- Single release
- Immediate resolution
Team:
- Blocked by TA-5920
- No timeline
- Complex multi-release
- Indefinite wait
This investigation reveals classic engineering tension:
Pragmatism vs Idealism:
- Bruno: "Make it work, ship it"
- Team: "Do it right, even if it takes time"
Short-term vs Long-term:
- Bruno: Fix the urgent problem now
- Team: Build the proper foundation
Incremental vs Revolutionary:
- Bruno: Small, focused improvements
- Team: Comprehensive redesign
Delivery vs Architecture:
- Bruno: Value delivery to users/merchants
- Team: Value architectural cleanliness
Choosing Approach 1 (interim) over PR #976:
Implications:
- Race condition remains in production longer
- Users experience degraded UX (Approach 1's delay)
- More time spent on interim solution
- PR #976 work effectively discarded
- Waiting for TA-5920 with no timeline
Alternative if PR #976 chosen:
- Race condition fixed immediately
- No UX degradation
- Can iterate to Approach 4 later
- Users protected now
- Business value delivered
Team's rationale for waiting:
- Architectural purity worth the wait
- Interim solution acceptable
- Proper fix is Approach 4
- Don't want "band-aid" solutions
1. TA-5920 Timeline:
- When will mixed versions risk be resolved?
- Is there active work on this blocker?
- What's the priority of this infrastructure work?
- Should we re-prioritize this ticket given blocking impact?
- Who owns this ticket?
- What's the estimated timeline (months? years?)?
2. Priority Decision:
- Should we deploy PR #976 (Bruno's solution) as interim instead of Approach 1?
- Is architectural purity worth indefinite wait?
- What's the cost of UX degradation in Approach 1?
- What's the business impact of continued race condition exposure?
- Have we measured conversion impact?
- What do merchants prefer: working solution now vs perfect solution later?
3. UX Trade-off:
- Have we measured impact of Approach 1 delay on users?
- What percentage of users affected by slow loading?
- Geographic distribution of impacted users?
- A/B testing planned for Approach 1?
- Acceptable threshold for delay?
- Impact on form completion rates?
- Impact on merchant satisfaction?
4. Rollback Plan:
- If stateless approach has production issues after TA-5920 resolved, what's rollback strategy?
- Can we safely revert architectural changes?
- What's the rollback complexity?
- Monitoring and alerting requirements?
- Who's on-call for rollout?
- What are the rollback triggers?
5. Testing Strategy:
- Comprehensive test plan for stateless approach?
- Coverage requirements?
- Performance benchmarks?
- Load testing strategy?
- Beta testing cohorts identified?
- Timeline for testing phases?
6. Interim Solution Choice:
- Why not PR #976 as interim solution?
- Is Approach 1's UX impact measured?
- Have merchants been consulted?
- What's the decision rationale beyond architecture?
7. Resource Allocation:
- Two separate investigations (Bruno and team)?
- Could resources have been combined?
- What's the cost of parallel work?
- Coordination between investigations?
8. Decision Making:
- Who makes final call: Approach 1 vs PR #976?
- What are the decision criteria?
- Business stakeholders involved?
- Customer feedback considered?
9. Timeline Expectations:
- When do we expect to deploy proper solution?
- What if TA-5920 takes another year?
- At what point do we reconsider interim choice?
- Acceptable wait time for architectural purity?
10. Measurement:
- How will we measure success?
- What metrics define "better"?
- User impact quantification?
- Business impact quantification?
| # | Approach | Status | POC | Pros | Cons | UX Impact |
|---|---|---|---|---|---|---|
| 1 | SDK waits | Interim solution | PR #976 comment point A | Builds on existing, guarantees order | UX degradation, queuing overhead | High (negative) |
| 2 | Input buffering | Evaluated | poc-queue-fields | Builds on existing, immediate render | Proxy complexity, timeout | Medium |
| 3 | Late sync | Evaluated (Bruno) | PR #976 | No UX impact, proven in tests | "More events" concern | None |
| 4 | Stateless | Recommended (blocked) | PR #1011 | Simplifies architecture | Blocked by TA-5920 | None |
Jira Tickets:
- TA-13399 - This investigation (Team's alternative)
- TA-13099 - Bruno's investigation (closed Sep 30, 2025)
- TA-13380 - CVV-only mode bug (blocked by TA-13399)
- TA-5920 - Mixed versions blocker (blocks Approach 4)
Pull Requests:
- PR #976 - Bruno's late sync solution (Open, unmerged)
- PR #1011 - Stateless controller POC (Open, exploratory)
Confluence Pages:
- Post-mortem - Original incident September 2025
- Approaches doc - This investigation (version 14)
Documentation:
- Secure Fields Events - Current docs with READY event inaccuracy
September 2025:
- Sep 29, 07:52 UTC: TA-13399 created by Luca Allievi
- Sep 29, 07:57 UTC: Moved to "Refinement Ready"
- Sep 29, 10:15 UTC: Linked as blocker for TA-13380
- Sep 29, 12:33 UTC: Moved to "Dev Ready"
- Sep 30: TA-13099 (Bruno's ticket) closed
October 2025:
- Oct 7: Sprint assignment (T-Wolf O)
- Oct 8: Sprint reassignment (T-Wolf N)
- Oct 22: Sprint expansion (back to T-Wolf O)
- Oct 27, 10:26 UTC: Assigned to Giordano Arman
- Oct 27, 10:26 UTC: Moved to "In Progress"
November 2025:
- Nov 5: Summary updated by Cristiano Betta
- Nov 5: Sprint expansion (T-Wolf P)
- Nov 6, 11:17 UTC: PR #1011 opened (stateless POC)
- Nov 10, 09:32 UTC: Moved to "Review" status
Industry Examples:
- Braintree Hosted Fields - Reference implementation
- Checkout.com Frames - Alternative approach
Internal:
- Bruno's 35+ documentation files from original investigation (TA-13099)
- PR #976 review discussions
- Team conversations (Sep 29-30, 2025)
- Confluence page iterations (14 versions)
Code References:
- POC branch:
poc-queue-fields - PR #976: Late sync implementation
- PR #1011: Stateless controller POC
- Search term in PR #1011: "TA-13399" for code comments
Key Contributors:
Luca Allievi:
- Created TA-13399
- Defined investigation scope
- Linked related tickets
Giordano Arman:
- Assigned to investigation
- Created PR #1011 (stateless POC)
- Implemented exploratory code
Cristiano Betta:
- Updated ticket summaries
- Contributed to Confluence page
- Sprint management
Paulo Ferrarini:
- Sprint management
- Refinement process
Gary Evans:
- Sprint management
- Prioritization
Cansın Güler:
- Moved ticket to "Dev Ready"
Bruno:
- Original investigation (TA-13099)
- PR #976 implementation
- 35+ documentation files
- Alternative approach advocate
The team's investigation into the Secure Fields race condition produced a thorough evaluation of four distinct approaches, with Approach 4 (Stateless Controller) identified as the optimal long-term solution. This approach offers significant architectural benefits by eliminating state management complexity in the controller and relying on inputs as the single source of truth.
However, the team discovered a critical blocker: TA-5920 (mixed versions risk). This years-old infrastructure issue prevents safe deployment of Approach 4 due to the risk of users loading mismatched versions of controller and SDK code. The complexity of supporting dual modes (stateful and stateless simultaneously) was deemed too high.
As a result, the team selected Approach 1 (SDK waits) as the interim solution, fully acknowledging its negative UX impact. Users will experience noticeably longer wait times before fields become interactive, particularly affecting those with slow connections or during server issues. The team explicitly noted users like "African donors of Wikimedia" as examples of affected populations.
1. Architecture vs Pragmatism: The investigation reveals a fundamental tension between architectural purity and pragmatic delivery. The team chose to wait for the "proper" solution (Approach 4) rather than deploy the working solution (Bruno's PR #976), accepting continued exposure to the race condition and degraded UX as acceptable trade-offs.
2. The Cost of Perfection: By choosing architectural improvement over immediate fix, the team accepted:
- Continued race condition exposure in production
- Degraded UX for all users (Approach 1's serialized loading)
- Indefinite timeline (TA-5920 has no resolution date)
- Additional development effort (two separate investigations)
- Discarded working solution (PR #976 remains unmerged)
3. Blocker Impact: TA-5920 demonstrates how infrastructure debt can block multiple product initiatives. This ticket has existed for years and blocks not just this race condition fix, but potentially other improvements. The lack of timeline or active work raises questions about prioritization.
4. Investigation Thoroughness: The team produced excellent documentation:
- Four approaches thoroughly evaluated
- Confluence page with comparative analysis
- POC implementations for verification
- Clear pros/cons for each approach
- Identified additional issues (READY event, submit guardrails)
5. The Unasked Question: Neither investigation fully addresses: Why not deploy PR #976 now and migrate to Approach 4 later?
This would:
- Fix race condition immediately
- Provide zero UX degradation
- Allow architectural improvement when TA-5920 resolved
- Deliver value to users and merchants now
- Reduce risk exposure
Sometimes the "proper solution" isn't the right solution if it can't be deployed. The team chose architectural purity over pragmatic delivery, resulting in:
- Race condition: Still exists in production
- Interim solution: Degrades UX for all users
- Proper solution: Blocked indefinitely
- Working solution (PR #976): Unmerged
This investigation exemplifies the engineering principle: The best architecture in the world is worthless if it never ships.
The race condition remains unfixed, users experience slower forms, and the team waits for infrastructure improvements with no clear timeline. Meanwhile, a working solution sits in PR #976, proven in tests, ready to merge, but rejected for not being architecturally ideal.
Question for consideration: Is it better to have a working "band-aid" solution in production, or an ideal solution blocked indefinitely?