Rationale, currently the system once it has all data from the external system, does all updates in a single call. If anything fails, we would execute the entire call again. If another relevant transaction happens in the interim time, we don't have a way to handle it. The large task record created, with spurious properties, makes it difficult to trace and debug.
- All reads and updates are done in a single call.
- The large transaction code, is partially idempotent, but not fully.
- We don't have a way to handle out of order events.
- We don't have a way to retry errors.
- Traceability is not good.
- There is no queue monitoring.
- Break up the large transaction into smaller ones.
- Improve error handling, retry logic, and out of order events detection and correction.
- Create better data traceability.
- Monitor system performance, and reliability.
- Predict future performance, and reliability via benchmarking.
Code patterns
- Abstraction layer for action and observer handlers 16hr
- Do not save instance of action hierarchy in mongodb 0hr
- Keep instance of action hierarchy in bullmq, and when done, save to mongodb 16hr
- Keep instance of action hierarchy in mongodb 10hr
- Split read and write actions for legacy tables 24hr
- Preprocess tenant customization and integration settings 16hr
- Support logging and provenance within new handlers 16hr
New Features
- Out of order events detection (10hr issue an error)
- Simple detection ( query data after the update )
- Out of order events mitigation
- Atomic actions where possible 16hr
- Locking by id 8hr
- Out of order events correction or reporting
- Simple recalculation by replaying events in order by id(40hr, refactor)
- Report on errors (3hr)
- Report on changes (12hr needs output)
- Update data (12hr)
- Complex
- Integrate with normal processing (32hr needs testing)
- Separate read and write actions (20hr)
- Simple recalculation by replaying events in order by id(40hr, refactor)
Optimizations
- Cache tenant customization and integration settings (16hr + 2hr)
- Process actions as array of objects (32hr needs testing)
- Save RSAppointment data to RSAppointmentChanges
- Add action handler ( 8hr )
Traceability - Debugging
- Table data added ( 8hr )
- External event timestamp
- Read start timestamp
- Write start timestamp
Performance
- Verify index usage for all tables ( 8hr )
Monitoring
- Add BullMQ dashboard ( 8hr - 12hr )
- Monitor stalled, failed, and completed jobs
- Review grafana dashboards ( 8hr - 12hr )
Benchmarking ( 32 - 48hr )
- Define metrics
- Latency
- Throughput
- Reliability (number of errors, number of retries, number of out of order events)
- Define scenarios
- Tenant configuration
- Number of events
- Load (number of events per second)
- Synthetic vs real data
- Create benchmark code
- Run benchmarks
- Review results
Code patterns
- Abstraction layer for action and observer handlers 12hr
- Work from simple queue, keep relevant transaction data in payload
- Support logging and provenance within new handlers 16hr
New Features
- Out of order events detection (10hr issue an error)
- Simple detection ( query data after the update )
- Out of order events mitigation
- Atomic actions where possible 8hr ($inc and possibly $setOnInsert)
- Locking by id 8hr based on runtime configuration
- Out of order events reporting (nice to have)
- Simple in-memory recalculation by replaying events in time order partitioned by id
- Derived tables first ( 20hr )
- Temporal tables next ( 20hr )
- Compare to current state ( 4hr )
- Do a full pass by using a hash of id. ( 20hr )
- Only report on request by individual id. ( 6hr )
- Simple in-memory recalculation by replaying events in time order partitioned by id
- Save RSAppointment data ( 4hr )
- Save RSLeadGlobalConversions data ( 8hr )
- Query RSLeadGlobalConversions data ( 8hr )
Performance
- Verify index usage for all tables ( 8hr )
Monitoring
- Add BullMQ dashboard ( 8hr - 12hr )
- Monitor stalled, failed, and completed jobs
- Review grafana dashboards ( 8hr - 12hr )
Benchmarking ( 20 - 32hr )
- Define metrics
- Latency
- Throughput
- Reliability (number of errors, number of retries, number of out of order events)
- Define scenarios
- Tenant configuration
- Number of events
- Load (number of events per second)
- Synthetic vs real data
- Create benchmark code
- Run benchmarks
- Review results
- Week 1: Core refactoring (abstraction layer + logging) and new data flows (Appointments & LeadGlobalConversions).
- Week 2: Basic OOO handling (detection + mitigation), plus queries and index checks.
- Week 3: Partial OOO reporting (temporal tables) + monitoring dashboards + start benchmarking.
- Week 4: Full OOO reporting (derived tables + optional full pass), and finalize benchmarking.
Notes: By adding the new dataflow early we also provide an example of how to use the new abstraction layer. By implementing the entire dataflow in-memory as a verification process, we get a way to verify the current dataflow as well as provide a way to test the new abstraction layer.