arturohernandez10/RS.2.2.md

## RS.2.2.md

      
    Raw
  

              RS.2.2.md
            
          
    Data processing improvements & problems

Rationale, currently the system once it has all data from the external system, does all updates in a single call. If anything fails, we would execute the entire call again. If another relevant transaction happens in the interim time, we don't have a way to handle it. The large task record created, with spurious properties, makes it difficult to trace and debug.

All reads and updates are done in a single call.
The large transaction code, is partially idempotent, but not fully.
We don't have a way to handle out of order events.
We don't have a way to retry errors.
Traceability is not good.
There is no queue monitoring.

Main goals


Break up the large transaction into smaller ones.
Improve error handling, retry logic, and out of order events detection and correction.
Create better data traceability.
Monitor system performance, and reliability.
Predict future performance, and reliability via benchmarking.

Candidate Changes

Code patterns

Abstraction layer for action and observer handlers 16hr

Do not save instance of action hierarchy in mongodb 0hr
Keep instance of action hierarchy in bullmq, and when done, save to mongodb 16hr
Keep instance of action hierarchy in mongodb 10hr


Split read and write actions for legacy tables 24hr
Preprocess tenant customization and integration settings 16hr
Support logging and provenance within new handlers 16hr

New Features

Out of order events detection (10hr issue an error)

Simple detection ( query data after the update )


Out of order events mitigation

Atomic actions where possible 16hr
Locking by id 8hr


Out of order events correction or reporting

Simple recalculation by replaying events in order by id(40hr, refactor)

Report on errors (3hr)
Report on changes (12hr needs output)
Update data (12hr)


Complex

Integrate with normal processing (32hr needs testing)
Separate read and write actions (20hr)


Optimizations

Cache tenant customization and integration settings (16hr + 2hr)
Process actions as array of objects (32hr needs testing)
Save RSAppointment data to RSAppointmentChanges

Add action handler ( 8hr )


Traceability - Debugging

Table data added ( 8hr )

External event timestamp
Read start timestamp
Write start timestamp


Performance

Verify index usage for all tables ( 8hr )

Monitoring

Add BullMQ dashboard ( 8hr - 12hr )

Monitor stalled, failed, and completed jobs


Review grafana dashboards ( 8hr - 12hr )

Benchmarking ( 32 - 48hr )

Define metrics

Latency
Throughput
Reliability (number of errors, number of retries, number of out of order events)


Define scenarios

Tenant configuration
Number of events
Load (number of events per second)
Synthetic vs real data


Create benchmark code
Run benchmarks
Review results

Minimum changes for release 2.2

Code patterns

Abstraction layer for action and observer handlers 12hr

Work from simple queue, keep relevant transaction data in payload


Support logging and provenance within new handlers 16hr

New Features

Out of order events detection (10hr issue an error)

Simple detection ( query data after the update )


Out of order events mitigation

Atomic actions where possible 8hr ($inc and possibly $setOnInsert)
Locking by id 8hr based on runtime configuration


Out of order events reporting (nice to have)

Simple in-memory recalculation by replaying events in time order partitioned by id

Derived tables first ( 20hr )
Temporal tables next ( 20hr )
Compare to current state ( 4hr )


Do a full pass by using a hash of id. ( 20hr )
Only report on request by individual id. ( 6hr )


Save RSAppointment data ( 4hr )
Save RSLeadGlobalConversions data ( 8hr )
Query RSLeadGlobalConversions data ( 8hr )

Performance

Verify index usage for all tables ( 8hr )

Monitoring

Add BullMQ dashboard ( 8hr - 12hr )

Monitor stalled, failed, and completed jobs


Review grafana dashboards ( 8hr - 12hr )

Benchmarking ( 20 - 32hr )

Define metrics

Latency
Throughput
Reliability (number of errors, number of retries, number of out of order events)


Define scenarios

Tenant configuration
Number of events
Load (number of events per second)
Synthetic vs real data


Create benchmark code
Run benchmarks
Review results

Timeline


Week 1: Core refactoring (abstraction layer + logging) and new data flows (Appointments & LeadGlobalConversions).
Week 2: Basic OOO handling (detection + mitigation), plus queries and index checks.
Week 3: Partial OOO reporting (temporal tables) + monitoring dashboards + start benchmarking.
Week 4: Full OOO reporting (derived tables + optional full pass), and finalize benchmarking.

Notes: By adding the new dataflow early we also provide an example of how to use the new abstraction layer. By implementing the entire dataflow in-memory as a verification process, we get a way to verify the current dataflow as well as provide a way to test the new abstraction layer.
No results found