Skip to content

Instantly share code, notes, and snippets.

@arturohernandez10
Last active January 21, 2025 15:47
Show Gist options
  • Select an option

  • Save arturohernandez10/5759d6c29db690083ef06e7c9602a8f5 to your computer and use it in GitHub Desktop.

Select an option

Save arturohernandez10/5759d6c29db690083ef06e7c9602a8f5 to your computer and use it in GitHub Desktop.

Data processing improvements & problems

Rationale, currently the system once it has all data from the external system, does all updates in a single call. If anything fails, we would execute the entire call again. If another relevant transaction happens in the interim time, we don't have a way to handle it. The large task record created, with spurious properties, makes it difficult to trace and debug.

  • All reads and updates are done in a single call.
  • The large transaction code, is partially idempotent, but not fully.
  • We don't have a way to handle out of order events.
  • We don't have a way to retry errors.
  • Traceability is not good.
  • There is no queue monitoring.

Main goals

  • Break up the large transaction into smaller ones.
  • Improve error handling, retry logic, and out of order events detection and correction.
  • Create better data traceability.
  • Monitor system performance, and reliability.
  • Predict future performance, and reliability via benchmarking.

Candidate Changes

Code patterns

  • Abstraction layer for action and observer handlers 16hr
    • Do not save instance of action hierarchy in mongodb 0hr
    • Keep instance of action hierarchy in bullmq, and when done, save to mongodb 16hr
    • Keep instance of action hierarchy in mongodb 10hr
  • Split read and write actions for legacy tables 24hr
  • Preprocess tenant customization and integration settings 16hr
  • Support logging and provenance within new handlers 16hr

New Features

  • Out of order events detection (10hr issue an error)
    • Simple detection ( query data after the update )
  • Out of order events mitigation
    • Atomic actions where possible 16hr
    • Locking by id 8hr
  • Out of order events correction or reporting
    • Simple recalculation by replaying events in order by id(40hr, refactor)
      • Report on errors (3hr)
      • Report on changes (12hr needs output)
      • Update data (12hr)
    • Complex
      • Integrate with normal processing (32hr needs testing)
      • Separate read and write actions (20hr)

Optimizations

  • Cache tenant customization and integration settings (16hr + 2hr)
  • Process actions as array of objects (32hr needs testing)
  • Save RSAppointment data to RSAppointmentChanges
    • Add action handler ( 8hr )

Traceability - Debugging

  • Table data added ( 8hr )
    • External event timestamp
    • Read start timestamp
    • Write start timestamp

Performance

  • Verify index usage for all tables ( 8hr )

Monitoring

  • Add BullMQ dashboard ( 8hr - 12hr )
    • Monitor stalled, failed, and completed jobs
  • Review grafana dashboards ( 8hr - 12hr )

Benchmarking ( 32 - 48hr )

  • Define metrics
    • Latency
    • Throughput
    • Reliability (number of errors, number of retries, number of out of order events)
  • Define scenarios
    • Tenant configuration
    • Number of events
    • Load (number of events per second)
    • Synthetic vs real data
  • Create benchmark code
  • Run benchmarks
  • Review results

Minimum changes for release 2.2

Code patterns

  • Abstraction layer for action and observer handlers 12hr
    • Work from simple queue, keep relevant transaction data in payload
  • Support logging and provenance within new handlers 16hr

New Features

  • Out of order events detection (10hr issue an error)
    • Simple detection ( query data after the update )
  • Out of order events mitigation
    • Atomic actions where possible 8hr ($inc and possibly $setOnInsert)
    • Locking by id 8hr based on runtime configuration
  • Out of order events reporting (nice to have)
    • Simple in-memory recalculation by replaying events in time order partitioned by id
      • Derived tables first ( 20hr )
      • Temporal tables next ( 20hr )
      • Compare to current state ( 4hr )
    • Do a full pass by using a hash of id. ( 20hr )
    • Only report on request by individual id. ( 6hr )
  • Save RSAppointment data ( 4hr )
  • Save RSLeadGlobalConversions data ( 8hr )
  • Query RSLeadGlobalConversions data ( 8hr )

Performance

  • Verify index usage for all tables ( 8hr )

Monitoring

  • Add BullMQ dashboard ( 8hr - 12hr )
    • Monitor stalled, failed, and completed jobs
  • Review grafana dashboards ( 8hr - 12hr )

Benchmarking ( 20 - 32hr )

  • Define metrics
    • Latency
    • Throughput
    • Reliability (number of errors, number of retries, number of out of order events)
  • Define scenarios
    • Tenant configuration
    • Number of events
    • Load (number of events per second)
    • Synthetic vs real data
  • Create benchmark code
  • Run benchmarks
  • Review results

Timeline

  • Week 1: Core refactoring (abstraction layer + logging) and new data flows (Appointments & LeadGlobalConversions).
  • Week 2: Basic OOO handling (detection + mitigation), plus queries and index checks.
  • Week 3: Partial OOO reporting (temporal tables) + monitoring dashboards + start benchmarking.
  • Week 4: Full OOO reporting (derived tables + optional full pass), and finalize benchmarking.

Notes: By adding the new dataflow early we also provide an example of how to use the new abstraction layer. By implementing the entire dataflow in-memory as a verification process, we get a way to verify the current dataflow as well as provide a way to test the new abstraction layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment