smith/job-queue-architecture-discussion.md

## job-queue-architecture-discussion.md

      
    Raw
  

              job-queue-architecture-discussion.md
            
          
    Background Job Architecture for Elastic — Discussion Notes

2026-03-04 — Starting from @platformatic/job-queue, ending at first principles

1. @platformatic/job-queue — What Is It?

Matteo Collina (Node.js TSC, Fastify creator) and Platformatic shipped a new job queue library for Node.js with a clean pluggable storage interface:

MemoryStorage — in-process, ephemeral
FileStorage — filesystem-based, single-node persistence
RedisStorage — distributed, production-grade (Redis 7+ / Valkey 8+)

Key features: deduplication by job ID, enqueueAndWait() request/response pattern, configurable retries, stalled job recovery (Reaper), graceful shutdown, TypeScript-native typed payloads.
The Storage interface is 27 methods across 8 categories: lifecycle, queue operations, job state, results, workers, pub/sub notifications, events, and leader election.
2. Could You Build an Elasticsearch Backend?

Yes. Here's how the mapping works.
Data Model

Single-index design — job ID becomes doc _id, all state in one document:
Index: jq-jobs
{
  "mappings": {
    "properties": {
      "state":             { "type": "keyword" },
      "state_timestamp":   { "type": "long" },
      "worker_id":         { "type": "keyword" },
      "message":           { "type": "binary" },
      "result":            { "type": "binary" },
      "error":             { "type": "binary" },
      "result_expires_at": { "type": "long" },
      "error_expires_at":  { "type": "long" },
      "created_at":        { "type": "long" },
      "queue_order":       { "type": "long" }
    }
  }
}
Method Mapping


Storage method
Redis mechanism
ES mechanism


enqueue
Lua: HSET NX + LPUSH
create (op_type) with refresh=wait_for


dequeue
BLMOVE (blocking)
Poll: _search for state: "queued" + _update with if_seq_no/if_primary_term


requeue
LREM + LPUSH
Painless _update: set state: "queued", clear worker_id


getJobState
HGET
_get by doc ID


setJobState
HSET + PUBLISH
_update with refresh=wait_for


getJobStates
HMGET
_mget by doc IDs


setResult
HSET with TTL envelope
_update: set result + result_expires_at


getResult
HGET + decode
_get, check result_expires_at > now


completeJob
Lua: multi-key atomic
Single Painless _update (all state in one doc)


failJob
Lua: multi-key atomic
Single Painless _update


subscribeToJob
Redis pub/sub
Polling + in-process EventEmitter


acquireLeaderLock
SET NX PX
_create with op_type=create


The Two Hard Problems

Blocking dequeue: ES has no BLMOVE. Must poll with optimistic concurrency control:
async dequeue(workerId: string, timeout: number): Promise<Buffer | null> {
  const deadline = Date.now() + (timeout * 1000);
  while (Date.now() < deadline) {
    const hit = await this.client.search({
      index: 'jq-jobs',
      body: {
        query: { term: { state: 'queued' } },
        sort: [{ queue_order: 'asc' }],
        size: 1
      }
    });
    if (hit.hits.hits.length > 0) {
      const doc = hit.hits.hits[0];
      try {
        await this.client.update({
          index: 'jq-jobs',
          id: doc._id,
          if_seq_no: doc._seq_no,
          if_primary_term: doc._primary_term,
          refresh: 'wait_for',
          body: {
            doc: { state: 'processing', state_timestamp: Date.now(), worker_id: workerId }
          }
        });
        return Buffer.from(doc._source.message, 'base64');
      } catch (e) {
        if (e.statusCode === 409) continue;
        throw e;
      }
    }
    await new Promise(r => setTimeout(r, 200));
  }
  return null;
}
Pub/sub notifications: No native pub/sub. Use in-process EventEmitter for same-process, polling for cross-process. Adds ~100-200ms latency to enqueueAndWait.
Tradeoffs


ES backend
Redis backend


Dequeue latency
~100-200ms (polling)
<1ms (BLMOVE)


Throughput ceiling
Low thousands/sec
~100k/sec


Observability
Built-in (Kibana dashboards, ES
QL)


Durability
Replicated by default
Needs AOF/RDB config


Additional infrastructure
None (if ES exists)
Redis cluster


Verdict

ES as a queue backend makes sense when ES is already in the stack, throughput is <5k jobs/sec, and observability on queue state is valuable. Wrong choice for sub-millisecond latency or >10k jobs/sec.
3. Is @platformatic/job-queue State of the Art?

No. It's a clean v0 from a credible author.
What's missing for production scale:

No priority queues
No delayed/scheduled jobs
No job dependencies or flows
No rate limiting
No worker thread support
No metrics/telemetry hooks
No named queues for workload isolation
Zero production battle scars

The actual landscape:

BullMQ — Node.js standard if you have Redis. Years of production use, full feature set.
Temporal — Real state of the art for durable workflow orchestration. Netflix, Stripe, Snap scale.
Graphile Worker — Best Postgres-backed option.

The storage interface design is worth studying as a reference for clean separation of queue semantics from storage mechanics.
4. The Real Problem: Task Manager and Kibana's Architecture

Task Manager's issue isn't implementation quality — it's that task execution is an infrastructure primitive stuck inside an application server.
Kibana is simultaneously:

The UI serving HTTP requests
The task scheduler
The task worker executing background logic
The task claimer competing with other instances via polling

These concerns can't scale independently. Need more task throughput? Add Kibana instances — but now you've also scaled your HTTP frontend and added more pollers hitting ES.
The polling problem: Every Kibana instance queries ES for claimable tasks on an interval. Most polls return nothing. Under load, claim contention creates update conflicts. Thundering herd on a database.
Watcher's lesson: Watcher had the right instinct (task execution is a platform concern) but wrong coupling (execution engine embedded in ES internals). Task execution evolves faster than a database can.
The lesson: ES should understand tasks semantically without executing them.
Proposed Architecture: Three Layers

Layer 1: ES as task-aware state store

Task state machine enforced by ES: queued → claimed → running → completed|failed
Lease-based claiming as a first-class operation: POST /_tasks/_claim?lease=30s
Change notifications on task indices (eliminates polling)
Built-in TTL and ILM-based lifecycle

Layer 2: Generic worker protocol
POST /_tasks/_claim          → { task_id, payload, lease_ttl }
POST /_tasks/{id}/_heartbeat → extends lease
POST /_tasks/{id}/_complete  → { result }
POST /_tasks/{id}/_fail      → { error, retry: bool }

Any process that speaks this protocol is a worker — Go, Rust, Python, Node.js, WASM, serverless functions.
Layer 3: Task definitions and orchestration
PUT /_task_definitions/alerting-rule-eval
{
  "schedule": "interval:1m",
  "timeout": "30s",
  "max_retries": 3,
  "worker_affinity": { "capability": ["es-query"] }
}
Why This Matters for AI

AI workloads are fundamentally different from "evaluate an alerting rule every 60 seconds":

Long-running: Agent loops run for minutes, making multiple ES queries and LLM calls
Resource-heterogeneous: Some tasks need GPUs, some need memory for large context windows
Stateful workflows: Observe → think → act → observe. Each step should be a durable task with replay.
Observable: Agent execution should be searchable in Kibana
Multi-tenant: In serverless, tenant isolation requires the task system to understand tenancy

Migration Path


Define the task protocol as an ES API
Build a reference worker in Go or Rust
Kibana becomes a worker client (Task Manager speaks the new protocol internally)
Migrate task types one at a time
New task types deploy as standalone workers
Customers write their own workers against the public protocol

"Do You Really Need JavaScript?"

For existing Kibana tasks:

Alerting rule evaluation: Mostly ES queries + conditional logic. Could be a DSL.
Reporting: Headless browser rendering. Already awkward in Node.js.
Fleet actions: API calls. Language-agnostic.
ML jobs: Already run in ES's Java ML process.

Most Kibana tasks don't need JavaScript. The JS requirement is an artifact of the architecture. For AI workloads, Python is the lingua franca — routing through Node.js is a liability.
5. The Gearman Question

The worker protocol described above is essentially Gearman (2005): central coordinator holds state, generic workers connect and process jobs, any language.
What killed Gearman wasn't the pattern — it was the implementation. gearmand was a fragile C daemon with bolted-on persistence.
What succeeded Gearman split into two paths:

Simple job queues — Sidekiq, BullMQ, Celery. Redis + workers for background tasks.
Workflow orchestration — Temporal, Conductor, Step Functions. Durable execution with replay and state machines.

The more honest question: should ES own task coordination, or just the data?
Options:

ES as coordinator (Gearman-in-ES) — less infrastructure, but asks ES to be something it's not
ES as state store + external orchestrator (e.g., ES-backed Temporal persistence) — more honest separation, but another system to run
ES as data layer + k8s as scheduler — let the industry-standard container orchestrator handle worker lifecycle

The Gearman pattern (central coordinator + generic workers) is fundamentally correct as a separation of concerns. The question is whether ES is the right place to put it, or whether coordination belongs to something purpose-built while ES owns the durable state and observability.
Open Question

Erlang/OTP solved this decades ago: supervision trees, gen_server, process mailboxes — durable task execution with restart semantics built into the runtime. The BEAM VM is the task coordinator. Does ES want to be the BEAM, or does it want to be the database that the BEAM talks to?
Storage method	Redis mechanism	ES mechanism
`enqueue`	Lua: HSET NX + LPUSH	`create` (op_type) with `refresh=wait_for`
`dequeue`	BLMOVE (blocking)	Poll: `_search` for `state: "queued"` + `_update` with `if_seq_no`/`if_primary_term`
`requeue`	LREM + LPUSH	Painless `_update`: set `state: "queued"`, clear `worker_id`
`getJobState`	HGET	`_get` by doc ID
`setJobState`	HSET + PUBLISH	`_update` with `refresh=wait_for`
`getJobStates`	HMGET	`_mget` by doc IDs
`setResult`	HSET with TTL envelope	`_update`: set `result` + `result_expires_at`
`getResult`	HGET + decode	`_get`, check `result_expires_at > now`
`completeJob`	Lua: multi-key atomic	Single Painless `_update` (all state in one doc)
`failJob`	Lua: multi-key atomic	Single Painless `_update`
`subscribeToJob`	Redis pub/sub	Polling + in-process EventEmitter
`acquireLeaderLock`	SET NX PX	`_create` with `op_type=create`
	ES backend	Redis backend
Dequeue latency	~100-200ms (polling)	<1ms (BLMOVE)
Throughput ceiling	Low thousands/sec	~100k/sec
Observability	Built-in (Kibana dashboards, ES	QL)
Durability	Replicated by default	Needs AOF/RDB config
Additional infrastructure	None (if ES exists)	Redis cluster