2026-03-04 — Starting from @platformatic/job-queue, ending at first principles
Matteo Collina (Node.js TSC, Fastify creator) and Platformatic shipped a new job queue library for Node.js with a clean pluggable storage interface:
- MemoryStorage — in-process, ephemeral
- FileStorage — filesystem-based, single-node persistence
- RedisStorage — distributed, production-grade (Redis 7+ / Valkey 8+)
Key features: deduplication by job ID, enqueueAndWait() request/response pattern, configurable retries, stalled job recovery (Reaper), graceful shutdown, TypeScript-native typed payloads.
The Storage interface is 27 methods across 8 categories: lifecycle, queue operations, job state, results, workers, pub/sub notifications, events, and leader election.
Yes. Here's how the mapping works.
Single-index design — job ID becomes doc _id, all state in one document:
Index: jq-jobs
{
"mappings": {
"properties": {
"state": { "type": "keyword" },
"state_timestamp": { "type": "long" },
"worker_id": { "type": "keyword" },
"message": { "type": "binary" },
"result": { "type": "binary" },
"error": { "type": "binary" },
"result_expires_at": { "type": "long" },
"error_expires_at": { "type": "long" },
"created_at": { "type": "long" },
"queue_order": { "type": "long" }
}
}
}| Storage method | Redis mechanism | ES mechanism |
|---|---|---|
enqueue |
Lua: HSET NX + LPUSH | create (op_type) with refresh=wait_for |
dequeue |
BLMOVE (blocking) | Poll: _search for state: "queued" + _update with if_seq_no/if_primary_term |
requeue |
LREM + LPUSH | Painless _update: set state: "queued", clear worker_id |
getJobState |
HGET | _get by doc ID |
setJobState |
HSET + PUBLISH | _update with refresh=wait_for |
getJobStates |
HMGET | _mget by doc IDs |
setResult |
HSET with TTL envelope | _update: set result + result_expires_at |
getResult |
HGET + decode | _get, check result_expires_at > now |
completeJob |
Lua: multi-key atomic | Single Painless _update (all state in one doc) |
failJob |
Lua: multi-key atomic | Single Painless _update |
subscribeToJob |
Redis pub/sub | Polling + in-process EventEmitter |
acquireLeaderLock |
SET NX PX | _create with op_type=create |
Blocking dequeue: ES has no BLMOVE. Must poll with optimistic concurrency control:
async dequeue(workerId: string, timeout: number): Promise<Buffer | null> {
const deadline = Date.now() + (timeout * 1000);
while (Date.now() < deadline) {
const hit = await this.client.search({
index: 'jq-jobs',
body: {
query: { term: { state: 'queued' } },
sort: [{ queue_order: 'asc' }],
size: 1
}
});
if (hit.hits.hits.length > 0) {
const doc = hit.hits.hits[0];
try {
await this.client.update({
index: 'jq-jobs',
id: doc._id,
if_seq_no: doc._seq_no,
if_primary_term: doc._primary_term,
refresh: 'wait_for',
body: {
doc: { state: 'processing', state_timestamp: Date.now(), worker_id: workerId }
}
});
return Buffer.from(doc._source.message, 'base64');
} catch (e) {
if (e.statusCode === 409) continue;
throw e;
}
}
await new Promise(r => setTimeout(r, 200));
}
return null;
}Pub/sub notifications: No native pub/sub. Use in-process EventEmitter for same-process, polling for cross-process. Adds ~100-200ms latency to enqueueAndWait.
| ES backend | Redis backend | |
|---|---|---|
| Dequeue latency | ~100-200ms (polling) | <1ms (BLMOVE) |
| Throughput ceiling | Low thousands/sec | ~100k/sec |
| Observability | Built-in (Kibana dashboards, ES | QL) |
| Durability | Replicated by default | Needs AOF/RDB config |
| Additional infrastructure | None (if ES exists) | Redis cluster |
ES as a queue backend makes sense when ES is already in the stack, throughput is <5k jobs/sec, and observability on queue state is valuable. Wrong choice for sub-millisecond latency or >10k jobs/sec.
No. It's a clean v0 from a credible author.
What's missing for production scale:
- No priority queues
- No delayed/scheduled jobs
- No job dependencies or flows
- No rate limiting
- No worker thread support
- No metrics/telemetry hooks
- No named queues for workload isolation
- Zero production battle scars
The actual landscape:
- BullMQ — Node.js standard if you have Redis. Years of production use, full feature set.
- Temporal — Real state of the art for durable workflow orchestration. Netflix, Stripe, Snap scale.
- Graphile Worker — Best Postgres-backed option.
The storage interface design is worth studying as a reference for clean separation of queue semantics from storage mechanics.
Task Manager's issue isn't implementation quality — it's that task execution is an infrastructure primitive stuck inside an application server.
Kibana is simultaneously:
- The UI serving HTTP requests
- The task scheduler
- The task worker executing background logic
- The task claimer competing with other instances via polling
These concerns can't scale independently. Need more task throughput? Add Kibana instances — but now you've also scaled your HTTP frontend and added more pollers hitting ES.
The polling problem: Every Kibana instance queries ES for claimable tasks on an interval. Most polls return nothing. Under load, claim contention creates update conflicts. Thundering herd on a database.
Watcher's lesson: Watcher had the right instinct (task execution is a platform concern) but wrong coupling (execution engine embedded in ES internals). Task execution evolves faster than a database can.
The lesson: ES should understand tasks semantically without executing them.
Layer 1: ES as task-aware state store
- Task state machine enforced by ES:
queued → claimed → running → completed|failed - Lease-based claiming as a first-class operation:
POST /_tasks/_claim?lease=30s - Change notifications on task indices (eliminates polling)
- Built-in TTL and ILM-based lifecycle
Layer 2: Generic worker protocol
POST /_tasks/_claim → { task_id, payload, lease_ttl }
POST /_tasks/{id}/_heartbeat → extends lease
POST /_tasks/{id}/_complete → { result }
POST /_tasks/{id}/_fail → { error, retry: bool }
Any process that speaks this protocol is a worker — Go, Rust, Python, Node.js, WASM, serverless functions.
Layer 3: Task definitions and orchestration
PUT /_task_definitions/alerting-rule-eval
{
"schedule": "interval:1m",
"timeout": "30s",
"max_retries": 3,
"worker_affinity": { "capability": ["es-query"] }
}AI workloads are fundamentally different from "evaluate an alerting rule every 60 seconds":
- Long-running: Agent loops run for minutes, making multiple ES queries and LLM calls
- Resource-heterogeneous: Some tasks need GPUs, some need memory for large context windows
- Stateful workflows: Observe → think → act → observe. Each step should be a durable task with replay.
- Observable: Agent execution should be searchable in Kibana
- Multi-tenant: In serverless, tenant isolation requires the task system to understand tenancy
- Define the task protocol as an ES API
- Build a reference worker in Go or Rust
- Kibana becomes a worker client (Task Manager speaks the new protocol internally)
- Migrate task types one at a time
- New task types deploy as standalone workers
- Customers write their own workers against the public protocol
For existing Kibana tasks:
- Alerting rule evaluation: Mostly ES queries + conditional logic. Could be a DSL.
- Reporting: Headless browser rendering. Already awkward in Node.js.
- Fleet actions: API calls. Language-agnostic.
- ML jobs: Already run in ES's Java ML process.
Most Kibana tasks don't need JavaScript. The JS requirement is an artifact of the architecture. For AI workloads, Python is the lingua franca — routing through Node.js is a liability.
The worker protocol described above is essentially Gearman (2005): central coordinator holds state, generic workers connect and process jobs, any language.
What killed Gearman wasn't the pattern — it was the implementation. gearmand was a fragile C daemon with bolted-on persistence.
What succeeded Gearman split into two paths:
- Simple job queues — Sidekiq, BullMQ, Celery. Redis + workers for background tasks.
- Workflow orchestration — Temporal, Conductor, Step Functions. Durable execution with replay and state machines.
The more honest question: should ES own task coordination, or just the data?
Options:
- ES as coordinator (Gearman-in-ES) — less infrastructure, but asks ES to be something it's not
- ES as state store + external orchestrator (e.g., ES-backed Temporal persistence) — more honest separation, but another system to run
- ES as data layer + k8s as scheduler — let the industry-standard container orchestrator handle worker lifecycle
The Gearman pattern (central coordinator + generic workers) is fundamentally correct as a separation of concerns. The question is whether ES is the right place to put it, or whether coordination belongs to something purpose-built while ES owns the durable state and observability.
Erlang/OTP solved this decades ago: supervision trees, gen_server, process mailboxes — durable task execution with restart semantics built into the runtime. The BEAM VM is the task coordinator. Does ES want to be the BEAM, or does it want to be the database that the BEAM talks to?