Skip to content

Instantly share code, notes, and snippets.

@smith
Created March 4, 2026 16:58
Show Gist options
  • Select an option

  • Save smith/5cddff60b69963148e7561af2ed71f5c to your computer and use it in GitHub Desktop.

Select an option

Save smith/5cddff60b69963148e7561af2ed71f5c to your computer and use it in GitHub Desktop.
Background job architecture for Elastic — discussion notes (2026-03-04)

Background Job Architecture for Elastic — Discussion Notes

2026-03-04 — Starting from @platformatic/job-queue, ending at first principles


1. @platformatic/job-queue — What Is It?

Matteo Collina (Node.js TSC, Fastify creator) and Platformatic shipped a new job queue library for Node.js with a clean pluggable storage interface:

  • MemoryStorage — in-process, ephemeral
  • FileStorage — filesystem-based, single-node persistence
  • RedisStorage — distributed, production-grade (Redis 7+ / Valkey 8+)

Key features: deduplication by job ID, enqueueAndWait() request/response pattern, configurable retries, stalled job recovery (Reaper), graceful shutdown, TypeScript-native typed payloads.

The Storage interface is 27 methods across 8 categories: lifecycle, queue operations, job state, results, workers, pub/sub notifications, events, and leader election.

2. Could You Build an Elasticsearch Backend?

Yes. Here's how the mapping works.

Data Model

Single-index design — job ID becomes doc _id, all state in one document:

Index: jq-jobs

{
  "mappings": {
    "properties": {
      "state":             { "type": "keyword" },
      "state_timestamp":   { "type": "long" },
      "worker_id":         { "type": "keyword" },
      "message":           { "type": "binary" },
      "result":            { "type": "binary" },
      "error":             { "type": "binary" },
      "result_expires_at": { "type": "long" },
      "error_expires_at":  { "type": "long" },
      "created_at":        { "type": "long" },
      "queue_order":       { "type": "long" }
    }
  }
}

Method Mapping

Storage method Redis mechanism ES mechanism
enqueue Lua: HSET NX + LPUSH create (op_type) with refresh=wait_for
dequeue BLMOVE (blocking) Poll: _search for state: "queued" + _update with if_seq_no/if_primary_term
requeue LREM + LPUSH Painless _update: set state: "queued", clear worker_id
getJobState HGET _get by doc ID
setJobState HSET + PUBLISH _update with refresh=wait_for
getJobStates HMGET _mget by doc IDs
setResult HSET with TTL envelope _update: set result + result_expires_at
getResult HGET + decode _get, check result_expires_at > now
completeJob Lua: multi-key atomic Single Painless _update (all state in one doc)
failJob Lua: multi-key atomic Single Painless _update
subscribeToJob Redis pub/sub Polling + in-process EventEmitter
acquireLeaderLock SET NX PX _create with op_type=create

The Two Hard Problems

Blocking dequeue: ES has no BLMOVE. Must poll with optimistic concurrency control:

async dequeue(workerId: string, timeout: number): Promise<Buffer | null> {
  const deadline = Date.now() + (timeout * 1000);
  while (Date.now() < deadline) {
    const hit = await this.client.search({
      index: 'jq-jobs',
      body: {
        query: { term: { state: 'queued' } },
        sort: [{ queue_order: 'asc' }],
        size: 1
      }
    });
    if (hit.hits.hits.length > 0) {
      const doc = hit.hits.hits[0];
      try {
        await this.client.update({
          index: 'jq-jobs',
          id: doc._id,
          if_seq_no: doc._seq_no,
          if_primary_term: doc._primary_term,
          refresh: 'wait_for',
          body: {
            doc: { state: 'processing', state_timestamp: Date.now(), worker_id: workerId }
          }
        });
        return Buffer.from(doc._source.message, 'base64');
      } catch (e) {
        if (e.statusCode === 409) continue;
        throw e;
      }
    }
    await new Promise(r => setTimeout(r, 200));
  }
  return null;
}

Pub/sub notifications: No native pub/sub. Use in-process EventEmitter for same-process, polling for cross-process. Adds ~100-200ms latency to enqueueAndWait.

Tradeoffs

ES backend Redis backend
Dequeue latency ~100-200ms (polling) <1ms (BLMOVE)
Throughput ceiling Low thousands/sec ~100k/sec
Observability Built-in (Kibana dashboards, ES QL)
Durability Replicated by default Needs AOF/RDB config
Additional infrastructure None (if ES exists) Redis cluster

Verdict

ES as a queue backend makes sense when ES is already in the stack, throughput is <5k jobs/sec, and observability on queue state is valuable. Wrong choice for sub-millisecond latency or >10k jobs/sec.

3. Is @platformatic/job-queue State of the Art?

No. It's a clean v0 from a credible author.

What's missing for production scale:

  • No priority queues
  • No delayed/scheduled jobs
  • No job dependencies or flows
  • No rate limiting
  • No worker thread support
  • No metrics/telemetry hooks
  • No named queues for workload isolation
  • Zero production battle scars

The actual landscape:

  • BullMQ — Node.js standard if you have Redis. Years of production use, full feature set.
  • Temporal — Real state of the art for durable workflow orchestration. Netflix, Stripe, Snap scale.
  • Graphile Worker — Best Postgres-backed option.

The storage interface design is worth studying as a reference for clean separation of queue semantics from storage mechanics.

4. The Real Problem: Task Manager and Kibana's Architecture

Task Manager's issue isn't implementation quality — it's that task execution is an infrastructure primitive stuck inside an application server.

Kibana is simultaneously:

  • The UI serving HTTP requests
  • The task scheduler
  • The task worker executing background logic
  • The task claimer competing with other instances via polling

These concerns can't scale independently. Need more task throughput? Add Kibana instances — but now you've also scaled your HTTP frontend and added more pollers hitting ES.

The polling problem: Every Kibana instance queries ES for claimable tasks on an interval. Most polls return nothing. Under load, claim contention creates update conflicts. Thundering herd on a database.

Watcher's lesson: Watcher had the right instinct (task execution is a platform concern) but wrong coupling (execution engine embedded in ES internals). Task execution evolves faster than a database can.

The lesson: ES should understand tasks semantically without executing them.

Proposed Architecture: Three Layers

Layer 1: ES as task-aware state store

  • Task state machine enforced by ES: queued → claimed → running → completed|failed
  • Lease-based claiming as a first-class operation: POST /_tasks/_claim?lease=30s
  • Change notifications on task indices (eliminates polling)
  • Built-in TTL and ILM-based lifecycle

Layer 2: Generic worker protocol

POST /_tasks/_claim          → { task_id, payload, lease_ttl }
POST /_tasks/{id}/_heartbeat → extends lease
POST /_tasks/{id}/_complete  → { result }
POST /_tasks/{id}/_fail      → { error, retry: bool }

Any process that speaks this protocol is a worker — Go, Rust, Python, Node.js, WASM, serverless functions.

Layer 3: Task definitions and orchestration

PUT /_task_definitions/alerting-rule-eval
{
  "schedule": "interval:1m",
  "timeout": "30s",
  "max_retries": 3,
  "worker_affinity": { "capability": ["es-query"] }
}

Why This Matters for AI

AI workloads are fundamentally different from "evaluate an alerting rule every 60 seconds":

  • Long-running: Agent loops run for minutes, making multiple ES queries and LLM calls
  • Resource-heterogeneous: Some tasks need GPUs, some need memory for large context windows
  • Stateful workflows: Observe → think → act → observe. Each step should be a durable task with replay.
  • Observable: Agent execution should be searchable in Kibana
  • Multi-tenant: In serverless, tenant isolation requires the task system to understand tenancy

Migration Path

  1. Define the task protocol as an ES API
  2. Build a reference worker in Go or Rust
  3. Kibana becomes a worker client (Task Manager speaks the new protocol internally)
  4. Migrate task types one at a time
  5. New task types deploy as standalone workers
  6. Customers write their own workers against the public protocol

"Do You Really Need JavaScript?"

For existing Kibana tasks:

  • Alerting rule evaluation: Mostly ES queries + conditional logic. Could be a DSL.
  • Reporting: Headless browser rendering. Already awkward in Node.js.
  • Fleet actions: API calls. Language-agnostic.
  • ML jobs: Already run in ES's Java ML process.

Most Kibana tasks don't need JavaScript. The JS requirement is an artifact of the architecture. For AI workloads, Python is the lingua franca — routing through Node.js is a liability.

5. The Gearman Question

The worker protocol described above is essentially Gearman (2005): central coordinator holds state, generic workers connect and process jobs, any language.

What killed Gearman wasn't the pattern — it was the implementation. gearmand was a fragile C daemon with bolted-on persistence.

What succeeded Gearman split into two paths:

  1. Simple job queues — Sidekiq, BullMQ, Celery. Redis + workers for background tasks.
  2. Workflow orchestration — Temporal, Conductor, Step Functions. Durable execution with replay and state machines.

The more honest question: should ES own task coordination, or just the data?

Options:

  • ES as coordinator (Gearman-in-ES) — less infrastructure, but asks ES to be something it's not
  • ES as state store + external orchestrator (e.g., ES-backed Temporal persistence) — more honest separation, but another system to run
  • ES as data layer + k8s as scheduler — let the industry-standard container orchestrator handle worker lifecycle

The Gearman pattern (central coordinator + generic workers) is fundamentally correct as a separation of concerns. The question is whether ES is the right place to put it, or whether coordination belongs to something purpose-built while ES owns the durable state and observability.

Open Question

Erlang/OTP solved this decades ago: supervision trees, gen_server, process mailboxes — durable task execution with restart semantics built into the runtime. The BEAM VM is the task coordinator. Does ES want to be the BEAM, or does it want to be the database that the BEAM talks to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment