Skip to content

Instantly share code, notes, and snippets.

@smith
Created March 4, 2026 17:58
Show Gist options
  • Select an option

  • Save smith/17310ca63863b62cda205e8114b593f6 to your computer and use it in GitHub Desktop.

Select an option

Save smith/17310ca63863b62cda205e8114b593f6 to your computer and use it in GitHub Desktop.
Workflows, Not Tasks: Rethinking Elastic's Background Execution Architecture

Workflows, Not Tasks: Rethinking Elastic's Background Execution Architecture

2026-03-04 — Nathan Smith, Tech Lead, APM @ Elastic Observability


The Problem

Kibana Task Manager is an application-level job queue embedded in a UI server. It is simultaneously the scheduler, the claimer, and the executor — all inside a Node.js process that also serves HTTP requests. These concerns can't scale independently, and the polling-based claim model generates wasted load on Elasticsearch at scale.

Watcher (the last ES-native execution primitive) was deprecated. Nothing replaced it at the platform level. Task Manager filled the vacuum by necessity, not by design.

With Elastic moving toward serverless (on-prem eventually = serverless on customer k8s), AI agents becoming a core product capability, and workflows already shipping as a feature — the architecture needs to change.

The Insight

Every Task Manager task is already a workflow. We just didn't model them that way.

Task Manager task Actual workflow
Alerting rule eval Query ES → evaluate conditions → fire actions → store state
Report generation Gather parameters → query data → render → store artifact → notify
Fleet action Select agents → send command → poll results → timeout or collect → report
AI investigation Observe → reason → act → observe → ... (loop with exit conditions)

These are multi-step processes with ordering, conditional branching, error handling, and state. Modeling them as atomic "tasks" loses structure that matters for observability, retry granularity, and composition.

Workflows should not be a feature running on Task Manager. Task Manager should be replaced by a workflow engine.

Current Architecture

Kibana (Node.js)
  └── Task Manager
        ├── polls ES for claimable tasks (every instance, every interval)
        ├── claims via optimistic update (conflicts under contention)
        ├── executes task logic in-process
        ├── includes: alerting, reporting, fleet, workflows, ...
        └── scales only by adding Kibana instances

Problems:

  • Coupled scaling: can't add task capacity without adding HTTP frontends
  • Polling waste: N Kibana instances × polling interval = constant ES query load, mostly returning nothing
  • No tenant isolation: all tenants' tasks compete in the same claim loop
  • Node.js constraint: all task logic must be JavaScript, running on the event loop
  • No step-level observability: a task is a black box; if it fails at step 3 of 5, you start over

Proposed Architecture

Three layers, cleanly separated:

Layer 1: Workflow Engine (orchestration)

Manages workflow lifecycle: scheduling, step sequencing, state transitions, guardrail enforcement. Does not execute business logic.

Workflow state lives in ES as searchable documents — every workflow instance, every step execution, every decision point.

POST /_workflows/ai-security-investigation/_start
{
  "trigger": { "alert_id": "abc-123" },
  "params": { "severity": "critical" }
}

The engine:

  • Creates a workflow instance document in ES
  • Schedules the first step
  • As each step completes, evaluates the workflow definition and schedules the next
  • Enforces guardrails (timeouts, budgets, approval gates)
  • Records the full execution history

Layer 2: Step Executors (workers)

Stateless processes that claim and execute individual workflow steps. Any language, any runtime.

POST /_workflows/_steps/_claim?capability=es-query&lease=30s
→ { step_id, workflow_id, step_name, payload, lease_ttl }

POST /_workflows/_steps/{id}/_complete
→ { result }

Worker types:

  • ES query executor — runs searches, aggregations (Go/Rust, lightweight)
  • Action executor — sends notifications, webhooks, API calls (any language)
  • AI executor — LLM calls, reasoning chains (Python, GPU-capable nodes)
  • Render executor — headless browser for reports (dedicated service)
  • Kibana executor — for steps that genuinely need Kibana context (backward compat)
  • Customer executor — user-deployed workers for custom automation

Workers scale independently via k8s HPA/KEDA based on step queue depth. AI steps route to GPU nodes. Simple query steps route to lightweight pods. K8s manages the worker pool lifecycle; the workflow engine manages task-to-worker routing.

Layer 3: Workflow Definitions

Declarative definitions stored in ES. Auditable, versionable, composable.

name: ai-security-investigation
version: 2
max_duration: 5m
max_llm_calls: 20

steps:
  - name: gather-context
    executor: es-query
    permissions: [read:logs-*, read:metrics-*]
    timeout: 30s
    retry: { max: 2, backoff: exponential }

  - name: analyze
    executor: ai-reasoning
    model: default
    input: "{{ steps.gather-context.result }}"
    timeout: 60s

  - name: propose-remediation
    executor: ai-reasoning
    permissions: [read:*]
    input: "{{ steps.analyze.result }}"

  - name: apply-remediation
    executor: action
    permissions: [write:cases-*]
    requires_approval: true
    approval_timeout: 24h

Built-in workflows ship with Elastic (alerting, reporting, fleet). Customers extend or replace them. The workflow definition is the behavioral contract — for agents especially, it's the guardrail.

Why Workflows Are the Right Primitive for AI Agents

An AI agent without a workflow is an LLM with access to your production cluster. The workflow is what makes it an agent instead of a liability.

Permissions per step

The agent doesn't decide its own permissions. Step 1 can read logs. Step 4 can write cases. Step 3 can't write anything. This is declared in the workflow definition and enforced by the engine.

Budget and timeout

"This investigation gets 60 seconds and 10 LLM calls max." The agent can't run forever or spend unbounded money on inference. The workflow engine tracks and enforces resource consumption.

Approval gates

"Apply remediation" requires human approval. The workflow pauses, creates an approval request (visible in Kibana), and resumes when approved or times out. The agent proposes; the human decides.

Deterministic replay

Every step's input and output is recorded in ES. When an agent does something unexpected, replay the execution step by step. Every decision, every tool call, every intermediate result is queryable:

FROM workflow-steps
| WHERE workflow.id == "abc-123"
| SORT step.started_at ASC
| KEEP step.name, step.input, step.output, step.duration_ms

Composition

Agent capabilities are composed from workflow steps and sub-workflows. "Can investigate" = a workflow. "Can remediate" = a different workflow. "Can investigate and remediate" = a workflow that chains them. Guardrails compose the same way capabilities do.

Customer-defined behavior

Customers don't need to write agent code. They define workflows: "when this alert fires, have the agent run this investigation playbook, but require my approval before any write operation." The workflow is the product surface for agent customization.

Multi-Tenancy

The workflow engine enforces tenant isolation at the scheduling level:

  • Per-tenant quotas: tenant A gets N concurrent workflow executions
  • Priority tiers: paid tenants' workflows schedule before free tier
  • Resource attribution: every step execution is tagged with tenant ID; cost is trackable
  • Data isolation: step executors receive scoped credentials per tenant
  • Fairness: no tenant's runaway agent can starve another tenant's alerting rules

This is impossible to retrofit into Task Manager's "grab the next claimable task" model. It's natural in a workflow engine that understands tenant context.

What ES Provides That Other Orchestrators Don't

Temporal, Conductor, and Step Functions are proven workflow engines. But they store workflow state in their own databases (Cassandra, MySQL, Postgres). Building on ES gives:

  1. Queryable execution history — ES|QL, aggregations, Kibana dashboards over all workflow executions. Not a separate monitoring system — the workflow data is the observability data.

  2. Full-text search on step payloads — "Find every AI investigation where the agent mentioned 'lateral movement'" is a search query, not a log grep.

  3. ILM for lifecycle — Completed workflow data ages through hot/warm/cold/frozen tiers automatically. No manual cleanup, no TTL application code.

  4. Unified platform — The workflow engine, the data it operates on, and the UI that displays it are all the same system. No integration glue.

  5. No additional infrastructure — ES is already running. No Temporal server, no Cassandra cluster, no additional operational burden.

Build vs. Adopt

Option Pros Cons
Build on ES No new infra, queryable state, unified platform Years of work, unproven at orchestration scale
Adopt Temporal Proven, battle-tested, multi-language Another database (Cassandra/MySQL), another system to operate
Temporal with ES persistence Temporal's orchestration + ES's queryability Requires building an ES persistence plugin for Temporal; Temporal's data model may not map cleanly to ES
Evolve Task Manager Incremental, low risk Preserves the fundamental architectural problems

Migration Path

  1. Define workflow state as ES documents. Index mappings for workflow instances and step executions. This is the foundation — get the data model right.

  2. Build a workflow engine as a standalone service. Not in Kibana. Reads workflow definitions from ES, manages step scheduling, writes state to ES. Can be Go, Rust, or Java.

  3. Define the step executor protocol. HTTP-based: claim, heartbeat, complete, fail. Publish as a public API.

  4. Migrate one workflow. Pick something contained — report generation. Build it as a real workflow with step executors. Prove the model.

  5. Kibana becomes a step executor and UI. Task Manager internally becomes a "Kibana step executor" that speaks the new protocol. Existing task types run unchanged — they're just steps now.

  6. Migrate remaining task types. Alerting, fleet, ML — each becomes a workflow definition with appropriate step executors.

  7. Ship customer-facing workflow authoring. The workflow definition format becomes a product surface. Customers define, modify, and compose workflows through Kibana UI or API.

  8. AI agents use workflows natively. Agent capabilities are workflow definitions. The agent framework invokes workflows, not raw tool calls. Guardrails are enforced by the engine, not the agent code.

Summary

Task Manager (current) Workflow Engine (proposed)
Execution model Atomic tasks, black box Multi-step workflows, observable per step
Scheduling Polling + optimistic claim Event-driven or change-notification based
Runtime coupling Node.js in Kibana Any language, any runtime
Scaling Coupled to Kibana instances Workers scale independently
Tenant isolation None Per-tenant quotas, priority, data scoping
Agent guardrails Application code Declarative workflow constraints
Observability Task status field Full step-level execution history in ES
Customer extensibility Kibana plugin development Workflow definitions + custom step executors

The transition from tasks to workflows is not a refactor of Task Manager. It's a recognition that background execution in Elastic is a platform concern, not a Kibana concern, and that the primitive should be the workflow — not the task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment