2026-03-04 — Nathan Smith, Tech Lead, APM @ Elastic Observability
Kibana Task Manager is an application-level job queue embedded in a UI server. It is simultaneously the scheduler, the claimer, and the executor — all inside a Node.js process that also serves HTTP requests. These concerns can't scale independently, and the polling-based claim model generates wasted load on Elasticsearch at scale.
Watcher (the last ES-native execution primitive) was deprecated. Nothing replaced it at the platform level. Task Manager filled the vacuum by necessity, not by design.
With Elastic moving toward serverless (on-prem eventually = serverless on customer k8s), AI agents becoming a core product capability, and workflows already shipping as a feature — the architecture needs to change.
Every Task Manager task is already a workflow. We just didn't model them that way.
| Task Manager task | Actual workflow |
|---|---|
| Alerting rule eval | Query ES → evaluate conditions → fire actions → store state |
| Report generation | Gather parameters → query data → render → store artifact → notify |
| Fleet action | Select agents → send command → poll results → timeout or collect → report |
| AI investigation | Observe → reason → act → observe → ... (loop with exit conditions) |
These are multi-step processes with ordering, conditional branching, error handling, and state. Modeling them as atomic "tasks" loses structure that matters for observability, retry granularity, and composition.
Workflows should not be a feature running on Task Manager. Task Manager should be replaced by a workflow engine.
Kibana (Node.js)
└── Task Manager
├── polls ES for claimable tasks (every instance, every interval)
├── claims via optimistic update (conflicts under contention)
├── executes task logic in-process
├── includes: alerting, reporting, fleet, workflows, ...
└── scales only by adding Kibana instances
Problems:
- Coupled scaling: can't add task capacity without adding HTTP frontends
- Polling waste: N Kibana instances × polling interval = constant ES query load, mostly returning nothing
- No tenant isolation: all tenants' tasks compete in the same claim loop
- Node.js constraint: all task logic must be JavaScript, running on the event loop
- No step-level observability: a task is a black box; if it fails at step 3 of 5, you start over
Three layers, cleanly separated:
Manages workflow lifecycle: scheduling, step sequencing, state transitions, guardrail enforcement. Does not execute business logic.
Workflow state lives in ES as searchable documents — every workflow instance, every step execution, every decision point.
POST /_workflows/ai-security-investigation/_start
{
"trigger": { "alert_id": "abc-123" },
"params": { "severity": "critical" }
}
The engine:
- Creates a workflow instance document in ES
- Schedules the first step
- As each step completes, evaluates the workflow definition and schedules the next
- Enforces guardrails (timeouts, budgets, approval gates)
- Records the full execution history
Stateless processes that claim and execute individual workflow steps. Any language, any runtime.
POST /_workflows/_steps/_claim?capability=es-query&lease=30s
→ { step_id, workflow_id, step_name, payload, lease_ttl }
POST /_workflows/_steps/{id}/_complete
→ { result }
Worker types:
- ES query executor — runs searches, aggregations (Go/Rust, lightweight)
- Action executor — sends notifications, webhooks, API calls (any language)
- AI executor — LLM calls, reasoning chains (Python, GPU-capable nodes)
- Render executor — headless browser for reports (dedicated service)
- Kibana executor — for steps that genuinely need Kibana context (backward compat)
- Customer executor — user-deployed workers for custom automation
Workers scale independently via k8s HPA/KEDA based on step queue depth. AI steps route to GPU nodes. Simple query steps route to lightweight pods. K8s manages the worker pool lifecycle; the workflow engine manages task-to-worker routing.
Declarative definitions stored in ES. Auditable, versionable, composable.
name: ai-security-investigation
version: 2
max_duration: 5m
max_llm_calls: 20
steps:
- name: gather-context
executor: es-query
permissions: [read:logs-*, read:metrics-*]
timeout: 30s
retry: { max: 2, backoff: exponential }
- name: analyze
executor: ai-reasoning
model: default
input: "{{ steps.gather-context.result }}"
timeout: 60s
- name: propose-remediation
executor: ai-reasoning
permissions: [read:*]
input: "{{ steps.analyze.result }}"
- name: apply-remediation
executor: action
permissions: [write:cases-*]
requires_approval: true
approval_timeout: 24hBuilt-in workflows ship with Elastic (alerting, reporting, fleet). Customers extend or replace them. The workflow definition is the behavioral contract — for agents especially, it's the guardrail.
An AI agent without a workflow is an LLM with access to your production cluster. The workflow is what makes it an agent instead of a liability.
The agent doesn't decide its own permissions. Step 1 can read logs. Step 4 can write cases. Step 3 can't write anything. This is declared in the workflow definition and enforced by the engine.
"This investigation gets 60 seconds and 10 LLM calls max." The agent can't run forever or spend unbounded money on inference. The workflow engine tracks and enforces resource consumption.
"Apply remediation" requires human approval. The workflow pauses, creates an approval request (visible in Kibana), and resumes when approved or times out. The agent proposes; the human decides.
Every step's input and output is recorded in ES. When an agent does something unexpected, replay the execution step by step. Every decision, every tool call, every intermediate result is queryable:
FROM workflow-steps
| WHERE workflow.id == "abc-123"
| SORT step.started_at ASC
| KEEP step.name, step.input, step.output, step.duration_msAgent capabilities are composed from workflow steps and sub-workflows. "Can investigate" = a workflow. "Can remediate" = a different workflow. "Can investigate and remediate" = a workflow that chains them. Guardrails compose the same way capabilities do.
Customers don't need to write agent code. They define workflows: "when this alert fires, have the agent run this investigation playbook, but require my approval before any write operation." The workflow is the product surface for agent customization.
The workflow engine enforces tenant isolation at the scheduling level:
- Per-tenant quotas: tenant A gets N concurrent workflow executions
- Priority tiers: paid tenants' workflows schedule before free tier
- Resource attribution: every step execution is tagged with tenant ID; cost is trackable
- Data isolation: step executors receive scoped credentials per tenant
- Fairness: no tenant's runaway agent can starve another tenant's alerting rules
This is impossible to retrofit into Task Manager's "grab the next claimable task" model. It's natural in a workflow engine that understands tenant context.
Temporal, Conductor, and Step Functions are proven workflow engines. But they store workflow state in their own databases (Cassandra, MySQL, Postgres). Building on ES gives:
-
Queryable execution history — ES|QL, aggregations, Kibana dashboards over all workflow executions. Not a separate monitoring system — the workflow data is the observability data.
-
Full-text search on step payloads — "Find every AI investigation where the agent mentioned 'lateral movement'" is a search query, not a log grep.
-
ILM for lifecycle — Completed workflow data ages through hot/warm/cold/frozen tiers automatically. No manual cleanup, no TTL application code.
-
Unified platform — The workflow engine, the data it operates on, and the UI that displays it are all the same system. No integration glue.
-
No additional infrastructure — ES is already running. No Temporal server, no Cassandra cluster, no additional operational burden.
| Option | Pros | Cons |
|---|---|---|
| Build on ES | No new infra, queryable state, unified platform | Years of work, unproven at orchestration scale |
| Adopt Temporal | Proven, battle-tested, multi-language | Another database (Cassandra/MySQL), another system to operate |
| Temporal with ES persistence | Temporal's orchestration + ES's queryability | Requires building an ES persistence plugin for Temporal; Temporal's data model may not map cleanly to ES |
| Evolve Task Manager | Incremental, low risk | Preserves the fundamental architectural problems |
-
Define workflow state as ES documents. Index mappings for workflow instances and step executions. This is the foundation — get the data model right.
-
Build a workflow engine as a standalone service. Not in Kibana. Reads workflow definitions from ES, manages step scheduling, writes state to ES. Can be Go, Rust, or Java.
-
Define the step executor protocol. HTTP-based: claim, heartbeat, complete, fail. Publish as a public API.
-
Migrate one workflow. Pick something contained — report generation. Build it as a real workflow with step executors. Prove the model.
-
Kibana becomes a step executor and UI. Task Manager internally becomes a "Kibana step executor" that speaks the new protocol. Existing task types run unchanged — they're just steps now.
-
Migrate remaining task types. Alerting, fleet, ML — each becomes a workflow definition with appropriate step executors.
-
Ship customer-facing workflow authoring. The workflow definition format becomes a product surface. Customers define, modify, and compose workflows through Kibana UI or API.
-
AI agents use workflows natively. Agent capabilities are workflow definitions. The agent framework invokes workflows, not raw tool calls. Guardrails are enforced by the engine, not the agent code.
| Task Manager (current) | Workflow Engine (proposed) | |
|---|---|---|
| Execution model | Atomic tasks, black box | Multi-step workflows, observable per step |
| Scheduling | Polling + optimistic claim | Event-driven or change-notification based |
| Runtime coupling | Node.js in Kibana | Any language, any runtime |
| Scaling | Coupled to Kibana instances | Workers scale independently |
| Tenant isolation | None | Per-tenant quotas, priority, data scoping |
| Agent guardrails | Application code | Declarative workflow constraints |
| Observability | Task status field | Full step-level execution history in ES |
| Customer extensibility | Kibana plugin development | Workflow definitions + custom step executors |
The transition from tasks to workflows is not a refactor of Task Manager. It's a recognition that background execution in Elastic is a platform concern, not a Kibana concern, and that the primitive should be the workflow — not the task.