smith/workflows-not-tasks.md

## workflows-not-tasks.md

      
    Raw
  

              workflows-not-tasks.md
            
          
    Workflows, Not Tasks: Rethinking Elastic's Background Execution Architecture

2026-03-04 — Nathan Smith, Tech Lead, APM @ Elastic Observability

The Problem

Kibana Task Manager is an application-level job queue embedded in a UI server. It is simultaneously the scheduler, the claimer, and the executor — all inside a Node.js process that also serves HTTP requests. These concerns can't scale independently, and the polling-based claim model generates wasted load on Elasticsearch at scale.
Watcher (the last ES-native execution primitive) was deprecated. Nothing replaced it at the platform level. Task Manager filled the vacuum by necessity, not by design.
With Elastic moving toward serverless (on-prem eventually = serverless on customer k8s), AI agents becoming a core product capability, and workflows already shipping as a feature — the architecture needs to change.
The Insight

Every Task Manager task is already a workflow. We just didn't model them that way.


Task Manager task
Actual workflow


Alerting rule eval
Query ES → evaluate conditions → fire actions → store state


Report generation
Gather parameters → query data → render → store artifact → notify


Fleet action
Select agents → send command → poll results → timeout or collect → report


AI investigation
Observe → reason → act → observe → ... (loop with exit conditions)


These are multi-step processes with ordering, conditional branching, error handling, and state. Modeling them as atomic "tasks" loses structure that matters for observability, retry granularity, and composition.
Workflows should not be a feature running on Task Manager. Task Manager should be replaced by a workflow engine.
Current Architecture

Kibana (Node.js)
  └── Task Manager
        ├── polls ES for claimable tasks (every instance, every interval)
        ├── claims via optimistic update (conflicts under contention)
        ├── executes task logic in-process
        ├── includes: alerting, reporting, fleet, workflows, ...
        └── scales only by adding Kibana instances

Problems:

Coupled scaling: can't add task capacity without adding HTTP frontends
Polling waste: N Kibana instances × polling interval = constant ES query load, mostly returning nothing
No tenant isolation: all tenants' tasks compete in the same claim loop
Node.js constraint: all task logic must be JavaScript, running on the event loop
No step-level observability: a task is a black box; if it fails at step 3 of 5, you start over

Proposed Architecture

Three layers, cleanly separated:
Layer 1: Workflow Engine (orchestration)

Manages workflow lifecycle: scheduling, step sequencing, state transitions, guardrail enforcement. Does not execute business logic.
Workflow state lives in ES as searchable documents — every workflow instance, every step execution, every decision point.
POST /_workflows/ai-security-investigation/_start
{
  "trigger": { "alert_id": "abc-123" },
  "params": { "severity": "critical" }
}

The engine:

Creates a workflow instance document in ES
Schedules the first step
As each step completes, evaluates the workflow definition and schedules the next
Enforces guardrails (timeouts, budgets, approval gates)
Records the full execution history

Layer 2: Step Executors (workers)

Stateless processes that claim and execute individual workflow steps. Any language, any runtime.
POST /_workflows/_steps/_claim?capability=es-query&lease=30s
→ { step_id, workflow_id, step_name, payload, lease_ttl }

POST /_workflows/_steps/{id}/_complete
→ { result }

Worker types:

ES query executor — runs searches, aggregations (Go/Rust, lightweight)
Action executor — sends notifications, webhooks, API calls (any language)
AI executor — LLM calls, reasoning chains (Python, GPU-capable nodes)
Render executor — headless browser for reports (dedicated service)
Kibana executor — for steps that genuinely need Kibana context (backward compat)
Customer executor — user-deployed workers for custom automation

Workers scale independently via k8s HPA/KEDA based on step queue depth. AI steps route to GPU nodes. Simple query steps route to lightweight pods. K8s manages the worker pool lifecycle; the workflow engine manages task-to-worker routing.
Layer 3: Workflow Definitions

Declarative definitions stored in ES. Auditable, versionable, composable.
name: ai-security-investigation
version: 2
max_duration: 5m
max_llm_calls: 20

steps:
  - name: gather-context
    executor: es-query
    permissions: [read:logs-*, read:metrics-*]
    timeout: 30s
    retry: { max: 2, backoff: exponential }

  - name: analyze
    executor: ai-reasoning
    model: default
    input: "{{ steps.gather-context.result }}"
    timeout: 60s

  - name: propose-remediation
    executor: ai-reasoning
    permissions: [read:*]
    input: "{{ steps.analyze.result }}"

  - name: apply-remediation
    executor: action
    permissions: [write:cases-*]
    requires_approval: true
    approval_timeout: 24h
Built-in workflows ship with Elastic (alerting, reporting, fleet). Customers extend or replace them. The workflow definition is the behavioral contract — for agents especially, it's the guardrail.
Why Workflows Are the Right Primitive for AI Agents

An AI agent without a workflow is an LLM with access to your production cluster. The workflow is what makes it an agent instead of a liability.
Permissions per step

The agent doesn't decide its own permissions. Step 1 can read logs. Step 4 can write cases. Step 3 can't write anything. This is declared in the workflow definition and enforced by the engine.
Budget and timeout

"This investigation gets 60 seconds and 10 LLM calls max." The agent can't run forever or spend unbounded money on inference. The workflow engine tracks and enforces resource consumption.
Approval gates

"Apply remediation" requires human approval. The workflow pauses, creates an approval request (visible in Kibana), and resumes when approved or times out. The agent proposes; the human decides.
Deterministic replay

Every step's input and output is recorded in ES. When an agent does something unexpected, replay the execution step by step. Every decision, every tool call, every intermediate result is queryable:
FROM workflow-steps
| WHERE workflow.id == "abc-123"
| SORT step.started_at ASC
| KEEP step.name, step.input, step.output, step.duration_ms
Composition

Agent capabilities are composed from workflow steps and sub-workflows. "Can investigate" = a workflow. "Can remediate" = a different workflow. "Can investigate and remediate" = a workflow that chains them. Guardrails compose the same way capabilities do.
Customer-defined behavior

Customers don't need to write agent code. They define workflows: "when this alert fires, have the agent run this investigation playbook, but require my approval before any write operation." The workflow is the product surface for agent customization.
Multi-Tenancy

The workflow engine enforces tenant isolation at the scheduling level:

Per-tenant quotas: tenant A gets N concurrent workflow executions
Priority tiers: paid tenants' workflows schedule before free tier
Resource attribution: every step execution is tagged with tenant ID; cost is trackable
Data isolation: step executors receive scoped credentials per tenant
Fairness: no tenant's runaway agent can starve another tenant's alerting rules

This is impossible to retrofit into Task Manager's "grab the next claimable task" model. It's natural in a workflow engine that understands tenant context.
What ES Provides That Other Orchestrators Don't

Temporal, Conductor, and Step Functions are proven workflow engines. But they store workflow state in their own databases (Cassandra, MySQL, Postgres). Building on ES gives:


Queryable execution history — ES|QL, aggregations, Kibana dashboards over all workflow executions. Not a separate monitoring system — the workflow data is the observability data.


Full-text search on step payloads — "Find every AI investigation where the agent mentioned 'lateral movement'" is a search query, not a log grep.


ILM for lifecycle — Completed workflow data ages through hot/warm/cold/frozen tiers automatically. No manual cleanup, no TTL application code.


Unified platform — The workflow engine, the data it operates on, and the UI that displays it are all the same system. No integration glue.


No additional infrastructure — ES is already running. No Temporal server, no Cassandra cluster, no additional operational burden.


Build vs. Adopt


Option
Pros
Cons


Build on ES
No new infra, queryable state, unified platform
Years of work, unproven at orchestration scale


Adopt Temporal
Proven, battle-tested, multi-language
Another database (Cassandra/MySQL), another system to operate


Temporal with ES persistence
Temporal's orchestration + ES's queryability
Requires building an ES persistence plugin for Temporal; Temporal's data model may not map cleanly to ES


Evolve Task Manager
Incremental, low risk
Preserves the fundamental architectural problems


Migration Path


Define workflow state as ES documents. Index mappings for workflow instances and step executions. This is the foundation — get the data model right.


Build a workflow engine as a standalone service. Not in Kibana. Reads workflow definitions from ES, manages step scheduling, writes state to ES. Can be Go, Rust, or Java.


Define the step executor protocol. HTTP-based: claim, heartbeat, complete, fail. Publish as a public API.


Migrate one workflow. Pick something contained — report generation. Build it as a real workflow with step executors. Prove the model.


Kibana becomes a step executor and UI. Task Manager internally becomes a "Kibana step executor" that speaks the new protocol. Existing task types run unchanged — they're just steps now.


Migrate remaining task types. Alerting, fleet, ML — each becomes a workflow definition with appropriate step executors.


Ship customer-facing workflow authoring. The workflow definition format becomes a product surface. Customers define, modify, and compose workflows through Kibana UI or API.


AI agents use workflows natively. Agent capabilities are workflow definitions. The agent framework invokes workflows, not raw tool calls. Guardrails are enforced by the engine, not the agent code.


Summary


Task Manager (current)
Workflow Engine (proposed)


Execution model
Atomic tasks, black box
Multi-step workflows, observable per step


Scheduling
Polling + optimistic claim
Event-driven or change-notification based


Runtime coupling
Node.js in Kibana
Any language, any runtime


Scaling
Coupled to Kibana instances
Workers scale independently


Tenant isolation
None
Per-tenant quotas, priority, data scoping


Agent guardrails
Application code
Declarative workflow constraints


Observability
Task status field
Full step-level execution history in ES


Customer extensibility
Kibana plugin development
Workflow definitions + custom step executors


The transition from tasks to workflows is not a refactor of Task Manager. It's a recognition that background execution in Elastic is a platform concern, not a Kibana concern, and that the primitive should be the workflow — not the task.
Task Manager task	Actual workflow
Alerting rule eval	Query ES → evaluate conditions → fire actions → store state
Report generation	Gather parameters → query data → render → store artifact → notify
Fleet action	Select agents → send command → poll results → timeout or collect → report
AI investigation	Observe → reason → act → observe → ... (loop with exit conditions)
Option	Pros	Cons
Build on ES	No new infra, queryable state, unified platform	Years of work, unproven at orchestration scale
Adopt Temporal	Proven, battle-tested, multi-language	Another database (Cassandra/MySQL), another system to operate
Temporal with ES persistence	Temporal's orchestration + ES's queryability	Requires building an ES persistence plugin for Temporal; Temporal's data model may not map cleanly to ES
Evolve Task Manager	Incremental, low risk	Preserves the fundamental architectural problems
	Task Manager (current)	Workflow Engine (proposed)
Execution model	Atomic tasks, black box	Multi-step workflows, observable per step
Scheduling	Polling + optimistic claim	Event-driven or change-notification based
Runtime coupling	Node.js in Kibana	Any language, any runtime
Scaling	Coupled to Kibana instances	Workers scale independently
Tenant isolation	None	Per-tenant quotas, priority, data scoping
Agent guardrails	Application code	Declarative workflow constraints
Observability	Task status field	Full step-level execution history in ES
Customer extensibility	Kibana plugin development	Workflow definitions + custom step executors