Date: 2026-02-22
Build a production-grade Fly control plane inside jido_lib for all GitHub bots while preserving current synchronous bot APIs.
Core stack:
- Fly Machines API writes via
req_fly - Fly GraphQL API for read-side metadata/inventory
- Optional FLAME execution profile for burst workloads
- ETS-first queue/state/lease/idempotency storage
- Existing Jido/Runic bot workflows +
jido_vfsartifact checkpoints
jido_harnessremains provider normalization only.jido_runicremains workflow/delegation runtime.jido_vfsremains artifact persistence boundary.jido_libowns GitHub orchestration and control-plane behavior.- No Postgres/Oban in v1 (ETS-first).
- GraphQL is read-side only; do not depend on GraphQL for critical write paths.
Add lib/jido_lib/github/control_plane.ex with:
@spec submit(atom(), map(), keyword()) :: {:ok, Jido.Lib.Github.ControlPlane.RunRef.t()} | {:error, term()}
@spec await(String.t(), keyword()) :: {:ok, map()} | {:error, term()}
@spec get(String.t()) :: {:ok, Jido.Lib.Github.ControlPlane.Run.t()} | {:error, term()}
@spec list(map()) :: [Jido.Lib.Github.ControlPlane.Run.t()]
@spec cancel(String.t(), keyword()) :: :ok | {:error, term()}
@spec retry(String.t(), keyword()) :: {:ok, Jido.Lib.Github.ControlPlane.RunRef.t()} | {:error, term()}
@spec reconcile(keyword()) :: {:ok, map()} | {:error, term()}Behavior:
submit/3: validates bot + intake, creates queued run, returnsRunRef.await/2: blocks until terminal state or timeout.get/1: full run envelope.list/1: filter by status/bot/owner/repo.cancel/2: marks cancellation and propagates to worker/provider.retry/2: creates a new attempt from prior run envelope.reconcile/1: executes orphan/stale run reconciliation pass.
For all bot agents under lib/jido_lib/github/agents/:
IssueTriageBotIssueTriageCriticBotPrBotQualityBotReleaseBotRoadmapBot
Add additive APIs:
enqueue_*helper returningRunRef- Existing sync methods remain available
- Add
mode: :inline | :control_plane(default:inlinefor backward compatibility)
Suggested wrappers:
IssueTriageBot.enqueue_issue/2IssueTriageCriticBot.enqueue_issue/2PrBot.enqueue_issue/2QualityBot.enqueue_target/2ReleaseBot.enqueue_repo/2RoadmapBot.enqueue_plan/2
Extend existing tasks with:
--control-plane--async--wait--run-id
Add operator tasks:
mix jido_lib.github.runsmix jido_lib.github.runs.cancel <run_id>mix jido_lib.github.runs.retry <run_id>mix jido_lib.github.runs.reconcile
CLI semantics:
--asyncreturns immediately with run ref.--waitblocks for terminal result.- If both absent and
--control-planeis present, default to wait.
Create under lib/jido_lib/github/control_plane/:
supervisor.exstate_machine.exqueue.exscheduler.exdispatcher.exworker.exrun_store.exlease_store.exidempotency_store.exreconciler.exquota.expolicy.extelemetry.exrun.exrun_ref.exevent.ex
Create under lib/jido_lib/github/platform/fly/:
client.ex(behaviour)req_fly_client.ex(Machines write path)graphql_client.ex(read-side path)machine_spec.ex(deterministic machine payloads)
Create under lib/jido_lib/github/control_plane/executor/:
direct.ex(default path)flame.ex(optional)
Keep using:
lib/jido_lib/bots/foundation/artifact_store.exlib/jido_lib/bots/foundation/role_runner.ex- Existing bot result contracts (no forced schema unification)
Root supervisor: Jido.Lib.Github.ControlPlane.Supervisor
Children:
RunStore(ETS owner)LeaseStore(ETS owner)IdempotencyStore(ETS owner)Queue(GenServer)Scheduler(GenServer with tick)WorkerSupervisor(DynamicSupervisor)Reconciler(periodic GenServer)
Execution flow:
submitvalidates and enqueues run.- Scheduler admits run by quota/policy.
- Dispatcher starts worker.
- Worker acquires lease.
- Worker executes bot via selected executor profile.
- Worker checkpoints manifest/artifacts.
- Worker publishes/comments idempotently.
- Worker emits terminal event and releases lease.
States:
:queued:admitted:provisioning:running:finalizing:publishing:succeeded:failed:canceled:timed_out
Rules:
- All transitions validated in
state_machine.ex. - Terminal states are immutable.
retrycreates a new run attempt rather than mutating terminal state.- Every transition emits telemetry + control-plane event.
manifest.jsoncheckpoint updated at each major phase.
Authoritative path for:
- Machine create/start/stop/restart/destroy
- Metadata tags (
run_id,bot,attempt,repo,owner) - TTL/cleanup metadata
Read-only usage for:
- Fleet inventory
- Region/capacity metadata
- Historical lookup/diagnostic enrichment
execution_profile: :direct | :flame- Default profile is
:direct :flameenabled only when configured and available- Clear fallback policy (
fallback_to_direct?)
- Bounded retries with backoff + jitter
- Lease expiration and stale-worker takeover
- Cancellation propagation to provider + machine
- Reconciler for orphan machines and stale queued/running runs
- Idempotency keys for publish/comment side effects
- Fail-closed on invalid provider/runtime prerequisites
- Commit pending quality fixes in:
jido_workspacejido_runicjido_codexjido_gemini
- Verify
mix qualitypasses in touched repos.
- Add stores + queue + scheduler + worker scaffolding.
- Add run/event structs and telemetry hooks.
- Add deterministic unit tests for transitions, retries, cancellation.
- Add Fly behaviour +
req_flyMachines client. - Add GraphQL read adapter.
- Add deterministic machine payload/spec tests.
- Add
enqueue_*API andmode: :control_plane. - Keep synchronous APIs unchanged by default.
- Add inline-vs-queued parity tests.
- Extend task flags.
- Add operator tasks (
runs,cancel,retry,reconcile). - Add deterministic task tests.
- Add orphan sweeps, dead-letter categorization.
- Harden race handling around cancellation and late completion.
- Add chaos-style deterministic tests.
Add docs:
docs/fly_control_plane_architecture.mddocs/fly_control_plane_ops.mddocs/fly_control_plane_rollout.md
Update:
README.mdmix.exsdocs.extras
test/jido_lib/github/control_plane/state_machine_test.exstest/jido_lib/github/control_plane/queue_test.exstest/jido_lib/github/control_plane/scheduler_test.exstest/jido_lib/github/control_plane/run_store_test.exstest/jido_lib/github/control_plane/reconciler_test.exstest/jido_lib/github/platform/fly/machine_spec_test.exstest/jido_lib/github/platform/fly/req_fly_client_test.exstest/jido_lib/github/platform/fly/graphql_client_test.exs
- Add queued-mode tests for all bot run suites under
test/jido_lib/github/agents/. - Assert result-map compatibility with existing inline outputs.
- Extend task tests for
--control-plane,--async,--wait,--run-id. - Add run-operator task tests.
- Machines lifecycle write calls.
- GraphQL read queries.
- Optional FLAME profile smoke.
- Idempotent repost behavior with repeated
run_id.
mix testmix quality- Ignore
jido_workspace_scenariosfor this workstream.
- All GitHub bots support control-plane mode and keep synchronous mode behavior.
- Fly Machines write path uses
req_flyboundary. - GraphQL is read-only in control plane.
- FLAME profile is optional and tested.
- ETS queue/state/lease/reconcile are covered by deterministic tests.
- Operator run-management Mix tasks are implemented.
- Documentation is publish-ready for internal/external handoff.
jido_libquality and test gates are green.
req_flyis the canonical Fly write client.- ETS durability is acceptable for v1.
.jido/runs/<run_id>/manifest.jsonis the audit/recovery backbone.- Existing bot result shapes are preserved.
- Feature flags gate rollout by bot and execution profile.