Created
March 8, 2026 01:24
-
-
Save roninjin10/9157a2a4a18a2555e4a74af0c1211aa7 to your computer and use it in GitHub Desktop.
freestyle migration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Work on Linear issue JJH-101: | |
| <issue identifier="JJH-101"> | |
| <title>Migrate agent tasks + workspaces from GKE Sandbox to Freestyle VMs</title> | |
| <description> | |
| ## Summary | |
| Replace our GKE Sandbox (gVisor) runner pods and K8s workspace pods with [Freestyle](<https://freestyle.sh>) micro-VMs for agent tasks and workspaces. This eliminates \~1,500 lines of undifferentiated infrastructure (runner pool management, heartbeats, WebRTC signaling, PTY orchestration) in favor of Freestyle's managed VM lifecycle, built-in SSH/terminal access, and snapshot caching. | |
| **Scope**: Agent tasks + Workspaces only. CI/workflow steps stay on the existing runner infrastructure. | |
| --- | |
| ## Why | |
| * **Runner pool is undifferentiated infrastructure**: heartbeat polling, `FOR UPDATE SKIP LOCKED` task claiming, stale runner cleanup, pod dispatch — all replaced by a single `POST /vms` API call | |
| * **WebRTC is complex and fragile**: SDP exchange, ICE candidate polling, STUN server dependency — all replaced by Freestyle's built-in SSH access | |
| * **Snapshots enable instant boot**: Pre-built agent base images boot in <800ms vs multi-second K8s pod startup | |
| * **Suspend/resume in <100ms**: Workspaces can suspend on idle and resume instantly, vs cold K8s pod restarts | |
| --- | |
| ## What Changes | |
| ### New code | |
| * `internal/freestyle/` — Thin Go HTTP client for the Freestyle VM REST API (types, client, VM operations) | |
| * `scripts/create-agent-snapshot.ts` — Bun script to create a reusable agent base snapshot | |
| ### Modified code | |
| * `internal/services/agent.go` — `DispatchAgentRun()` creates a Freestyle VM instead of a workflow task. Steps 1-6 (workflow tracking, token generation) are infrastructure-agnostic and stay the same. Steps 7-9 (snapshot, payload, task queue) change to VM creation with `gitRepos` + `additionalFiles` | |
| * `internal/services/workspace.go` — Major rewrite: replace K8s pod dispatch + WebRTC signaling with Freestyle VM creation + SSH access | |
| * `internal/routes/workspace.go` — Remove WebRTC endpoint, add SSH connection info endpoint | |
| * `internal/routes/workspace_internal.go` — Remove WebRTC signaling endpoints, simplify to status-only callbacks | |
| * `db/schema.sql` — Schema changes to `workspaces` and `workspace_sessions` tables (remove K8s/WebRTC columns, add `freestyle_vm_id`, `ssh_connection_info`) | |
| * `db/queries/workspace.sql` — Update queries to match schema changes | |
| * `cmd/server/main.go` — Wire Freestyle client into services, add config vars | |
| * `internal/config/` — Add `JJHUB_FREESTYLE_API_KEY`, `JJHUB_FREESTYLE_API_URL`, `JJHUB_FREESTYLE_AGENT_SNAPSHOT_ID` | |
| ### Deleted code | |
| * `internal/runner/` — All files (pool.go, claim.go, heartbeat.go, cleanup.go, store.go, types.go, executor/) | |
| * `internal/wsrunner/` — All files (runner.go, client.go) | |
| * `cmd/runner/main.go`, `cmd/runner/factory.go`, `cmd/runner/Dockerfile` | |
| * `cmd/runner/workflow/workspace-pty.ts` | |
| * `infra/helm/jjhub/templates/runner-pool.yaml` | |
| * `infra/k8s/gvisor-runtimeclass.yaml` | |
| ### Kept (agent workflow scripts — now run inside Freestyle VMs) | |
| * `cmd/runner/workflow/agent.ts`, `agent-task.tsx`, `agent-tools.ts`, `agent_event_mapper.ts`, `smithers.ts`, `preload.ts`, `execute-step.ts` | |
| ### Go dependency removals | |
| * `github.com/creack/pty` (PTY) | |
| * `github.com/pion/webrtc/v4` + all pion/\* transitive deps (WebRTC) | |
| * `k8s.io/client-go`, `k8s.io/api`, `k8s.io/apimachinery` (if only used for workspaces — check first) | |
| --- | |
| ## Freestyle API Reference | |
| ### Documentation | |
| * **OpenAPI spec (Scalar UI)**: [https://vm-api.freestyle.sh/](<https://vm-api.freestyle.sh/>) | |
| * **Docs home**: [https://docs.freestyle.sh/v2](<https://docs.freestyle.sh/v2>) | |
| * **VM lifecycle**: [https://docs.freestyle.sh/v2/vms/lifecycle](<https://docs.freestyle.sh/v2/vms/lifecycle>) | |
| * **VM configuration**: [https://docs.freestyle.sh/v2/vms/configuration](<https://docs.freestyle.sh/v2/vms/configuration>) | |
| * **Files & repos**: [https://docs.freestyle.sh/v2/vms/configuration/files-and-repos](<https://docs.freestyle.sh/v2/vms/configuration/files-and-repos>) | |
| * **Systemd services**: [https://docs.freestyle.sh/v2/vms/configuration/systemd-services](<https://docs.freestyle.sh/v2/vms/configuration/systemd-services>) | |
| * **SSH access**: [https://docs.freestyle.sh/v2/vms/ssh](<https://docs.freestyle.sh/v2/vms/ssh>) | |
| * **Persistence**: [https://docs.freestyle.sh/vms/index/persistence](<https://docs.freestyle.sh/vms/index/persistence>) | |
| * **Dashboard (API keys)**: [https://dash.freestyle.sh](<https://dash.freestyle.sh>) | |
| * **npm SDK (TypeScript reference)**: [https://www.npmjs.com/package/freestyle-sandboxes](<https://www.npmjs.com/package/freestyle-sandboxes>) | |
| * **GitHub SDK source**: [https://github.com/freestyle-sh/sandbox_sdks](<https://github.com/freestyle-sh/sandbox_sdks>) | |
| ### Key API Details | |
| * **Auth**: `Authorization: Bearer <FREESTYLE_API_KEY>` | |
| * **Base URL**: `https://api.freestyle.sh` (or the VM API at `https://vm-api.freestyle.sh`) | |
| * **No Go SDK** — we write a thin HTTP client against their REST API | |
| ### Core Endpoints | |
| | Method | Path | Purpose | | |
| | -- | -- | -- | | |
| | POST | `/vms` | Create VM (with gitRepos, additionalFiles, systemd, persistence, snapshotId) | | |
| | GET | `/vms/{id}` | Get VM state | | |
| | DELETE | `/vms/{id}` | Delete VM | | |
| | POST | `/vms/{id}/start` | Start/resume VM | | |
| | POST | `/vms/{id}/stop` | Stop VM | | |
| | POST | `/vms/{id}/suspend` | Suspend VM (preserves memory + disk, <100ms resume) | | |
| | POST | `/vms/{id}/exec-await` | Execute command and wait | | |
| | POST | `/vms/{id}/snapshot` | Snapshot running VM | | |
| | PUT | `/vms/{id}/files/{path}` | Write file to VM | | |
| | POST | `/vms/{id}/systemd/services` | Create systemd service | | |
| ### VM Creation Options | |
| * `snapshotId` — Boot from pre-built snapshot (fast boot) | |
| * `gitRepos` — `[{url, path, rev}]` — Clone repos at creation | |
| * `additionalFiles` — `{"/path": {content, encoding, executable}}` — Inject files | |
| * `systemd.services` — `[{name, ExecStart, Type, Restart, ...}]` — Create services | |
| * `persistence` — `{mode: "ephemeral"|"cache"|"persistent", priority: N}` | |
| * `idleTimeoutSeconds` — Auto-suspend after inactivity | |
| * `memSizeMb`, `vcpuCount`, `rootfsSizeMb` — Resource sizing | |
| ### SSH Access | |
| ``` | |
| ssh {vmId}:{accessToken}@vm-ssh.freestyle.sh | |
| ssh {vmId}+{username}:{accessToken}@vm-ssh.freestyle.sh | |
| ``` | |
| ### Performance | |
| | Operation | Latency | | |
| | -- | -- | | |
| | VM creation (from snapshot) | <800ms | | |
| | Suspend/resume | <100ms | | |
| | Fork | <50ms | | |
| --- | |
| ## Existing Codebase Context | |
| ### Current Agent Dispatch Flow (`DispatchAgentRun`) | |
| 1. Upsert per-repo agent workflow definition | |
| 2. Create workflow run (status="queued") | |
| 3. Create workflow step (name="agent", status="queued") | |
| 4. Generate agent token (`jjhub_agent_` + 40 hex, SHA-256 hashed) | |
| 5. Store token hash + 24h expiry in workflow_runs | |
| 6. Load message history (best-effort) | |
| 7. **\[INFRA\]** Call `snapshotter.CreateSnapshot()` — repo-host snapshot | |
| 8. **\[INFRA\]** Build task payload JSON with kind="agent" | |
| 9. **\[INFRA\]** Create workflow task (status="pending", `FOR UPDATE SKIP LOCKED` queue) | |
| 10. Link session to workflow run | |
| Steps 1-6 and 10 are infrastructure-agnostic. Steps 7-9 change to Freestyle VM creation. | |
| ### Current Workspace Flow | |
| * `CreateSession()` → find or create workspace → dispatch K8s pod with PVC + gVisor | |
| * `dispatchWorkspacePod()` → creates PVC (10Gi RWO) + Pod with runner image, env vars, privileged security context | |
| * `ExchangeWebRTC()` → client/runner SDP + ICE candidate exchange via DB columns | |
| * `DestroyWorkspace()` → stop sessions, update DB, `k8sClient.Pods().Delete()` | |
| * Cleanup: `CleanupIdleSessions()` + `CleanupIdleWorkspaces()` periodic sweeps | |
| ### Key Service Interfaces | |
| * `RepoHostSnapshotter` — `CreateSnapshot(ctx, repoID) (string, error)` — replaced by VM gitRepos clone | |
| * `WorkspaceQuerier` — DB interface with WebRTC methods to remove | |
| * `AgentDispatchQuerier` — workflow task creation queries | |
| ### Database Tables Affected | |
| * `workspaces` — remove `pod_name`, `pvc_name`, add `freestyle_vm_id` | |
| * `workspace_sessions` — remove `client_sdp`, `runner_sdp`, `client_ice_candidates`, `runner_ice_candidates`, add `ssh_connection_info` | |
| * `workflow_tasks` — add `freestyle_vm_id` for tracking VM-backed tasks | |
| ### Existing Monitoring Patterns | |
| * Prometheus metrics via `JJHubMetrics` struct in `internal/routes/metrics.go` | |
| * Custom registry (not global default), `/metrics` endpoint | |
| * Existing metrics: `jjhub_runner_pool_available`, `jjhub_runner_pool_claimed`, `jjhub_active_agent_sessions` | |
| * Alerts in `infra/terraform/modules/monitoring/main.tf` (Cloud Monitoring) | |
| * Structured logging via `slog` with GCP JSON handler | |
| * OpenTelemetry tracing with Cloud Trace exporter | |
| ### Existing Testing Patterns | |
| * **Unit tests**: testify + stdlib, table-driven, hand-written interface mocks, `t.Parallel()` | |
| * **Integration tests**: `*_integration_test.go`, real DB via `JJHUB_TEST_DATABASE_URL`, `-p=1 -parallel=1` | |
| * **E2E tests**: Bun Test in `/e2e/api/`, docker-compose services, real API calls with tokens | |
| * **Test targets**: `make test-go`, `make test-db`, `make test-db-isolated`, `make e2e` | |
| --- | |
| ## Testing Requirements | |
| ### Unit Tests | |
| * Freestyle Go client: mock HTTP server (httptest), verify auth headers, request/response marshaling, error handling | |
| * Modified `DispatchAgentRun()`: mock Freestyle client interface, verify VM creation params | |
| * Modified workspace service: mock Freestyle client, verify VM lifecycle calls | |
| ### Integration Tests | |
| * Freestyle client against live API (gated by `JJHUB_FREESTYLE_API_KEY` env var — skip if unset) | |
| * Create VM → verify running state → exec command → verify output → delete VM | |
| * Snapshot creation → boot from snapshot → verify fast startup | |
| ### E2E Tests (zero mocks) | |
| * Full agent conversation flow: create session → send message → dispatch → agent runs in Freestyle VM → events stream back via SSE → session completes | |
| * Full workspace flow: create workspace → verify Freestyle VM created → verify SSH access → suspend/resume → cleanup on idle | |
| * **These must run against real Freestyle VMs, not mocks.** Gate behind `JJHUB_FREESTYLE_API_KEY`. | |
| --- | |
| ## Monitoring & Alerting Requirements | |
| ### New Prometheus Metrics | |
| * `jjhub_freestyle_vm_create_duration_seconds` (Histogram, labels: `type=agent|workspace`) | |
| * `jjhub_freestyle_vm_create_total` (Counter, labels: `type`, `status=success|error`) | |
| * `jjhub_freestyle_active_vms` (Gauge, labels: `type=agent|workspace`) | |
| * `jjhub_freestyle_vm_suspend_duration_seconds` (Histogram) | |
| * `jjhub_freestyle_api_request_duration_seconds` (Histogram, labels: `method`, `endpoint`) | |
| * `jjhub_freestyle_api_errors_total` (Counter, labels: `endpoint`, `error_code`) | |
| ### New Alerts (add to `infra/terraform/modules/monitoring/main.tf`) | |
| * **CRITICAL**: Freestyle VM creation failure rate > 10% for 5 minutes | |
| * **CRITICAL**: Freestyle API unreachable for 2 minutes | |
| * **WARNING**: VM creation latency p95 > 5 seconds for 5 minutes | |
| * **WARNING**: Active VM count approaching Freestyle plan limits | |
| ### New Dashboard | |
| * Add "freestyle" dashboard JSON to `infra/terraform/modules/monitoring/` | |
| * Panels: VM creation rate, creation latency, active VMs, API error rate, suspend/resume latency | |
| ### Structured Logging | |
| * Log every VM creation with `slog.Info("freestyle vm created", "vm_id", id, "type", "agent|workspace", "duration_ms", dur)` | |
| * Log errors with `slog.Error("freestyle vm creation failed", "error", err, "type", "agent|workspace")` | |
| * Log VM lifecycle events (suspend, resume, delete) | |
| --- | |
| ## Implementation Notes | |
| * Read specs before coding: `docs/specs/engineering.md`, `docs/specs/infra.md`, `docs/specs/design.md` | |
| * Follow existing service layer patterns: Routes → Services → DB | |
| * Use functional options pattern (`WithFreestyleClient()`) matching existing code | |
| * Scripts use Bun (TypeScript), not bash — per code hygiene rules | |
| * After schema changes: run `make sqlc` to regenerate | |
| * After dependency removals: run `go mod tidy` | |
| * Verify: `go build ./...` succeeds with no dead imports | |
| * Update specs after implementation: [engineering.md](<http://engineering.md>) (runner architecture), [infra.md](<http://infra.md>) (remove GKE Sandbox runner section), [design.md](<http://design.md>) (workspace API changes) | |
| </description> | |
| <team name="JJHub"/> | |
| </issue> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment