Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save roninjin10/9157a2a4a18a2555e4a74af0c1211aa7 to your computer and use it in GitHub Desktop.

Select an option

Save roninjin10/9157a2a4a18a2555e4a74af0c1211aa7 to your computer and use it in GitHub Desktop.
freestyle migration
Work on Linear issue JJH-101:
<issue identifier="JJH-101">
<title>Migrate agent tasks + workspaces from GKE Sandbox to Freestyle VMs</title>
<description>
## Summary
Replace our GKE Sandbox (gVisor) runner pods and K8s workspace pods with [Freestyle](<https://freestyle.sh>) micro-VMs for agent tasks and workspaces. This eliminates \~1,500 lines of undifferentiated infrastructure (runner pool management, heartbeats, WebRTC signaling, PTY orchestration) in favor of Freestyle's managed VM lifecycle, built-in SSH/terminal access, and snapshot caching.
**Scope**: Agent tasks + Workspaces only. CI/workflow steps stay on the existing runner infrastructure.
---
## Why
* **Runner pool is undifferentiated infrastructure**: heartbeat polling, `FOR UPDATE SKIP LOCKED` task claiming, stale runner cleanup, pod dispatch — all replaced by a single `POST /vms` API call
* **WebRTC is complex and fragile**: SDP exchange, ICE candidate polling, STUN server dependency — all replaced by Freestyle's built-in SSH access
* **Snapshots enable instant boot**: Pre-built agent base images boot in <800ms vs multi-second K8s pod startup
* **Suspend/resume in <100ms**: Workspaces can suspend on idle and resume instantly, vs cold K8s pod restarts
---
## What Changes
### New code
* `internal/freestyle/` — Thin Go HTTP client for the Freestyle VM REST API (types, client, VM operations)
* `scripts/create-agent-snapshot.ts` — Bun script to create a reusable agent base snapshot
### Modified code
* `internal/services/agent.go` — `DispatchAgentRun()` creates a Freestyle VM instead of a workflow task. Steps 1-6 (workflow tracking, token generation) are infrastructure-agnostic and stay the same. Steps 7-9 (snapshot, payload, task queue) change to VM creation with `gitRepos` + `additionalFiles`
* `internal/services/workspace.go` — Major rewrite: replace K8s pod dispatch + WebRTC signaling with Freestyle VM creation + SSH access
* `internal/routes/workspace.go` — Remove WebRTC endpoint, add SSH connection info endpoint
* `internal/routes/workspace_internal.go` — Remove WebRTC signaling endpoints, simplify to status-only callbacks
* `db/schema.sql` — Schema changes to `workspaces` and `workspace_sessions` tables (remove K8s/WebRTC columns, add `freestyle_vm_id`, `ssh_connection_info`)
* `db/queries/workspace.sql` — Update queries to match schema changes
* `cmd/server/main.go` — Wire Freestyle client into services, add config vars
* `internal/config/` — Add `JJHUB_FREESTYLE_API_KEY`, `JJHUB_FREESTYLE_API_URL`, `JJHUB_FREESTYLE_AGENT_SNAPSHOT_ID`
### Deleted code
* `internal/runner/` — All files (pool.go, claim.go, heartbeat.go, cleanup.go, store.go, types.go, executor/)
* `internal/wsrunner/` — All files (runner.go, client.go)
* `cmd/runner/main.go`, `cmd/runner/factory.go`, `cmd/runner/Dockerfile`
* `cmd/runner/workflow/workspace-pty.ts`
* `infra/helm/jjhub/templates/runner-pool.yaml`
* `infra/k8s/gvisor-runtimeclass.yaml`
### Kept (agent workflow scripts — now run inside Freestyle VMs)
* `cmd/runner/workflow/agent.ts`, `agent-task.tsx`, `agent-tools.ts`, `agent_event_mapper.ts`, `smithers.ts`, `preload.ts`, `execute-step.ts`
### Go dependency removals
* `github.com/creack/pty` (PTY)
* `github.com/pion/webrtc/v4` + all pion/\* transitive deps (WebRTC)
* `k8s.io/client-go`, `k8s.io/api`, `k8s.io/apimachinery` (if only used for workspaces — check first)
---
## Freestyle API Reference
### Documentation
* **OpenAPI spec (Scalar UI)**: [https://vm-api.freestyle.sh/](<https://vm-api.freestyle.sh/>)
* **Docs home**: [https://docs.freestyle.sh/v2](<https://docs.freestyle.sh/v2>)
* **VM lifecycle**: [https://docs.freestyle.sh/v2/vms/lifecycle](<https://docs.freestyle.sh/v2/vms/lifecycle>)
* **VM configuration**: [https://docs.freestyle.sh/v2/vms/configuration](<https://docs.freestyle.sh/v2/vms/configuration>)
* **Files & repos**: [https://docs.freestyle.sh/v2/vms/configuration/files-and-repos](<https://docs.freestyle.sh/v2/vms/configuration/files-and-repos>)
* **Systemd services**: [https://docs.freestyle.sh/v2/vms/configuration/systemd-services](<https://docs.freestyle.sh/v2/vms/configuration/systemd-services>)
* **SSH access**: [https://docs.freestyle.sh/v2/vms/ssh](<https://docs.freestyle.sh/v2/vms/ssh>)
* **Persistence**: [https://docs.freestyle.sh/vms/index/persistence](<https://docs.freestyle.sh/vms/index/persistence>)
* **Dashboard (API keys)**: [https://dash.freestyle.sh](<https://dash.freestyle.sh>)
* **npm SDK (TypeScript reference)**: [https://www.npmjs.com/package/freestyle-sandboxes](<https://www.npmjs.com/package/freestyle-sandboxes>)
* **GitHub SDK source**: [https://github.com/freestyle-sh/sandbox_sdks](<https://github.com/freestyle-sh/sandbox_sdks>)
### Key API Details
* **Auth**: `Authorization: Bearer &lt;FREESTYLE_API_KEY&gt;`
* **Base URL**: `https://api.freestyle.sh` (or the VM API at `https://vm-api.freestyle.sh`)
* **No Go SDK** — we write a thin HTTP client against their REST API
### Core Endpoints
| Method | Path | Purpose |
| -- | -- | -- |
| POST | `/vms` | Create VM (with gitRepos, additionalFiles, systemd, persistence, snapshotId) |
| GET | `/vms/{id}` | Get VM state |
| DELETE | `/vms/{id}` | Delete VM |
| POST | `/vms/{id}/start` | Start/resume VM |
| POST | `/vms/{id}/stop` | Stop VM |
| POST | `/vms/{id}/suspend` | Suspend VM (preserves memory + disk, <100ms resume) |
| POST | `/vms/{id}/exec-await` | Execute command and wait |
| POST | `/vms/{id}/snapshot` | Snapshot running VM |
| PUT | `/vms/{id}/files/{path}` | Write file to VM |
| POST | `/vms/{id}/systemd/services` | Create systemd service |
### VM Creation Options
* `snapshotId` — Boot from pre-built snapshot (fast boot)
* `gitRepos` — `[{url, path, rev}]` — Clone repos at creation
* `additionalFiles` — `{"/path": {content, encoding, executable}}` — Inject files
* `systemd.services` — `[{name, ExecStart, Type, Restart, ...}]` — Create services
* `persistence` — `{mode: "ephemeral"|"cache"|"persistent", priority: N}`
* `idleTimeoutSeconds` — Auto-suspend after inactivity
* `memSizeMb`, `vcpuCount`, `rootfsSizeMb` — Resource sizing
### SSH Access
```
ssh {vmId}:{accessToken}@vm-ssh.freestyle.sh
ssh {vmId}+{username}:{accessToken}@vm-ssh.freestyle.sh
```
### Performance
| Operation | Latency |
| -- | -- |
| VM creation (from snapshot) | <800ms |
| Suspend/resume | <100ms |
| Fork | <50ms |
---
## Existing Codebase Context
### Current Agent Dispatch Flow (`DispatchAgentRun`)
1. Upsert per-repo agent workflow definition
2. Create workflow run (status="queued")
3. Create workflow step (name="agent", status="queued")
4. Generate agent token (`jjhub_agent_` + 40 hex, SHA-256 hashed)
5. Store token hash + 24h expiry in workflow_runs
6. Load message history (best-effort)
7. **\[INFRA\]** Call `snapshotter.CreateSnapshot()` — repo-host snapshot
8. **\[INFRA\]** Build task payload JSON with kind="agent"
9. **\[INFRA\]** Create workflow task (status="pending", `FOR UPDATE SKIP LOCKED` queue)
10. Link session to workflow run
Steps 1-6 and 10 are infrastructure-agnostic. Steps 7-9 change to Freestyle VM creation.
### Current Workspace Flow
* `CreateSession()` → find or create workspace → dispatch K8s pod with PVC + gVisor
* `dispatchWorkspacePod()` → creates PVC (10Gi RWO) + Pod with runner image, env vars, privileged security context
* `ExchangeWebRTC()` → client/runner SDP + ICE candidate exchange via DB columns
* `DestroyWorkspace()` → stop sessions, update DB, `k8sClient.Pods().Delete()`
* Cleanup: `CleanupIdleSessions()` + `CleanupIdleWorkspaces()` periodic sweeps
### Key Service Interfaces
* `RepoHostSnapshotter` — `CreateSnapshot(ctx, repoID) (string, error)` — replaced by VM gitRepos clone
* `WorkspaceQuerier` — DB interface with WebRTC methods to remove
* `AgentDispatchQuerier` — workflow task creation queries
### Database Tables Affected
* `workspaces` — remove `pod_name`, `pvc_name`, add `freestyle_vm_id`
* `workspace_sessions` — remove `client_sdp`, `runner_sdp`, `client_ice_candidates`, `runner_ice_candidates`, add `ssh_connection_info`
* `workflow_tasks` — add `freestyle_vm_id` for tracking VM-backed tasks
### Existing Monitoring Patterns
* Prometheus metrics via `JJHubMetrics` struct in `internal/routes/metrics.go`
* Custom registry (not global default), `/metrics` endpoint
* Existing metrics: `jjhub_runner_pool_available`, `jjhub_runner_pool_claimed`, `jjhub_active_agent_sessions`
* Alerts in `infra/terraform/modules/monitoring/main.tf` (Cloud Monitoring)
* Structured logging via `slog` with GCP JSON handler
* OpenTelemetry tracing with Cloud Trace exporter
### Existing Testing Patterns
* **Unit tests**: testify + stdlib, table-driven, hand-written interface mocks, `t.Parallel()`
* **Integration tests**: `*_integration_test.go`, real DB via `JJHUB_TEST_DATABASE_URL`, `-p=1 -parallel=1`
* **E2E tests**: Bun Test in `/e2e/api/`, docker-compose services, real API calls with tokens
* **Test targets**: `make test-go`, `make test-db`, `make test-db-isolated`, `make e2e`
---
## Testing Requirements
### Unit Tests
* Freestyle Go client: mock HTTP server (httptest), verify auth headers, request/response marshaling, error handling
* Modified `DispatchAgentRun()`: mock Freestyle client interface, verify VM creation params
* Modified workspace service: mock Freestyle client, verify VM lifecycle calls
### Integration Tests
* Freestyle client against live API (gated by `JJHUB_FREESTYLE_API_KEY` env var — skip if unset)
* Create VM → verify running state → exec command → verify output → delete VM
* Snapshot creation → boot from snapshot → verify fast startup
### E2E Tests (zero mocks)
* Full agent conversation flow: create session → send message → dispatch → agent runs in Freestyle VM → events stream back via SSE → session completes
* Full workspace flow: create workspace → verify Freestyle VM created → verify SSH access → suspend/resume → cleanup on idle
* **These must run against real Freestyle VMs, not mocks.** Gate behind `JJHUB_FREESTYLE_API_KEY`.
---
## Monitoring & Alerting Requirements
### New Prometheus Metrics
* `jjhub_freestyle_vm_create_duration_seconds` (Histogram, labels: `type=agent|workspace`)
* `jjhub_freestyle_vm_create_total` (Counter, labels: `type`, `status=success|error`)
* `jjhub_freestyle_active_vms` (Gauge, labels: `type=agent|workspace`)
* `jjhub_freestyle_vm_suspend_duration_seconds` (Histogram)
* `jjhub_freestyle_api_request_duration_seconds` (Histogram, labels: `method`, `endpoint`)
* `jjhub_freestyle_api_errors_total` (Counter, labels: `endpoint`, `error_code`)
### New Alerts (add to `infra/terraform/modules/monitoring/main.tf`)
* **CRITICAL**: Freestyle VM creation failure rate > 10% for 5 minutes
* **CRITICAL**: Freestyle API unreachable for 2 minutes
* **WARNING**: VM creation latency p95 > 5 seconds for 5 minutes
* **WARNING**: Active VM count approaching Freestyle plan limits
### New Dashboard
* Add "freestyle" dashboard JSON to `infra/terraform/modules/monitoring/`
* Panels: VM creation rate, creation latency, active VMs, API error rate, suspend/resume latency
### Structured Logging
* Log every VM creation with `slog.Info("freestyle vm created", "vm_id", id, "type", "agent|workspace", "duration_ms", dur)`
* Log errors with `slog.Error("freestyle vm creation failed", "error", err, "type", "agent|workspace")`
* Log VM lifecycle events (suspend, resume, delete)
---
## Implementation Notes
* Read specs before coding: `docs/specs/engineering.md`, `docs/specs/infra.md`, `docs/specs/design.md`
* Follow existing service layer patterns: Routes → Services → DB
* Use functional options pattern (`WithFreestyleClient()`) matching existing code
* Scripts use Bun (TypeScript), not bash — per code hygiene rules
* After schema changes: run `make sqlc` to regenerate
* After dependency removals: run `go mod tidy`
* Verify: `go build ./...` succeeds with no dead imports
* Update specs after implementation: [engineering.md](<http://engineering.md>) (runner architecture), [infra.md](<http://infra.md>) (remove GKE Sandbox runner section), [design.md](<http://design.md>) (workspace API changes)
</description>
<team name="JJHub"/>
</issue>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment