Skip to content

Instantly share code, notes, and snippets.

@shykes
Created January 16, 2026 03:20
Show Gist options
  • Select an option

  • Save shykes/5d70658b90cdfb00e426d354f6ab6d8d to your computer and use it in GitHub Desktop.

Select an option

Save shykes/5d70658b90cdfb00e426d354f6ab6d8d to your computer and use it in GitHub Desktop.
Collaborative Parallel Execution in Dagger - Design Discussion

Collaborative Parallel Execution in Dagger

Summary

This document explores design options for a new parallel execution feature in Dagger, enabling functions to be automatically split and executed across multiple workers.


Problem Statement

Dagger users want to:

  1. Split test suites across workers for faster CI (like Buildkite/CircleCI test splitting)
  2. Unified filtering across toolchains (run "this frontend test + this backend test" without knowing each runner's CLI)
  3. Telemetry-driven scheduling - use historical timing data to optimize splits
  4. Local-first - works on laptop with parallelism=1, scales to cloud workers

Design Evolution

Approach 1: +parallel on Function Arguments (Rejected)

// +parallel
func (g *Go) Test(ctx context.Context, pkgs []string) error
  • Annotation on function, list arguments are split dimensions
  • Problem: Requires "self-call" for TestAll() to invoke TestSpecific() through Dagger API
  • Self-call is bleeding-edge, creates dependency risk

Approach 2: +parallel on Argument Only (Explored)

func (g *Go) Test(ctx context.Context,
    // +parallel
    pkgs []string,
) error
  • Cleaner: annotation lives exactly where the split dimension is declared
  • Multiple +parallel args = Cartesian product (explicit opt-in)
  • Same problem: Still requires self-call for wrapper functions

Approach 3: Batch-Aware Methods on Types (Explored)

type TestCase struct { ... }

func (t *TestCase) Run(ctx context.Context) error { ... }
func (t *TestCase) RunBatch(ctx context.Context, all []*TestCase) error { ... }
  • Leverage GraphQL fan-out: [TestCase].Run()
  • Engine routes to batch method when available
  • Problem: Overloads TestCase with batch awareness, unclear interface

Approach 4: Smart Wrapper Type (Current Proposal)

type TestFiles struct {
    Workspace *dagger.Directory
    Files     []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    // Execute tests for tf.Files
}
  • +parallel on field: "This field can safely be split into buckets"
  • +parallel on function: "This function might benefit from parallel execution"
  • Both required for engine to parallelize
  • No self-call needed - wrapper type is explicit

Current Proposal Details

Pragma Semantics

Pragma Location Meaning
+parallel on field "Safe to split into buckets" (capability)
+parallel on function "Would benefit from parallel execution" (intent)

Engine Behavior

When TestFiles.Test() is called:

  1. Function has +parallel → check receiver type
  2. TestFiles.Files has +parallel → splittable dimension found
  3. Partition Files into N subsets (scheduler decides N)
  4. Create N TestFiles instances with same Workspace, different Files slices
  5. Dispatch Test() on each in parallel (local or scale-out)
  6. Aggregate results (for void/error: collect all errors)

Example Usage

// Toolchain implementation
type TestFiles struct {
    Workspace *dagger.Directory
    Files     []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    _, err := dag.Container().
        From("node:alpine").
        WithDirectory("/app", tf.Workspace).
        WithExec([]string{"npx", "jest"}.append(tf.Files...)...).
        Sync(ctx)
    return err
}

// Factory function
func (j *Jest) TestFiles() *TestFiles {
    return &TestFiles{
        Workspace: j.Source,
        Files:     discoverTestFiles(j.Source),
    }
}
# CLI usage
dagger call jest test-files test

Return Type Restrictions

For MVP, +parallel functions can only return:

  • void / error-only (tests, checks)
  • Directory (mergeable via overlay)

Future: Changeset, Service (composite proxy), others TBD.


Telemetry Integration

Dagger is fully instrumented with OpenTelemetry. The parallel feature hooks into this:

Cost Accounting via Spans

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    for _, file := range tf.Files {
        task := dag.Parallel.Task(file).
            WithField("Files", []string{file})
        
        ctx := task.Start(ctx)
        runTest(ctx, file)
        task.End(ctx)
    }
    return nil
}
  • Each span carries attributes mapping to +parallel field values
  • Scheduler uses historical span durations for intelligent splitting
  • Works even when items are batched - per-item timing is captured

Scheduler Loop

Run 1: TestFiles.Test() with Files=[a,b,c,d,e]
  → Spans emitted: a=10s, b=5s, c=30s, d=8s, e=12s
  → Telemetry stored

Run 2: Same call, scheduler has history
  → Split to balance load: [c] (30s) vs [a,d,e] (30s) vs [b] (5s)
  → Wall time: ~30s instead of 65s sequential

Configuration

No Per-Function Configuration

  • Parallelism degree is NOT configurable per-function
  • Avoids hardcoding environment-specific values in code
  • Dagger values extreme portability (laptop to mega-cluster)

Local Mode

  • Simple heuristic: parallelism=1 or 2, or "max 10 items per slice"
  • Engine already runs functions concurrently on available cores
  • Splitting mainly for testing the feature locally

Scale-Out Mode

  • Existing dagger check --scale-out infrastructure
  • Cloud scheduler allocates remote engines
  • Parallel functions dispatch slices to workers via CloudEngineClient()
  • Telemetry-driven scheduling for optimal splits

Open Questions

1. Multiple +parallel Fields

type BuildMatrix struct {
    Platforms []string // +parallel
    Targets   []string // +parallel
}

Options:

  • Cartesian product (N×M splits) - natural for matrix builds
  • Error - only one +parallel field allowed per type
  • Explicit grouping - +parallel(group="matrix") for Cartesian

Current leaning: Start with one +parallel field per type, add Cartesian later.

2. Filtering via dagger check

  • +check functions run via dagger check without arguments
  • How does filtering fit? Options:
    • Filtering lives in dagger call, not dagger check
    • Extend dagger check with --filter applying to +parallel fields
    • Separate dagger test command

Current leaning: Filtering in dagger call, dagger check stays simple.

3. Nested Parallelism ([][]string)

type TestFiles struct {
    // Tests as paths: [[pkg1, testA], [pkg1, testB], [pkg2]]
    Tests [][]string // +parallel
}
  • Inner list = selector/path (atomic, not split further)
  • Outer list = parallelism dimension
  • Enables mixed granularity (package-level + test-level)

Current leaning: Support [][]T where inner list is opaque selector.

4. dag.Parallel API Location

Telemetry sugar for emitting parallel-aware spans. Options:

  • Core type (dag.Parallel)
  • Inspired by PR #9327 (Status API) + util/parallel patterns
  • Engine handles OTel boilerplate, SDK provides sugar

Implementation Phases

Phase 1: Go Toolchain + Local Only

  • Add +parallel pragma parsing to Go codegen
  • Implement engine-side splitting logic
  • Update toolchains/go as reference implementation
  • Local parallelism with simple heuristic

Phase 2: Validate with Real Toolchains

  • Experimental PRs for github.com/dagger/jest and github.com/dagger/pytest
  • Refine based on real-world usage

Phase 3: Cross-SDK Support

  • Create Claude skill for adding directives (reduces per-SDK boilerplate)
  • Implement in TypeScript, Python SDKs

Phase 4: Scale-Out Integration

  • Hook into existing CloudEngineClient() infrastructure
  • Telemetry-driven scheduling via cloud service
  • FIXME stubs in Phase 1 for smooth transition

Related Work

  • Scale-out: dagger check --scale-out (existing, hidden flag)
  • Checks: +check pragma for argument-free validation functions
  • Telemetry: Full OTel instrumentation, including container-executed tools
  • PR #9327: Status API for custom spans (pattern inspiration)
  • util/parallel: Internal parallel job execution with span support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment