shykes/dagger-parallel-execution-design.md

## dagger-parallel-execution-design.md

      
    Raw
  

              dagger-parallel-execution-design.md
            
          
    Collaborative Parallel Execution in Dagger

Summary

This document explores design options for a new parallel execution feature in Dagger, enabling functions to be automatically split and executed across multiple workers.

Problem Statement

Dagger users want to:

Split test suites across workers for faster CI (like Buildkite/CircleCI test splitting)
Unified filtering across toolchains (run "this frontend test + this backend test" without knowing each runner's CLI)
Telemetry-driven scheduling - use historical timing data to optimize splits
Local-first - works on laptop with parallelism=1, scales to cloud workers


Design Evolution

Approach 1: +parallel on Function Arguments (Rejected)

// +parallel
func (g *Go) Test(ctx context.Context, pkgs []string) error

Annotation on function, list arguments are split dimensions
Problem: Requires "self-call" for TestAll() to invoke TestSpecific() through Dagger API
Self-call is bleeding-edge, creates dependency risk

Approach 2: +parallel on Argument Only (Explored)

func (g *Go) Test(ctx context.Context,
    // +parallel
    pkgs []string,
) error

Cleaner: annotation lives exactly where the split dimension is declared
Multiple +parallel args = Cartesian product (explicit opt-in)
Same problem: Still requires self-call for wrapper functions

Approach 3: Batch-Aware Methods on Types (Explored)

type TestCase struct { ... }

func (t *TestCase) Run(ctx context.Context) error { ... }
func (t *TestCase) RunBatch(ctx context.Context, all []*TestCase) error { ... }

Leverage GraphQL fan-out: [TestCase].Run()
Engine routes to batch method when available
Problem: Overloads TestCase with batch awareness, unclear interface

Approach 4: Smart Wrapper Type (Current Proposal)

type TestFiles struct {
    Workspace *dagger.Directory
    Files     []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    // Execute tests for tf.Files
}

+parallel on field: "This field can safely be split into buckets"
+parallel on function: "This function might benefit from parallel execution"
Both required for engine to parallelize
No self-call needed - wrapper type is explicit


Current Proposal Details

Pragma Semantics


Pragma Location
Meaning


+parallel on field
"Safe to split into buckets" (capability)


+parallel on function
"Would benefit from parallel execution" (intent)


Engine Behavior

When TestFiles.Test() is called:

Function has +parallel → check receiver type
TestFiles.Files has +parallel → splittable dimension found
Partition Files into N subsets (scheduler decides N)
Create N TestFiles instances with same Workspace, different Files slices
Dispatch Test() on each in parallel (local or scale-out)
Aggregate results (for void/error: collect all errors)

Example Usage

// Toolchain implementation
type TestFiles struct {
    Workspace *dagger.Directory
    Files     []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    _, err := dag.Container().
        From("node:alpine").
        WithDirectory("/app", tf.Workspace).
        WithExec([]string{"npx", "jest"}.append(tf.Files...)...).
        Sync(ctx)
    return err
}

// Factory function
func (j *Jest) TestFiles() *TestFiles {
    return &TestFiles{
        Workspace: j.Source,
        Files:     discoverTestFiles(j.Source),
    }
}
# CLI usage
dagger call jest test-files test
Return Type Restrictions

For MVP, +parallel functions can only return:

void / error-only (tests, checks)
Directory (mergeable via overlay)

Future: Changeset, Service (composite proxy), others TBD.

Telemetry Integration

Dagger is fully instrumented with OpenTelemetry. The parallel feature hooks into this:
Cost Accounting via Spans

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
    for _, file := range tf.Files {
        task := dag.Parallel.Task(file).
            WithField("Files", []string{file})
        
        ctx := task.Start(ctx)
        runTest(ctx, file)
        task.End(ctx)
    }
    return nil
}

Each span carries attributes mapping to +parallel field values
Scheduler uses historical span durations for intelligent splitting
Works even when items are batched - per-item timing is captured

Scheduler Loop

Run 1: TestFiles.Test() with Files=[a,b,c,d,e]
  → Spans emitted: a=10s, b=5s, c=30s, d=8s, e=12s
  → Telemetry stored

Run 2: Same call, scheduler has history
  → Split to balance load: [c] (30s) vs [a,d,e] (30s) vs [b] (5s)
  → Wall time: ~30s instead of 65s sequential


Configuration

No Per-Function Configuration


Parallelism degree is NOT configurable per-function
Avoids hardcoding environment-specific values in code
Dagger values extreme portability (laptop to mega-cluster)

Local Mode


Simple heuristic: parallelism=1 or 2, or "max 10 items per slice"
Engine already runs functions concurrently on available cores
Splitting mainly for testing the feature locally

Scale-Out Mode


Existing dagger check --scale-out infrastructure
Cloud scheduler allocates remote engines
Parallel functions dispatch slices to workers via CloudEngineClient()
Telemetry-driven scheduling for optimal splits


Open Questions

1. Multiple +parallel Fields

type BuildMatrix struct {
    Platforms []string // +parallel
    Targets   []string // +parallel
}
Options:

Cartesian product (N×M splits) - natural for matrix builds
Error - only one +parallel field allowed per type
Explicit grouping - +parallel(group="matrix") for Cartesian

Current leaning: Start with one +parallel field per type, add Cartesian later.
2. Filtering via dagger check


+check functions run via dagger check without arguments
How does filtering fit? Options:

Filtering lives in dagger call, not dagger check
Extend dagger check with --filter applying to +parallel fields
Separate dagger test command


Current leaning: Filtering in dagger call, dagger check stays simple.
3. Nested Parallelism ([][]string)

type TestFiles struct {
    // Tests as paths: [[pkg1, testA], [pkg1, testB], [pkg2]]
    Tests [][]string // +parallel
}

Inner list = selector/path (atomic, not split further)
Outer list = parallelism dimension
Enables mixed granularity (package-level + test-level)

Current leaning: Support [][]T where inner list is opaque selector.
4. dag.Parallel API Location

Telemetry sugar for emitting parallel-aware spans. Options:

Core type (dag.Parallel)
Inspired by PR #9327 (Status API) + util/parallel patterns
Engine handles OTel boilerplate, SDK provides sugar


Implementation Phases

Phase 1: Go Toolchain + Local Only


Add +parallel pragma parsing to Go codegen
Implement engine-side splitting logic
Update toolchains/go as reference implementation
Local parallelism with simple heuristic

Phase 2: Validate with Real Toolchains


Experimental PRs for github.com/dagger/jest and github.com/dagger/pytest
Refine based on real-world usage

Phase 3: Cross-SDK Support


Create Claude skill for adding directives (reduces per-SDK boilerplate)
Implement in TypeScript, Python SDKs

Phase 4: Scale-Out Integration


Hook into existing CloudEngineClient() infrastructure
Telemetry-driven scheduling via cloud service
FIXME stubs in Phase 1 for smooth transition


Related Work


Scale-out: dagger check --scale-out (existing, hidden flag)
Checks: +check pragma for argument-free validation functions
Telemetry: Full OTel instrumentation, including container-executed tools
PR #9327: Status API for custom spans (pattern inspiration)
util/parallel: Internal parallel job execution with span support
Pragma Location	Meaning
`+parallel` on field	"Safe to split into buckets" (capability)
`+parallel` on function	"Would benefit from parallel execution" (intent)
No results found