This document explores design options for a new parallel execution feature in Dagger, enabling functions to be automatically split and executed across multiple workers.
Dagger users want to:
- Split test suites across workers for faster CI (like Buildkite/CircleCI test splitting)
- Unified filtering across toolchains (run "this frontend test + this backend test" without knowing each runner's CLI)
- Telemetry-driven scheduling - use historical timing data to optimize splits
- Local-first - works on laptop with parallelism=1, scales to cloud workers
// +parallel
func (g *Go) Test(ctx context.Context, pkgs []string) error- Annotation on function, list arguments are split dimensions
- Problem: Requires "self-call" for
TestAll()to invokeTestSpecific()through Dagger API - Self-call is bleeding-edge, creates dependency risk
func (g *Go) Test(ctx context.Context,
// +parallel
pkgs []string,
) error- Cleaner: annotation lives exactly where the split dimension is declared
- Multiple
+parallelargs = Cartesian product (explicit opt-in) - Same problem: Still requires self-call for wrapper functions
type TestCase struct { ... }
func (t *TestCase) Run(ctx context.Context) error { ... }
func (t *TestCase) RunBatch(ctx context.Context, all []*TestCase) error { ... }- Leverage GraphQL fan-out:
[TestCase].Run() - Engine routes to batch method when available
- Problem: Overloads TestCase with batch awareness, unclear interface
type TestFiles struct {
Workspace *dagger.Directory
Files []string // +parallel
}
// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
// Execute tests for tf.Files
}+parallelon field: "This field can safely be split into buckets"+parallelon function: "This function might benefit from parallel execution"- Both required for engine to parallelize
- No self-call needed - wrapper type is explicit
| Pragma Location | Meaning |
|---|---|
+parallel on field |
"Safe to split into buckets" (capability) |
+parallel on function |
"Would benefit from parallel execution" (intent) |
When TestFiles.Test() is called:
- Function has
+parallel→ check receiver type TestFiles.Fileshas+parallel→ splittable dimension found- Partition
Filesinto N subsets (scheduler decides N) - Create N
TestFilesinstances with sameWorkspace, differentFilesslices - Dispatch
Test()on each in parallel (local or scale-out) - Aggregate results (for void/error: collect all errors)
// Toolchain implementation
type TestFiles struct {
Workspace *dagger.Directory
Files []string // +parallel
}
// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
_, err := dag.Container().
From("node:alpine").
WithDirectory("/app", tf.Workspace).
WithExec([]string{"npx", "jest"}.append(tf.Files...)...).
Sync(ctx)
return err
}
// Factory function
func (j *Jest) TestFiles() *TestFiles {
return &TestFiles{
Workspace: j.Source,
Files: discoverTestFiles(j.Source),
}
}# CLI usage
dagger call jest test-files testFor MVP, +parallel functions can only return:
void/ error-only (tests, checks)Directory(mergeable via overlay)
Future: Changeset, Service (composite proxy), others TBD.
Dagger is fully instrumented with OpenTelemetry. The parallel feature hooks into this:
// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
for _, file := range tf.Files {
task := dag.Parallel.Task(file).
WithField("Files", []string{file})
ctx := task.Start(ctx)
runTest(ctx, file)
task.End(ctx)
}
return nil
}- Each span carries attributes mapping to
+parallelfield values - Scheduler uses historical span durations for intelligent splitting
- Works even when items are batched - per-item timing is captured
Run 1: TestFiles.Test() with Files=[a,b,c,d,e]
→ Spans emitted: a=10s, b=5s, c=30s, d=8s, e=12s
→ Telemetry stored
Run 2: Same call, scheduler has history
→ Split to balance load: [c] (30s) vs [a,d,e] (30s) vs [b] (5s)
→ Wall time: ~30s instead of 65s sequential
- Parallelism degree is NOT configurable per-function
- Avoids hardcoding environment-specific values in code
- Dagger values extreme portability (laptop to mega-cluster)
- Simple heuristic: parallelism=1 or 2, or "max 10 items per slice"
- Engine already runs functions concurrently on available cores
- Splitting mainly for testing the feature locally
- Existing
dagger check --scale-outinfrastructure - Cloud scheduler allocates remote engines
- Parallel functions dispatch slices to workers via
CloudEngineClient() - Telemetry-driven scheduling for optimal splits
type BuildMatrix struct {
Platforms []string // +parallel
Targets []string // +parallel
}Options:
- Cartesian product (N×M splits) - natural for matrix builds
- Error - only one
+parallelfield allowed per type - Explicit grouping -
+parallel(group="matrix")for Cartesian
Current leaning: Start with one +parallel field per type, add Cartesian later.
+checkfunctions run viadagger checkwithout arguments- How does filtering fit? Options:
- Filtering lives in
dagger call, notdagger check - Extend
dagger checkwith--filterapplying to+parallelfields - Separate
dagger testcommand
- Filtering lives in
Current leaning: Filtering in dagger call, dagger check stays simple.
type TestFiles struct {
// Tests as paths: [[pkg1, testA], [pkg1, testB], [pkg2]]
Tests [][]string // +parallel
}- Inner list = selector/path (atomic, not split further)
- Outer list = parallelism dimension
- Enables mixed granularity (package-level + test-level)
Current leaning: Support [][]T where inner list is opaque selector.
Telemetry sugar for emitting parallel-aware spans. Options:
- Core type (
dag.Parallel) - Inspired by PR #9327 (Status API) +
util/parallelpatterns - Engine handles OTel boilerplate, SDK provides sugar
- Add
+parallelpragma parsing to Go codegen - Implement engine-side splitting logic
- Update
toolchains/goas reference implementation - Local parallelism with simple heuristic
- Experimental PRs for
github.com/dagger/jestandgithub.com/dagger/pytest - Refine based on real-world usage
- Create Claude skill for adding directives (reduces per-SDK boilerplate)
- Implement in TypeScript, Python SDKs
- Hook into existing
CloudEngineClient()infrastructure - Telemetry-driven scheduling via cloud service
- FIXME stubs in Phase 1 for smooth transition
- Scale-out:
dagger check --scale-out(existing, hidden flag) - Checks:
+checkpragma for argument-free validation functions - Telemetry: Full OTel instrumentation, including container-executed tools
- PR #9327: Status API for custom spans (pattern inspiration)
util/parallel: Internal parallel job execution with span support