shykes/dagger-parallel-execution-full-conversation.md

## dagger-parallel-execution-full-conversation.md

      
    Raw
  

              dagger-parallel-execution-full-conversation.md
            
          
    Dagger Parallel Execution - Full Design Conversation

User

Help me plan a new "collaborative parallel execution" feature in dagger. The general idea:

Dagger functions can be annotated with a new directive @parallel (+parallel in Go etc) which indicates to the engine that this function can be executed in parallel
The dimensions of paralellism are controlled by arguments. Any argument to a parallel function that is a list, is interpreted by the engine as a dimension of parallelism. The contract is that when a client invokes a paralle function, the engine may split that "frontend" invocation into N "backend" invocations, with each invocation receiving a slice of the frontend list argument. The function implementor is responsible for guarranteeing correct behavior of the function regardless of the splitting.
Only return types that can be reliably merged in any order (I don't know the fancy terminology) is allowed. Example: Directory is allowed because you can always merge N directories together. Changeset is allowed. Service could conceivably be merged into a composite proxy service (let's worry about that later). etc.

Your plan:

Research the dagger codebase so you understand what I'm talking about. Start with docs/current_docs. Then maybe skim core/ and engine/ (server-side) then cmd/dagger (client side)
Look at toolchains/go . Understand that my primary test case is adding support for test filtering & splitting for go and other test runners.

Ask lots of questions. This is exploratory

Assistant

[Explored codebase - docs, core/engine architecture, Go toolchain, cmd/dagger client]
Summary of Key Architecture:


Function Execution Flow:

CLI parses arguments → builds GraphQL query → sends to engine
ModuleFunction.Call() in core/modfunc.go handles execution
Arguments go through ConvertToSDKInput → SDK executes → ConvertFromSDKResult
List arguments are passed as-is - no splitting currently exists


Directives System:

Go uses comment directives: +optional, +default="value", +cache="session", +check
These are parsed during codegen and become GraphQL schema metadata
Adding +parallel would follow this pattern


Types with Merge Potential:

Directory - can be merged (overlay semantics)
File - potentially mergeable depending on semantics
Service - you mentioned proxy composition
Container - unclear how to merge
Primitives - depends (strings could concat, numbers could sum?)


Go Toolchain Test Pattern:

Test(ctx, run, skip, failfast, parallel, timeout, count, pkgs []string)
pkgs is already a list argument - perfect candidate for splitting
Internal parallelism exists via util/parallel package


Clarifying Questions:

Where does splitting happen?
How are multiple list arguments handled?
Who defines the merge operation?
Error semantics?
Caching implications?
Idempotency contract?
Changeset type?
Practical test splitting scenario


User


Changeset type already exists. core/changeset.go. Just one example


For test splitting, the convention is for the function to return void, and we rely on errors to signal "check failed". So merging is a no-op. Let's discard the issue of merging altogether for now


Do you want to know how I plan to introduce a telemetry-driven loop, so that a scheduler can instrument test splitting based on actual test run times, similar to how buildkite and circleci implement test splitting, but more generalizable & programmable (and with better local-first support)? That is a major goal of this


Assistant

Yes, please tell me about the telemetry-driven scheduling loop.

User

Dagger is fully instrumented by otel, out of the box. Each dagger function call emits a span. When a dagger function calls the dagger system api, those are functions too, and traced accordingly. Even better, when using the system API to execute a container (Container.withExec, Container.asService), if the executed tool emits otel telemetry, *dagger will intercept them and emit them in context of dagger's trace. this includes test telemetry :-)
So, the plan is to hook into this otel infrastructure, so that the developer of a parallel dagger function can easily emit custom spans (or, arrange for its test runner to emit custom spans - same result) in such a way that the dagger engine CAN MAP THESE SPANS BACK TO VALUES IN THE LIST ARGUMENT USED FOR FILTERING / SPLITTING. This allows for the loop we talked about. The mapping could be as simple as a standardized otel attribute. Example:
// +parallel
func (p *Go) Test(ctx context.Context, pkgs []string, ...) error {
  for _, pkg := range pkgs {
    task := dag.Parallel.NewTask().WithArg("pkgs", []string{pkg})
    ctx = task.Start(ctx)
    _, err := dag.Container().
      From("golang:latest").
      WithDirectory("/app", p.Source).
      WithWorkdir(pkg).
      WithExec([]string{"go", "test"}).
      Sync(ctx)
    if err != nil {
     return err
    }
    task.End(ctx)
  }
 return nil
}

Assistant

[Understood the telemetry architecture and feedback loop]
Follow-up Questions:

dag.Parallel.NewTask() API - Is this a new core API?
Telemetry Storage - Where does historical timing persist?
Scheduler Location - Client-side, engine-side, or hybrid?
First Run (Cold Start) - What happens with no historical data?
Worker Distribution - Intra-engine or multi-engine?
Nested Parallelism handling?

Also, in your example, the function still iterates sequentially inside. Is the idea that the engine can optionally split the call itself into N parallel calls?

User


don't worry about it. It's a hybrid, for otel-specific reasons.
We have a cloud service that collects otel + schedules and runs jobs. Locally, the engine will do something simpler.
Engine-side. CLI stays dumb.
split evenly / parallelism number is the unknown
engine decides based on resources. Local-first is a pillar
Irrelevant. A job is a job


Assistant

[Discussed configuration options for parallelism number - CLI flag, config file, env var, etc.]

User

I prefer to avoid function default. Once something is in code, you're stuck with it forever. And we value extreme portability.
For now let's make it not configurable at all. There is a more general "scale-out" feature already in development (see 'dagger check --scale-out') which optionally hooks up to a cloud-hosted scheduler. My thinking is that once +parallel is implemented, this scaleout-system would seamlessly handle parallelism of these functions.
And, when running locally without this scale-out feature, I suppose parallelism would simply be set to 1, since there is no need to ramp it up further - the engine already knows how to run dagger functions concurrently on all available cores. We could simply do parallelism=2, or perhaps another heuristic such as: maximum of 10 "items" per slice.

Assistant

[Explored scale-out infrastructure - --scale-out flag, CloudEngineClient(), tryScaleOut() in checks.go]
Integration point clear:

+parallel functions would use same CloudEngineClient() pattern
Local mode: simple heuristic
Scale-out mode: scheduler decides

Implementation Areas:

Directive parsing
Schema/metadata
Return type validation
Engine splitting logic
Result merging
Telemetry API
Scale-out integration

Questions on return types and multiple list arguments...

User


void + directory enough for now
good question. let's discuss.
Go toolchain for now. Plus experimental PRs for jest and pytest. Create a Claude skill for adding directives (reduces per-SDK boilerplate); local-only first + stubs for scale-out
Take inspiration from PR #9327 and util/parallel/


Assistant

[Explored PR #9327 Status API and util/parallel patterns]
Multiple list arguments discussion - Options: Cartesian, Zipped, First list wins, Explicit +split
Recommendation: Explicit +split annotation

User

what if +parallel moved to the argument, and no annotation on the function?

Assistant

That's cleaner. The annotation lives exactly where the semantics are:
func Test(
    // +parallel
    pkgs []string,
    platforms []string,
) error
Implications:

No function-level annotation
Self-documenting
Multiple +parallel args = Cartesian (opt-in)
Return type restriction still applies
Schema representation on FunctionArg


User

+parallel on non-list -> error
+parallel on nested list -> maybe a solution to the "nested tasks" problem? eg. testing packages + individual tests within each package
+parallel combined with +optional -> easy, that's an empty list

I have one more consideration after you've processed the above


Assistant

[Discussed nested list [][]string semantics - each inner list is a selector/path]
What's the additional consideration?

User

Additional consideration: test filtering. Can this parallel facility be also used as a foundation for filtering?

Assistant

Yes! The +parallel arg is essentially declaring "this is the workload dimension." Once you have that:

Splitting
Telemetry
Filtering
Listing

Options for listing available values:

Convention-based companion function
Explicit annotation
Single function with mode flag


User

I don't love any of these options. One other option: no magic facility for the engine to call the companion list function. Instead, the function developer is responsible for calling it:
func (g *Go) TestSpecific(ctx context.Context, tests [][]string // +parallel) error {
  // all the logic
}

func (g *Go) TestAll(ctx context.Context) error {
  allTests := listAllTests(g.Source)
  return g.TestSpecific(allTests)
}

Assistant

This is explicit and composable. The pattern is:

Tests() - List available
TestSpecific(tests // +parallel) - Run specific (parallelizable)
TestAll() - Convenience: discover + run all

But there are 2 problems:

TestAll() calls TestSpecific() as native Go call, not Dagger API - parallel won't work. Needs "self-call" feature.
Which should be a +check? Implications for dagger check filtering?


User

Let's focus on problem 1 first

Assistant

Options:

A: Two separate entry points, accept the gap
B: Single function with optional arg
C: Engine auto-populates from companion function
D: Wait for self-call


User

What if we took a completely different approach. Dagger API supports "fan-out calls" on a list of objects. What if dagger function devs could customize that fan-out behavior, with an implementation that receives an array instead of single receiver?

Assistant

[Explored batch-aware methods on types]
Options for batch interface:

Convention: separate batch function
Directive + context accessor
Type-level directive

Comparison table of +parallel on arg vs batch method on type

User

Option C: how does a "batch-eligible method" receive the list of TestCases?

Assistant

Options:

Via context accessor
Optional batch argument (engine-injected)
Receiver carries batch info
Two methods, engine picks


User

What if instead of overloading TestCase to handle a list of itself, we defined a new type that wrapped [TestCase], a sort of "smart list" type which could implement a special interface?

Assistant

[Explored smart wrapper type with Splittable interface]
type TestSuite struct {
    Cases []*TestCase
}

func (s *TestSuite) Items() []*TestCase { return s.Cases }
func (s *TestSuite) WithItems(cases []*TestCase) *TestSuite { ... }
Why this is nice:

TestCase stays simple
Explicit collection type
Interface-based
No self-call needed


User

The "interface" would be an implicit contract with the dagger engine, not an actual Go interface (because of the nature of the Dagger API). Its only purpose is to wrap a collection type, with the possibility of "intercepting" fan-out calls. The basic tools we have are pragmas.

Assistant

[Explored pragma-based options]

Option A: Pragma on collection field
Option B: Pragma linking wrapper to item type
Option C: Pragma on methods

The Interception Question: Explicit wrapper construction vs engine auto-wrapping?

User

Proposal:
type TestFiles struct {
  Workspace *dagger.Directory
  Files []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
  _, err = dag.Container().From("jest").WithDirectory(".", tf.Workspace).WithExec([]string{"sh -c '" + strings.Join(tf.Files, " ") + "'").Sync(ctx)
  return err
}

Parallel function means: "this function might benefit from parallel execution"
Parallel field means: "this field can safely be split in buckets for parallel execution"


Assistant

This is clean. The two pragmas work together:
Engine Behavior:

Function has +parallel → check receiver type
TestFiles.Files has +parallel → splittable
Partition Files into N subsets
Create N TestFiles instances
Dispatch Test() on each in parallel
Aggregate results

Questions:

Multiple +parallel fields?
Non-parallel methods on same type?
Telemetry?
Empty result after split?


User


good question. let's explore
correct. thoughts?
General idea: any number of otel spans may be emitted with parallel-related attributes for "cost accounting"
This won't happen because all function calls are scheduled dynamically


Assistant

1. Multiple +parallel fields:

Cartesian product?
Error?
Explicit grouping?

Current leaning: Start with one +parallel field per type.
2. +parallel field without +parallel function:
Field = capability ("safe to split")
Function = intent ("would benefit")
Both required for engine to parallelize.
3. Telemetry:
Fine-grained cost accounting - even when batched, per-item timing captured via span attributes.

User

Summarize the current state in a markdown, so I can catch up my colleagues on different options we explored

Assistant

[Generated comprehensive summary document]

User

Can you paste this to a gist so I can continue mobile? Also can I access any of this from claude web app?

Assistant

Gist: https://gist.github.com/shykes/5d70658b90cdfb00e426d354f6ab6d8d
Re: Claude web app - This conversation is in Claude Code (CLI), which has separate history from claude.ai. Options:

Use the gist
Claude Code on mobile via SSH
Export full conversation


User

3

[End of conversation]
No results found