Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save shykes/e60ddc8505cff9ac76b6207ca42b1c4f to your computer and use it in GitHub Desktop.

Select an option

Save shykes/e60ddc8505cff9ac76b6207ca42b1c4f to your computer and use it in GitHub Desktop.
Full conversation: Collaborative Parallel Execution in Dagger - Design Discussion

Dagger Parallel Execution - Full Design Conversation

User

Help me plan a new "collaborative parallel execution" feature in dagger. The general idea:

  • Dagger functions can be annotated with a new directive @parallel (+parallel in Go etc) which indicates to the engine that this function can be executed in parallel
  • The dimensions of paralellism are controlled by arguments. Any argument to a parallel function that is a list, is interpreted by the engine as a dimension of parallelism. The contract is that when a client invokes a paralle function, the engine may split that "frontend" invocation into N "backend" invocations, with each invocation receiving a slice of the frontend list argument. The function implementor is responsible for guarranteeing correct behavior of the function regardless of the splitting.
  • Only return types that can be reliably merged in any order (I don't know the fancy terminology) is allowed. Example: Directory is allowed because you can always merge N directories together. Changeset is allowed. Service could conceivably be merged into a composite proxy service (let's worry about that later). etc.

Your plan:

  1. Research the dagger codebase so you understand what I'm talking about. Start with docs/current_docs. Then maybe skim core/ and engine/ (server-side) then cmd/dagger (client side)
  2. Look at toolchains/go . Understand that my primary test case is adding support for test filtering & splitting for go and other test runners.

Ask lots of questions. This is exploratory


Assistant

[Explored codebase - docs, core/engine architecture, Go toolchain, cmd/dagger client]

Summary of Key Architecture:

  1. Function Execution Flow:

    • CLI parses arguments → builds GraphQL query → sends to engine
    • ModuleFunction.Call() in core/modfunc.go handles execution
    • Arguments go through ConvertToSDKInput → SDK executes → ConvertFromSDKResult
    • List arguments are passed as-is - no splitting currently exists
  2. Directives System:

    • Go uses comment directives: +optional, +default="value", +cache="session", +check
    • These are parsed during codegen and become GraphQL schema metadata
    • Adding +parallel would follow this pattern
  3. Types with Merge Potential:

    • Directory - can be merged (overlay semantics)
    • File - potentially mergeable depending on semantics
    • Service - you mentioned proxy composition
    • Container - unclear how to merge
    • Primitives - depends (strings could concat, numbers could sum?)
  4. Go Toolchain Test Pattern:

    • Test(ctx, run, skip, failfast, parallel, timeout, count, pkgs []string)
    • pkgs is already a list argument - perfect candidate for splitting
    • Internal parallelism exists via util/parallel package

Clarifying Questions:

  1. Where does splitting happen?
  2. How are multiple list arguments handled?
  3. Who defines the merge operation?
  4. Error semantics?
  5. Caching implications?
  6. Idempotency contract?
  7. Changeset type?
  8. Practical test splitting scenario

User

  • Changeset type already exists. core/changeset.go. Just one example

  • For test splitting, the convention is for the function to return void, and we rely on errors to signal "check failed". So merging is a no-op. Let's discard the issue of merging altogether for now

  • Do you want to know how I plan to introduce a telemetry-driven loop, so that a scheduler can instrument test splitting based on actual test run times, similar to how buildkite and circleci implement test splitting, but more generalizable & programmable (and with better local-first support)? That is a major goal of this


Assistant

Yes, please tell me about the telemetry-driven scheduling loop.


User

Dagger is fully instrumented by otel, out of the box. Each dagger function call emits a span. When a dagger function calls the dagger system api, those are functions too, and traced accordingly. Even better, when using the system API to execute a container (Container.withExec, Container.asService), if the executed tool emits otel telemetry, *dagger will intercept them and emit them in context of dagger's trace. this includes test telemetry :-)

So, the plan is to hook into this otel infrastructure, so that the developer of a parallel dagger function can easily emit custom spans (or, arrange for its test runner to emit custom spans - same result) in such a way that the dagger engine CAN MAP THESE SPANS BACK TO VALUES IN THE LIST ARGUMENT USED FOR FILTERING / SPLITTING. This allows for the loop we talked about. The mapping could be as simple as a standardized otel attribute. Example:

// +parallel
func (p *Go) Test(ctx context.Context, pkgs []string, ...) error {
  for _, pkg := range pkgs {
    task := dag.Parallel.NewTask().WithArg("pkgs", []string{pkg})
    ctx = task.Start(ctx)
    _, err := dag.Container().
      From("golang:latest").
      WithDirectory("/app", p.Source).
      WithWorkdir(pkg).
      WithExec([]string{"go", "test"}).
      Sync(ctx)
    if err != nil {
     return err
    }
    task.End(ctx)
  }
 return nil
}

Assistant

[Understood the telemetry architecture and feedback loop]

Follow-up Questions:

  1. dag.Parallel.NewTask() API - Is this a new core API?
  2. Telemetry Storage - Where does historical timing persist?
  3. Scheduler Location - Client-side, engine-side, or hybrid?
  4. First Run (Cold Start) - What happens with no historical data?
  5. Worker Distribution - Intra-engine or multi-engine?
  6. Nested Parallelism handling?

Also, in your example, the function still iterates sequentially inside. Is the idea that the engine can optionally split the call itself into N parallel calls?


User

  1. don't worry about it. It's a hybrid, for otel-specific reasons.
  2. We have a cloud service that collects otel + schedules and runs jobs. Locally, the engine will do something simpler.
  3. Engine-side. CLI stays dumb.
  4. split evenly / parallelism number is the unknown
  5. engine decides based on resources. Local-first is a pillar
  6. Irrelevant. A job is a job

Assistant

[Discussed configuration options for parallelism number - CLI flag, config file, env var, etc.]


User

I prefer to avoid function default. Once something is in code, you're stuck with it forever. And we value extreme portability.

For now let's make it not configurable at all. There is a more general "scale-out" feature already in development (see 'dagger check --scale-out') which optionally hooks up to a cloud-hosted scheduler. My thinking is that once +parallel is implemented, this scaleout-system would seamlessly handle parallelism of these functions.

And, when running locally without this scale-out feature, I suppose parallelism would simply be set to 1, since there is no need to ramp it up further - the engine already knows how to run dagger functions concurrently on all available cores. We could simply do parallelism=2, or perhaps another heuristic such as: maximum of 10 "items" per slice.


Assistant

[Explored scale-out infrastructure - --scale-out flag, CloudEngineClient(), tryScaleOut() in checks.go]

Integration point clear:

  • +parallel functions would use same CloudEngineClient() pattern
  • Local mode: simple heuristic
  • Scale-out mode: scheduler decides

Implementation Areas:

  1. Directive parsing
  2. Schema/metadata
  3. Return type validation
  4. Engine splitting logic
  5. Result merging
  6. Telemetry API
  7. Scale-out integration

Questions on return types and multiple list arguments...


User

  1. void + directory enough for now
  2. good question. let's discuss.
  3. Go toolchain for now. Plus experimental PRs for jest and pytest. Create a Claude skill for adding directives (reduces per-SDK boilerplate); local-only first + stubs for scale-out
  4. Take inspiration from PR #9327 and util/parallel/

Assistant

[Explored PR #9327 Status API and util/parallel patterns]

Multiple list arguments discussion - Options: Cartesian, Zipped, First list wins, Explicit +split

Recommendation: Explicit +split annotation


User

what if +parallel moved to the argument, and no annotation on the function?


Assistant

That's cleaner. The annotation lives exactly where the semantics are:

func Test(
    // +parallel
    pkgs []string,
    platforms []string,
) error

Implications:

  1. No function-level annotation
  2. Self-documenting
  3. Multiple +parallel args = Cartesian (opt-in)
  4. Return type restriction still applies
  5. Schema representation on FunctionArg

User

+parallel on non-list -> error +parallel on nested list -> maybe a solution to the "nested tasks" problem? eg. testing packages + individual tests within each package +parallel combined with +optional -> easy, that's an empty list

  • I have one more consideration after you've processed the above

Assistant

[Discussed nested list [][]string semantics - each inner list is a selector/path]

What's the additional consideration?


User

Additional consideration: test filtering. Can this parallel facility be also used as a foundation for filtering?


Assistant

Yes! The +parallel arg is essentially declaring "this is the workload dimension." Once you have that:

  1. Splitting
  2. Telemetry
  3. Filtering
  4. Listing

Options for listing available values:

  • Convention-based companion function
  • Explicit annotation
  • Single function with mode flag

User

I don't love any of these options. One other option: no magic facility for the engine to call the companion list function. Instead, the function developer is responsible for calling it:

func (g *Go) TestSpecific(ctx context.Context, tests [][]string // +parallel) error {
  // all the logic
}

func (g *Go) TestAll(ctx context.Context) error {
  allTests := listAllTests(g.Source)
  return g.TestSpecific(allTests)
}

Assistant

This is explicit and composable. The pattern is:

  • Tests() - List available
  • TestSpecific(tests // +parallel) - Run specific (parallelizable)
  • TestAll() - Convenience: discover + run all

But there are 2 problems:

  1. TestAll() calls TestSpecific() as native Go call, not Dagger API - parallel won't work. Needs "self-call" feature.
  2. Which should be a +check? Implications for dagger check filtering?

User

Let's focus on problem 1 first


Assistant

Options:

  • A: Two separate entry points, accept the gap
  • B: Single function with optional arg
  • C: Engine auto-populates from companion function
  • D: Wait for self-call

User

What if we took a completely different approach. Dagger API supports "fan-out calls" on a list of objects. What if dagger function devs could customize that fan-out behavior, with an implementation that receives an array instead of single receiver?


Assistant

[Explored batch-aware methods on types]

Options for batch interface:

  • Convention: separate batch function
  • Directive + context accessor
  • Type-level directive

Comparison table of +parallel on arg vs batch method on type


User

Option C: how does a "batch-eligible method" receive the list of TestCases?


Assistant

Options:

  1. Via context accessor
  2. Optional batch argument (engine-injected)
  3. Receiver carries batch info
  4. Two methods, engine picks

User

What if instead of overloading TestCase to handle a list of itself, we defined a new type that wrapped [TestCase], a sort of "smart list" type which could implement a special interface?


Assistant

[Explored smart wrapper type with Splittable interface]

type TestSuite struct {
    Cases []*TestCase
}

func (s *TestSuite) Items() []*TestCase { return s.Cases }
func (s *TestSuite) WithItems(cases []*TestCase) *TestSuite { ... }

Why this is nice:

  • TestCase stays simple
  • Explicit collection type
  • Interface-based
  • No self-call needed

User

The "interface" would be an implicit contract with the dagger engine, not an actual Go interface (because of the nature of the Dagger API). Its only purpose is to wrap a collection type, with the possibility of "intercepting" fan-out calls. The basic tools we have are pragmas.


Assistant

[Explored pragma-based options]

  • Option A: Pragma on collection field
  • Option B: Pragma linking wrapper to item type
  • Option C: Pragma on methods

The Interception Question: Explicit wrapper construction vs engine auto-wrapping?


User

Proposal:

type TestFiles struct {
  Workspace *dagger.Directory
  Files []string // +parallel
}

// +parallel
func (tf *TestFiles) Test(ctx context.Context) error {
  _, err = dag.Container().From("jest").WithDirectory(".", tf.Workspace).WithExec([]string{"sh -c '" + strings.Join(tf.Files, " ") + "'").Sync(ctx)
  return err
}
  • Parallel function means: "this function might benefit from parallel execution"
  • Parallel field means: "this field can safely be split in buckets for parallel execution"

Assistant

This is clean. The two pragmas work together:

Engine Behavior:

  1. Function has +parallel → check receiver type
  2. TestFiles.Files has +parallel → splittable
  3. Partition Files into N subsets
  4. Create N TestFiles instances
  5. Dispatch Test() on each in parallel
  6. Aggregate results

Questions:

  1. Multiple +parallel fields?
  2. Non-parallel methods on same type?
  3. Telemetry?
  4. Empty result after split?

User

  1. good question. let's explore
  2. correct. thoughts?
  3. General idea: any number of otel spans may be emitted with parallel-related attributes for "cost accounting"
  4. This won't happen because all function calls are scheduled dynamically

Assistant

1. Multiple +parallel fields:

  • Cartesian product?
  • Error?
  • Explicit grouping?

Current leaning: Start with one +parallel field per type.

2. +parallel field without +parallel function: Field = capability ("safe to split") Function = intent ("would benefit") Both required for engine to parallelize.

3. Telemetry: Fine-grained cost accounting - even when batched, per-item timing captured via span attributes.


User

Summarize the current state in a markdown, so I can catch up my colleagues on different options we explored


Assistant

[Generated comprehensive summary document]


User

Can you paste this to a gist so I can continue mobile? Also can I access any of this from claude web app?


Assistant

Gist: https://gist.github.com/shykes/5d70658b90cdfb00e426d354f6ab6d8d

Re: Claude web app - This conversation is in Claude Code (CLI), which has separate history from claude.ai. Options:

  1. Use the gist
  2. Claude Code on mobile via SSH
  3. Export full conversation

User

3


[End of conversation]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment