Skip to content

Instantly share code, notes, and snippets.

@joshgord
Created February 14, 2026 21:49
Show Gist options
  • Select an option

  • Save joshgord/3a103a447b9d621b6dc8d793bc75c87c to your computer and use it in GitHub Desktop.

Select an option

Save joshgord/3a103a447b9d621b6dc8d793bc75c87c to your computer and use it in GitHub Desktop.
DataLoader Chaining: The loadAndDispatch Bottleneck — Analysis and Migration Plan

DataLoader Chaining: The loadAndDispatch Bottleneck

Problem Summary

Gusto's GraphQL resolvers extensively use a pattern called loadAndDispatch() that defeats DataLoader batching, causing every chained DataLoader call to dispatch with a batch size of 1. This was identified as the primary driver of DataLoaderHelper being the #1 allocation source in production (48,910 samples / 11.6% of all allocations).


How the Bottleneck Was Identified

Allocation Profiling (2026-02-13)

An async-profiler allocation capture (-e alloc, 30 seconds, 463K samples) on a production gusto instance revealed that DataLoaderHelper dominated all allocation sources:

Source Samples % Total
DataLoaderHelper 48,910 11.6%
PacsAuthzInstrumentation 36,108 8.6%
ContextDataFetcherDecorator 28,550 6.8%
HashMap.resize 25,021 5.4%

Root Cause Analysis

A codebase-wide search found 119 call sites across 42 files using the loadAndDispatch pattern:

// From ImageDataLoader.java — the pattern used everywhere
public static CompletableFuture<List<ImageResult>> loadAndDispatch(
    DataFetchingEnvironment env, ImageDataLoaderKey key) {
  var loader = get(env);
  var future = loader.load(key);
  loader.dispatch();  // <-- immediately dispatches with batch size of 1
  return future;
}

Every DataLoader that supports batching (VideoCore, Images, ECL, Collections, Persons, etc.) is being called with batch size 1 when invoked through this pattern — meaning the downstream gRPC services receive N individual requests instead of 1 batched request with N keys.


Why loadAndDispatch Exists

In graphql-java (pre-v25), DataLoader dispatch happens at the end of each field resolution level. When a DataFetcher chains into another DataLoader via .thenCompose(), the second load() call happens after the framework has already dispatched that level. Without manual dispatch, the request hangs forever:

// ShowDataFetcher.currentEpisode — real example from gusto
return EvidenceLoader.loadCurrentEpisode(env, unifiedEntityId)
    // This runs AFTER framework dispatch — framework won't dispatch again
    .thenCompose(episodeId -> VideoCoreDataLoader.loadAndDispatch(env, episodeId));
    //                                           ^^^^^^^^^^^^^^^^
    //                     manual dispatch forces batch-of-1 to avoid hanging

The comment in the codebase explains this directly:

"Unfortunately, we currently must manually dispatch a dataloader when it is composed after another Future. Otherwise, the thread will hang indefinitely waiting for something to manually dispatch it."


Performance Overhead

Allocation Cost

Each loadAndDispatch call triggers a full DataLoaderHelper.dispatch() cycle including:

  • Creating new CompletableFuture chains per dispatch
  • Invoking the MappedBatchLoader.load() with a single key
  • gRPC call setup, serialization, and response handling per individual request

With 119 call sites and many executing multiple times per GraphQL request (e.g., EvidenceLoader has 26 references), this multiplies to dozens of dispatch cycles per request, most with batch size 1.

Downstream RPS Amplification

DataLoaders that support batching (e.g., ECLFetchEvidenceDataLoader, VideoCoreDataLoader, ImageDataLoader) are designed to batch N keys into a single gRPC call. The loadAndDispatch pattern defeats this — a page with 40 videos generates 40 individual gRPC calls instead of 1 batched call.

Canary Validation (PR #4023)

A dev canary using graphql-java 25 beta with chained DataLoaders (applied only to live event prefetching) showed:

  • Statistically significant p50 latency improvement — even with chaining applied to just one small pocket of the app
  • Clear reduction in downstream gRPC RPS to EvidenceControlLayerService due to better batching
  • Flat or slightly improved CPU — fewer, larger batch calls are more efficient than many individual calls

Proposed Solutions

Option A: graphql-java 25 Chained DataLoaders (Recommended)

graphql-java 25 introduces native support for chained DataLoaders that eliminates the hanging problem entirely.

How it works: The engine automatically detects when a DataLoader load() is called inside a .thenCompose() chain and schedules dispatch appropriately, allowing multiple chained loads to batch together.

// With graphql-java 25 — no manual dispatch needed, batching works
return EvidenceLoader.loadCurrentEpisode(env, unifiedEntityId)
    .thenCompose(episodeId -> VideoCoreDataLoader.load(env, episodeId));
    //                                           ^^^^
    //                     plain load() — framework handles dispatch + batching

Enabling:

GraphQL graphQL = GraphQL.unusualConfiguration(graphqlContext)
    .dataloaderConfig()
    .enableDataLoaderChaining(true);

Status: PR #4023 validated this approach on a dev canary with positive results. Currently applied only to live event prefetching.

Reference: graphql-java Chained DataLoaders docs

Dependency: graphql-java 25 is obtained through DGS framework, which is tied to SBN. The official path is SBN4 (targeting early 2026). Forcing the dependency independently via resolutionStrategy.force is possible but unsupported by the platform team until SBN4 GA.

Slack thread: #dna-gusto-dev discussion

Option B: DGS Ticker Mode

DGS provides a ScheduledDataLoaderRegistry that polls on a configurable interval and dispatches any DataLoaders with pending keys.

dgs:
  graphql:
    dataloader:
      ticker-mode-enabled: true
      schedule-duration: 10ms  # default

Pros:

  • Simple 1-line config change
  • Works with current graphql-java version (no dependency upgrade)
  • Chained loads get auto-dispatched within ~10ms
  • Multiple loads queued around the same time batch together

Cons:

  • Adds up to 10ms latency per chained load (the polling interval)
  • Less optimal batching than graphql-java 25's native chaining (timer-based vs graph-aware)
  • Existing loadAndDispatch calls still fire manual dispatches before the ticker

Option C: Selective Restructuring (Incremental)

Restructure individual hot-path resolvers to avoid chaining where possible — e.g., pre-fetch all needed data at the same field level so loads batch naturally.

Pros: No dependency changes, targeted fixes

Cons: Labor-intensive, doesn't fix the systemic problem, may not be possible for all call patterns


Recommended Plan

Phase Action Dependency Risk
1 Sign up as early tester for SBN4 Platform team None
2 Expand PR #4023 to chain DataLoaders across the entire app (not just live prefetching) graphql-java 25 beta Medium — unsupported until SBN4 GA
3 Dev canary the full change; validate batching improvements via Atlas gRPC metrics Canary infrastructure Low
4 Remove loadAndDispatch methods from all DataLoader classes Phase 3 validation Low — mechanical refactor
5 Adopt graphql-java 25 GA when SBN4 ships SBN4 GA None

Key Call Sites to Migrate (highest impact)

DataLoader loadAndDispatch References Downstream Service
EvidenceLoader 26 ECL (evidence control layer)
VideoCoreDataLoader 10+ oasis VideoCore gRPC
ImageDataLoader 4 Images gRPC
ECLHydratePageDataLoader 3 ECL hydration gRPC
GameDataLoader 5 Games gRPC
CollectionDataLoader 4 Collections gRPC
LiveEventsDataLoader 3 Live events gRPC
PersonDataLoader 2 Person gRPC
All others ~60 combined Various
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment