eersnington/poll-until-rfc.md

## poll-until-rfc.md

      
    Raw
  

              poll-until-rfc.md
            
          
    RFC For Deterministic pollUntil() function

Allow me a moment for my pessimistic side to take control of this gist to say that this RFC might be more of a Reasonably Futile Concept rather than being a Request For Comments. But I've been looking into the what, hows, and whys of this amalgamation of thinking tokens for quite a while with the current mental model of WDK, and maybe what can be done about it.
Ground Zero

Polling isn't ideal (side-note: I think WDK has the potential to cheat it), but I keep seeing this pattern of WDK code (and even my best friend when I showed WDK primitives) which made me go down this rabbit hole. I wrote this gist to spell out why is "call status -> sleep -> repeat" loop feels natural, how it actually plays with determinism, and what a first-class helper could do to make it easier for everyone.
1. The useReflex: I'll just loop and slap in an await sleep("15s")

I will be setting some ground truth which a fictional service called FleetwoodSac, that exposes these endpoints:

POST /albums → { catalogId }
GET /albums/:catalogId → { status: "cutting" | "mixing" | "mastered" | "scrapped" }

Albums take anywhere from 20 seconds to 3 minutes, and there's no webhook.
When you have a provider that won't push events, the thought noggin reaches for the following three steps:

You call an API you are given (checkSomething)
Wait with sleep()
Repeat until the API says stop

export async function waitForFleetwoodAlbumWorkflow(jobId: string) {
  "use workflow";

  for (let attempt = 0; attempt < 120; attempt++) {
    const status = await checkFleetwoodAlbum(catalogId); // 'use step' which calls a fetch() inside
    if (status.state === "mastered") return { status: "success", data: status.data };
    if (status.state === "scrapped") return { status: "error", error: "album production failed" };
    await sleep("1m");
  }

  return { status: "timeout", error: "Job did not complete within 30 attempts" };
}
This kinda mirrors how every developer thinks about time: fetch the latest state, and if it isn’t final, sleep. The snippet seems quite harmless but it's egregiously wrong and the developer is blissfully unaware of what the workflow runtime records behind the scenes.
2. What actually gets recorded

Workflows resume by replaying their event log. The loop above generates a trace like:
step_started(checkFleetwoodAlbum#1)
step_completed(checkFleetwoodAlbum#1 -> { state: "cutting" })
wait_created(1m)
wait_completed
...
step_started(checkFleetwoodAlbum#5)
step_completed(checkFleetwoodAlbum#5 -> { state: "mastered", assetUrl })

Understanding that trace matters because the next section, determinism hinges entirely on replaying these events. Without the full log, the workflow would have to contact FleetwoodStack again during replay, and breaking determinism.
3. What "deterministic" really means here


Steps ('"use step"') run once in the Node runtime and persist their return value.
sleep() schedules a timer, records wait_created, then “wakes up” via wait_completed.
Workflow code is rerun with the cached step/timer results; it must not depend on anything else.

Polling only works because each iteration goes through those primitives. If we skip them or reach outside the sandbox, replay breaks.
4. (a)Wait? What about Webhooks?

If steps + timers are the allowed building blocks, webhooks are the third. They look like they break determinism external HTTP request hits our workflow, but in reality, the runtime records the payload the moment it arrives:
export async function waitForWebhook() {
  "use workflow";

  const webhook = createWebhook();
  await notifyPartner(webhook.url); // step function

  const request = await webhook; // sleeps
  return request.json();
}
Event log:
hook_created(token = "hook_01H...")
hook_received(token = "hook_01H...", payload = Request { body: {...} })

On replay the workflow simply rehydrates request from the stored payload. So webhooks don't solve our FleetwoodSac like problem, but they show another example of the "record, then replay" pattern.
5. When webhooks aren't an option

If FleetwoodSac refuses to push events, we can "self-webhook" by spinning up our own external worker and having it resume the workflow:
export async function waitViaPollingAgent(jobId: string) {
  "use workflow";

  const webhook = createWebhook();
  await enqueuePollingAgent(jobId, webhook.url); // 'use step'

  const request = await webhook; // waits for our own worker
  return request.json();
}

async function enqueuePollingAgent(jobId: string, callbackUrl: string) {
  "use step";
  await backgroundQueue.push({ jobId, callbackUrl });
}

// runs outside the workflow runtime.
export async function pollingWorker({ jobId, callbackUrl }: JobPayload) {
  while (true) {
    const status = await checkPreslyStatus(jobId);
    if (status.state === "complete" || status.state === "failed") {
      await fetch(callbackUrl, { method: "POST", body: JSON.stringify(status) });
      break;
    }
    await new Promise((resolve) => setTimeout(resolve, 60_000));
  }
}
This keeps the workflow deterministic, but the logic is now scattered across three services: workflow, webhook, and worker. That's why people keep defaulting back to the inline loop: it keeps the control flow visibly in one place.
6. Deal with the devil of non-determinism to bring forth a deterministic polling solution?

Given those constraints, a pollUntil helper doesn't need new runtime features. It just needs to compose the existing ones steps for the status checks, sleep, and present a nicer surface.
Instead of returning only the final value, we can surface the entire attempt history.
The workflow sandbox already knows how to:

Run steps and cache their outputs.
Schedule sleeps deterministically.

A pollUntil helper can lean on those building blocks and simply package the attempt history for you.
7. Why returning all attempts helps

Because each poll attempt is already lives in the event log, so a helper can gather that information directly gather step results instead of hitting the external service or I/O again.
type PollAttempt<T> = {
  attempt: number;
  status: "fulfilled" | "rejected";
  value?: T;
  reason?: unknown;
  startedAt: Date;
  completedAt: Date;
};

type PollResult<T> = {
  attempts: PollAttempt<T>[];
  resolved: boolean;
  final?: T;
};
On replay, the workflow rehydrates the attempts array from previously recorded results; no new network calls occur. This is kinda like Promise.allSettled, but spread over time. No new I/O occurs.
8. How to hack a deterministic Polling Function

// The following Egyptian Hieroglyphics are currently being deciphered./
export async function pollUntil<TArgs extends Serializable[], TResult>(
  stepFn: (...args: TArgs) => Promise<TResult>,
  args: TArgs,
  predicate: (result: TResult) => boolean,
  {
    interval,
    maxAttempts = Infinity,
  }: { interval: StringValue | number; maxAttempts?: number }
): Promise<PollResult<TResult>> {
  "use workflow";

  const attempts: PollAttempt<TResult>[] = [];

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    const startedAt = new Date();
    try {
      const value = await stepFn(...args); // logged step
      const completedAt = new Date();
      attempts.push({ attempt, status: "fulfilled", value, startedAt, completedAt });

      if (predicate(value)) {
        return { attempts, resolved: true, final: value };
      }
    } catch (reason) {
      const completedAt = new Date();
      attempts.push({ attempt, status: "rejected", reason, startedAt, completedAt });
      throw reason; // or continue if retries should happen here
    }

    await sleep(interval); // logged wait
  }

  return { attempts, resolved: false };
}
The Idea is every loop iteration is still "step -> wait", so the determinism continues to apply.
9. How to save the intial code from shame with pollUntil()

The original workflow can now defer to pollUntil and gain structured history without reimplementing the boilerplate:
export async function waitForAlbum(jobId: string) {
  "use workflow";

  const { attempts, resolved, final } = await pollUntil(
    checkFleetwoodAlbum,
    [jobId],
    (status) => status.state === "complete" || status.state === "failed",
    { interval: "30s", maxAttempts: 30 }
  );
  

  if (!resolved) return { status: "timeout", error: "Job did not complete within 30 attempts" };
  if (final.state === "scrapped") return { status: "error", error: "album production failed" };
  return return { status: "success", data: resolved };
}
Every attempt is still a step + timer. Determinism holds because replay reuses the logged results, and DX improves because the workflow receives a structured history instead of manually tracking counters.
10. Does the Workflow Kit already make this doable?

Yeah, WDK already exposes everything pollUntil needs:

The workflow VM injects WORKFLOW_USE_STEP, WORKFLOW_SLEEP, and WORKFLOW_CREATE_HOOK into globalThis whenever a 'use workflow' function runs, so a helper can stay entirely in userland. Each 'use step' invocation writes step_started/step_completed events to the log and rehydrates from those events on replay, meaning a polling loop never re-hits the upstream API during replays.
sleep() is just another deterministic primitive (wait_created/wait_completed) and takes string | number | Date, so the classic “step → wait → repeat” loop is already fully deterministic. No runtime changes are required to schedule the waits a helper would need.
The VM locks Date, Math.random, and crypto to deterministic implementations and bumps the clock whenever an event arrives. That lets pollUntil stamp each attempt (started/completed) without reintroducing nondeterminism.
Hooks follow the exact same record/replay pattern (hook_created/hook_received), so even the "external polling agent POSTs back into a webhook" workaround in #5 is already supported.

Constraints worth calling out


Errors still behave exactly like normal steps: fatal errors should be rethrown (ending the workflow), while retryable errors keep the promise pending until a successful attempt is logged. A helper needs explicit knobs (maxAttempts, retryOn) to clarify what happens when predicates never pass or the step keeps failing.
Attempt history only exists inside the workflow process. If users want to inspect it later they must return it, stream it, or emit it via another step—there’s no API today to query raw event logs from workflow code.


The helper sketched above is intentionally modest: it wraps the primitives we already trust and returns a richer record of what happened. If that shape resonates, the next conversation is whether we keep it as a pattern (docs, snippets, lint rules) or elevate it into the WDK API. If we do the latter, it might also be worth considering a "use poll" directive so workflows can declare—up front—that they expect to poll rather than wait for hooks. That's the decision space I'm interested in exploring with the rest of the team.
And to sign off, eersnington (I'm barely awake at this point in time now)
No results found