Anyone operating an LLM inference API with API key authentication -- whether a direct provider or an aggregator -- should consider supporting scoped JWT tokens. DeepInfra already does this well. The pattern is general and solves real problems that the rest of the industry is working around with proxies and key management sprawl.
Organizations that distribute LLM API access to their users (universities, SaaS platforms, dev teams) currently have two options:
-
Give each user a real API key via a management API. This works, but the organization loses control the moment the key leaves their hands. Keys can be shared, leaked, or used in ways the organization didn't intend. Revoking a key often destroys its analytics history. And provisioning keys is a heavyweight operation -- there's no cheap way to issue thousands of ephemeral credentials.
-
Run a proxy. The organization stands up a server that accepts its own tokens, validates them, and forwards requests to the inference API with a real key. This works but adds latency, operational burden, and a single point of failure in front of every inference request. For organizations that just want to hand out constrained access, running infrastructure in the hot path of every LLM call is a disproportionate requirement.
Neither option is great for the common case: "I want to let someone use my API credits, constrained to certain models and a spending cap, for a limited time."
An API key holder signs a JWT that encodes constraints (allowed models, spending limit, expiration). The end user presents this JWT directly to the provider's API as a bearer token. The provider verifies the signature against the issuing API key and enforces the embedded constraints. No proxy needed.
POST https://api.example.com/v1/scoped-jwt
Authorization: Bearer <api-key>
{
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"spending_limit": 2.00,
"expires_delta": 86400
}
Returns a signed token like jwt:eyJhbG... usable as a bearer token against all inference endpoints.
Crucially, issuance can also happen offline. The JWT format is open and the signing key is the API key the issuer already holds. A server-side endpoint is convenient but not required -- organizations can sign tokens locally, in a Lambda, in a CI pipeline, in a browser, without calling any API.
import jwt, time
API_KEY = "sk-..." # your real API key (the HMAC signing secret)
KEY_ID = "acct_123:a2V5XzE=" # your account id + base64(key name), used as kid
token = "jwt:" + jwt.encode(
{
"sub": "acct_123",
"models": ["anthropic/claude-sonnet-4", "google/gemini-2.5-flash"],
"spending_limit": 2.00,
"exp": int(time.time()) + 86400, # 24 hours
},
API_KEY,
algorithm="HS256",
headers={"kid": KEY_ID},
)
# token is now a bearer credential you can hand to someone else
# they use it directly against the provider's API -- no proxy, no callback
print(token)No network call. No database. One HMAC signature. The recipient uses this as Authorization: Bearer jwt:eyJhbG... and the provider does the rest.
POST https://api.example.com/v1/chat/completions
Authorization: Bearer jwt:eyJhbG...
{ "model": "anthropic/claude-sonnet-4", "messages": [...] }
The provider verifies the signature, checks the model allowlist, checks the spending limit hasn't been exceeded, checks expiry, and processes the request. Usage is billed to the issuing key.
Standard JWT with HMAC-SHA256. The header's kid field identifies the issuing key. The payload encodes constraints:
{
"sub": "<issuer-account-id>",
"models": ["anthropic/claude-sonnet-4"],
"spending_limit": 2.00,
"exp": 1739000000
}The signature uses the API key as the HMAC secret. The provider already knows every API key's value, so verification is a single HMAC check with no additional state.
Scoped JWTs should have a maximum lifetime -- hours to days, not months. Short-lived tokens are better than long-lived ones for this use case:
- Revocation becomes unnecessary. If a token expires in 24 hours, a leaked token is a bounded problem. The issuer doesn't need to maintain a revocation list. The provider doesn't need to check one. There is no state to manage on either side.
- Spending limits stay meaningful. A $2 token that expires in a day is a crisp commitment. A $2 token that expires in a year accumulates ambiguity. Short expiry forces periodic re-issuance, which is a natural checkpoint for the issuing organization to re-evaluate access.
- Issuance is cheap. Generating a JWT is a single HMAC operation. Issuing a fresh token every day (or every session) costs nothing.
- It matches how people actually distribute access. Workshop tomorrow? Issue 24-hour tokens. Semester-long course? Issue weekly tokens from an automated system. Contractor engagement? Issue a token scoped to the duration. The expiry is the access policy.
- University courses can hand out tokens to students at the start of each class or assignment without running any infrastructure.
- Hackathon organizers can generate time-limited tokens for participants.
- SaaS platforms can issue per-session tokens for their users without proxying every request.
- Dev teams can give contractors scoped access without creating (and remembering to delete) long-lived keys.
- CLI tool authors can build auth flows where their server issues a scoped JWT after login and the CLI talks directly to the provider.
This is the one that gets interesting. LLMs increasingly generate interactive artifacts -- HTML pages, web apps, visualizations, tools -- that the user can open and run immediately. Some of these artifacts would benefit from being able to call an LLM themselves: a generated tutoring app that can answer follow-up questions, a data explorer that can explain charts, a coding sandbox that can lint and suggest fixes.
Today this is impractical. You can't embed a real API key in a generated artifact -- it would be visible in the source and trivially extractable. You could stand up an authenticated proxy, but then the artifact isn't self-contained anymore; it depends on your infrastructure being up.
With scoped JWTs, the generating system can embed a short-lived, spending-capped token directly in the artifact's source. The artifact calls the inference API directly. The token might be good for 4 hours and $0.50. If someone extracts it from the source, the blast radius is bounded: limited models, limited spend, and it expires soon. When it does expire, the artifact gracefully degrades to its static content.
This turns "generate an interactive artifact" from a deployment problem into a signing operation. The LLM orchestrator (Artifacts, Canvas, tool-use agents, or anything that produces runnable output) can mint a scoped JWT at generation time and bake it into the output. No proxy, no backend, no API key management. The artifact is born self-contained and dies gracefully.
User: "Make me a flashcard app for these vocab words that quizzes me and explains my mistakes"
System: [generates HTML artifact]
[signs a scoped JWT: models=["google/gemini-2.5-flash"], spending_limit=0.25, expires_delta=14400]
[embeds token in artifact source]
Artifact: [runs in browser, calls inference API directly with scoped JWT]
[works for 4 hours, then stops making API calls and shows static content]
- Verification cost: One HMAC-SHA256 check per request plus a key lookup by
kid. Comparable to the existing API key auth path. - Spending limit tracking: Requires associating cumulative spend with each token (or a hash of it). Analogous to how per-key usage is already tracked, but scoped to a shorter-lived entity. Could be tracked in-memory with periodic flush given the short lifetimes.
- Model allowlist: A simple set membership check before routing.
- Backwards compatible: A new auth method, not a change to existing ones. Existing API keys work unchanged. The
jwt:prefix on the token distinguishes it from regular keys. - No new state for the issuer: The whole point is that the issuer signs locally and the provider validates. The issuer doesn't need a database of issued tokens. The provider doesn't need to store tokens either -- it verifies the signature and checks the embedded claims.
- DeepInfra scoped JWTs: Production implementation of this exact pattern. API key holders sign JWTs with model restrictions, spending limits, and expiry. Tokens are used as bearer tokens against inference endpoints.
- Kerberos tickets: Short-lived, cryptographically signed credentials issued by a trusted authority, presented directly to services. Same delegation pattern, different era.
- OAuth2 access tokens: Short-lived tokens issued by an authorization server. Similar concept but much heavier protocol; scoped JWTs skip the token exchange ceremony because the API key is the signing key.
- Macaroons (Google): Caveats-based bearer credentials where each holder can further attenuate but not escalate permissions. Scoped JWTs are a simpler version of this idea without the chaining.