Docs

The Modelith reference

Runtime cost-governance plane · OpenAI-compatible

Everything you need to send your first routed request, govern a session to its budget, and read the resulting trace. Designed for engineers shipping with Kilo, Continue, Cursor, and custom agent frameworks.

Five minutes to your first halt

1. Quickstart

Modelith is an OpenAI-compatible HTTP proxy with a runtime governor on the hot path. You point your existing SDK at our base URL, set a session ID and a budget, and the next token runs through scoring, tier routing, and a budget check before it ever reaches an upstream model. No new client library, no new framework.

The fastest way to feel the value: run the same prompt four times with a $0.05 budget and watch the loop halt fire on the fourth call. The sandbox in your dashboard does this with one click.

Step 1 — Sign in and create a key

From the dashboard sidebar, open API keys and generate a new key. It will be prefixed with mlth_ — that prefix is non-secret and used for routing on the wire. The secret portion of the key is shown once. Store it like any other bearer credential.

Step 2 — Point your SDK at Modelith

Change your OpenAI base URL to https://api.modelith.cloud/api/v1/routing/chat/completions and pass the mlth_ key as the bearer token. Every OpenAI-compatible client works — Kilo, Continue, Cursor, the official OpenAI SDKs, the Anthropic SDK with a shim, the Vercel AI SDK, and curl.

pythonPython — OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="https://api.modelith.cloud/api/v1/routing",
    api_key="mlth_your_key_here",
)

response = client.chat.completions.create(
    model="auto",  # tier routing
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

Step 3 — Add a session ID and a budget

Two headers turn a stateless call into a governed session. They are the entire surface area of the governor. If you omit them, the request still routes — it just has no per-session budget or loop detection.

  • X-Modelith-Session-Id — any stable string per agent run (UUID, task ID, or your own run id).
  • X-Modelith-Budget-Limit — the maximum USD spend for this session, as a decimal string. The governor halts the moment the next request would exceed it.
curlcURL — minimum viable request
curl -X POST "https://api.modelith.cloud/api/v1/routing/chat/completions" \
  -H "Authorization: Bearer mlth_your_key_here" \
  -H "Content-Type: application/json" \
  -H "X-Modelith-Session-Id: run-7a3f" \
  -H "X-Modelith-Budget-Limit: 0.05" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Summarize this PRD."}]}'

Step 4 — See the halt

Replay the same call four times. The fourth call returns 429 LOOP DETECTED and the dashboard row turns red. The session is durable in Redis for the configured TTL, so any client that tries the same X-Modelith-Session-Id — even minutes later — gets a 429 immediately, before any token is spent.

You can also exercise the budget halt by setting X-Modelith-Budget-Limit: 0.0001 on a single longer call. The first request will likely already exceed the limit, and the governor will return 402 with a structured body explaining the spend.

Need a hosted playground? Open Session Sandbox in the dashboard and run a loop with a real mlth_ key.

One endpoint, two headers, one shape

2. API reference

The API surface is intentionally small. There is one endpoint, one auth scheme, two governance headers, and an OpenAI-compatible response body enriched with routing metadata. Everything else is configuration in the dashboard.

Endpoints

Modelith exposes a single OpenAI-compatible chat completions endpoint. Streaming is supported (SSE, stream: true). All other surfaces (Command Center, Sandbox, Billing) are dashboard routes served by the same auth layer.

MethodPathPurpose
POST/api/v1/routing/chat/completionsRouted chat completion. OpenAI-compatible body; Modelith headers attach governance.
GET/api/v1/public/statusPublic status payload powering /status. No auth.
POST/api/v1/public/design-partner-applicationsPublic intake for design partner applications. No auth.

Authentication

All authenticated requests must send an Authorization: Bearer mlth_… header. The bearer token is the full API key, including the mlth_ prefix. The key is rate-limited per key, revocable from the dashboard, and never returned in any API response after creation. Dashboard sessions additionally use a separate JWT cookie scoped to the dashboard domain.

A leaked key should be revoked immediately from Dashboard → API keys. Revocation is effective on the next request — the gateway checks key state on every call.

Session & budget headers

The governor is governed by two headers. They are the entire way to attach budget and loop-detection semantics to a request.

HeaderRequiredValueEffect
X-Modelith-Session-IdnoStable string, ≤ 128 chars, per agent runGroups subsequent requests into one session for the governor.
X-Modelith-Budget-LimitnoDecimal USD, e.g. '0.10'Session halts (402) once cumulative spend would exceed this limit.
X-Modelith-Modenocheap | balanced | powerfulConstrains tier selection. Default is balanced.
X-Modelith-Force-ModelnoAny OpenRouter model idBypasses tier routing for a single request. Used for debugging or A/B tests.

Response shape

Responses are OpenAI-compatible. The Modelith-specific metadata is attached under x_modelith on the root object. This is the block your dashboard and your own logging will read.

jsonResponse (truncated)
{
  "id": "chatcmpl-...",
  "model": "anthropic/claude-sonnet-4",
  "choices": [{ "index": 0, "message": { "role": "assistant", "content": "..." }, "finish_reason": "stop" }],
  "usage": { "prompt_tokens": 312, "completion_tokens": 144, "total_tokens": 456 },
  "x_modelith": {
    "session_id": "run-7a3f",
    "step": 3,
    "routing_mode": "balanced",
    "complexity_score": 6.4,
    "initial_tier": 2,
    "final_tier": 2,
    "escalated": false,
    "provider": "openrouter",
    "estimated_cost_usd": 0.00213,
    "latency_ms": 287,
    "spent_usd": 0.0071,
    "budget_limit_usd": 0.05,
    "budget_remaining_pct": 85.8
  }
}

On a halt, the response is an HTTP error with a structured body. The same x_modelith keys appear so the caller can reconcile spend and state in one place.

The primitive that makes a loop into a budget

3. Sessions

A session is a named, bounded unit of agent work. It exists in Redis for the duration of the run, carries a budget, a step counter, a tier history, and a fingerprint cache. The governor is the only thing that reads it on the hot path — Postgres is touched asynchronously, after the response, to record the request log.

A session is created lazily the first time Modelith sees a new X-Modelith-Session-Id. There is no pre-registration call. Sessions expire after 24 hours of inactivity by default; an active session extends the TTL on every request.

Session lifecycle

  1. Open. First request with a new session id and a budget header creates the session. The first tier is selected from complexity scoring and your allowed tiers.
  2. Run. Each request increments the step counter, records the prompt fingerprint (UUIDs and numbers stripped), and adds estimated cost to the running total.
  3. Halt. The governor returns 402 (budget) or 429 (loop or steps) the moment a guardrail trips. The response carries the halt reason and the spent/limit pair.
  4. Close. A session can be closed explicitly by sending the close header, or it will TTL after 24 hours of inactivity. The trace is preserved in Postgres either way.

Budget halt (402)

402 — budget

Spent $0.0942 + this request's hold $0.012 would exceed the session limit $0.10. The governor halts before the next token is sent to the upstream provider. The session remains durable in Redis; further calls with the same session id will continue to return 402 until the session is closed or the limit is raised.

The hold is the estimated cost of the current request, computed from input length and the initial-tier price. If the actual response cost is lower, the difference is credited back to the session on settlement.

Loop halt (429)

429 — loop

Same fingerprint detected 4 times within 10 seconds. The fingerprint is the prompt text with UUIDs, numbers, and whitespace normalized — a structural comparison, not a literal one. The governor halts on the 4th identical fingerprint and never sends the 5th to the upstream.

The loop detector is the second-most-common reason customers cite for choosing Modelith. A runaway agent that re-prompts itself with the same context will burn through a monthly budget in minutes; the loop halt turns that into a 200-millisecond structured 429.

Max-steps halt (429)

429 — max steps

Session exceeded 30 steps. The agent is probably stuck — even if each prompt is different, the agent is not making progress. The governor halts and asks for a human. The threshold is configurable per account from the dashboard.

The max-steps halt is the polite refusal. Where the loop halt says "you repeated yourself", the max-steps halt says "this is taking too long; ask for a human". Both are server-side, both are non-overridable by the client.

Complexity scoring, tier selection, escalation

4. Routing

Routing is how the governor decides which upstream model handles a request. Modelith uses a six-analyzer complexity engine, three fixed tiers, and a mode setting that constrains the search space. The whole selection happens in Redis on the hot path — never a synchronous database call.

You can disable routing entirely and pin a single model. The governor still runs; you just give it a fixed answer for the "which tier?" question.

The 6-analyzer complexity engine

Each prompt is scored 1–10 by six independent analyzers. The scores are weighted, summed, and floored to a tier. A confidence score is computed as 1 − (stddev / 10). If the analyzers disagree, the final score is bumped by +1 — we'd rather over-pay than under-pay for a hard prompt.

Length

15%

Token count proxy — long prompts need more context window.

Code

20%

Code-fence density, identifier casing, and language signals.

Reasoning

25%

Markers like 'step by step', 'prove', 'evaluate'.

Vocabulary

15%

Type-token ratio and rare-token density.

Intent

15%

Verb classes: summarize, generate, transform, refactor.

Domain

10%

Domain signals: legal, medical, financial, casual chat.

The engine is frozen. Weights sum to 1.0. There is no machine learning, no embeddings, no semantic cache. The decision to keep it simple is a product decision — it means the routing is predictable, the bills are reproducible, and a compliance team can audit it line by line.

Tiers T1 / T2 / T3

Three fixed tiers, mapped to specific upstream models on the platform. Tiers are a constraint for the governor, not a guarantee: the actual model dispatched is recorded in x_modelith.final_tier and x_modelith.model on every response.

TierWhenExamples
T1 — FastSimple intents, short context, low-stakes output (classify, extract, format)Llama 3.1 8B, GPT-4o-mini, Gemini 1.5 Flash
T2 — BalancedDefault tier. Most production code, summarization, structured generationClaude Sonnet 4, GPT-4o, Gemini 1.5 Pro
T3 — PowerfulHard reasoning, long-horizon planning, multi-step agent workClaude Opus 4, GPT-4.1, o3-mini

Allowed tiers are configured per account at Settings → Routing. If a customer has licensed only T1 and T2, the governor will never select T3, regardless of the complexity score.

Modes: cheap / balanced / powerful

Modes are a hint to the governor about how to constrain the search space. The mode you send in X-Modelith-Mode (or the per-request x_modelith_mode JSON field) limits the maximum tier the governor will pick.

  • cheap — caps at T1. Use for background jobs, batch classification, anything where latency budget is loose and cost is tight.
  • balanced (default) — caps at T2. The right choice for most production traffic.
  • powerful — caps at T3. Use when you explicitly need the best answer; pair with a tight session budget to keep spend predictable.

Escalation & trace

The governor can escalate a request to a higher tier when the complexity score warrants it, as long as the higher tier is in your allowed set. Escalation never moves down — once a session is on T3, it stays on T3 for the rest of the run. This is by design: a flaky T1 → T2 → T3 → T2 oscillation is what runaway loops look like.

The complete tier and escalation history is in the dashboard session detail view, including the per-step complexity score, the chosen model, the latency, and the cost. This is the "execution trace" the landing page refers to — not a black box.

Paddle plans, prepaid wallet, transparent ledger

5. Billing

Paid plans and wallet top-ups are processed by Paddle, our Merchant of Record. We chose Paddle over Stripe because it is available globally, handles sales tax and VAT for us, and issues the receipts your finance team needs. The dashboard billing view shows your plan, your quota usage, and your wallet balance in one place.

Plans

Plans are billed monthly and renew automatically. You can cancel from the Paddle customer portal at any time; cancellation stops the next renewal. Existing usage within the current period is not retroactively refunded.

  • Free — $0. 1,000 proxy requests per month, no platform spend. Governor on. For evaluation and pre-production.
  • Starter — $19/mo. 25,000 requests OR $12 platform spend, with a wallet.
  • Pro — $49/mo. 150,000 requests OR $60 platform spend, with a wallet and team-wide aggregation.
  • Enterprise — from $299/mo. Custom quotas, SSO, invoice billing, SOC 2 / HIPAA BAA where applicable.

PAYG wallet

The wallet is a prepaid balance applied to platform-key traffic. Min top-up $10. Each platform request is held against the wallet at the initial-tier price, then settled at the actual upstream cost on response. Unused balance rolls forward indefinitely and is refundable on account closure.

BYOK traffic is billed by your upstream provider and does not draw from the wallet. The wallet exists to make platform spend predictable for your finance team.

Full economics are in Terms §payments. The full margin equation is in the deck.

Refunds

Refunds are handled by Paddle for subscriptions and by Modelith for wallet balance. We do not charge a refund fee. The full terms — including the 14-day window where applicable, the PAYG non-refundable rule, and how to file a request — are on the Refund policy page.

Keep your upstream contracts, use our governor

6. BYOK — bring your own key

You can store provider keys in your account and route traffic through your own OpenAI, Anthropic, Groq, Gemini, Mistral, or OpenRouter account. Modelith becomes the governor on top of your existing upstream relationship — same session halts, same complexity scoring, same trace — billed by your provider, not by us.

Provider keys are encrypted at rest with Fernet (AES-128-CBC) and never returned in API responses. They are not written to application logs. You can rotate or revoke a key at any time from the dashboard; revocation is effective on the next request.

Supported providers and the headers they consume:

  • OpenAIsk-… keys. Routes direct to OpenAI when present.
  • Anthropicsk-ant-… keys. Routes direct to Anthropic.
  • Groqgsk_… keys.
  • Google Gemini — Google AI Studio keys.
  • Mistral console keys.
  • OpenRouter — gives you the full OpenRouter catalog behind a single key.

Add keys at Settings → Provider keys. The presence of a key switches routing to your provider on a per-request basis.

Hosted only, for now

7. Self-hosting

Modelith is currently available as a hosted service at api.modelith.cloud. The dashboard, the routing layer, and the governor are operated as a single product — the session lifecycle depends on shared Redis state, and the complexity engine is shipped as part of the binary, not as a separate library.

Self-hosting is not currently supported. We are evaluating a private deployment tier for regulated customers with strict data-residency requirements. If that maps to you, email founders@modelith.cloud with a description of your environment and we will follow up about the waitlist.

For data-residency questions in the meantime, see the Trust center and the Security page.

The errors you will actually see

8. Troubleshooting

Most production traffic returns 200. The handful of errors that fire are the ones the governor is designed to surface — and every one of them has a structured body you can act on programmatically.

402 — budget or plan limit

The session budget was exceeded (per-session), or the plan limit was hit (per-account). The response body includes x_modelith.spent_usd, x_modelith.budget_limit_usd, and x_modelith.step. To recover: either increase the limit for the next call, close the session and start a new one, or upgrade the plan. The session is not auto-closed on a 402; further calls with the same id will continue to return 402 until you choose what to do.

429 — loop, steps, or rate limit

Three causes share the same status code, distinguished by x_modelith.halt_reason in the body:

  • loop_detected — same fingerprint 4× in 10s. The fingerprint cache is per session. Closing the session clears the cache.
  • max_steps — the configured step cap was reached. The cap is per account; raise it from Settings → Routing if you have a long-running agent.
  • rate_limited — per-key rate limit. Wait or rotate keys.

502 — upstream failure

The platform upstream (OpenRouter, or your BYOK provider) returned an error. The governor does not retry by default; the response is the upstream's status. Check the Status page, and inspect the upstream request id in the response body for support. If you have a BYOK key for the failing provider, set it to bypass the platform pool.

Reading the dashboard

The Command Center is the only place to see aggregate truth. Three views matter when you debug a session:

  • Session detail. One row per session, with the spent/budget pair, the step counter, the tier history, and a per-step trace. Open any session to see exactly which calls escalated and which halted.
  • Logs. Every request, structured, searchable, filterable by status, model, session id, and time range.
  • Costs. The ledger. Every cent, by team, by project, by model. Exports to CSV for finance.

Still stuck? Email founders@modelith.cloud with the session id and we will follow up.

Copy-paste minimum viable request

9. API reference card

Everything you need to make a governed, session-aware routed request, in three lines of configuration and one body. Save this; you will reuse it.

curlcURL — minimum viable governed request
# Endpoint
ENDPOINT="https://api.modelith.cloud/api/v1/routing/chat/completions"

# Auth
KEY="mlth_your_key_here"

# Governance headers
SESSION_ID="run-$(uuidgen)"
BUDGET="0.10"

curl -X POST "$ENDPOINT" \
  -H "Authorization: Bearer $KEY" \
  -H "Content-Type: application/json" \
  -H "X-Modelith-Session-Id: $SESSION_ID" \
  -H "X-Modelith-Budget-Limit: $BUDGET" \
  -d '{
    "model": "auto",
    "x_modelith_mode": "balanced",
    "messages": [
      { "role": "system", "content": "You are a careful code reviewer." },
      { "role": "user", "content": "Review this PR diff for security issues." }
    ]
  }'

Expected response on success (200)

An OpenAI-compatible body with x_modelith metadata. The metadata block is what your logs and the Modelith dashboard read. The same block appears on a halt, in a structured error envelope, so your reconciliation logic can read one shape.

Expected response on budget halt (402)

An HTTP 402 with a JSON body whose detail.messagereads something like "session budget exceeded" and whose x_modelith block carries spent_usd, budget_limit_usd, and step. The governor never returns 200 with an empty response on a halt — the failure is loud.

Want a hosted playground?

The Session Sandbox in the dashboard runs a real loop with a real mlth_ key. Set a budget, run a loop, watch the halt. Most teams ship to production within an hour of opening it.

Need help? Email founders@modelith.cloud — a real human reads it.