Benchmarks

We benchmark what we control — the gateway, not the model.

A managed LLM gateway adds a hop between your code and the provider. The honest question is: how much does that hop cost in latency and reliability? This page is our methodology for measuring it — and a commitment to publish real numbers, not a fabricated leaderboard.

See the methodology View live status

benchmark · gateway-overhead · targets

What this page measures

Added overhead p5095 ms

Added overhead p95190 ms

Uptime SLA99.9%

Model quality leaderboardnot claimed

Failovermeasured per hop

Numbers shownSLA targets

No fabricated scoresReproducibleTail-first

Read this first. The latency figures below are SLA targets we hold ourselves to — not the results of a published study. A full reproducible gateway-latency report is on the roadmap. We would rather state that plainly than print numbers we cannot defend. For live, real-time health see the status page.

Added overhead — p50 target: 95 ms
Added overhead — p95 target: 190 ms
Uptime SLA: 99.9%
Provider lock-in cost: 0 ms

Targets, not measurements. p50 mirrors SLA_LATENCY_MS from the live stats endpoint; p95 is the tail bound we commit to.

Scope

What a gateway can — and cannot — benchmark

Routing a request does not change the model's answer. So NemoRouter benchmarks the things routing actually affects: added latency, failover, and throughput. Model quality stays where it belongs — with the model provider and independent evaluators.

Added gateway overhead

The single number that is genuinely ours: how many milliseconds NemoRouter adds on top of the provider round-trip. Auth, credit reserve, guardrail evaluation, routing, and the cost-header read all happen here.

Measured as wall-clock time inside the gateway, excluding provider latency
Reported at p50, p95, and p99 — the tail is where gateways hide cost
Broken down by stage: auth, guardrails, routing, settle
A pass-through request (no guardrails enabled) is the floor case

Failover behavior

When a provider 5xxs, times out, or rate-limits us, the routing engine retries the next model in the fallback chain. We measure how often that happens and how much latency the retry adds.

Fallback trigger rate — share of requests that needed a retry
Added latency of a single failover hop
Success rate after failover vs. a naked single-provider call
Circuit-breaker open/close transitions per provider

Throughput under load

Sustained requests-per-minute and tokens-per-minute the gateway holds before queueing. This is the number behind the per-tier RPM/TPM guarantees on the pricing page.

Sustained RPM/TPM at each plan tier
Latency degradation curve as concurrency climbs
Queue depth and 429 behavior at the ceiling
Run with the zero-cost mock provider to isolate gateway cost

What we do NOT benchmark

Model quality — MMLU, HumanEval, GPQA, reasoning scores — is the model provider’s to measure and publish. NemoRouter does not run a model leaderboard, because routing a request does not change the model’s answer.

No invented quality scores — we link to provider + independent evals
Token pricing comes straight from the provider, surfaced 1:1
A faster route never trades away response quality
Honest gap: a published latency study is on the roadmap, not done

Gateway overhead

The hop costs milliseconds — here is where they go

Every request flows Client → Frontend → Nemo Backend → in-process routing engine → provider. Everything before the provider is overhead we own. We break it into stages so the number is auditable, not a black box.

Overhead budget

A per-stage latency budget, not a single mystery number

A pass-through request — no guardrails enabled — is the floor. Each capability you switch on adds a known, measured increment. Nothing is hidden inside an average.

Auth + key resolution — virtual-key lookup, cached in-process
Credit reserve — a single atomic Postgres mutation under an advisory lock
Guardrail evaluation — only the guardrails you enabled, skipped entirely when none are
Routing + settle — fallback selection and the cost-header read on the way back

trace · request-path · stage breakdown

Where the overhead budget goes (illustrative target split)

Auth + key resolutioncached lookup

Credit reserveone atomic write

Guardrails (when enabled)per-guardrail

Routing + fallback selectin-process

Settle (cost header read)post-response

Pass-through floorminimal

p50 / p95 / p99per-stageauditable

The routing engine runs in-process inside Nemo Backend — there is no separate network service to add a second hop. Browse the live catalog on the models page.

Reliability

Failover is a feature you can measure

A single provider has bad minutes — a 503, a timeout, a regional blip. The value of a gateway is that one bad provider does not become your bad request. We measure how often failover fires and what it costs.

Failover behavior

One bad provider should not be your outage

When a model call fails, the routing engine moves to the next model in the fallback chain. Circuit breakers keep a struggling provider out of rotation until it recovers. The benchmark question is how much latency a failover hop adds — and we report it separately, never blended into the happy-path p50.

Ordered fallback chains — when a provider fails, the next model is tried automatically
Per-provider circuit breakers open after sustained errors and probe before re-closing
Retries are bounded and budgeted — a failover never silently doubles your spend
Reserve+settle credit safety means a failed attempt costs zero credits

router · fallback-chain · reliability

Failover invariants

Fallback chainordered

Circuit breakerper provider

Retry budgetbounded

Failed-attempt cost0 credits

Failover latencyreported separately

Ordered chainsCircuit breakersReserve + settle

Methodology

How a NemoRouter benchmark is run

A benchmark is only worth printing if someone else can reproduce it. Every study we publish follows the same four rules — and ships the config so you can run it yourself.

Isolate the gateway

Run against the zero-cost mock provider so provider latency is constant. Whatever moves is gateway overhead — nothing else.

Warm, then sample

Discard cold-start requests, then collect a large fixed sample at steady state. Cold starts are reported separately, never blended into p50.

Report the tail

Publish p50, p95, and p99 — never just the average. A gateway that looks fast at p50 and stalls at p99 is a slow gateway.

Show the config

Every run ships its config: region, instance size, guardrails enabled, concurrency, sample size, and date. A benchmark you cannot reproduce is marketing.

The honest version

Many gateways publish a single average latency number with no config, no tail percentiles, and no way to reproduce it. We treat that as marketing, not measurement. Until our reproducible study is published, the figures on this page are stated as SLA targets — the bar we commit to — and the live status page shows real component health right now.

Uptime

Availability is a number we sign

The 99.9% uptime SLA is a contractual commitment in the legal SLA — not a figure we picked because it looked good on a slide.

Uptime SLA

99.9% — committed, monitored, and public

The gateway runs on managed Cloud Run autoscaling and managed Supabase Postgres. Component health — gateway, database, settlement queue — is published live, and incident history is public rather than buried behind a support login.

99.9% uptime is a contractual commitment, not a marketing rounding
Backed by managed Cloud Run autoscaling and managed Supabase Postgres
Live component health is published — gateway, database, and settlement queue
Incident history and current status are public, not hidden behind a login

status · components · live

Published component health

Gateway (Nemo Backend)healthy

Database (Postgres)connected

Settlement queuedraining

Uptime commitment99.9%

Incident historypublic

Cloud RunSupabase PostgresPublic status

See real-time component health on the status page, and the full availability terms in the legal SLA.

What's next

How we will publish results

When the first reproducible study lands, here is exactly what ships with it — so there is no ambiguity about what was tested or how.

A dated report, not a moving number

Each study is timestamped and versioned. Older reports stay live so you can see the trend, not just today’s headline.

Full run configuration

Region, instance size, concurrency, guardrails enabled, sample size, and provider mix — everything needed to reproduce the run.

p50, p95, and p99 — always the tail

Every latency claim ships all three percentiles. A single average is not a benchmark; it is a hope.

Failover and error rates alongside latency

Reliability and speed are reported together. A gateway that is fast only when nothing fails is not actually fast.

Want the methodology in detail, or to run a benchmark against your own workload before committing? Email sales@nemorouter.ai.

Measure it yourself

The fastest benchmark is your own traffic

Point a real workload at NemoRouter and watch the overhead in your own dashboard. No fabricated leaderboard to trust — just the gateway, your requests, and the numbers.

Start free — no credit card Browse the model catalog

Live component health is always on the status page.