$5 free credits when you sign up
Benchmarks

We benchmark what we control — the gateway, not the model.

A managed LLM gateway adds a hop between your code and the provider. The honest question is: how much does that hop cost in latency and reliability? This page is our methodology for measuring it — and a commitment to publish real numbers, not a fabricated leaderboard.

benchmark · gateway-overhead · targets

What this page measures

Added overhead p5095 ms
Added overhead p95190 ms
Uptime SLA99.9%
Model quality leaderboardnot claimed
Failovermeasured per hop
Numbers shownSLA targets
No fabricated scoresReproducibleTail-first

Read this first. The latency figures below are SLA targets we hold ourselves to — not the results of a published study. A full reproducible gateway-latency report is on the roadmap. We would rather state that plainly than print numbers we cannot defend. For live, real-time health see the status page.

Added overhead — p50 target
95 ms

Gateway time on top of the provider call

Added overhead — p95 target
190 ms

The slow tail we hold ourselves to

Uptime SLA
99.9%

Contractual — see the legal SLA

Provider lock-in cost
0 ms

Switch models with a string, no re-integration

Targets, not measurements. p50 mirrors SLA_LATENCY_MS from the live stats endpoint; p95 is the tail bound we commit to.

Scope

What a gateway can — and cannot — benchmark

Routing a request does not change the model's answer. So NemoRouter benchmarks the things routing actually affects: added latency, failover, and throughput. Model quality stays where it belongs — with the model provider and independent evaluators.

Added gateway overhead

The single number that is genuinely ours: how many milliseconds NemoRouter adds on top of the provider round-trip. Auth, credit reserve, guardrail evaluation, routing, and the cost-header read all happen here.

  • Measured as wall-clock time inside the gateway, excluding provider latency
  • Reported at p50, p95, and p99 — the tail is where gateways hide cost
  • Broken down by stage: auth, guardrails, routing, settle
  • A pass-through request (no guardrails enabled) is the floor case

Failover behavior

When a provider 5xxs, times out, or rate-limits us, the routing engine retries the next model in the fallback chain. We measure how often that happens and how much latency the retry adds.

  • Fallback trigger rate — share of requests that needed a retry
  • Added latency of a single failover hop
  • Success rate after failover vs. a naked single-provider call
  • Circuit-breaker open/close transitions per provider

Throughput under load

Sustained requests-per-minute and tokens-per-minute the gateway holds before queueing. This is the number behind the per-tier RPM/TPM guarantees on the pricing page.

  • Sustained RPM/TPM at each plan tier
  • Latency degradation curve as concurrency climbs
  • Queue depth and 429 behavior at the ceiling
  • Run with the zero-cost mock provider to isolate gateway cost

What we do NOT benchmark

Model quality — MMLU, HumanEval, GPQA, reasoning scores — is the model provider’s to measure and publish. NemoRouter does not run a model leaderboard, because routing a request does not change the model’s answer.

  • No invented quality scores — we link to provider + independent evals
  • Token pricing comes straight from the provider, surfaced 1:1
  • A faster route never trades away response quality
  • Honest gap: a published latency study is on the roadmap, not done
Gateway overhead

The hop costs milliseconds — here is where they go

Every request flows Client → Frontend → Nemo Backend → in-process routing engine → provider. Everything before the provider is overhead we own. We break it into stages so the number is auditable, not a black box.

Overhead budget

A per-stage latency budget, not a single mystery number

A pass-through request — no guardrails enabled — is the floor. Each capability you switch on adds a known, measured increment. Nothing is hidden inside an average.

  • Auth + key resolution — virtual-key lookup, cached in-process
  • Credit reserve — a single atomic Postgres mutation under an advisory lock
  • Guardrail evaluation — only the guardrails you enabled, skipped entirely when none are
  • Routing + settle — fallback selection and the cost-header read on the way back
trace · request-path · stage breakdown

Where the overhead budget goes (illustrative target split)

Auth + key resolutioncached lookup
Credit reserveone atomic write
Guardrails (when enabled)per-guardrail
Routing + fallback selectin-process
Settle (cost header read)post-response
Pass-through floorminimal
p50 / p95 / p99per-stageauditable

The routing engine runs in-process inside Nemo Backend — there is no separate network service to add a second hop. Browse the live catalog on the models page.

Reliability

Failover is a feature you can measure

A single provider has bad minutes — a 503, a timeout, a regional blip. The value of a gateway is that one bad provider does not become your bad request. We measure how often failover fires and what it costs.

Failover behavior

One bad provider should not be your outage

When a model call fails, the routing engine moves to the next model in the fallback chain. Circuit breakers keep a struggling provider out of rotation until it recovers. The benchmark question is how much latency a failover hop adds — and we report it separately, never blended into the happy-path p50.

  • Ordered fallback chains — when a provider fails, the next model is tried automatically
  • Per-provider circuit breakers open after sustained errors and probe before re-closing
  • Retries are bounded and budgeted — a failover never silently doubles your spend
  • Reserve+settle credit safety means a failed attempt costs zero credits
router · fallback-chain · reliability

Failover invariants

Fallback chainordered
Circuit breakerper provider
Retry budgetbounded
Failed-attempt cost0 credits
Failover latencyreported separately
Ordered chainsCircuit breakersReserve + settle
Methodology

How a NemoRouter benchmark is run

A benchmark is only worth printing if someone else can reproduce it. Every study we publish follows the same four rules — and ships the config so you can run it yourself.

1

Isolate the gateway

Run against the zero-cost mock provider so provider latency is constant. Whatever moves is gateway overhead — nothing else.

2

Warm, then sample

Discard cold-start requests, then collect a large fixed sample at steady state. Cold starts are reported separately, never blended into p50.

3

Report the tail

Publish p50, p95, and p99 — never just the average. A gateway that looks fast at p50 and stalls at p99 is a slow gateway.

4

Show the config

Every run ships its config: region, instance size, guardrails enabled, concurrency, sample size, and date. A benchmark you cannot reproduce is marketing.

The honest version

Many gateways publish a single average latency number with no config, no tail percentiles, and no way to reproduce it. We treat that as marketing, not measurement. Until our reproducible study is published, the figures on this page are stated as SLA targets — the bar we commit to — and the live status page shows real component health right now.

Uptime

Availability is a number we sign

The 99.9% uptime SLA is a contractual commitment in the legal SLA — not a figure we picked because it looked good on a slide.

Uptime SLA

99.9% — committed, monitored, and public

The gateway runs on managed Cloud Run autoscaling and managed Supabase Postgres. Component health — gateway, database, settlement queue — is published live, and incident history is public rather than buried behind a support login.

  • 99.9% uptime is a contractual commitment, not a marketing rounding
  • Backed by managed Cloud Run autoscaling and managed Supabase Postgres
  • Live component health is published — gateway, database, and settlement queue
  • Incident history and current status are public, not hidden behind a login
status · components · live

Published component health

Gateway (Nemo Backend)healthy
Database (Postgres)connected
Settlement queuedraining
Uptime commitment99.9%
Incident historypublic
Cloud RunSupabase PostgresPublic status

See real-time component health on the status page, and the full availability terms in the legal SLA.

What's next

How we will publish results

When the first reproducible study lands, here is exactly what ships with it — so there is no ambiguity about what was tested or how.

A dated report, not a moving number

Each study is timestamped and versioned. Older reports stay live so you can see the trend, not just today’s headline.

Full run configuration

Region, instance size, concurrency, guardrails enabled, sample size, and provider mix — everything needed to reproduce the run.

p50, p95, and p99 — always the tail

Every latency claim ships all three percentiles. A single average is not a benchmark; it is a hope.

Failover and error rates alongside latency

Reliability and speed are reported together. A gateway that is fast only when nothing fails is not actually fast.

Want the methodology in detail, or to run a benchmark against your own workload before committing? Email sales@nemorouter.ai.

Measure it yourself

The fastest benchmark is your own traffic

Point a real workload at NemoRouter and watch the overhead in your own dashboard. No fabricated leaderboard to trust — just the gateway, your requests, and the numbers.

Live component health is always on the status page.