Introducing Cobo Agentic Wallet (CAW): Autonomy for AI agents, with control enforced at the infrastructure level

Learn more
close

The Shadow API Market Gets an Audit Layer: GatewayBench Launches via Check4U.ai

May 29, 2026

Academy

AI relay services have expanded model access, but they have also created a new trust gap. GatewayBench is an AI gateway evaluation framework now open for submissions through Check4U.ai. It turns model authenticity, billing transparency, cache isolation, and real cost into verifiable metrics, giving API relay providers, routing gateways, aggregators, and model service providers a public leaderboard to prove what they deliver.

AI model APIs are moving into production, powering customer support, code generation, agent workflows, risk systems, and research pipelines. But behind that shift, an old access problem is getting harder to ignore: not every developer can reach the same models on the same terms.

In some markets, teams can put a card on file and start testing the latest models through official APIs with little friction. Elsewhere, developers still run into regional limits, payment hurdles, account reviews, and compliance checks. Demand does not disappear just because access is uneven, which is why a third-party API relay market — often described as Shadow API — has grown around the gaps.

For many teams, the appeal is obvious. Shadow API gives developers a lower-friction way to use models like GPT and Claude, offering something close to the official API experience while filling gaps the official channels do not always cover.

The problem is what happens behind the gateway.

Providers may advertise original models, lower prices, faster speeds, and cache support. But users rarely get a clear view of the backend: which model actually handled the request, how tokens were counted, whether cache savings were passed through, how account data was isolated, and how failed requests were billed.

That risk is starting to move beyond procurement. In March 2026, an audit paper from Germany’s CISPA Helmholtz Center for Information Security found that at least 187 academic papers had used Shadow API, with 62% accepted by top conferences including ACL, CVPR, and ICLR. If researchers cannot be sure which model actually ran the experiment, or whether it was quantized or downgraded along the way, the results become harder to reproduce and harder to trust.

Shadow API solves an access problem. It also creates a trust problem.

That is the gap GatewayBench is trying to address. Available through Check4U.ai, the open-source audit framework aims to give the Shadow API market a more transparent way to compare gateways.

GatewayBench is not meant to be another speed test. It treats the gateway itself as something that needs to be audited, taking the opaque backend behavior behind an API call and turning it into metrics that can be checked, rerun, and compared.

That is also where it breaks from most horizontal benchmarks. Existing evaluations usually look at speed, price, model coverage, and availability, assuming the data returned by the gateway can be trusted. In the Shadow API market, that assumption is shaky. A provider can look fast and cheap on the surface while routing requests to a downgraded model, padding token counts, mixing cache pools, or billing for failures in ways users cannot easily see.

GatewayBench adds a layer that many benchmarks leave out: technical integrity. Traditional tests tend to stop at the edge of the interface, measuring response time, listed pricing, and model availability. GatewayBench looks one step further, asking whether the path behind the request can still be checked, explained, and compared.

Put simply, it is less interested in whether a gateway looks fast and cheap, and more interested in whether it delivered what it promised. Once a developer sends a request through a relay, GatewayBench asks what actually came back: which model handled the request, how the cost was calculated, whether backend behavior stayed within the promised boundaries, and whether the service can be trusted beyond the marketing page.

GatewayBench breaks gateway evaluation into three layers: L1 trust, L2 performance, and L3 economics.

L1 looks at whether the gateway is delivering the model it claims to deliver, counting tokens transparently, and handling cache in a trustworthy way. L2 looks at latency, Goodput, and whether the service can keep delivering under long-context workloads. L3 looks past the listed price, asking what the customer actually pays once input, output, cache, failed requests, payment rails, and other costs are included.

The weighting is deliberate:

Final score = 

0.40 × L1 trust + 0.40 × L3 economics + 0.20 × L2 performance

That mix says a lot about the Shadow API problem. Speed matters, but it does not come first. A gateway has to show that the model is real, the bill makes sense, and the cost can be explained before it gets to compete on raw performance.

The biggest risk in a model gateway is often what users cannot see. A request may return normally, but the backend path can still be messy: which model actually handled the call, how thinking tokens were counted, whether cache savings were passed through, and how account data was kept apart.

GatewayBench’s L1 layer is built around that visibility gap. It takes the kinds of problems users usually notice only indirectly — a model feeling weaker, a bill looking higher, a cache discount not quite adding up — and turns them into audit questions. Instead of relying on provider logs, it uses statistical tests, cryptographic structures, and latency fingerprints to check whether a gateway is delivering the model, cost, and isolation it promised.

1. Model authenticity: catching dynamic model switching 

One of the harder tricks to catch in the Shadow API market is dynamic model switching. A gateway can route requests to different backends depending on cost, load, customer type, or the shape of the request. In a more careful version of the playbook, obvious test calls or high-value customers may still hit the official model, while ordinary traffic, long-tail prompts, or expensive workloads get sent to a quantized model, a distilled model, or a cheaper substitute.

That makes one-off testing weak. The response can look normal, even when the model behind it is no longer the one the provider promised.

If you only look at the final answer, this kind of substitution is hard to spot. LLMs are probability systems underneath. Given the same prompt, the official model should produce a relatively stable ranking of likely next tokens. A cheaper or downgraded model may produce a similar answer, but its token ranks will drift, especially in the middle and long tail of the distribution.

GatewayBench approaches this through RUT, or Rank-based Uniformity Test, a method used to check whether a black-box LLM API behaves consistently with a reference model. It does not simply compare the semantic meaning of the final text. Instead, GatewayBench samples outputs over time and places the generated tokens back into the reference model’s probability ranking, checking whether their positions still look statistically consistent.

If a gateway has swapped the model, downgraded it through quantization, or applied harmful fine-tuning, the text may still read well. But the token positions inside the reference model’s probability distribution can start to shift in a systematic way. That shift becomes a statistical trace of model substitution or degradation.

Catching dynamic model switching 

For continuous monitoring, GatewayBench can also use Logprob Tracking. This method does not require generating a full answer every time. Instead, it asks the endpoint to produce a single token under a fixed prompt, then tracks whether that token’s log probability shows a stable drift over time. If the same endpoint shows a clear logprob shift across different time windows, it may indicate a backend model update, fine-tuning change, quantization adjustment, or routing switch.

These signals are hard to explain away with marketing claims. RUT is mainly used to check whether the target endpoint still behaves like the reference model. Logprob Tracking adds a lower-cost way to watch whether the same endpoint keeps behaving consistently over time. Together, they move model authenticity away from a gut feeling — “this model seems worse” — and closer to something that can be statistically tested and continuously monitored.

For enterprise users, the real question is simple: is the model being delivered still the same one promised in the contract, the API documentation, and the pricing table?

2. Billing transparency: auditing hidden thinking tokens

Reasoning models have made API billing harder to check.

With these models, part of the work happens inside a hidden reasoning process, often counted as thinking tokens. Those tokens can affect the bill, but they are not always fully visible to the customer. That creates a new blind spot: users can see the prompt, the final answer, and the invoice, but not always the reasoning cost sitting between them.

GatewayBench treats that blind spot as part of the trust problem. The basic idea is simple: the visible answer should have some relationship to the amount of reasoning claimed behind it. A long, careful, multi-step answer is unlikely to come from almost no thinking cost. A short, simple answer should not quietly carry an unusually large hidden token charge.

To test that, GatewayBench uses PALACE, a reverse-reasoning mechanism that looks at the user prompt and the final answer, then estimates a reasonable range for thinking-token usage. If the billed reasoning tokens sit far outside that range, the request can be flagged for abnormal billing.

CoIn adds another layer. It organizes token-level vector fingerprints into a verifiable cryptographic structure, such as a Merkle Tree, making billing records harder to rewrite after the fact.

Together, these methods try to answer a practical question: does the bill match the way the model appears to have generated the answer?

That matters because reasoning-model costs no longer live only in the text users can see. Some of the real cost may sit inside invisible thinking tokens. GatewayBench’s goal is to make that hidden layer open to challenge, audit, and review.

3.Cache integrity: checking the discount and the boundary

Prompt caching is supposed to make AI gateways cheaper and faster. When a request hits cache, the upstream cost can drop sharply, giving gateways a real way to improve margins and pass savings back to customers.

It also creates room for games around the edges.

A gateway may receive an upstream cache discount but still bill the customer as if the request were fully processed. Or it may push different customers’ prompts into a shared cache pool, improving hit rates at the cost of weaker account isolation. In that case, caching stops being a simple performance feature and becomes a question of billing trust and data boundaries.

GatewayBench tests cache behavior through latency fingerprints. A real cache hit should usually show up in the timing, especially in a lower time to first token, or TTFT. If the invoice reports a cache hit but the latency does not move, the “discount” may exist only on the bill.

It also tests account isolation. One account can send a rare long text to establish cache state, while another independent account sends the same text later. If the second account gets an unusually fast response, the result may point to cross-account cache reuse, raising questions about how cleanly the gateway separates tenants.

That is why cache belongs in the trust layer, not just the performance layer. A faster, cheaper gateway is only useful if the savings are real and the data boundary still holds. For GatewayBench, supply-chain trust is about the process behind the answer: which model ran it, how the cost was counted, and whether the response was generated inside the boundary the customer expected.

A gateway should clear the trust layer before performance becomes worth comparing.

Speed is still the easiest number to sell in the AI gateway market. Providers can point to lower latency, faster routing, or headline numbers like “150 tokens per second.” Those claims help filter out services that are obviously too slow, unstable, or unable to handle basic traffic. GatewayBench still measures them. It just gives L2 performance a more restrained weight: 20% of the final score.

The reason is practical. A fast gateway can still be a bad gateway if the model behind it has been swapped, the bill has been padded, or cache behavior cannot be explained. In a Shadow API setting, performance only means much after the service has shown that its delivery is trustworthy and its costs are clear.

So GatewayBench does not chase a single peak number. It looks at performance the way production teams experience it: how long users wait, how much useful work arrives within a latency budget, and how the system behaves when requests become heavier.

1.Latency and Goodput

In commercial AI systems, throughput means little without timing. A delayed chat response can lose a user. A delayed agent step can break a workflow. A delayed response in trading, automation, or risk systems can miss the decision window entirely.

GatewayBench starts with latency, using it to draw the line between usable and unusable service. Goodput is measured inside that line. Raw tokens per second can look strong in a clean test; Goodput asks how much useful work still arrives on time when requests queue up, streams jitter, and long-context jobs compete for capacity.

Latency is split into several signals. TTFT measures time to first token, shaping the first feel of an interactive response. TPOT / ITL tracks the gap between tokens, showing whether streaming stays smooth. TTFA measures when the first useful answer appears in a reasoning-model flow, where hidden thinking tokens can distort the experience. E2E latency captures the full request time, which matters more for batch and non-streaming workloads.

That split matters because slow service has more than one shape. A gateway can return the first token quickly, then stutter through the rest of the stream. It can look fine when idle, then lose control of P95 and P99 latency once concurrency rises. Averages hide those failures; layered latency metrics make them easier to see.

GatewayBench then brings in SLOs, or service-level objectives, as the business line. A team might require P95 TTFA below 1.5 seconds, P95 E2E below 8 seconds, and streaming output without obvious jitter. Throughput only counts when it stays inside those bounds.

Speed still matters, but the clean-room peak is only the opening bid. For production teams, the real test is what happens under load: how much work gets delivered before first-token latency spikes, streaming turns choppy, or the request misses its SLO.

2.Long context under pressure

Short prompts are a low bar. Many gateways can handle everyday chat well, then behave very differently when the workload shifts to RAG, long-document analysis, or complex agent runs. Moving from 1k tokens to 10k or 100k changes the resource profile, putting more pressure on attention compute, KV cache, queueing, concurrency isolation, and routing.

GatewayBench tests long-context behavior at 1k, 10k, and 100k input lengths. The goal is to see how the system slows down, and whether that slowdown is stable enough for teams to plan around.

A healthy service should degrade in a way that is visible and explainable. An unhealthy one may fall off a cliff at a certain context length, lose its P95 and P99 tail, or keep returning “successful” responses while the product experience has already broken.

The long-context curve can also expose softer forms of degradation. A gateway might avoid rejecting heavy requests and instead move them into a slower queue. It might price large-context jobs differently, place them on weaker resource pools, or route them through opaque scheduling rules. By comparing latency, throughput, and cost across 1k, 10k, and 100k inputs, GatewayBench can see whether performance degradation and pricing still line up.

For long-context workloads, the headline token limit is not the whole story. The more useful signal is the degradation curve: how the gateway behaves when the request starts to look like real enterprise work.

Price is usually the first number buyers look at when they evaluate an AI gateway. It is also one of the easiest numbers to repackage.

That is why GatewayBench gives L3 economics a 40% weight, putting it alongside L1 trust as a core part of the score. The goal is not to get closer to True Cost per 1M tokens — what a customer actually pays after input, output, cache, failed requests, exchange rates, payment rails, and refund terms are all counted.

Pricing transparency matters because it is part of trust. A gateway that cannot clearly break down its pricing leaves more room for confusion: cost shifting, weaker service, hidden fees, or billing rules that only show up after production traffic starts flowing.

3.1 Unpacking the bundled price

One common move in the gateway market is to advertise a single blended price. It is simple, easy to compare, and often looks good on a table. It can also hide where the cost really sits.

AI workloads do not have a fixed input-output ratio. RAG, long-document analysis, and long-context agent runs often produce huge input loads with relatively modest output. Code generation and content workflows may lean the other way, with smaller prompts and heavier output.

A blended price flattens those differences. A provider can make the overall number look competitive by lowering one line item and raising another, shifting cost into the workloads that are less obvious in a headline comparison. A gateway may look cheap in a generic benchmark and still become expensive for a team running input-heavy RAG.

GatewayBench responds by asking for four prices separately: input, output, cache hit, and cache write. For providers that price cache writes by TTL, such as short-lived versus longer-lived cache windows, those costs need to be split out as well.

The point is to move pricing back into the customer’s own workload. A team doing long-document retrieval needs to know whether input and cache costs will eat the budget. A team doing code generation will care more about output. Four-track pricing turns a marketing number back into something buyers can actually model.

2. Relative to official pricing

Cheaper is not always cleaner.

GatewayBench compares platform pricing with the official vendor price, using a simple ratio: platform price divided by official price. But it avoids treating every gateway the same. A router that provides multi-model routing, failover, unified billing, or scheduling may have a reason to charge a service premium. A thin pass-through proxy has less room to justify one.

That is the role-based logic behind the benchmark. Aggregators that add routing, redundancy, and billing infrastructure may be allowed a modest markup, while pure proxy services are expected to stay much closer to official pricing. A price that sits above the benchmark without a clear explanation is penalized.

The same logic works in the other direction. A price far below official cost can be a warning sign, not just a discount. The gap has to come from somewhere: model downgrades, hidden charges, throttling, weaker resource pools, or billing rules that are hard to see upfront.

So the question is not simply which gateway is cheapest. It is whether the price matches the role the gateway claims to play.

3. Hidden costs

Some of the most important costs never appear in the headline price.

Failed requests are one example. Under high concurrency, a gateway may return timeouts, broken streams, or 5xx errors. If those requests still get billed, instability turns into extra cost: the less reliable the service, the higher the effective unit price.

Account terms are another. Low per-token pricing can come with high top-up minimums, non-refundable balances, minimum spend rules, or expiring credits. The listed price may look attractive, while the money locked inside the account becomes harder to control.

Payment costs matter too. For cross-border buyers, a dollar price on the page may not be the final price in the ledger. FX spreads, card fees, payment-channel markups, and opaque settlement terms can all push the real cost higher.

GatewayBench’s L3 layer treats economics less like a pricing table and more like cost reconstruction. It breaks apart the headline price, checks the markup against the gateway’s actual role, and brings hidden costs into the same frame. The result is a more useful question for buyers: not which gateway looks cheapest, but which one gives them a cost structure they can explain, forecast, and defend.

Shadow API is not going away because people argue about it. As long as access to major models remains uneven across regions, payment rails, risk controls, and compliance processes, third-party API gateways will keep serving real demand. The better question is how this market moves from black-box competition to transparent competition.

That is the starting point for GatewayBench. It accepts the demand, then tries to make the invisible parts of gateway behavior easier to check. Developers should be able to know which model actually handled a request, how the cost was calculated, and what risks came with the service they used.

A one-time pre-integration test is useful, but it is not enough to improve the market on its own. The longer-term goal is to turn each call into part of a track record. Model routing, billing records, cache hits, latency behavior, failed requests, and incident handling can all become signals that buyers and developers can look back on.

Gateways that deliver consistently, price clearly, and handle traffic honestly should earn more trust and better distribution. Providers that repeatedly show model drift, performance issues, or billing disputes should face a higher trust cost. That is not about punishing a category of vendors. It is about moving the market away from low-price noise and toward verifiable delivery.

GatewayBench, an AI model gateway audit and evaluation framework, has officially launched through Check4U.ai. The framework aims to help rebuild trust in the AI gateway market and is now open to API relay providers, gateway operators, aggregators, and model service providers worldwide, who can participate in evaluations and appear on the GatewayBench public leaderboard.

With a shared, rerunnable, and comparable audit framework, GatewayBench gives providers a way to show what they actually deliver: model authenticity, billing clarity, stable performance, and real cost advantages.

For developers and enterprise buyers, the value is a cleaner way to compare gateway services before sending production traffic through them. For strong providers, it is a way to turn technical integrity into market proof.

A healthier AI gateway market will need more than slogans and pricing tables. It needs delivery records that can be checked over time. GatewayBench aims to become part of that trust layer, helping honest providers earn more visibility, giving buyers better signals, and moving the market toward services that can be tested, compared, and held accountable.

View more

Get started with Cobo Portal

Secure your digital assets for free