The Disappearing Window — AI Logprob Access in Procurement Contracts

When a language model writes a sentence, it writes it one word at a time. Before each word, it considers many. It scores them. It picks one. Then it moves on and does it again.

The numbers it computes during that consideration — how plausible was this word, how plausible was the runner-up, the one after that — are real numbers in real computer memory for a fraction of a second on every word of every response your company has ever received from a frontier model. They exist whether or not anyone asks for them. They are the residue of the model’s decision.

For most of the history of these APIs, you could ask for them. The top twenty, in order, with their scores. Not the model’s weights. Not its training data. Not its reasoning. Twenty small numbers per word — a thin numerical receipt for what just happened. Developers used it to build classifiers. Researchers used it to study models. Auditors could have used it to verify which model was actually on the other end of the wire.

Most never did.

Now the receipt is being withdrawn. Quietly. Across every frontier lab. Without announcement.

This brief is for the executive whose company is about to sign — or has already signed — a multimillion-dollar contract with a frontier AI vendor. The case it makes is short: the receipt costs the vendor nothing to provide, it costs you everything to lose, and once your contract is signed without it, you will not get it back.

What we call the receipt in this brief, an engineer would call logprobs. It is a parameter in the API request. When set to true, the response comes back with a numerical companion: at every word the model produced, a short list of the words it almost produced, ranked by how close it came. Each entry has a score. The score is the logarithm of the probability the model assigned to that word at that moment.

A short paragraph from a model produces a few hundred such scores. A long document, a few thousand. They are not the model’s internal reasoning. They do not reveal how the model works. They do not let anyone reconstruct the model. What they do is something simpler and more specific:

They let you, the customer, prove that the model answering your requests today is the same model that was answering them last week.

Not by trusting the vendor’s word. Not by reading the endpoint URL. Not by checking the API key. By measuring the numbers directly and comparing them to the numbers from before. The mathematics of doing this — establishing model identity from these short numerical lists — is the subject of two of our research papers, linked at the end of this brief. The mathematics is not what is in question. What is in question is whether the numbers will continue to be available at all.

What has happened, by provider

xAI. From the xAI developer documentation, current as of May 2026: “logprobs and top_logprobs are not supported by models grok-4.20 and newer. These fields will be silently ignored if set.” Silently. No error. The request succeeds. The field comes back empty. Nothing in the response tells the developer the parameter was discarded.

Google. On March 14, 2026, Vertex AI began returning 400 INVALID_ARGUMENT for Gemini 3 Pro and Gemini 3.1 Pro requests with response_logprobs=True. The change was not announced. Developers identified it from production failures. The functionality had worked the day before.

OpenAI. Logprobs are not available for any reasoning model, beginning with o1-preview and including the entire o-series, GPT-5, GPT-5-mini, gpt-5-chat-latest, and continuing through the GPT-5.x line released across early 2026 — including GPT-5.5, OpenAI’s flagship model as of late April 2026. When the Responses API launched in early 2025, it shipped without logprob support entirely; the parameter was restored for non-reasoning models in June 2025 after sustained developer pressure. For reasoning models, it has not returned.

The withdrawal is happening live. On March 30, 2026, a developer reported on the OpenAI developer forum: “I don’t run into this issue with GPT-5.1, and until a few days ago never ran into this issue with GPT-5.2.” Independent reproduction in the same thread confirmed that GPT-5.2, GPT-5.3-codex, and GPT-5.4 silently return an HTTP 500 server error — with no useful body and no policy explanation — whenever top_logprobs is set to a value greater than one. The developer cannot tell whether they hit a removed feature, a quota, or a bug. The same developer, two weeks earlier, had documented the identical pattern on Google Vertex’s Gemini 3 Pro. Two vendors, two frontier models, two unannounced withdrawals, identified by the same person running into the same wall.

Anthropic. The Anthropic Messages API does not expose token-level logprobs. There is no parameter to request them. The field does not exist.

The pattern is consistent. The direction is one way. Models that used to expose logprobs lose access; new models ship without it; reasoning models — the class procurement officers most care about for high-stakes decisions — never have it.

The withdrawal is not universal, which is the part that matters. OpenAI’s GPT-4.1 family continues to return logprobs reliably. Google Vertex’s Gemini 2.5 Flash still supports them. xAI’s pre-4.20 Grok models still do. The capability is not technically infeasible. It works today, on production hardware, at scale, in the same data centers that serve the models where it has been removed. The labs have demonstrated, by continuing to serve it on yesterday’s models, that nothing about logprob exposure is a serving-cost problem or an engineering problem. It is a decision. The decision is being made one model release at a time, in the direction of less.

No frontier lab has issued a press release on this. There is no public policy stating that logprobs are being phased out. The change is happening through silently rejected fields and unannounced 400 errors and feature pages that quietly drop a row from a comparison table.

What logprobs are, in language a contract can use

A language model, before it chooses a word, computes a number for every word in its vocabulary. The number is roughly how plausible this word is, given everything that came before. Then it picks. Then it picks the next one. Then the next.

logprobs is the request: tell me the numbers you computed before you chose. The top twenty, in order. Their log-probabilities — that is, the logarithms of their probabilities. Twenty numbers per token. A short paragraph generates a few hundred. A long document, a few thousand.

This is not the model’s reasoning. It is not the model’s training data. It is not the model’s weights. It is a thin numerical exhaust — a few hundred or thousand decimal numbers — that tells you nothing about what the model is and almost nothing about how it works. What it tells you is that it is the model you contracted for, and not some cheaper substitute.

That is the only thing it tells you. And that is the only thing your security stack cannot verify by any other means.

What top-20 logprob access enables — and what it does not

It enables:

Cross-session verification that a specific named endpoint is still serving the model it was serving when you signed the contract.
Drift detection when a provider silently swaps a checkpoint, rotates to a cheaper variant, or routes traffic to a different model behind the same URL.
Forensic detection of distillation provenance — measuring, from API output alone, whether a model was trained on another model’s outputs.
Confidence scoring for RAG systems and classifiers, the original use case, which has now been removed from the entire reasoning-model class.
Independent audit of model identity claims for regulators under the EU AI Act and NIST AI 600-1 — without disclosing weights, training data, or architecture.

It does not enable:

Model extraction. Twenty log-probabilities per token is far below the information-theoretic threshold to reconstruct weights.
Capability stealing. The numbers tell you which model is running, not how to build it.
Prompt extraction or training data leakage. Logprobs are computed from your input; they cannot reveal someone else’s.

There is no public-interest argument for removing this access. There is no safety case. There is no security case. The asymmetry is the case: a model whose outputs cannot be measured cannot be audited, and a provider whose model cannot be audited cannot be held accountable to a contract.

What the labs say, and what they do not say

No frontier lab has publicly stated why logprobs are disappearing.

OpenAI has not commented on the absence for reasoning models. xAI’s documentation states the limitation as a fact, without explanation, in a single line of small print. Google has not announced the Gemini 3 change at all. Anthropic has never offered logprobs and has never explained the omission.

Plausible technical reasons exist — speculative decoding produces fewer well-defined logprobs at intermediate tokens; reasoning models use multi-stage inference where “the” logprob is ambiguous; quantization for serving efficiency truncates the distribution at the source. These are real engineering considerations. They are not reasons to remove the parameter — they are reasons to document its semantics carefully, return what is available, and flag when coverage is incomplete.

Removing the parameter entirely is a different decision. It eliminates the only enterprise-side mechanism for verifying which model is running. It happens to also eliminate the only mechanism by which an external auditor could detect that a model has been swapped, silently distilled from a competitor, or routed to a cheaper variant behind a stable endpoint.

We will not speculate on motive. We will note that the effect is identical regardless of motive: the enterprise loses the ability to verify what it is paying for.

What this means for the procurement officer

The terms of your contract specify a model. The endpoint URL points at a name. The API key authenticates your access. The OAuth flow authenticates your agent. The SPIFFE identity authenticates your workload. The TLS certificate authenticates the connection. The model itself is authenticated by none of these.

If the provider rotates the model behind the endpoint, every credential in the stack remains valid. The endpoint URL does not change. The API key does not need to be reissued. The audit log records every request as successful. The only signal that anything has happened is the model’s output behavior — and behavioral testing cannot distinguish a planned silent upgrade from a degradation from a substitution from a distilled copy. The Cursor / Moonshot incident in March 2026 was discovered by a developer who intercepted an outbound HTTP request by hand. That is the current state of the art for detecting silent model substitution at a frontier vendor: the customer found out by accident.

The removal of logprob access is happening the same way. xAI’s documentation states the field will be silently ignored — the request succeeds, the response returns, the parameter is discarded, and no error reaches the developer. Google’s Gemini 3 Pro disablement was identified through production failures the day it took effect. OpenAI’s exclusion of reasoning models from logprob support was never announced as a policy; it was inferred by developers from the absence of the parameter in the API reference. The same operational silence that hides model substitution also hides the removal of the tools that would have detected it.

You can solve this in one of two ways.

You can require the vendor to disclose every change. This is what most enterprise contracts attempt today. It does not work. The vendor will disclose what they choose to disclose; you have no way to verify the disclosure; the gap between what changed and what was reported is exactly the gap that occurred in Cursor / Moonshot, in the Anthropic distillation disclosure of February 2026, and in every published incident in the model-substitution category.

Or you can require the vendor to expose enough numerical telemetry that you can verify model identity yourself.

That is the top-20 logprob field. It is twenty numbers per token. It is not a security risk to the vendor. It is the difference between a contract that you can enforce and a contract that depends on the vendor’s honor.

What to put in the contract

Below is the substantive content of a logprob access clause, in plain language. It is not legal advice. It is the operational specification your counsel will need to translate.

Access. The vendor shall, for every model offered under this agreement, expose top-K log-probabilities at API output, where K ≥ 20 or the full computed distribution if smaller, with no additional fee, rate limit, or access tier beyond what applies to the underlying inference call.
Semantic stability. The numerical interpretation of the returned values — natural log of post-softmax probability, computed at sampling temperature — shall not change without thirty (30) days’ written notice and a documented migration path. Provider quantization, if applied, shall be disclosed in API documentation with bit precision and rounding method.
Coverage disclosure. For models employing speculative decoding, multi-stage inference, or other pipelines where logprobs may be unavailable at some token positions, the response shall flag which positions are affected and why. Silent omission is non-compliance.
Non-degradation. The vendor shall not silently remove logprob support for any model class — reasoning, chat, completion, or otherwise — after the model has been made available under this agreement. New models added during the contract term are subject to clauses 1–3.
Audit access. The customer’s authorized verification provider (which may be the customer, an internal audit function, or a third party) shall have the same logprob access as the customer’s production traffic, at the same endpoints, for the purpose of model identity verification. The vendor shall not condition this access on disclosure of the verification methodology.
Reporting. If logprob behavior changes for any reason — quantization update, model rotation, infrastructure migration — the vendor shall notify the customer within seven (7) days, including the date of change, the affected endpoints, and any user-visible numerical differences.

These six clauses cost the vendor nothing to implement that is not already implemented. They cost the customer everything when omitted.

On Fall Risk’s position

Fall Risk AI’s measurement methodology — the PPP-residualized template framework described in our endpoint verification work — operates on top-K logprob output. With access, we can verify endpoint continuity at K=7 with measurable confidence and at K=20 with substantial margin. Without access, we cannot.

This is not a sales argument. The vendors whose endpoints we verify today — OpenAI’s non-reasoning models, Google Vertex’s non-reasoning Gemini models, xAI’s pre-4.20 Grok variants — continue to expose logprobs. Our work continues. Our customers’ contracts continue. Our verification protocol continues to return zero identity errors across 14 API-served models from three providers.

What we are losing is the ability to extend that verification to the next generation of models. GPT-5 and Claude 4.7 and Gemini 3 Pro and Grok 4.20 are the models procurement officers are signing contracts for in 2026. They are the models running in production at the enterprises this brief is addressed to. And they are the models for which independent identity verification is now structurally unavailable through standard API access.

Fall Risk’s framework operates fine without these particular endpoints. The science of the measurement does not depend on which models we apply it to. But the enterprise governance argument — the case for cryptographically-verifiable model identity as a procurement requirement — depends on the logprob access remaining available. When it disappears at the frontier, the gap between “this is the model we contracted for” and “this is the model that is computing” becomes formally unverifiable.

We are writing this brief now, while the option to contractually preserve that access still exists.

Bottom line

Twenty small numbers per word. That is the entire ask. No weights, no architecture, no training data, no reasoning traces, no internal state. Twenty numbers that every reasoning model already computes internally on every forward pass, and discards before the response leaves the building.

The frontier labs have decided that you do not need them. They have not asked you. They have not announced it. They have removed the field, returned the error, or silently discarded the parameter. The largest enterprises buying their products have not noticed, because the products still work and the credentials still validate and the audit logs still close clean.

This is the seed: when a multi-billion-dollar enterprise signs a frontier-model contract, that contract is the leverage. The vendor will not voluntarily restore what it has removed. It will respond to procurement that requires it.

You do not need our methodology to know this is the right ask. You need to ask for it. We are providing the language above so that asking is easier than not asking.

If your organization has a frontier-model contract under negotiation today, the clauses above belong in the next redline. If your organization has a frontier-model contract already in force, the next renewal is the only opportunity to add them without paying for the privilege. There will not be a third opportunity. By the time the absence becomes a board-level question, every model worth verifying will already be in production, and the contract that would have required the access will have been signed without it.

The procurement frame is the mechanism. The stake is larger. A senior technology executive — thirty-two years at one of the largest hardware companies in the world — was recently briefed on this issue. He asked a single question, and then asked it again, slower: “Wait, I could be paying a dollar and getting twenty-five cents?” The answer is yes. Recent published research, including our own, documents that the model behind a frontier API endpoint can be silently rotated to a cheaper variant — same URL, same API key, same audit log, different model — and that the customer has no mechanism by which to detect the rotation other than the numerical receipt this brief is about. When the receipt is withdrawn, the substitution becomes structurally invisible. Not hard to detect. Not expensive to detect. Invisible. The enterprise contract is the only place where the right to verify that substitution still exists in writing. Once it is signed without the requirement, the question of whether the dollar bought a dollar’s worth of model becomes, structurally, no one’s right to verify.

The receipt is being withdrawn. The question is whether enterprise procurement lets it go quietly, in exchange for nothing — or requires its return while the contract is still being written.