AI reach-scans
A reasoning model can reach a state where the right answer is still true, but no longer easy to continue toward from the path already written. ReachScan freezes that committed prefix, samples fresh continuations, and measures the answer futures that remain viable.
The published RC10 paper and RC3 evidence archive carry the full measurement cycle: morphology, source-conditioned diagnosticity, prospective replication, token-level localization, and controlled reopening by corrective splice.
The current paper is a measurement cycle, not a single wrong-answer example.
Case A is the structural prelude — the single-case morphology the cycle starts from. The published RC10 paper builds it into a full cycle: source-conditioned diagnosticity, prospective same-family replication, token-coordinate localization, morphology, and a controlled reopening test.
Public stack: paper, archive, and ReachScan v0.3.5
The reusable measurement core is ReachabilityLabs/ReachScan, archived at 10.5281/zenodo.20837723. The core paper is deposited at 10.5281/zenodo.20872623, and the evidence archive is deposited at 10.5281/zenodo.20872305.
Case A: structural morphology
A prompt-root and committed-trace scan shows upstream bias toward a target-excluding arithmetic family. Target-compatible mass drains before a particular wrong atom dominates.
Case B: source-conditioned diagnosticity
Correct-source and wrong-source prefixes overlap at mid-depth, then separate late. On D9, the deepest-cut correct-minus-wrong reach gap is +0.732, with disjoint-seed reconfirmation.
D7 prospective replication
The same-family replication is settled in the current paper: late-band separation is +0.802 with a 95% trace-bootstrap interval of [+0.695, +0.894].
Token localization
Eighteen wrong-source traces are localized in token coordinates. The median foreclosure bracket is 4.5 tokens, and the wrong answer is emitted a median 170 tokens after target reach dies.
Reach-entropy morphology
Low target reach is not one state. Most localized wrong traces shatter into diffuse wrong mass; a smaller number capture onto a concentrated wrong atom.
Controlled reopening
In five localized traces, a corrective splice reopens exact-target reach in four. Matched and scrambled controls generally do not, and replaying the original downstream path re-closes all five.
Possible is not reachable.
“All the possibles are not compatible together in one and the same world-sequence.”
Leibniz, Theodicy §201
Leibniz drew a distinction we have mostly lost: a thing can be possible on its own yet not compossible, not able to coexist with what has already been settled. Each possibility can make sense in isolation; the question is which ones can still hold together from where a process now stands.
This work moves that distinction out of metaphysics and into a running computation. When a model reasons toward an answer, the correct answer never stops existing; it remains a valid solution throughout. What changes is whether it remains reachable from the path the model has committed to.
Existence holds fixed. Reachability is what collapses. The contribution is to make that collapse observable: locate the point where an answer falls out of reach, watch reasoning continue after the loss, and test whether a corrected commitment can bring the answer back into reach.
What is borrowed, and what is new
Leibniz gives an old name to the relation. ReachScan gives it an instrument: finite rollouts, declared projections, receipts, localization, intervention, and a bounded claim.
Endpoint evaluation collapses a richer object.
A reasoning trace is not only an answer-producing sequence. It is a path of commitments. Each emitted token fixes part of the future: some continuations remain easy, some become narrow, some become budget-censored, and some become systematically wrong.
Standard evaluation usually sees only the endpoint. Process supervision sees steps. Self-consistency aggregates over endpoints. None of these, by itself, measures the distribution of futures that the committed state leaves open.
The reasoning lane measures that distribution. Given a committed prefix, a continuation policy, and a resource budget, we repeatedly continue from the prefix and examine where the continuations land. Correct-answer mass is one coordinate of that distribution. It is not the whole object.
The sentence the framework makes pronounceable
Self-consistency calls a state uncertain when the answers vary. This measurement calls it family-locked when the answers vary inside the same wrong family.
Committed prefix, future field, diagnostic projection.
Three primitives, each of which has a plain-language version. The portable part is the measurement grammar: committed state, sampled futures, declared lens, bounded claim. The task-specific part is the projection lens.
Committed prefix
The state a process has already produced. In a reasoning system this is the token sequence written so far. In a solver it is a partial assignment. In a scheduler it is a partial schedule. The surface is different; the conceptual object is the same: an irreversible commitment.
Future field
The distribution of outcomes the system can still produce from that exact committed state under a declared policy and budget. We measure it by sampling many continuations and reading where they land. The field has structure beyond a single most likely answer.
Diagnostic projection
A way of looking at the future field through one labeling rule at a time. Are most futures multiples of 8? Are they all in the same operation family? Do they share a residue the target lacks? Each projection is a different question about the same field.
The intuitive picture: tunnels and buckets.
Imagine the system writing a reasoning trace as opening a tunnel into the space of possible future answers. Once the tunnel is open, all subsequent continuations have to come out somewhere downstream of where the system is now standing. The tunnel can stay wide or narrow sharply. It can point at the right region or at a wrong one.
A diagnostic projection sorts the answers into buckets by some property — like sorting laundry by color. If the target is a blue shirt and the system keeps producing answers from the white-clothes pile, it isn't merely wrong. It's searching in a bucket that doesn't contain the target.
The reasoning lane measures both the shape of the tunnel and which bucket its outputs are landing in.
Why this is different from existing approaches
Self-consistency picks the majority answer. Process supervision rewards good steps. Tree-of-thought searches branches. Each treats the trace as a means to an answer. The reasoning lane treats the trace as a state and asks what futures remain attached to it.
What travels, and what does not.
The grammar travels across substrates: freeze a committed state, sample the futures still opened by that state, read them through a declared projection, and keep the claim tied to the contract. Modulo 8 does not travel automatically. Exact-answer parsing does not travel automatically. Every new task needs its own lens.
Supporting figures stay available as deeper dives, not as the entry point.
The central page keeps the argument in prose: committed state, sampled futures, declared lens, bounded claim. Readers who want the visual aids can open them separately without making the page itself depend on heavyweight embeds.
Upstream wrongness and staged tightening on an arithmetic reasoning trace.
Case A is the target-532 floor-sum problem. It shows why the full answer field matters: the prompt-root field is already biased toward a target-excluding arithmetic family, commitment depletes target-compatible mass, and the answer field later resolves onto a dominant wrong atom. The interactive views remain available as supporting figures; the page-level frame here carries the full RC10 measurement cycle.
What Case A shows.
At the prompt-only state, 169 of 245 valid numeric continuations lie in the target-excluding family 8ℤ, 45 lie in the target residue class, and one lands exactly on 532. The field is already arithmetically organized against the target before any generated reasoning token has been retained.
Commitment sharpens that structure. Wrong-family mass reaches 0.984 at 18.9% and 1.000 at 27%, while target-fiber mass falls to 0.016 and then zero. Yet the winning wrong atom is not resolved at the same time: 112 dominates the early/middle sweep, while 56 dominates only by 50%.
At 50%, the terminal answer field is nearly a point mass on 56, but surface reasoning remains diverse. The computation has largely converged while prose still varies. Endpoint accuracy and visible reasoning text miss that state-indexed reachability fact.
Case A is therefore a morphology result: target reach drains, wrong-family support closes, and atom-level consolidation follows later. Case B then asks whether correct-source and wrong-source prefixes separate under the same reach-scan logic.
The Case A receipts
169 of 245 prompt-root numeric continuations lie in 8ℤ; 45 of 245 sit in the target residue class; 1 of 256 lands exactly on 532.
Wrong-family mass reaches 0.984 at 18.9% and 1.000 at 27%; target-fiber mass falls to 0.016 and then zero.
251 of 256 answers equal 56 at 50%, while trajectory signatures remain mostly unique. Answer concentration outruns surface-text collapse.
Zero correct or target-compatible continuations appear across 1,280 samples from 50% through 99.9% under the declared instrument.
The same conceptual object, on a substrate without an oracle.
The SAT calibration measures a partial assignment and asks whether satisfying completions remain reachable from it. The reasoning lane measures a token prefix and asks where the model's continuations land. The objects are different in surface and identical in structure: a committed history, and the futures it leaves open.
The substrates differ in one crucial way. SAT has an exact bridge oracle: a solver can certify that no satisfying completion remains from a given prefix. Reasoning systems do not. A model might recover under a larger budget, a different sampler, a different prompt, or a verifier we have not yet attached. The framework adapts honestly.
What the SAT side calls certified non-extendability, the reasoning side calls persistence under declared instrument. Same conceptual program, different epistemic status. The vocabulary tracks the distinction precisely so that a careful reader can never confuse the two.
The cross-substrate sentence
A committed process can enter a region of future space that no longer intersects, or no longer substantially reaches, the target set. In SAT, the reachable future from the prefix collapses. In reasoning, the field's mass leaves the target fiber. The substrate differs; the structural pattern is shared.
In the SAT work, this appears as a walking-dead interval: a path has already lost its satisfying future before failure becomes visible. In the AI work, the analogous quantity is the commitment-expression interval: the target answer falls out of reach before the model writes the answer.
Black-box measurement is the point, not a compromise.
White-box interpretability asks what is happening inside the model. ReachScan asks a different question: from this committed state, under this policy and budget, what outcomes remain reachable? That question still works when the system is closed, tool-using, multi-agent, or too complicated to make internals the first object.
The two approaches can cooperate. A reach-scan can show where to look: the transition window, the contrast between states, the field before and after intervention, and the downstream path that closes the target again.
The careful sentence
ReachScan is not a probe of model internals. It is a reference measurement of prefix-conditioned outcome viability under a declared continuation instrument.
What the published paper measures.
The work is a bounded evidence chain, not a single scan: Case A supplies distributional structure; Case B supplies source-conditioned diagnosticity; localization and reopening make the comparison temporally and functionally meaningful.
Source-conditioned diagnosticity · measured
On D9, correct-source and wrong-source fields are similar at 50%, then separate late: the correct-minus-wrong gap reaches +0.732 at 94%. A disjoint-seed reconfirmation reproduces the late-band separation.
Prospective D7 replication · measured
The same-family D7 replication is prospectively selected and build-pinned. Its late-band trace-bootstrap estimate is +0.802, with a 95% interval of [+0.695, +0.894] and an adversarial lower bound above +0.74.
Foreclosure before expression · measured
Eighteen D7 wrong-source traces are localized. The median onset bracket is 4.5 tokens, and the wrong answer is emitted a median 170 tokens after target reach crosses below the dead threshold.
Controlled reopening · proof of principle
In five localized traces, corrective splice reopens exact-target reach in four. Matched-site and digit-scramble controls generally do not, fresh continuation preserves four through 32 tokens, and the original downstream path re-closes all five.
What the measurements close, and what stays open.
The paper closes the gap between a one-case morphology and a diagnostic measurement cycle. It does not close portability. Cross-model behavior, less structured task families, deployable monitoring, and internal mechanism identification remain explicit boundary questions.
Claim boundary.
Every claim on this page is indexed. The indexing is part of the contribution.
State-indexed answer foreclosure under declared contracts
The paper supports finite-budget, model/sampler/budget-indexed claims about prefix-conditioned future fields. A prompt or correct-source prefix can leave the target reachable; a wrong-committed prefix can deplete it; a corrected prefix can reopen it.
Portability, mechanism, and deployable monitoring
The work does not establish cross-model portability, cross-task generality, universal diagnosticity, an internal mechanism account, a deployable monitor, a general causal-intervention law, or indefinite repaired-path stability.
Foreclosure, not mathematical impossibility
The reasoning substrate has no exact oracle analogous to the SAT bridge check. A reasoning-side dead or foreclosed label is an operational claim under a declared model, prefix, sampler, budget, extraction rule, validity gate, and target set. It is not a proof that recovery is impossible.
The vocabulary.
Each term refers to a measurable object. The arithmetic is simple. The contribution is in choosing what to measure.
Committed prefix
The state already produced. In a reasoning trace, the token sequence written so far. The conceptual primitive is irreversible commitment, not the surface form.
Future field
The empirical distribution of outcomes reachable from a committed prefix under a declared policy and budget. The central observable.
Fiber
A bucket of outcomes sharing a hidden label under some projection — for example, all answers with the same residue mod 8. A wrong fiber is one that excludes the target.
Scaffold
The reasoning pattern that produces a particular fiber of answers. In the case study, the scaffold is "eight repeated blocks times a base-block value," which produces multiples of 8.
Residue
The remainder label that determines fiber membership. The target 532 has residue 4 mod 8; the wrong family has residue 0. The framework asks not just whether answers are wrong but whether they share the wrong residue.
Target-excluding fiber
The deepest version of the case-study finding. The system's continuations concentrate in a fiber whose label is incompatible with the target. The system isn't just searching badly. It's searching in the wrong bucket.