Reachability Labs measuring reachable future under commitment
Published AI reasoning lane

AI reach-scans

A reasoning model can reach a state where the right answer is still true, but no longer easy to continue toward from the path already written. ReachScan freezes that committed prefix, samples fresh continuations, and measures the answer futures that remain viable.

The published RC10 paper and RC3 evidence archive carry the full measurement cycle: morphology, source-conditioned diagnosticity, prospective replication, token-level localization, and controlled reopening by corrective splice.

Measure, discriminate, localize, perturb, remeasure.

The current paper is a measurement cycle, not a single wrong-answer example.

Case A is the structural prelude — the single-case morphology the cycle starts from. The published RC10 paper builds it into a full cycle: source-conditioned diagnosticity, prospective same-family replication, token-coordinate localization, morphology, and a controlled reopening test.

Public stack: paper, archive, and ReachScan v0.3.5

The reusable measurement core is ReachabilityLabs/ReachScan, archived at 10.5281/zenodo.20837723. The core paper is deposited at 10.5281/zenodo.20872623, and the evidence archive is deposited at 10.5281/zenodo.20872305.

Possible is not reachable.

“All the possibles are not compatible together in one and the same world-sequence.”

Leibniz, Theodicy §201

Leibniz drew a distinction we have mostly lost: a thing can be possible on its own yet not compossible, not able to coexist with what has already been settled. Each possibility can make sense in isolation; the question is which ones can still hold together from where a process now stands.

This work moves that distinction out of metaphysics and into a running computation. When a model reasons toward an answer, the correct answer never stops existing; it remains a valid solution throughout. What changes is whether it remains reachable from the path the model has committed to.

Existence holds fixed. Reachability is what collapses. The contribution is to make that collapse observable: locate the point where an answer falls out of reach, watch reasoning continue after the loss, and test whether a corrected commitment can bring the answer back into reach.

What is borrowed, and what is new

Leibniz gives an old name to the relation. ReachScan gives it an instrument: finite rollouts, declared projections, receipts, localization, intervention, and a bounded claim.

Endpoint evaluation collapses a richer object.

A reasoning trace is not only an answer-producing sequence. It is a path of commitments. Each emitted token fixes part of the future: some continuations remain easy, some become narrow, some become budget-censored, and some become systematically wrong.

Standard evaluation usually sees only the endpoint. Process supervision sees steps. Self-consistency aggregates over endpoints. None of these, by itself, measures the distribution of futures that the committed state leaves open.

The reasoning lane measures that distribution. Given a committed prefix, a continuation policy, and a resource budget, we repeatedly continue from the prefix and examine where the continuations land. Correct-answer mass is one coordinate of that distribution. It is not the whole object.

The sentence the framework makes pronounceable

Self-consistency calls a state uncertain when the answers vary. This measurement calls it family-locked when the answers vary inside the same wrong family.

A committed prefix induces a field of futures. The grammar travels; the projection changes.

Committed prefix, future field, diagnostic projection.

Three primitives, each of which has a plain-language version. The portable part is the measurement grammar: committed state, sampled futures, declared lens, bounded claim. The task-specific part is the projection lens.

Primitive 01

Committed prefix

The state a process has already produced. In a reasoning system this is the token sequence written so far. In a solver it is a partial assignment. In a scheduler it is a partial schedule. The surface is different; the conceptual object is the same: an irreversible commitment.

Primitive 02

Future field

The distribution of outcomes the system can still produce from that exact committed state under a declared policy and budget. We measure it by sampling many continuations and reading where they land. The field has structure beyond a single most likely answer.

Primitive 03

Diagnostic projection

A way of looking at the future field through one labeling rule at a time. Are most futures multiples of 8? Are they all in the same operation family? Do they share a residue the target lacks? Each projection is a different question about the same field.

The intuitive picture: tunnels and buckets.

Imagine the system writing a reasoning trace as opening a tunnel into the space of possible future answers. Once the tunnel is open, all subsequent continuations have to come out somewhere downstream of where the system is now standing. The tunnel can stay wide or narrow sharply. It can point at the right region or at a wrong one.

A diagnostic projection sorts the answers into buckets by some property — like sorting laundry by color. If the target is a blue shirt and the system keeps producing answers from the white-clothes pile, it isn't merely wrong. It's searching in a bucket that doesn't contain the target.

The reasoning lane measures both the shape of the tunnel and which bucket its outputs are landing in.

Why this is different from existing approaches

Self-consistency picks the majority answer. Process supervision rewards good steps. Tree-of-thought searches branches. Each treats the trace as a means to an answer. The reasoning lane treats the trace as a state and asks what futures remain attached to it.

What travels, and what does not.

The grammar travels across substrates: freeze a committed state, sample the futures still opened by that state, read them through a declared projection, and keep the claim tied to the contract. Modulo 8 does not travel automatically. Exact-answer parsing does not travel automatically. Every new task needs its own lens.

Supporting figures stay available as deeper dives, not as the entry point.

The central page keeps the argument in prose: committed state, sampled futures, declared lens, bounded claim. Readers who want the visual aids can open them separately without making the page itself depend on heavyweight embeds.

Case A, the structural prelude.

Upstream wrongness and staged tightening on an arithmetic reasoning trace.

Case A is the target-532 floor-sum problem. It shows why the full answer field matters: the prompt-root field is already biased toward a target-excluding arithmetic family, commitment depletes target-compatible mass, and the answer field later resolves onto a dominant wrong atom. The interactive views remain available as supporting figures; the page-level frame here carries the full RC10 measurement cycle.

What Case A shows.

At the prompt-only state, 169 of 245 valid numeric continuations lie in the target-excluding family 8ℤ, 45 lie in the target residue class, and one lands exactly on 532. The field is already arithmetically organized against the target before any generated reasoning token has been retained.

Commitment sharpens that structure. Wrong-family mass reaches 0.984 at 18.9% and 1.000 at 27%, while target-fiber mass falls to 0.016 and then zero. Yet the winning wrong atom is not resolved at the same time: 112 dominates the early/middle sweep, while 56 dominates only by 50%.

At 50%, the terminal answer field is nearly a point mass on 56, but surface reasoning remains diverse. The computation has largely converged while prose still varies. Endpoint accuracy and visible reasoning text miss that state-indexed reachability fact.

Case A is therefore a morphology result: target reach drains, wrong-family support closes, and atom-level consolidation follows later. Case B then asks whether correct-source and wrong-source prefixes separate under the same reach-scan logic.

The Case A receipts

169 of 245 prompt-root numeric continuations lie in 8ℤ; 45 of 245 sit in the target residue class; 1 of 256 lands exactly on 532.

Wrong-family mass reaches 0.984 at 18.9% and 1.000 at 27%; target-fiber mass falls to 0.016 and then zero.

251 of 256 answers equal 56 at 50%, while trajectory signatures remain mostly unique. Answer concentration outruns surface-text collapse.

Zero correct or target-compatible continuations appear across 1,280 samples from 50% through 99.9% under the declared instrument.

The same conceptual object, on a substrate without an oracle.

The SAT calibration measures a partial assignment and asks whether satisfying completions remain reachable from it. The reasoning lane measures a token prefix and asks where the model's continuations land. The objects are different in surface and identical in structure: a committed history, and the futures it leaves open.

The substrates differ in one crucial way. SAT has an exact bridge oracle: a solver can certify that no satisfying completion remains from a given prefix. Reasoning systems do not. A model might recover under a larger budget, a different sampler, a different prompt, or a verifier we have not yet attached. The framework adapts honestly.

What the SAT side calls certified non-extendability, the reasoning side calls persistence under declared instrument. Same conceptual program, different epistemic status. The vocabulary tracks the distinction precisely so that a careful reader can never confuse the two.

The cross-substrate sentence

A committed process can enter a region of future space that no longer intersects, or no longer substantially reaches, the target set. In SAT, the reachable future from the prefix collapses. In reasoning, the field's mass leaves the target fiber. The substrate differs; the structural pattern is shared.

In the SAT work, this appears as a walking-dead interval: a path has already lost its satisfying future before failure becomes visible. In the AI work, the analogous quantity is the commitment-expression interval: the target answer falls out of reach before the model writes the answer.

See the SAT-side measurement framework →

Black-box measurement is the point, not a compromise.

White-box interpretability asks what is happening inside the model. ReachScan asks a different question: from this committed state, under this policy and budget, what outcomes remain reachable? That question still works when the system is closed, tool-using, multi-agent, or too complicated to make internals the first object.

The two approaches can cooperate. A reach-scan can show where to look: the transition window, the contrast between states, the field before and after intervention, and the downstream path that closes the target again.

The careful sentence

ReachScan is not a probe of model internals. It is a reference measurement of prefix-conditioned outcome viability under a declared continuation instrument.

What the published paper measures.

The work is a bounded evidence chain, not a single scan: Case A supplies distributional structure; Case B supplies source-conditioned diagnosticity; localization and reopening make the comparison temporally and functionally meaningful.

What the measurements close, and what stays open.

The paper closes the gap between a one-case morphology and a diagnostic measurement cycle. It does not close portability. Cross-model behavior, less structured task families, deployable monitoring, and internal mechanism identification remain explicit boundary questions.

Claim boundary.

Every claim on this page is indexed. The indexing is part of the contribution.

What is supported

State-indexed answer foreclosure under declared contracts

The paper supports finite-budget, model/sampler/budget-indexed claims about prefix-conditioned future fields. A prompt or correct-source prefix can leave the target reachable; a wrong-committed prefix can deplete it; a corrected prefix can reopen it.

What is not yet supported

Portability, mechanism, and deployable monitoring

The work does not establish cross-model portability, cross-task generality, universal diagnosticity, an internal mechanism account, a deployable monitor, a general causal-intervention law, or indefinite repaired-path stability.

The epistemic distinction

Foreclosure, not mathematical impossibility

The reasoning substrate has no exact oracle analogous to the SAT bridge check. A reasoning-side dead or foreclosed label is an operational claim under a declared model, prefix, sampler, budget, extraction rule, validity gate, and target set. It is not a proof that recovery is impossible.

The vocabulary.

Each term refers to a measurable object. The arithmetic is simple. The contribution is in choosing what to measure.