Measurement & Determinism

Measurement is what the rest of the loop stands on. Statuses, profiles, bundles, and alerts are only as honest as the numbers underneath them. This page covers the two quantities the measurement core works with, the rules that keep those numbers trustworthy, and why the same input always replays to the same output.

The two quantities: θ and δ

Symbol	Name	What it is	Who sees it
θ (theta)	ability estimate	A student’s statistical reading ability on the latent scale, re-estimated after each answered item.	Internal only. Never shown to a student or family, ever.
δ (delta) / delta_prior	item difficulty	How hard an item is. `delta_prior` is the deterministic, formula-derived estimate computed at import, before real calibration.	Developer-facing.

The diagnostic chains θ: the server grades each answer and updates the estimate, then picks the next item near the student’s current level. The student never sees θ or a score. They see tasks and supportive feedback.

Rasch and the Wave 1 engine

Amal calibrates item difficulty with a Rasch measurement model. Real calibration refits δ from observed responses (it needs a large response volume per item) and freezes the result into a versioned snapshot.

Wave 1 ships the real in-process 1PL Rasch EAP estimator (rasch_1pl_v1) as the default scoring implementation. The deterministic mock formula remains available via the ENGINE_IMPLEMENTATION=mock environment variable and is used exclusively to replay DiagnosticSession rows that were originally scored with that implementation.

The engine operates on a two-track difficulty system. When real Rasch calibration data accumulates, eligible items are automatically promoted to their measured difficulty (Track A: calibrated). Until then every item runs on delta_prior, the deterministic formula-derived estimate computed at import time (Track B: prior). Every live session reads one pinned calibration_version so historical decisions stay reproducible regardless of future recalibration runs.

Determinism is a feature (V-6)

The measurement core is deterministic by design (V-6): the same input plus the same rule and calibration version produces byte-identical output, every time. This is what makes the whole loop auditable, because any past decision can be replayed against the exact snapshot that produced it. θ is persisted at full precision and the resolved item difficulties are frozen per session, so a replay months later yields the identical estimate even after a future recalibration.

No LLM in the measurement loop (V-5)

No language model ever participates in measurement (V-5). No LLM computes θ, picks an item, scores a probe, or runs alongside a student session. Statistics measure, humans decide, and LLMs only draft teacher-side artifacts that a teacher reviews. There is a single enforced chokepoint for LLM calls, and it sits nowhere near the scoring path.

No single global percentage (V-3)

There is no overall reading score anywhere (V-3). Status is always reported per-measure or per-macro-domain, never as one rolled-up number. Each measure resolves to one of five states:

MeasureStatus	Meaning
`Meets`	At or above the benchmark for this measure.
`Approaching`	Near the benchmark; watch.
`Below`	Under the benchmark; needs support.
`Severe`	Well under; priority need.
`Not_Assessed`	No comparable evidence yet.

Not_Assessed is never treated as zero and is excluded from every rollup. A missing measure is missing, and it never silently drags a status down. This is the honest-coverage rule that keeps the loop from punishing students for gaps in data.

Append-only audit

Measurement-relevant tables are append-only. Evidence rows, scored responses, and θ estimates are never edited or deleted in place; corrections are new rows. This guarantees the record the engine decided from is exactly the record you can later inspect, which is what makes V-6 replay meaningful.

What happens when a diagnostic finishes

A finished diagnostic now emits a screening_score evidence row alongside θ. That row carries the per-sub-skill correct and total counts from the server-graded answers, not θ itself, and it is the input the decision engine recomputes from. θ stays internal and parallel on its own table. The projection that writes the evidence row and the macro status tiles at finish is the Measurement Bridge.

Where to go next

The Decision Engine: how measured evidence becomes a skill, domain, and macro status.
Diagnostic & Practice: the session APIs that chain θ.
Standards & Benchmarks: where the per-measure cut points come from.
Glossary: θ, δ, Rasch, calibration, and MeasureStatus defined.