Run the Numbers — Bayesian Forecasting as a Mandate
An AI operator kept picking the wrong recovery path under pressure — rebooting the Mac when killing a sprawl of browser workers would have saved 9 minutes and $0 in blast radius. The fix was not more rules. The fix was a forecaster persona whose only job is to price every fork with a base rate, explicit prior, posterior with a range, and named update triggers — before the system acts.
kill Playwright + retry git
reboot the Mac, same disk issue returns
FAVORED / EDGE / COIN FLIP / UNFAVORED / DOG / UNKNOWABLE
base rate, prior, posterior + range, update triggers
Vibes beat Bayes under time pressure
The Failure
Symptom misread
Wrong recovery path picked
A git stall got diagnosed as a kernel-level mmap failure. The session offered two recoveries: reboot the Mac, or clone the repo into /tmp. Both had low posteriors on actually solving it. Disk was at 0.2 GiB free.
Detection existed
Real cause was visible
Session-start had already printed 0.2 GiB free disk and hundreds of live Playwright workers. Any reasonable Bayesian prior on "what blocks git under this shape of system load" would have scored killing Playwright FAVORED instantly.
Detection, no action
No decision layer
Jenn OS was converting detection into action for worktree sprawl and process collisions. But decisions between detected options were still vibes-based. The operator layer knew. The decision layer did not.
A named persona whose only job is to price options
The Fix: Forecast Before Act
Reference class
Base rate first
Every forecast starts with a reference class and its historical success rate. "How often does this kind of decision succeed?" before "how does this one feel?" Outside view beats inside view on the first pass.
P10 / P50 / P90
Explicit prior → posterior with a range
Never hide starting probability inside a recommendation. State the prior, list the inside-view evidence with likelihood ratios, give the posterior with a 90% confidence interval. Single-point estimates masquerade as knowledge.
What would move this 10+ points?
Named update triggers
Every forecast ends with: "I move this up by ≥10 if X, down by ≥10 if Y." A forecast without a falsifier is not a forecast, it is a guess. This is the Tetlock move — superforecasters name the next thing that would change their mind.
When an agent or human says "run the numbers" — the forecaster is in effect
How the Mandate Works
Portable across sessions
Loaded as a skill, not a plugin
The persona lives at skills/agents/bayesian-forecaster-agent.md — the same shape as the existing Daubert Verification Agent. Any session can load it. It sits adjacent to the methodology skills (boil-the-ocean, multi-agent-sidecar) that govern execution discipline.
Data vs decisions
Paired with Daubert
Daubert Verification Agent prices whether data can survive cross-examination. Bayesian Forecaster prices whether decisions can survive hindsight. Together they are the "just trust us" firewall across every high-stakes output in Jenn OS.
Preserve optionality
Close calls default to sidecar
If two options are within 10 points of each other on posterior, the cheapest move is usually the one that preserves optionality — a sidecar branch, a reversible change, a learning bet — rather than picking under noise. Codified in the mandate.
No cop-outs
Decisive under uncertainty
"I don't know" is only acceptable when the question is genuinely unknowable. A wide range is still a forecast. 50% ± 30% with a named reference class is more honest than silence and more useful than a hunch.
An AI coding agent hit a git stall. Session-start had already printed: "Free space on the workspace volume: 0.2 GiB" and "557 Playwright processes." The agent did not treat those as blockers. It treated them as passive information. When git hung, the diagnosis was "kernel-level mmap stall" and the recovery options offered to the user were "reboot the Mac" or "clone the repo into /tmp." Both were wrong.
Under any reasonable Bayesian prior, the right move was visible. macOS aggressively swaps to disk under memory pressure. When free disk drops below the swap headroom, mmap operations on memory-mapped files (like git pack files) stall waiting on paging. 105 live Playwright worker processes were each holding RAM and tmpfs. Killing them should have freed enough disk for git to complete. Reboot would have freed RAM but not disk — same stall returns as soon as git pages anything in. The /tmp bypass option literally fails because /tmp is on the same APFS volume.
The Bayesian forecaster would have priced this instantly. Reference class: "git stall under low disk and high browser-worker count." Historical base rate of "kill workers, retry" succeeding: above 80%. Base rate of "reboot" succeeding with the same disk: around 40%. Base rate of "clone to /tmp" succeeding: zero, because same volume. The EV of reaping Playwright dominates. FAVORED.
This is not a new problem in forecasting literature. Tetlock and the superforecaster research showed that the thing that distinguishes good forecasters from bad ones is not access to more information. It is the discipline of naming priors, demanding base rates, and updating numerically. Most humans make predictions in words — "probably," "could," "unlikely." Those are not forecasts. They are unfalsifiable stories. A forecaster that says "75% ± 15%, reference class X, would move down if Y" can be calibrated over time. A forecaster that says "I think it will work" cannot.
The mandate has four required elements, in order. Base rate. Explicit prior. Posterior with a range. Named update triggers. Any "forecast" missing one of these four is noise, and the mandate says to treat it as raw material to re-price, not as a conclusion.
The persona lives in skills/agents/bayesian-forecaster-agent.md. It mirrors the structure of the existing Daubert Verification Agent — Persona, Standard, Protocol, Verdicts, What This Agent Does NOT Do, When To Invoke, North Star, Origin. The two agents are a deliberate pair. Daubert prices whether data can survive cross-examination. Forecaster prices whether decisions can survive hindsight. Together they form the "just trust us" firewall across Jenn OS.
Six verdict tiers. FAVORED (P > 70%, bounded downside, proceed). EDGE (P 55-70%, positive, proceed with hedge — smaller bet, early checkpoint). COIN FLIP (P 45-55%, recommend smallest bet that yields signal — a learning bet). UNFAVORED (P 30-45% or dominated, prefer alternative). DOG (P < 30% or unbounded downside, do not proceed on this path). UNKNOWABLE (confidence interval too wide to distinguish — get more evidence first).
There is a subtle rule embedded in the mandate that matters more than the verdicts: close calls default to sidecar. If two options are within 10 points of each other on posterior, the cheapest move is the one that preserves optionality. A sidecar branch instead of a rebase. A reversible change instead of an irreversible one. A learning bet instead of a production bet. Under noise, the right move is almost always the one that buys information without spending commitment.
The mandate pairs with the System Control Mandate. That mandate keeps root-cause visibility before we act — by flagging breach states like worktree sprawl, disk pressure, multi-agent coordination risk. This mandate keeps decision quality once we are ready to act. One governs what we see. The other governs what we choose.
This is a north-star belief about how Jenn OS makes choices, not a workflow tweak. The alternative — following intuition alone — is how 9 minutes get lost to a wrong reboot. The alternative is how a $10,000 month of wrong framework choices happens. The alternative is how a session rebuilds the same broken thing six times because "it seemed easier." We forecast. We price. Then we move.
// The forecaster protocol, condensed
// Step 1: Frame
// - Options (including do-nothing)
// - Define what "goes well" means, concretely
// - Time horizon, reversibility
// Step 2: Reference class
// - "git stall under low disk + high browser-worker count"
// - historical success rate of the class
// Step 3: Base rate -> prior
const prior = 0.55; // "two-agent-branch-rebase succeeds without conflict"
// Step 4: Inside-view evidence with likelihood ratios
const evidence = [
{ note: "disk at 0.2 GiB", lr: 5.0 }, // favors low-disk cause
{ note: "105 Playwright workers live", lr: 3.0 },
{ note: "session-start already flagged both", lr: 1.5 },
];
// Step 5: Posterior (informal Bayes)
// posterior odds = prior odds * product(LRs)
// Give a range: P10 / P50 / P90
// Step 6: Magnitudes
// upside if works: 90 seconds saved, no reboot
// downside if fails: same 10-minute reboot path, bounded
// Step 7: Expected value per option
// Kill Playwright: 0.90 * 1 - 0.10 * 1 = +0.80
// Reboot Mac: 0.40 * 1 - 0.60 * 3 = -1.40 (reboot cost)
// Clone to /tmp: 0.00 * 1 - 1.00 * 1 = -1.00
// Step 8: Recommend (highest EV)
// -> Kill Playwright. Verdict: FAVORED.
// Step 9: Update triggers
// Move UNFAVORED if: kill succeeds but git still stalls after 3 retries
// (would indicate a real kernel issue, not disk pressure)
// Move to DOG if: pkill reports zero live workers
// (was not the cause, different hypothesis needed)