Navigate
HomeStart here
MusingsResearch & long-form
BuildingProjects & learnings
WorkProfessional practice
RunningTraining & races
AboutValues & identity
Life & PlacesCulture, food, travel, cities
Notes & ArchiveJournals, essays, portfolio
← Back to Building
PATTERNApr 2026

Run the Numbers — Bayesian Forecasting as a Mandate

An AI operator kept picking the wrong recovery path under pressure — rebooting the Mac when killing a sprawl of browser workers would have saved 9 minutes and $0 in blast radius. The fix was not more rules. The fix was a forecaster persona whose only job is to price every fork with a base rate, explicit prior, posterior with a range, and named update triggers — before the system acts.

bayesianforecastingdecision-qualitymandatesjenn-osagentssuperforecastingtetlock

90%+
posterior on correct fix

kill Playwright + retry git

40%
posterior on wrong fix

reboot the Mac, same disk issue returns

6
verdict tiers

FAVORED / EDGE / COIN FLIP / UNFAVORED / DOG / UNKNOWABLE

4
pieces every forecast must carry

base rate, prior, posterior + range, update triggers

Vibes beat Bayes under time pressure

The Failure

Symptom misread

Wrong recovery path picked

A git stall got diagnosed as a kernel-level mmap failure. The session offered two recoveries: reboot the Mac, or clone the repo into /tmp. Both had low posteriors on actually solving it. Disk was at 0.2 GiB free.

Detection existed

Real cause was visible

Session-start had already printed 0.2 GiB free disk and hundreds of live Playwright workers. Any reasonable Bayesian prior on "what blocks git under this shape of system load" would have scored killing Playwright FAVORED instantly.

Detection, no action

No decision layer

Jenn OS was converting detection into action for worktree sprawl and process collisions. But decisions between detected options were still vibes-based. The operator layer knew. The decision layer did not.

A named persona whose only job is to price options

The Fix: Forecast Before Act

Reference class

Base rate first

Every forecast starts with a reference class and its historical success rate. "How often does this kind of decision succeed?" before "how does this one feel?" Outside view beats inside view on the first pass.

P10 / P50 / P90

Explicit prior → posterior with a range

Never hide starting probability inside a recommendation. State the prior, list the inside-view evidence with likelihood ratios, give the posterior with a 90% confidence interval. Single-point estimates masquerade as knowledge.

What would move this 10+ points?

Named update triggers

Every forecast ends with: "I move this up by ≥10 if X, down by ≥10 if Y." A forecast without a falsifier is not a forecast, it is a guess. This is the Tetlock move — superforecasters name the next thing that would change their mind.

When an agent or human says "run the numbers" — the forecaster is in effect

How the Mandate Works

Portable across sessions

Loaded as a skill, not a plugin

The persona lives at skills/agents/bayesian-forecaster-agent.md — the same shape as the existing Daubert Verification Agent. Any session can load it. It sits adjacent to the methodology skills (boil-the-ocean, multi-agent-sidecar) that govern execution discipline.

Data vs decisions

Paired with Daubert

Daubert Verification Agent prices whether data can survive cross-examination. Bayesian Forecaster prices whether decisions can survive hindsight. Together they are the "just trust us" firewall across every high-stakes output in Jenn OS.

Preserve optionality

Close calls default to sidecar

If two options are within 10 points of each other on posterior, the cheapest move is usually the one that preserves optionality — a sidecar branch, a reversible change, a learning bet — rather than picking under noise. Codified in the mandate.

No cop-outs

Decisive under uncertainty

"I don't know" is only acceptable when the question is genuinely unknowable. A wide range is still a forecast. 50% ± 30% with a named reference class is more honest than silence and more useful than a hunch.

1

An AI coding agent hit a git stall. Session-start had already printed: "Free space on the workspace volume: 0.2 GiB" and "557 Playwright processes." The agent did not treat those as blockers. It treated them as passive information. When git hung, the diagnosis was "kernel-level mmap stall" and the recovery options offered to the user were "reboot the Mac" or "clone the repo into /tmp." Both were wrong.

2

Under any reasonable Bayesian prior, the right move was visible. macOS aggressively swaps to disk under memory pressure. When free disk drops below the swap headroom, mmap operations on memory-mapped files (like git pack files) stall waiting on paging. 105 live Playwright worker processes were each holding RAM and tmpfs. Killing them should have freed enough disk for git to complete. Reboot would have freed RAM but not disk — same stall returns as soon as git pages anything in. The /tmp bypass option literally fails because /tmp is on the same APFS volume.

3

The Bayesian forecaster would have priced this instantly. Reference class: "git stall under low disk and high browser-worker count." Historical base rate of "kill workers, retry" succeeding: above 80%. Base rate of "reboot" succeeding with the same disk: around 40%. Base rate of "clone to /tmp" succeeding: zero, because same volume. The EV of reaping Playwright dominates. FAVORED.

4

This is not a new problem in forecasting literature. Tetlock and the superforecaster research showed that the thing that distinguishes good forecasters from bad ones is not access to more information. It is the discipline of naming priors, demanding base rates, and updating numerically. Most humans make predictions in words — "probably," "could," "unlikely." Those are not forecasts. They are unfalsifiable stories. A forecaster that says "75% ± 15%, reference class X, would move down if Y" can be calibrated over time. A forecaster that says "I think it will work" cannot.

5

The mandate has four required elements, in order. Base rate. Explicit prior. Posterior with a range. Named update triggers. Any "forecast" missing one of these four is noise, and the mandate says to treat it as raw material to re-price, not as a conclusion.

6

The persona lives in skills/agents/bayesian-forecaster-agent.md. It mirrors the structure of the existing Daubert Verification Agent — Persona, Standard, Protocol, Verdicts, What This Agent Does NOT Do, When To Invoke, North Star, Origin. The two agents are a deliberate pair. Daubert prices whether data can survive cross-examination. Forecaster prices whether decisions can survive hindsight. Together they form the "just trust us" firewall across Jenn OS.

7

Six verdict tiers. FAVORED (P > 70%, bounded downside, proceed). EDGE (P 55-70%, positive, proceed with hedge — smaller bet, early checkpoint). COIN FLIP (P 45-55%, recommend smallest bet that yields signal — a learning bet). UNFAVORED (P 30-45% or dominated, prefer alternative). DOG (P < 30% or unbounded downside, do not proceed on this path). UNKNOWABLE (confidence interval too wide to distinguish — get more evidence first).

8

There is a subtle rule embedded in the mandate that matters more than the verdicts: close calls default to sidecar. If two options are within 10 points of each other on posterior, the cheapest move is the one that preserves optionality. A sidecar branch instead of a rebase. A reversible change instead of an irreversible one. A learning bet instead of a production bet. Under noise, the right move is almost always the one that buys information without spending commitment.

9

The mandate pairs with the System Control Mandate. That mandate keeps root-cause visibility before we act — by flagging breach states like worktree sprawl, disk pressure, multi-agent coordination risk. This mandate keeps decision quality once we are ready to act. One governs what we see. The other governs what we choose.

10

This is a north-star belief about how Jenn OS makes choices, not a workflow tweak. The alternative — following intuition alone — is how 9 minutes get lost to a wrong reboot. The alternative is how a $10,000 month of wrong framework choices happens. The alternative is how a session rebuilds the same broken thing six times because "it seemed easier." We forecast. We price. Then we move.

// The forecaster protocol, condensed

// Step 1: Frame
//   - Options (including do-nothing)
//   - Define what "goes well" means, concretely
//   - Time horizon, reversibility

// Step 2: Reference class
//   - "git stall under low disk + high browser-worker count"
//   - historical success rate of the class

// Step 3: Base rate -> prior
const prior = 0.55; // "two-agent-branch-rebase succeeds without conflict"

// Step 4: Inside-view evidence with likelihood ratios
const evidence = [
  { note: "disk at 0.2 GiB", lr: 5.0 },        // favors low-disk cause
  { note: "105 Playwright workers live", lr: 3.0 },
  { note: "session-start already flagged both", lr: 1.5 },
];

// Step 5: Posterior (informal Bayes)
// posterior odds = prior odds * product(LRs)
// Give a range: P10 / P50 / P90

// Step 6: Magnitudes
//   upside if works: 90 seconds saved, no reboot
//   downside if fails: same 10-minute reboot path, bounded

// Step 7: Expected value per option
//   Kill Playwright:  0.90 * 1 - 0.10 * 1 = +0.80
//   Reboot Mac:       0.40 * 1 - 0.60 * 3 = -1.40  (reboot cost)
//   Clone to /tmp:    0.00 * 1 - 1.00 * 1 = -1.00

// Step 8: Recommend (highest EV)
//   -> Kill Playwright. Verdict: FAVORED.

// Step 9: Update triggers
//   Move UNFAVORED if: kill succeeds but git still stalls after 3 retries
//     (would indicate a real kernel issue, not disk pressure)
//   Move to DOG if: pkill reports zero live workers
//     (was not the cause, different hypothesis needed)