Validating AI Output
Claude produces confident, clean output. That's the problem. You can't tell by looking whether it's right. I built a validation methodology that catches errors before they ship.
I was classifying 400 job titles by AI automation exposure. Claude returned a beautiful spreadsheet — every title had a score, a rationale, a confidence level. It looked perfect. A colleague spot-checked 20 entries against Bureau of Labor Statistics task data and found 6 were wrong. Not slightly wrong — categorically wrong. A "Financial Analyst" was classified as low-exposure because Claude focused on the relationship-building part of the role and ignored that 70% of the actual tasks are data manipulation.
The problem isn't that AI makes mistakes. It's that AI mistakes look exactly like correct output. There's no red squiggly line. No compiler error. The confidence is the same whether it's right or wrong.
So I built a validation methodology, influenced by open-source AI testing pipelines that run 100+ tests proving specific claims in policy models. Seven principles:
1. Proof chains, not spot checks. Every claim traces back to a source. "Financial Analyst is high-exposure" needs to link to the specific O*NET tasks that are automatable, with the automation score for each one. If the chain breaks, the claim is suspect.
2. Forbidden strings. Define words that must never appear in output: "consider," "potentially," "it may be worth exploring." These are hedge words that mean the model isn't confident but is producing output anyway. Scan exhaustively — if any forbidden string appears, the pipeline stops.
3. Oracle validation. Compare AI output against an independent source of truth. For job classifications, that's BLS task data. For tariff codes, that's the Harmonized Tariff Schedule. The oracle doesn't need to be perfect — it needs to be independent.
4. Count reconciliation. If 412 suppliers go into a pipeline, 412 must come out. Not "about 400." Not 410 with 2 silently dropped. Exact counts at every stage, with named gaps when they don't match.
5. Priority-based test flagging. Not all errors are equal. Tag tests as @critical (blocks delivery), @important (should fix), or @suggestion (nice to have). This prevents the team from drowning in 200 test results when only 8 matter.
6. CI-fixer loops. When a test fails, the AI fixes it and reruns. Not a human. The human reviews the fix, not the failure. This is the difference between "tests as gatekeepers" and "tests as improvement engines."
7. Seam testing. Every handoff between pipeline stages is a potential failure point. Test the seams — where data passes from extraction to classification, from classification to scoring, from scoring to output formatting. Most errors live at boundaries, not inside stages.
The honest assessment: this is overkill for a blog post or a website feature. But for anything going into a client deliverable — data classifications, financial models, policy analysis — the cost of shipping a confident wrong answer is high enough that the validation overhead pays for itself on the first catch.
# Validation checklist — run before any deliverable ships: 1. PROOF CHAINS Every claim → trace to source → verify link isn't broken "Analyst is high-exposure" → O*NET tasks 43-3031 → automation scores [0.82, 0.91, 0.67, 0.88] 2. FORBIDDEN STRINGS (scan exhaustively) ❌ "consider" ❌ "potentially" ❌ "it may be worth" ❌ "arguably" ❌ "in some cases" ❌ "further research" If found → pipeline stops → rewrite without hedging 3. ORACLE VALIDATION AI output vs independent source (BLS, HTS, Census) Match rate > 90% → proceed Match rate 70-90% → investigate mismatches Match rate < 70% → methodology is wrong 4. COUNT RECONCILIATION inputs_in == outputs_out at every stage 412 suppliers in → 412 accounted for Gaps listed BY NAME, not summarized 5. PRIORITY TAGS @critical — blocks delivery (wrong classification) @important — should fix (missing context) @suggestion — nice to have (formatting)