Navigate
HomeStart here
MusingsResearch & long-form
BuildingProjects & learnings
WorkProfessional practice
RunningTraining & races
AboutValues & identity
Life & PlacesCulture, food, travel, cities
Notes & ArchiveJournals, essays, portfolio
Methods critiqueTechnology & IntelligenceUpdated March 2026Paper audit + occupational exposure analysis
Musings

Technology & Intelligence · March 2026

GPTs are GPTs

Eloundou, Manning, Mishkin & Rock (Science, June 2024). A dissection of the rubric, the findings, and the chain from exposure scores to dollar impact.

80%
Workers with exposed tasks
E1 + E2 combined (GPT-4 estimate)
1,016
O*NET occupations scored
All O*NET-listed occupations
65.6%
Human-GPT agreement
E1 + E2 combined measure
27.2%
O*NET vs company match
Company-specific validation
01

The Paper

“GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models” was published in Science in June 2024.[1] Three of the four authors work at OpenAI (Eloundou, Mishkin, Rock); the fourth (Manning) is at the University of Pennsylvania’s Wharton School. The title is a double pun: GPTs (Generative Pre-trained Transformers) are GPTs (General-Purpose Technologies), following Bresnahan & Trajtenberg’s 1995 framework for technologies that reshape entire economies.

The paper is foundational. Every consulting firm doing AI exposure analysis cites it. Every PE firm commissioning “AI impact assessments” on portfolio companies imports its rubric. McKinsey, BCG, Deloitte — all derive their frameworks from the E0/E1/E2 classification Eloundou et al. introduced. The Goldman Sachs report projecting 300 million jobs affected[12] leans on the same methodology.

The question is: how much of this edifice is load-bearing? I built a company-specific pipeline that extends this paper’s approach to actual job descriptions and individual employees. The rubric is rigorous. The chain from exposure to dollar impact is not. Understanding where the science ends and the consulting begins matters for anyone using these numbers to make investment decisions.

Disclosure

I’ve implemented a company-specific version of this methodology for a PE portfolio company. This musing reflects both the paper’s published findings and the gap I observed between national exposure scores and what people actually do at their desks.

02

The Method: Rubric, Annotators, 50% Threshold

The method is straightforward. Take O*NET’s database of 1,016 occupations, each decomposed into component tasks (19.3 tasks per occupation on average).[11] For each task, ask: can an LLM, or LLM-powered software, reduce the time to complete this task by at least 50%, while maintaining equivalent quality?

Three categories emerge. E0: No — the LLM doesn’t meaningfully help. E1: Yes, the LLM alone reduces time by half. E2: Yes, but only through LLM-powered software (code generation tools, retrieval pipelines, image classifiers), not the raw model.

The 50% threshold is arbitrary. The authors acknowledge this but don’t sensitivity-test it.[1] A 30% threshold would capture far more tasks; a 70% threshold far fewer. The paper reports no robustness checks across threshold values. This matters because the difference between “AI saves you 40% of time” and “AI saves you 60% of time” on a single task can flip an occupation from E0 to E1.

Annotators were OpenAI contractors — not occupational experts, not the workers who perform these tasks. Five annotators rated each occupation. GPT-4 also rated every task independently, providing a machine comparison. The paper acknowledges the annotator limitation but proceeds with the ratings as ground truth. No inter-annotator reliability statistics (Krippendorff’s alpha, Cohen’s kappa) among the human raters are reported.[1]

Figure 1

The Exposure Rubric: E0 / E1 / E2

E0No ExposureLLM does not reduce time by 50%+Roofing, truck driving, surgeryE1Direct ExposureLLM alone reduces time by 50%+Drafting emails, summarizing reports, translating textE2Exposure via SoftwareLLM-powered software reduces time by 50%+Code generation, data analysis pipelines, image classificationSource: Eloundou et al. (2024), Table 1

The rubric asks a single question: can an LLM (or LLM-powered software) reduce the time to complete this task by at least 50%, maintaining quality? E1 is the chatbot layer. E2 is the software layer — code tools, retrieval systems, classification pipelines. E2 is where most economic value sits.

03

The Findings: Who Is Exposed

The paper’s central finding inverts the automation narrative. Previous waves of technology — robots, factory automation, ATMs — displaced low-skill, low-wage, routine manual work. LLMs hit differently. Exposure peaks at Job Zone 4: workers with a bachelor’s degree or higher, $77,000 median wage, in roles requiring “considerable preparation.”[1]

The wage coefficient is positive: a one-standard-deviation increase in log wages corresponds to higher exposure (β = 0.017 at Zone 4). High-wage knowledge workers face more LLM exposure, not less. This is the opposite of what Autor, Levy & Murnane (2003) documented for computerization, where routine cognitive and manual tasks were automated first.[9]

Three exposure measures tell different stories:

01

E1 only (direct): ~15% of workers. Tasks where the raw LLM alone cuts time by 50%+. Writing, translation, summarization.

02

E1 + 0.5×E2 (weighted): ~33% of workers. The paper’s preferred composite. Weights software-mediated exposure at 50%.

03

E1 + E2 (full): ~52–80% of workers. All exposed tasks. 52% per human raters, 80% per GPT-4. The widely cited ‘80%’ headline number.

The E2 category is where the real economic action is. The gap between E1-only (15%) and full E1+E2 (52–80%) shows that the transformative impact comes from LLM-powered software — code copilots, automated analysis pipelines, retrieval-augmented generation — not from people typing prompts into ChatGPT.[1] This distinction is underappreciated in the consulting versions of this paper.

Figure 2

Exposure by Job Zone (O*NET classification)

0%10%20%30%40%50%Share of tasks with ≥50% time reduction (E1 + E2)Zone 1: Little prep$30K median8%Zone 2: Some prep$38K median12%Zone 3: Medium prep$47K median28%Zone 4: Considerable$77K median48%Zone 5: Extensive$106K median35%Peak exposure at Zone 4Bachelor’s+, $77K median — not low-wage

Exposure peaks at Job Zone 4 (bachelor’s degree, considerable preparation, $77K median wage) and declines at Zone 5. This inverts the standard automation narrative: LLMs disproportionately affect knowledge workers, not manual laborers. Low-wage physical work (Zone 1) is barely exposed. The β coefficient on wages is positive — higher-paid workers face more exposure, not less.

Figure 3

Three Exposure Measures: Human vs GPT-4 Annotations (%)

0%20%40%60%80%15%14.7%E1 onlyDirect LLM exposure33%47.4%E1 + 0.5×E2Weighted composite52%80.2%E1 + E2Full exposureHuman annotatorsGPT-4 annotator

The E2 multiplier is the paper’s most important finding. When you include software-mediated exposure (E1 + E2), GPT-4 estimates 80% of workers face exposed tasks. Humans are more conservative (52%). The gap between E1-only and E1+E2 shows that the real economic impact comes from LLM-powered software, not chatbots alone.

04

The Validation Gap

The paper reports two validation metrics: percent agreement between human and GPT-4 annotators, and Pearson correlation. The tension between them is revealing.

For E1 (direct exposure), agreement is high: 80.8%. But Pearson r is just 0.223. This means humans and GPT-4 agree on the easy cases — truck driving is E0, email drafting is E1 — but disagree randomly on the ambiguous tasks that actually matter for economic impact. High agreement with low correlation is the signature of a rubric that sorts obvious cases well but provides no consistent signal on hard ones.[1]

The combined E1+E2 measure improves: 65.6% agreement, 0.652 correlation. Better, but still noisy. And the paper reports no inter-annotator reliability among humans. We know how well humans agree with GPT-4, but not how well they agree with each other. If five annotators frequently disagree, the “human label” is itself uncertain.[1]

The authors are transparent about this fragility: “We view these labels as rough attempts at measuring exposure rather than definitive judgments.”[1] That honesty is admirable. But it rarely survives the translation into consulting decks. By the time a PE firm sees “80% of tasks are AI-exposed,” the uncertainty has been stripped clean.

Figure 4

Human-GPT Agreement Rates vs Correlation

Agreement %Pearson rE1 (direct)80.8%0.223E1 + 0.5×E272.1%0.591E1 + E2 (full)65.6%0.652High agreement + low Pearson rAgree on easy cases, disagree randomly on hard ones

E1 direct exposure shows 80.8% agreement between human and GPT-4 annotators but only 0.223 Pearson correlation. This means they agree on the obvious cases (truck driving is E0, email drafting is E1) but diverge unpredictably on ambiguous tasks. The combined E1+E2 measure has better correlation (0.652) but lower raw agreement (65.6%). No inter-annotator reliability among humans is reported.

05

From Exposure to Dollars

This is the critical section. The paper measures exposure. Consulting firms sell dollar impact. Between the two lies a causal chain with four links, each weaker than the last.

Exposure → Adoption

Just because a task can be done faster with an LLM doesn’t mean workers will use one. Organizational inertia, regulatory constraints, trust deficits, and IT procurement all intervene.

Evidence: Moderate. Bick et al. (2025) find 41% of US workers use AI, but only 5.7% of work hours involve it.

Adoption → Productivity

Using the tool doesn’t guarantee productivity gains. Workers may use AI for low-value tasks, produce output that requires heavy editing, or substitute AI time for thinking time.

Evidence: Strong at micro level. Brynjolfsson et al. (2023) find 14% productivity gain in customer service. Noy & Zhang (2023) find 40% for writing tasks. But these are controlled settings with selected populations.

Productivity → Cost Savings

Productivity gains don’t automatically reduce costs. Svanberg et al. (2024) find only 23% of AI-exposed tasks are cost-effective to automate when accounting for wages, error rates, and integration costs.

Evidence: Weak. The Svanberg finding is devastating for back-of-envelope calculations.

Cost Savings → Dollar Impact

Firm-level savings don’t aggregate linearly to GDP. Acemoglu (2024) estimates 0.06% annual TFP growth from AI over the next decade — far below the implied 2–3% in Goldman Sachs and McKinsey projections.

Evidence: Very weak. The macro evidence is barely distinguishable from noise.

Each link loses roughly an order of magnitude.[2][3] Start with 80% of workers having at least one exposed task.[1] Roughly 40% actually adopt AI tools. Of those, perhaps 15–30% see measurable productivity gains in controlled settings.[5][6] Of those gains, maybe 23% translate to cost savings at the firm level.[3] The aggregate GDP effect? Acemoglu estimates 0.06% per year.[2]

The Svanberg finding deserves emphasis. “Beyond AI Exposure”[3] showed that most AI-exposed tasks fail a basic cost-effectiveness test. When you account for the worker’s wage (if AI saves time on a $15/hr task, the savings are small), the error rate of the AI system, and the integration cost of deploying it, the majority of “exposed” tasks don’t pencil out. Only 23% cross the threshold.[3] This single paper deflates most PE-style AI impact estimates by 4×.

Figure 5

From Exposure to Dollars: The Causal Chain

÷10×÷10×÷10×÷10×ExposureTask-levelLLM applicabilityStrongAdoptionWorkers actuallyuse the toolModerateProductivityOutput per hourincreasesStrong (micro)Cost SavingsFirm-level costsactually fallWeakDollar ImpactAggregate GDPeffectVery weakEloundou (2024)80% workersexposed (E1+E2)Svanberg (2024)Only 23%cost-effectiveBrynjolfsson (2023)14% productivity(customer service)Acemoglu (2024)0.06%/yr TFP(next decade)Each link loses an order of magnitude80% exposed → ~40% adopt → ~15% see gains → ~5% cost savings → ~0.5% GDP

The paper establishes the first link (exposure) rigorously. But the chain from exposure to dollar impact requires four additional assumptions, each reducing the signal by roughly an order of magnitude. Svanberg et al. (2024) found only 23% of AI-exposed tasks are cost-effective to automate when you account for wages, error rates, and integration costs. Acemoglu (2024) estimates 0.06% annual TFP growth over the next decade — far below the 2–3% implied by consulting estimates.

06

What a Company-Specific Approach Changes

When I built a company-specific pipeline for a mid-size media company (~80 employees, ~2,370 tasks), three things changed immediately.[8]

First, the task source. O*NET describes what “Marketing Managers” do nationally. A specific company’s marketing manager might spend 60% of their time on tasks O*NET doesn’t list — internal presentations, cross-functional coordination, ad-hoc data pulls. When I compared O*NET task descriptions to actual job descriptions and employee interviews, only 27.2% of O*NET tasks matched what people actually did. The national taxonomy describes the occupation. It doesn’t describe the job.

Second, the granularity. The paper assigns one exposure score per occupation. A company-specific approach scores individual tasks for individual employees — ~30 tasks per person. Two people with the same title can have completely different exposure profiles because they do different work. The paper’s occupation-level approach hides this variance entirely.

Figure 7 Interactive

Same Data, Different Story: Task vs Occupation Exposure

0%25%50%75%100%Exposure Score (share of time reducible by 50%+)50% thresholdAccountantPrepare tax returns85%Reconcile financial discrepancies72%Advise on financial decisions55%Audit organizational operations38%Maintain client relationships15%avg 68%Marketing ManagerDraft campaign copy92%Analyze market research data78%Develop pricing strategies55%Negotiate media contracts22%Coordinate cross-functional teams10%avg 58%Software DeveloperWrite boilerplate code95%Debug error logs80%Design system architecture45%Conduct code reviews60%Resolve production incidents30%avg 72%HR SpecialistScreen resumes88%Draft job descriptions85%Conduct interviews18%Mediate workplace disputes12%Administer benefits enrollment42%avg 48%

The same data tells different stories depending on the unit of analysis. Task-level exposure reveals which specific activities AI can perform an HR Specialist who mostly screens resumes (88% exposed) looks very different from one who mostly mediates disputes (12%). Occupation-level aggregation collapses that variance into a single number that determines policy responses. A 48% occupation score hides the fact that half the tasks are barely exposed and half are almost fully automatable. Toggle between views to see what averaging erases.

Third, the rating system. The 3-tier E0/E1/E2 rubric is too coarse for operational decisions. A 5-tier scale (0 = no automation potential through 4 = full automation with monitoring) with confidence scores lets you distinguish between “this task could be augmented but requires heavy oversight” and “this task can be fully automated tomorrow.” The paper’s rubric answers “is this exposed?” A company-specific rubric answers “what do I do about it?”

Every company-specific pipeline I’ve seen produces different exposure distributions than the national averages predict. That’s not a flaw in the paper. It’s the paper working as designed — it measured national potential, not firm-level reality. The flaw is in treating national potential as a company-specific prediction.

Figure 6

National Taxonomy vs Company-Specific Pipeline

Dimension
Eloundou et al. (National)
Company-Specific
Task Source
O*NET (1,016 occupations)
Actual job descriptions + interviews
Granularity
Occupation-level averages
Individual employee tasks (~30/person)
Scale
5 annotators per task
2,370 tasks across 79 employees
Rating System
3-tier (E0/E1/E2)
5-tier (0–4 with confidence)
Validation
65.6% human-GPT agreement
27.2% O*NET match rate
Output
Occupation exposure scores
Individual task-level automation plans

When I ran a company-specific pipeline on a mid-size media company (~80 employees), only 27.2% of O*NET task descriptions matched what people actually did. National taxonomies describe the occupation. Company-specific analysis describes the job.

07

What Is Defensible

After dissecting the paper, what can you actually claim — and what can’t you?

You CAN claim

Knowledge work is disproportionately exposed to LLMs. The Job Zone 4 finding is robust and inverts the automation narrative.

E2 (software-mediated) exposure is larger and more economically significant than E1 (direct chatbot use).

Micro-level productivity gains are real: 14–40% in controlled studies of specific tasks.

Higher-wage workers face more exposure, not less. The wage-exposure coefficient is positive.

The rubric itself is a useful classification framework for screening tasks, even if the scores are noisy.

You CANNOT claim

That 80% exposure implies 80% automation or 80% job displacement. Exposure ≠ automation ≠ displacement.

Specific dollar savings from AI adoption based on exposure scores. The causal chain is too weak.

That national occupation-level scores predict exposure at a specific company. The 27.2% match rate says they don’t.

Headcount reduction targets derived from exposure analysis. The paper explicitly does not claim this.

That the 50% threshold is the right threshold. It was never sensitivity-tested.

The paper is excellent science. It identified a real phenomenon (LLM exposure is concentrated in knowledge work), introduced a workable classification framework, and was honest about its limitations. It becomes bad consulting when the uncertainty is stripped away, the 50% threshold is treated as natural law, occupation-level averages are applied to specific companies, and exposure scores are multiplied by wage bills to produce dollar figures.

If you’re a PE firm commissioning an AI impact assessment on a portfolio company, start with the Eloundou rubric — it’s the best task-level framework available. But run it on actual job descriptions, not O*NET averages. Use a finer-grained scale. And do not, under any circumstances, multiply an exposure percentage by a wage bill and call it “savings.” The Svanberg finding alone should make that math embarrassing.[3]

08

Sources

[1]Eloundou, Manning, Mishkin & Rock. “GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” Science, 2024.The foundational paper. Introduces the E0/E1/E2 rubric, scores 1,016 O*NET occupations with human and GPT-4 annotators, establishes that knowledge workers face disproportionate exposure.
[2]Acemoglu. “The Simple Macroeconomics of AI.” NBER Working Paper 32487, 2024.Estimates 0.06% annual TFP growth from AI over the next decade. The most rigorous macro-level counter to optimistic GDP projections from Goldman Sachs and McKinsey.
[3]Svanberg, Li, Fleming & Goehring. “Beyond AI Exposure: Which Tasks are Cost-Effective to Automate with Computer Vision?” MIT, 2024.Only 23% of AI-exposed tasks pass a cost-effectiveness test when accounting for wages, error rates, and integration costs. The paper that breaks the exposure-to-savings chain.
[4]Hartley, Lam & Llull. “Labor Market Effects of Generative AI: Evidence from OECD Countries.” Forthcoming, 2026.Cross-country evidence on GenAI labor market effects. Updates the Eloundou framework with international data from OECD economies.
[5]Brynjolfsson, Li & Raymond. “Generative AI at Work.” NBER Working Paper 31161, 2023.14% productivity gain in customer service. The gold-standard micro-level study of AI productivity effects. Gains concentrated among lower-skill workers.
[6]Noy & Zhang. “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence.” Science, 2023.40% time reduction on professional writing tasks. MIT randomized experiment showing large effects on mid-skill workers specifically.
[7]ILO-NASK. “A Refined Global Index of Occupational Exposure to Generative AI.” 2025.International Labour Organization refinement of exposure measurement. Extends the Eloundou framework globally with adjustments for developing economies.
[8]Labaschin, Halpern & Taska. “Extending GPTs to Firms: A Framework for Company-Specific AI Exposure Analysis.” 2025.Formalizes the company-specific approach: job descriptions over O*NET, task-level granularity, multi-tier rating scales. The methodological bridge from national taxonomy to firm-level reality.
[9]Autor, Mindell & Reynolds. The Work of the Future: Building Better Jobs in an Age of Intelligent Machines. MIT Press, 2022.Historical context for technology and labor market disruption. Argues for complementarity over substitution and shows how previous automation waves displaced routine work.
[10]Webb. “The Impact of Artificial Intelligence on the Labor Market.” Stanford, 2020.Pre-LLM AI exposure analysis using patent text to measure which occupations face AI disruption. Showed AI targets a different skill distribution than robots or software.
[11]O*NET 27.2, US Department of Labor. Occupational Information Network.The national occupational taxonomy underlying the paper. 1,016 occupations, ~19.3 tasks per occupation. Source of all task descriptions in Eloundou et al.
[12]Briggs & Kodnani. “The Potentially Large Effects of Artificial Intelligence on Economic Growth.” Goldman Sachs, 2023.300 million jobs exposed globally, 7% GDP boost projected. The report that launched a thousand consulting decks. Relies heavily on Eloundou methodology without the uncertainty caveats.

Related Reading

AIlabor-economicsexposuremethodologyautomationO*NETPEknowledge-work
Jenn Musings · jennumanzor.com