The Paper
“GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models” was published in Science in June 2024.[1] Three of the four authors work at OpenAI (Eloundou, Mishkin, Rock); the fourth (Manning) is at the University of Pennsylvania’s Wharton School. The title is a double pun: GPTs (Generative Pre-trained Transformers) are GPTs (General-Purpose Technologies), following Bresnahan & Trajtenberg’s 1995 framework for technologies that reshape entire economies.
The paper is foundational. Every consulting firm doing AI exposure analysis cites it. Every PE firm commissioning “AI impact assessments” on portfolio companies imports its rubric. McKinsey, BCG, Deloitte — all derive their frameworks from the E0/E1/E2 classification Eloundou et al. introduced. The Goldman Sachs report projecting 300 million jobs affected[12] leans on the same methodology.
The question is: how much of this edifice is load-bearing? I built a company-specific pipeline that extends this paper’s approach to actual job descriptions and individual employees. The rubric is rigorous. The chain from exposure to dollar impact is not. Understanding where the science ends and the consulting begins matters for anyone using these numbers to make investment decisions.
Disclosure
I’ve implemented a company-specific version of this methodology for a PE portfolio company. This musing reflects both the paper’s published findings and the gap I observed between national exposure scores and what people actually do at their desks.
The Method: Rubric, Annotators, 50% Threshold
The method is straightforward. Take O*NET’s database of 1,016 occupations, each decomposed into component tasks (19.3 tasks per occupation on average).[11] For each task, ask: can an LLM, or LLM-powered software, reduce the time to complete this task by at least 50%, while maintaining equivalent quality?
Three categories emerge. E0: No — the LLM doesn’t meaningfully help. E1: Yes, the LLM alone reduces time by half. E2: Yes, but only through LLM-powered software (code generation tools, retrieval pipelines, image classifiers), not the raw model.
The 50% threshold is arbitrary. The authors acknowledge this but don’t sensitivity-test it.[1] A 30% threshold would capture far more tasks; a 70% threshold far fewer. The paper reports no robustness checks across threshold values. This matters because the difference between “AI saves you 40% of time” and “AI saves you 60% of time” on a single task can flip an occupation from E0 to E1.
Annotators were OpenAI contractors — not occupational experts, not the workers who perform these tasks. Five annotators rated each occupation. GPT-4 also rated every task independently, providing a machine comparison. The paper acknowledges the annotator limitation but proceeds with the ratings as ground truth. No inter-annotator reliability statistics (Krippendorff’s alpha, Cohen’s kappa) among the human raters are reported.[1]
Figure 1
The Exposure Rubric: E0 / E1 / E2
The rubric asks a single question: can an LLM (or LLM-powered software) reduce the time to complete this task by at least 50%, maintaining quality? E1 is the chatbot layer. E2 is the software layer — code tools, retrieval systems, classification pipelines. E2 is where most economic value sits.
The Findings: Who Is Exposed
The paper’s central finding inverts the automation narrative. Previous waves of technology — robots, factory automation, ATMs — displaced low-skill, low-wage, routine manual work. LLMs hit differently. Exposure peaks at Job Zone 4: workers with a bachelor’s degree or higher, $77,000 median wage, in roles requiring “considerable preparation.”[1]
The wage coefficient is positive: a one-standard-deviation increase in log wages corresponds to higher exposure (β = 0.017 at Zone 4). High-wage knowledge workers face more LLM exposure, not less. This is the opposite of what Autor, Levy & Murnane (2003) documented for computerization, where routine cognitive and manual tasks were automated first.[9]
Three exposure measures tell different stories:
E1 only (direct): ~15% of workers. Tasks where the raw LLM alone cuts time by 50%+. Writing, translation, summarization.
E1 + 0.5×E2 (weighted): ~33% of workers. The paper’s preferred composite. Weights software-mediated exposure at 50%.
E1 + E2 (full): ~52–80% of workers. All exposed tasks. 52% per human raters, 80% per GPT-4. The widely cited ‘80%’ headline number.
The E2 category is where the real economic action is. The gap between E1-only (15%) and full E1+E2 (52–80%) shows that the transformative impact comes from LLM-powered software — code copilots, automated analysis pipelines, retrieval-augmented generation — not from people typing prompts into ChatGPT.[1] This distinction is underappreciated in the consulting versions of this paper.
Figure 2
Exposure by Job Zone (O*NET classification)
Exposure peaks at Job Zone 4 (bachelor’s degree, considerable preparation, $77K median wage) and declines at Zone 5. This inverts the standard automation narrative: LLMs disproportionately affect knowledge workers, not manual laborers. Low-wage physical work (Zone 1) is barely exposed. The β coefficient on wages is positive — higher-paid workers face more exposure, not less.
Figure 3
Three Exposure Measures: Human vs GPT-4 Annotations (%)
The E2 multiplier is the paper’s most important finding. When you include software-mediated exposure (E1 + E2), GPT-4 estimates 80% of workers face exposed tasks. Humans are more conservative (52%). The gap between E1-only and E1+E2 shows that the real economic impact comes from LLM-powered software, not chatbots alone.
The Validation Gap
The paper reports two validation metrics: percent agreement between human and GPT-4 annotators, and Pearson correlation. The tension between them is revealing.
For E1 (direct exposure), agreement is high: 80.8%. But Pearson r is just 0.223. This means humans and GPT-4 agree on the easy cases — truck driving is E0, email drafting is E1 — but disagree randomly on the ambiguous tasks that actually matter for economic impact. High agreement with low correlation is the signature of a rubric that sorts obvious cases well but provides no consistent signal on hard ones.[1]
The combined E1+E2 measure improves: 65.6% agreement, 0.652 correlation. Better, but still noisy. And the paper reports no inter-annotator reliability among humans. We know how well humans agree with GPT-4, but not how well they agree with each other. If five annotators frequently disagree, the “human label” is itself uncertain.[1]
The authors are transparent about this fragility: “We view these labels as rough attempts at measuring exposure rather than definitive judgments.”[1] That honesty is admirable. But it rarely survives the translation into consulting decks. By the time a PE firm sees “80% of tasks are AI-exposed,” the uncertainty has been stripped clean.
Figure 4
Human-GPT Agreement Rates vs Correlation
E1 direct exposure shows 80.8% agreement between human and GPT-4 annotators but only 0.223 Pearson correlation. This means they agree on the obvious cases (truck driving is E0, email drafting is E1) but diverge unpredictably on ambiguous tasks. The combined E1+E2 measure has better correlation (0.652) but lower raw agreement (65.6%). No inter-annotator reliability among humans is reported.
From Exposure to Dollars
This is the critical section. The paper measures exposure. Consulting firms sell dollar impact. Between the two lies a causal chain with four links, each weaker than the last.
Exposure → Adoption
Just because a task can be done faster with an LLM doesn’t mean workers will use one. Organizational inertia, regulatory constraints, trust deficits, and IT procurement all intervene.
Evidence: Moderate. Bick et al. (2025) find 41% of US workers use AI, but only 5.7% of work hours involve it.
Adoption → Productivity
Using the tool doesn’t guarantee productivity gains. Workers may use AI for low-value tasks, produce output that requires heavy editing, or substitute AI time for thinking time.
Evidence: Strong at micro level. Brynjolfsson et al. (2023) find 14% productivity gain in customer service. Noy & Zhang (2023) find 40% for writing tasks. But these are controlled settings with selected populations.
Productivity → Cost Savings
Productivity gains don’t automatically reduce costs. Svanberg et al. (2024) find only 23% of AI-exposed tasks are cost-effective to automate when accounting for wages, error rates, and integration costs.
Evidence: Weak. The Svanberg finding is devastating for back-of-envelope calculations.
Cost Savings → Dollar Impact
Firm-level savings don’t aggregate linearly to GDP. Acemoglu (2024) estimates 0.06% annual TFP growth from AI over the next decade — far below the implied 2–3% in Goldman Sachs and McKinsey projections.
Evidence: Very weak. The macro evidence is barely distinguishable from noise.
Each link loses roughly an order of magnitude.[2][3] Start with 80% of workers having at least one exposed task.[1] Roughly 40% actually adopt AI tools. Of those, perhaps 15–30% see measurable productivity gains in controlled settings.[5][6] Of those gains, maybe 23% translate to cost savings at the firm level.[3] The aggregate GDP effect? Acemoglu estimates 0.06% per year.[2]
The Svanberg finding deserves emphasis. “Beyond AI Exposure”[3] showed that most AI-exposed tasks fail a basic cost-effectiveness test. When you account for the worker’s wage (if AI saves time on a $15/hr task, the savings are small), the error rate of the AI system, and the integration cost of deploying it, the majority of “exposed” tasks don’t pencil out. Only 23% cross the threshold.[3] This single paper deflates most PE-style AI impact estimates by 4×.
Figure 5
From Exposure to Dollars: The Causal Chain
The paper establishes the first link (exposure) rigorously. But the chain from exposure to dollar impact requires four additional assumptions, each reducing the signal by roughly an order of magnitude. Svanberg et al. (2024) found only 23% of AI-exposed tasks are cost-effective to automate when you account for wages, error rates, and integration costs. Acemoglu (2024) estimates 0.06% annual TFP growth over the next decade — far below the 2–3% implied by consulting estimates.
What a Company-Specific Approach Changes
When I built a company-specific pipeline for a mid-size media company (~80 employees, ~2,370 tasks), three things changed immediately.[8]
First, the task source. O*NET describes what “Marketing Managers” do nationally. A specific company’s marketing manager might spend 60% of their time on tasks O*NET doesn’t list — internal presentations, cross-functional coordination, ad-hoc data pulls. When I compared O*NET task descriptions to actual job descriptions and employee interviews, only 27.2% of O*NET tasks matched what people actually did. The national taxonomy describes the occupation. It doesn’t describe the job.
Second, the granularity. The paper assigns one exposure score per occupation. A company-specific approach scores individual tasks for individual employees — ~30 tasks per person. Two people with the same title can have completely different exposure profiles because they do different work. The paper’s occupation-level approach hides this variance entirely.
Figure 7 — Interactive
Same Data, Different Story: Task vs Occupation Exposure
The same data tells different stories depending on the unit of analysis. Task-level exposure reveals which specific activities AI can perform — an HR Specialist who mostly screens resumes (88% exposed) looks very different from one who mostly mediates disputes (12%). Occupation-level aggregation collapses that variance into a single number that determines policy responses. A 48% occupation score hides the fact that half the tasks are barely exposed and half are almost fully automatable. Toggle between views to see what averaging erases.
Third, the rating system. The 3-tier E0/E1/E2 rubric is too coarse for operational decisions. A 5-tier scale (0 = no automation potential through 4 = full automation with monitoring) with confidence scores lets you distinguish between “this task could be augmented but requires heavy oversight” and “this task can be fully automated tomorrow.” The paper’s rubric answers “is this exposed?” A company-specific rubric answers “what do I do about it?”
Every company-specific pipeline I’ve seen produces different exposure distributions than the national averages predict. That’s not a flaw in the paper. It’s the paper working as designed — it measured national potential, not firm-level reality. The flaw is in treating national potential as a company-specific prediction.
Figure 6
National Taxonomy vs Company-Specific Pipeline
When I ran a company-specific pipeline on a mid-size media company (~80 employees), only 27.2% of O*NET task descriptions matched what people actually did. National taxonomies describe the occupation. Company-specific analysis describes the job.
What Is Defensible
After dissecting the paper, what can you actually claim — and what can’t you?
You CAN claim
Knowledge work is disproportionately exposed to LLMs. The Job Zone 4 finding is robust and inverts the automation narrative.
E2 (software-mediated) exposure is larger and more economically significant than E1 (direct chatbot use).
Micro-level productivity gains are real: 14–40% in controlled studies of specific tasks.
Higher-wage workers face more exposure, not less. The wage-exposure coefficient is positive.
The rubric itself is a useful classification framework for screening tasks, even if the scores are noisy.
You CANNOT claim
That 80% exposure implies 80% automation or 80% job displacement. Exposure ≠ automation ≠ displacement.
Specific dollar savings from AI adoption based on exposure scores. The causal chain is too weak.
That national occupation-level scores predict exposure at a specific company. The 27.2% match rate says they don’t.
Headcount reduction targets derived from exposure analysis. The paper explicitly does not claim this.
That the 50% threshold is the right threshold. It was never sensitivity-tested.
The paper is excellent science. It identified a real phenomenon (LLM exposure is concentrated in knowledge work), introduced a workable classification framework, and was honest about its limitations. It becomes bad consulting when the uncertainty is stripped away, the 50% threshold is treated as natural law, occupation-level averages are applied to specific companies, and exposure scores are multiplied by wage bills to produce dollar figures.
If you’re a PE firm commissioning an AI impact assessment on a portfolio company, start with the Eloundou rubric — it’s the best task-level framework available. But run it on actual job descriptions, not O*NET averages. Use a finer-grained scale. And do not, under any circumstances, multiply an exposure percentage by a wage bill and call it “savings.” The Svanberg finding alone should make that math embarrassing.[3]
Sources
Related Reading