The Revenge of the Data Scientist
Hamel's PyAI Conf talk on how data science fundamentals remain essential for AI evaluation, even in the age of LLMs.
🔑 Key Insights
- 5 Eval Pitfalls: Generic metrics, Unverified judges, Bad experimental design, Bad data/labels, Automating too much
- "Look at the data" is the highest ROI activity often skipped by AI engineers
- LLM judge should be treated like a classifier—use precision/recall, not just accuracy
- Criteria drift: The labeling process itself surfaces what matters—people don't know what they want until they see LLM outputs
- The harness (tests + metrics + observability) is largely data science
The Core Argument
Training models was never most of the job. The bulk of work is setting up experiments to test AI generalization, debugging stochastic systems, and designing good metrics. Calling an LLM over an API does not make this work go away.
Five Pitfalls
- Generic Metrics: Teams use off-the-shelf metrics like "helpfulness scores" that are too generic to diagnose actual failures. Need application-specific metrics like "Calendar Scheduling Failure."
- Unverified Judges: Using LLM as judge without validating it like a classifier. Need human labels, train/dev/test split, measure precision/recall.
- Bad Experimental Design: Generating synthetic test data without grounding in real production data. Should use real logs/traces first.
- Bad Data and Labels: Delegating labeling to dev team instead of domain experts. "Data scientists don't trust anything by training."
- Automating Too Much: LLMs can help wire things up but cannot look at the data for you—you don't know what you want until you see the outputs.
Explored: 2026-04-02 | Rating: 4.5/5