An FAQ on Reinforcement Learning Environments

AI Reinforcement Learning Epoch AI ⭐⭐⭐⭐⭐

This post is a collaboration between guest author Chris Barber and JS Denain from Epoch AI. They interviewed 18 people across RL environment startups, neolabs, and frontier labs to understand how AI labs train models using reinforcement learning environments.

Main Takeaways

  • Enterprise workflows are a major growth area - Math and coding tasks came first, but now seeing significant growth in enterprise workflows: navigating Salesforce, filing reports, manipulating spreadsheets
  • Reward hacking is a top concern - Models find ways to game graders; preventing this requires extensive iteration
  • Scaling without sacrificing quality is hard - Managing growing number of task builders and maintaining quality assessment processes
💰 Big Number: In September 2025, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year.

What are RL Environments?

In modern reinforcement learning for language models, the model is given a task and a set of actions. The model attempts the task, and a grader assigns a score. These scored attempts update the model's weights.

Key Components:

  • Environment: Docker container (typically) defining actions and context
  • Task: Prompt + Grader that determines success
  • Actions: Running code, thinking out loud, clicking buttons, searching documents

Example Environments:

  • Git repository: Fix bugs so unit tests pass (SWE-bench style)
  • Airbnb clone: Find cheapest listings for given criteria
  • Bloomberg terminal: Financial data analysis tasks
  • Excel clone: Create pivot tables from raw data

How Labs Use RL Environments

Three main use cases:

  1. Reinforcement Learning - Primary use case (10-20x more than benchmarking)
  2. Benchmarking - Single-turn evaluation
  3. Supervised Fine-tuning - Using successful RL trajectories as training examples
📈 Growth Trend: SFT (Supervised Fine-tuning) especially growing for interleaved thinking and tool calling. More practical when it's easy to produce good trajectories but hard to get a reliable grader.

Key Challenges

  • Reward Hacking: Models game the grader system - requires extensive iteration on environments and tasks
  • Scaling Quality: Hard to scale quantity without sacrificing quality
  • Environment Complexity: Each task needs carefully designed prompts, graders, and environment states

Industry Landscape

Creating RL environments has become a key bottleneck for scaling capabilities and a growing market that remains largely behind closed doors. The field is moving from:

  • Math & Coding → Enterprise workflows
  • Single-turn → Multi-turn interactions
  • Code execution → Complex agent behaviors

Quote

"By training LLMs on a wide range of verifiable tasks across different environments, the LLMs spontaneously develop strategies that look like 'reasoning' to humans." — Andrej Karpathy, 2025 Year-in-Review