RL Environments are the new data labelling

Apr 12, 2026

The scarce resource in AI has changed four times in six years. Most people are still optimizing for the last one.

2020, data was the bottleneck. Whoever had the most text from the internet won. Google had the web indexed. 2022, it moved to compute. Recipe was known. Data was available. Running thousands of GPUs for months cost hundreds of millions. 2024, it moved to algorithms. Compute you could buy. Data you could find. But the recipe for turning a base model into a reasoning model was held by a handful of teams.

Now the algorithm is public. GRPO is in every open-source library. Models are open weights. Compute is available through distributed platforms where someone in Brazil and someone in Germany contribute GPUs to the same training run.

What remains scarce is the environment.

An RL environment is deceptively simple. A task. A verifier. Together they create the conditions for learning.

Math has an answer key. 2+3 is 5. A program checks that in microseconds. Code has test cases. Write a function, run it, see if all tests pass. Solved environments.

Now try everything else. Writing a good email has no compiler. Diagnosing a patient has no answer in the back of the book. Negotiating a contract has no test case that returns pass or fail.

Right now, every frontier model on earth is enrolled in three courses. Math, code, and logic. Everything else is a course without a textbook. The model sits in an empty classroom.

January 2025. DeepSeek published GRPO. Take a base model. Give it math problems. Check answers. Right or wrong. One bit of feedback. No human raters. No preference models. No demonstrations of step-by-step reasoning.

What emerged looked like thinking. The model started breaking problems into steps. Writing “let me first calculate this, then substitute.” Nobody taught it that. Self-verification appeared. “Wait, let me reconsider.” The model caught its own arithmetic errors mid-solution. All from a reward function that was six lines of Python returning 1 or 0.

Everyone celebrated. RL taught the model to reason.

Then a team looked closer. NeurIPS 2025 Best Paper Runner-Up.

They tested the RL-trained model against the base model that never went through RL. At pass@1, RL wins. Gets correct answers more often on the first attempt. But give both models hundreds of attempts on the same problem. At large k, the base model wins. Solves more distinct problems given enough tries.

Capabilities were already there. Scattered across the base model’s probability distribution. Seeds in a field. RL didn’t plant new seeds. It watered the ones already growing in the right direction. Made them bloom more reliably. The field didn’t get bigger.

Selection, not creation.

Karpathy described it precisely. RL is “sucking supervision through a straw.” A model generates a 17-step solution. The reward function says one word. Correct. Which of the 17 steps made it correct? Was it the substitution at step 3? The creative reframing at step 11? Maybe step 8 was terrible but step 14 saved it. You’ll never find out. One bit wide. 17 steps deep.

The gap between outcome supervision (your final answer is correct, full marks) and process supervision (step 3 had a sign error, step 5 was creative) is the gap between the straw and the pipe. Building the pipe requires environments that can evaluate intermediate steps. Not just final answers. That requires deep domain expertise in every field where we want models to improve.

Same pattern happened with data. In 2018, having a large dataset was a competitive advantage. By 2023, datasets were commoditized. Common Crawl was free. The Pile was open. RedPajama was open. Advantage shifted from having data to knowing what to do with it.

Starting with environments now. Building a good RL environment for a new domain is rare and valuable. The person who defines what “correct” means for medical diagnosis, or legal reasoning, or chip design, opens a new course for every model on earth to enroll in. Job didn’t exist two years ago.

Anthropic is spending tens of millions a year on environments. Not models. Not algorithms. Environments. Prime Intellect built an open platform where anyone can create and share them. Environments Hub sits above GPU Instances in their sidebar.

Most AI investment right now is aimed at the wrong layer. Capital flooding into model training, inference infrastructure, application wrappers. Actual bottleneck is a six-line Python function that returns 1 or 0.

Data labeling was a $15 billion industry built on one insight: supervised learning needs labeled examples. Environment design is the next version. Same land grab. Just a different artifact.

The labelers of 2020 become the environment builders of 2026. Most of them don’t know it yet.

mtrajan blog

Discussion about this post

Ready for more?