World

Good QC for RL Data

Published · May 7, 2026Painted Ladies, SFShare

In January, I proposed a new definition for Type 1, Type 2 data, pending drastic need from the data industry on how to evaluate data quality. A conscious side-effect of the shift to longer horizon regiments is increased need for model-based QA, far beyond the body-shop capabilities of current day data companies.

The progression of what data markets we entered first directly corresponded to how verifiable we could make each one. We filtered the hard domains out of the field at the infrastructure layer, first by choosing verifiable ones, then by building environments that strip away the attention and irreversibility that made real decisions actually hard, then by avoiding reward functions that require taking a contested position. The artifacts from this selection effect are operationalized in pipeline design. Even in the supposedly easy domains we kept, the QC discipline that distinguishes a useful Type 1 dataset from a depreciating one is not yet a shared language across the data markets. Most of the data shipped to frontier labs in 2026 fails the bar set by the labs' own internal QC frameworks.

Many data companies fall by the wayside in two ways. We're picking the easier domains because the evaluation problem is already solved there. And we're failing to actually solve the QC problem on the data we ship in those domains.

The shape of good QC for off-the-shelf RL data has come into focus over the past eighteen months. There is a defensible bar for what good looks like, which should not be aspirational. It is implemented and shipped by the labs themselves, and any vendor selling into a frontier lab in 2026 is being measured against this bar implicitly during the purchase decision. Most are failing multiple gates at once.

The vocabulary here is worth walking through because it has not yet propagated outside the labs that use it. As we tend towards data and tasks that measure how much/fast it costs to do something, rather than whether we can do them, standardized QC for evaluating how well data tests something on the performance-cost-latency Pareto curve will become of utmost importance.

Intake review

Before any post-training run touches the data, you ask whether the dataset is even eval-able.

This is the cheapest gate in the QC stack and it is the one most data companies skip. A frontier lab spending a six-figure trial contract on a dataset that fails intake review is paying twice, once for the data itself, and once for the GPU hours and researcher attention burned on a training run that was uninterpretable from the start. The market for OTS RL data in 2026 is large enough that the second-order cost of skipping intake now exceeds the first-order cost of running it. As mentioned in my previous piece, Anthropic and other labs disclosed their RL data spend in 2025 at $1B+ and overshoot it in actuality.

There are major intake categories that every company that professes to collect data for frontier analysis ought to display at least.

Verification spectrum classification asks where the task sits between deterministic code grading (SWE-bench Verified is the cleanest version of this category) and LLM-judge rubrics (the published reference pattern across HealthBench, FLASK, BiGGen Bench, and Prometheus 2 is atomic, binary, axis-tagged criteria) and unverifiable-by-automation tasks that should ship as SFT demonstrations rather than reward-based RL. Skipping this classification is how labs end up plugging fundamentally unaudited LLM judges into reward functions.

Contamination resistance and variant generation ask whether the dataset's hillclimbness survives the next model generation. For example, GPQA, AIME, and FrontierMath are static sets whose discriminative power decayed inside a year as problems leaked into pretraining and the vendors had no canary, no rotation cadence, no recovery story.

Pass@k and distributional analysis set the productive training band, because a dataset whose pass@1 sits at zero on the target model or whose difficulty distribution is bimodal produces no gradient to climb.

Rubric construction patterns determine whether the grader is atomic and binary or compound and reward-hackable per the rubric anchoring research. Each category is a question that has a published cautionary tale behind it and the cost of getting it wrong is paid downstream by the lab, not by the vendor.

There are a few more checks that ought to be treated as pre-flights, but a concoction of this in nice formats mean that the vendors that have figured this out are using intake review as a structured pitch to procurement teams ("here is our slice on each category, here is the artifact for each gate, here is the audit pass we ran") and clearing informal onboarding cycles in weeks instead of months. The vendors that haven't are losing the contracts they think they're winning, or have researchers that say their data is "good" on paper but secretly looking for alternatives behind their backs. When a lab, overseeing a million line delivery, discovers one discrepancy/failure in one of those lines, they'll wonder about whether there is a QC process whatsoever.

Active Testing

After intake passes, some small-scale ablation + small post-training can be employed to stress-test post-training data to catch the problems intake review cannot see. Reward hacking shows up in training, amongst the complexity of different models with different harnesses. Sycophancy shows up under reward pressure, not in static evaluation. Forgetting shows up after the training run, by which point the lab has already paid for the data and the compute, and we generally want to make sure catastrophic forgetting is somehow not an immediate consequence of a dataset. Active testing is more expensive to run than intake (the cost is a small post-training run on a probe model plus the GPU hours for the diagnostic battery), but the cost of skipping it is higher still, because the failure modes it catches are the ones that quietly degrade frontier model releases and trigger the contract non-renewals I'm hearing about across the labs.

Most data vendors in 2026 are running zero categories of active testing on the data they ship.

Reward hacking comes up in every single lab conversation, still. METR put numbers on it with 1-2% of o3 attempts containing exploits inside their sandboxes, AISI caught OpenClaw reverse-engineering its own evaluation proxy from inside an isolated environment, and ImpossibleBench finds GPT-5 exploiting test cases 76% of the time on the impossible-SWEbench variant. Modern frontier models are routinely cheating their evaluations under reward pressure, and I still find many vendors have never run a single probe to check whether their own data trains for exactly this. The bias-probe battery is the parallel story for any LLM-judge in a reward function. Sycophancy, reward-tampering, and alignment-faking are the three published probes vendors should be running, with the alignment-faking baseline at 12%, and almost none are.

For verifier-graded data, the SWE-bench Verified Pro pattern of 200 PASS plus 200 FAIL human re-judging with FP and FN rates reported separately is now table stakes. OpenAI's 2026 retirement post for the original SWE-bench found 59.4% of audited problems had flawed test cases. That's the floor under which "deterministic verifier" stops meaning anything. Forgetting checks need to be per-skill, not aggregate, the way Tulu 3 published the floor. The gap between SFT continual post-training (around -10.4% average) and on-policy RL (around -2.3%) is what should inform the training method choice, and Qi et al. is the reason aggregate numbers are misleading on safety-relevant data. Small benign fine-tunes can strip RLHF safety guardrails while aggregate scores stay flat. Frontier shape analysis uses the Pareto curve to detect reward-hackable task sets, with the reward-hacking signature work as the published reference, and most vendors don't run it because it requires GPU infrastructure they don't own. Failure triage is the cheapest of these and the most useful. Each failed rollout labeled as capability, prompt, scaffolding, rubric, training-data, orchestration, or triangulation gives the vendor a concrete edit list and the lab a way to tell whether the dataset is broken at the data layer or upstream.

The procurement read is the same shape as intake. Active testing is the work the labs are already running internally on every dataset they accept, and they are increasingly asking vendors to ship the audit results alongside the data so the lab does not have to repeat the work. Vendors who show up with the bias-probe battery results, the per-skill forgetting numbers, the verifier FP/FN audit, and the failure-triage distribution are clearing onboarding in weeks. Vendors who show up with "we ran a few small training experiments and the loss went down" aren't getting past the first technical review. That gap is the difference between a serious data company and a commodity competitor in 2026.

It should be noted - that since many labs are compute rich in some capacities, that they are more forgiving of data quality (since we are more bottlenecked on quality data) in continuing to work with certain vendors. But, as harkened in previous writing on how data will be the cause of the next AI bubble if one, how long can we expect to exist in an inefficient data market where researchers throw out 50%+ of the data they procure?

Where we need to improve on in the wild

Let's do some deeper dives into 2024-2026 benchmark releases and how they lack some of these standards:

FrontierSWE (Proximal) sits in the strongest possible verification regime (deterministic code-based grader with hidden test signal), and exactly fails surface stratification because each model is locked to its own native production harness, conflating model and scaffolding contributions to the headline number.

ProgramBench fails on realism. Competitive programming with clean specs and known answers is not the deployment context for any production coding agent in 2026, and the model that tops a ProgramBench leaderboard is not necessarily the one any engineering team should be deploying. Though I'd applied their creative restrictions placed on models and cataloging as cost as a hillclimbing objective category, these tasks are still quite contrived and represents a class of benchmarks that still confuse contest difficulty for production utility.

Tau-Bench measures end-state correctness on multi-turn customer service interactions and skips the process evaluation that is load-bearing on multi-turn rollouts. Did the agent ask the right clarifying question at turn three, recover from a tool failure at turn five, explain the resolution coherently at turn seven.

GDPval tries to anchor frontier capability to economic productivity and fails on realism for the same reason ProgramBench does, where productivity tasks reconstructed in a controlled environment are not the productivity tasks that exist in real organizational contexts.

MMMLU carries the standard MMLU contamination posture across forty languages, with no canary, no rotation, and a known leakage profile from the moment it shipped.

DSBench put GPT-4o-as-judge on 86% of its tasks with a single hand-wave validation claim and saturated from 34% to 89% in ten months, which is the load-bearing example of what happens when verifier soundness is skipped on a static set.

Terminal-Bench 2.0 handles task verification well but stays inside short shell-task horizons that hide both the irreversibility and process-evaluation failures longer-horizon work surfaces, the way coding-and-math hid them in 2024.

The benchmarks that pass more of the categories tend to do so on a single axis at a time. BankerToolBench (Handshake) is the cleanest realism story I have seen on financial tool use, because the tasks are derived from actual investment banking workflows and the verifier is built around the working products bankers use. LiveCodeBench Pro handles contamination defense by drawing fresh problems on a rolling basis from competitive programming sites and retiring them as they age into pretraining, which is the published reference for a refresh cadence done correctly. SciCode handles verifier soundness on partial-credit scientific coding by hand-writing per-problem deterministic checkers with expert review, at the cost of scaling (but I welcome this entry at the cost of scaling, if Mercor's human QA here was run well). None of them clear all categories at once.

I deeply respect the work that all of these companies have done. All of them shipped artifacts that move the field forward. All of them also illustrate why the QC bar is now load-bearing. The question of "does the measurement instrument actually inform a research decision the lab can make" is extremely difficult as the QA processes vary vendor by vendor, and the answer depends on which categories the benchmark cleared and which it skipped.

The vendor distinction worth drawing is between table stakes and differentiation because the floor is relatively automatable. I see the floor as a smorgasbord of dataset documentation manifest, atomic rubric construction with linter, verifier soundness audit, n-gram contamination report, cross-model evaluation with unbiased pass@k, multi-seed bootstrap CIs, eval harness declaration, trace artifacts, surface stratification across at least two scaffolding configs, and probe model selection from a versioned shortlist.

Further into the differentiation work that may not be cost-effective, the work looks more like a researcher's. Bias probe batteries on verifiers, sycophancy, reward-tampering, and alignment-faking probes, CoT faithfulness probes with counterfactual perturbations, IRT-based ability audits via tinyBenchmarks or Fluid Benchmarking, online RL lane diagnostics for PPO and GRPO. Vendors without research staff who can read the cited papers directly will not implement these, but vendors who do ought to be adequately rewarded.

The market implication

Vendors who haven't internalized this QC bar will find their contracts on the chopping block in 2026, and the rumors I've already heard from top labs about RL contracts being non-renewed reflect exactly this dynamic. Labs are buying less data in the abstract sense of "we need more tasks in this shape." Sellers to Chinese labs may still find that the old motion works. The rest of the market has shifted. Most vendors who keep overoptimizing on unrealistic synthetic data will be selling against the current rather than with it. It is not always said out loud, but the frontier labs are buying outcomes, model improvement on a target capability, and the QC bar is the floor under whether the data can actually produce that outcome.

To overoptimize for selling data in its current form without thinking about scalability is to choose a death by a thousand cuts. Frontier labs in 2026 have learned to discount black boxes heavily, especially black boxes attached to vendors who do not appear to care about their own data quality. The few vendors who have built this infrastructure internally already (a small set, mostly the ones with research-dense teams) are seeing pricing power on the order of 3-5x what their commodity peers can charge for nominally similar tasks, and the premium is built on continued trust as reliable quality-first partners at scale. I find that the gap will only widen as the labs' procurement teams get more sophisticated and as more data teams come to market with these standards.

This is the companion observation to the long-horizon non-verifiable point. Before we even get to the harder domains where the reward function is contested and the environment has to model irreversibility, we need to be doing the QC work in the domains where the reward function is uncontested. The execution gap is what's left to close, and it is smaller than the selection effect but larger than most people running data companies want to admit. A world where we have more codified QC standards is also a world where more models like Andons' proliferates. Theoretically, if you are running a data company in 2027 and you cannot tell me your pass@k distribution across at least three models, your verifier FP/FN rates against human gold, your contamination check against the named eval suites your dataset is positioned against, and your frontier-shape diagnostic on a probe model, you are not selling Type 1 data. You are selling Type 2 data with Type 1 marketing. The labs will figure that out within one purchase cycle, and the rumors I'm hearing suggest several already have.