The Bitter Lesson - RL Environments Version

Let’s address the biggest qualms with data and RL env companies today:

While realistic RL environments are needed to push performance today, the high cost of building & maintaining environments: realistic, high-fidelity simulations or replicas are expensive to develop and keep up to date. Successful RL environment startups will move building systems that reduce cost, increase reusability, and, through building 100s of RL environments, figure out automation in RL environment creation or some new insight about building verifiers more easily for strictly non-verifiable outputs. Look for startups with unique takes and breakthroughs about doing this, and whose COGS are clearly decreasing over time.

Alternatively, be wary about differentiating the startups that are building RL environments as a GTM tool versus the ones that are doing it as a business. As there is plenty of virgin soil within enterprises to build end to end AI infrastructure given white space opportunities (hence the success of startups like Forge), many startups view RL environments and data sales as a way to establish vendor relationships given how readily available data contracts for strong technical teams are nowadays.

Overhead of proprietary/closed environments: customers may prefer open or interoperable environments; if labs build their own, third-party environment providers may lose business. To my knowledge, no RL env companies sell their products in an annual SaaS-like contract yet. This is expected, as revenue here cannot be contracted via annualized contracts as RL environments demand hasn’t even been a year yet, and some other paradigm around model improvement may emerge in less the timeframe.

That’s no excuse, however, for developing a business with reasonable recurring metrics. Look for startups actively driving towards productized AI infra, with the understanding that RL environments and data packs may not always be around. This may be a startup that envisions building a “crowdstrike” type suite for mid and post training needs, and who is actively pushing the boundaries for what labs are and aren’t willing to outsource today.

Commoditization

A drive to the bottom, as more miners flock to the data mines, should be expected. Currently, one can reasonably expect to sell an RL environment and plenty of in-line tasks to an lab for about 100-400k, depending on the complexity. At face value, RL environments testing real world domains that are oft tested, that current frontier models score poorly on, are what researchers look for in procurement. Better if the environments require more reasoning steps that have well defined deterministic evaluation tasks built around them, and are in high value white collar domains.

When Anthropic churned some contracts from Surge due to cost (and players like Mercor moved up the stack to fulfill them, albeit at lower cost), some took this as thesis reinforcement. Look for startups that treat commoditization as opportunity to build other parts of the stack by capturing spend volume and interaction time with procurement teams.

A conclusion that one might draw from observing Scale’s general dissolution is that it was running on it a model inherently unsustainable. My counter conclusion was that Scale’s decline wasn’t industry-specific - it was moreso an inability to adapt to shifting data needs as a result of size and culture. The data that Scale was built to collect in the late 2010s is vastly more “hot dog or not hot dog” like then the datasets that Mercor and Surge produce today. At a time, Scale was public enemy #1 at Mercor, around series B days.

An always needed note:

A reasonable north star, as always, is that a fast-growing team with enigmatic and well networked founders, who hires extremely well and is close to innovator customers, agnostic of industry, can find the right pivots to win. AKA, this is “betting on the team”, which in the world of AI today, is one of the only reliable constants for investors. Its always, of course, good to actually understand and have unique opinions about your startup investments’ markets, even if a people-indexed investor.

We should expect RL environment startups to be very relevant through at least 2025-2028, especially in the mid-training and post-training phases, because they supply crucial infrastructure that labs need to push capabilities and alignment.
But many such startups will fail or get consolidated, especially if they focus on single, narrow environments that are expensive to maintain or quickly obsolete. The ones that survive are likely those that offer reusable, modular, upgradeable, or domain-agnostic environments; and/or those with strong partnerships with labs or open source communities.
RL environments will not completely replace other training stages. Pretraining (foundation models) remains necessary. But the marginal gains from further pretraining decline. So the growth will be more in mid- and post-training work, with RL environments, reward modeling, reasoning, etc., forming the battlegrounds for new advances.
Eventually, there might be a shift toward more “environment standards” or “reward environment hubs” (open or semi-open) analogous to how datasets / evaluation benchmarks once standardized parts of NLP/vision.
Startups who can position themselves as platforms (environment + compute + evaluation + easy integration into labs’ pipelines) will have better longevity vs those offering one-off environments.
Also be aware that RL environments sellers have other low hanging fruit for pivots - those that sell to enterprises can easily sell other forward deployed automations leveraging MLE expertise disparity, or other parts of the stack that come up soon

A bitter lesson on having theses is that you can usually never invalidate a startup purely thematically.

People often forget how much of an advantage having an incredibly customer-centric AI native team that sells to hyperspender logos actually is. Regardless if you’re selling data (consulting type one time arrangements), the knowledge asymmetry from being a trusted vendor with a large war chest to hire capital to all of the magnificent 7 labs is an inherent right to win whatever AI infra category comes next.

OpenAI and Mercor are coming out very soon with a general purpose eval, that OAI has specifically asked Mercor to delay releasing, to improve their own performance on, for example. No amount of “customer interview” research, bar being an ex-researcher, could have yielded you this insight and the depth of relationship to act on it, if you weren’t in Mercor’s position.

The conclusions that I espouse have been reached by other founders, and maybe are considered old and discarded in their journey. At the end of the day, I remain a people-focused investor, and my thoughts are an olive branch to founders to see if we think alike.