Stealth Co logo

Stealth Co

Category: Data
Invested: 2025

A new paradigm for scaling and ensuring complex, realistic data

In 2023, as OpenAI researchers noticed the buzz around GPT-3 and GPT-3.5, models trained on the transformer architecture (first released in 2017) which finally seemed to

In our inpatient quest for data, we asked companies with huge stores of base human expertise (Mercor, Handshake, Turing, Scale) to create GPQA like datasets to augment next-token prediction with context over professional, data-scarce domains, that weren’t all that represented in the entire internet (which we used as our training corpus).

As models hillclimb on objective tasks whose verification mechanisms are largely simplistic, the things that models need to hillclimb farther on become much more unverifiable. Domains that are harder to verify, as it seems, correlate with white collar work salaries. We need better, higher quality, and actual real world datasets to make sure models actually adhere to economically valuable work. Quality matters more than quantity now more than ever.

The results of this are apparent more than ever. App layer products trained on shaky “unrealistic data” foundations increasingly stray from actual analyst work. RL env and data companies buy dying YC companies like Jawwas picking over star destroyer ruins for spare valuable datasets that emulate the best work of desperate founders. Giants like Mercor fail to scale quality data at scale - convicing many labs’ human data strategies to diversify vendors because they inherently don’t believe any data vendor can scale quality.

We are taking the lazy way out in order to produce flashy, “easy” benchmarks to show that we continue steady hillclimbing on increasingly unverifiable domains. LMArena, as identified by Surge, is terrible as it fails to capture nuanced RLHF. The evals in tech culture zeitgeist increasingly don’t generalize or map to real world tasks. We are wasting RL compute on evals that are too simplistically representing the real world.

“Data is the worlds’ most depreciable asset, especially as RL paradigms evolve and attempts at continuous learning proliferate.” Real time data matters more than ever. Real time RLHF conversion infrastructure from enterprise signals, combined with anonymization to reduce enterprise footprints, will be necessary to create realistic north star datasets to actually model economically valuable work. Converting human behavior and natural language to robust evals will be more than ever necessary. This will increasingly be a joint effort between the many layers of the industrializing human data markets that will emerge and the increased engineering difficulties associated with ever complex RL environments, reward model shaping, and robust real time enterprise data pipelines.

I wrote before about the game of telephone that we played to source data for inpatient labs. Back then, models were simplistic, and had zero grounding in the domains we tried to teach them with non-domain experts. Now, they’ve graduated with bachelors degrees, and in our quest to get them masters degrees and pHDs, only real world human data, converted into model actionable formats, will suffice. Artificial data creation companies with growth-forward capital incentives and rotting culture will not drive this change.

The first step to building the foundations for servicing data needs of evolving RL paradigms is enterprise SLA proliferation (Build AI, unnamed co). The second step is specialization in frontier expert formation (Phinity, Hillclimb). The third step is robust RLHF engineering and generalized RL tooling such as to make RL available to everyone (Tinker, CGFT). This is what AGI, by definition of automating every job by order of complexity, will require.

The challenge with making a robust eval that accurately simulates the real world is that RL requires verifiable rewards, and we’d much rather prefer all of our problems in one verifiable end reward, or at its most complicated, several smaller verifiable rewards. If this is the limit of our learning paradigms today, then I argue that the most robust eval that is actually representative of real world work is one that updates weekly and is continually supplied by as to close to pure real world work examples as possible. Perhaps an amalgamation of all of these weekly evals, updated freshly every week with new enterprise data, in a 365 day trailing megaset (owing to the depreciation of data relative to RL) is the end goal, given SOTA learning paradigms in Dec 2025.

The scaling laws we’re hitting now are half driven by the need for learning paradigm innovation, but increasingly hampered by outdated data collection modalities, data companies with rotting internal culture and practices, and unrealistic and gameable benchmarks. We want to ensure that our constant hillclimb’s problems remain research-related rather than the problems plagued by human data markets today. Current data companies have entrenched throughput-optimizing processes, and we want to flip that paradigm.

From an economist perspective, if the human data companies today remain as rotted as they are and continue to profess to both be the main data provider to labs as well as profess to revolutionaries for the new labor economy, than it falls on us to prevent the negative externalities of these outdated human data approaches from wrongfully being the jobs that employ many in the AI shift.