Everybody is training models

WIP

Today there are 4-5 new startup vector DBs besides Pinecone and larger players that require twenty four hours of migration time to migrate entire enterprise systems to.

There are also countless startups working in RL-as-a-service (a fancified way to describe forward deployed MLEs), and the basic innings of various outsourced functions of the MLE tech stack.

Not to mention the dozens of startups who've aggregated some sort of userbase and now thinking about selling structured data to labs for training frontier models (handshake is the biggest culprit here). A recent story I've heard was about 17 YC startups approaching a single group partner for introductions to OpenAI to exactly this.

The infra layer is getting built out, whether your favorite vc market map had said it would or not. There is appetite to train smaller, lower latency, equally performant models to unlock use cases in most industries and the notion that a one size fits all model has definitively been proven.

For better reading, a researcher from Pleias does an excellent job predicting this 6 months ago, with (A) the steady advancement of labs into the product layer and (B) democratization of the ability to create strong RL + reasoning models.

The only reason that OpenAI and Gemini are the ones building out product layer components successfully today are because, for a head start period of 6-9 months, they were the first to breach model performance thresholds that were key to models good enough for certain white collar tasks and held the keys to those cards for that while.

They thus acquired the data from enterprises as well as net new data from labeler startups to get to where they were.

But why shouldn't it be the domain experts who are training models in their domains? One might argue that, frontier models labs are training foundation models on any sort of data they can find such that they can provide the blueprint for smaller, more performant models to be trained by enterprise providers easily. In actuality, its likely because the toolkit for converting data to actionable models just hasn't really existed without substantial bespoke infrastructure and dozens of researchers.

This is changing rapidly through two avenues, though. Companies like ceramic.ai, the proliferation of vector DBs, and the abstracting away of many parts of model training like data curation into SaaS-like services are making enterprise teams with some degree of MLE talent more versatile (e.g. Endex, Rogo, and other newer startups whose business models revolve around domain-specific agents).

The democratization of model training tools is enabling domain experts to build specialized models for their specific use cases, rather than relying solely on general-purpose foundation models.

Everyone is training models (soon tm)

Everybody is training models