New Age Commodities

Introduction

This is a smorgasbord of thoughts on compute, inference, human data, enterprise RL, and empire building behavior in venture.

I thought to separate these into separate posts, but realized that all of these are inextricably linked to examining AI app layer and infra companies today holistically. Additionally, a glance on venture markets, even in thematic exploration, is increasingly necessary to provide necessary context for some of the decisions made in leading players in the space, where private capital deployment disruption to free markets increasingly proliferates.

From Cloud Back to Bare Metal?

Perhaps one of the most interesting pendulum effects in maturing AI land is the shift back to bare metal buildouts.

When Amazon Web Services (AWS) popularized cloud computing in the late 2000s, many companies abandoned managing their own physical servers (“bare metal”) in favor of renting cloud instances. The cloud promised flexibility and lower upfront costs is the question of: “why build a data center when AWS could do it for you?” proliferated in build v. buy conversations. Over the years, startups and even enterprises stopped expanding on-premises hardware, choosing to be “cloud-first.” However, the rise of massive AI models and inference workloads is swinging the pendulum back toward bare metal infrastructure. In 2025, demand for dedicated servers is surging again. AI training and deployment require specialized, high-performance hardware (GPUs, high-memory machines) and consistent performance, which bare metal can provide.

Bare metal servers give companies full control over powerful chips (like NVIDIA H100s) with no virtualization overhead, plus the ability to implement advanced cooling (even liquid cooling) and ultra-fast networking (e.g. InfiniBand) for AI clusters. Despite the recent advent of training runs according to new RL paradigms that demanded this much compute, companies like CentralAxis (now Aravolta) have emerged to service new hyper-efficient data center design companies.

Companies running steady workloads are finding bare metal can save 30–70% versus cloud, especially with no egress or API request fees and more predictable monthly costs. High-profile examples include 37signals (makers of Basecamp), which undertook a well-publicized “cloud repatriation.” After migrating off AWS, 37signals slashed its cloud bill from $3.2M to $1.3M per year – about $2M saved annually – by running on its own servers. Over a five-year period, they project over $10 million in savings after fully exiting AWS (even after buying ~$700k of new hardware). Many new app layer startups raising series A startups, taking advantage of frothy equity markets, are also citing bare metal buildout as a way to reach net positive margins (see Vibecode).

Managing bare metal requires skills that many cloud-era teams haven’t needed. Running on-premises hardware means hiring or developing expertise in data center operations, network engineering, and hardware maintenance; 37signals noted that they had to account for “operations roles” and data center costs in their move. Moves like Vibecode App’s, mentioned earlier, were only seriously considered because the founders, despite being full stack SWEs, had built similar data center rollouts in previous roles in Bluesky.

Industrialization of human data supply chains

With the surge of AI model training and fine-tuning, data has become a critical commodity, giving rise to marketplaces that provide human-generated data and annotations at scale. See my recent article on human data markets and RL environments. I’ve written extensively about human data markets and their interplay with app layer RL-native products in these three prior posts:

A World of Automatable Domains

A World of Verifiable Domains

A World of Serviceable Domains

Interestingly, RL env companies have recently made the bet that enterprise maturity for the same data products being sold to labs has elapsed, or are just trying to put the cart ahead of the horse to make use of frothy fundraising environments. That human data companies are putting datasets in public enterprise view today suggests that they certainly think enterprise ACVs, beyond the labs, are worth courting heavily, instead of sucking up all of labs’ contracts.

But early envs today purpose built for lab sales still don’t reliably measure real world tasking as exemplified by many tweets from various engineers on this. A large part of this is the disconnect between domain experts, data contributors, and researchers, especially when researchers remain unsophisticated buyers. The shift to make data markets more “off the shelf” rather than consulting agreements could bring more attention to this.

As I work more in the human data/rl space, the value chain for data is industrializing more and more. Specialists in various parts and modalities of data procurement are outcompeting incumbents only minted a couple years ago. It is obvious that, end users of data could achieve better quality and realistic data if four different data players in a specialized data production process made it rather than one, larger one. The typical coordination difficulty argument that you’d expect to see from changing a production process to multiple vendors rather than vertical integration is growing less effective in light of increasingly complex data structures that require increasingly specialized touches to get right.

Agnostically, lab sales have evolved to become a lot more systematized as well. As early as a year ago, researchers were both the economic buyer and user. Today, human data teams are common in large enterprise data buyers, and standardization around data as a vendor project, even at its most sophisticated post training forms, is common. Newer modalities and shapes of data still retain some novelty consideration, but we generally see increasing sophistication of buyers when that was clearly not the case before. That lab sales are subsequently getting harder makes Micro1’s decision to focus some attention on newer enterprises, who are simultaneously increasing ACVs while still remaining unsophisticated buyers whose economic buyers are easier to get to make more sense.

My verifiable domains piece talks about the history of where an early player like Mercor shifted focus as human data markets matured - first as human labor sourcing, then datasets as a service, then sophisticated post training datasets, and now some sort of RLaaS-adjacent eval infra. All of these build on each other - human labor sourcing is pre-requisite to sophisticated RL env creation - but Mercor continually loses out research favorability to smaller, vertical focused competitors in large labs (anecdotally from folks at Deepmind). There is no player today (Mercor included) with the vertical integration depth of somebody like Standard Oil and we’ll see fragmentation into subcategories within human data markets.

More here soon on what I describe as the “industrialization” of human data markets.

Slope of Enlightenment for Computer Use Agents

Anyone who has tried to automate a web browser (for web scraping, testing, or using web apps programmatically) knows it can be finicky. Scripts break when sites change, and anti-bot measures often block automated agents. There is a strong case for a platform akin to RapidAPI (an API marketplace), but for these browser automation agents, as well as a new class of MLE infrastructure to help applied AI teams systematically build, maintain, and standardize eval creation on their own product environments.

See Rehearsal AI’s product for the above specifically related to environments, but one can also argue that any AI post-prod engineer like Keystone is usable for this as well.

RapidAPI, for those unfamiliar, was a casualty of ZIRP-era pricing, quickly rising to unicorn valuation in early 2022 but then being sold in a firesale to Nokia for mere “API infrastructure” a couple years later.

But RapidAPI created a popular marketplace for traditional APIs, letting developers discover and integrate thousands of services. However, many tasks still require interacting with websites that don’t offer APIs (think of booking tickets on a site, scraping a directory, etc.). An early prototype of what this may look like is Apify, which provides a specialized platform and marketplace for web scrapers and “actors.” Apify’s approach hints at what a “RapidAPI for browser bots” could look like. Instead of just listing raw APIs, Apify allows developers to publish scrapers/automation scripts which Apify runs in the cloud with standardized inputs/outputs. Like out of the box browser infrastructure equivalents today like Browserbase and Steel.dev, scaling, proxy IPs, headless browsers are necessary tools for LLM-based scrapers such that scrapers are reliable under load and resistant to blocking.

The bull case sees something like developers publishing specialized agents (for example, “Amazon price monitor bot” or “auto-fill this government form bot”), and users subscribing to these agents via a hub, just like they do for APIs on RapidAPI. As computer use performance, like finance/healthcare/legal task performance, is now commonly benchmarked on new model releases (see Anthropic’s Opus 4.5 release post), its not unimaginable that browser agent reliability and costs will see agnostic improvements every 6 months.

Computer use agents are another class of agents which labs are agnostically expected to get to 99% simple task accuracy on most web interfaces. Rumor has it that this is one of xAI’s flagship products, which would be a great win for a lab nobody expects particularly frontier models from as of Dec 2025. While computer use agent remains relatively muted, with consensus use cases for Yutori and Composite still not particularly common, we are quickly approaching the “critical adoption barrier” for computer use agents whereas they beat/approach traditional web APIs.

Is this out of the box RLaaS for developers to make long-trajectory browser agent automation reliable? The biggest use case for agent capabilities is search (Parallel Web systems, exa) but also see use cases like Agentmail. Whereas if a RapidAPI for browser agent infra were to exist, agents could transact and find reliable tools on there themselves, eval for their own use cases, and use (extending the MoE concept to a wider shopping marketplace concept). Could this be the pre-genesis to infra for generating always-on reasoning trace generation?

New problem areas for RL-enterprise adoption

If you take the base assumption that sophisticated applied MLE knowledge will trickle down to regular enterprises (see my previous writing on enterprise adoption of RL), then that makes the need for a separate class of MLE tooling for the 4 core functions of RL necessary. These 4 core functions are:

Reward Model Scaling
Data (for RL envs)
RL env maintenance
RL env task creation

Most of the problems I will discuss, top level, are in pursuit of making deployed models more aligned with business KPIs in an increasingly cost effective manner. This is necessary for enabling enterprise adoption for ML solutions with ROI in a mind share and cost effective capacity.

Before you read this, you need to agree with me on the base assumption that RL-based ML, combined with enterprise data, produces extremely valuable ROI for enterprises, given that enterprise context is ingested properly If you’re still not convinced of that, read my previous writing on the shape of a business here. The TAM is undeniable, and is there.

Complex Reward Model Scaling and Shaping

Of the four core functions mentioned before, very few are talking about the associated issues with long horizon agents whose arsenal include ever more tool calls and whose semantically defined reward rubrics are harder to craft. Most complex business processes worth employing expensive MLE talent with RL familiarity are also ones with sparse rewards (also why we pay white collar workers so much to address them).

These are generally PRM and LRM based reward models (Process Reward and Learned Reward Models) whose reward completions are highly based on complicated intermediate step rewards in CoT/long horizon or ones where rewards are human judgement based and generated. Indeed, environment reward rubrics shifting to more PRM based models is a necessity given longer horizon white collar tasks - this is the reason why reasoning traces at different stages of modeling task rewards is so important. Flavors of current enterprise deployed model approaches to reward models include:

PRM + hybrid reward modeling pipelines like OPRL
Large-scale RLHF + RL finetuning for real tasks via PPO
Test-time search in a planning + reward model + RL hybrid method

TLDR: PRMs + PPO + some sort of planning/human oversight as of Dec 2025 is the common stack for building RL-based agents in high level reasoning workflows.

For enterprise deployment, less sophisticated customers no doubt need infrastructure/services to convert business KPIs to rewards for models. The problems around scaling reward models in academic circles have existed for a while - take early examples to synthesize dense reward functions from natural language with LLMs or automating live reward function shaping with human feedback pipelines (something that looks a bit more complicated than DPO today).

Accurate and robust reward model shaping for actual business KPIs is crucial because that is the pre-requisite for task verification. Once you can verify anything, you can automate it and generalize across associated real world tasks given data variation. The TAM is all of labor and more.

There are a few companies whose identifications and approaches to these problems try to tackle various issues that enterprise RL in the reward model space:

Watertight AI is a new venture founded by ex-Anthropic researchers specifically tackling reward hacking in enterprise deployments of long-horizon tasks, starting with the most mature markets like coding and search. The landscape of local minima landmines expands exponentially expands the longer horizon white collar tasks are.

Haize Labs is an earlier founded research-first company whose products currently center around helping legacy enterprises with existing model deployments avoid business KPI misalignment. How do you prevent your airline agent from hallucinating refund policies, in a quest to give them the “best possible customer service?”

Trajectory is a Google Deepmind spinout building paradigms for automatic reward model scaling with RL, as well as automated RL env and task creation.

Data (Realism and Real time)

Realistic data is one of the five procurement tenets that labs have for current RL environments.

Researchers have also long complained that current commercial env providers’ tasks and evals still do not adequately track to real world economically valuable tasks. This is exemplified empirically in real world products, like real world accounts of products like Endex and FRL being scarce used by analysts due to low reliability in actual tasks, though procured top down.

From Bilal Zuberi's blog

From Bilal Zuberi's blog

I link to Bilal’s observations about recent AI app layer products that show weak business fundamentals. Though I hate the cliche statistic that has become common saying “90% of AI pilots currently fail,” this is because the last mile in AI integration and adoption - adaptation to enterprise-specific stakeholder workflows - is the shirked responsibility of AI procurers. But if FRL/Endex’s product doesn’t actually translate to the real world tasks that procurers do - is it valuable?

Current methods for procuring realistic data goes through too many rounds of telephone. Human data companies procure datasets from experts that they source, often without being domain experts themselves. These datasets get sold to labs/enterprises touting the expertise of the labelers simply by pedigree. They then get postmarked as north star economically valuable tasks by those building app layer products. Notwithstanding trying to make this game of telephone as lossless as possible, enterprise-specific quirks complicate ai adoption accuracy even further.

In a perfect world, we’d have built pipelines for realtime enterprise data ingestion (offline user logs & trajectories & reasoning traces) for models that post train/store correct memory in real time. This should extend into RL env creation for enterprise environments, and the tasks/evals for those environments. This is a common project theme in current RLaaS engagements with legacy companies with some level of MLE sophistication today and I expect platformizations of this use case to be common soon.

We just need to solve this one infrastructure problem to solve this entire problem category. Anecdotally, my friend at a P72-like firm already tells me of a huge productivity delta between those using the AI financial analyst tool procured top down and those who refuse to learn. For our white collar industries that demand 99.9%+ accuracy on routine analyst tasks, we need better paradigms for realistic data to map actual enterprise workflows better. This is the last step in making Rogo/Samaya/Endex products an absolute no brainer for enterprise use (besides on prem with small local models/data privacy concerns).

Scalable Realistic RL Env Creation (QA and creation velocity)

Its been long known that the likes of Plato, Hud, Halluminate etc. are focused on building “automatic” rl env creation engines, especially for computer use agents. In the human data markets business context, this is simply scaling product output and makes sense from a data sales aspect. These engines, though, are also likely to underpin an important part of enterprise RL adoption, and be part of the first platformized RL infra products to hit the market.

We have production and development environments in web 2.0 that have demanded a whole host of tooling built around them to support developer workflows (Git, etc.). As products turn increasingly agentic, as well as internal tooling, we can imagine environments as part of the dev environments/repos for agents. Naturally, devtool categories worth building in ensue.

If we expect RL to be part of a permanent best practices for post-training, continuing into continuous learning paradigms, then we need QA and other developer infrastructure for enterprises to maintain their most emblematic environments. This is one part of the equation for better aligning deployed models more frequently and easily with actual business KPIs.

This is an easy problem to visualize with how enterprises are building out their AI/research teams in tandem with product today (especially legacy companies with outdated org structures). When the product team pushes a product update, do all the research teams’ evals/tasks for the agents break? If we’ve built internal agents for internal tooling, starting with glean-type use cases, how do we systematically make it easier for environments to eval and benchmark agent interactions with new data lakes, product lines, and ontology/business KPIs associated with those?

Scalable RL Env Task Creation (Synthetic Data)

The first important thing to mention is that synthetic data by itself, shouldn’t ever be able to fully push frontier model performance in some non-human in the loop, unassisted, way. If this ever happens, I submit that as the true definition of AGI, as that represents us building systems that can create truly out of distribution products with in distribution data.

Today, synthetic data plays a huge part in unlocking human data to support frontier model development, a huge part in providing direct training data for N-1 cost effective models, and easier eval creation for existing model implementations. Those with skillsets to produce diverse, high quality synthetic data, however, remains even more elusive and academic then MLE talent with production RL knowledge.

As environments increasingly emulate harder to simulate real world white collar domains, the pool of qualified contributors for them shrinks, and we’d rather maximize contributors’ time generating reasoning traces to solve tasks and issues within the environments rather than creating tasks. There are RL env companies out there which commonly use synthetic data approaches to augment qualified contributors’ outputs by automating things like task creation such as to maximize reasoning trace/trajectory creation, whereas the challenge lies in predicting what tasks that will teach the model most effectively.

Additionally, “synthetic data” has long been the only way to collect datasets for some video/LiDar use cases like those in AVs and initial robotics use cases. As cost-for-performance optimized models proliferate in app layer products, synthetic data experiments for small model deployment will proliferate, especially as human data production costs remain prohibitively tied to frontier lab exclusivity and WIP.

If you can crack truly diverse synthetic data generation from real world datasets, then there should be no reason why localized small eval production shouldn’t massively increase. If this commoditizes, this is one part of the equation in better aligning deployed models more frequently and easily with actual business KPIs. Perhaps, this is some abstracted version of what the “alignment” teams in all of the large labs are actually doing, hopefully with some fancier architecture.

Bandwidth Marketplaces in the 2000s vs. Compute Marketplaces Today

In the early 2000s, during the dot-com boom, some envisioned that telecom bandwidth would be traded like a commodity with exchanges where ISPs and businesses could buy/sell network capacity (analogous to how electricity or oil are traded). Enron famously tried this, creating a platform to trade fiber optic bandwidth capacity. Back then it seemed logical: internet traffic demand was surging, and an exchange could match those who have excess network capacity with those who need it was sound.

But several things became clear by 2001:

Fungibility was a pipe dream: Fungibility and standardized units of exchange are necessary for any commodities market to be built, or at least at some sort of gradient level (see differential grading in the oil markets). The owners of fiber networks (telcos and ISPs) disliked the idea of bandwidth becoming a standardized commodity. They continued to offer/build their infrastructure with anti-fungible components – latency, reliability, etc. differ – so bandwidth from one provider isn’t perfectly fungible with another. Carriers preferred private contracts and variable pricing. They “want customers to be confused about pricing,” an analyst quipped, rather than a transparent one-size-fits-all market. Simply put, bandwidth owners (like the compute owners today) had intrinsic product pricing characteristics dependent on non-fungibility.
Oversupply of Bandwidth: The dot-com bust left a glut of fiber capacity (“dark fiber”). Suddenly there was far more supply than demand, and prices plummeted ~80% on major routes. For example, a standard New York–London data contract dropped from ~$30k to $5k in under a year. This price collapse made trading unappealing – why bother when bandwidth was so cheap and plentiful? Enron’s bandwidth trading business was tiny compared to its energy trading, and with prices in freefall, the market never gained liquidity. We can see some parallels to compute today where oversupply is quickly being approached as inference could possibly move majority local and large training job spend appears confined to frontier labs.
Fear of Undercutting & Price Arbitrage: If carriers dumped their excess bandwidth on an open exchange at bargain rates, they risked angering existing customers who paid higher prices. Those customers might demand to renegotiate if they see market prices far lower. So, big providers avoided participating to not upset the apple cart. Instead of an open marketplace, deals remained closed and brokered quietly (often by bandwidth brokers who took a fee).

Fast forward to today: Could a marketplace for computing power succeed where bandwidth trading failed? There are signs it might. The context is very different:

Demand Volatility: High-end compute (like GPU compute for AI) is currently in high demand and often more supply-constrained than not. Top GPUs (NVIDIA H100s, etc.) are expensive and sometimes back-ordered. Unlike the bandwidth glut of the 2000s, we have a compute scarcity (and very high demand) in the AI boom.
Multiple Willing Suppliers: Compute capacity is owned by many parties – big cloud providers, smaller cloud/colocation firms, even individuals with GPU rigs. Many of them want to monetize spare capacity. Moreover, while the argument could have been made that large hyperscalers make it their business to supply and be as much of a monopoly as possible, there exist growing an increasingly large long tail of data center customers who build out substantial compute for their own reasons and seek to monetize/offload excess capacity. This alone justifies compute exchanges and APIs for routing disparate compute for these companies’ buildouts.
Standardization via Virtualization: In 2000, buying bandwidth from carrier A vs carrier B might involve different contracts, technical interfaces, etc. Today, containerization and cloud software stacks have made compute far more fungible. A GPU is a GPU; if it meets certain specs (GPU model, VRAM, etc.), you can run your Dockerized AI workload on it regardless of who owns it. This standardization (Docker, Kubernetes, etc.) means a job can be shipped to any provider with minimal fuss, enabling a true marketplace. For instance, Vast.ai’s platform lets users filter by GPU type, RAM, price, and launch on any matching machine across providers. Even SLAs can be ported over and integrated bespoke at scale for every customer to customer interaction, easing adoption.
Closed market incentives negation: I mentioned before how if “carriers dumped their excess bandwidth on exchanges at bargain rates” they angered existing customers who paid higher prices. While we generally like fungibility in commoditized marketplaces, the lack of absolute fungibility in compute marketplaces due to uptime and the physicality of data centers (and associated risk metrics) makes it so that enterprise to enterprise SLAs generally still hold purpose, and infra in early compute marketplaces already supports this.

One metric of concentration: currently NVIDIA’s revenues are heavily concentrated in a few big cloud customers (two unnamed customers made up ~39% of its revenue in Q2 2026).

Aravolta’s blog points out a big shift that caused bandwidth markets to collapse - the rollout of DWDM which represented a massive instant 100x of capacity. Their argument is that such an instant and massive rollout is not possible given how physical the compute constraints with today’s compute markets are.

I tend to agree - even if lab results demonstrate massively improved energy efficiency gains with deep tech innovations like neuromorphic computing or alternatives to silicon based chips, rollout is still highly dependent on figuring out production deployment questions like longevity in physical environments. There is no one click easily distributable software solution for massive compute gains, even if you think the deeptech moonshot bets will work. Even if you point to alternative model architectures that promise massively lower latency and lower costs, they still require months of model training (see Deepgrove AI).

Why Data Centers Won’t All Become Fungible Commodities (or: More Than 2–3 Players Will Persist)

It’s tempting to assume that in the future, all computing will be done in a few giant “utility” clouds (like an oligopoly of 2–3 hyperscalers). To some extent, AWS, Microsoft Azure, and Google Cloud do dominate. Yet, it’s very unlikely that data centers themselves become fully fungible or consolidated to only those players. There are several reasons rooted in locality and physical realities:

Latency and Edge Computing Needs: Not all workloads can tolerate the latency of a far-away cloud region. As the world embraces real-time applications (IoT, AR/VR, autonomous vehicles, remote surgery, etc.), we need computing physically near the end-users or devices. Edge data centers are the solution, which are smaller facilities in regional hubs or even on the “last mile.” Physical compute infrastructure increasingly exists in the hands of operators and the long tail’s ownership of compute is increasing. Even if you don’t believe that new age startups are really building out their own bare metal, 5G networks often pair with local edge servers so that data from your phone or car is processed in a nearby city, not across the country. Likewise, new-age smart factories or hospitals might have micro-data centers on-site to do AI inference or data processing instantly, especially like those that companies like Mobius and Optica will trend to if successful. The big three cloud providers alone cannot practically cover all these micro-locations with their own massive data centers, nor do they want to reliably operate geographically distributed small data centers, so regional players and on prem deployments will coexist, ensuring data center infrastructure remains distributed.
Data Sovereignty and Local Regulations: Geopolitical data issues endure and are even misconstrued out of proportion, as evidenced by many Americans’ enduring wariness to use local deepseek/kimi models even with slight biases trained out of them. Many countries, regions, and even states have strict laws about where data can reside and who controls it. The EU, for instance, has been championing “European data sovereignty.” Even if AWS/Azure build regions in Europe, there’s a legal nuance: if the company is US-headquartered, European regulators (and customers) worry about U.S. CLOUD Act and other foreign access. While Local physical setup matters for sovereignty, the more close at home issue is that data laws differ vastly within states. Luddite reactions also vary substantially across states, making it hard for seamless rollout of new data center construction by hyperscalers.
Geographic and Infrastructure Constraints: Data centers are physical buildings that need land, power, and connectivity. In some locales, the hyperscalers may not invest due to limited market size or lack of infrastructure. That opens opportunities for regional data center companies. For instance, in parts of Africa or SE Asia, local telcos or entrepreneurs have built data centers where Amazon/Google have no region (or only very limited presence). Similarly, even within countries, there are secondary cities where edge colocation providers operate because the big clouds only have zones in major metros. Power and space are also limiting factors as certain high-density areas (like NYC, London) struggle with power capacity for new megadata centers. Smaller, distributed facilities can collectively add capacity where one giant couldn’t. Properly geographically distributed data centers have cost advantages orthogonal to concentration in urban areas, also leading to interesting societal shifts that I explore in a different article.
Tailored Solutions and Specialized Hardware: Not every computing workload is best served by a generic cloud server in a faraway region. Some require specialized hardware or setup - for example, ultra-low latency trading systems might colocate in a specific exchange’s data center, or high-security government systems might be in a bunker with certain specs. These unique needs mean many organizations maintain their own data centers or use niche colo providers. The big clouds try (with Outposts, Azure Stack, etc.) to cover on-prem needs, but that still counts as more distributed infrastructure, not just a few central sites.
Resilience and Multi-Source Strategy: Relying on only 2–3 mega-providers poses systemic risks. A failure or cyber-attack on one could knock out a huge portion of services. Many enterprises thus use multi-cloud or hybrid strategies to avoid single points of failure. This inherently supports multiple platforms. There’s no reason why, as specialized inference that may be too compute intensive for local devices but best run on local data center networks, plays into a physical version of cybersecurity’s “microsegmentation” strategy as data centers become integral to firm level operations.

In more non-PC terms - data centers are run by boomers. The delay in blue collar skilled labor expertise rollout seen recently in Coreweave espouses enduring blue collar labor issues related to running high tech operations in non-tech dense areas. Data centers are not being consolidated because there are advantages to their construction in non-tech adjacent areas because of lowered physical costs for maintenance means increasing contracting of people outside of tech to run them.

In practical terms, data centers are not fungible commodities like oil barrels. Each data center has unique aspects: location, latency to certain user bases, local regulations, even differences in cost structure (e.g., hydro power in one region vs expensive electricity in another). Because of these local characteristics, it’s unlikely we end up with just 2–3 colossal providers owning everything. Instead, we’re seeing a hybrid cloud and edge world: big players for general-purpose compute and global scale, regional players for proximity and sovereignty, and on-premise deployments for specialized needs.

If interested in diving more into the technicalities, Aravolta (CentralAxis) does a great job of illustrating this in a recent blog post, pointing out more differences between the overbuildout of bandwidth in the early 2000s and the physical realities of physical compute deployment today.

On a personal note, I love this because it brings us closer to a world of Simon Stålenhag’s art.

Inference will move to the edge

AI models will get so efficient and compact that many tasks currently done in cloud data centers will be handled locally on our personal devices (phones, laptops, IoT devices). Chris Paik had a great initial writeup on this. Large scale training runs will still occur in data centers, but general consumer hardware will be capable for running ever capable local models that perform current SOTA workflows capably - see the gains that deepgrove.ai has made in producing SOTA models with 10x less compute and latency with its bitnet-based Bonsai series of models.

Bare metal investments can better be thought of as replacing employee headcount by abstracting away low level reasoning - CapEx = employees with local models. Bundling in reasoning in software creates exponential value the longer the end to end automation is.

We’re already seeing the first steps: “On-device AI” is a major trend. Smartphone chips now include NPUs (Neural Processing Units) to run AI tasks. Apple’s latest iPhones, for example, can do speech recognition and image analysis on-device. Google has moved some of its Assistant processing on-device for speed and privacy. Running AI locally reduces latency, preserves privacy, and can even save bandwidth (no need to send data to cloud). Qualcomm found that on-device inference is more energy-efficient when you factor in the whole round trip to cloud.

If you take any of the deep-tech innovations to produce a 10x effect on inference (Cerebras Chips, neuromorphic compute, drastically scaled up bitnet model architecture, etc.), the immediate effect would be a reduced load on central data centers for inference.

This is the setup, and the nuances and 2nd order effects to explore are numerous and exciting because they make new classes of infrastructure problems addressable.

Even if small models can handle a lot of tasks, the largest, most advanced models might always be several steps ahead, requiring cloud infrastructure. So data centers likely still host the frontier models and high-end tasks. It could evolve into a tiered system: your device uses a local model for quick, simple queries, but for a very complex query it forwards it to a cloud super-model. This is analogous to how some computing moved to edge (think of caching or CDNs in web content), yet central servers still exist for heavy lifting. We expect a gradual process here where the interim will probably see planner, higher level models run in cloud/localized data centers orchestrating on prem models.

Data center rollout will probably shift toward supporting edge. Instead of only building giant regional centers, we might see more mini data centers co-located with 5G towers or in neighborhoods, acting as intermediates. If personal devices do more, they might also collaborate with nearby micro-servers for tasks that are a bit too heavy or for aggregating learning (like federated learning updates). The overall compute demand might still grow, but distributed more among devices and smaller data centers rather than exclusively hyperscale ones. The energy to run computation will exist in many distributed 5-watt phone chips instead of one big megawatt data center.

From an economic view, on-device AI could save cloud providers costs (they don’t have to handle every single request) but also reduce their direct revenue (less reliance on their APIs). Consumers and enterprises might prefer on-device for cost and privacy, so cloud companies will adapt by offering tools to sync and update those edge models (and of course, still selling the hardware or software that powers on-device capabilities). A small note - there currently is some small free markets interference here with kingmaking players forcing top app layer companies in some areas to use foundation labs’ API (their portcos) rather than push for small model development.

The 2nd and 3rd order effects of this gradual rollout potentially include tailwinds for solutions assisting the long tail end of AI capex investments and meaningful tooling to seamlessly include on prem v cloud designation of different models, especially for cybersecurity and cost optimization use cases. Edge inference is also a massive boon for consumer and the robotics world and may be a key towards adoption shifts there.

Hyper-Concentration of Customers in the AI Industry

Hyperconcentration in oligopolistic buyers is a defining characteristic of human data markets as well as emerging RLaaS use cases. This “hyperconcentration” can be risky, but it’s a common pattern in the current AI landscape, where a few big spenders dominate. Fireworks at one point had nearly 70% of its customer usage/revenue tied up in just two clients: the AI startups Cursor and Perplexity. In other words, Fireworks was hugely dependent on Cursor and Perplexity as anchor customers. If one of them left or shrank usage, it would have caused a major hit to Fireworks’ business.

Fireworks is not alone. Many B2B AI companies today have maybe a handful of “whale” customers rather than a broad base. Why is this?

Power-law of AI spend: The organizations training the largest models or deploying at scale are relatively few (think: OpenAI, Google, Meta, Anthropic, Microsoft, a couple of well-funded startups like Inflection, plus perhaps government or large enterprise initiatives). These few have outsized **budgets and needs. If you sell a service into one of them (say, you’re providing a data annotation platform or a model optimization service), that one contract can be enormous – potentially eight or nine figures. Landing two such customers can vault a startup’s revenue to impressive levels (Fireworks “immediately bootstrapped” to high ARR by serving just Cursor and Perplexity, who themselves had significant AI workloads). The flip side is, outside those top players, there may not be a long tail of tens of thousands of smaller customers (at least not yet). So, naturally, AI B2B providers become very concentrated on a few big fish.
AI startups and SaaS hyperconcentration: If we look at AI software startups selling to enterprises, we often see a similar pattern in early stages – e.g., an AI enterprise SaaS might get one Fortune 500 pilot that accounts for the majority of its initial revenue. It’s somewhat typical in B2B that early on you have revenue concentration (one big logo can outweigh a dozen small ones). But in AI it’s pronounced because the contracts can be huge and the pool of adopters is still small (not many companies are ready to spend millions on AI tooling yet, outside of tech). As another example, some of the new AI research cloud startups (offering access to GPUs) might have most of their usage from one or two hyper-scale customers who use them as overflow, rather than thousands of equal-sized customers (see customer concentration, not by choice, in base10 as well).
Risks: The obvious risk of hyperconcentration is dependency. If 70% of your revenue comes from two customers, your fate is tied to theirs. If their budget is cut or they build an in-house solution or a competitor underbids you, you’re in trouble. We saw a mini-example with Scale AI: losing OpenAI/Google as clients (even if for non-business reasons like conflict of interest) opened the door for competitors and forced Scale to find new revenue sources. For startups, losing a whale early can be fatal if you can’t replace that income. All human data companies’ revenues, for example, are levered to the success of OAI/Anthropic/Deepmind.

In a world where enterprise maturity is still coming to light, early hyperscaler startups are ones with the pedigree and relationships with all the other hyperspender startups. It is these environments where oligopolistic capital providers can actually provide the most value.

Its all one big club and you’re not a part of it if you don’t know this (are you connected via 2nd degree at least to the founders of AC? or Mercor?). In the aforementioned linked article, the fact that only hyperspenders are mature enough to buy frontier ML solutions (and represent a handy chunk of the TAM), combined with the fact that most of these are blue chip T1 backed logos, means an interesting alternative side effect in king making in venture markets.

Agnostic Product Improvement

One striking aspect of many AI products today is how much their quality depends on the underlying model (often from a third party) versus their own software engineering. For example, an AI copywriting app that was built on GPT-3 saw a huge performance jump when GPT-4 became available not because the company innovated, but because OpenAI did. This has led to a scenario where product improvements are largely driven by new model “drops” (releases of more powerful models) rather than the product team’s iterative engineering. It’s almost akin to if your app suddenly gets better because the “engine” it runs on got an upgrade from an outside supplier.

Let’s examine the dynamics. Firstly, it means many AI startups are “thin wrappers” over the same handful of foundational models. Ethan Ding has a more creative name for GPT-wrappers (which people have come around since their early economic viability since GPT 4’s release) - token streams. The value is accruing to the model providers (and their GPU suppliers), resulting in their active encouragement of apps in their ecosystem. Indeed, encouragement from kingmakers may even be the reason why app layer companies like Rogo and Endex continue to use OpenAI’s API instead of building out their own local models to drive towards positive margins.

So where does value accrue for app layer companies today? The shapes of different industries with enterprise behaviors that somewhat resist this phenomenon emphasize the following:

Infrastructure and Cost Efficiency: Companies can invest in engineering solutions to serve AI faster or cheaper. For instance, developing a more optimized inference server, or using bare-metal GPUs you manage to reduce cloud costs (as discussed in the first section). If one team figures out how to run a model with half the compute, they can outperform rivals on speed/cost. This might involve low-level model optimizations, better caching of results, distilling models, etc. Engineering talent that traditionally worked on distributed systems, compilers, or GPU optimization becomes very relevant.
Distilled small models: Relying strictly on a generic base model means you’re waiting on others for improvements. Small model development and training for one off workflows and domains is becoming increasingly common. For example, an AI startup might maintain its own fine-tuned version of Llama-2 or another open-source model, and put engineering effort into improving its performance on their domain through data and training tricks. This is hard but creates an internal IP and improvement path not shared by everyone. It’s an engineering and ML challenge (collecting the right data, training efficiently, etc.), but yields control - a strategy that leading app layer companies like Cursor have taken as ground truth. As architectures like bitnet and other distillation techniques become common practice in the zeitgeist, we can expect this to become part of regular engineering practices.
Memory management and increasingly complex reward function modeling: Many current AI apps basically provide an interface to a model. Commonplace MLE infra today includes memory modules that remember user context beyond what the base model does, tool integration (retrieving info via web searches, etc.), or multi-model orchestration (using the best model for each sub-task). As agents are increasingly deployed in long form contexts, the four issues regarding RL that I’ve mentioned before come increasingly into play. Reward function and agent definitions will grow to be more complex than system prompts in OpenAI calls. Memory management will be tackled increasingly via knowledge graphs and more heavyweight solutions that increasingly look like human neural networks.

New Devtooling for agent-human interactions

The cliche that dead internet theory has been true for a while (internet bots, whether brittle RPA or LLM-powered) has been commonly repeated in popular culture but not internalized in infrastructure practices (besides the humble Captcha’s widespread adoption). But if agents are increasing transacting, does it make sense to have them do it via our APIs or UI-level automation? But moreover, why haven’t popular human-human interaction channels (SMS/WhatsApp/other social messaging) adopted common agentic communication protocols yet?

The Interaction Company of California’s Poke product is probably the first most adopted consumer agent that lives on SMS/WhatsApp. The infrastructure built by the team to support context retrieval, management, and API integrations with SMS/WhatsApp/other messaging services was completely bespoke and serves as a strong open source framework for other products that need common human-agent interaction vectors. Photon.code is another entry into what will likely become growing infrastructure platforms to support opinionated agent-human interaction.

For those with early access to Sesame’s new consumer product, it blows anything else I’ve seen out of the water. Having a speech to speech companion gpt-level companion who has context over all my prior conversations and of whom I can ask to view and critique my work (as well web search) is the most frictionless agent-human interaction layer I’ve seen so far that bounds usage friction. I hate, yet am incredibly impressed, by how often I switch into human-human mannerisms when interacting with it. It makes sense why they haven’t open sourced their largest models yet earlier this year given how much their research has bridged uncanny valley.

With no empirical evidence, a breakthrough in voice to voice naturalness, uncanny valley, and full duplex architecture at scale is likely to be the real catalyst for voice agent adoption in consumer/enterprise settings. The “agnostic product improvement by new model release” will likely hold here as well - but the difference is that enterprises are already adopting vertical-specific voice agents in their banal unnatural forms as evidenced by startups like Krew.

Talent, Empire Building, and Overcapitalization

Founder friendly markets in a cambrian explosion of app layer use cases, as well as the trickling of M&A into private markets, has also amplified talent wars.

Acqui-hiring smaller folks is increasingly common. Mercor has sent 5+ of my friends’ startups acqui-hire offers (mostly stock) which haven’t been taken. When a blue-chip company like Kalshi sees a new vendor that presents actual value that they couldn’t easily hire for, their immediate reaction is to acqui-hire. Heated M&A in private markets is both a combination of capital abundance and easily identified acqui-hire targets given few identified lighthouse customers (to quote Bilal Zuberi).

This is a double-edged adage to my earlier statement that selling to blue chip logos is a great early path to 9-figure ARR. Simply put, some of these logos are shitty customers who will never view you as an equal partner - many teams I’ve made intros to for blue chip logos get churned because they turn down acqui-hire
Indigestion is rampant - integration time for acqui-hired teams is deprioritized currently in favor of talent aggregation

For those who come from public markets backgrounds, M&A as a means of synergistic expansion may not be surprising. But in private markets, M&A for talent acquisition is becoming increasingly common for the following reasons:

Capital Gluts from Fundraising Environment
- Founders are encouraged to raise/are pre-empted —> most of the round proceeds goes into expensive acqui-hires to deploy capital
- Investors need decacorn possibilities from a static pool of high talent pedigree founders —> push founders to acqui-hire as one of the most direct ways to empire build into a possible decacorn
Empire building tendencies in AI as most everything is virgin white spend territory
- If not for the glut of human data and RL env startups that followed Mercor, why couldn’t Mercor have, in Brenden Foody’s words, been the “eval provider for everything?”
Increasing proliferation of teams that have raised small amounts from YC-type pre seed investors, become profitable (but not venture-scalable) and are now being scooped up as expensive engineering hires that come with a few built out internal tools (and also allowing the acquirers to boast that most of their team are “ex-founders”)
- Saves face for founders, gives them nice payouts, and allowed acquirers to lend more credence to themselves to be kingmade based on perceived “talent density” in teams

When faced with the dearth of MLE talent in the bay as AI has produced a “skillset reset,” this strategy seems even more obvious. I previously wrote about interesting observations (besides wanton M&A offers) about how companies were attempting to win talent hires today here.

The best talent for building the MLE infra associated with the problems with RL I identified earlier are the ones who’ve had enterprise experience building deployed AI in the most mature AI app markets. This is coding and search today (look at ex-windsurf and, soon, ex PWS and Exa engineers). (See Trajectory)

These are not normal phenomenon and indicative of high beta cyclism. We shouldn’t forgive enduring poor business fundamentals, even if you believe kingmaking works. These are especially the features of today’s market that we should internalize as cyclical:

When the AC’s and Mercors of the work make it normal to annualize 3 month pilots revenue into front-facing business metrics, whilst also normalizing service based revenues that don’t care about margins/productization, then we similarly need a clear line of sight to productization as I write about in “A world of serviceable domains.” While we accept hyper customer concentration as a result of sparse talent concentration today, we shouldn’t make it a new normal and always be on the lookout for things like “enterprise maturation” which are the single most leading indicator for whether RL-based MLE infra companies will find larger TAMs.
The MLE talent that these companies want to acquire today are either highly mercenary, waiting out the war in quant trading firms, or making use of frothy private capital markets.
Much of the contract spend today can only be accessed by playing to the conditions wrought by high beta cyclism. That is, much of the sophisticated companies today with inordinate amounts of budget for experimentation are the ones playing by today’s perverse incentives, and sometimes being kingmade. As Liz Wiesel expressed - recognizable logos now are valued more than ever in fundraising and procurement and legacy enterprises are often discriminated against in attempts to move to the new gold rush.

A Small Note

I’ve written and explored too much about the human data and RL space at this point to not have identified opportunities that are too high EV not too capitalize on. My time at Hummingbird and experiences before that have taught me not to let opportunities slip by, and how to find euclidean ways to bound manhattan-based distances.

I’m working on something unique in the human data space that unlocks real world enterprise data at scale and invite any at scale end dataset users/buyers to chat more at cr4sean@gmail.com. The only additional note I’ll make here is that the human data supply chain will fracture further into specialists at stages most people don’t even know exist, and that’s what I’m exploring.

The Standard Oils of today is a titan with infinitely more weaknesses that can be chipped at without antitrust regulation.