Proprietary Data Won’t Save You From AI Disruption
Speed to scale matters more, and if your data is truly valuable it will be valuable enough to get in other ways
I often hear executives argue that their company’s proprietary data will protect them from disruption in the age of AI. Their logic usually goes like this: AI labs may have the best foundation models, but they don’t have our data. That exclusivity, the thinking goes, will give them an edge in some domain.
I understand the impulse. For years, the business mantra has been that “data is the new oil.” Whoever controls it can refine it into power. I’ve lived through this first-hand myself, seeing companies spend years building elaborate pipelines, annotation processes, and specialized models around their own datasets. And then, almost overnight, those carefully tuned systems were leapfrogged by large, general-purpose models trained on petabytes of unstructured data.
Most people I talk to have never heard of Rich Sutton’s “Bitter Lesson,” which argues that general methods exploiting scale will, over time, beat specialized approaches. Nor have they thought much about Julian Simon’s observations on resources—that scarcity rarely holds, because when something grows valuable enough, humans find new ways to produce it. Taken together, the implication is clear: if data really is the new oil, it won’t protect you. Specialized data is no defense in the long run.
Companies that hope to defend themselves with proprietary data should pay attention to where the real AI moats are forming. The only one that seems durable today is speed to scale.
Data may be the new oil, but that shouldn’t be reassuring
The belief that data is the new oil comes from the sense that it is both scarce and proprietary. If you controlled a unique dataset, you could turn it into defensible value. It felt obvious that data would serve as a fortress wall against competition.
But Sutton’s Bitter Lesson tells a different story. Over decades of AI research, clever algorithms and narrow optimizations have consistently been eclipsed by simpler, general methods scaled up with more compute and more data. We’ve seen this play out with deep learning, initially for image recognition, and with today’s large language models that have enabled incredible chatbots and agents.
This changes the meaning of data as the new oil. We shouldn’t think about it as a scarce resource, but as a commodity, with familiar commodity characteristics like substitution. If data really behaved like oil, then Simon’s rule would apply: scarcity won’t last. When a resource grows valuable, people will innovate to find or produce more of it or substitutions.
Evidence contradicts proprietary data’s value
History supports Sutton and Simon.
AlphaFold, which cracked one of biology’s most complex challenges, didn’t come from a pharmaceutical giant with access to mountains of biological data. Getty and Adobe, sitting on immense libraries of images, didn’t create the world’s most advanced image models. GitHub, sitting on the world’s largest repository of code, doesn’t have the best coding model (GitHub Copilot actually uses models from other companies, like OpenAI and Anthropic).
One famous example in recent years: Bloomberg invested $10 million to train a finance-specific large language model on its own data, “perhaps the largest domain-specific dataset yet.” Then GPT-4 outperformed it.
Again and again, the same pattern emerges: the winners aren’t the companies with proprietary data. They’re the ones with the compute, infrastructure, and research talent to put scale to work.
This even applies to data that seems unique. If its value is high enough, others will happily pay to license, distill, collect, annotate, synthesize, and otherwise create it or reasonable substitutions. They might even just steal it and pay the billion-dollar fines that result.
Yes, if your domain is so niche that no one else will bother, you may retain some defensibility. But then, by definition, the opportunity is small. When the stakes are high, competitors will find a way in.
What really matters: speed to scale
So what can help you retain and grow business value in the face of increasingly capable AI models?
The real moat in AI isn’t the data you hold but how fast you can move and scale.
OpenAI, for example, reached hundreds of millions of users in record time, creating a consumer user moat that’s hard to challenge. It and its competitors (primary Google and Anthropic, and to a lesser extent xAI, Meta, and some Chinese labs) race each other to new model capabilities and product features to secure scarce users and user attention. They also as quickly as possible lock up land, energy, and GPUs to ensure future scale.
Proprietary data, by contrast, has rarely been decisive. It doesn’t hold up to theoretical scrutiny (Simon was right; just look at the history of other previously scarce resources, even diamonds) or empirical fact (ChatGPT is direct proof of the Bitter Lesson’s truth). Data alone won’t save you. Speed to scale can.