4 min read

Nvidia-backed Startup Bets on Synthetic Data for AI

Nvidia-backed Startup Bets on Synthetic Data for AI
Photo by Mika Baumeister / Unsplash

This week, SandboxAQ, an artificial intelligence startup spun out of Alphabet and backed by Nvidia, released a trove of 5.2 million “synthetic” molecules. The data, generated by computers rather than discovered in a lab, is designed to train other AI models to drastically speed up the discovery of new medicines. The move highlights a critical, looming problem for the entire AI industry: after years of scraping the web, it is starting to run out of the high-quality human-generated data needed to make its models smarter.

For an industry built on the premise that more data and more computing power will inevitably lead to more intelligence, this data bottleneck represents an existential threat. The solution, as demonstrated by SandboxAQ, is increasingly to have AI create its own data. This shift is not just changing how models are trained; it's reshaping the competitive landscape and raising new questions about the future of AI development.

Why is the AI industry facing a data shortage?

For years, the development of large language models (LLMs) has been guided by “scaling laws,” a concept pioneered by researchers at OpenAI, including its former head of safety, Dario Amodei. The principle is simple: making models bigger and feeding them more data and computing power leads to fundamental improvements in capability. This drove tech giants to train their models on vast repositories of text and images scraped from the open internet.

But that strategy is hitting a wall. In late 2024, reports emerged that leading labs like OpenAI, Google, and Anthropic were seeing diminishing returns, with new models struggling to significantly surpass their predecessors. A key reason, according to a Bloomberg report, was the increasing difficulty in finding new, high-quality training data. OpenAI’s Orion model, for instance, reportedly fell short on coding tasks due to a lack of sufficient new coding data to learn from.

This scarcity is compounded by legal and ethical pushback. Creators and publishers are increasingly fighting back against the unlicensed use of their work. High-profile lawsuits, like the one filed by The New York Times against OpenAI, challenge the core practice of web scraping. A growing number of websites are now using technical means to block AI crawlers, effectively closing the open web that AI companies have relied on. The Data Provenance Initiative found that in one year, the number of high-value websites restricting access rose from 3% to over 20%.

What is synthetic data?

Synthetic data is information generated by computer simulations or algorithms rather than being collected from real-world events. In the context of AI, it means using one AI model to create new data—text, images, or, in SandboxAQ's case, molecular structures—to train another model. This allows developers to create massive, tailored datasets without scraping the web or running costly physical experiments.

This AI-builds-AI approach is becoming central to the industry’s future. George Lee, co-head of the Goldman Sachs Global Institute, noted that with human-generated data largely exhausted, “machines are now being used to generate synthetic data that advances their pre- and post-training.” He pointed to Google’s Co-Scientist, an agentic framework designed to accelerate scientific invention, as a prime example.

The models themselves are being used to refine their successors through a process of generating hypotheses and filtering out bad answers, a technique that improves reasoning. This is a core part of how OpenAI developed its advanced 'o-series' reasoning models.

Is creating fake data a real solution?

Generating data offers clear advantages. It provides a potentially infinite source of training material, sidesteps many copyright battles, and allows researchers to create datasets tailored for specific, complex tasks, like discovering drugs or finding security vulnerabilities in code. For enterprises, it offers a way to train models on business processes without exposing sensitive customer information.

However, the approach is not without risk. The biggest concern is a phenomenon sometimes called “model collapse” or “inbreeding,” where an AI trained on its own output can begin to amplify its own errors, biases, and hallucinations. In a widely read essay, Anthropic CEO Dario Amodei warned that without proper understanding, AI models can become “an incoherent pastiche of many different words and concepts.” If a model learns from a flawed version of reality generated by its predecessor, it may drift further and further from the truth.

This is the core challenge of AI safety. OpenAI itself acknowledged that its o3 reasoning model had a significantly higher rate of hallucination than earlier versions. If models are trained on their own flawed outputs, this problem could get worse, not better.

How does this change the AI race?

The pivot to synthetic data reinforces the centrality of raw computing power. Generating vast, high-quality synthetic datasets requires immense computational resources, fueling the “unquenchable need” for more data centers. A McKinsey report projected that data centers would require a staggering $7 trillion in global investment by 2030, with the majority driven by AI workloads.

This entrenches the lead of companies with the deepest pockets and the most advanced hardware, namely the hyperscalers like Microsoft, Google, and Amazon, and the chipmakers like Nvidia. As Amazon CEO Andy Jassy wrote to shareholders recently, companies must invest “aggressively” now in this infrastructure to reap future rewards. The ability to efficiently generate high-quality synthetic data may become the new competitive moat.

While the industry was previously defined by a race for data, it is now entering a race to master the creation of it. The companies that can build the most effective “AI factories”—not just for training models but for generating the very data they learn from—are the ones most likely to break through the current performance plateaus and unlock the next wave of innovation.


Reference Shelf

Nvidia-backed AI startup SandboxAQ creates new data to speed up drug discovery (Reuters)

The cost of compute: A $7 trillion race to scale data centers (Mckinsey)

OpenAI, Google and Anthropic Are Struggling to Build More Advanced AI (Bloomberg)

The outlook for AI adoption as advancements in the technology accelerate (Goldman Sachs)