When it comes to training artificial intelligence, OpenAI and Babylon Biosciences have shown that experienced hands can help steer the programs toward greater accuracy through a collaboration aimed at improving the ability of large language models to predict which nascent drugs are most likely to succeed in clinical trials.
The company behind ChatGPT and the biotech startup have been working to tailor models specifically to identify the pitfalls that may cause a promising molecule to miss its primary endpoint through a process called reinforcement fine-tuning, which OpenAI began rolling out last December.
Similar to the process of supervised learning, where AI models are nudged closer to the correct answer with pre-labeled data sets, reinforcement fine-tuning employs a programmable grading system—in this case, one developed intentionally by Babylon’s team of drug hunters—that provides numerical feedback for every answer given.
With biomedical data collected by Sleuth Insights, researchers helped sculpt a version of OpenAI’s o3-mini model using scientific literature and findings from 430 clinical trials spanning cancers, neurology, metabolic diseases and rare disorders.
The companies said the monthslong process was able to take the prediction accuracy of the base model—which was already able to beat a coin flip, with an area-under-the-curve measurement of 0.65—and raise that AUC to 0.84 when presented with a blinded set of studies.
OpenAI and Babylon estimate that clinical development failures cost the biopharma industry about $45 billion annually. Last year’s most high-profile disappointments included mid- and late-stage trial misses for drugs that had been the subject of massive acquisition deals by Big Pharmas as well as clinical readouts that triggered deep layoffs at smaller biotechs.
The researchers did not break down their prediction accuracy by development stage—such as decisions to move a preclinical candidate into human testing, or to promote a prospect from phase 2 to a pivotal phase 3 trial, for example—but they said the AI models were able to provide interesting results that may be able to help drug developers make better use of limited resources when evaluating multiple opportunities.
“The beauty of where the technology is today, is that it gives a small company like Babylon these juggernaut capabilities,” the biotech’s CEO and founder, Sacha Schermerhorn, said in an interview. “The unlock and throughput that we've had is just tremendous, just from implementing this tooling internally.”
Babylon, founded in 2023, aims to license potential molecules for development, with a focus on Alzheimer’s disease and neurological conditions.
For its tailored AI model, Schermerhorn said that when it was tested with molecules proposed for a validated biological target, it was able to take into account preclinical information and the mechanism of action as well as previously unsuccessful attempts to develop drugs in that particular space.
“We saw some really clever examples of things that were totally non-obvious,” he said. “For us there was an ‘aha moment,’ where we saw these models stitch together very incongruent results and come up with a unified reason for why that asset failed, and why a new one would or would not.”
That binary result of clinical trials—sink or swim—offers a unique opportunity to gauge improvements as AI models receive more training, Schermerhorn said. And, though he’s worked in the machine learning field for years and described himself as something of a holdout for adapting the technology, Schermerhorn said he now thinks that one day every company will have its own personalized AI model.
“It'll be some kind of approximation of, let's call it a weighted sum of the intelligence and expertise of their teams,” he said.
“The beauty of clinical trials is that you're able to benchmark against a relatively objective, unambiguous outcome … Then again, I think baking in the idiosyncrasies of your team seems like a very good idea, if you're looking for a particular flavor of molecule on the licensing side, for example, as a subjective approach. Threading that needle is an interesting one.”
“Having some kind of a blend—of objective fine-tuning, plus subjective input—will be very helpful, as long as you have unambiguous ways of assessing how those models are performing,” he added.
But, like with all AI models, there’s a question of garbage-in, garbage-out—whether training data may reinforce previous biases that tilt the outcome. In this case, this includes a clinical trial landscape that has not historically been diverse and has largely adopted criteria that excluded many women from participating in research studies.
“It's not going to be totally perfect, because trial recruitment really matters,” Schermerhorn said. “I think it's very hard to offset those biases until the world starts running more clinical trials that include the diversity that should be in there.”
At the same time—when it comes to analyzing study results that actually make it into print—researchers have been more likely to tout their successes than their failures.
“If a model is initialized to that, it will initially be overly sanguine, right?” he added. “It will say everything will work—because, miraculously, every molecule ever published has cured some rodent model.”
“I think where we saw the capability—and my vision for [reinforcement fine-tuning], in terms of my excitement—came from basically being able to bake in a level of skepticism into those models … like the skepticism of the drug hunters on the Babylon team.”