Beyond Legacy Tools: Defining Modern AI Drug Discovery for 2025 and Beyond
In this report:
- Intro: The New Framework
- “AI Drug Discovery” is About Holism
- "AI Drug Discovery" is About Building Software
- Access to Data is King
- Validation is Critical for AI Drug Discovery Platforms
- Grand Vision of AI Drug Discovery for 2025 and Beyond
Disclaimer
This report aims to provide an educational, balanced, and pragmatic perspective on AI-driven drug discovery (AIDD). No part of this report should be construed as promotional content or marketing communication.
Some companies featured are past or current clients, and certain organizations provided factual input during the research process. All analysis and conclusions were developed independently to ensure objectivity.
This report does not constitute investment advice or an endorsement. While we strive for accuracy and neutrality, we accept no liability for decisions made based on this content. Readers are encouraged to conduct their own due diligence.
In 2025, it seems there is still a lack of a robust definition of an emerging category of artificial intelligence-driven drug discovery companies (hereinafter, AIDD).
The purpose of this report is to suggest a qualitative framework for classification of AIDD companies, combining the four key attributes that define the leading players in this area:
- Focus on holism vs reductionism in biology
- Creating robust AI platforms (software)
- Priority of data acquisition
- Technology validation (via demonstrable ability to discover novel targets, discovery and develop clinical-grade drug candidates rapidly, a track record of platform partnerships, scientific publications, patents, and so on)
We will delve deeper into framework discussion below, but in a nutshell, it boils down to this:

Indeed, abstracting from specific characteristics of a tech stack and platform design, there are three key value points of an AI platform on business outcome:
Is a computational platform scalable and robust enough to impact the R&D workflow, people collaboration patterns, and daily decision making of a wide range of specialists of a given organization to make a productivity difference?
Is it able to represent biology in silico down to sufficient depth, but also sufficient breadth to be able to grasp relevant and useful dependencies, patterns, network biology effects, to be able to impact scientific decision-making beyond mainstream research workflows?
Is the AI platform capable of addressing the above two questions in a repeatable, stable, standardized way across all levels of R&D workflows in the organization? Would a third-party collaborator be able to get sustainable value out of using the AI software if they had access?
In our opinion, AIDD is about being able to answer “yes” to all three questions. This is what makes the AIDD platform a tangible business asset.
“AI Drug Discovery” is About Holism
As we explore the newly suggested framework, one key distinction emerges: the difference in what we attempt to model and represent computationally in today’s AI-driven landscape versus what was typically addressed using earlier generations of computational tools.
A helpful starting point is to consider the conceptual gap between traditional software—developed decades ago and still widely used in drug discovery for specific tasks—and modern AI-enabled platforms that are increasingly positioned as end-to-end solutions. While both types of tools play valuable roles, their underlying philosophies differ significantly.
In simple terms, “traditional” or “legacy” cheminformatics and bioinformatics rely on human-driven approaches: cheminformatics uses predefined chemical descriptors (like molecular weight or logP), statistical methods and some machine learning approaches for tasks like QSAR modeling and docking, while bioinformatics applies statistical methods, including dimensionality reduction techniques, to analyze complex biological datasets (e.g., genomics, proteomics) and uncover potential drug targets. These methods are hypothesis-driven, modular, and work with smaller, well-structured datasets.
Conceptually, legacy computational systems and simpler machine learning methods are useful in the paradigm of “biological reductionism.” And they do a great job there, even today.
Classical reductionist approach example is structure-based drug discovery, where it is believed modulating a specific protein is an answer to a drug discovery problem (it sometimes is). The computational part, therefore, is mostly focused on narrow-scope tasks like fitting a ligand into a protein pocket (docking), or, computationally identifying a new type of chemistry for a given target (ligand-based virtual screening).

In stark contrast, cutting edge AI-driven drug discovery companies attempt to shift to a systems biology level, a hypothesis-agnostic approach, using deep learning-based systems to integrate largely multimodal data (phenotype, omics, patient data, chemical structures, texts, images, etc.) to construct complex and comprehensive biology representations (e.g. “knowledge graphs”).

For example, the scientific underpinnings of Pharma.AI computational platform by Hong Kong based Insilico Medicine are rooted in a novel combination of policy-gradient-based reinforcement learning (RL) and generative models, enabling multi-objective optimization to balance parameters such as potency, toxicity, and novelty.
According to the company, a target identification PandaOmics module leverages 1.9 trillion data points from over 10 million biological samples (including RNA sequencing and proteomics) and 40 million documents (such as patents and clinical trials), using NLP and machine learning to uncover and prioritize novel therapeutic targets.
The Chemistry42 module applies deep learning, including generative adversarial networks (GANs) and reinforcement learning, to design novel drug-like molecules optimized for binding affinity, metabolic stability, and bioavailability.
In the context of clinical development, inClinico predicts trial outcomes using historical and ongoing trial data, offering insights into patient selection and endpoint optimization.
On an algorithm side of things, Pharma.AI incorporates advanced reward shaping, allowing it to fine-tune generated molecules to specific target profiles or polypharmacological goals. Additionally, Insilico emphasizes the use of knowledge graph embeddings, which encode biological relationships — such as gene–disease, gene–compound, and compound–target interactions — into vector spaces.
These embeddings are augmented by attention-based neural architectures, inspired by transformer models, to focus on biologically relevant subgraphs, refining hypotheses for target identification and biomarker discovery.
The platform employs a continuous active learning and iterative feedback process, retraining models on new experimental data, including biochemical assays, phenotypic screens, and in vivo validations, to accelerate the design–make–test–analyze (DMTA) cycle by rapidly eliminating suboptimal candidates and enhancing lead generation.
Furthermore, the platform’s multi-modal data fusion integrates textual information from published literature, patents, and clinical trial data with omics-level insights and chemical libraries. To this end, Natural Language Processing (NLP) models are used to extract relevant biological context and side-effect annotations from these textual sources, which are then enriched with phenotypic screening data, enabling a holistic view of the drug discovery process.
You can familiarize yourself with some of the aspects of the Pharma.AI platform by reading a recent paper “A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models” (image below is from the paper).

Another relevant example of what can be classified as an AI drug discovery approach is Recursion’ OS Platform.
The Recursion OS is a vertical platform of diverse technologies that enables the company to map and navigate trillions of biological, chemical, and patient-centric relationships utilizing approximately 65 petabytes of proprietary data.
According to a commentary by Recursion, OS integrates ‘Real World’ data generated in their own wet-laboratories or by select partners and a ‘World Model’ which is a collection of AI computational models they also build in-house. Today, their scaled ‘wet-lab’ biology, chemistry, and patient-centric experimental data feeds their ‘dry-lab’ computational tools to identify, validate, and translate therapeutic insights, which they can then validate in the wet-lab. The Recursion OS is powered by BioHive-2, what company claims to be the fastest supercomputer wholly owned and operated by a biopharma company.
While different from Insilico Medicine in model architectures and workflows, Recursion is, however, focused on the same key objective: to create a comprehensive representation of biology to be able to mine crucial insights for drug discovery:

Key models of Recursion OS include Phenom-2, a 1.9 billion-parameter ViT-G/8 MAE trained on 8 billion microscopy images, achieving a 60% improvement in genetic perturbation separability, according to company claims.
MolPhenix, winner of NeurIPS 2024 Best Paper, predicts molecule-phenotype effects with a considerable improvement over baselines. MolGPS, a 3-billion-parameter model, excels in molecular property prediction and integrates proprietary phenomics data, outperforming benchmarks in 12 of 22 ADMET tasks. MolE, trained on 842 million molecular graphs, leads in 10 of 22 ADMET tasks.
An interesting component of Recursion OS, is a knowledge graph tool that evaluates promising signals found by the Recursion OS through a complex lens of topics of interest in biology and drug discovery – including global trend scores, protein pockets and structure, competitive landscape, and clinical trials. The knowledge graph allows researchers to perform “target deconvolution” – identifying and validating the molecular targets of a small molecule's phenotypic responses – in order to narrow those hundreds of possibilities into the best target opportunity.
A more recent example comes from a California-based Iambic Therapeutics, founded in 2019. The team at Iambic developed a drug discovery platform that integrates three specialized AI systems—Magnet, NeuralPLexer, and Enchant—into a unified pipeline that computationally spans molecular design, structure prediction, and clinical property inference.

Magnet generates synthetically accessible small molecules by leveraging reaction-aware generative models constrained by Iambic’s automated chemistry infrastructure. These molecules are passed to NeuralPLexer, a multi-scale diffusion-based generative model that directly predicts atom-level, ligand-induced conformational changes in protein-ligand complexes using only protein sequence and ligand graph as input. The resulting structural complexes inform both target engagement and binding specificity.
Finally, Enchant uses a multi-modal transformer architecture trained across diverse, noisy preclinical datasets to predict human pharmacokinetics and other clinical outcomes via transfer learning, achieving high predictive accuracy even with minimal clinical data. This architecture enables an iterative, model-driven workflow where molecular candidates are designed, structurally evaluated, and clinically prioritized entirely in silico before synthesis.
Finally, there is a notable example from the area of neurodegenerative diseases, Verge Genomics. The CONVERGE® platform developed by Verge is an end-to-end, closed-loop machine learning system that integrates large-scale human-derived biological data with predictive modeling.
At its core, CONVERGE® leverages high-dimensional, multi-modal datasets—including over 60 terabytes of human gene expression and inferred gene relationships, thousands of gene perturbation and ChIP-seq studies, millions of protein-protein interactions, and direct-from-human clinical samples across diseases such as ALS, Parkinson’s, and FTD.

These data are used to train machine learning models that identify and prioritize drug targets with increased translational relevance, avoiding reliance on animal or artificial cell models that poorly mimic human biology. Predictions from these models are experimentally validated in-house using Verge’s wet lab infrastructure, forming a feedback loop that continuously refines both biological hypotheses and model performance.
This integration of patient-derived tissue data, mechanistic genomics, and computational target prioritization is aimed at the identification of clinically viable drug candidates without brute-force screening. Verge’s internally developed clinical compound was derived entirely through CONVERGE® in under four years, including target discovery stage.
Conceptually, “AI drug discovery”, in contrast to “legacy” computational systems refers to a modern computational tech stack, usually a multimodal ensemble, that is capable of modeling biology holistically, including molecular, phenotypic, and clinical data of all types and sizes (chemical, omics, text, images (e.g. cell staining), EHR, etc.) — all at once, or substantial part of variety.
Generative AI
Another crucial aspect differing modern AIDD from earlier computational tools is generative capabilities.
While companies like Insilico Medicine pioneered the use of Generative Adversarial Networks (GANs) for generative chemistry back in 2016, by leveraging their ability to model complex molecular distributions and propose novel chemical structures, it is the introduction of transformers and attention mechanisms in 2017, particularly with the advent of models like BERT and GPT, that in our opinion rendered a paradigm shift of generative modeling across domains.
We consider 2017 as a pillar year for generative AI, including chemistry and biology, after the landmark paper “Attention is all you need”.

These architectures, pioneered by Google, and later developed by OpenAI, Anthropic, Mistral AI, and others, demonstrated unparalleled scalability and capacity for capturing long-range dependencies in sequential data.
By pretraining on vast corpora of text (hundreds of billions and even trillions of parameters) and employing self-attention to dynamically weight input relationships, transformers enabled large-scale generative models such as GPT-3 and GPT-4 to generate highly coherent and contextually accurate outputs.
Yes, “hallucinations” are still a major issue. But the shift is paramount, nonetheless. The pioneering commercial products in this regard are ChatGPT for primarily text-to-text generation, Midjourney for text-to-image generation, and many others for text-to-video, text-to-music, etc.
The emergence of practically feasible transformers and large language models catalyzed a sort of race in computational chemistry and biology towards so-called foundation models. The article 19 Companies Pioneering AI Foundation Models in Pharma and Biotech summarizes some of the initiatives in this domain.
To summarize, here is a simple generalizable framework to draw a silver lining between legacy CADD and modern AIDD:
Table 1
Dimension | Traditional Chem(Bio)informatics | AI Drug Discovery |
---|---|---|
Primary Focus |
Methodical QSAR, structure-based design, library searches |
Automated, data-intensive predictions and/or generative output, end-to-end optimization, novel hypothesis generation, biology scoring, etc. |
Core Techniques |
- QSAR (linear/non-linear models) - Docking & virtual screening - Descriptor-driven modeling |
- Deep learning (CNNs, GNNs) - Generative models (VAEs, GANs) - Transformers, attention algorithm - Active learning, reinforcement learning |
Feature Engineering |
- Heavily reliant on manually crafted descriptors - Traditional molecular fingerprints |
- Automated feature extraction from raw data (e.g., molecular graphs) - Learns non-obvious patterns |
Data Sources |
- Limited to known chemical and structural data - Smaller curated databases |
- Integration of large-scale multi-modal data (omics, real-world evidence) - Massive virtual libraries - Synthetic data |
Generative Capability |
- Rule-based or library-based enumeration - Similarity-driven searches |
- Machine learning–based de novo molecule generation - Novel chemistry exploration |
Scalability |
- Often constrained by computational cost of docking or QSAR on moderate-sized libraries |
- Designed to handle billions of compounds or biological data points in silico - Cloud-based, high-throughput pipelines |
Human Involvement |
- Significant expert intervention needed (e.g., choosing descriptors, scoring functions) |
- Reduced manual involvement through automation - AI suggests experiments and molecules for validation |
Integration Across Stages |
- Typically used as isolated tools (e.g., for docking or property prediction) |
- Can form an end-to-end platform (target ID to lead optimization, to clinical trial optimization ideas or predicting clinical trial success) - Real-time feedback loops |
Scope of Insights |
- Narrowly focused on chemical structures and known SAR rules |
- Deeper pattern recognition across complex, high-dimensional datasets - Potential for discovering novel biology and chemistry, novel hypotheses |
Value Proposition |
- Proven track record for well-known targets and chemical series |
- Potential for identifying breakthrough hypotheses, targets, biomarkers, and molecules, as well as diagnostic solutions - Accelerated and more efficient R&D cycles |
Next, as we have reviewed what “AI drug discovery” attempts to model (holistic biology vs mainstream “reductionism”), and what kind of models are generally capable of doing so, let’s discuss another crucial aspect — AI platform “maturity” as a software product.
"AI Drug Discovery" is Also About Building Software
References
1. BenevolentAI, pipeline, February 2025 https://web.archive.org/web/20250210224107/https://www.benevolent.com/pipeline/
2. BenevolentAI, pipeline, December 2023 https://web.archive.org/web/20231205114116/https://www.benevolent.com/pipeline/
3. BenevolentAI, annual report (PDF), 2022 https://www.benevolent.com/application/files/9816/7939/1282/BenevolentAI_Annual_Report_2022.pdf
4. Healx, pipeline, April 2025 https://web.archive.org/web/20250423114523/https://healx.ai/pipeline/
5. Healx, pipeline, April 2024 https://web.archive.org/web/20240417123453/https://healx.ai/pipeline/
6. Healx, pipeline, April 2023 https://web.archive.org/web/20230329164007/https://healx.ai/pipeline/
7. Healx, pipeline, December 2022 https://web.archive.org/web/20221203025122/https://healx.ai/pipeline/
8. Insilico, pipeline, April 2025 https://insilico.com/pipeline
9. Insilico, pipeline, December 2023 https://web.archive.org/web/20231204133620/https://insilico.com/pipeline
10. Insilico, pipeline, October 2022 https://web.archive.org/web/20221007131323/https://insilico.com/pipeline
11. Insilico, pipeline, February 2022 https://web.archive.org/web/20220213125657/https://insilico.com/pipeline
12. Exscientia, pipeline, November 2023 https://web.archive.org/web/20231130165922/https://www.exscientia.ai/pipeline
13. Exscientia, PR, August 2022 https://www.businesswire.com/news/home/20220817005681/en/Exscientia-Business-Update-for-Second-Quarter-and-First-Half-2022
14. Exscientia, article, July 2022 https://www.nanalyze.com/2022/07/exscientia-stock-ai-drug-discovery/
15. Exscientia, annual report, 2021 https://s28.q4cdn.com/460399462/files/doc_financials/2021/ar/2021-UK-Annual-Report.pdf
16. Recursion, pipeline, April 2025 https://web.archive.org/web/20250425190557/https://www.recursion.com/pipeline
17. Recursion, pipeline, April 2024 https://web.archive.org/web/20240414085309/https://www.recursion.com/pipeline
18. Recursion, pipeline, March 2023 https://web.archive.org/web/20230324234118/https://www.recursion.com/pipeline
19. Recursion, pipeline, January 2022 https://web.archive.org/web/20220131104947/https://www.recursion.com/pipeline
20. Recursion, pipeline, February 2021 https://web.archive.org/web/20210225041638/https://www.recursion.com/pipeline
21. Recursion, pipeline, January 2021 https://web.archive.org/web/20210129043831/https://www.recursion.com/pipeline
22. Relay, pipeline, March 2025 https://web.archive.org/web/20250321133438/https://relaytx.com/pipeline/
23. Relay, pipeline, February 2024 https://web.archive.org/web/20240227231146/https://relaytx.com/pipeline/
24. Relay, pipeline, November 2023 https://web.archive.org/web/20231111223956/https://relaytx.com/pipeline/
25. Relay, annual report (PDF), 2022 https://ir.relaytx.com/static-files/1b13dc48-4fb1-4ec3-b639-69636bc3ace1
26. Relay, annual report (PDF), 2021 https://ir.relaytx.com/static-files/65cffc5e-e6e3-42a3-9b87-cc44b93c2856
27. Relay, annual report (PDF), 2020 https://ir.relaytx.com/static-files/08d959ca-abd2-4a9c-bd25-be8eef73d732
28. Schrodinger, pipeline, April 2025 https://web.archive.org/web/20250421111538/https://www.schrodinger.com/pipeline
29. Schrodinger, pipeline, April 2024 https://web.archive.org/web/20240427094807/https://www.schrodinger.com/pipeline
30. Schrodinger, pipeline, November 2022 https://web.archive.org/web/20221124124721/https://www.schrodinger.com/pipeline
31. Schrodinger, pipeline, June 2021 https://web.archive.org/web/20210620183431/https://www.schrodinger.com/pipeline
32. Schrodinger, pipeline, June 2020 https://web.archive.org/web/20200606152921/https://www.schrodinger.com/pipeline
33. Schrodinger, pipeline, July 2019 https://web.archive.org/web/20190717045358/https://www.schrodinger.com/pipeline
34. Verge Genomics, pipeline, April 2025 https://www.vergegenomics.com/pipeline
35. Verge Genomics, pipeline, February 2024 https://web.archive.org/web/20240306224636/https://www.vergegenomics.com/pipeline
36. Verge Genomics, pipeline, November 2022 https://web.archive.org/web/20221104085232/https://www.vergegenomics.com/pipeline
37. BenevolentAI, report release, 2021 and 2022 https://www.benevolent.com/news-and-media/press-releases-and-in-media/benevolentai-unaudited-preliminary-results-year-ended-31-december-2022/
38. BenevolentAI, report release, 2023 https://www.benevolent.com/application/files/2417/1136/4663/BenevolentAI_Annual_Report_2023.pdf
39. BenevolentAI, accounting statements, 2024 https://www.benevolent.com/application/files/7717/3916/6608/Benevolent_AI__OSAKA_Holdings_Pro_forma_BS_-_Final.pdf
40. Insilico, annual report (PDF), 2024 https://www1.hkexnews.hk/app/sehk/2025/107348/documents/sehk25050802048.pdf
41. Recursion, annual report (HTML), 2021 https://ir.recursion.com/node/6926/html
42. Recursion, annual report (HTML), 2022 https://ir.recursion.com/node/8131/html
43. Recursion, annual report (HTML), 2023 https://ir.recursion.com/node/9691/html
44. Recursion, annual report (HTML), 2024 https://ir.recursion.com/node/11351/html
45. Relay, annual report (HTML), 2021 https://ir.relaytx.com/node/7691/html
46. Relay, annual report (HTML), 2022 https://ir.relaytx.com/node/8531/html
47. Relay, annual report (HTML), 2023 https://ir.relaytx.com/node/9196/html
48. Relay, annual report (HTML), 2024 https://ir.relaytx.com/node/10066/html
49. Schrodinger, annual report (PDF), 2021 https://d18rn0p25nwr6d.cloudfront.net/CIK-0001490978/7a72e457-9a9e-4efc-b9b3-5ead018c904d.pdf
50. Schrodinger, annual report (PDF), 2022 https://d18rn0p25nwr6d.cloudfront.net/CIK-0001490978/6835c32b-f977-482f-82c5-254066f66d06.pdf
51. Schrodinger, annual report (PDF), 2023 https://d18rn0p25nwr6d.cloudfront.net/CIK-0001490978/b3224b2d-5cc5-4081-ba8b-d89a31181139.pdf
52. Schrodinger, annual report (PDF), 2024 https://d18rn0p25nwr6d.cloudfront.net/CIK-0001490978/2ad2903d-0825-4d27-b42a-2e6966d88206.pdf
Edits
- Edit 1 (2025-04-17): Following a clarification from Iambic representatives, we have updated the Iambic timeline in the Table 3, replacing 24 months for 8 months. The company explains that 24 months is for getting to clinic, while it took only 8 months to get to IND studies.
- Edit 2 (2025-04-29): Insilico Medicine headquarters location updated
- Edit 3 (2025-05-12): Recurion pipeline updated in table 2 (source)
- Edit 4 (2025-06-06): Financial summary tables and a related bullet point were added to provide contextual insight into AIDD company performance
Report methodology
An analysis of historical therapeutic pipeline data (Table 2) was carried out using archived snapshots from the Web Archive, allowing us to review how pipeline diagrams appeared at earlier points in time. In some instances, annual financial reports were also consulted to retrieve pipeline details for previous years.
Efforts were made to track each molecule or program within a given pipeline across successive years, and if a particular program did not appear in the following year’s records, it was generally assumed that it had been put on hold for various reasons.
Target novelty analysis for Diagram 3 was performed based on the methodology and mathematical formula outlined in this file.
Correction policy
If you come across any factual inaccuracies or outdated information, please don’t hesitate to contact us promptly. We will address these issues by issuing corrections in a dedicated section of our report, pending editorial review.
This correction policy covers company profiles, technology evaluations, and all comparative analyses included in our report. Stakeholders are encouraged to report potential errors to our editorial team using this form.
All corrections will be clearly dated and thoroughly detailed to uphold the integrity of our comparative report and ensure our readers have access to the most accurate and up-to-date information.
Disclaimer
This report aims to provide an educational, balanced, and pragmatic perspective on AI-driven drug discovery (AIDD). No part of this report should be construed as promotional content or marketing communication.
Some companies featured are past or current clients, and certain organizations provided factual input during the research process. All analysis and conclusions were developed independently to ensure objectivity.
This report does not constitute investment advice or an endorsement. While we strive for accuracy and neutrality, we accept no liability for decisions made based on this content. Readers are encouraged to conduct their own due diligence.