Insilico Medicine Unveils nach0: A Comprehensive LLM for Chemical and Biomedical Applications
Insilico Medicine, a clinical-stage AI-driven drug discovery company, in collaboration with NVIDIA, has introduced nach0, a novel large language model (LLM) transformer designed for biological and chemical tasks. Detailed in a recent publication in Chemical Science Journal, nach0 is distinguished by its ability to handle multi-domain and multi-task applications, such as natural language understanding, synthetic route prediction, and molecular generation.
Existing biomedical LLMs like BioBERT and SciFive primarily focus on biomedical text mining without incorporating chemical structure descriptions. Although models such as Galactica integrate both text and chemical structures, they lack training for diverse chemical tasks. Nach0 addresses this gap by utilizing a comprehensive dataset that includes abstract texts from PubMed and patent descriptions from the U.S. Patent and Trademark Office, totaling 100 million documents. These were converted into 355 million tokens from abstracts, 2.9 billion tokens from patents, and 4.7 billion tokens representing molecular structures in the simplified molecular-input line-entry system (SMILES).
See also: Insilico Medicine’s Generative AI Patent Provides Advantage in AI Drug Discovery Race
The training of nach0 focused on three primary areas: natural language processing (NLP) tasks such as document classification and question answering, chemistry-related tasks including molecular property prediction, molecular generation, and reagent prediction, and cross-domain tasks like description-guided molecule design and molecular description generation.
"Nach0 represents a step forward in automating drug discovery through natural language prompts," stated Alex Zhavoronkov, PhD, founder and CEO of Insilico Medicine. Future enhancements may include the integration of protein sequences with specific tokens and further model fine-tuning to accommodate new modalities and the fusion of text and knowledge graph information.
Built on NVIDIA's BioNeMo generative AI platform, nach0's training and scalability were enhanced using NVIDIA NeMo, an end-to-end platform for custom generative AI development. NVIDIA’s memory-mapped data loader modules facilitated efficient handling of large datasets with minimal memory usage and optimal reading speed.
Rory Kelleher, Global Head of Business Development for Life Sciences at NVIDIA, comments:
"Generative AI and LLMs are transforming the landscape of scientific discovery in biology and chemistry. [...] Insilico’s domain-specific nach0 model, powered by NVIDIA BioNeMo, is a significant step toward unlocking the full potential of LLMs for drug discovery."
Compared to other LLMs for biomedical understanding, such as FLAN, SciFive, and MolT5, nach0 exhibited superior performance in molecular tasks involving molecular data and significantly outperformed ChatGPT.
In practical applications, nach0 was tested in two case studies. The first involved generating molecules potentially effective against Diabetes mellitus. Researchers used prompts to discover biological targets, analyze mechanisms of action, generate molecular structures, propose synthesis steps, and predict molecular properties. From 200 SMILES entries, one promising structure was selected based on chemical expertise. In another case study, nach0 was used within Insilico’s Chemistry42 generative AI drug design platform, generating 8 molecules in 45 minutes that met specific criteria.
"We anticipate that as nach0 evolves, it will require less supervision and will be able to generate and validate promising therapeutic options for medicinal chemists," notes Maksim Kuznetsov, a senior research scientist at Insilico and one of the lead authors of the paper.
Insilico Medicine has been at the forefront of integrating generative AI in drug discovery and development since 2016. The company’s AI platform, Pharma.AI, has been instrumental in creating a robust pipeline of therapeutic assets across various disease areas, including fibrosis, cancer, immunology, and aging-related diseases. Since 2021, Insilico has nominated 18 preclinical candidates and advanced six to clinical stages. Notably, in March 2024, Insilico published data on its lead drug, a TNIK inhibitor for idiopathic pulmonary fibrosis, currently in Phase II trials, in Nature Biotechnology.
Topics: AI & Digital