As sequencing technology becomes more and more widely available to generate large datasets on demand, AI/ML-based sequence learning, a cutting-edge fusion of artificial intelligence and biological data analysis, is emerging as a game-changer for searching new and more effective drugs. This blog aims to review a growing number of reported sequence-based AI/ML models in literature and industry and their impacts on transforming drug discovery.
The Foundation: Biological Sequence Data
At the core of this transformation lies the vast biological sequence data thanks to the advancement in sequencing technology. These sequences, the intricate codes of life encoded in DNA, RNA, and proteins, deriving from large-scale human genetic studies, functional genomic experiments, and single-cell analyses, provide a wealth of information about the molecular biology underpinnings of health and disease.
The Power of AI/ML in Sequence Analysis
Traditional methods of analyzing this data have often been time-consuming and limited in their ability to uncover complex patterns. This is where AI and machine learning are making contributions. By building analytical, predictive or generative models on RNA, DNA or protein sequence data, AI/ML tools empower researchers to: 1)/ Accelerate Target Identification: Quickly identify promising drug targets by analyzing vast amounts of genetic and genomic data. 2). Predict Drug Efficacy: Conduct virtual screening to identify potential drug candidates before or alongside laboratory testing. 3). Personalize Medicine: Tailor treatments based on an individual’s genetic profile. 4). Repurpose Existing Drugs: Identify new uses for existing medications by understanding their molecular effects.
Sequence-based AI/ML Models for Target Identification:
These models focus on identifying druggable targets by analyzing genomic, transcriptomic, or proteomic data to pinpoint disease-related proteins, genes, or pathways.
- AlphaFold (DeepMind) – protein and coding DNA sequences
While primarily known for structure prediction, AlphaFold has been used to help identify disease-related proteins from coding sequences whose structures can now be predicted more accurately. Understanding the structure of these proteins helps in identifying and prioritizing targets for drug discovery.
Reference: Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., … & Avsec, Ž. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664), eadg7492.
Website deriving from the publication: https://alphamissense.hegelab.org/
- ExPectoSC (Princeton University) – noncoding sequences
ExPecto is a deep learning model that predicts the effects of noncoding genetic variants on gene expression. It is particularly useful for understanding how genetic variations influence gene regulatory networks and identifying potential therapeutic targets.
Reference: Sokolova, K., Theesfeld, C. L., Wong, A. K., Zhang, Z., Dolinski, K., & Troyanskaya, O. G. (2023). Atlas of primary cell-type-specific sequence models of gene expression and variant effects. Cell Reports Methods, 3(9).
Interactive website: https://humanbase.io/expectosc
- GENESIS (GSK)
GSK’s GENESIS platform integrates large-scale genomic data with machine learning to predict and prioritize drug targets. It enables the identification of genetic variants linked to diseases, accelerating the process of target discovery in drug development.
Source: GSK Press Release (2021): GSK’s AI-powered platform GENESIS accelerates drug discovery.
- Big RNA Model (Deep Genomics)
Deep Genomics has developed an AI-driven platform that uses a “big RNA” foundation model to analyze RNA splicing and its impact on gene expression. The model predicts how genetic mutations will alter splicing and can identify new drug targets associated with RNA mis-splicing, such as those involved in neurodegenerative diseases and cancers.
Source: Celaj, A., Gao, A. J., Lau, T. T., Holgersen, E. M., Lo, A., Lodaya, V., … & Frey, B. J. (2023). An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv, 2023-09.
Deep Genomics (https://www.deepgenomics.com/)
AI/ML Models for Sequence/Structure-Based Drug Design
These models focus on designing or optimizing molecules, proteins, or peptides/oligonucleotides based on their sequence or structure to enhance drug efficacy, specificity, and binding properties.
- AlphaFold (DeepMind)
Beyond target identification, AlphaFold’s accurate protein structure predictions are now being used to assist in rational drug design. It provides 3D models that help design drugs that can bind to specific proteins, such as inhibitors or activators.
Reference: Jumper, J., et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature.
- ProteinMPNN (Baker Lab at University of Washington)
ProteinMPNN is a generative model that designs novel protein sequences based on desired 3D structures. This tool is particularly important for engineering therapeutic proteins with high stability, specificity, and functionality.
Reference: Wang, J., Lisanza, S., Juergens, D., Tischer, D., Watson, J. L., Castro, K. M., … & Baker, D. (2022). Scaffolding protein functional sites using deep learning. Science, 377(6604), 387-394.
AI/ML Models beyond Sequencing Data:
Though sequencing data emerged as important data source for developing AI/ML models to accelerate drug discovery, other data modalities such as cell imaging, medical text, small-molecule structures also play important roles in the field of AI-aided drug discovery and development. For example, Insitro has published a pooled cell painting CRISPR screening platform that enables de novo inference of gene function by self-supervised deep learning. The platform is based on optical images of cell culture that are interfered with CRISPER gene knock out. (reference: Sivanandan, S., Leitmann, B., Lubeck, E., Sultan, M. M., Stanitsas, P., Ranu, N., … & Chu, C. (2023). A pooled cell painting CRISPR screening platform enables de novo inference of gene function by self-supervised deep learning. bioRxiv, 2023-08.) Another example is in small molecule drug design. GENTRL, developed by Insilico Medicine, is a generative reinforcement learning model used for de novo drug design. It designs novel small molecules from scratch based on predefined criteria, such as binding affinity and bioavailability. It has been successfully used to discover drug candidates in as little as 46 days. (reference: Zhavoronkov, A., et al. (2019). “Deep learning enables rapid identification of potent DDR1 kinase inhibitors.” Nature Biotechnology.)
Clinical Applications and Future Prospects
Looking ahead, the integration of sequence-based AI/ML models into clinical settings holds significant potential. These models can streamline the drug development pipeline, from target identification through to clinical trials, reducing time and cost while increasing the likelihood of successful outcomes. Furthermore, the clinical application of these models promises to tailor treatments more precisely to individual genetic profiles, potentially revolutionizing how we approach patient care and manage disease. Here are a few sequencing or multimodality-based models for clinical applications:
- Clinical Trial Optimization- digital twins
Unlearn.AI uses AI to create “digital twins” of patients to optimize clinical trial design and reduce the number of patients needed for placebo groups.
Source: Unlearn.AI website: https://www.unlearn.ai/technology
- Drug Repurposing
DRKG (Drug Repurposing Knowledge Graph) is an open-source AI model that integrates various biomedical data sources to facilitate drug repurposing.
DRKG: Drug Repositioning Knowledge Graph | Ning Lab (osu.edu)
- Biomarker Discovery
Insilico Medicine’s PandaOmics is an AI-powered platform that integrates multi-omics data to identify novel targets and biomarkers.
Reference: Kamya, P., Ozerov, I. V., Pun, F. W., Tretina, K., Fokina, T., Chen, S., … & Zhavoronkov, A. (2024). PandaOmics: an AI-driven platform for therapeutic target and biomarker discovery. Journal of Chemical Information and Modeling, 64(10), 3961-3969.
In conclusion, the AI/ML models reviewed in this blog post demonstrate the diverse applications of AI in various aspects of drug discovery and clinical development. It’s notable that the field is rapidly evolving, and new models and applications are continuously being developed. We can expect more sophisticated applications, by integrating multi-modal data, including imaging and medical text with sequence data, will lead to more predictive and accurate models that will advance the development of treatment to benefit human health.
Leave a comment