Machine Learning for Drug Discovery you own this product

Noah Flynn

MEAP began February 2024
Last updated January 2026
Publication in Summer 2026 (estimated)

ISBN 9781633437661
400 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Simplified Chinese

resources: Source code Book forum Source code on GitHub

table of content

PART 1: FUNDAMENTALS OF CHEMINFORMATICS & MACHINE LEARNING

1 The Drug Discovery Process

1.1 Deep Learning’s Value Proposition

1.1.1 Needle in a Haystack

1.1.2 Virtual Screening & Property Prediction

1.1.3 Generative Chemistry

1.1.4 Chemical Reaction Prediction & Retrosynthesis

1.1.5 Protein Folding & Simulations

1.2 What is ML? What is a Molecule?

1.2.1 What is ML?

1.2.2 What is a Molecule? The Joy of SMILES

1.2.3 An Example Application with USAN Stems & RDKit

1.3 Introducing Drug Discovery

1.3.1 Target Identification & Hit Discovery

1.3.2 Lead Identification & Lead Optimization

1.3.3 Drug Development

1.4 Summary

1.5 References

2 Ligand-based Screening: Filtering & Similarity Searching

2.1 What is Virtual Screening?

2.1.1 Virtual Screening Taxonomy

2.1.2 Scenario: Hit Identification of Antimalarial Compounds

2.1.3 Strategy: Similarity Searching

2.2 Loading a Virtual Screening Library

2.2.1 Understanding the Dataset as a Structure Data File

2.2.2 Molecule Sanitization

2.2.3 Molecular Descriptors

2.3 Compound Filters

2.3.1 Property-based Filters

2.3.2 Structure-based Filters

2.4 Fingerprints: Representing Molecules as Numbers

2.4.1 Structural Keys

2.4.2 Hashed Fingerprints

2.4.3 Fingerprinting our Library

2.5 Similarity Searching

2.5.1 Defining "Similarity"

2.5.2 Searching against a Query

2.6 Summary

2.7 Exercises

2.8 References

3 Ligand-based Screening: Machine Learning

3.1 Problem Understanding

3.1.1 Your Machine Learning Task

3.2 Data Acquisition, Exploration, & Curation

3.2.1 Loading and Exploring the hERG Blockers Dataset

3.2.2 Validating & Standardizing SMILES

3.2.3 Feature Generation & Exploration

3.3 Application of Linear Models

3.3.1 Learning from Data

3.3.2 Training our Linear Model

3.3.3 Evaluating our Model

3.4 Improving our Model

3.4.1 Regularization

3.4.2 Non-linear Transformation

3.4.3 Hyperparameter Tuning

3.4.4 Evaluating the Best Model

3.4.5 Saving and Applying our Model

3.5 Summary

3.6 Exercises

3.7 References

4 Solubility Deep Dive with Linear Models

4.1 Solubility with Linear Regression

4.1.1 Load the Data

4.1.2 Target Variable Distribution

4.1.3 Feature Computation & Correlation

4.1.4 Linear Regression

4.2 The Learning Algorithm

4.2.1 Linear Models

4.2.2 Ordinary Least Squares (OLS)

4.2.3 Gradient Descent

4.3 Touring Scikit-Learn Linear Models

4.3.1 Defining a Benchmark

4.3.2 Ridge Regression & Feature Selection

4.3.3 Robust Estimation with RANSAC

4.3.4 Support Vector Regression

4.4 Bias-Variance Decomposition

4.4.1 A Case Study in Polynomials

4.4.2 Learning Curves

4.4.3 Validation Curves

4.5 Summary

4.6 Exercises

4.7 References

5 Classification: Cytochrome P450 Inhibition

5.1 Binary Classification of CYP3A4 Inhibition

5.1.1 Logistic Regression in Theory

5.1.2 Logistic Regression in Practice

5.1.3 (Mis)calibration: Questioning Probabilistic Output Assumptions

5.2 Tree-based Models

5.2.1 Decision Trees

5.2.2 Dealing with Data Set Imbalance

5.3 Ensemble Learning: A Preview

5.3.1 Decision Tree Bias-Variance Trade Offs

5.3.2 Random Forests: A Strong Collective of Weak Decision Trees

5.3.3 Multi-Instance Learning: Model-agnostic Ensemble Learning

5.4 Multiclass & Multilabel Classification

5.4.1 Multiclass Classification

5.4.2 Multilabel Classification

5.5 Summary

5.6 Exercises

5.7 References

6 Case Study: Small Molecule Binding to an RNA Target

6.1 Small Molecule Binding to an RNA Target

6.1.1 The HIV-1 Transaction Response (TAR) RNA Model System

6.1.2 Structure: Computing Descriptors

6.1.3 Activity: Experimentally Measuring Binding Profiles

6.2 Representative Data Splitting & Dimensionality Reduction

6.2.1 Data Refinement

6.2.2 Representative Data Splitting: Kennard-Stone Algorithm

6.2.3 Dimensionality Reduction: Principal Component Analysis

6.3 QSAR Modeling: Mapping Descriptors to Measurements

6.3.1 Exemplary QSAR Modeling Workflow

6.3.2 QSAR Model Interpretation

6.4 Gradient Boosting Machines

6.4.1 Informing the RNA-Binding Chemical Space

6.4.2 The XGBoost Magic Trick

6.4.3 Model-agnostic Interpretation

6.5 Summary

6.6 Exercises

6.7 References

7 Unsupervised Learning: Repurposing Drugs, Curating Compounds, & Screening Fragments

7.1 Dimensionality Reduction: Drug Repurposing

7.1.1 High Throughput Screening Data

7.1.2 Self-Organizing Maps

7.1.3 Drug Repurposing

7.1.4 Universal Manifold Approximation & Projection (UMAP)

7.2 Clustering: Curating Diverse Compound Libraries

7.2.1 Diversity and Focus

7.2.2 Combinatorial Libraries

7.2.3 Cluster-based Compound Selection

7.2.4 Dissimilarity-based Compound Selection

7.3 Density Estimation: Fragment-based Drug Discovery

7.3.1 Fragment-based Drug Discovery

7.3.2 Pharmacophore Modeling

7.3.3 Density Estimation

7.4 Summary

7.5 Exercises

7.6 References

PART 2: DEEP LEARNING FOR MOLECULES & STRUCTURAL BIOLOGY

8 Introduction to Deep Learning

8.1 Ligand-based VS with PyTorch

8.1.1 Protein Kinases

8.1.2 Our First PyTorch Model

8.1.3 Enrichment Factors in Virtual Screening

8.2 Neural Networks & PyTorch Mechanics

8.2.1 Models as Computational Graphs

8.2.2 Implementing a Neural Network

8.2.3 Training & Applying our Neural Network

8.3 Summary

8.4 References

9 Structure-based Drug Design with Active Learning

9.1 Docking: A Core SBDD Technique

9.1.1 Protein-Ligand Docking

9.1.2 Minimal Protein-Ligand Docking Workflow

9.1.3 Prepare the Protein & Ligand Structures

9.1.4 Run A Docking Experiment

9.1.5 Interaction Fingerprints

9.2 Active Learning for Hit Identification: Deep Docking

9.2.1 Active Learning: Smart Choices with Limited Resources

9.2.2 The Deep Learning Surrogate Model

9.2.3 Training the Surrogate Model

9.2.4 Initial Sampling

9.2.5 Acquisition Functions for Active Learning

9.2.6 The Oracle

9.2.7 The Active Learning Loop

9.3 Active Learning for Lead Optimization: Free Energy Perturbation Experiments

9.3.1 The Role of Free Energy Calculations

9.4 Summary

9.5 References

10 Generative Models for De Novo Design

10.1 The Quest for Designer Molecules

10.1.1 The Challenge of Chemical Space

10.1.2 Generative Models: A New Paradigm for Molecular Design

10.1.3 Reinforcement Learning for Targeted Generation

10.2 Building the World: Generative Models for Molecules

10.2.1 Essential Properties of a Good Molecular Latent Space

10.2.2 Learning to Compress and Recreate: The Autoencoder

10.2.3 The Autoencoder Architecture

10.2.4 Experiment on the MOSES Benchmark

10.3 Creating a Continuous Chemical Universe: Variational Autoencoders

10.3.1 The Variational Autoencoder

10.3.2 Posterior Collapse and Cyclic VAE

10.3.3 Monitoring Metrics

10.3.4 Training & Evaluating VAE-CYC

10.4 Understanding Sequential Molecular Structure: Recurrent Neural Networks

10.4.1 How RNNs Process Sequences

10.4.2 Resolving Vanishing Gradients with Gated Recurrent Units

10.4.3 Sequence-to-Sequence Architecture: Encoding and Decoding Molecules

10.4.4 Revisiting Tokenization: Byte-Pair Encoding for Molecules

10.4.5 Putting It All Together: VAE-CYC

10.5 Summary

10.6 References

11 Graph Neural Networks for Predicting Drug-Target Affinity

11.1 Challenges of Drug-Target Affinity Prediction

11.1.1 Benchmarking Drug-Target Affinity

11.1.2 Necessity for Better Molecular and Protein Representations

11.2 Molecular Graph Construction

11.2.1 Small Molecules as Molecular Graphs

11.2.2 From SMILES to Molecular Graphs

11.3 Proteins as Residue Interaction Graphs

11.3.1 Contact Maps as Structure Representation

11.3.2 Amino Acid Features and Position-Specific Scoring Matrices

11.4 Graph Neural Network Foundations

11.4.1 The Engine of GNNs: Message Passing

11.4.2 Graph Convolutional Networks

11.4.3 Graph Attention Networks (GATs)

11.4.4 Graph Isomorphism Networks (GINs)

11.4.5 Graph-level Pooling for Molecular Representations

11.4.6 Challenges in Deep GNNs

11.5 DualGraphDTA’s Dual-Stream Architecture

11.5.1 The Overall Architecture

11.5.2 Architectural Components and Information Flow

11.5.3 Embedding Fusion and Prediction

11.5.4 Alternative Architectures: GAT and GIN

11.5.5 Graph Data Preparation & Abstraction with PyTorch Geometric

11.5.6 Loss Function and Optimization

11.6 Evaluating Drug-Target Interaction Models

11.6.1 Performance Comparison of GCN, GAT, and GIN

11.6.2 Evaluating Pretrained Checkpoints

11.7 Real-World Applications & Moving Beyond Topology

11.8 Summary

11.9 References

12 Transformers for Protein Structure Prediction

12.1 Conjoined Problems: Structure Prediction & Protein Design

12.1.1 Protein Structure Essentials

12.1.2 Why Protein Folding is a Hard Modeling Problem

12.2 SimpleFold: An End-to-End Example

12.2.1 Stage 1: Configuration

12.2.2 Stage 2: Loading Pretrained Models

12.2.3 Stage 3: Inference Pipeline

12.2.4 Stage 4: Flow Matching

12.2.5 Stage 5: From Sequence to Structure

12.2.6 Comparing Approaches: PLM-Based vs. MSA-Based

12.2.7 Retrospective

12.3 Protein Language Models

12.3.1 Why Transformers? From Local to Global Attention

12.3.2 The Encoder-Only Transformer Architecture

12.3.3 Training a Small Protein Language Model

12.3.4 Downstream Applications: Antibody Classification

12.3.5 The Evolutionary Scale Modeling (ESM) Family of Models

12.4 Summary

12.5 References

13 Multimodal AI Systems for End-to-End Drug Discovery

13.1 From Components to Systems: The Integration Challenge

13.1.1 The Evolution of AI-Driven Drug Discovery

13.1.2 Defining Multimodal AI Systems

13.2 Platform Architecture & Components

13.2.1 Platform Overview

13.2.2 Integration Patterns Across Platforms

13.2.3 Integrating Multimodal Representations

13.3 End-to-End Workflows in Practice

13.3.1 Case Study 1: Rentosertib for Idiopathic Pulmonary Fibrosis

13.3.2 Case Study 2: Virtual Screening at Billion-Compound Scale

13.3.3 Case Study 3: Protein Therapeutic Design - A Multi-Model Workflow

13.4 Clinical Translation: Hype vs. Reality

13.4.1 What AI Has Improved

13.4.2 What AI Has Not Solved

13.4.3 Current Limitations and Challenges

13.4.4 Future Directions (2023-2030)

13.4.5 Final Comments

13.5 Summary

13.6 References

Appendix

Appendix A: Glossary

Appendix B: Chemical Data Repositories

B.1 Data Sources

B.2 Notes on Usage

B.3 Garbage In, Garbage Out

B.4 References

Appendix C: Knowledge Distillation: Shrinking Models for Efficient, Hierarchical Molecular Generation

C.1 Generative Chemistry as a Motivating Use Case

C.1.1 The Evolution of Molecular Generation

C.1.2 Hierarchical Molecular Generation

C.1.3 HierVAE Architecture for Hierarchical Molecular Generation

C.1.4 Bridge to Knowledge Distillation

C.2 Core Knowledge Distillation Concepts

C.2.1 The Knowledge Distillation Paradigm

C.2.2 Tapping into Dark Knowledge

C.2.3 Controlling Information with Temperature

C.2.4 Online vs. Offline Distillation

C.2.5 Expanding the Dataset with Pseudo-Labeling

C.3 Assembly: Putting it All Together

C.3.1 Multi-component Distillation Loss

C.3.2 Training Strategy: KL Annealing and the Dual Forward Pass

C.3.3 Student Model Design: Balancing Compression and Capability

C.3.4 End-to-end Knowledge Distillation

C.3.5 Future Directions

C.4 Summary

C.5 References

Appendix D: Technical Deep Dive into Protein Structure Prediction

D.1 Protein Biophysics and Chemistry

D.1.1 Amino Acid Chemistry

D.1.2 Hydrogen Bonding Mechanics

D.1.3 Torsion Angles and the Ramachandran Plot

D.1.4 The Physics of Folding

D.2 SimpleFold End-to-End Example Details

D.2.1 Protein Representation Formats

D.2.2 Stage 1: Configuration

D.2.3 Stage 2: Loading Pretrained Models

D.2.4 Stage 3: Inference Pipeline

D.2.5 Stage 5: Executing Protein Structure Prediction

D.2.6 Alignment-based Validation

D.2.7 Metrics, Metrics, Metrics

D.2.8 Understanding CAMEO and CASP Benchmarks

D.3 Training Protein Language Models

D.3.1 Tokenization: Converting Proteins to Numbers

D.3.2 Scaled Dot-Product Attention

D.3.3 Encoding Position Information

D.3.4 Encoder Layer: Attention + Feed-Forward

D.3.5 Training our Protein Language Model

D.3.6 When to Train Your Own Model

D.4 ESM-2 Case Study: Predicting Mutation Effects

D.5 Advanced SimpleFold Architecture

D.5.1 Invoking the Bitter Lesson in Structural Biology

D.5.2 Flow Matching: From Noise to Native Structure

D.6 Timestep Conditioning with Diffusion Transformer Blocks

D.6.1 SwiGLU Activation Functions

D.6.2 Rotary Position Embeddings

D.6.3 QK Normalization

D.6.4 Diffusion Transformer (DiT) Blocks

D.7 Deconstructing SimpleFold: Encoder-Trunk-Decoder

D.7.1 Atom Encoder: Input Representations and Embeddings

D.7.2 Residue Trunk: Global Attention

D.7.3 Atom Decoder: Structure Prediction

D.7.4 Architecture Summary

D.8 Additional Details on Training, Sampling, & Inference

D.8.1 Training Objective

D.8.2 Timestep Resampling & Training Procedure

D.9 Sampling and Inference

Overview

1 The Drug Discovery Process

This chapter introduces the modern drug discovery landscape—its humanitarian stakes, long timelines, high costs, and steep attrition—and motivates computation as a force multiplier. It frames discovery as an immense search problem at the intersection of astronomical chemical space and a vast set of biological targets, far beyond the reach of brute-force experimentation. The narrative positions AI and machine learning, particularly deep learning, as practical tools to accelerate and de-risk this search through rapid virtual screening and property prediction, de novo generative design of molecules, data-driven synthesis planning, and advances in protein structure prediction that inform target understanding and drug design.

The chapter surveys where deep learning is already delivering value: predicting molecular properties to triage libraries at scale; replacing or complementing expensive docking with learned models; generating novel chemical entities that meet specified property profiles; forecasting reactions and automating retrosynthetic routes to ensure designs are synthesizable and manufacturable; and narrowing the sequence-to-structure gap in proteins. It weighs novelty versus incremental “me-too” efforts (in the context of Eroom’s Law), discusses the promise and pitfalls of privileged scaffolds, and argues that learned, task-specific representations reduce bias relative to hand-crafted descriptors—supporting discoveries like structurally novel antibiotics. To ground these methods, the chapter builds fundamentals in ML (supervised vs. unsupervised learning, generalization and overfitting), molecular representation (SMILES and stereochemistry), and practical tooling (RDKit, fingerprints, PCA, and simple classifiers on FDA-approved drugs grouped by therapeutic stems).

Finally, the chapter walks through the end-to-end pipeline: target identification and validation; hit discovery via computational and high-throughput screening; hit-to-lead refinement; lead optimization focused on potency, selectivity, and ADMET alongside PK/PD considerations; and preclinical testing before clinical development. It distinguishes discovery from development, summarizing Phases I–III—progressing from initial human safety to large-scale efficacy and risk–benefit confirmation—and notes expedited pathways for urgent or first-in-class therapies. Throughout, it highlights where AI/ML can broaden candidate funnels, prioritize experiments, predict safety and exposure, propose syntheses, and increasingly generate and optimize candidates, compressing cycle times and improving the likelihood of clinical success.

Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 drug-like compounds and the biological search space of 105 targets.

Using AI to guide early prediction and optimization of drug-like molecules, we can broaden the number of considered candidate molecules, identify failures earlier when they are relatively inexpensive, and accelerate delivery of novel therapeutics to the clinic for patient benefit.

In virtual screening, we start with a large, diverse library of compounds that we can filter using a predictive model that has learned to predict what properties each compound has. Our predictive model has learned how to map the chemical space to the functional space. If the compound is predicted to have optimal properties, we carry it over for further experiments. In de novo design, we start with a defined set of property criteria that we can use along with a generative model to generate the structure of our ideal drug candidate. Our generative model knows how to map the functional space to the chemical space.

New drugs per billion USD of R&D reflects a downward trajectory. You may have heard of Moore’s Law, which is the observation that the number of transistors on an integrated circuit doubles approximately every two years. Moore’s Law implies that computing power doubles every couple of years while cost decreases. Eroom’s Law (Moore spelled backwards) is the observation that the inflation-adjusted R&D cost of developing new drugs doubles roughly every nine years. Eroom’s Law reflects diminishing returns in developing new drugs, including factors such as lower risk tolerance by regulatory agencies (the “cautious regulator” problem), the “throw money at it” tendency, and need to show more than a modest incremental benefit over current successful drugs (the “better than the Beatles” problem). The plot was constructed with data from Scannell et al., which discusses the trend in greater detail [6].

If we know both the structure of our ligand or compound and the target, we can use structured-based design methods. If we only know the ligand structure, we are restricted to ligand-based design methods. Alternatively, if we only know the target structure, we can use de novo design to guide generation of a suitable drug candidate.

Artificial intelligence, ML, and deep learning are all related to each other.

Example pairs of isomeric SMILES.

Example drug molecules for each USAN stem classification within our data set.

Chemical space exploration in a reduced, 4-dimensional space.

Decision boundary of our logistic regressor for classifying “-cillin” (left) and “-olol” (right) USAN stems. For each plot, colored samples belong to the positive class and uncolored samples belong to the negative class.

We can breakdown drug design into target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical development. Once a drug candidate has progressed to the drug development stage, it will need to pass multiple phases of clinical trials testing safety and efficacy prior to submission to and review by the FDA and launch to market.

We can break down the ADMET properties into the following broad descriptions. Absorption refers to the process by which a drug enters the bloodstream from its administration site, such as the gastrointestinal tract for oral drugs or the respiratory system for inhalation drugs. Distribution pertains to the movement of a drug within the body once it has entered the bloodstream. Metabolism refers to the biochemical transformation of a drug within the body, primarily carried out by enzymes. Metabolic processes aim to convert drugs into more polar and water-soluble metabolites, facilitating their elimination from the body. Excretion involves the removal of drugs and their metabolites from the body. Toxicity assessment aims to evaluate the potential adverse effects of a drug candidate on various organs, tissues, or systems.

We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.

In virtual screening, we conducted our search across a chemical space consisting of an enormous set of molecules. In de novo design, we are still conducting an (informal) search, just not across the chemical space. We are now searching across the functional space of potential molecular properties. If our model “learns” which section of the functional space maps to molecules that have ideal binding affinity and safety, then perhaps it can reverse-engineer novel molecule structures in the chemical space that match our functional criteria.

Preclinical trials evaluate drug candidate safety and efficacy on model organisms. Phase I clinical trials evaluate drug candidate safety in its first exposure to humans. Phase II and Phase III clinical trials continue to collect data on safety while measuring drug candidate efficacy on larger groups of patients. The pass rate of our lead compounds decreases drastically as they progress beyond preclinical stages, along with an increase in the associated time to test them.

Summary

Developing therapeutics entails a long, arduous process. Traditional development from ideation to market is costly (magnitude of billions of dollars), lengthy (10 to 15 years), and risky (attrition of over 90%). Through advances in AI, we can discover cures that have better safety profiles, address medical conditions or diseases with low coverage, and can reach patients quicker.
Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 10⁶³ medicinal compounds and the biological search space of 10⁵ targets.
Applications of AI to drug design include molecule property prediction for virtual screening, creation of compound libraries with de novo molecule generation, synthesis pathway prediction, and protein folding simulation.
ML is a subfield of AI that enables computers to learn from and make decisions based on data, automatically and without explicit programming or rules on how to behave. Example ML algorithms include logistic regression and random forests. Deep learning is a subfield of ML that uses deep neural networks to extract complex patterns and representations from data.
We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
Popular, well-maintained chemical data repositories include ChEMBL, ChEBI, PubChem, Protein Data Bank (PDB), AlphaFoldDB, and ZINC. When using a new data source, learn how it was assembled and how quality is maintained. Garbage data in, garbage model out. See “Appendix B: Chemical Data Repositories” for more information.

FAQ

What are the main stages of drug discovery, and how do they differ from drug development?

Drug discovery comprises target identification and validation, hit discovery, hit-to-lead (lead identification), and lead optimization, followed by preclinical studies in model organisms. Drug development begins after preclinical success and covers human clinical trials (Phase I safety; Phase II preliminary efficacy and safety; Phase III confirmatory efficacy, broader safety), culminating in regulatory review (e.g., FDA) and market launch.

Why is searching chemical and biological space so challenging?

The chemical space of drug-like molecules is astronomically large (~10^63), far beyond what brute-force experiments can cover. Even with high-throughput assays (10^5–10^7 compounds/day), full coverage would take longer than the age of the universe. The biological space is also vast (~10^5 human proteins and variants). ML narrows this intractable search by prioritizing likely successes computationally.

How does virtual screening work, and how do ML-based approaches compare to docking and simulations?

Virtual screening prioritizes compounds by predicting properties like binding affinity and toxicity before lab testing. Physics-based methods (docking, molecular dynamics) simulate interactions but are computationally expensive. ML-based screening learns from data to directly predict properties, enabling throughput on the order of 10^9–10^12 compounds per day and reducing costly experimental or docking steps.

What is de novo design (generative chemistry), and how does it help address Eroom’s Law?

De novo design uses AI to generate novel chemical structures that satisfy desired property profiles by searching the functional property space and mapping back to structures. This promotes innovation beyond “me-too” drugs, countering Eroom’s Law (rising R&D cost per new drug) by proposing pre-optimized, potentially patentable candidates while reducing downstream attrition.

What is chemical reaction prediction and retrosynthesis, and why does it matter?

Reaction prediction forecasts products of given reactants; retrosynthesis starts from the target molecule and proposes simpler precursors and reaction steps. Because each step can branch into ~10^4 transformations, deep learning helps efficiently navigate route planning. This accelerates lab synthesis of novel AI-designed molecules and optimizes manufacturing (process chemistry) for cost, speed, and safety.

How has deep learning advanced protein structure prediction, and why is structure important?

Deep learning (e.g., AlphaFold2) predicts 3D protein structures from sequences with high accuracy, closing the gap between abundant sequence data and scarce structural data. Knowing structure clarifies function, disease mechanisms, and ligand-binding sites, informing target selection, binding affinity modeling, and rational design.

What do AI, ML, and deep learning mean here, and what are supervised vs unsupervised learning?

AI is the broad field of intelligent systems. ML is a subset that learns patterns from data. Deep learning is a subset of ML using neural networks. Supervised learning uses labeled data for tasks like classification (e.g., toxic vs non-toxic) and regression (e.g., solubility). Unsupervised learning uses unlabeled data for clustering, dimensionality reduction/representation learning, compression, and generative modeling. Generalization (avoiding overfitting) is critical for predicting properties of novel compounds.

How are molecules represented for ML models?

Common textual encodings include SMILES, which record atoms, bonds, and connectivity. Canonical SMILES provide a consistent, unique string per structure; isomeric SMILES encode stereochemistry and tautomerism. Models often use numerical features like molecular descriptors and fingerprints (e.g., ECFP) derived from these representations to learn structure–property relationships.

What are ADMET and PK/PD, and which properties are optimized in lead optimization?

ADMET covers Absorption, Distribution, Metabolism, Excretion, and Toxicity—key drivers of clinical success. PK (“what the body does to the drug”) includes ADME; PD (“what the drug does to the body”) covers efficacy and mechanism. Lead optimization tunes: efficacy (max effect), potency (dose required), selectivity/safety (on-target vs off-target effects, therapeutic index), and bioavailability (fraction reaching circulation and the target).

How does RDKit support ML workflows in drug discovery?

RDKit parses and writes chemical formats (e.g., SMILES/SDF), builds molecule objects, computes descriptors and fingerprints (e.g., ECFP), performs substructure searches, and renders structures. It integrates with Python ML stacks (e.g., scikit-learn) to enable feature generation, visualization, dimensionality reduction (PCA), and modeling (e.g., logistic regression) directly from chemical data.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $23.99

you save $24.00 (50%)

eBook

pdf, ePub, online

$47.99 $23.99

you save $24.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more