Overview

1 The Drug Discovery Process

Drug discovery is presented as a long, expensive, and failure-prone endeavor that must navigate an enormous chemical and biological search space. The chapter positions computational methods—especially machine learning and deep learning—as critical accelerators that triage risk early and expand the funnel of viable candidates. It highlights where AI already delivers value: virtual screening and molecular property prediction, generative chemistry for designing novel structures with desired properties, chemical reaction prediction and automated retrosynthesis to make compounds synthetically accessible, and protein structure prediction to guide target-centric design. Together, these advances reduce experimental burden, improve safety profiling earlier, and help counteract declining R&D productivity by prioritizing better molecules sooner.

To ground these applications, the chapter introduces core ML concepts—training versus inference, the importance of generalization, and supervised (classification, regression) and unsupervised (clustering, dimensionality reduction, generative modeling) learning—and explains how molecules are represented for computation. It describes SMILES as a compact structural notation, expands on canonical and isomeric SMILES to capture stereochemistry, and emphasizes why these details matter for biological activity. Practical tooling such as RDKit is used to convert structures into features (e.g., fingerprints), explore chemical space with techniques like PCA, and build simple predictive models (e.g., logistic regression) that relate structure to pharmacologic classes, illustrating a foundational workflow that scales to more advanced AI pipelines.

The chapter then maps the end-to-end pipeline: target identification and validation; hit discovery via computational and high-throughput screening; hit-to-lead and lead optimization focused on potency, selectivity, ADMET, and PK/PD; and preclinical evaluation before Phase I–III clinical trials, with recognition of expedited pathways for high-need therapies. At each stage, AI can prioritize targets, rapidly rank and filter large libraries, forecast efficacy and toxicity, generate candidates that meet multi-property objectives, plan feasible syntheses, and leverage protein structure predictions to inform design. The overarching message is that integrating data-driven models throughout the workflow broadens exploration, surfaces failures earlier when they are cheaper, and increases the chances that the most promising, safe, and effective drug candidates reach patients faster.

Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 drug-like compounds and the biological search space of 105 targets.
Using AI to guide early prediction and optimization of drug-like molecules, we can broaden the number of considered candidate molecules, identify failures earlier when they are relatively inexpensive, and accelerate delivery of novel therapeutics to the clinic for patient benefit.
In virtual screening, we start with a large, diverse library of compounds that we can filter using a predictive model that has learned to predict what properties each compound has. Our predictive model has learned how to map the chemical space to the functional space. If the compound is predicted to have optimal properties, we carry it over for further experiments. In de novo design, we start with a defined set of property criteria that we can use along with a generative model to generate the structure of our ideal drug candidate. Our generative model knows how to map the functional space to the chemical space.
New drugs per billion USD of R&D reflects a downward trajectory. You may have heard of Moore’s Law, which is the observation that the number of transistors on an integrated circuit doubles approximately every two years. Moore’s Law implies that computing power doubles every couple of years while cost decreases. Eroom’s Law (Moore spelled backwards) is the observation that the inflation-adjusted R&D cost of developing new drugs doubles roughly every nine years. Eroom’s Law reflects diminishing returns in developing new drugs, including factors such as lower risk tolerance by regulatory agencies (the “cautious regulator” problem), the “throw money at it” tendency, and need to show more than a modest incremental benefit over current successful drugs (the “better than the Beatles” problem). The plot was constructed with data from Scannell et al., which discusses the trend in greater detail [6].
If we know both the structure of our ligand or compound and the target, we can use structured-based design methods. If we only know the ligand structure, we are restricted to ligand-based design methods. Alternatively, if we only know the target structure, we can use de novo design to guide generation of a suitable drug candidate.
Artificial intelligence, ML, and deep learning are all related to each other.
Example pairs of isomeric SMILES.
Example drug molecules for each USAN stem classification within our data set.
Chemical space exploration in a reduced, 4-dimensional space.
Decision boundary of our logistic regressor for classifying “-cillin” (left) and “-olol” (right) USAN stems. For each plot, colored samples belong to the positive class and uncolored samples belong to the negative class.
We can breakdown drug design into target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical development. Once a drug candidate has progressed to the drug development stage, it will need to pass multiple phases of clinical trials testing safety and efficacy prior to submission to and review by the FDA and launch to market.
We can break down the ADMET properties into the following broad descriptions. Absorption refers to the process by which a drug enters the bloodstream from its administration site, such as the gastrointestinal tract for oral drugs or the respiratory system for inhalation drugs. Distribution pertains to the movement of a drug within the body once it has entered the bloodstream. Metabolism refers to the biochemical transformation of a drug within the body, primarily carried out by enzymes. Metabolic processes aim to convert drugs into more polar and water-soluble metabolites, facilitating their elimination from the body. Excretion involves the removal of drugs and their metabolites from the body. Toxicity assessment aims to evaluate the potential adverse effects of a drug candidate on various organs, tissues, or systems.
We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
In virtual screening, we conducted our search across a chemical space consisting of an enormous set of molecules. In de novo design, we are still conducting an (informal) search, just not across the chemical space. We are now searching across the functional space of potential molecular properties. If our model “learns” which section of the functional space maps to molecules that have ideal binding affinity and safety, then perhaps it can reverse-engineer novel molecule structures in the chemical space that match our functional criteria.
Preclinical trials evaluate drug candidate safety and efficacy on model organisms. Phase I clinical trials evaluate drug candidate safety in its first exposure to humans. Phase II and Phase III clinical trials continue to collect data on safety while measuring drug candidate efficacy on larger groups of patients. The pass rate of our lead compounds decreases drastically as they progress beyond preclinical stages, along with an increase in the associated time to test them.

Summary

  • Developing therapeutics entails a long, arduous process. Traditional development from ideation to market is costly (magnitude of billions of dollars), lengthy (10 to 15 years), and risky (attrition of over 90%). Through advances in AI, we can discover cures that have better safety profiles, address medical conditions or diseases with low coverage, and can reach patients quicker.
  • Drug discovery can be thought of as a difficult search problem that exists at the intersection of the chemical search space of 1063 medicinal compounds and the biological search space of 105 targets.
  • Applications of AI to drug design include molecule property prediction for virtual screening, creation of compound libraries with de novo molecule generation, synthesis pathway prediction, and protein folding simulation.
  • ML is a subfield of AI that enables computers to learn from and make decisions based on data, automatically and without explicit programming or rules on how to behave. Example ML algorithms include logistic regression and random forests. Deep learning is a subfield of ML that uses deep neural networks to extract complex patterns and representations from data.
  • We can segment the early drug discovery pipeline into four main phases: target identification, hit discovery, hit-to-lead or lead identification, and lead optimization. Target identification designates a valid target whose activity is worth modulating to address some disease or disorder. Hit discovery uncovers chemical compounds with activity against the target. Lead identification selects the most promising hits and lead optimization improves their potency, selectivity, and ADMET properties to be suitable for preclinical study.
  • Popular, well-maintained chemical data repositories include ChEMBL, ChEBI, PubChem, Protein Data Bank (PDB), AlphaFoldDB, and ZINC. When using a new data source, learn how it was assembled and how quality is maintained. Garbage data in, garbage model out. See “Appendix B: Chemical Data Repositories” for more information.

FAQ

How does drug discovery differ from drug development?Drug discovery covers the preclinical R&D pipeline: target identification and validation, hit discovery, hit-to-lead (lead identification), lead optimization, and preclinical testing. Drug development begins after a drug candidate is selected, progressing through human clinical trials (Phase I: safety and dosing; Phase II: preliminary efficacy and safety; Phase III: confirmatory efficacy, broader safety), then regulatory review and market launch. In certain cases (e.g., first-in-class, orphan, breakthrough, or substantially superior therapies), accelerated pathways can shorten timelines.
Why is drug discovery often called a “needle in a haystack” problem?The chemical search space is astronomical—on the order of 10^63 drug-like molecules—while there are ~10^5 potential human protein targets and variants. Experimental screening throughput (about 10^5–10^7 compounds/day) cannot realistically cover this space. Machine learning and AI help narrow and prioritize candidates, making the search tractable.
What is virtual screening and how does ML improve it?Virtual screening prioritizes compounds in silico by estimating properties such as binding affinity, safety, or solubility before lab testing. Traditional approaches (e.g., docking, molecular dynamics) simulate physics but can be slow at scale. ML-based screening learns directly from data to predict properties rapidly, enabling testing of 10^9–10^12 compounds/day in silico and focusing experiments on the most promising candidates. Virtual screening can be structure-based (uses target structure) or ligand-based (uses known active ligands).
What is generative chemistry and how does it differ from virtual screening?Generative (de novo) design asks models to propose new chemical structures that satisfy desired property criteria, effectively searching in functional/property space and mapping back to chemical space. Virtual screening filters large existing libraries; generative models create novel candidates, helping overcome reliance on known chemotypes and potentially counteracting trends like Eroom’s Law. Care is needed around novelty, patentability, and off-target risks.
What are hits, leads, and lead optimization?Hits are compounds that show activity against the target (e.g., measurable binding or functional effect) found via computational or experimental screening. Leads are the most promising hits after confirmatory assays and early profiling (including ADMET). Lead optimization iteratively modifies lead structures to improve potency, selectivity, pharmacokinetics, safety, and developability while retaining desired activity.
What do PK and PD mean, and why do they matter?Pharmacokinetics (PK) describes what the body does to a drug—absorption, distribution, metabolism, and excretion. Pharmacodynamics (PD) describes what the drug does to the body—mechanism of action, efficacy, and dose–response. Balanced PK/PD and favorable ADMET profiles are crucial to clinical success and minimizing toxicity.
How do inhibitors, agonists, and antagonists differ, and what are off-target effects?Inhibitors reduce or block enzyme activity. Agonists bind receptors to activate a response; antagonists bind without activating and block agonists. Off-target effects occur when a compound binds unintended biomolecules, potentially causing side effects or efficacy issues—common in families like kinases where targets are highly similar.
What is retrosynthesis and why is it important for AI-designed molecules?Retrosynthesis plans a route from a target molecule back to simpler, available precursors by applying reaction transformations in reverse. AI/ML assist by predicting feasible reactions and ranking multistep routes within a vast search space. This is critical for synthesizing novel AI-generated compounds and for optimizing scalable, cost-effective manufacturing (process chemistry).
How are molecules represented for machine learning?Common textual encodings include SMILES (with canonical and isomeric variants to fix uniqueness and encode stereochemistry). For modeling, molecules are often converted to descriptors or fingerprints such as ECFP (e.g., ECFP6) that capture substructural features as fixed-length vectors. Toolkits like RDKit parse structures, compute features, and integrate with ML workflows.
What ML approaches are used, and what is generalization?Supervised learning uses labeled data for tasks like classification (e.g., toxic vs non-toxic) and regression (e.g., binding affinity). Unsupervised learning finds structure in unlabeled data (e.g., clustering, dimensionality reduction for visualization or representation learning). Generalization is a model’s ability to perform well on new, unseen compounds; avoiding overfitting is essential in drug discovery, where models must extrapolate to novel chemical space.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Machine Learning for Drug Discovery ebook for free