Overview

Chapter 1. The data science process

Data science is presented as a practical, cross‑disciplinary craft focused on making and managing data‑driven decisions. Success depends less on exotic tools and more on clear, measurable goals, sound methodology, collaboration across roles, and a repeatable workflow. To ground the ideas, the chapter uses a real banking example—reducing losses from bad loans—to show how a project unfolds from framing the problem to delivering actionable results.

The work is collaborative, with distinct but complementary roles: the project sponsor (who owns success and business value), the client or domain expert (who represents end users), the data scientist (who sets analytic strategy and executes the science), the data architect (who stewards data assets), and operations (who deploys and runs solutions). Projects iterate through fluid stages: defining a specific, quantitative goal and acceptance criteria; collecting, exploring, and cleaning data while watching for quality issues and bias (for example, training only on already‑approved loans); and modeling to extract insight or predictions. Common tasks include classification, scoring, ranking, clustering, finding relations, and characterization, with method choice guided by business constraints such as interpretability and how results will be used.

Models are evaluated against business goals and sensible baselines, emphasizing not just accuracy but precision, recall, and false‑positive rates, and ensuring performance is meaningfully better than a null or existing approach. Communication is tailored to audience: executives care about business impact (such as potential reduction in charge‑offs), end users need guidance on interpretation and appropriate overrides, and operations needs clarity on runtime, data, and maintenance. Deployment is a beginning, not an end: pilot carefully, monitor behavior, accommodate stakeholder feedback, and plan for updates as conditions change. Throughout, setting expectations and determining lower bounds on acceptable performance keep projects realistic, aligned, and poised for impact.

The lifecycle of a data science project: loops within loops
The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.
A decision tree model for finding bad loan applications. The outcome nodes show confidence scores.
Example slide from an executive presentation

Summary

The data science process involves a lot of back-and-forth—between the data scientist and other project stakeholders, and between the different stages of the process. Along the way, you’ll encounter surprises and stumbling blocks; this book will teach you procedures for overcoming some of these hurdles. It’s important to keep all the stakeholders informed and involved; when the project ends, no one connected with it should be surprised by the final results.

In the next chapters, we’ll look at the stages that follow project design: loading, exploring, and managing the data. Chapter 2 covers a few basic ways to load the data into R, in a format that’s convenient for analysis.

In this chapter you have learned

  • A successful data science project involves more than just statistics. It also requires a variety of roles to represent business and client interests, as well as operational concerns.
  • You should make sure you have a clear, verifiable, quantifiable goal.
  • Make sure you’ve set realistic expectations for all stakeholders.

FAQ

What does this chapter mean by “data science”?Data science is a cross-disciplinary practice that applies methods from statistics, machine learning, data engineering, and analytics to drive decisions and manage their consequences. The emphasis is on clear, quantifiable goals, sound methodology, collaboration across roles, and a repeatable workflow—rather than relying on any single tool.
What real-world example is used to illustrate the data science process?The chapter uses a bank’s loan-default problem: build a tool to help loan officers identify risky loan applications and reduce losses from bad loans. It demonstrates concepts with the German Credit data (GoodLoan vs BadLoan) from the UCI Machine Learning Repository.
Which roles are involved in a data science project, and what do they do? - Project sponsor: Owns business outcomes and defines success; champions the project. - Client (end-user representative/domain expert): Connects the model to real workflows and needs. - Data scientist: Sets analytic strategy, builds/evaluates models, and coordinates with stakeholders; cultivates domain empathy. - Data architect: Manages data sources, storage, and access. - Operations: Deploys and runs the solution within technical constraints.
Why is the project sponsor critical, and how should you engage them?The sponsor defines success; their sign-off is the organizing goal. Use directed interviews to convert goals into quantitative acceptance criteria, keep them informed with understandable progress and findings, and avoid “black box” surprises.
What are the main stages of the data science lifecycle, and why is it iterative? - Define the goal - Data collection and management - Modeling (analysis) - Model evaluation and critique - Presentation and documentation - Deployment and maintenance The process is iterative: discoveries at any stage can send you back to refine goals, data, or models (a loops-within-loops dynamic similar to CRISP-DM).
How do you define a measurable, business-relevant goal?Translate the business need into specific, testable targets and stopping conditions (for example, reduce charge-offs by X% with defined error tolerances). For exploratory work, time-box efforts to generate hypotheses that can later be formalized as concrete goals.
What should you consider during data collection and preparation? - Availability, relevance, sufficiency, and quality of data - Prefer direct measures over proxies (e.g., payment burden vs income alone) - Explore/visualize to spot issues - Watch for sample bias (e.g., training only on accepted loans can distort conclusions about “risky” vs “safe” applicants)
How do you choose a modeling approach for the loan-default task?The task is classification (identify likely defaulters). Suitable options include logistic regression and tree-based methods. When end users need interpretability and confidence scores, a decision tree can be a good fit, showing clear rules and probabilities.
How do you evaluate a classification model in this context? - Use a confusion matrix to summarize predictions vs actuals - Consider accuracy but compare against a null model/base rate (e.g., always predicting GoodLoan) - Prioritize precision, recall, and false positive rate based on business trade-offs - Seek performance that’s significantly better than the null and that generalizes beyond training data
How should you present results and plan for deployment and maintenance? - Business leaders: Emphasize impact on key metrics (e.g., potential reduction in charge-offs) - End users: Explain outputs, confidence scores, example rule traces, and when to override - Operations: Cover runtime, data flows, constraints, and update plans Deploy via a pilot when possible, monitor overrides and drift, and plan for model updates as conditions change.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free