Overview

Chapter 1. The data science process

This chapter introduces data science as a pragmatic, cross‑disciplinary practice focused on making and managing data‑driven decisions for business and scientific problems. Success depends less on exotic tools and more on clear, quantifiable goals, sound methodology, close collaboration, and a repeatable workflow. It emphasizes the key roles that must collaborate on a project—project sponsor, client (end‑user representative), data scientist, data architect, and operations—and stresses continual engagement with sponsors and clients, along with the data scientist’s need for domain empathy to frame the right problems and solutions.

The chapter presents an iterative lifecycle rather than a rigid sequence. It begins with defining specific, measurable objectives and understanding how results will be deployed. Data collection and management then dominate the effort: identifying what data is available, assessing quality and sufficiency, exploring, cleaning, and guarding against biases (such as sampling only previously accepted cases). Modeling follows—spanning tasks like classification, scoring, ranking, clustering, relation finding, and characterization—with frequent feedback loops to earlier stages as assumptions, representations, and data needs are refined. A running example on loan risk illustrates how practical constraints, interpretability needs, and stakeholder input shape modeling choices.

Evaluation focuses on whether models meet agreed acceptance criteria and outperform obvious baselines, using measures such as precision, recall, false‑positive rates, and overall accuracy, and ensuring results make sense in context. Communication is tailored to audiences: business stakeholders care about impact on key metrics, end users need guidance on interpretation and use (including when to override), and operations needs clarity on deployment constraints and maintenance. The chapter closes by setting expectations: define lower bounds via a null model or existing process, consider base error rates and significance, and confirm that available data and resources can realistically achieve the desired performance before full execution and deployment.

The lifecycle of a data science project: loops within loops
The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.
A decision tree model for finding bad loan applications. The outcome nodes show confidence scores.
Example slide from an executive presentation

Summary

The data science process involves a lot of back-and-forth—between the data scientist and other project stakeholders, and between the different stages of the process. Along the way, you’ll encounter surprises and stumbling blocks; this book will teach you procedures for overcoming some of these hurdles. It’s important to keep all the stakeholders informed and involved; when the project ends, no one connected with it should be surprised by the final results.

In the next chapters, we’ll look at the stages that follow project design: loading, exploring, and managing the data. Chapter 2 covers a few basic ways to load the data into R, in a format that’s convenient for analysis.

In this chapter you have learned

  • A successful data science project involves more than just statistics. It also requires a variety of roles to represent business and client interests, as well as operational concerns.
  • You should make sure you have a clear, verifiable, quantifiable goal.
  • Make sure you’ve set realistic expectations for all stakeholders.

FAQ

What does “data science” mean in this chapter?It’s a cross‑disciplinary practice that applies methods from data engineering, statistics, data mining, machine learning, and predictive analytics to make data‑driven decisions for business and scientific problems.
Which roles are essential on a data science project?Five recurring roles appear: sponsor (owns business outcome and success criteria), client (represents end users and domain knowledge), data scientist (leads analytic strategy and modeling), data architect (manages data and storage), and operations (deploys and runs the solution). Roles can overlap and some may sit outside the core team.
Why is the project sponsor so critical, and how should we work with them?The sponsor defines success and provides sign‑off. Keep them informed with clear plans, progress, and results in business terms. Collaborate to turn needs into measurable goals and acceptance criteria so everyone knows when the project succeeds.
What are the main stages of a data science project?Typical stages are goal definition; data collection and management; modeling; model evaluation and critique; presentation and documentation; and deployment and maintenance. The process is iterative, with feedback loops between stages and follow‑on projects after deployment.
How do we define a good project goal?Make it specific, measurable, and tied to business impact, with clear acceptance and stopping conditions. Even exploratory work should be time‑boxed and aimed at generating hypotheses that can be turned into concrete goals.
What should we watch for during data collection and preparation?Assess availability, relevance, quantity, and quality. Prefer directly measured variables over proxies. Be alert to sampling bias (for example, only having data on accepted cases) and revisit goals or features when biases or data gaps are discovered.
Which modeling tasks are common, and how do we choose methods?Common tasks include classification, scoring (regression/probabilities), ranking, clustering, relation discovery, and characterization. Choose approaches that fit the goal, data, and stakeholder needs (for example, favor interpretable models when users require understandable decisions).
How should we evaluate models beyond overall accuracy?Use confusion matrices and metrics such as precision, recall, and false positive rate, and compare performance to a null (baseline) model or current process. Check generalization and whether model behaviors make sense in the domain; iterate if criteria aren’t met.
How do we present results to different audiences?For business sponsors, emphasize impact on key metrics and recommendations. For end users, show how to interpret outputs, decision traces, and confidence scores, and when overrides make sense. For operations, focus on deployment constraints, runtime, data flows, and maintenance needs.
What does deployment and maintenance involve, and how do we set realistic expectations?Pilot, monitor, and update the model as conditions change; watch for systematic overrides or drift. Set expectations using the null model as a lower bound (base error rate) and require performance that’s meaningfully better, adjusting goals or data resources when needed.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Practical Data Science with R, Second Edition ebook for free