1 Introduction to Bayesian statistics: Representing our knowledge and uncertainty with probabilities
Bayesian statistics offers a practical language for reasoning under uncertainty by representing what we know—and don’t know—with probabilities. Instead of giving single, definitive answers, it models unknowns as random variables and expresses beliefs as full probability distributions that quantify confidence. Through intuitive examples like weather forecasting, the text shows how probabilistic outputs empower better decisions at different levels of granularity and how expectations, uncertainty, and risk tolerance shape actions in real-world contexts such as diagnostics, recommendations, and AI.
At the core of the Bayesian approach is the cycle of belief formation and refinement: a prior distribution encodes initial knowledge, data provide evidence, and the posterior distribution updates beliefs conditionally on that evidence. The chapter illustrates this with simple Bernoulli and categorical models, highlighting how probability axioms, expected values, and conditional probabilities translate beliefs into measurable statements. It contrasts this belief-centric view with frequentism, where probabilities arise from long-run frequencies under repeatable trials; it also discusses subjectivity versus objectivity, the role of priors, and why the approaches often converge with abundant data while differing in interpretation and workflow.
The chapter closes by connecting Bayesian thinking to modern AI, especially large language models that score and select likely next words using conditional probabilities. While LLMs aren’t fully Bayesian due to computational limits, their training and refinement (including user feedback) reflect Bayesian ideas about uncertainty, multiple plausible outcomes, and updating. Readers are positioned to see when Bayesian methods shine—limited data, need for prior knowledge, and decision analysis—versus when frequentist tools are convenient, and are primed for the book’s journey from foundational concepts to scalable inference, specialized models, and principled decision-making under uncertainty.
An illustration of machine learning models without probabilistic reasoning capabilities being susceptible to noise and overconfidently making the wrong predictions.
An example categorical distribution for rainfall rate.
Summary
We need probability to model phenomena in the real world whose outcomes we haven’t observed.
With Bayesian probability, we use probability to represent our personal belief about an unknown quantity, which we model using a random variable.
From a Bayesian belief about a quantity of interest, we can compute quantities that represent our knowledge and uncertainty about that quantity of interest.
There are three main components to a Bayesian model: the prior distribution, the data, and the posterior distribution. The last component is the result of combining the first two and what we want out of a Bayesian model.
Bayesian probability is useful when we want to incorporate prior knowledge into a model, when data is limited, and for decision-making under uncertainty.
A different interpretation of probability, frequentism, views probability as the frequency of an event under infinite repeats, which limits the application of probability in various scenarios.
Large language models, which power popular chat artificial intelligence models, apply Bayesian probability to predict the next word in a sentence.
FAQ
Why do we need probability instead of simple yes/no predictions?Because real-world predictions are uncertain. Probabilities communicate how likely outcomes are, let users make decisions that match their risk tolerance, and avoid throwing away useful information that a hard yes/no answer hides.What is a random variable, and how does it apply to weather forecasting?A random variable assigns numbers to uncertain outcomes. For weather: a binary variable (rain=1, no rain=0) models “Will it rain?”, while a categorical or continuous variable can model “How much will it rain?” by assigning probabilities to amounts.What are the trade-offs in choosing binary, categorical, or continuous models?Granularity vs practicality. Binary is simplest but least informative; continuous is most detailed but can be complex; categorical (discrete bins) is a pragmatic middle ground that keeps enough detail to be useful and trustworthy without excessive complexity.What is a probability distribution, and what do its parameters mean?A probability distribution quantifies how likely different values of a random variable are. It’s governed by parameters that shape those likelihoods (e.g., Bernoulli’s parameter p is the chance of “success”). All probabilities are nonnegative and sum to 1.What is expected value, and why is it a weighted average?The expected value summarizes the central tendency by averaging possible outcomes, weighted by their probabilities. More likely outcomes contribute more than unlikely ones, unlike a simple (unweighted) average.How does Bayesian updating work (prior → data → posterior)?Start with a prior belief about an unknown, observe data, and combine them to produce a posterior—your updated belief. This is expressed as a conditional probability (the probability of the unknown given the data).How do Bayesian and frequentist viewpoints differ?Bayesian: probability represents degrees of belief about unknowns and updates with data. Frequentist: probability is the long-run frequency of outcomes under repeated trials; the parameter is fixed, and randomness comes from data collection.Is Bayesian “subjectivity” a problem?Different sensible priors can yield different posteriors with the same data, which some call subjective. This can be a feature: priors encode domain knowledge, improve learning with limited data, and make assumptions explicit and auditable.Are neural network confidence scores true probabilities?Not necessarily. Classifier outputs are often normalized (e.g., via softmax) but can be miscalibrated—high scores don’t guarantee true probability correctness. This can lead to overconfident wrong predictions if calibration or uncertainty modeling is missing.How do large language models relate to Bayesian ideas?LLMs perform next-word prediction using conditional probabilities based on context and data. They’re Bayesian in spirit but not fully Bayesian due to computational limits; they approximate by focusing on likely candidates and using multiple samples to support learning from feedback.
pro $24.99 per month
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!