Overview

1 Understanding foundation models

Foundation models mark a shift from building task- or dataset-specific models toward training a single, very large model on vast and diverse data that can be adapted to many tasks. Such models are characterized by four pillars: massive and varied training data, large parameter counts, broad task applicability, and the ability to be fine-tuned. In time series, this means one model can forecast across frequencies and temporal structures (trends, seasonality, holidays) and often support related tasks like anomaly detection or classification. The chapter also clarifies model versus algorithm, motivates why reuse reduces development overhead, and sets expectations that fine-tuning can tailor a general model to a particular domain while acknowledging that no single approach is best for every scenario.

The chapter introduces the Transformer as the core architecture behind most foundation models and reframes it for time series forecasting. Inputs are tokenized and mapped through embeddings, then augmented with positional encoding to preserve temporal order. An encoder stack with multi-head self-attention learns rich dependencies across the sequence, while a decoder stack uses masked attention to prevent peeking into the future and cross-attention to leverage the encoder’s learned representation. Forecasts are produced autoregressively—each predicted step is fed back to inform subsequent steps—followed by a projection to output space. Understanding these components and their hyperparameters enables effective fine-tuning, capability assessment, and troubleshooting for different time series use cases.

Finally, the chapter weighs benefits and drawbacks of foundation forecasting models. Advantages include simple, out-of-the-box pipelines, usefulness with limited data, lower expertise requirements to get started, and broad reusability across tasks and datasets. Drawbacks include privacy and governance concerns (especially with hosted proprietary models), limited control over capabilities and horizons, potential mismatch to niche scenarios, and significant compute and storage demands. The book proceeds with hands-on exploration: building a small model to surface practical challenges; working with purpose-built time series models such as TimeGPT, Lag-LLaMA, Chronos, Moirai, and TimesFM; adapting language models (e.g., PromptCast, Time-LLM) for forecasting; and evaluating these approaches on real tasks like weekly sales forecasting and anomaly detection, culminating in a capstone comparison against statistical methods.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.
Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.
Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.
Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.
We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.
Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.
Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.
Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

  • A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
  • Derivatives of the Transformer architecture are what powers most foundation models.
  • Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
  • Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
  • Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What is a foundation model in the context of time series forecasting?A foundation model is a very large machine learning model trained on massive, diverse datasets so it can generalize to many tasks. In time series, a single foundation model can forecast across different frequencies and patterns, and it can often be adapted to a specific use case through fine-tuning.
How do foundation models differ from traditional task-specific models?Traditional approaches train a separate model for each scenario or dataset. Foundation models remove that need: the same pre-trained model can be reused across many time series with varying trends, seasonality, and even for related tasks like anomaly detection or classification.
What is the difference between an algorithm and a model?An algorithm is the procedure or recipe for learning (for example, linear regression). A model is the outcome of applying that algorithm to data—the learned parameters that make predictions. The same algorithm trained on different data produces different models.
Why are foundation models so large?To learn patterns that transfer across many scenarios, they are trained on extremely large datasets. Capturing this breadth requires many parameters (often millions or billions), which must be stored as weights—making the models big in memory and disk size.
Can foundation models be fine-tuned, and what does that involve?Yes. Fine-tuning adapts a pre-trained foundation model to your data by updating a subset of its parameters rather than training from scratch. This typically improves accuracy for your specific series or domain while keeping training time manageable.
How does the Transformer architecture apply to time series forecasting?Inputs are tokenized and passed through an embedding layer, then augmented with positional encoding to preserve order. The encoder uses self-attention to learn dependencies; the decoder uses masked attention (to avoid future leakage) plus encoder information to generate forecasts step by step, projecting its internal representation back to actual values.
What are embeddings and positional encoding, and why are they important?Embeddings are learned vector representations of the input tokens that capture salient features. Positional encoding injects information about the order of time steps—commonly via sine and cosine patterns—so the model knows where each value occurs in the sequence and doesn’t learn from the future.
What role does attention (self-attention and multi-head attention) play?Self-attention lets the model weigh how much each past time step matters for predicting the next one. Multi-head attention runs this in parallel so different heads can specialize (for example, one focusing on trend, another on seasonality), yielding a richer representation.
What are the main advantages and drawbacks of using foundation models for forecasting?Advantages: quick, out-of-the-box pipelines; lower expertise required; useful with limited data; and reusable across tasks. Drawbacks: potential privacy concerns (especially with proprietary services), limited control over capabilities and horizons, possible suboptimal performance for niche cases, and high compute/storage needs—though API access can reduce infrastructure burden.
When should I choose a foundation model over a custom or statistical model?Run experiments and weigh accuracy gains against costs and constraints. Use a foundation model when it outperforms alternatives and the ROI justifies privacy, capability limits, and resource needs. Prefer a custom approach when a tailored model performs better, supports required features (for example, exogenous variables or multivariate links), and is cheaper to deploy.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free