table of content

Part 1 The rise of foundation machine learning models

1 Understanding foundation models

1.1 Defining a foundation model

1.2 Exploring the transformer architecture

1.2.1 Feeding the encoder

1.2.2 Inside the encoder

1.2.3 Making predictions

1.3 Advantages and disadvantages of foundation models

1.3.1 Benefits of foundation forecasting models

1.3.2 Drawbacks of foundation forecasting models

1.4 Next steps

2 Building a foundation model

2.1 Exploring the architecture of N-BEATS

2.1.1 Basis expansion

2.2 Architecture of N-BEATS

2.2.1 A block in N-BEATS

2.2.2 A stack in N-BEATS

2.2.3 Assembling N-BEATS

2.3 Pretraining our model

2.4 Pretraining N-BEATS

2.5 Transfer learning with our pretrained model

2.6 Fine-tuning our pretrained model

2.7 Evaluating each approach

2.8 Forecasting at another frequency

2.9 Understanding the challenges of building a foundation model

2.10 Next steps

Part 2 Foundation models developed for forecasting

3 Forecasting with TimeGPT

3.1 Defining generative pretrained transformers

3.2 Exploring TimeGPT

3.2.1 Training TimeGPT

3.2.2 Quantifying uncertainty in TimeGPT

3.3 Forecasting with TimeGPT

3.3.1 Initial setup

3.3.2 Zero-shot forecasting

3.3.3 Performance evaluation

3.4 Fine-tuning with TimeGPT

3.4.1 Fine-tuning TimeGPT

3.4.2 Evaluating the fine-tuned model

3.4.3 Controlling the depth of fine-tuning

3.5 Forecasting with exogenous variables

3.5.1 Preparing the exogenous features

3.5.2 Forecasting with exogenous variables

3.5.3 Explaining the effect of exogenous features with Shapley values

3.5.4 Evaluating forecasts with exogenous features

3.6 Cross-validating with TimeGPT

3.7 Forecasting on a long horizon with TimeGPT

3.8 Detecting anomalies with TimeGPT

3.8.1 Detecting anomalies

3.8.2 Evaluating anomaly detection

3.9 Next steps

4 Zero-shot probabilistic forecasting with Lag-Llama

4.1 Exploring Lag-Llama

4.1.1 Viewing the architecture of Lag-Llama

4.1.2 Pretraining Lag-Llama

4.2 Forecasting with Lag-Llama

4.2.1 Setting up Lag-Llama

4.2.2 Zero-shot forecasting with Lag-Llama

4.2.3 Changing the context length in Lag-Llama

4.3 Fine-tuning Lag-Llama

4.3.1 Handling initial setup

4.3.2 Reading and splitting the data in Colab

4.3.3 Launching the fine-tuning procedure

4.3.4 Forecasting with a fine-tuned model

4.3.5 Evaluating the fine-tuned model

4.4 Model comparison table

4.5 Next steps

5 Learning the language of time with Chronos

5.1 Discovering the T5 family

5.2 Exploring Chronos

5.3 Using tokenization in Chronos

5.4 Training a model with Chronos

5.4.1 Tackling data scarcity with augmentation techniques

5.4.2 Examining the pretrained Chronos models

5.4.3 Selecting the appropriate Chronos model

5.5 Forecasting with Chronos

5.5.1 Initial setup

5.5.2 Predictions

5.6 Cross-validating with Chronos

5.6.1 Running cross-validation

5.6.2 Evaluating Chronos

5.7 Fine-tuning Chronos

5.7.1 Performing initial setup

5.7.2 Configuring the fine-tuning parameters

5.7.3 Launching the fine-tuning procedure

5.7.4 Forecasting with a fine-tuned model

5.7.5 Evaluating the fine-tuned model

5.8 Detecting anomalies with Chronos

5.9 Next steps

6 Moirai: A universal forecasting transformer

6.1 Exploring Moirai

6.1.1 Viewing the architecture of Moirai

6.1.2 Pretraining Moirai

6.1.3 Selecting the appropriate model

6.2 Discovering Moirai-MoE

6.2.1 Patching and embedding

6.2.2 Studying the decoder-only transformer

6.2.3 Pretraining Moirai-MoE

6.3 Forecasting with Moirai

6.3.1 Zero-shot forecasting with Moirai

6.3.2 Cross-validation with Moirai

6.3.3 Forecasting with exogenous features

6.4 Detecting anomalies with Moirai

6.5 Next steps

7 Deterministic forecasting with TimesFM

7.1 Examining TimesFM

7.1.1 Architecture of TimesFM

7.1.2 Pretraining TimesFM

7.2 Forecasting with TimesFM

7.2.1 Zero-shot forecasting with TimesFM

7.2.2 Cross-validation with TimesFM

7.2.3 Forecasting with exogenous features

7.3 Fine-tuning TimesFM and anomaly detection

7.4 Next steps

Part 3 Using LLMs for time-series forecasting

8 Forecasting as a language task

8.1 Overview of LLMs and prompting techniques

8.1.1 Exploring Flan-T5 and Llama-3.2

8.1.2 Understanding the basics of prompting

8.2 Forecasting with Flan-T5

8.2.1 Function to forecast with Flan-T5

8.2.2 Forecast with Flan-T5

8.3 Cross-validation with Flan-T5

8.3.1 Running cross-validation

8.3.2 Evaluating Flan-T5

8.4 Forecasting with exogenous features with Flan-T5

8.4.1 Including exogenous features with Flan-T5

8.4.2 Extracting future values of exogenous variables

8.4.3 Cross-validating with external features

8.4.4 Evaluating Flan-T5 forecasts with exogenous features

8.5 Detecting anomalies with Flan-T5

8.5.1 Defining a function for anomaly detection with Flan-T5

8.5.2 Running anomaly detection

8.6 Forecasting with Llama-3.2

8.6.1 Performing initial setup

8.6.2 Creating a function to forecast via API call

8.6.3 Making predictions

8.7 Cross-validating with Llama-3.2

8.8 Detecting anomalies with Llama-3.2

8.8.1 Modifying the system prompt

8.8.2 Defining a function for anomaly detection

8.8.3 Running anomaly detection with Llama-3.2

8.8.4 Evaluating anomaly detection

8.9 Next steps

9 Reprogramming an LLM for forecasting

9.1 Discovering Time-LLM

9.1.1 Patch reprogramming

9.1.2 Discovering Prompt-as-Prefix

9.1.3 Making predictions

9.2 Forecasting with Time-LLM

9.2.1 Performing initial setup

9.2.2 Generating forecasts

9.3 Cross-validating with Time-LLM

9.4 Evaluating Time-LLM

9.5 Detecting anomalies with Time-LLM

9.5.1 Detecting anomalies

9.5.2 Evaluating anomaly detection

9.6 Next steps

Part 4 Capstone project

10 Capstone project: Forecasting daily visits to a blog

10.1 Introducing the use case

10.2 Walking through the project

10.2.1 Setting the constants

10.2.2 Forecasting with a seasonal naïve model

10.2.3 Forecasting with ARIMA

10.2.4 Forecasting with TimeGPT

10.2.5 Forecasting with Chronos

10.2.6 Forecasting with Moirai

10.2.7 Forecasting with TimesFM

10.2.8 Forecasting with Time-LLM

10.2.9 Evaluating all models

10.3 Staying up to date

Appendixes

Appendix A: references

Overview

1 Understanding foundation models

Foundation models mark a shift from building task-specific predictors to training a single, very large model on vast and diverse data so it can generalize across many tasks. They are characterized by four pillars: massive and heterogeneous training data, large parameter counts, broad task applicability, and the ability to be adapted via fine-tuning. In time series, this means one model can forecast across frequencies and temporal patterns (trends, seasonality, holiday effects) and often perform related tasks such as anomaly detection or classification. Practitioners can use them in zero-shot mode for quick, out-of-the-box forecasts, or fine-tune selected parameters to align the model to a specific domain and improve accuracy.

Most foundation models build on the Transformer architecture, whose mechanics are adapted for sequences like time series. Raw signals are transformed into embeddings and augmented with positional encoding to preserve ordering, then passed through an encoder that uses multi-head self-attention to learn dependencies over time. A decoder with masked attention generates forecasts autoregressively, leveraging both prior outputs and the encoder’s representation, before a final projection yields predictions for the full horizon. Understanding this pipeline helps in selecting and tuning hyperparameters, deciding when to fine-tune, and diagnosing limitations (for example, whether a given variant supports exogenous covariates or multivariate targets).

These models simplify forecasting pipelines, lower the expertise barrier, and can perform well even with limited local data, while being reusable across datasets and tasks. Trade-offs include data privacy considerations, limited control over a model’s built-in capabilities and horizons, potential underperformance versus specialized approaches in some settings, and significant compute and storage needs. The chapter emphasizes careful evaluation to decide when foundation models are the best fit and previews the book’s hands-on journey: defining model boundaries, experimenting with multiple time-series foundation models, applying zero-shot and fine-tuning workflows to real datasets (including sales forecasting and anomaly detection), and ultimately benchmarking them against traditional statistical baselines.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.

Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.

Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.

Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.

We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.

Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.

Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.

Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
Derivatives of the Transformer architecture are what powers most foundation models.
Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What is a foundation model?

A foundation model is a large machine learning model trained on very large and diverse datasets so it can be reused across many tasks. It typically has millions or billions of parameters, supports multiple downstream tasks, and can be adapted (fine-tuned) to specific use cases.

How is a model different from an algorithm?

An algorithm is the procedure or recipe for solving a problem, while a model is the outcome of applying that algorithm to data. Using the same algorithm on different datasets produces different models—like using the same recipe with different ingredients yields different cakes.

What are the key characteristics of a foundation model?

Trained on very large, diverse datasets
Contains a large number of parameters
Applicable to multiple tasks
Adaptable via fine-tuning

How do foundation models apply to time series forecasting?

A single foundation model can forecast series with different frequencies and temporal patterns (trends, seasonality, holidays) and may also handle related tasks like anomaly detection and classification. It can produce useful forecasts even without task-specific training and can be fine-tuned for better performance.

What is the Transformer architecture and why is it used here?

The Transformer is a deep learning architecture built around attention mechanisms that capture complex dependencies efficiently. For time series, it uses an encoder to learn relationships in historical data and a decoder to generate forecasts, making it a common backbone for foundation forecasting models.

What are embeddings and positional encoding in time series Transformers?

Embeddings convert raw time series tokens (values or windows) into dense vectors the model can understand. Positional encoding (often sinusoidal) is added so the model knows the order of observations—critical to avoid “seeing the future” when learning from past data.

How does self-attention help with time series?

Self-attention learns how much each past token should influence the current step, capturing dependencies across time. Multi-headed attention lets the model focus on different patterns simultaneously (e.g., trends in one head, seasonality in another).

How does the decoder generate forecasts, and what does masking do?

The decoder uses masked multi-head attention so it cannot access future information, generating one step at a time and feeding each prediction back to produce the full horizon. It also attends to the encoder’s output to leverage the learned representation of the history.

What are the advantages of using foundation models for forecasting?

Out-of-the-box pipelines with minimal setup
Lower expertise barrier to get started
Works even with few data points
Reusable across tasks and datasets

What are the limitations and when might a custom model be better?

Privacy concerns with hosted/proprietary models
Limited control over capabilities (e.g., horizon limits, lack of exogenous or multivariate support)
May not be optimal for a specific niche use case
High storage and compute requirements (though APIs can reduce overhead)

Evaluate accuracy, costs, and constraints; if a tailored model outperforms and is cheaper to operate, prefer a custom solution.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more