Overview

1 Understanding foundation models

Foundation models mark a shift from building many task-specific models to reusing a single, large model across diverse problems. Distinguished from the algorithms that create them, these models are trained on massive, varied datasets, contain large numbers of parameters, are adaptable to multiple tasks, and can be fine-tuned for particular scenarios. In time series forecasting, one model can handle different frequencies, trends, and seasonalities and even support related tasks such as anomaly detection and classification, streamlining workflows and enabling zero-shot predictions that work reasonably well before any specialization.

The backbone of most foundation models is the Transformer architecture. For time series, raw values are tokenized and passed through an embedding layer, then enriched with positional encoding to preserve temporal order. The encoder’s self-attention (often with multiple heads) learns dependencies such as trends and seasonal patterns, producing a deep representation of the series. The decoder, using masked attention to avoid peeking into the future and attending to encoder outputs, generates forecasts one step at a time, projecting internal representations back to the target space. Variants may use the full encoder–decoder, decoder-only setups, or time-series–specific tweaks.

These models offer practical advantages: out-of-the-box pipelines, lower expertise barriers, strong performance with limited data, and broad reusability across datasets and tasks. Trade-offs include privacy considerations, constrained control over capabilities (e.g., horizon limits, multivariate or exogenous handling, fine-tuning support), potentially suboptimal fit for niche use cases, and significant compute and storage requirements. Effective adoption demands careful evaluation of accuracy, cost, and risk. Subsequent chapters apply and compare purpose-built time-series foundation models and LLM-based adaptations, guiding model selection and culminating in a hands-on capstone assessment against statistical baselines.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.
Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.
Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.
Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.
We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.
Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.
Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.
Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

  • A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
  • Derivatives of the Transformer architecture are what powers most foundation models.
  • Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
  • Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
  • Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What’s the difference between an algorithm and a model?An algorithm is the procedure or recipe for solving a task; a model is the trained result you get after applying that algorithm to data. Same algorithm + different data = different model.
What is a foundation model, in simple terms?A foundation model is a very large machine learning model trained on massive, diverse datasets so it can generalize across many tasks. Typical traits: it has lots of parameters, works out of the box for multiple use cases, and can be fine-tuned to a specific scenario.
What are the four key components that define a foundation model?
  • Training data is very large and diverse.
  • The model itself is large (millions to billions of parameters).
  • It can be used for multiple tasks.
  • It can be adapted via fine-tuning.
How do foundation models change time series forecasting?One model can handle different frequencies, trends, and seasonalities, and often supports related tasks like anomaly detection or classification. They enable zero- or few-shot forecasting, reducing the need to train a bespoke model for each new series.
Why are foundation models so big, and what does that imply for practice?Training on huge datasets requires many parameters (weights) to capture varied patterns, resulting in models that can be gigabytes in size. Practically, this means higher storage and compute needs, and often GPUs or API-based access for reasonable inference speed.
How does the Transformer architecture process time series?Pipeline overview: the raw series is tokenized and passed through an embedding layer, positional encoding is added, the encoder learns dependencies via self-attention, the decoder generates forecasts using masked attention and information from the encoder, and a linear layer outputs values autoregressively until the horizon is covered.
What are tokens and embeddings in time series Transformers?Tokens are pieces of the series (e.g., single values or windows). An embedding is a learned vector representation of those tokens that captures their salient features in a way the model can use downstream.
What is positional encoding and why is it important?Positional encoding injects information about the order of observations so the model knows “when” each token occurs. A common approach uses sine and cosine functions at multiple frequencies, which helps the model distinguish identical values appearing at different times and prevents using future information to explain the past.
What do self-attention and multi-head attention do in the encoder?Self-attention learns how each token relates to others in the sequence, assigning importance weights to capture dependencies. Multi-head attention runs several attention mechanisms in parallel so different heads can focus on different patterns (e.g., trends vs. seasonality).
What are the main advantages and disadvantages of foundation models for forecasting?
  • Advantages: out-of-the-box pipelines; works with few data points; requires less forecasting expertise; reusable across tasks.
  • Disadvantages: privacy concerns (especially with hosted/proprietary models); limited control over capabilities; may underperform a tailored model for some cases; high compute and storage requirements.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free