The arrival of foundation models in time series forecasting

The new Machine Learning forecasting systems

.

Strumenti Machine learning and Econometrics

Historically, in the early phase of using Deep Learning for time series forecasting, models were developed for specific datasets or contexts, trained to capture only the dynamics of their target domain (e.g., DeepAR, N-BEATS). These models often outperformed traditional statistical approaches such as ARIMA; however, they do not easily adapt to other datasets and require retraining for each individual series.
Starting from 2024, research has progressively shifted toward the development of foundation models — general-purpose models capable of making accurate forecasts on any time series without the need for task-specific training.
In this article, before exploring how these models perform on the specific case of the hourly Italian electricity price (PUN), we will introduce a general overview and a reference taxonomy.

Foundation Models

In the field of Machine Learning, it is useful to distinguish between two main categories of models:

  • Single-task models: these are designed to perform one specific task. For example, a model may be trained to generate product descriptions for an e-commerce site or to classify emails as spam or not spam. Such models tend to achieve excellent performance in the domain they were developed for, but they cannot be easily reused in other contexts.
    Similarly, in the case of time series, a model trained exclusively on the electricity consumption of a city can learn its typical daily and seasonal fluctuations, but it would hardly be able to forecast daily supermarket sales with the same accuracy, since the underlying temporal dynamics are completely different.
  • Foundation models: these are Machine Learning models pre-trained on large and heterogeneous datasets, capable of being adapted to multiple tasks without requiring retraining from scratch. These models are generally large (in terms of parameters) and may demand significant computational resources, but they deliver accurate performance and strong generalization capabilities.

For several years now, in the field of Natural Language Processing (NLP)[1], large foundation models — known as Large Language Models (LLMs) such as GPT or LLaMA — have become well established. These models are trained on vast amounts of text and learn the implicit rules of language, enabling them to generate text, translate languages, or answer questions without retraining.
The success of LLMs has subsequently inspired the development of foundation models in other domains such as computer vision, audio, and video, extending the paradigm well beyond natural language. Starting from 2024, research has begun applying this approach to numerical data, particularly time series, with the goal of developing models capable of learning general trends and patterns and producing accurate forecasts on previously unseen data, in a zero-shot mode. [2]

🎯 Single-Task Model

🛠️ Trained for a single task

🚀 High performance within its domain

🔄 Retraining required for new data

🚫 Limited generalization capability

🌐 Foundation Model

📊 Pre-trained on heterogeneous data

🔧 Adaptable to multiple tasks

🌀 Zero-shot and Fine-Tuning

⚙️ High generalization capability

The Taxonomy of Foundation Models

Foundation models are generally large in size, as they are designed to ensure broad applicability and, at the same time, high performance. Being conceived to compete with specialized (single-task) models, they typically require a higher number of parameters and a more complex architecture. For this reason, foundation models are often referred to as Large X Models (LxM), where the letter X represents the application domain. For example, in the language domain we speak of Large Language Models (LLMs), while in computer vision we refer to Large Vision Models (LVMs), and so on. However, it is important to note that not all foundation models are necessarily “large.” In several fields, more compact variants have emerged that still retain a foundational nature.
To fully understand the structure of foundation models, it is useful to propose a classification based on the nature of their inputs, that is, on the type of data provided to the model (text, numbers, images, audio, video, etc.).

Category General objective Main examples
Text Understand/generate natural language LLM (GPT-3.5, LLaMA)
Images/Video Recognize/classify/understand visual content LVM (ViT, CLIP)
Audio Interpret/generate sound signals Whisper, AudioLM
Time Series Analyze/predict trends or future values TimeGPT, TimesFM, TTM
Decision and Control Learn optimal strategies and actions AlphaZero, Decision Transformer
Multimodal Integrate information from multiple modalities GPT-4V, Gemini

This taxonomy allows us to observe how, starting from a shared paradigm, different families of foundation models have emerged — each developed for a specific type of input. Each of these families addresses challenges unique to its domain, such as understanding language, processing images, or forecasting temporal dynamics, yet all share the same underlying logic: training on large amounts of heterogeneous data and the ability to generalize acquired knowledge to new tasks. In this way, the evolution of foundation models does not follow a single direction but branches into multiple parallel paths, reflecting the diversity of data types and application objectives.

Towards Foundation Models for Time Series

A foundation model for time series is a Machine Learning model pre-trained on a large amount of numerical data, particularly historical series. During training, the model is exposed to billions of time points from thousands of different historical series. The goal is to learn general patterns of phenomena that evolve over time (seasonality, trends, correlations, economic cycles, etc.), so that these patterns can be reused on new, previously unseen series.
After training, the model can be used in a "zero-shot" setting or with "fine-tuning" — an additional training phase on a specific dataset selected by the user, which updates the model parameters and enables more accurate performance in the desired context.

Why did a foundation model for time series only emerge in 2024, while in domains such as language, solutions like GPT have been available for years?
The answer lies in the intrinsic complexity of time series data, both from a theoretical and practical perspective. Unlike natural language, which is based on a defined vocabulary and shared grammatical rules, time series consist of continuous, often noisy numerical data. Each series follows its own dynamics, with varying shapes, magnitudes, and rhythms, and there is no discrete structure comparable to linguistic syntax. The model must therefore learn to interpret these signals without relying on a universal “language.”

Making the task even more challenging is the vast heterogeneity of time series: they vary in context length (how much historical data is available), forecasting horizon (how many future steps must be predicted), and temporal frequency, which can range from minutes to years. A truly effective foundation model must therefore be extremely flexible, capable of adapting to these variables while maintaining accuracy across very different scenarios.
Another major obstacle is the limited availability of data. In the language domain, enormous amounts of text are freely accessible online, whereas time series are often fragmented, constrained by privacy restrictions, or limited to specific industrial domains. Building a dataset that is both large and diverse enough to train a generalist model is therefore much more difficult.
All these challenges explain why developing a universal generalist system for time series was far from straightforward. However, by 2024, several of these barriers were finally overcome, paving the way for the emergence of foundation models.

Do you want to stay up-to-date on commodity market trends?
Sign up for PricePedia newsletter: it's free!

The First Foundation Model for Forecasting: TimeGPT

TimeGPT is the first foundation model for time series forecasting, developed in 2024. Like GPT for language or ViT for images, TimeGPT is also based on a Transformer[3] architecture, but adapted to handle numerical data that evolve over time. The Transformer architecture, introduced by Vaswani et al. in 2017[7], is now the backbone of the most advanced artificial intelligence models. TimeGPT employs an encoder–decoder structure with self-attention: each time series is divided into windows (blocks of historical values), transformed into vector embeddings, and enriched with positional encoding to represent temporal order. The model processes all points in the sequence globally, capturing both local relationships (between nearby values) and long-term dependencies.[4]

The model was trained on over 100 billion data points from numerous sectors. This dataset is extremely heterogeneous, including series with different frequencies (daily, hourly, annual), noise levels, trends, and seasonality patterns. The goal was to make the model robust and generalizable, capable of delivering accurate performance in both zero-shot forecasting and fine-tuning settings.

A "Decoder-Only" Model: TimesFM

TimesFM (Time Series Foundation Model) is an open-source time series forecasting model developed by Google Research. This model uses a decoder-only Transformer architecture — a type of neural network that processes sequences causally, meaning each element can “see” only those that come before it. In the context of time series, this approach is particularly suitable because it respects the natural flow of time: the future depends on the past.
Instead of analyzing each individual point in a time series (e.g., each daily or hourly value), TimesFM divides the sequence into patches, i.e., contiguous blocks of data. Each patch is then transformed into a numerical vector called a temporal token, which is subsequently passed to the model. In natural language, a token can be a word or a character; in time series, a token is a block of data representing a segment of the series. The model processes these tokens sequentially and, based only on previous tokens, predicts the future values of the series.

Despite the complexity of the task, TimesFM is relatively compact compared to large language models: it contains about 200 million parameters and is trained on around 100 billion time data points — much smaller scales than typical LLMs like GPT-3 or LLaMA. This demonstrates that it is possible to build a practical and efficient foundation model for time series forecasting, capable of achieving performance close to specialized supervised models, without the high computational costs of large, generic models.

An Alternative Foundation Model: Tiny Time Mixers

In designing the Tiny Time Mixers (TTM), researchers at IBM Research aimed to develop a forecasting system capable of achieving performance comparable to large foundation models (e.g., TimeGPT, TimesFM), but with a much lighter and more efficient architecture based on a simplified version of the Transformer called TSMixer (Time Series Mixer).[5][6]
While large foundation models rely on a single, high-capacity network designed to adapt to a wide range of forecasting contexts, the TTM approach is modular — developing multiple compact variants (around 1 million parameters each), each pre-trained and optimized for a specific forecasting scenario.

The Tiny Time Mixers system does not fully fit the classical definition of a foundation forecasting model due to its innovative structure. However, it can still be classified as one, since it is pre-trained on a vast collection of heterogeneous time series varying by domain, frequency, and forecasting horizon. Each of its variants can generate accurate zero-shot forecasts, further refined through fine-tuning — a key characteristic of foundation models.
At the same time, the TTM system incorporates features typical of "single-task" models, since each variant is calibrated for a specific forecasting context, with different configurations for context length and prediction horizon. Despite this specialization, its overall nature remains fundamentally foundational.

In summary, on one hand, we have TimesFM and TimeGPT — large-scale foundation models (Large Time Series Models or LTSM) capable of producing accurate zero-shot forecasts on their own; on the other hand, we have TTM, a system of compact models structured into multiple specialized variants that deliver highly accurate predictions within their respective training contexts.

🧠 TimeGPT
  • Transformer architecture
  • LSTM (+200M parameters)
  • Zero-shot & fine-tuning
🔁 TimesFM
  • Transformer architecture
  • LSTM (200M parameters)
  • Zero-shot & fine-tuning
🧩 Tiny Time Mixers
  • TSMixer architecture
  • Compact (1M parameters)
  • Zero-shot & fine-tuning

Conclusion

The development of Machine Learning models across all fields is advancing rapidly, and even experts find it difficult to predict their limits. The time series models discussed in this article demonstrate that building foundation architectures capable of delivering competitive results, despite the challenges outlined, is not only possible but also extremely promising.
The evolution of these systems in the coming years is uncertain, but continuous improvements in accuracy and generalization capabilities are likely. Already today, these models achieve performance comparable to — and in some cases superior to — the best traditional econometric approaches and specialized Deep Learning models (single-task), especially after targeted fine-tuning. Considering that the first results of these technologies have been made available to users only recently (less than two years ago), the growth potential appears significant.

Looking ahead, there seem to be two possible development directions. The first aims at creating increasingly large, foundational, and multimodal models, capable of tackling a wide range of tasks from any type of input, from text and image generation to time series forecasting. These systems strive toward a form of intelligence that is increasingly general and adaptable.
The second direction prioritizes computational efficiency and sustainability, seeking to maintain high performance with compact and optimized architectures. Models like the Tiny Time Mixers demonstrate that it is possible to drastically reduce computing costs without sacrificing accuracy, highlighting how lightweight models can become a strategic advantage in a context where hardware and energy resources are increasingly limited.
It is difficult to predict which direction will dominate; the outcome will depend heavily on the results of future research, and both paths currently have limitations. Most likely, both directions will coexist: on one side, universal and multimodal models; on the other, lightweight and specialized solutions, each addressing different needs. The future of artificial intelligence may therefore not be dominated by a single philosophy, but by the dynamic balance between power and efficiency.


[1] The branch of AI that focuses on enabling computers to understand, interpret, and generate human language.
[2] In this mode, forecasts are generated using only the knowledge acquired during the pre-training phase conducted by developers, without modifying or updating the model parameters through further training.
[3] The internal structure of a model — that is, how its components are organized and connected to process data and learn from patterns. In other words, it represents how the model “thinks” and transforms inputs into outputs.
[4] For more on how the Transformer works: Token & Transformer: the heart of modern Machine Learning models
[5] The TTM structure is described here: Machine Learning in Time Series Forecasting: Introducing Tiny Time Mixerss
[6] For a comparison between TTM and traditional forecasting models: From Econometrics to Machine Learning: The Challenge of Forecasting
[7] Definition of Transformer (Vaswani et al., 2017): Attention Is All You Need