Machine Learning in Time Series Forecasting: Introducing Tiny Time Mixers
The Cyclical Forecasting of the Hourly National Single Price (PUN)
Published by Riccardo Gentilucci. .
Electric Power Price Drivers
Machine Learning is a branch of Artificial Intelligence (AI) dedicated to the development of algorithms and models capable of learning from data and improving their performance on a specific task, without being explicitly programmed for every single case. One of the most well-known examples is certainly GPT, a model trained on large amounts of text to learn the rules of language, in order to predict the next word in a text sequence while progressively improving its performance.
If in the field of language, models like GPT have revolutionized the way humans interact with AI, in the context of econometric analysis and time series forecasting there exist models with a similar functioning.
Among these, one of the most interesting is the system of Tiny Time Mixers (TTM).
In this article, we analyze the development and application of Tiny Time Mixers, Machine Learning models specifically designed for the multivariate forecasting of time series.
TTMs are pre-trained and compact models, developed to combine high computational efficiency with competitive predictive accuracy, making them suitable for a wide range of practical applications.
Like other ML models, the TTM system learns from historical data to progressively improve its predictive performance, autonomously adapting to different application contexts without the need to be reprogrammed each time.
Pre-Training: Deep Learning
The TTM models have been “pre-trained” to recognize trends and seasonality in time series through Deep Learning systems, an advanced Machine Learning technique that uses deep neural networks to learn from data, designed to imitate, in a simplified way, the functioning of the human brain. The goal of pre-training is to enable the model to recognize general patterns in time series, such as trends, seasonality, sudden spikes, and noise, regardless of the specific application domain. During this phase, known as “channel-independent” pre-training, each variable is treated as an individual series, without considering its interactions with others. In this way, the model learns the fundamental temporal dynamics of each channel (variable).
The pre-training was carried out on about one billion public datasets, characterized by great heterogeneity in terms of domain, temporal frequency, series length, and number of variables.
This diversity allows TTMs to generalize effectively to very different application contexts.
Technically, this process relies on the same principles that made pre-training so powerful in language models such as GPT.
Instead of learning to predict words, the TTM learns to forecast future values within time sequences.
This knowledge is then refined during Fine-Tuning on smaller and more specific datasets, in order to adapt the network weights to the target context.
Model Versions
In 2024, the attention of the scientific community progressively shifted toward the development of “foundational” pre-trained models for time series. These models, characterized by hundreds of millions or even billions of parameters, are capable of performing accurate forecasts on new datasets. However, this approach requires high computational costs, considerable hardware resources, long training times, and significant financial investments.
When designing the TTM, researchers at IBM Research set out to develop a forecasting system capable of offering performance comparable to that of large foundational models, but with a much lighter architecture. The results obtained demonstrate that it is possible to build compact models, characterized by a reduced number of parameters and low computational impact, while still guaranteeing high predictive accuracy and remarkable efficiency in terms of resources.
Instead of developing a single large model capable of adapting to all forecasting contexts, the developers chose an alternative approach: to create several smaller pre-trained models, each optimized for a specific forecasting scenario. Each variant differs from the others based on the following parameters:
- Length of the context: indicates how many past time points the model considers for forecasting.
- Length of the forecast: indicates how many future time points the model is able to predict.
Choosing the model variant best suited to the case of interest allows one to achieve more accurate results while maintaining compact dimensions and extremely fast execution times, with minimal impact on computational resources.
Through the get_model(.)
function, it is possible to automatically select the model according to the parameters defined by the user.
Do you want to stay up-to-date on commodity market trends?
Sign up for PricePedia newsletter: it's free!
Hourly PUN Forecast Cycle
To complete the analysis, we examine how the TTM system performs in forecasting the Hourly National Single Price (PUN). The forecast is carried out using a rolling time window, applied to three different forecasting horizons.[1]
In summary, three forecasts are generated up to Friday of each week, each with a different time horizon.
The first reference week is the one ending on Friday, May 30, 2025, from which we obtain a 3-day forecast (P3) starting Wednesday, May 28, a 5-day forecast (P5) from Monday, May 26, and a 7-day forecast (P7) from Saturday, May 24.
This process is repeated every Friday for 16 consecutive weeks.
For each time horizon, the mean RMSE (Root Mean Square Error) values obtained over the 16 weeks are calculated to evaluate the model’s accuracy.
Compared to the forecast presented in the previous article, we introduced some modifications to improve accuracy.
The time horizon for the 7-day forecast (P7) was reduced by one day, in order to fit exactly within the weekly period.
In addition, we defined a specific forecasting context for each time horizon (P3, P5, P7), configuring the get_model(.)
function to load a model variant dedicated to each case.
Finally, the data provided to the model were normalized to ensure greater robustness and stability during the forecasting phase.
For this case, we imposed the following parameters:
CONTEXT_LENGTH = 512
(≈ 21 days)
PREDICTION_LENGTH = 168
(≈ 7 days) for P7, 120
(≈ 5 days) for P5, 72
(≈ 3 days) for P3.
Zero-Shot Forecast
In this mode, the model was used without any specific training on the Italian electricity market data.
Only the knowledge acquired during the pre-training phase conducted by the developers was utilized, without any modification or parameter updates through additional training.
Zero-Shot means exactly this: the pre-trained model is called as-is, without any updates on the target domain data, and it is used to generate forecasts on the new dataset.
The hourly PUN forecast in Zero-Shot mode produced the following results:
RMSE Results: ZERO-SHOT Forecast
HORIZON | Average RMSE | Maximum RMSE | Average RMSE% | Maximum RMSE% |
7-day Forecast (P7) | 16.748 | 25.442 | 15.455% | 27.273% |
5-day Forecast (P5) | 14.095 | 25.326 | 12.284% | 20.109% |
3-day Forecast (P3) | 11.466 | 25.383 | 10.110% | 22.180% |
The results obtained in Zero-Shot mode show a surprisingly solid performance, considering that the model was never trained on Italian electricity market data. The average RMSE for the 3-day forecast is around 11.5 €/MWh, with an average percentage error of 10%, indicating a good ability of the model to capture the general trend of the PUN. As the forecast horizon increases, the error progressively rises, as expected, reaching about 15% at a one-week horizon.
What makes this result particularly relevant is the complete absence of retraining. Thanks to the pre-training phase, the TTM is able to recognize temporal patterns common to many energy series (daily seasonality, etc.) and successfully apply them to a completely new context. This confirms the remarkable generalization ability of the model, which, despite not knowing the specific domain, is able to provide good forecasts.
Fine-Tuning Forecast
Fine-Tuning is a technique that involves retraining an already pre-trained model on a new dataset to adapt it to a specific task.
For example, it is like a child who has already learned the basic rules of mathematics and is now being taught only how to apply them to a new game.
From a technical perspective, Fine-Tuning updates only a portion of the model’s parameters, starting from the weights already learned during pre-training.
In practice, the user calls the pre-trained model and performs Fine-Tuning on a targeted dataset, specifically selected to allow the model to learn particular correlations.
During Fine-Tuning, TTM activates so-called "channel mixing", the ability to combine and make different channels (variables) interact with each other. This phase represents the second level of the “multi-level” strategy adopted by TTM: if in pre-training it learned to recognize temporal patterns of each series individually, such as trends, seasonality, or sudden variations (channel-independent), now the model learns to capture the relationships between the series.
Following a structural model built for PUN forecasting[2], we selected the following variables to build a multivariate dataset: PUN, GAS, CO2, SFER.
The model learns to map this context to the PUN target.
Performing Fine-Tuning on this dataset allows the model to learn correlations between the variables and electricity prices.
For example, it can understand that an increase in gas prices may anticipate an increase in PUN, or that CO₂ levels may have delayed effects on the energy market.
After Fine-Tuning the pre-trained model on the target dataset, we carried out the forecast following the same predictive cycle used in Zero-Shot mode.
The results obtained are shown in the table below.
RMSE Results: FINE-TUNING Forecast
HORIZON | Average RMSE | Maximum RMSE | Average RMSE% | Maximum RMSE% |
7-day Forecast (P7) | 14.963 | 24.013 | 13.864% | 26.200% |
5-day Forecast (P5) | 12.144 | 18.436 | 10.675% | 18.112% |
3-day Forecast (P3) | 10.574 | 21.677 | 9.394% | 22.475% |
The analysis of the results confirms the effectiveness of the Fine-Tuning process in improving the predictive capability of the model. After adapting to the specific Italian electricity market data, the average error (RMSE) significantly decreases compared to the Zero-Shot mode: for the 3-day horizon, the absolute error drops from about 11.5 €/MWh to just over 10, while the weekly forecast improves by almost 2 percentage points. This increase in precision demonstrates the model’s ability to learn structural correlations between the considered energy variables (GAS, CO₂, and SFER) and to leverage them to refine the PUN forecast.
Thanks to channel-mixing activated during Fine-Tuning, the model was able to capture causal relationships in the energy system, such as the effect of gas prices on PUN or the influence of CO₂ costs in subsequent days.
Conclusions
The experiment conducted with Tiny Time Mixer (TTM) clearly showed the potential of pre-trained models in time series forecasting, even in complex contexts such as the Italian electricity market. In Zero-Shot mode, the model was able to provide reasonably accurate forecasts without ever being trained on PUN data, demonstrating the robustness of multi-level pre-training and the model’s generalization ability.
However, subsequent Fine-Tuning highlighted a significant improvement in performance: the average error was consistently reduced, and the model’s ability to capture relationships between exogenous variables and the target was strengthened. This result confirms that, although pre-training provides a solid and “universal” foundation, local adaptation is essential to obtain reliable forecasts tailored to the reality of a specific domain.
The TTM system demonstrates that efficiency and accuracy can be reconciled: a small model, properly pre-trained and then adapted, can provide reliable forecasts while reducing development time and costs compared to large traditional Deep Learning models.
[1] For further details on the predictive cycle structure, see the section "Forecast Cycle" in the article:
From Econometrics to Machine Learning: The Challenge of Forecasting
[2] The structural model for hourly PUN forecasting is described in detail here:
Hourly PUN price: is there a specific “hour” effect?
[3] The Tiny Time Mixers structure is described in the official document (IBM Research):
Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series.