Best practices for backtesting trading strategies in power markets

How historical forecasts can support developers and independent power producers to evaluate different trading strategies for their merchant assets

Programmatic trading in power markets has two layers: the prediction layer and decision layer. The first layer is where forecast companies operate – training machine learning models to generate forecasts for what may happen in economic dispatch the following hours and days. Operators use these predictions to make decisions – whether to offer into the ancillaries market or energy markets, how much exposure they want towards real-time price risks, or whether to pay a premium and hedge on ICE.

The decision layer is where forecasts turn into an optimization. Given probabilistic ranges of energy prices and ancillaries products, how should a battery operator allocate their BESS capacity and energy tomorrow? Armed with the same set of forecasts, different traders can and will make different decisions according to their own risk tolerance. For batteries, the goal is to maximize ‘percentage of perfect foresight’ subject to your appetite for risk. How does your trading strategy perform during Uri? Or in restricted gas regimes like in NEISO and CAISO in 2022? Robust trading strategies will backtest on past months or years of data to ensure they perform well under different market conditions.

A backtest is usually performed by owners interested in selecting a trading strategy for their asset. Many companies offering tolling agreements will analyze trading strategies before purchasing those rights. Solar developers may use one to determine how they would trade their asset if it was collocated with a battery.

A key input into backtests are the forecasts available at the time bids and offers need to be made. These backwards-looking-forecasts are called backcasts. Best practices for generating backcasts include making them reproducible, avoiding data leakage, ensuring model lineage and evaluating the right metrics, which we’ll cover in the next section.

Key Takeaways

Backtest refers to assessing how a trading strategy (the decision layer) performs over the course of a historical period.
For any morning before day-ahead bid close in the period, a trading strategy will have forecasts (the prediction layer) available to it before offering into the market.
These data points are called backcasts, which refer to the short-term forecasts that would have been generated given information at the time.

Backtests with Backcasts

Backtests are the method by which trading strategies can be evaluated over a given period, and usually require historical short-term forecasts, also known as backcasts.

Reproducible Backcast Data

Reproducible means the methodology that is creating the backcasts can be used operationally in the future after the Commercial Operation Date (COD) of the asset. Generally, a trading strategy being backtested will require energy and ancillary price forecasts and asset generation forecasts, in addition to a given set of constraints (like state of charge, or heat rate, or financial position with implicated PPAs, etc.). All the same inputs that are used in the trading strategy need to come from data sources that will exist in the future. In practice, this means saving machine learning models to a model store so they can be loaded for inference, and writing feature pipelines in a way that allows them to generate point-in-time vintages and account for missing or obviously-wrong inputs.

Avoiding Data Leakage

Data leakage occurs when historical backcasts are made for some time period using information from that time period that would not have been known when the backcast models were run. There are two main ways it manifests in power markets forecasting. Temporal leakage is when models use actual outcomes, such as realized demand or generation figures, instead of relying on the forecasts that were available at model runtime.

Vintaged Inputs

Inputs to model should be vintaged to the time the model runs

The second most common instance of data leakage in backtests is by including dates from the backcast period during training. The phrase ‘knowledge cutoff’ refers to the last day of data being used in each model's training dataset. If the knowledge cutoff is after the beginning of your backtest period, you will have data leakage.

Knowledge Cutoff Leakage

A model trained up to Oct 2024 has seen data from 2023 and will likely provide more accurate forecasts for that backtest period than what you can expect after your COD.

Data leakage can be okay but should come with considerable caveats. Data leakage will almost always produce more inflated results over a clean evaluation, which can cause owners to adopt non-optimal trading strategies or be exposed to a risk profile that could otherwise be avoided.

Backtests should incorporate periods of volatility, extreme demand, and renewable integration to ensure that models are stress-tested under various conditions. Unfortunately, a long backcast period since the knowledge cutoff of the training dataset means your model is trained on data from a grid that looks significantly different than it is today. An ERCOT model trained up to summer 2023 will include volatile summer months and cold snaps from the winter, but will have a knowledge cutoff when there was 2GW less battery capacity and 7GW less solar. Techniques like rolling-window validation are essential to test models across different market environments without risking data leakage.

Rolling Window Backtest with Model Lineage

Model lineage is the process through which model weights are updated while holding all the same inputs and architectures the same. An updated model can also be used live, trained on the most recent activity on the grid.

Rolling windows are used to ensure a long backtest period while also using the same model architecture in operation. Modellers can hold inputs and architectures constant while adjusting the weights of the neural network with updated data. New price shapes stemming from a rapidly evolving grid can be learned by more recent models while also evaluating how the architecture and inputs of that model perform under different regimes.

Key Takeaways

Data leakage occurs when historical backcasts use information unavailable at the time decisions were made.
Leakage inflates backtest results, leading to non-optimal strategies and risk exposure.
Techniques like rolling-window validation help avoid leakage and ensure models perform across various market conditions.

Evaluating Metrics

There is a clear demarcation between the prediction layer and decision layer of programmatic power trading. In the decision layer, the goal is similar for everyone – maximize profit-and-loss (P&L) subject to your appetite for risk. There are superior trading strategies that yield lower risk for the same P&L.

In the prediction layer, the metrics for evaluation are dependent on how those predictions are used downstream. The best metric is how those forecasts support your asset’s P&L. Other mathematical metrics include Root Mean Squared Error (RMSE) for point forecasts and Continuous Ranked Probability Score (CRPS) for probabilistic forecasts. Situational metrics can include the calibration of the percentiles (how often the observed value falls within the predicted value), the predicted top-bottom spreads (the hours best used for charging and discharging a battery), or the sensitivities and precision for predicting real-time price spikes. The best partnerships between the prediction and decision layer clearly identify the metrics that are needed from the prediction layer to improve the P&L of the decision layer.

Summary

Programmatic power trading strategies hinge on the balance between forecast accuracy and decision optimization. As asset owners and developers seek to maximize profits in the volatile energy markets, tools like backcasts, rolling windows, and clear metric evaluations ensure trading strategies are resilient. By avoiding common pitfalls like data leakage and aligning predictive performance with business goals, these strategies can drive better results under real-world conditions, fostering a more informed and proactive approach to energy trading.

Enertel AI provides short-term energy and ancillary price forecasts for utilities, independent power producers (IPPs), and asset developers to inform trading strategies. We recently released an offering for developers that provides them with two years of historical forecasts at custom models to use as backcast data for their backtests. You can view our data catalog, view a clickthrough demo of our product for operators, or contact us to see a sample of backcast data for your assets.

Best practices for backtesting trading strategies in power markets

Key Takeaways

Backtests with Backcasts

Reproducible Backcast Data

Avoiding Data Leakage

Vintaged Inputs

Knowledge Cutoff Leakage

Rolling Window Backtest with Model Lineage

Evaluating Metrics

Summary

Recent Posts

Comments