Strategy Evaluation Fundamentals

June 16, 20262 min read

The overfitting problem

The most common backtesting mistake is overfitting — tuning a strategy's parameters until it performs well on historical data, then assuming it will perform well in the future. With enough parameters and enough tuning, you can make almost any strategy look profitable on past data. That doesn't mean it will work going forward.

Signs of overfitting:
The strategy only works with very specific parameter values
Small changes to parameters dramatically change results
The strategy performs well on the training period but poorly on out-of-sample data
The strategy has many rules or conditions relative to the number of trades it generates

Out-of-sample testing

The most reliable defense against overfitting is out-of-sample testing. Split your historical data into two periods: use the first period (in-sample) to develop and tune the strategy, then test it on the second period (out-of-sample) without any further adjustments.

If the strategy performs well on both periods, it's more likely to have captured a genuine pattern. If it performs well in-sample but poorly out-of-sample, it's probably overfit.

Statistical significance

A strategy that produces 10 trades over 5 years hasn't been tested enough to draw conclusions. Random chance alone could produce 10 winning trades in a row. You need enough trades to establish statistical confidence that the results aren't due to luck.

Rules of thumb:
Minimum 30 trades for basic statistical validity
100+ trades for higher confidence
The more parameters your strategy has, the more trades you need

Regime awareness

Markets change. A strategy that worked in a trending market may fail in a range-bound market. A strategy developed during low-volatility conditions may blow up during a crisis. The best strategies are either robust across multiple market regimes or explicitly designed for a specific regime (with rules for when to stop trading in other regimes).

SDB's extended historical data (Pro) lets you test across multiple market environments. If a strategy only worked during one specific period, be skeptical.

The checklist

Before considering any backtested strategy viable:

Does it pass out-of-sample testing?
Does it have enough trades for statistical significance?
Is it robust to small parameter changes?
Does it work across different market regimes?
Are the execution assumptions realistic (slippage, commissions)?
Is the maximum drawdown tolerable in real trading?
Does the logic make intuitive sense, or is it just data mining?

A strategy that passes all seven is worth investigating further. Most strategies fail at least two.

Matthew J. Goss, Jr.

Retired COMEX/NYMEX floor trader, Goldman Sachs and FlexTrade Systems alumnus, multi-instrumentalist, published author, and independent mathematics researcher. Founder of Quantiterate.