The Problem With Backtests
Why Backtests Lie About the Future
Backtesting—running a trading strategy against historical price data to measure how it would have performed—feels like the ultimate proof of concept. You test a rule on 10 years of real data, the numbers look good, and you believe you've found an edge. In reality, backtests often tell a seductive fiction. They obscure fatal flaws that only emerge when real money is at stake and live market conditions deviate from the past. A backtest is not a prediction; it is a report card for a strategy that has already been played out against data the strategy's creator could, in principle, have seen coming.
A backtest measures how a strategy performed against known history—not how it will perform against unknown future price action. The gap between these two truths is where most backtested strategies fail in live trading.
Key takeaways
- Backtests measure past performance only; they provide no guarantee about future returns and cannot account for market regimes that have not yet occurred.
- Overfitting is the cardinal sin of backtesting: a strategy tweaked to fit historical patterns will collapse when market conditions change.
- Transaction costs, slippage, and commissions are often omitted or underestimated in backtests, inflating returns by 5–40% in the real world.
- Backtests ignore liquidity constraints, gap risk, and the speed at which you can actually enter or exit positions.
- A backtest's success on past data is no evidence of robustness; true validation requires out-of-sample testing and live trading under real constraints.
The Illusion of Certainty
When you backtest a strategy, you already know how the story ends. You have price data from 2010 to 2024. Your moving average crossover signals "buy" on March 15, 2015, and you can instantly verify that the price went up the next day. You have hindsight. A backtest feels scientific because it produces numbers—percent returns, win rates, Sharpe ratios—but those numbers describe a world that no longer exists. They describe a world in which the future was already written.
Real trading occurs in the present, facing an open future. Your live trading system receives today's price, makes a decision, and executes an order without knowing whether tomorrow's price will go up or down. A backtest reverses this: it knows the answer and constructs a narrative in which the strategy predicted it. This is why backtests are deceptively good at finding false signals.
Consider a simple example: a strategy that buys the S&P 500 on the first trading day of each month. From January 2010 to December 2023, this naive rule delivered cumulative returns of approximately 420%, beating many active managers. A backtest would declare it successful. But this rule works because those years contained a strong bull market. Backtest the same rule from 2000 to 2002 (the dot-com crash), and it underperforms. Backtests cannot reveal which historical periods are relevant to the future.
Overfitting: The Backtest's Fatal Flaw
Overfitting occurs when a strategy is tweaked and refined until it fits historical data perfectly but has no genuine predictive power. The more parameters you adjust—moving average length, entry thresholds, stop-loss levels—the more opportunities you create for the backtest to find spurious patterns.
Imagine you test 100 different moving average lengths on the same historical data. By chance alone, a handful will outperform others. If you select the best performer and declare it "optimal," you've likely selected noise, not signal. This is known as the "Texas sharpshooter problem": draw a target around the bullet holes after firing a gun, and every shot is a bull's-eye. Backtests invite this error because they reward you for retrofitting rules to historical data.
A study by Pardo (2008) and De Prado (2018) documented this pattern across quant strategies: strategies that showed strong backtest results frequently failed in live trading because they were overfit. The typical decay in performance was 30–50% when moving from backtest to real trading. A strategy showing 25% annualized returns in backtest often realized 12–15% in live trading, if it avoided catastrophic failure entirely.
The math behind overfitting is unforgiving. If you test N independent hypotheses on historical data, the probability that at least one will appear to work by pure chance is high. With 100 parameter combinations and a 5% significance threshold, you expect 5 false positives. With 1,000 combinations, 50. Backtests almost always involve multiple tests.
Missing Transaction Costs, Slippage, and Commissions
Backtests typically assume you can buy at the exact closing price and sell at the exact closing price. In reality, you face:
- Slippage: the difference between the price you expected and the price you actually filled. On a liquid stock, this might be 1–2 cents per share; on less liquid assets or in volatile conditions, it can be 10–50 cents or more.
- Commissions: brokerage fees, typically $0–$10 per round-trip trade depending on your broker.
- Bid-ask spread: the gap between the buy and sell prices. A stock quoting 100.50 bid, 100.51 ask costs 0.01 per share just to cross, plus a wider spread during low-volume periods.
- Market impact: on very large orders, your own trading can move the market against you.
A study by Arnott, Beck, Kalesnik, and West (2016) found that transaction costs reduced backtest returns by 5–40% depending on the strategy's turnover and asset class. A high-frequency strategy testing at 15% annual returns might deliver 6–8% after costs. A lower-turnover strategy of 12% returns might drop to 8–10%.
Most backtesting software allows you to specify commission rates, but few traders use realistic ones. It's easy to assume $1 per trade when your broker charges variable rates tied to order size and venue. It's easy to ignore the bid-ask spread on thinly traded ETFs or options. Backtests can be tuned to accommodate these costs, but only if you measure them honestly.
Flowchart
Liquidity and Gap Risk
A backtest assumes you can always execute at the quoted price and at the size you need. This is rarely true. Many strategies fail because they ignore liquidity.
If your strategy trades 100,000 shares of a micro-cap stock, your order might be too large to fill at the posted bid-ask spread. The market impact could move the stock against you by 0.5–2%, erasing a week's profits on a single trade. Backtests do not typically account for this.
Gap risk—the risk that a security gaps overnight or during a halt—is also invisible in backtests. If you hold a position overnight and a negative news event causes a 10% gap down at the open, a backtest will assume you exited at yesterday's close. In reality, you would have been stopped out at the gapped price, taking a 10% loss instead of the expected exit.
The 2020 COVID crash provided a memorable example: volatility ETFs (VIX-linked products) gapped down 75–90% in a single day. Backtests of strategies that held these products overnight would show far smaller losses than traders actually suffered. Similarly, many bonds issued by Lehman Brothers gapped from 95 cents on the dollar to 10 cents during the 2008 crisis—a gap a backtest could not have predicted.
The Multiple-Testing Problem
When you run a backtest once, you get one number. But traders typically run many backtests. You test 5 entry rules, 3 exit rules, and 4 risk parameters. That's 60 backtests. If you're looking for any rule that beats a benchmark by a small margin, the law of large numbers guarantees you'll find one—even if no true edge exists.
Statisticians call this p-hacking: the practice of testing enough hypotheses until one appears statistically significant by chance. A well-known paper by Ioannidis (2005) showed that in fields with many false findings, most published positive results are false. Backtesting, with its thousands of possible parameter combinations, is a perfect environment for p-hacking.
The only defense is to test your strategy on out-of-sample data—data the strategy has never seen. If your strategy was fit on 2010–2017 data, test it on 2018–2024. If it still works, you have some evidence of robustness. If it collapses, you've caught an overfit.
Regime Shifts and Structural Changes
Markets are not stationary. The bull market of 2009–2020 looked nothing like the 2000–2002 bear market. Treasury yields, volatility, and correlations between asset classes shift. A backtest assumes that the future will resemble the past—a dangerous assumption.
Consider a pair-trading strategy that exploited the stable correlation between XLE (energy ETF) and crude oil from 2010–2019. That correlation broke down sharply in 2020 when energy collapsed but oil companies diversified. A backtest on pre-2020 data would have shown strong returns; live trading after 2020 would have shown losses.
The Federal Reserve's shift from zero interest rates (2009–2015) to quantitative tightening (2022–2023) created an entirely different market regime. Correlations that held for years inverted. A backtest of a strategy designed during the QE era was nearly useless in the tightening era.
The Walk-Forward Test: A Partial Solution
One method to partially address overfitting is walk-forward testing. Instead of optimizing a strategy on all historical data, you:
- Optimize on the first 1 year of data.
- Test on the next 3 months of new data (out-of-sample).
- Roll forward: optimize on years 1–1.25, test on the next 3 months.
- Repeat for the entire historical period.
Walk-forward testing is more realistic because it simulates how a real trader would operate—optimize, trade, re-optimize. But it is computationally expensive and many backtesting platforms do not support it well. Even walk-forward tests can overfit if you re-optimize too frequently or adjust rules based on recent performance.
Real-world examples
The 2008 Quant Crisis: In August 2008, a wave of high-frequency trading strategies failed simultaneously across multiple firms, notably Renaissance Technologies and Goldman Sachs' Global Alpha fund. Strategies that had shown strong backtest performance for years collapsed. The simultaneous failure suggested that many quants had discovered similar patterns in historical data and that these patterns were artifacts of the specific market conditions of 2003–2007, not robust edges. After de-leveraging and rewriting strategies, many firms recovered, but the losses were in the hundreds of millions.
Long-Term Capital Management (1998): LTCM was staffed by Nobel laureates and backtest-validated traders. Their models showed that certain bond spreads were historically mispriced. Backtests suggested they could return 40%+ annually with moderate risk. In 1998, when Russian debt defaulted and credit spreads spiked globally, LTCM's "optimal" positions turned into catastrophic losses. The firm lost $4.6 billion in a single year and required a $3.6 billion government-coordinated bailout. The flaw: LTCM's backtests did not include a scenario in which correlation between supposedly uncorrelated assets spiked to 0.95 during a crisis.
Two Sigma's Model Updates (2015–2017): The quant firm Two Sigma published research showing that many machine learning models trained on market data degrade when deployed to live trading. A model that achieved 60% accuracy on historical data often achieved 52–54% accuracy on new data—still profitable, but far less than the backtest suggested. The gap was due to overfitting and subtle differences between historical data quality and live data.
Common mistakes
- Ignoring slippage: Assuming execution at closing prices instead of estimating realistic slippage of 1–5 basis points.
- Using hindsight to set parameters: Choosing a moving average length because it worked best in a specific historical period, not because it is robust across regimes.
- Testing too many parameters without penalizing complexity: More parameters always improve backtest fit, but they reduce out-of-sample performance.
- Not accounting for liquidity: Testing strategies on assets that are too illiquid to actually trade in the size your strategy requires.
- Optimizing on the same period you test: Using all available data to both fit and validate a strategy, making overfitting invisible.
FAQ
Can I ever trust a backtest?
A backtest is useful as one piece of evidence, not as proof. If your strategy fails a backtest, it is almost certainly a bad strategy. If it passes a backtest, you've filtered out the worst rules, but you haven't proven the strategy will work in live trading. A passing backtest is a necessary condition for a good strategy, not a sufficient one.
How much out-of-sample data do I need?
Statisticians recommend testing on at least 20–30% of your total data. If you have 10 years of data, fit on 7 years and test on 3. If your out-of-sample performance is within 80–90% of in-sample performance, you have some confidence in the strategy. If it drops to 50% or less, the strategy is likely overfit.
Should I backtest my strategy before trading?
Yes, absolutely. Backtesting can screen out strategies that are obviously broken or dependent on errors in your code. But treat the backtest result as a lower bound on returns, not an estimate of future returns. Subtract 5–25% from your backtest return to account for overfitting, costs, and slippage.
Why do professionals still use backtests if they're so flawed?
Backtests are flawed, but the alternative—trading without testing—is worse. A backtest that filters out bad ideas is valuable. The key is to backtest honestly, account for real-world costs, and treat the results with skepticism.
What if my backtest shows the strategy works in multiple different time periods?
That's a good sign. If a strategy was profitable in 2010–2014, 2015–2019, and 2020–2024 (in different market conditions), it has some chance of robustness. But still test it out-of-sample, trade it with smaller position sizes initially, and monitor live results against backtest predictions.
How do I know if my backtest is overfit?
Compare in-sample returns to out-of-sample returns. If in-sample returns are much higher, your strategy is likely overfit. Also test your strategy on data from different asset classes or time periods. A strategy that only works on one stock or one decade is probably overfit.
Should I avoid backtesting altogether?
No. Backtesting is a valuable tool for screening ideas quickly. The mistake is treating a backtest result as a guarantee. Use backtests to identify promising strategies, then validate them with walk-forward testing, out-of-sample testing, and eventually live trading with small position sizes.
Related concepts
- The Honest Evidence on Technical Analysis
- Curve Fitting and Overfitting
- Data Mining Bias
- Look-Ahead Bias
- Why Patterns Look Better in Hindsight
Summary
Backtests are deceptively persuasive because they reduce complex market dynamics to a single number: past return. But that number tells you how a strategy would have performed in a world the strategy's designer could have known was coming. It says nothing about how the strategy will perform when facing an unknown future. The cardinal flaws of backtests—overfitting, missing costs, ignored liquidity, regime shifts—are not bugs in backtesting software; they are features of any analysis that tries to predict the future using the past. A backtest that passes rigorous out-of-sample validation is a better candidate for live trading than one that does not, but even a robust backtest is only a necessary condition, not a sufficient one. The only true test is live money, real slippage, actual costs, and genuine uncertainty.