FAQs on Backtest Overfitting

Our recent papers [1,2] on backtest overfitting have attracted significant interest, including several press releases [American Mathematical Society, Science DailyUniversity of Newcastle] and news articles [Financial Times, Wall Street JournalBloomberg, Barron's, Pacific Standard, Morningstar, Seeking Alpha]. The feedback so far has been encouraging, and numerous colleagues have approached us with interesting questions and requests for clarification. This blog lists and responds to a number of these items.

1. Why do so many quantitative investments fail? In the 21st century, we are surrounded by math and algorithms designed to filter noise out of signal. Why is it that the same science and math that allows us to fly safely through unpredictable atmospheric conditions seem to fail us in the comparatively simpler task of investing?

The reality is, some of the most successful investment funds in history apply rigorous mathematical models (RenTec, TGS, DE Shaw, Two Sigma, PEAK6, Hutchin Hill, Winton, Citadel, Teza Tech, Virtu, Tower Research, Jump Trading, Tradelink, Chopper, Sun Trading, … just to cite a few). Many of them are closed to outside investors, and the public rarely hears about them. This void is often filled by pseudo-mathematical investments, which apply mathematical tools improperly as a marketing strategy. One of the most widely misunderstood experimental techniques is historical simulation, or backtesting.

2. Is it true that every backtest is intrinsically flawed? Not at all. Backtesting is a powerful and necessary research tool. The purpose of our research is to highlight how easily backtest results can be manipulated, and educate investors regarding some minimum features that should be demanded from those studies.

3. How can I determine if a backtest is flawed? Backtests can be manipulated in many ways. The most important piece of information missing from virtually all backtest reports is the number of trials attempted. Without this information, it is impossible to determine the probability of backtest overfitting. Here is an intuitive and visual explanation of why it is essential to report the number of trials carried out.

4. Can the “hold-out” method prevent overfitting? This is perhaps the most frequently asked question. Unfortunately, the “hold-out” method (i.e., reserving a testing set to validate the model discovered in the training set) cannot prevent overfitting. There are multiple reasons, listed in our Notices of the AMS paper. Perhaps the most important reason for hold-out’s failure is that this method does not control for the number of trials attempted. If we apply the hold-out method enough times (say 20 times for a 95% confidence level), it is expected that we will obtain a false negative (i.e., the test fails to discard an overfit strategy). In contrast, the PBO method is one of several that take into account the number of trials attempted, and penalizes backtested performance accordingly.

5. I do not understand the meaning of backtest overfitting. If a strategy worked in the past, why shouldn’t it work in the future? Could you please provide a simple example? Any random sample extracted from a population incorporates patterns. For example, after tossing a fair coin ten times we could obtain by chance a sequence such as {+,+,+,+,+,-,-,-,-,-}, where “+” means head and “-” means tail. A researcher could determine that the best strategy for betting on the outcomes of this coin is to expect “+” on the first five tosses, and for “-” on the last five tosses (a typical “seasonal” argument in the investment community). When we toss that coin ten more times, we may obtain a sequence such as {-,-,+,-,+,+,-,-,+,-}, where we win 5 times and lose 5 times. That researcher’s betting rule was overfit, because it was designed to profit from a random pattern observed in the past. The rule has absolutely no predictive power over the future, regardless of how well it appears to have worked in the past.

6. When is a backtest not overfit? While many historical patterns are random, some patterns correspond to a reoccurring natural phenomenon. For example, suppose a biased coin that turns heads in 60% of the tosses. Given a sufficiently long sample, we may be able to determine the heads-to-tail ratio and as a result bet consistently that heads will outnumber tails. That betting strategy will not be overfit. The problem is, such natural patterns are relatively weak and, in the case of the financial markets, they tend to weaken further over time as traders exploit them. So although fitting a backtest is certainly possible, overfitting it is all too easy (and tempting, as a marketing pitch to uneducated investors).

7. What are memory effects in the context of random processes? An important distinction we make in our research is between financial series with and without memory.

  • No memory: A coin, whether fair or biased, does not have memory. The 50% heads ratio does not arise as a result of the coin “remembering” previous tosses. Patterns emerge, and they are “diluted” as additional sequences of tosses are produced, each incorporating patterns that eventually cancel each other.
  • Memory: Now suppose that we add a memory chip to that coin, such that it somehow remembers the previous tosses. This memory actively “undoes” recent historical patterns, such that the 50% heads ratio is quickly recovered. Just as a spring memorizes its equilibrium position, financial variables that have acquired a high tension will return to equilibrium violently, undoing previous patterns.

8. Why does overfitting, combined with memory effects, lead to losses? The difference between “diluting” and “undoing” a historical pattern is enormous. Diluting does not contradict your bet, but undoing generates outcomes that systematically go against your bet!

Unfortunately, backtest overfitting tends to identify the trading rules that would profit from the most extreme random patterns in sample. But those variables have memory, hence future realizations must undo those same patterns! In other words, backtest overfitting in presence of memory effects leads to loss maximization.

9. Are you saying that Technical Analysis is a form of charlatanism? No. Technical analysis tools rely on a variety of filters that make them prone to overfitting. We are simply stating that technical analysts and their investors should be particularly aware of the risks of overfitting. When the probability of backtest overfitting is correctly monitored, technical analyses may provide valuable insights to investors.

10. What can be done to avoid this fraud? The most important advice is to understand the concepts mentioned earlier. Ask your researcher how he has avoided overfitting, the number of trials attempted, detailed logs of every experiment run, the theoretical model that would justify the uncovered behavior, etc. You may want to avoid researchers unfamiliar with these notions, or who cannot give you straight answers regarding their past research activity or scientific output. When confronted, a typical charlatan’s response is “trust me, I have plenty of experience, and I know this works”. Appeals to authority are a common form of logical fallacy. No mathematical proof relies on them, thus the motto of the Royal Society: “Nullius in Verba”.

The reality is, overfitting is pervasive among finance practitioners and academics, and very often an unintended consequence of sloppiness or malpractice. While pharmaceutical companies must abide to clear research procedures, there is no equivalent to the FDA in the financial industry. Quantitative funds are not required to obtain an approval or ISO-9000 certification. No certification can ensure success, but at least it could prevent basic flaws in investment proposals, such as overfitting.

11. But isn’t financial modeling more complex than quantum physics? We disagree with those who believe that “financial modelling is too complex and that no procedures could be devised to guarantee quality.” Pharma research, genomics, astrophysics, quantum computing and weather forecasting are just a few scientific fields that model incredibly complex random systems, where research protocols have been successfully implemented.

While it is true that some of the brightest scientists are attracted to work in finance, this industry often rewards working in isolation rather than cooperatively. Unless investors demand it, those scientists may not have the incentives to abide to the same research protocols they were asked to comply with in a laboratory. Without those protocols, quality suffers and performance disappoints.

Comments are closed.