Thursday 2 September 2021

The three kinds of (over) fitting

This post is something that I've banged on about in many presentations at several conferences* (most complete slides are here), and in various interviews, but never actually formally described in a blog post. In fact this post has existed in draft form since 2015 (!).

* you know, when you leave your house and listen to someone else speaking. Something that in late 2021 is a distant memory, although I will actually be speaking at an event later this year.

So there won't be new information here if you've been following my work closely, but it's still nice to write it down in one place.

(I'm trying to stick to my self imposed target of one blog post per month, but you will appreciate that I don't always have time for the research involved in producing them - unless it's a by product of something I'm already working on)

Trivially, it's about the fitting of trading systems and the different ways you can screw this up:

  • Explicit (over)fitting
  • Implicit (over)fitting
  • Tacit (over)fitting


What is fitting

I find it hard to believe that anyone reading this doesn't already know this, unless you've accidentally landed here after googling some unrelated search term, but let me define my terms.

The act of fitting a trading system can formally be defined as the process of discovering which combination of trading rule and parameter set(s) will produce the optimal trading system when tested on historic data: a combination I call the trading rule variation. The unspoken assumption of all quant finance is that this variation will also be the optimal system to use in the future.

A trading rule is a specific set of instructions which tells you how to trade; for example something like 'Buy if the N day return is negative, otherwise sell'. In this case the parameter set would consist only of a specific value of N.

Optimality can mean many things, but for the purposes of this post let's assume it's maximising Sharpe Ratio (it isn't that important which measure we choose in the context of our discussion here).

So for this particular example fitting could involve considering alternative values of N, and finding the value which had the highest Sharpe Ratio in an historic backtest. Alternatively, it could also involve trying out different rules - for example 'Sell if the N day return is negative, otherwise buy'. But note that these approaches are equivalent; we could parameterize this alternative set of rules as 'Buy X*units if the N day return is negative, otherwise buy' where X is eithier +1 (so we buy) or -1 (so we sell). Now we have two parameters, N and X, and our fitting process will try and find the optimal joint parameter values. 

Of course there are still numerous rules that we haven't considered here, such as selling if the N hour return is negative, or if the most recent non farm payroll was greater than N, or if there was a vomiting camel chart pattern on the Nth Wednesday in the month. So when fitting we will do so over a given parameter space, which includes the range of possible values for all our parameters. Here the parameter space will be X = [-1,1] and N = [1,2,3......] (assuming we have daily closing data). The product of possible values of X and N can loosely be thought of as the 'degrees of freedom' of the fitting process. 

All fitting thus involves the choice of some possible trading strategies from a tiny subset of all possible strategies.

The number of units to buy or sell is another question entirely, which I discuss in this series of posts

Fitting can be done in an automated fashion, purely manually, or using some combination of the above. For example, we could get some backtesting software and ask it to find the optimal values of X and N. Or we could manually test each possible variation. Or we could run the backtesting software once for X=1 (buy if N day return is negative), and then again for X=-1, each time finding the best value of N. The third option is the more common amongst most quant traders.


What is overfitting and why it be bad

Consider the following:

Hastie et al (2009) “The Elements of Statistical Learning” Springer. Figure 2.11


How does this relate to the fitting of trading systems? Well, we can think of 'prediction error' as 'Sharpe Ratio on an inverted scale' such that a low value is good. And 'model complexity' is effectively the degrees of freedom of the trading strategy.

What is the graph telling us? Well first consider the 'training sample' - the set of data we used to do the fitting on - the dirty red line. As we add complexity we will get a better performing trading strategy (in expectation). In fact it's possible to create a trading strategy with zero prediction error, and thus infinite Sharpe Ratio, if the degrees of freedom are sufficiently large (in a hand waving way, if the complexity in the strategy is equal to the amount of entropy in the data). 

How? Well consider a trading strategy which has the form 'Buy X*units if it's January', 'Buy X*units if it's February'.... If we fit this on past data it's going to do pretty well. Now let's make it even more complex: 'Buy X* units if it's January 3rd 2015', 'Buy X* units if it's January 4th 2015' .... (where January 3rd 2015 is the first day of our price history). This will perfectly predict every single day in the backtest, and thus have infinite Sharpe Ratio.

(More mathematically, if we fit a sufficiently high degree polynomial to the price data, we can get a perfect fit)

On the out of sample (dirty green) line notice that we always do worse (in expectation) than the red line. That's because we'll never do as well in predicting a different data set to what we have trained / fitted our model on. Also notice that the gap between the red and the green line grows as the model gets more complex. The more closely our model fits the backtest period, the less likely it is that it will be able to predict a novel future. 

This means that the green line has a minimum error (~maximum Sharpe Ratio) where we have the optimal amount of complexity (~degrees of freedom). Anything to the right of this point is overfitting (also known as curve fitting).

Sadly, we don't get paid based on how well we predict the in sample data. We get paid for predicting out of sample performance: for predicting the future. And this is much harder! And the Sharpe Ratios will be lower! 

At least in theory! In practice, if you're an academic then you get paid for publishing papers with nice results: papers that predict the past. If you're working for a quant hedge fund then you may be getting paid for coming up with nice backtests that also predict the past. And even as a humble independent trader, we get a kick out of a nice backtest. So for this reason it's very easy to be drawn towards trying to make the in sample line look as possible: which we'll do by making the model more complicated.

Basically: our incentives make us prone to overfitting and towards confounding the red and the green lines.



Explicit fitting


We're now ready to discuss the three kinds of (over)fitting.

The first is explicit fitting. It's what most people think of as fitting. The basic idea being that you get some kind of automated algo to select the best possible set of parameters. This could be very easy: a grid search for example that just tries every possible strategy variation. Or it could be much more complex: some kind of fancy AI technique like a neural network. 

The good news about explicit fitting is that it's possible to do it properly. By which I mean we can:
 
  • Restrict ourselves to fewer degrees of freedom
  • Enforce a realistic seperation between in and out of sample data in the backtest (the 'no time machine' rule) 
  • Use robust fitting techniques to avoid wandering into the overly complex overfitting end of the figure above.

Of course it's also possible to do explicit fitting badly (and plenty of people do!), but at least it's possible to avoid overfitting if you're careful enough.


Fewer degrees of freedom


Consider a more realistic example of an moving average crossover trading rule (MAC) which can be defined using two parameters A and B: signal = MA_A - MA_B, where MA_x is a moving average with lookback x days, and A<>B. Note that if A<B then this will be a momentum rule, whereas if A>B it will be a mean reversion rule. We assume that A and B can take any values in the range 1 to 256 (where 256 is roughly the number of business days in a year); anything longer than this would be an 'investment' rather than a 'trading' strategy.

If we try and fit all 65,280 possible values of A and B individually for each instrument we trade then we're very likely to overfit. We can reduce our degrees of freedom in various ways:

  • Restrict A<B [so just momentum]
  • Set B = k.A; fit k first, then fit A  [I do this!]
  • Restrict A and B to be in the set {1,2,4,8,16,32, ... 256}  [I do this!]
  • Use the same A, B for all instruments in a given asset class [discussed here]
  • Use the same A,B for all instruments [perhaps after accounting for costs]
Notice that this effectively involves making fitting decisions outside of the explicit fitting... I discuss this some more later. But for now you can note that it's possible to make these kinds of decisions without using real data at all.


No time machine


By 'no time machine', I mean that a parameter set should only be tested on a period of data if it has been fitted only on data that was available on data that was in the past of the testing period.

So for example if we fit from 2000 - 2020, and then test on the same period, then we're cheating - we couldn't have done this without a time machine. If we fit from 2000-2010, and then test from 2011 - 2020; then that's okay. But if we then do a classic ML technique and subsequently fit from 2011-2020 to test from 2000-2010 then we've cheated.

There are two honest options:

  • An expanding window; first we fit using data for 2000 (assuming a year gives us enough data to fit with; if we're doing a robust fit that would be fine) and test that model in the year 2001; then we fit using 2000 and 2001, and test that second model in 2002..... then we fit using 2000 - 2019, and then test in the year 2020.
  • A rolling window. Say we want to use a maximum of 10 years to fit our data, then we would proceed initially as for an expanding window until we get to .... we fit using 2000 - 2009 and test in the year 2010, then we fit using 2001 - 2010 and test in the year 2011.... then finally we fit using 2010-2019 and then test in the year 2020. 
In practice the choice between expanding and rolling windows is a tension between using as much data as possible (to reduce the chances that we overfit to a small sample), and the fact that markets change over time. A medium speed trend follower that needs decades worth of data to fit will probably want to use an expanding window: they are exploiting market effects that are relatively low Sharpe Ratio (high entropy in the data) but will also hopefully not go away. An HFT shop will want to use a rolling window, with a duration of the order of a few months: they are looking for high SR effects that will be quickly degraded once the competition finds out about them.


A robust fitting technique 


A robust fitting technique is one which accounts for the amount of entropy in the data; basically it will not over reach itself based on limited evidence that one parameter set is better than another.  

Consider for example the following:

A and B are the parameters for a MAC model trading Eurodollar futures. The best possible combination sits neatly in the centre of this plot: A=10, B=20 (a trend following model of medium speed). The Z-axis compares this optimum with all other values shown in the plot; a high value (yellow) indicates the optimium is significantly better than the relevant point.

I have removed all values below 2.0, which roughly corresponds to statistical significance. The large white area covers all possible values of A and B that can't be distinguished from the optimum. Even though we have over 30 years of data here, there is enough entropy that we can only rule out all the mean reversion systems (top triangle of the plot), and the faster momentum models (wedge at top left).

Contrast this with the picture for Mexican Peso:


Here I only have a few years of data. There is almost no evidence to suggest that the optimum parameter set (which lies at the bottom right of the plot) is any better than almost any other set of parameters. 

A simple example of robust fitting is the method I use myself: I construct a number of different parameter variations and then allocate weights to them. 

This is now a portfolio optimisation problem, a domain where there are plenty of available techniques for robust fitting (my favourite is discussed at length, in the posts that begin here). We can do this in a purely backward looking fashion (not breaking the 'no time machine' rule). A robust fitting technique will allocate equally to all considered variations where there is too much entropy and insufficient evidence that any is worth allocating more to (in the form of heterogenous correlation matricices, different cost levels, or differing pre-cost Sharpe Ratios). 

But when there is compelling evidence available it will tilt it's allocation to more diversifying, cheaper, and higher performing rule variations. It is usually a tilt rather than a wholesale reallocation, since there is rarely enough information to prove that one trading rule variation is better than all the others.



Implicit fitting


We can now think about the second form of fitting: implicit fitting. Implicit fitting occurs when you make any decision having seen the results of testing with both in and out of sample data.

Implicit fitting comes in degrees of badness. From most worst to least bad, examples of implicit fitting could include:

  • Run a few different backtests with different parameter values. Pick the one you like the best. Basically this is explicit in sample fitting, done manually. As an example, consider what I wrote earlier:  "Or we could run the backtesting software once for X=1 (buy if N day return is negative), and then again for X=-1, each time finding the best value of N." This is implicit fitting.
  • Run an explicitally fitted backtest, then modify the parameter space (eg restricting A<50) before running it again
  • Run a proper backtest, then modify the trading rule in some way before running it again (again, with explicit fitting, so you can pat yourself on the back). If this improves things, keep the modified rule.
  • Run a series of backtests, changing the fitting hyper parameters until you get a result you like. Examples of hyper parameters include expanding window lookbacks, shrinkage on robust Bayesian fitting, deciding whether to fit on a per instrument or per asset basis, and all kinds of wonderful things if you're doing fancy AI.
  • Run a series of backtests, changing some 'non core' parameters until you get a result you like. Examples include the volatility estimation lookback on your risk scaling, or the buffer window.
  • Run a single backtest to try out and idea. The idea doesn't work, so you forget about it completely.
You can probably see why these are all 'cheating': we're basically making use of a time machine that wouldn't . So for the last example, what we really ought to do is have a 'fund level' backtest in which every single idea we've ever considered is stored, and gets a risk allocation at the start of our testing period (which is then modified as the backtest fitting learns more about the historic performance of the model). Poor ideas will not appear in our 'live' model (assuming there is sufficient evidence by the ), but it will mean that our historic 'fund level' account curve won't be inflated by only ever having good ideas within it.

Other ways to deal with this also rely on knowing how many backtests you have run for a given idea; they include correcting your significance level for the number of trials you have done (which I don't like, since it treats a major case of parameter cheating the same as a tiny hyper parameter tweak), and testing on multiple paths to catch especially egregious over fitting (something like CPCV

But ultimately, you should know when you are doing implicit fitting. Try not to do it! As much as possible, if something needs fitting (and most things don't) fit in a proper explicit robust out of sample fashion. 



Tacit fitting


Barbara is a quant trader. She's read all about explicit and implicit fitting. She decides to fit a MAC model to capture momentum. First she restricts the parameter space using artifical data (as I discuss here):

  • Restrict A<B [so just momentum]
  • Set B = 4A [using artificial data]
  • Restrict A to be in the set {1,2,4,8,16,32,64}  [using artificial data]
  • Drop values of A that are too expensive for a given instrument [using artificial data]

Then she fits a series of risk weights using a robust out of sample expanding window with real data, pooling data across all instruments. Barbara is pleased with her results and goes ahead to trade the strategy.

The question is this, has Barbara used a time machine? Surely not!

In fact she has. Consider the first decision that she made:

  • Restrict A<B [so just momentum]
Could Barbara have made this decision without a time machine? Had she really been at the start of her backtest data (which we'll assume goes back to the beginning of financial market data; for the sake of argument let's say that's 1900), would she have known that momentum is more likely to be profitable than mean reversion (at least for the sort of assets and time scales that I tend to focus on, as does Barbara?). Strictly speaking the answer is no. Barbara only knows that momentum is better because of one or more pieces of tacit knowledge. Most likely:

  • She's done this backtest before  (perhaps at another shop where they were less strict about overfitting)
  • And/ or her boss has done this backtest before, and told her to fit a momentum model
  • And/ or she saw a conference presentation where someone said that momentum works 
  • ... She read a classic academic paper on the subject
  • ... Her Uber driver to the airport was an ex pit trader who favoured momentum
  • She is one of my students
  • She's read all of my books
None of this information would have been available to Barbara in 1900. By restricting A<B she's massively inflating her backtested performance over what would have been really possible had the backtest software realistically discovered over time that momentum was better. It's also possible that she will miss out on some profitable trading strategies just because she isn't looking for them (for example, some models of mean reverting A>B seem to be profitable for small A). 

Solving the problem of tacit fitting is very hard. Here are some possible ideas:

  • Widen the parameter space and fit in the wider space (so don't restrict A<B in this simple example). Of course that will result in more degrees of freedom, so you will need to be far more careful with using a robust fitting technique.
  • Use some kind of fancy neural network or similar to fit a highly general model. Even with moderm computational power it is unrealistic to fit a model that would be sufficiently general to avoid any possibility of tacit fitting (for example, if you only feed such a model daily price data, then you've arguably made a tacit decision that daily prices can predict future returns).
  • Hire people who know nothing about finance (and once they've learned, kill or brainwash them. You can't just fire them - they'll tell people your secrets!). This is surprisingly common amongst top quant funds (the hiring of ignorant people, not the killing and brainwashing).


And finally....




And if you want to get fancy, read this book.

Now go away, and overfit no more.

9 comments:

  1. Hi Rob,
    This is the best thing i've read all month!
    I really like how you codified the "cardinal sins" of strategy development and backtesting.
    Thanks for this.

    ReplyDelete
    Replies
    1. Well it's only the 3rd September, so that's maybe not an especially high bar. But thanks for your kind comment anyway!

      Delete
  2. just fyi you have a small typo -> "lokoing"
    and thanks for your monthly blogposts, always a pleasure to read!

    ReplyDelete
  3. Dear Rob, thanks for this insightful article. Love it!
    I think most people are not taking overfitting seriously enough.

    If I want to learn more about this topic, which one of your books can you recommend? I think you should write a whole book about Overfitting in Backtesting. :-)

    ReplyDelete
    Replies
    1. "Systematic Trading" has some stuff, but yes it could fill an entire book. I've only just started another book, so you will have to wait a few years before the overfitting book comes out :-)

      Delete
    2. Genuinely hope the new book is a rom-com charting the love life of an overzealous backfitter.

      Delete
  4. I found this pretty insightful, and has given plenty of food for thought. Thanks for writing this and sharing. Bought your book on the back of this.

    ReplyDelete
  5. If you ever do a fourth book on trading systems, would love to see further discussion on the intricacies of curve fitting. (personally read all three of your books:))

    Is there an online video that goes with the attached slides?https://drive.google.com/file/d/1_V598h9Y8ldL04VksHA01oWyztjlMLjh/view

    Thank you.

    ReplyDelete
    Replies
    1. Ah I've started book four! It's not about curve fitting. That will probably be book five!

      There isn't a video of that talk. The closest thing would be a youtube interview on the same topic https://youtu.be/vGhzvJbJzEc

      Delete

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.