Tuesday, 23 June 2026

Breaking Badly: finding the structural breaks in parameter estimates

 Here's a nice picture from a lovely book written by a top bloke:

It shows the cumulative p&l from different speeds of momentum over time (for portfolios containing 102 instruments) over 50 years of data. Notice how the two fastest speeds (2&4) get worse in the second half of the sample. I've called the line #2 here the 'second most famous hockey stick graph in history'. It certainly looks like something changed in 1990. 

This is important. If we're optimising portfolios of such things we only want to consider data that is relevant, but we also want as much data as possible for statistical significance. Now if I were a simpleton I'd do this by looking at graphs like that and going 'aha i only need to use data after 1990'. As a simpleton I don't use capital letters. But I am a big fan of not doing in sample fitting, even of meta parameters like this; and I am an even bigger fan of doing things automatically which means not wading through thousands of graphs like that (since there are thousands of SR estimates in my forecast p&l space, plus a good chunk of correlations).

So we need an automatic way of identifying such breaks. Fortunately this is not a new problem as you will know if, like me, you did undergraduate econometrics. Finding structural breaks is an entire industry. We need two things: a test for how likely it is that a break has occured between two sub-samples A and B. And an algorithim for going through all the options of A and B

And in case you haven't realised this is the seventh post in my summer 2026 series on portfolio optimisation.


What parameters

The first question to think about is what parameters we're going to apply this process to. I do two kinds of optimisation:

  • Forecast weights
  • Instrument weights
And in both cases I have estimates of SR (one per asset) and correlations ([N^2-N]/2 for N assets). I haven't really looked at instrument weight optimisation yet in this series, and there are some wrinkles there so I'm going to park that for now. That just leaves the SR for a forecast (which remember is a pairing of a trading rule and an instrument), and the correlation of such forecasts within an instrument.

Now I am going to ignore correlations in this post. As I discussed in an earlier post, although correlations are relatively unstable in the short term, they are unlikely to have secular trends like SR. And it's quite easy to deal with this by just using a relatively long lookback to estimate them, probably with an ewma on the correlation estimate.


What test

This is quite an easy one, compared to the world of econometrics and linear regression where we have to do such nonsense as a Chow test. Given two sub-samples A and B, to find out if they have different sample means we just do an independent t-test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

An important characteristic of such tests is they are only likely to be significant if there is sufficient data. If sample A or sample B is too small in size, it's unlikely we'll find a signficant difference in the means.
Typical values for critical values in t-tests are 1% and 5%; I will also check out 10% later.

Since all the p&l streams in a risk targeted trading system will have the same expected standard deviation in the long run, I can legitmately treat this as a test to see if the SR are different if I adjust the samples so they have an identical standard deviation. 

(If any ex-students of mine are reading this, you should remember this from week 3 of the course!)

What algo / procedure

OK let's make this concrete. Suppose we have 50 years of data as in the original graph above, and we're currently wondering if there were one or more structural breaks. We don't want to have less than five years data to do our estimation. Note these numbers are appropriate for my context, but not in a domain with faster trading, higher SR, and faster alpha decay. We proceed as follows:
  • We compare year 1 with years 2 to 50. Since year 1 is very small it's unlikely we'll find a break; but if we do then we do a split (see below).
  • If no split has occured, we compare years 1-2 with years 3-50. Again if we find a break, we split.
  • If no split has occured, we compare years 1-3 with years 4-50...
  • ...
  • If no split has occured, we compare years 1-45 with years 46-50. If we still don't find a break, then we use the entire period for estimation (since the period 47-50 will only have four years, we terminate here).
Now what if a split occurs at some point? Then we restart the process, but this time without the pre-split data. Suppose for example a split had occured in year 20 (which is 1992 in the original graph). Then:
  • We compare year 21 with years 22 to 50. If we find a break, then we split again.
  • If no split has occured, we compare years 21-22 with years 23-50. Again if we find a break, we split.
  • If no split has occured, we compare years 21-23 with years 24-50...
  • ...
  • If no split has occured, we compare years 21-45 with years 46-50. If we still don't find a break, then we use the entire period from years 21-50 for estimation.
You get the idea. This procedure is quite quick and easy to run; and notice we're identifying any break that exceeds a certain threshold rather than finding the most likely break as we would do with a QLR type test. 

The main downside of it is that it won't identify breaks that reverse. For example if the world is in regime A, then regime B, then regime A again then ideally we'd estimate our parameters using both regime A's. But the above test will eithier use only the final regime A, or the last two regimes, or possible all three regimes depending on whether there is a significant difference at the appropriate point(s). There are fancy things we could do to deal with this, but I feel these are corner cases and life is too short.


An example


Let's use a concrete example. This is the performance of the momentum4 rule on CORN. I've chosen it because we already know momentum4 has a structural break, and CORN has plenty of history. Just for fun, before reading on, see if you can identify where the algo finds the break here (there is exactly one break - this isn't a trick question).

Some code

This is probably the first post in this series where it's been practical to actually include the code, since there isn't much of it, nor are there are any dependencies apart from getting the returns:


from copy import copy
from typing import List, Callable

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

BUS_DAYS_IN_YEAR =
256
import matplotlib.pyplot as plt

MIN_NUMBER_OF_YEARS =
5

def identify_and_plot_breaks(all_returns: pd.Series, CV: float =0.01):
breaks_as_dict = identify_all_breaks(all_returns)
breaks_as_df = pd.DataFrame(breaks_as_dict)
breaks_as_df = breaks_as_df.bfill(axis=1)
breaks_as_df.cumsum().plot()
plt.show(
block=True)


def identify_all_breaks(all_returns: pd.Series, CV: float):
## returns a dict, turn into a dataframe and you can plot
returns_to_consider= copy(all_returns)
returns_to_consider = returns_to_consider.dropna()
broken_list = identify_all_breaks_recursively(returns_to_consider=returns_to_consider, list_of_returns_broken_off=[],
CV=CV)
broken_list.reverse()
broken_dict = dict([
(
idx, value) for idx, value in enumerate(broken_list)
])

return broken_dict

def identify_all_breaks_recursively(returns_to_consider: pd.Series, list_of_returns_broken_off: List, CV: float) -> List:
years_in_returns = how_many_years_approx(returns_to_consider)

for i in range(years_in_returns):
year_idx=i+1
first_sample, second_sample = split_sample_after_n_years(returns_to_consider, year_idx)
if len(second_sample)<(MIN_NUMBER_OF_YEARS*BUS_DAYS_IN_YEAR):
break
is_broken_here = test_a_break(first_sample, second_sample, CV=CV)
if is_broken_here:
list_of_returns_broken_off.append(first_sample)
return identify_all_breaks_recursively(
second_sample, list_of_returns_broken_off=list_of_returns_broken_off,
CV=CV
)
else:
continue

## No breaks identified or sample size too short
list_of_returns_broken_off.append(returns_to_consider)
return list_of_returns_broken_off

def how_many_years_approx(returns: pd.Series):
return int(np.floor(len(returns)/BUS_DAYS_IN_YEAR))

def split_sample_after_n_years(all_returns: pd.Series, n_years: int):
idx = n_years*BUS_DAYS_IN_YEAR
return all_returns[:idx], all_returns[idx:]

def test_a_break(first_sample: pd.Series, second_sample: pd.Series, CV: float):
## Normalise by standard deviation before considering means
norm_first_sample =first_sample/first_sample.std()
norm_second_sample=second_sample/second_sample.std()
return ttest_ind(norm_first_sample, norm_second_sample).pvalue<CV
On Corn this produces the following:

The break occurs on the 26th August 1982. So to calculate that particular SR estimate we'd only use data from that date onwards.

Is that what you would have guessed? Personally, I would probably have gone for a later break point if identifying it by eye. As humans we are drawn to the sharp upward move in 1989 and would probably have gone for a break just after that. It's possible a search for the likeliest break would have found that point, but remember we are looking for the first break that exceeds the threshold; and once that break happens no further breaks are identified.

A summary of results

Here is a summary of the results for each instrument/forecast pairing with the default 1% critical value:


You can see that breaks are quite rare with only 13% or so of instrument/rules having at least one break. This also suggests that the Sharpe Ratios for trading rule performance are actually quite stable over time; or at least stable enough that they won't fail any statistical tests at a 1% critical value. 

Multiple breaks are even rarer. Just 1.8% have two breaks; 0.4% or 39 instruments have three breaks, ten have four breaks and only two have five breaks. They are:

skewrv365 forecasting EURIBOR-ICE    (yes, there is still a EURIBOR future!)
normmom2 forecasting FTSE250         

Here is Euribor, relative value skew with a 365 day window:


Although five is pushing it, there are certainly three regimes there (pre 2000, 2000 - 2010, and 2010 onwards), and using post 2010 data seems to make some kind of sense.


And FTSE 250 momentum8 (this is pre-cost):


There are certainly at least two regimes there and I wouldn't argue with the automated decision to use only data after 2006 or so, when it looks like; in the words of Pulp in one of my favourite songs, "Something changed".

Of course we will get a different picture with a slacker test. Here is the picture with a 10% critical value:

Now just over half the pairings have at least one break in them.

The decision as to use 0% (equivalent to no breaks at all), 1%, 5% or 10% CV is one we will now address.

An optimisation test

Now the big question is does this actually improve performance? On a pure out of sample test? And is this changed much by using a different critical value?

I follow the same procedure roughly as in previous posts:

  • Select 10,20,30 or 40 years of in sample data (I need at least 10 years because with a minimum of five years required for estimation I certainly won't find any breaks, or I will risk finding a break and not having five years of data leftover)
  • Select 1 or 5 years of out of sample data
  • Pick a random instrument, ensuring there is enough history available (between 11 and 45 years). We will only choose from instruments with sufficient history for the time required.
  • Randomly pick N=9 forecasting rules from those available (the same number as in posts #2 and #3)
Then for each of those sumsamples:
  • Cycle through using no breaks (0% CV), 1% CV, 5% CV and 10% CV
  • Estimate SR on the insample data using eithier all the data (0% CV), or the data after the last break given some critical value. 
  • Estimate correlation using all the in sample data
  • Use fixed shrinkage levels (estimated here): SR shrinkage 0.5, correlation 0.75 (since we'll always have at least five years of in sample data we don't need to worry about the higher levels of shrinkage required when we have insufficent data). The results won't be much different with any vaguely similar shrinkage.
  • Run in sample optimisation and out of sample optimisation on all the options above
Finally once we have all our subsamples:
  • Get the median SR from the distribution of subsamples
  • Find the optimal CV with the highest SR
  • Test to see if that median is significantly higher than the others

10 years in sample, one year out of sample

We only have four options to consider so no need for the huge tables and fancy heatmaps of previous posts:
         SR  pvalue all  pvalue distinct
0.00 -0.021       0.247            0.247
0.01 -0.018       0.295            0.295
0.05 -0.019       0.204            0.204
0.10 -0.014         NaN              NaN
Each row is a different critical value used for breakpoint finding. Zero means the entire in sample period was used. The next column is the out of sample Sharpe Ratio for each option. In the second column is the p-value for a test of the optimal option against the relevant option. NaN is the optimal option, and lower values (say below 0.05) mean the optimal option is significantly better than the alternatives. In the final column I've rerun the staistical t-test but this time I have excluded instances where no breaks were found (so the test is only done comparing the out of sample SR when breaks were found, versus when they were not). This shouldn't affect the p-values, but it's nice to check it doesn't

You can see here that it it looks like a very loose breakpoint policy is the best, but it's not significantly better.

10 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 0.166 0.168 0.168
0.01 0.160 0.003 0.003
0.05 0.167 0.007 0.007
0.10 0.170 NaN NaN

Again the loosest breakpoint works best; but it's hardly logical since it's indistinguishable from no breakpoints at all.

20 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.160 0.542 0.542
0.01 -0.160 0.323 0.323
0.05 -0.160 0.050 0.050
0.10 -0.155 NaN NaN
Loose is better.

20 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 0.123 NaN NaN
0.01 0.116 0.0 0.0
0.05 0.105 0.0 0.0
0.10 0.098 0.0 0.0
No breakpoints are the best. There isn't much consistency here.

30 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 0.092 NaN NaN
0.01 -0.019 0.0 0.0
0.05 -0.078 0.0 0.0
0.10 -0.066 0.0 0.0
Another massive win for not using breaks.

30 years in sample, five years out of sample

        SR  pvalue all  pvalue distinct
0.00 -0.005 0.001 0.001
0.01 -0.011 0.000 0.000
0.05 -0.008 0.014 0.014
0.10 -0.003 NaN NaN
Or perhaps we should go for the loosest break....

40 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.142 NaN NaN
0.01 -0.249 0.000 0.000
0.05 -0.189 0.000 0.000
0.10 -0.178 0.002 0.002
or no breaks...

40 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.060 0.177 0.177
0.01 -0.051 NaN NaN
0.05 -0.055 0.000 0.000
0.10 -0.055 0.000 0.000
A clear vote for strict breaks, but no breaks at all are also good.

Conclusion

Well that was as clear as a bowl full of mud that has been made even more unclear by painting the bowl a very dark colour and then adding some darker mud. Not very clear at all, in other words.

You think up a nice neat simple way of finding structural breaks, and then it doesn't actually work when used for optimisation. In seven out of the cases we examined not using breaks is eithier optimal, or statistically insignificant from the optimal. Only in one case was it inferior. The loosest possible break (CV 10%) was optimal in four cases. In most cases apart from 30 years/1 year the difference in performance was small between CV alternatives. This partly reflects the fact that breaks aren't that common, especially with a 1% CV.

Of course findings like this are very context dependent. I'm using rules that should probably work over long time periods; indeed they have been in sample selected for such a purpose. For the 30 year and 40 year periods there just aren't that many instruments with that much history; so even repeated random sampling is likely to turn up the same suspects repeatedly.

One issue here might be that we are considering the breaks at a forecast/instrument level. We might get different results if we pool estimates for trading rule performance across instruments. And indeed that is the subject of the next post. So I will return to this topic when I've looked at pooling.



No comments:

Post a Comment

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.