Tuesday, 23 June 2026

Breaking Badly: finding the structural breaks in parameter estimates

 Here's a nice picture from a lovely book written by a top bloke:

It shows the cumulative p&l from different speeds of momentum over time (for portfolios containing 102 instruments) over 50 years of data. Notice how the two fastest speeds (2&4) get worse in the second half of the sample. I've called the line #2 here the 'second most famous hockey stick graph in history'. It certainly looks like something changed in 1990. 

This is important. If we're optimising portfolios of such things we only want to consider data that is relevant, but we also want as much data as possible for statistical significance. Now if I were a simpleton I'd do this by looking at graphs like that and going 'aha i only need to use data after 1990'. As a simpleton I don't use capital letters. But I am a big fan of not doing in sample fitting, even of meta parameters like this; and I am an even bigger fan of doing things automatically which means not wading through thousands of graphs like that (since there are thousands of SR estimates in my forecast p&l space, plus a good chunk of correlations).

So we need an automatic way of identifying such breaks. Fortunately this is not a new problem as you will know if, like me, you did undergraduate econometrics. Finding structural breaks is an entire industry. We need two things: a test for how likely it is that a break has occured between two sub-samples A and B. And an algorithim for going through all the options of A and B

And in case you haven't realised this is the seventh post in my summer 2026 series on portfolio optimisation.


What parameters

The first question to think about is what parameters we're going to apply this process to. I do two kinds of optimisation:

  • Forecast weights
  • Instrument weights
And in both cases I have estimates of SR (one per asset) and correlations ([N^2-N]/2 for N assets). I haven't really looked at instrument weight optimisation yet in this series, and there are some wrinkles there so I'm going to park that for now. That just leaves the SR for a forecast (which remember is a pairing of a trading rule and an instrument), and the correlation of such forecasts within an instrument.

Now I am going to ignore correlations in this post. As I discussed in an earlier post, although correlations are relatively unstable in the short term, they are unlikely to have secular trends like SR. And it's quite easy to deal with this by just using a relatively long lookback to estimate them, probably with an ewma on the correlation estimate.


What test

This is quite an easy one, compared to the world of econometrics and linear regression where we have to do such nonsense as a Chow test. Given two sub-samples A and B, to find out if they have different sample means we just do an independent t-test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

An important characteristic of such tests is they are only likely to be significant if there is sufficient data. If sample A or sample B is too small in size, it's unlikely we'll find a signficant difference in the means.
Typical values for critical values in t-tests are 1% and 5%; I will also check out 10% later.

Since all the p&l streams in a risk targeted trading system will have the same expected standard deviation in the long run, I can legitmately treat this as a test to see if the SR are different if I adjust the samples so they have an identical standard deviation. 

(If any ex-students of mine are reading this, you should remember this from week 3 of the course!)

What algo / procedure

OK let's make this concrete. Suppose we have 50 years of data as in the original graph above, and we're currently wondering if there were one or more structural breaks. We don't want to have less than five years data to do our estimation. Note these numbers are appropriate for my context, but not in a domain with faster trading, higher SR, and faster alpha decay. We proceed as follows:
  • We compare year 1 with years 2 to 50. Since year 1 is very small it's unlikely we'll find a break; but if we do then we do a split (see below).
  • If no split has occured, we compare years 1-2 with years 3-50. Again if we find a break, we split.
  • If no split has occured, we compare years 1-3 with years 4-50...
  • ...
  • If no split has occured, we compare years 1-45 with years 46-50. If we still don't find a break, then we use the entire period for estimation (since the period 47-50 will only have four years, we terminate here).
Now what if a split occurs at some point? Then we restart the process, but this time without the pre-split data. Suppose for example a split had occured in year 20 (which is 1992 in the original graph). Then:
  • We compare year 21 with years 22 to 50. If we find a break, then we split again.
  • If no split has occured, we compare years 21-22 with years 23-50. Again if we find a break, we split.
  • If no split has occured, we compare years 21-23 with years 24-50...
  • ...
  • If no split has occured, we compare years 21-45 with years 46-50. If we still don't find a break, then we use the entire period from years 21-50 for estimation.
You get the idea. This procedure is quite quick and easy to run; and notice we're identifying any break that exceeds a certain threshold rather than finding the most likely break as we would do with a QLR type test. 

The main downside of it is that it won't identify breaks that reverse. For example if the world is in regime A, then regime B, then regime A again then ideally we'd estimate our parameters using both regime A's. But the above test will eithier use only the final regime A, or the last two regimes, or possible all three regimes depending on whether there is a significant difference at the appropriate point(s). There are fancy things we could do to deal with this, but I feel these are corner cases and life is too short.


An example


Let's use a concrete example. This is the performance of the momentum4 rule on CORN. I've chosen it because we already know momentum4 has a structural break, and CORN has plenty of history. Just for fun, before reading on, see if you can identify where the algo finds the break here (there is exactly one break - this isn't a trick question).

Some code

This is probably the first post in this series where it's been practical to actually include the code, since there isn't much of it, nor are there are any dependencies apart from getting the returns:


from copy import copy
from typing import List, Callable

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

BUS_DAYS_IN_YEAR =
256
import matplotlib.pyplot as plt

MIN_NUMBER_OF_YEARS =
5

def identify_and_plot_breaks(all_returns: pd.Series, CV: float =0.01):
breaks_as_dict = identify_all_breaks(all_returns)
breaks_as_df = pd.DataFrame(breaks_as_dict)
breaks_as_df = breaks_as_df.bfill(axis=1)
breaks_as_df.cumsum().plot()
plt.show(
block=True)


def identify_all_breaks(all_returns: pd.Series, CV: float):
## returns a dict, turn into a dataframe and you can plot
returns_to_consider= copy(all_returns)
returns_to_consider = returns_to_consider.dropna()
broken_list = identify_all_breaks_recursively(returns_to_consider=returns_to_consider, list_of_returns_broken_off=[],
CV=CV)
broken_list.reverse()
broken_dict = dict([
(
idx, value) for idx, value in enumerate(broken_list)
])

return broken_dict

def identify_all_breaks_recursively(returns_to_consider: pd.Series, list_of_returns_broken_off: List, CV: float) -> List:
years_in_returns = how_many_years_approx(returns_to_consider)

for i in range(years_in_returns):
year_idx=i+1
first_sample, second_sample = split_sample_after_n_years(returns_to_consider, year_idx)
if len(second_sample)<(MIN_NUMBER_OF_YEARS*BUS_DAYS_IN_YEAR):
break
is_broken_here = test_a_break(first_sample, second_sample, CV=CV)
if is_broken_here:
list_of_returns_broken_off.append(first_sample)
return identify_all_breaks_recursively(
second_sample, list_of_returns_broken_off=list_of_returns_broken_off,
CV=CV
)
else:
continue

## No breaks identified or sample size too short
list_of_returns_broken_off.append(returns_to_consider)
return list_of_returns_broken_off

def how_many_years_approx(returns: pd.Series):
return int(np.floor(len(returns)/BUS_DAYS_IN_YEAR))

def split_sample_after_n_years(all_returns: pd.Series, n_years: int):
idx = n_years*BUS_DAYS_IN_YEAR
return all_returns[:idx], all_returns[idx:]

def test_a_break(first_sample: pd.Series, second_sample: pd.Series, CV: float):
## Normalise by standard deviation before considering means
norm_first_sample =first_sample/first_sample.std()
norm_second_sample=second_sample/second_sample.std()
return ttest_ind(norm_first_sample, norm_second_sample).pvalue<CV
On Corn this produces the following:

The break occurs on the 26th August 1982. So to calculate that particular SR estimate we'd only use data from that date onwards.

Is that what you would have guessed? Personally, I would probably have gone for a later break point if identifying it by eye. As humans we are drawn to the sharp upward move in 1989 and would probably have gone for a break just after that. It's possible a search for the likeliest break would have found that point, but remember we are looking for the first break that exceeds the threshold; and once that break happens no further breaks are identified.

A summary of results

Here is a summary of the results for each instrument/forecast pairing with the default 1% critical value:


You can see that breaks are quite rare with only 13% or so of instrument/rules having at least one break. This also suggests that the Sharpe Ratios for trading rule performance are actually quite stable over time; or at least stable enough that they won't fail any statistical tests at a 1% critical value. 

Multiple breaks are even rarer. Just 1.8% have two breaks; 0.4% or 39 instruments have three breaks, ten have four breaks and only two have five breaks. They are:

skewrv365 forecasting EURIBOR-ICE    (yes, there is still a EURIBOR future!)
normmom2 forecasting FTSE250         

Here is Euribor, relative value skew with a 365 day window:


Although five is pushing it, there are certainly three regimes there (pre 2000, 2000 - 2010, and 2010 onwards), and using post 2010 data seems to make some kind of sense.


And FTSE 250 momentum8 (this is pre-cost):


There are certainly at least two regimes there and I wouldn't argue with the automated decision to use only data after 2006 or so, when it looks like; in the words of Pulp in one of my favourite songs, "Something changed".

Of course we will get a different picture with a slacker test. Here is the picture with a 10% critical value:

Now just over half the pairings have at least one break in them.

The decision as to use 0% (equivalent to no breaks at all), 1%, 5% or 10% CV is one we will now address.

An optimisation test

Now the big question is does this actually improve performance? On a pure out of sample test? And is this changed much by using a different critical value?

I follow the same procedure roughly as in previous posts:

  • Select 10,20,30 or 40 years of in sample data (I need at least 10 years because with a minimum of five years required for estimation I certainly won't find any breaks, or I will risk finding a break and not having five years of data leftover)
  • Select 1 or 5 years of out of sample data
  • Pick a random instrument, ensuring there is enough history available (between 11 and 45 years). We will only choose from instruments with sufficient history for the time required.
  • Randomly pick N=9 forecasting rules from those available (the same number as in posts #2 and #3)
Then for each of those sumsamples:
  • Cycle through using no breaks (0% CV), 1% CV, 5% CV and 10% CV
  • Estimate SR on the insample data using eithier all the data (0% CV), or the data after the last break given some critical value. 
  • Estimate correlation using all the in sample data
  • Use fixed shrinkage levels (estimated here): SR shrinkage 0.5, correlation 0.75 (since we'll always have at least five years of in sample data we don't need to worry about the higher levels of shrinkage required when we have insufficent data). The results won't be much different with any vaguely similar shrinkage.
  • Run in sample optimisation and out of sample optimisation on all the options above
Finally once we have all our subsamples:
  • Get the median SR from the distribution of subsamples
  • Find the optimal CV with the highest SR
  • Test to see if that median is significantly higher than the others

10 years in sample, one year out of sample

We only have four options to consider so no need for the huge tables and fancy heatmaps of previous posts:
         SR  pvalue all  pvalue distinct
0.00 -0.021       0.247            0.247
0.01 -0.018       0.295            0.295
0.05 -0.019       0.204            0.204
0.10 -0.014         NaN              NaN
Each row is a different critical value used for breakpoint finding. Zero means the entire in sample period was used. The next column is the out of sample Sharpe Ratio for each option. In the second column is the p-value for a test of the optimal option against the relevant option. NaN is the optimal option, and lower values (say below 0.05) mean the optimal option is significantly better than the alternatives. In the final column I've rerun the staistical t-test but this time I have excluded instances where no breaks were found (so the test is only done comparing the out of sample SR when breaks were found, versus when they were not). This shouldn't affect the p-values, but it's nice to check it doesn't

You can see here that it it looks like a very loose breakpoint policy is the best, but it's not significantly better.

10 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 0.166 0.168 0.168
0.01 0.160 0.003 0.003
0.05 0.167 0.007 0.007
0.10 0.170 NaN NaN

Again the loosest breakpoint works best; but it's hardly logical since it's indistinguishable from no breakpoints at all.

20 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.160 0.542 0.542
0.01 -0.160 0.323 0.323
0.05 -0.160 0.050 0.050
0.10 -0.155 NaN NaN
Loose is better.

20 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 0.123 NaN NaN
0.01 0.116 0.0 0.0
0.05 0.105 0.0 0.0
0.10 0.098 0.0 0.0
No breakpoints are the best. There isn't much consistency here.

30 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 0.092 NaN NaN
0.01 -0.019 0.0 0.0
0.05 -0.078 0.0 0.0
0.10 -0.066 0.0 0.0
Another massive win for not using breaks.

30 years in sample, five years out of sample

        SR  pvalue all  pvalue distinct
0.00 -0.005 0.001 0.001
0.01 -0.011 0.000 0.000
0.05 -0.008 0.014 0.014
0.10 -0.003 NaN NaN
Or perhaps we should go for the loosest break....

40 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.142 NaN NaN
0.01 -0.249 0.000 0.000
0.05 -0.189 0.000 0.000
0.10 -0.178 0.002 0.002
or no breaks...

40 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.060 0.177 0.177
0.01 -0.051 NaN NaN
0.05 -0.055 0.000 0.000
0.10 -0.055 0.000 0.000
A clear vote for strict breaks, but no breaks at all are also good.

Conclusion

Well that was as clear as a bowl full of mud that has been made even more unclear by painting the bowl a very dark colour and then adding some darker mud. Not very clear at all, in other words.

You think up a nice neat simple way of finding structural breaks, and then it doesn't actually work when used for optimisation. In seven out of the cases we examined not using breaks is eithier optimal, or statistically insignificant from the optimal. Only in one case was it inferior. The loosest possible break (CV 10%) was optimal in four cases. In most cases apart from 30 years/1 year the difference in performance was small between CV alternatives. This partly reflects the fact that breaks aren't that common, especially with a 1% CV.

Of course findings like this are very context dependent. I'm using rules that should probably work over long time periods; indeed they have been in sample selected for such a purpose. For the 30 year and 40 year periods there just aren't that many instruments with that much history; so even repeated random sampling is likely to turn up the same suspects repeatedly.

One issue here might be that we are considering the breaks at a forecast/instrument level. We might get different results if we pool estimates for trading rule performance across instruments. And indeed that is the subject of the next post. So I will return to this topic when I've looked at pooling.



Thursday, 18 June 2026

To Cluster Or Not To Cluster That is the Question...

This is the sixth (!) post in a series I'm writing on portfolio optimisation. A quick reminder of the story so far:

  • In the first post I showed that if you are optimising across forecasts from different trading rules and instruments, that the rules within an instrument cluster naturally together, suggesting you should first fit within; and then across, instruments. Luckily, this is what I've always done.
  • In my second post I ran some experiments with optimising with random data. The results showed a supreme indifference between joint winners: monte carlo and bootstrapping, and a shrinkage methodology with a tiny bit of SR shrinkage. Using a more conservative viewpoint didn't affect the results much.
  • ... before moving into the world of real data for post three where I showed that the predictability of sampling distribution of parameter estimates was much worse with real data than with random data.
  • Then in post, number four, I reran my experiments this time with real data. Unsurprisingly I found that the previous winners did badly. Instead a middle ground of shrinkage was the winner across most time periods; unless the in sample period was too short.
  • Post number five attempted a spin on #4 by averaging out the weights produced by different shrinkage levels. It was an abject failure. We shall never speak of it again.
Importantly, all these experiments were done with a relatively small number of assets: nine. Bear in mind that when I fit forecast weights within instruments I have 40 assets; and when fitting instrument weights I have over 200.

Now you will remember that my previous favourite fitting method involves hierarchical clustering. I gave this the grand name of handcrafting, and in it's simplest form it doesn't use SR at all and just clusters by correlations or by some user assigned labels (eg asset class). 

There is plenty of empirical and theoretical evidence elsewhere as to why this makes sense for larger portfolios, but as I have the code and framework to do so, I thought why not battle test for myself; as part of this (long) exercise in checking that all my portfolio optimisation assumptions are correct. The question we want to answer then is what is the optimal number of clusters as a proportion of the total portfolio size? And if it's 1, then we don't cluster.
 
As before I can also vary the amount of available in sample data, and data used for out of sample evaluation. And this should be a joint test with the amount of shrinkage required. With clusters, we might expect less shrinkage to be required, since that is building in some robustness already. However in the interests of time, I won't be using Monte Carlo or Bootstrapping (doing that for each cluster in turn would be .... very..... slow..... indeed).

As before I will be doing this exercise using forecast weights (for which I generally have 40 assets). I can do some random selection, alternating between 20, 30 and 40 assets; randomly choosing instruments to do this with. 

Listing all the permutations:
  • Select 1,5 or 10 years of in sample data
  • Select 1 or 5 years of out of sample data
  • Pick a random instrument, ensuring there is enough history available (between 2 and 15 years). We will only choose from instruments with sufficient history for the time required.
  • Randomly pick N=20, 30 or 40 assets from those available (if 40, just use them all)
Then for a given dataset of in and out of sample:
  • Select correlation (from this list of options: 0, 0.7, 0.75, 0.8) and SR shrinkage (from this list: 0.0, 0.25, 0.5, 0.75, 1.0)
  • Select optimal number of clusters from 1 (no clustering), 2,3,4,6,8,10. 
  • Run in sample optimisation and out of sample optimisation on all the options above.
  • Repeat a few hundred times (it's quite slow!).
As I have done before I will also be checking speed - obviously adding clusters will increase optimisation time; firstly because the clustering and aggregation process itself adds time, but mainly because we will end up doing more optimisations. For example, for N=20 doing ten clusters would require 11 optimisations, one for each cluster, and one across clusters. Whilst the individual optimisations might be slightly faster than for smaller portfolios due to faster convergence, this is still probably going to be longer than a single optimisation. 

The speed of the optimisations varied between 28 seconds up to 294 seconds. Longer in sample time means slightly slower estimation time, more assets means slower convergence, less shrinkage also means slower convergence and as discussed just now smaller clusters also slows things down.

Since we're trying to establish here how much clustering to do, the only thing to note is that clustering imposes a speed penalty.

TLDR: This is a very long post due to the exhaustive number of combinations. There are numerous pretty plots to look at. But if you get bored and skip to the end to see the results, I won't judge you. Honest.

Results - forecast weights

20 assets - one year in sample, one year out of sample

In the previous couple of posts I showed the results as a matrix with different levels of shrinkage, for different time periods. Here it's a bit trickier, since we're considering the performance on 3 axis: two shrinkage, plus cluster size (where 1 is no clustering). But my screen only has two axis. Hence - in the words of many peoples relationship status on facebook circa 2010 - it's complicated.




This is a heatmap and the right hand rectangle is the key. The left hand is the data. The x-axis is the amount of correlation shrinkage. Apologies for the bunched up numbers: the values are 0,.7,.75,.8. Zero is there for fun; the other values are approximately. On the y-axis are the SR shrinkage and the number of clusters. So 1.0,8 is full shrinkage and 8 clusters. The colours are the median SR for each out of sample optimisation except that I've done what I did before in post four. I found the optimum value (zero correlation shrinkage, 0.75 SR shrinkage, cluster size of 5). I then did a t-test to compare that optimum SR value against all the other values. Where that test failed at a critical value of 0.1; in other words when the relevant value isn't significantly different from the optimum; I coloured in the square with the same colour SR value as the optimum.

If that explanation doesn't make sense you should probably reread post#4

The interpretation here is that with a few exceptions pretty much any value is fine as almost everything is one colour.

30 assets - one year in sample, one year out of sample

This is similar to 20 assets - nothing significant.

40 assets - one year in sample, one year out of sample

Here anything works, except:
  • Too much SR shrinkage (the coloured area at the bottom of the plot)
  • Not clustering

20 assets - one year in sample, five years out of sample

Again it's easier to say what doesn't work:
  • No correlation shrinkage
  • Too much SR shrinkage
The amount of clustering doesn't really influence the results.

30 assets - one year in sample, five years out of sample


Similar; perhaps a hint that smaller clusters underperform but just that.

40 assets - one year in sample, five years out of sample


Whilst there is more significance here, and a clear dislike for zero or full SR shrinkage, plus zero correlation shrinkage; again it does look like applying any degree of clustering is equally valid.

20 assets - five years in sample, one year out of sample

At this stage some people will be regretting their decision to print out my blogpost. In colour. For their sakes, I will be only reporting results where there is significance.

And for this combo, nothing is really significantly bad. 

30 assets - five years in sample, one year out of sample

Nothing is really significantly bad. 

40 assets - five years in sample, one year out of sample


 Certainly a preference here for lower amounts of SR shrinkage; with weaker evidence that middling amounts of clustering work well.

20 assets - five years in sample, five years out of sample

Focusing purely on clusters; it looks like 2 clusters is the one to avoid here.

30 assets - five years in sample, five years out of sample

Nothing is really significantly bad. 

40 assets - five years in sample, five years out of sample



20 assets - ten years in sample, one year out of sample

The whole plot is the same colour. Literally nothing to see here.

30 assets - ten years in sample, one year out of sample




40 assets - ten years in sample, one year out of sample

Again it looks like a modest amount of clustering; with N between 4 and 6 might be best.

20 assets - ten years in sample, five years out of sample

Looks like a case for larger cluster size...I think?

30 assets - ten years in sample, five years out of sample

Shrink the correlation and do some kind of clustering and you'll be fine mate.

40 assets - ten years in sample, five years out of sample

The final plot - and the one with the most significance. Shrinkage of about 0.75 on both and big clusters is the way to go here. More generally again it looks like middling cluster sizes are about right.

Summary

I think it is fair to say there aren't many definitive conclusions one can draw from that... experience. I would say however that there is some weak evidence that some level of clustering is better than none at all. And I would say there is even weaker evidence that you don't want your clusters to be too small, i.e. have a larger number of clusters. 

As in the previous post I could test the effect of combining portfolio weights derived with different cluster sizes. But given that we have struggled to find much statistical significance here it seems unlikely we'd get much satisifaction.

In the face of choosing a parameter in the face of no evidence there are two things I like - heuristic rules and powers of 2. Therefore let's say you should use 6 clusters when you're doing your thing. That's roughly equivalent to the number of distinct trading rules I have, and the number of asset classes when optimising for instrument weights. That isn't a power of 2, but 4 clusters seems a bit on the low size, and 8 a little high. In the face of choosing a parameter in the face of no evidence there are three things I like - heuristic rules, powers of 2, and taking an average of potential values.



Wednesday, 17 June 2026

Honey I shrunk the weights (instead of the inputs!)

TLDR: This is a post about something that doesn't work. So don't read if you only care about cherry picked delightful backtests.

This is my fifth post in a rapid fire intense series on portfolio optimisation. In my last post I looked at the optimal amount of shrinkage to use with real data, when running a bayesian methodology for mean variance optimisation. I found two things. Firstly, the optimal shrinkage was different for different sizes of in and out of sample periods. Secondly, that there was mostly a great deal of uncertainty about what the optimium was, with fairly flat surfaces and insigificant t-statistics abounding. I also found that random based methods (monte carlo and bootstrapping) don't work as well as the best shrinkage methods (and in some cases, do worse than the poorest methods). That's three things, but the latter point isn't relevant to this post.

Hence, shrinkage of 0.5 on SR and 0.75 for correlations seemed reasonable; but the truth is we don't really know for sure.

Now I am a big fan of the work of Resolve asset managment. And one thing they are fond of doing is if two or more things seem to work equally well, just taking an average of them (for example they do this here with CTA replication). And I also know intuitively that taking an average of portfolio weights is better than taking an average of inputs. Therefore might we not do better by taking an average of the weights produced by different shrinkage methods?

For example, if we averaged the weights produced by naive mean variance (NMV - zero shrinkage) and equal weights (full shrinkage on both inputs), then we're basically shrinking the weights.

This leaves us with two open questions (apart from the obvious question, which is how long I will continue flogging this subject to death):

  • What are we averaging?
  • What averaging weights should we use?
For the second part I'm going to keep things simple and just use equal weights. For the first part, consider this grid of shrinkage options. This is a subset of what we have seen before:

      0.00  0.50  1.00
0     A       B     C
0.5   D       E     F
1.0   G       H     J


Each row is a different SR shrinkage. Each column is another level of correlation shrinkage. There are 9 options of shrinkage. Some have special names. A is no shrinkage; naive mean variance. B is closest to the optimal shrinkage from the EPO paper I have referenced before. E is not that different from the empirical option I selected in the previous post. J is full shrinkage; equal weights. 

Now if I said to me "Rob, you can only choose two options from this list", I would select:
  • A and J
  • or perhaps, C and G
If allowed three options, I'd throw in E, so:

  • A, E,J
  • C, E, G
With four options I would hit the corners:
  • A,C,G,J
Finally with five options I would hit the corners and the centre:
  • A,C,G,J, E
With everything equally weighted in all of the above. This gives me six different permutations. These in turn can be compared to each of the individual shrinkage options (since we have to calculate them anyway...), so we're comparing 15 possibles.

I'm going to use exactly the same set up as the previous post; randomly chosen portfolios of nine trading rules for a random instrument; varying the size of the in sample and out of sample periods.

Note: yes the title is an allusion to this paper.


One year in sample, one year out of sample

       SR median  SR 0.05  T statistic
A 0.029 -1.531 0.104
B 0.014 -1.549 0.745
C 0.008 -1.598 0.025
D 0.031 -1.527 0.507
E 0.038 -1.573 NaN
F 0.014 -1.572 0.028
G 0.008 -1.729 0.004
H 0.023 -1.704 0.036
J 0.012 -1.694 0.019
AJ 0.013 -1.584 0.031
CG 0.003 -1.668 0.001
AEJ 0.006 -1.603 0.020
CEG 0.002 -1.591 0.001
ACGJ 0.001 -1.610 0.001
ACEGJ 0.005 -1.595 0.001


      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J


Hopefully the format of this table makes sense. The first two columns are median SR across all the random portfolios, and the 5% point of the distribution of random portfolios. You can see that option E is best at the median point (0.5 shrinkage on both), but D is slightly better at the 5% point. The final column is the result of a paired t-statistic comparing the optimal choice (which has NaN in this column) and the choice on the appropriate line. A number below 0.05 means the optimum is significantly better at a 5% critical value, i.e. there is a 95% or more chance it isn't just pure luck.  One benefit of doing this optimisation is that there are fewer options, plus no random methods; so it's very quick. Hence I can get more reasonable t-statistics here (I have 4,000 values in my sample). 

But you can still see that E isn't significantly better than D, and nor is A or B. C, F, H and J are insignificant at a 5% level, but not a 1% level. All the values in the top left quadrant are fine.

You will remember from the last post that the optimum was shrinkage of 0.25 SR, 0.6 correlations but also that there almost no statistical difference between sensible shrinkage results. Option E is closest to that previous optimum. 

Sadly none of the new 'combo' options are any good.

One year in sample, five years out of sample

       SR median  SR 0.05  T statistic
A 0.148 -0.628 0.0
B 0.157 -0.603 0.0
C 0.164 -0.578 0.0
D 0.140 -0.623 0.0
E 0.156 -0.600 0.0
F 0.161 -0.582 0.0
G 0.155 -0.594 0.0
H 0.170 -0.567 0.0
J 0.193 -0.491 NaN
AJ 0.173 -0.549 0.0
CG 0.176 -0.539 0.0
AEJ 0.170 -0.571 0.0
CEG 0.175 -0.561 0.0
ACGJ 0.178 -0.541 0.0
ACEGJ 0.179 -0.542 0.0
      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J

Some amazing significance there - basically equal weights is better than everything by some margin. This is exactly the result from before. And the combos don't perform as well.
 

Five years in sample, one year out of sample


       SR median  SR 0.05  T statistic
A 0.051 -1.742 0.276
B 0.057 -1.769 NaN
C 0.052 -1.763 0.147
D 0.048 -1.738 0.825
E 0.049 -1.757 0.883
F 0.043 -1.754 0.062
G 0.010 -1.842 0.005
H 0.026 -1.767 0.011
J 0.004 -1.803 0.008
AJ 0.025 -1.759 0.031
CG 0.019 -1.789 0.013
AEJ 0.042 -1.757 0.355
CEG 0.034 -1.749 0.043
ACGJ 0.016 -1.747 0.081
ACEGJ 0.037 -1.745 0.047
      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J

Again, the newer combo methods aren't much cop although AEJ is a little better than J.

Five years in sample, five years out of sample


       SR median  SR 0.05  T statistic
A 0.159 -0.666 0.000
B 0.166 -0.648 0.365
C 0.165 -0.642 0.233
D 0.157 -0.666 0.000
E 0.173 -0.660 NaN
F 0.168 -0.647 0.259
G 0.128 -0.676 0.000
H 0.141 -0.667 0.000
J 0.144 -0.660 0.000
AJ 0.162 -0.659 0.051
CG 0.157 -0.653 0.001
AEJ 0.172 -0.667 0.086
CEG 0.162 -0.654 0.013
ACGJ 0.161 -0.655 0.030
ACEGJ 0.163 -0.658 0.054
      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J

Again the middle ground of E is the best; we're also seeing more extreme shrinkage (the bottom row) do very badly as does mean variance. None of the combos do as well.

Ten years in sample, one year out of sample

       SR median  SR 0.05  T statistic
A -0.016 -1.850 0.166
B -0.025 -1.827 0.269
C -0.015 -1.775 0.223
D -0.007 -1.853 NaN
E -0.028 -1.815 0.991
F -0.017 -1.795 0.654
G -0.076 -1.894 0.000
H -0.082 -1.883 0.268
J -0.097 -1.856 0.000
AJ -0.074 -1.846 0.095
CG -0.069 -1.858 0.000
AEJ -0.057 -1.843 0.622
CEG -0.058 -1.844 0.001
ACGJ -0.069 -1.857 0.006
ACEGJ -0.058 -1.840 0.004
      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J

I struggled to get statistical significance for this set before; I have some now, but basically again somewhere in the region of D and E is best. Combo methods do not win though again AEJ isn't significantly worse.

Ten years in sample, five years out of sample

       SR median  SR 0.05  T statistic
A 0.089 -0.825 0.962
B 0.095 -0.865 0.578
C 0.099 -0.848 0.083
D 0.086 -0.842 0.581
E 0.095 -0.852 0.431
F 0.101 -0.848 NaN
G 0.072 -0.923 0.000
H 0.076 -0.930 0.000
J 0.089 -0.939 0.000
AJ 0.082 -0.889 0.000
CG 0.082 -0.908 0.000
AEJ 0.089 -0.874 0.000
CEG 0.091 -0.907 0.000
ACGJ 0.085 -0.896 0.000
ACEGJ 0.088 -0.892 0.000
      0.00  0.50  1.00
0     A       B    C 
0.5   D       E    F 
1.0   G       H    J

'Somewhere in the middle row' isn't a song from Wizard of OverFitting; but roughly where you want to be once again. The combo results are a dismal failure.

Summary

Someone once told me "I love your blog and your books because you talk about failures as well as successes". Well whoever that was - you'll have loved this one! 

Monday, 15 June 2026

FIFA* World Cup (*Fitting and Forecasting Actual data) Portfolio Optimisation competition with real returns

This is my fourth post in my summer 2026 mini series on portfolio optimisation. 

It will very much follow the format of (also with a sports alluding title) blog post number two, so it might be worth rereading that. A reminder if you can't be bothered, I used random data to compare some optimisation methods:

  •  monte carlo (random, parameteric)
  •  bootstrapping (random, non parametric)
  • double shrinkage (shrinking SR towards average SR, and correlations to zero). This encompasses some other methods including:
    • NMV naive mean variance (no shrinkage on anything)
    • EW equal weights (both full shrinkage)
    • MD maximum diversification (no shrinkage correlation, full shrinkage on SR)
    •  EPO (we just shrink the correlation matrix to some degree)

I found that MC/Bootstrap were the best, and didn't require any pesky estimation of the shrinkage meta-parameter. But they are SLOW. I worked out you'd need quite a few iterations to get the weights to converge, so each optimisation took quite a while. Should you wish to estimate that meta-parameter I found that for random data with a nice stable distribution that you didn't need much shrinkage. A little bit on the Sharpe Ratio was the most optimal; a little more wouldn't harm things much, but a lot was bad.  

However as we know from post three, real data is not as nice as random data, and is much harder to forecast. It has a habit of doing annoying things, like changing it's distribution when you're not looking. So we're expecting that we will need, for example, more shrinkage to reflect this.

The real data we will be using will many different runs, each consisting of 9 randomly selected trading rules, chosen for a single randomly chosed instrument. Because we know from post one that fitting within instruments is the way to go. Although I currently have 40 trading rules in my actual portofolio, I am sticking with nine now for speed and intuition. Plus the results shouldn't be too different with more components - that is something I will be looking at later in the series. I'm sampling with replacement so it's feasible - but very unlikely- I'll get the same instrument/rule set more than once.

As per my previous posts I'm also going to compare the results for different lengths of data. In the random data post I could generate as much data as I want; that's tricky here when the absolute longest history I have for any instrument is just over 50 years and many are much less than that. So I'm going to use in sample lengths of 1 year, 5 years and 10 years; and out of sample lengths of 1 year and 5 years. If an instrument doesn't have sufficient data for a given pairing I won't use it; eg for 10 years/5 years I would need 15 years which will be tricky for many instruemnts whilst for 1 year/1 year I would just need 2 years obviously. If it has more data than required, then on a given random run I'll randomly select the required 2 to 15 year long period.

First some speed statistics. We already know that shrinkage will be darn quick, but as I'm using different data lengths from the prior post it's probably worth repeating the stats for montecarlo and bootstrap:

              1 year in sample       5 years in sample      10 years in sample

BS          9.2                     20.6                       33.3

MC          5.1                      6.6                        8.0

Remember from the previous post that convergence is quicker with Monte Carlo than with Bootstrap, hence the substantially longer time taken to do BS which needs twice as many iterations; as well as the slight difference in implementation per iteration which explains the even worse performance of BS at longer iterations.

Results

One year in sample, One year out of sample

Let's begin with the median results. For the moment I'm going to present two data frames. The first is just Sharpe Ratios. Here is the one for an insample and out of sample period of just one year:

      0.00   0.20   0.40   0.60   0.70   0.75   0.80   0.90   1.00
0     0.056  0.057  0.039  0.054  0.047  0.049  0.046  0.055  0.032
0.25  0.061  0.045  0.054  0.063  0.057  0.049  0.044  0.042  0.037
0.5   0.059  0.057  0.048  0.047  0.044  0.046  0.041  0.046  0.044
0.75  0.049  0.041  0.062  0.058  0.041  0.026  0.029  0.047  0.033
0.8   0.030  0.041  0.061  0.054  0.041  0.025  0.026  0.026  0.032
0.85  0.016  0.038  0.050  0.035  0.030  0.029  0.023  0.025  0.036
0.9  -0.002  0.022  0.043  0.030  0.041  0.038  0.029  0.024  0.052
0.95  0.015  0.022  0.045  0.043  0.049  0.043  0.049  0.034  0.056
1.0   0.014  0.003  0.038  0.060  0.056  0.058  0.043  0.032  0.004
MC   -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000
BS    0.018  0.018  0.018  0.018  0.018  0.018  0.018  0.018  0.018

This will look very familiar if you looked at the previous post on random data, but there are a couple of extra rows. From the top then each column shows a different degree of correlation shrinkage. On the left 0.0 is no shrinkage where we used the estimated data. 1.0 is full shrinkage, where all correlations are set to zero. Apart from the diagonals. Obviously. Each row then is a different degree of SR shrinkage, from the top row where we use no shrinkage, down to the row labelled 1.0 where we fully shrink all SR to the average SR across assets. 

The bottom two rows are the results for Monte Carlo and Bootstrapping. There is no shrinkage here, so for consistency I've just copied the single value for each across all columns. 

Some elements of interest in the main part of the table, the top left corner (0.0, 0.0) is naive mean variance with no shrinkage, the top right (0.0, 1.0) is full correlation shrinkage, the bottom left (1.0, 0.0) is full SR shrinkage, and the bottom right (1.0, 1.0) is full shrinkage on both which leads to equal weights. The EPO empirical optimal is (0, 0.75).

The optimum value here has some shrinkage: 0.25 on SR and 0.60 on correlations. 

Compare and contrast that with the results for random data. The optimal shrinkage was barely nothing: 0.25 SR, 0 correlations or thereabouts. It isn't surprising we need more shrinkage in general. Remember from the previous post in this series on random data:

Essentially random data sets a lower bound on robustness calibration. For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1.

However the amount of optimal SR versus correlation shrinkage might seem surprising. Quoting now from post three in this series, on forecasting statistical parameters with real data:

In simple terms, we are a little bit worse than forecasting Sharpe Ratios in real data one year ahead than we would be with random data, but a LOT worse with correlations. Partly this is because we are pretty terrible at forecasting SR one year ahead anyway even with a stable underlying distribution; we don't do much worse with real data. However it does seem that correlations are far more unstable in reality than in randomly generated data.... If we recall from the prior post that the optimal shrinkage is zero on correlations with random data; we can now see why with actual data we'd probably want to opt for some correlation shrinkage; purely because the sampling error is much larger in practice. That is the empirical finding of the EPO paper. It does feel a bit weird since up to now my gut feeling has been that we have to shrink means a lot because they are much harder to forecast and because they have an outsized effect on portfolio weights compared to differences in correlation. Whilst the latter is still true it seems the former is not.

There are two different effects here remember: predicability of each estimate compared to random data (where correlation is worse), and more about their outright predictability (where SR is worse), and the different effects each has on MV optimisation (small differences in SR affect the outcome more).

Another surprise might be the relatively poor performance of MC and BS. Remember that the only difference between them is the assumption of joint Gaussian returns in one case and not in the other.  In the random data round each method was the best performing. Both however are making an implicit assumption that there is a stable distribution (parameteric in one case, not in the other), and that any variance in outcome over the out of sample period will be the same as would be expected from the sampling distribution of each parameter. Which is exactly what happens with random data. But we know from post three that the parameter estimates we're making have a wider distribution with real data; and this is especially true for correlations. Hence, the MC/BS methods are too optimistic about predictability and their weights are suboptimal compared to those produced by high shrinkage optimisations.

Note: I have ideas to fix that, which may or may not in a subsequent blog post. Briefly they involve playing with the MC parameter inputs to reflect the higher RMSE of real versus random data.

Now let's run a paired t-test comparision of that optimum median value against all other values. Here are the p=values from doing those tests:


0.00 0.20 0.40 0.60 0.70 0.75 0.80 0.90 1.00 0 0.91 0.64 0.39 0.63 0.86 0.83 0.90 0.56 0.22 0.25 0.84 0.83 0.99 NaN 0.27 0.24 0.64 0.75 0.37 0.5 0.62 0.37 0.34 0.20 0.07 0.19 0.17 0.54 0.79 0.75 0.76 0.81 0.70 0.70 0.79 0.93 0.85 0.99 0.93 0.8 0.81 0.95 0.81 0.84 0.95 1.00 0.99 0.67 0.97 0.85 0.99 0.97 0.72 0.67 0.76 0.69 0.84 0.60 0.99 0.9 0.92 0.68 0.84 0.85 0.66 0.64 0.74 0.71 0.91 0.95 0.73 0.64 0.91 0.73 0.72 0.60 0.56 0.66 0.94 1.0 0.71 0.68 0.87 0.46 0.41 0.42 0.34 0.42 0.47 MC 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 BS 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06

We can see that the optimum value itself is NaN since the p-value is undefined. We can also see that statistically, there isn't much difference in how shrinkage is used. MC and BS are definitely worse 

Now, how do things change if we are more pessimistic? As before I'm going to look at the 5% distributional point of outcomes from my multiple random results. If I do this, the optimal shrinkage is 0.8 on correlations, but a massive 1.0 on SR. At the 25% point it's 0.75 on SR but 1.0 on correlations. We want more shrinkage for sure!

Now let's think about a nice graphical way of showing these values. I'll start with a heatmap of the median SR:



Now I'm going to do something familar to the students on my course. I'm going to replace every value that is statistically insigificant from the optimal median with the optimal media value. Here I will use a 90% critical value:
Here the result looks like a really shit piece of modern art. Since almost all shrinkage values are not significantly different from the optimal; except weirdly correlation shrinkage 0.7 and SR shrinkage 0.5 (which is adjacent to the optimum); it's just a sea of blue. But we can see the MC/BS methods are inferior.


One year in sample, Five years out of sample


It isn't obvious but I used the same procedure for this plot which shows SR, with all values that can't be distinguished from the optimum in the same colour as that optimum. But every single value other than the optimum, which is full shrinkage or equal weights, is inferior to that optimum.


Five years in sample, One years out of sample

A very interesting picture here. There's clearly a shrinkage area that doesn't work. Note that the results overall are quite poor.

Five years in sample, Five years out of sample

A little clearer here. Modest shrinkage would work well, but then so would random data. Just don't shrink the SR too much.


Ten years in sample, One years out of sample

Importantly here the critical value is 80%, not 90%. With 90% the whole plot goes one colour. Pretty much any amount of shrinkage works. Again the SR results are very poor.

Ten years in sample, Five years out of sample



Summary of results

Well that was messy. I'd conclude that shrinkage of SR 0.5 and correlation 0.75 (the EPO value) is in the optimum region in almost all time periods. That's a reversal of what my original intuition suggested and I've used before, with more shrinkage on the SR. I've explained at length why my intuition was wrong. The random methods (MC/BS) are also inferior in many cases, as well as being slow.

The exception is one year / five years where you need full shrinkage (equal weights). Using 0.5/0.75 isn't so bad however. Although it's significantly worse, the actual loss in SR is small. Still it does seem logical to use more shrinkage with more data; and we can see from the one year/five year plot that we're better off shrinking SR more. So here is my heuristic rule of thumb:

Five or more years of data: SR shrinkage 0.5, correlation 0.75

Four to five years of data: SR shrinkage 0.6, correlation 0.75

Three to four years of data: SR shrinkage 0.7, correlation 0.80

Two to three years of data: SR shrinkage 0.8, correlation 0.85

One to two years of data: SR shrinkage 0.9, correlation 0.90

One or less than one year of data: SR shrinkage 1.0, correlation 1.0 (equal weights)

These results are very domain specific. In particular, I'm mostly dealing with holding periods in the weeks and months. A faster trading system would be able to compress the periods above. But the main lesson is that it's very hard to state categoricially what the exact amount of shrinkage should be. The surface is mostly too noisy. So don't sweat it. Use a vaguely okay value and you'll do vaguely ok.