Monday, 15 June 2026

FIFA* World Cup (*Fitting and Forecasting Actual data) Portfolio Optimisation competition with real returns

This is my fourth post in my summer 2026 mini series on portfolio optimisation. 

It will very much follow the format of (also with a sports alluding title) blog post number two, so it might be worth rereading that. A reminder if you can't be bothered, I used random data to compare some optimisation methods:

  •  monte carlo (random, parameteric)
  •  bootstrapping (random, non parametric)
  • double shrinkage (shrinking SR towards average SR, and correlations to zero). This encompasses some other methods including:
    • NMV naive mean variance (no shrinkage on anything)
    • EW equal weights (both full shrinkage)
    • MD maximum diversification (no shrinkage correlation, full shrinkage on SR)
    •  EPO (we just shrink the correlation matrix to some degree)

I found that MC/Bootstrap were the best, and didn't require any pesky estimation of the shrinkage meta-parameter. But they are SLOW. I worked out you'd need quite a few iterations to get the weights to converge, so each optimisation took quite a while. Should you wish to estimate that meta-parameter I found that for random data with a nice stable distribution that you didn't need much shrinkage. A little bit on the Sharpe Ratio was the most optimal; a little more wouldn't harm things much, but a lot was bad.  

However as we know from post three, real data is not as nice as random data, and is much harder to forecast. It has a habit of doing annoying things, like changing it's distribution when you're not looking. So we're expecting that we will need, for example, more shrinkage to reflect this.

The real data we will be using will many different runs, each consisting of 9 randomly selected trading rules, chosen for a single randomly chosed instrument. Because we know from post one that fitting within instruments is the way to go. Although I currently have 40 trading rules in my actual portofolio, I am sticking with nine now for speed and intuition. Plus the results shouldn't be too different with more components - that is something I will be looking at later in the series. I'm sampling with replacement so it's feasible - but very unlikely- I'll get the same instrument/rule set more than once.

As per my previous posts I'm also going to compare the results for different lengths of data. In the random data post I could generate as much data as I want; that's tricky here when the absolute longest history I have for any instrument is just over 50 years and many are much less than that. So I'm going to use in sample lengths of 1 year, 5 years and 10 years; and out of sample lengths of 1 year and 5 years. If an instrument doesn't have sufficient data for a given pairing I won't use it; eg for 10 years/5 years I would need 15 years which will be tricky for many instruemnts whilst for 1 year/1 year I would just need 2 years obviously. If it has more data than required, then on a given random run I'll randomly select the required 2 to 15 year long period.

First some speed statistics. We already know that shrinkage will be darn quick, but as I'm using different data lengths from the prior post it's probably worth repeating the stats for montecarlo and bootstrap:

              1 year in sample       5 years in sample      10 years in sample

BS          9.2                     20.6                       33.3

MC          5.1                      6.6                        8.0

Remember from the previous post that convergence is quicker with Monte Carlo than with Bootstrap, hence the substantially longer time taken to do BS which needs twice as many iterations; as well as the slight difference in implementation per iteration which explains the even worse performance of BS at longer iterations.

Results

One year in sample, One year out of sample

Let's begin with the median results. For the moment I'm going to present two data frames. The first is just Sharpe Ratios. Here is the one for an insample and out of sample period of just one year:

      0.00   0.20   0.40   0.60   0.70   0.75   0.80   0.90   1.00
0     0.056  0.057  0.039  0.054  0.047  0.049  0.046  0.055  0.032
0.25  0.061  0.045  0.054  0.063  0.057  0.049  0.044  0.042  0.037
0.5   0.059  0.057  0.048  0.047  0.044  0.046  0.041  0.046  0.044
0.75  0.049  0.041  0.062  0.058  0.041  0.026  0.029  0.047  0.033
0.8   0.030  0.041  0.061  0.054  0.041  0.025  0.026  0.026  0.032
0.85  0.016  0.038  0.050  0.035  0.030  0.029  0.023  0.025  0.036
0.9  -0.002  0.022  0.043  0.030  0.041  0.038  0.029  0.024  0.052
0.95  0.015  0.022  0.045  0.043  0.049  0.043  0.049  0.034  0.056
1.0   0.014  0.003  0.038  0.060  0.056  0.058  0.043  0.032  0.004
MC   -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000 -0.000
BS    0.018  0.018  0.018  0.018  0.018  0.018  0.018  0.018  0.018

This will look very familiar if you looked at the previous post on random data, but there are a couple of extra rows. From the top then each column shows a different degree of correlation shrinkage. On the left 0.0 is no shrinkage where we used the estimated data. 1.0 is full shrinkage, where all correlations are set to zero. Apart from the diagonals. Obviously. Each row then is a different degree of SR shrinkage, from the top row where we use no shrinkage, down to the row labelled 1.0 where we fully shrink all SR to the average SR across assets. 

The bottom two rows are the results for Monte Carlo and Bootstrapping. There is no shrinkage here, so for consistency I've just copied the single value for each across all columns. 

Some elements of interest in the main part of the table, the top left corner (0.0, 0.0) is naive mean variance with no shrinkage, the top right (0.0, 1.0) is full correlation shrinkage, the bottom left (1.0, 0.0) is full SR shrinkage, and the bottom right (1.0, 1.0) is full shrinkage on both which leads to equal weights. The EPO empirical optimal is (0, 0.75).

The optimum value here has some shrinkage: 0.25 on SR and 0.60 on correlations. 

Compare and contrast that with the results for random data. The optimal shrinkage was barely nothing: 0.25 SR, 0 correlations or thereabouts. It isn't surprising we need more shrinkage in general. Remember from the previous post in this series on random data:

Essentially random data sets a lower bound on robustness calibration. For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1.

However the amount of optimal SR versus correlation shrinkage might seem surprising. Quoting now from post three in this series, on forecasting statistical parameters with real data:

In simple terms, we are a little bit worse than forecasting Sharpe Ratios in real data one year ahead than we would be with random data, but a LOT worse with correlations. Partly this is because we are pretty terrible at forecasting SR one year ahead anyway even with a stable underlying distribution; we don't do much worse with real data. However it does seem that correlations are far more unstable in reality than in randomly generated data.... If we recall from the prior post that the optimal shrinkage is zero on correlations with random data; we can now see why with actual data we'd probably want to opt for some correlation shrinkage; purely because the sampling error is much larger in practice. That is the empirical finding of the EPO paper. It does feel a bit weird since up to now my gut feeling has been that we have to shrink means a lot because they are much harder to forecast and because they have an outsized effect on portfolio weights compared to differences in correlation. Whilst the latter is still true it seems the former is not.

There are two different effects here remember: predicability of each estimate compared to random data (where correlation is worse), and more about their outright predictability (where SR is worse), and the different effects each has on MV optimisation (small differences in SR affect the outcome more).

Another surprise might be the relatively poor performance of MC and BS. Remember that the only difference between them is the assumption of joint Gaussian returns in one case and not in the other.  In the random data round each method was the best performing. Both however are making an implicit assumption that there is a stable distribution (parameteric in one case, not in the other), and that any variance in outcome over the out of sample period will be the same as would be expected from the sampling distribution of each parameter. Which is exactly what happens with random data. But we know from post three that the parameter estimates we're making have a wider distribution with real data; and this is especially true for correlations. Hence, the MC/BS methods are too optimistic about predictability and their weights are suboptimal compared to those produced by high shrinkage optimisations.

Note: I have ideas to fix that, which may or may not in a subsequent blog post. Briefly they involve playing with the MC parameter inputs to reflect the higher RMSE of real versus random data.

Now let's run a paired t-test comparision of that optimum median value against all other values. Here are the p=values from doing those tests:


0.00 0.20 0.40 0.60 0.70 0.75 0.80 0.90 1.00 0 0.91 0.64 0.39 0.63 0.86 0.83 0.90 0.56 0.22 0.25 0.84 0.83 0.99 NaN 0.27 0.24 0.64 0.75 0.37 0.5 0.62 0.37 0.34 0.20 0.07 0.19 0.17 0.54 0.79 0.75 0.76 0.81 0.70 0.70 0.79 0.93 0.85 0.99 0.93 0.8 0.81 0.95 0.81 0.84 0.95 1.00 0.99 0.67 0.97 0.85 0.99 0.97 0.72 0.67 0.76 0.69 0.84 0.60 0.99 0.9 0.92 0.68 0.84 0.85 0.66 0.64 0.74 0.71 0.91 0.95 0.73 0.64 0.91 0.73 0.72 0.60 0.56 0.66 0.94 1.0 0.71 0.68 0.87 0.46 0.41 0.42 0.34 0.42 0.47 MC 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 BS 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06

We can see that the optimum value itself is NaN since the p-value is undefined. We can also see that statistically, there isn't much difference in how shrinkage is used. MC and BS are definitely worse 

Now, how do things change if we are more pessimistic? As before I'm going to look at the 5% distributional point of outcomes from my multiple random results. If I do this, the optimal shrinkage is 0.8 on correlations, but a massive 1.0 on SR. At the 25% point it's 0.75 on SR but 1.0 on correlations. We want more shrinkage for sure!

Now let's think about a nice graphical way of showing these values. I'll start with a heatmap of the median SR:



Now I'm going to do something familar to the students on my course. I'm going to replace every value that is statistically insigificant from the optimal median with the optimal media value. Here I will use a 90% critical value:
Here the result looks like a really shit piece of modern art. Since almost all shrinkage values are not significantly different from the optimal; except weirdly correlation shrinkage 0.7 and SR shrinkage 0.5 (which is adjacent to the optimum); it's just a sea of blue. But we can see the MC/BS methods are inferior.


One year in sample, Five years out of sample


It isn't obvious but I used the same procedure for this plot which shows SR, with all values that can't be distinguished from the optimum in the same colour as that optimum. But every single value other than the optimum, which is full shrinkage or equal weights, is inferior to that optimum.


Five years in sample, One years out of sample

A very interesting picture here. There's clearly a shrinkage area that doesn't work.

Five years in sample, Five years out of sample

A little clearer here. Modest shrinkage would work well, but then so would random data. Just don't shrink the SR too much.


Ten years in sample, One years out of sample

Importantly here the critical value is 80%, not 90%. With 90% the whole plot goes one colour. Pretty much any amount of shrinkage works.

Ten years in sample, Five years out of sample



Summary of results

Well that was messy. I'd conclude that shrinkage of SR 0.5 and correlation 0.75 (the EPO value) is in the optimum region in almost all time periods. That's a reversal of what my original intuition suggested and I've used before, with more shrinkage on the SR. I've explained at length why my intuition was wrong. The random methods (MC/BS) are also inferior in many cases, as well as being slow.

The exception is one year / five years where you need full shrinkage (equal weights). Using 0.5/0.75 isn't so bad however. Although it's significantly worse, the actual loss in SR is small. Still it does seem logical to use more shrinkage with more data; and we can see from the one year/five year plot that we're better off shrinking SR more. So here is my heuristic rule of thumb:

Five or more years of data: SR shrinkage 0.5, correlation 0.75

Four to five years of data: SR shrinkage 0.6, correlation 0.75

Three to four years of data: SR shrinkage 0.7, correlation 0.80

Two to three years of data: SR shrinkage 0.8, correlation 0.85

One to two years of data: SR shrinkage 0.9, correlation 0.90

One or less than one year of data: SR shrinkage 1.0, correlation 1.0 (equal weights)

These results are very domain specific. In particular, I'm mostly dealing with holding periods in the weeks and months. A faster trading system would be able to compress the periods above. But the main lesson is that it's very hard to state categoricially what the exact amount of shrinkage should be. The surface is mostly too noisy. So don't sweat it. Use a vaguely okay value and you'll do vaguely ok.


Monday, 8 June 2026

Forecasting statistical estimates when data gets real

 This is my third post in a series about optimisation and fitting. In my previous post I used random data to calibrate and evaluate many portfolio optimisation techniques. It's worth quoting in full from that post:

Random data is not real data: Well duh. But why is this important? Because random data is drawn from a fixed and well behaved distribution. This means the optimiser only has to discover / estimate the parameters of that distribution as more data is revealed to it. But real data doesn't have a fixed and known distribution. It doesn't actually have any distribution at all. We just model it hoping it does.

To summarise then, random data from a fixed distribution differs from real data in three important ways:

  • There is no distribution! We just assume there is one.
  • The distribution (which doesn't exist) is not known, and thus it's likely the distribution we assume is the wrong one. This is especially true for modelling underlying financial price returns with joint Gaussian models.
  • The distribution (which again, doesn't exist) isn't fixed, but can change over time.

And it has one thing in common:

  • The unknown parameters of the distribution are unknown and have to be learned over time.
In this post I'm going to explore this learning process for two key statistical estimates: correlations and Sharpe Ratios. What I am interested in is how much wider of the mark our estimates for these two things are likely to be for real data vs random data. This obviously has important implications for optimisation.


Let's look at a plot.



This is for random data generated by a process with a true SR of 1. It shows the evolution of the SR and it's statistical distribution as it is re-estimated each year. There is a burn in year which is missing, and then in the first year we can see our estimate of the SR using all available data so far (in orange), and the SR for the current year (in blue). You can see that the orange line is lagged by a year as it is purely out of sample and always a year behind. I've then used the orange line to estimate the theoretical sampling distribution of the Sharpe Ratio for a one year period, and constructed a 1.96x confidence interval (so about 95%) around the orange line which are the green and red lines. 

Note: The theoretical standard deviation of the sampling distribution of the Sharpe Ratio, assuming i.i.d. returns, is sqrt[(1+0.5SR^2)/N] where N is the number of periods.

Broadly speaking if our estimates are correct then we'd hope to see around 1/20 of the blue points outside the red and green lines, and around 19/20 on the inside. There are 40 years of data here and we go outside the range twice, which is roughly what we'd expect.

Another way of measuring this is to look at our error term, normalised by our standard deviation. This will be equal to:

[(SR estimate this year N) - (SR estimate years 0...N-1)]/(SR sampling std dev error 0... N-1)

If I take the square of this, average of all years and then square root I get the normalised root mean squared error. This comes out at 0.998 for all the data above.



The blue line in this plot shows the absolute value of the error term for each year. The orange line shows the RMSE. You can see this gradually declining over time and settling in at around 0.85

Here are the same two plots for a correlation pair estimate:





Again the RMSE tends to end up around 0.86

Incidentally, we can also do these plots for longer periods. Here is the RMSE evolution for a SR estimate looking ahead over the next 5 years:

The RMSE here is a little higher - around 1.0


Now, let's look at some real data. I'm going to use the p&l from trading the US10 year bond with a 16,64 day EWMAC. Let's begin by trying to forecast the SR one year ahead:


Even without calculating the error we can see that there are more boundary breakages than before with random data. Here is the error:

Notice that it is higher than before (around 1.25; or about sqrt(2) times bigger than the random data RMSE) and doesn't slowly converge as it did with random data, instead it stays roughly constant (ignoring the initial period of luck at the start). 

We get a similar picture for 5 years:

What about correlations? Let's look at the correlation between this slow momentum on 10 year US bonds, and the carry rule on the same instrument:


Wow, that's noisy. The RMSE will be off the charts. What about over 5 years?

Ouch. If we look at the correlation between two variations of the same trading rule, EWMAC64,256 and EWMAC32,128 - which are naturally highly correlated - then it's not much better:

Again the RMSE would be in double digits.

Those might be flukes, so let's look at lots of random results. I'm going to pick an instrument and trading rule randomly, and measure it's final RMSE number. I will then generate some random returns of the same length from the same SR distribution (by measuring the full sample SR for the relevant instrument/rule pairing); and measure that's RMSE. I will then select another rule from the same instrument, get the correlation of the two p&l streams, and generate some more random returns with the given expected correlation. Next and finally I will measure the correlation RMSE for the two sets of real returns, and the two sets of random returns.

If I consider the ratio [RMSE real data / RMSE random data] (both for next one year); then the median of this over a few thousand randomly selected trading strategy components is 1.06 for Sharpe Ratios, and for correlations around 5.6. 

In simple terms, we are a little bit worse than forecasting Sharpe Ratios in real data one year ahead than we would be with random data, but a LOT worse with correlations. 

Partly this is because we are pretty terrible at forecasting SR one year ahead anyway even with a stable underlying distribution; we don't do much worse with real data. However it does seem that correlations are far more unstable in reality than in randomly generated data. Note that these are correlations for trading strategy component returns. In some cases they are mathematically related (eg EWMAC of different speeds) and could be derived with some assumptions, a pencil, and a napkin. They are certainly more stable than the returns of the underlying instruments themselves (think about the changing correlation of stocks and bonds in different inflation environments). 

(Note: These numbers are about the same for five years ahead and also ten years ahead)

If we recall from the prior post that the optimal shrinkage is zero on correlations with random data; we can now see why with actual data we'd probably want to opt for some correlation shrinkage; purely because the sampling error is much larger in practice. That is the empirical finding of the EPO paper. It does feel a bit weird since up to now my gut feeling has been that we have to shrink means a lot because they are much harder to forecast and because they have an outsized effect on portfolio weights compared to differences in correlation. Whilst the latter is still true it seems the former is not.

Food for though. Anyway the next step is to repeat the 'Ultimate Fitting Championships' battle, but this time with real data.

 

















UFC - Ultimate Fitting Championships (Evaluating and calibrating portfolio optimisation methods with random data)

As I said in my last post I'm currently in the process of a mega-sized research project on fitting. In the first post I examined the correct way to cluster combinations of trading rules and instruments. 

This next post is rather meatier, and is about evaluating and calibrating some portfolio optimisation techniques. We might call this 'meta optimisation', since we want to find the best way to do optimisation, which itself is effectively a form of optimisation - we are choosing between alternatives based on some utility function.

And because it's optimisation, it can be done in a bad in sample way. And often is. People do have a habit of using a particular data set, working out which optimisation will work best, and then using that. They think they are good people because the optimisation is running in a nice robust out of sample fashion -but they are not good people. Because the choice of optimisation itself has been made having seen all the data.

To avoid this I'm initially going to use random data to evaluate and calibrate the various optimisation techniques. Then no real data will be harmed. A subsequent post will use some real data.

Note: I've sort of had a go at this before, here. However this is a much more thorough look at the problem, whereas the previous post was very limited in scope both of data and also of methodologies. There is also a link here to my multiple posts about probabilistic evalulation of outcomes (). 

Note 2: whilst researching this post I found a 'new' shrinkage based method, EPO, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3530390 developed by one of my favourite authors (Pedersen of AQR) with co-authors. The main reason I like this is because of the allusion with the more famous EPO which as someone who cycles and follows cycling is obviously quite ironically entertaining (I have described it as 'new' but it's several years old, and I can now see that it was highlighted by younggotti in a comment on my earlier post but which I didn't follow up)

Note 3: I also came across this relatively new book: https://portfoliooptimizationbook.com/ which is quite a nice survey of the field.

Note 4: "Rob why don't you use AI in developing trading systems like everyone else on LinkedIn". In fact AI has one - exactly one - use case as far as I am concerned:

Thank you chatgpt



The data

For the data I'm going to keep things real simple. I will be doing an optimisation with nine assets. That might seem an odd number*, but it's because in a subsequent post I will be repeating some of this with real assets, and I have nine specific ones in mind (spoiler alert: three instruments trading three different trading rules/methods: two speeds of trend, plus carry). The true SR of the assets will be drawn with equal proability from this list: [-0.5, -0.25, 0, 0.25, 0.5, 0.75, 1]. The true correlation of the assets will be drawn with equal probability from this list [-0.25, 0, 0.25, 0.5, 0.75, 0.9]. These lists are not fully symmetric, since trading strategy returns of the type I am optimising tend not to have substantial negative correlations or very high/low SR.

* obviously it is indeed an odd number, but it might also appear to be an arbitrary choice.

 I will then randomly draw returns from those multivariate Gaussian distributions, generating 2000 outcomes, each with 35 years of history (each 'year' is 256 business days). Why 35 years? Well, I will be varying the length of the in sample period. To be precise, I will use in sample periods between 1 year and 30 years; and evaluate out of sample on a five year basis. Using a (shorter) longer out of sample period would just (increase) reduce the variance between different outcomes; it won't affect their relative efficacy. It seems unlikely anyone will go more than five years without refitting (I do it every year in backtest) this seems about right.

I will generate a certain number of histories and then evaluate the relative performance of each optimiser on each history; thus avoiding the role of luck if one optimiser happens to get a lucky break.

Note if it isn't obvious I'm assuming I am a SR maximiser (equivalent to a CAGR maximiser for a leveraged investor with Gaussian returns), and I'm assuming all assets have the same expected standard deviation. As a futures trader this is fine. I'm also assuming weights will be positive. These are my standard boilerplate assumptions for optimising trading strategy returns. 


Random data is not real data

Well duh. But why is this important? Because random data is drawn from a fixed and well behaved distribution. This means the optimiser only has to discover / estimate the parameters of that distribution as more data is revealed to it. But real data doesn't have a fixed and known distribution. It doesn't actually have any distribution at all. We just model it hoping it does.

Essentially random data sets a lower bound on robustness calibration. For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1.

This also means that less robust methods will be flattered compared to more robust methods when using random rather than real data. For this reason we need to treat the results with some caution; and in a future post I will be sense checking them against some real data.


The criteria

On what basis should be evaluate an optimiser? Clearly we are interested in the out of sample performance - Sharpe Ratio in this case. But it's the probabilistic performance that interests me. Using random data means we can look at a distribution of outcomes. And I'm not just interested in the central, median point of that distribution. I'm concerned with optimisers that produce extreme, sparse, weights. On average these might look fine, but their downside will be worse than a more robust optimiser which produces more reasonable weights. So I am also going to evalutate performance at a more cautious 5% percentile point (what we use for statistical significance). 

Of course, there are other criteria for optimising. Speed is an important one that will be bad for gridsearch, bootstrap and monte carlo type methods. Related to that is convergence - how quickly does a boostrap or monte carlo converge on weights that are 'good enough'. If convergence is quite quick then the penalty of running multiple optimisations won't be as large.


The optimisers

Let's quickly run through the competitors in this little olympics:

  •  monte carlo (random, parameteric)
  •  bootstrapping (random, non parametric)
  • double shrinkage (shrinking SR towards average SR, and correlations to zero). Shrinkage can range from zero (no shrinkage) Note with the right parameters this encompasses some other methods including:
    • NMV naive mean variance (no shrinkage on anything)
    • EW equal weights (both full shrinkage)
    • MD maximum diversification (no shrinkage correlation, full shrinkage on SR)
    •  EPO (we just shrink the correlation matrix to some degree)

Notice I am not at this stage using any kind of clustering or hierarchical method, such as my own 'handcrafting'. My intention is to first, in this post, establish the best way to optimise relatively small portfolios. Then in a subsequent post I will properly evaluate the performance / speed tradeoff of using this small portfolio optimisation inside a top down clustering method.

There are a whole bunch of other methods we could use, but I have a good understanding of the methods above and I don't feel the need to go very fancy. 

Note that within the shrinkage team we have a number of competitiors as we can vary the shrinkage in a range of let's say 0 (no shrinkage, use empirical results), to 1.0 (full shrinkage). For correlation shrinkage I'm going to use these nine steps: 0, 0.2, 0.4, 0.6, 0.7, 0.75, 0.8, 0.9, 1.0 (the optimal EPO shrinkage is 0.75 hence the extra granularity around there). For SR shrinkage I'm going to use these nine values [0, 0.25, 0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0] because I know that minimal amounts of mean shrinkage don't achieve much. That gives me 81 possible shrinkage methods; but one is actually equal weights (shrinkage of 1 on SR and 1 on correlations); another is maximum diversification (1,0), a third is naive mean variance (0,0), and there are seven that are EPO (0, 0.2... 0.9). So there are actually 71 shrinkage methods of various strengths.


Establishing convergence speed

Before we begin we need to establish the number of runs required to establish convergence for the bootstrapping and monte carlo methods. 

There are a couple of ways we can establish convergence speed:

  • How quickly the weights 'settle down', i.e. does not change very much
  • How quickly the probabilistic out of sample SR 'settles down'

Note also that this convergence will take longer with larger portfolios, since there is a bigger possible space to cover.

The following shows for bootstrapping one random one year long in-sample how the average* error narrows between the 'correct' weights (run with the maximum of 2000 iterations) and the weights we get with fewer iterations:

* sqrt(sum(error^2))

We can see that after about 500 iterations (x axis) the average error (y axis) is down to 2% or less. In other words given the average weight would be 11.1%; we're looking at having a weight that has a good chance of being between 9% and 13%.

What about the convergence in SR?

Again after 500 iterations we're down to a SR difference of less than 1bp (0.01 SR units). These are single point SR though, the probabilistic result would be different with worse results for lower iterations where we are more likely to get sparse portfolios with poor OOS performance at conservative points of the distribution.

Here are the same results, but now averaged across 50 different samples (it's real slow running this code, and the results aren't that different across samples). Because we're being conservative I'm using the 80% percentile at each point rather than the median or average. Obviously that would be the 10th worst result in each bracket.



Note there the SR shown is the absolute difference between the SR for 2000 iterations and for a smaller number of iterations. So SR that are higher will be penalised as much as those that are lower.

That's for one year of in sample returns. I would say roughly 1000 to 1500 iterations is enough for decent convergence. How about ten years? Incidentally don't try this at home, it takes a long time.



It looks like convergence happens faster with longer periods of data which makes sense. With longer periods of data there is less chance of rogue samples causing extreme weights in the first few iterations. With 10 years of data we can probably risk reducing our iteration count down to 500 or so. 

Now, how does those results vary for the other random optimisation method, monte carlo? Here our resampling is . First for one year, weights:

That looks a little quicker than the same one year plot for bootstrapping. Afer 250 iterations we have our average weight difference down to around 3.2%, before it was more like 4.8%. We can probably get away with about half the iterations we had before.

How about Sharpe Ratios?
SR is noisier so there isn't as much evidence here of faster convergence.

You've probably seen enough of these plots now, so I will jump to a summary. These aren't hard numbers, but based on eyeballing the graphs to the rough point where convergence is down to about 0.01SR points or the equivalent in weights:

                              1 year                 10 years
Bootstrap                1200                    600
Monte Carlo            600                     300

We can turn these into a simple heuristic rule like use 1200*N^(-0.3) for N years; that is with bootstrapping which we then half for monte carlo.

(My gut feeling is that convergence will be a little slower with real data)


The giant test

OK so now we have established our <checks notes> 83 different candidate optimisers. 71 of these are of the shrinkage family, seven are EPO options, there are our three special cases EW, MD and NMV; and then we have our two randomly based methods: bootstrapping and monte carlo, for which we've established appropriate numbers of iterations for reasonable convergence. 

We're going to run each of those on each of the 2000 samples of fake history we have, and then look at the distribution of out of sample SR across those samples. We'll then focus on the 50% median point and the more conservative 5% point. I will also measure how long it takes to do the 2000 samples for each method to get an indication of time per optimisation. This will be repeated for different lengths of in sample history, from 1 year up to 30 years. Remember that we'll always be using the last five years of our sample for OOS evaluation.


Qualifiying round, shrinkage

To avoid too much work I will begin wih repeating the exercise in the EPO literature where we try and find the optimium levels of shrinkage for correlation and SR (note - for simplified EPO the SR shrinkage is just zero, but I'm going to explore the whole surface). I will do this on the basis of both 50% median points of the distribution, and also the 5% conservative point. This will give me two candidate shrinkage models to put up against the random competitiors of bootstrap and shrinkage in the two event finals.  In fact, since optimal shrinkage will certainly vary by the amount of data, there will be a finalist for each length of in sample period in years. At this point I'm not concerned with speed, since all the shrinkage will take about the same amount of time to optimise.

Starting with 1 year, 50% median point. Rows are SR shrinkage, columns are correlation shrinkage:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.87 0.86 0.84 0.81 0.79 0.79 0.77 0.75 0.72
0.25 0.88 0.86 0.83 0.81 0.79 0.77 0.77 0.74 0.70
0.50 0.87 0.85 0.83 0.79 0.77 0.75 0.74 0.72 0.67
0.75 0.83 0.82 0.78 0.74 0.72 0.70 0.69 0.67 0.62
0.80 0.81 0.80 0.77 0.73 0.71 0.69 0.67 0.65 0.60
0.85 0.77 0.77 0.75 0.70 0.67 0.67 0.65 0.62 0.59
0.90 0.73 0.73 0.70 0.65 0.65 0.63 0.62 0.61 0.57
0.95 0.65 0.64 0.61 0.58 0.57 0.56 0.56 0.57 0.55
1.00 0.41 0.40 0.40 0.39 0.38 0.38 0.38 0.37 0.38
Note: The coloured values are the 'special' pairs of shrinkage values. Top left in yellow is naive mean variance (no shrinkage). The rest of that row in purple is simple EPO with no mean shrinkage and varying correlation shrinkage. Bottom left in red is maximum diversification (full mean shrinkage, no correlation shrinkage). Bottom right in green is equal weights (full shrinkage on both). I won't be colouring these values in again, so keep them in mind.

There isn't much in it, but it looks like we want minimal shrinkage here with the optimal at (0.25,0). Shrinkage of more than 0.50 on means, and more than 0.40 or so on correlations leads to a dropoff in performance; though with zero correlation shrinkage we can push up to 0.75 on means without too much damage. Given the underlying data process is stable artifical data with a given distribution, perhaps that's not so surprising. 

Now let's look at the 5% point. Obviously these numbers are negative, but we want the least negative:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 -0.12 -0.13 -0.13 -0.15 -0.16 -0.15 -0.16 -0.18 -0.20
0.25 -0.12 -0.13 -0.13 -0.13 -0.14 -0.15 -0.17 -0.19 -0.22
0.50 -0.13 -0.13 -0.14 -0.17 -0.18 -0.18 -0.19 -0.22 -0.25
0.75 -0.21 -0.18 -0.19 -0.21 -0.23 -0.24 -0.25 -0.26 -0.29
0.80 -0.21 -0.21 -0.21 -0.22 -0.24 -0.26 -0.28 -0.28 -0.31
0.85 -0.27 -0.25 -0.23 -0.27 -0.29 -0.28 -0.30 -0.30 -0.34
0.90 -0.35 -0.34 -0.33 -0.33 -0.32 -0.34 -0.33 -0.34 -0.36
0.95 -0.50 -0.48 -0.45 -0.44 -0.43 -0.43 -0.40 -0.37 -0.39
1.00 -0.72 -0.69 -0.68 -0.66 -0.66 -0.66 -0.66 -0.66 -0.47

Here having more shrinkage isn't as problematic. Correlation shrinkage can be as high as 0.75 without losing more than 3bp (0.75 is the EPO optimal remember); whilst mean shrinkage can be up to 0.50. The optimal shrinkage is still about zero but the surface is quite flat beyond that. 

To reiterate; this is random data. Remember what I said earlier: "For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1." This is still true!

One useful thing as well as looking at points on the distribution is to calculated paired T-tests since the distributions can be directly compared (each particular point in a given distribution relates to the same random sample). After all the surface looks quite flat for most of the top left corner, but just how flat? It turns out that all the pairings are significantly worse than the optimal at a 1% critical value except for the optimal itself (0.25,0), (0.5,0) and (0,0).

Let's jump ahead to 30 years to get a feel for any differences. Again, first the results from the median:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 1.19 1.18 1.16 1.15 1.13 1.12 1.11 1.08 1.04
0.25 1.19 1.18 1.16 1.14 1.12 1.11 1.09 1.06 1.01
0.50 1.19 1.18 1.16 1.13 1.10 1.08 1.06 1.02 0.95
0.75 1.13 1.13 1.10 1.04 1.01 0.98 0.95 0.89 0.81
0.80 1.09 1.09 1.06 1.01 0.96 0.93 0.90 0.84 0.77
0.85 1.05 1.04 1.01 0.95 0.90 0.87 0.84 0.78 0.72
0.90 0.97 0.95 0.93 0.86 0.82 0.80 0.76 0.71 0.66
0.95 0.84 0.82 0.79 0.73 0.69 0.67 0.65 0.61 0.58
1.00 0.47 0.47 0.45 0.43 0.42 0.41 0.41 0.40 0.38
And for the 5% point:
      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.32 0.31 0.31 0.30 0.29 0.28 0.27 0.25 0.21
0.25 0.32 0.31 0.31 0.29 0.28 0.27 0.26 0.23 0.19
0.50 0.31 0.31 0.29 0.28 0.25 0.24 0.22 0.19 0.14
0.75 0.24 0.24 0.21 0.19 0.17 0.15 0.13 0.08 0.02
0.80 0.19 0.20 0.19 0.15 0.13 0.11 0.09 0.05 -0.02
0.85 0.14 0.16 0.15 0.11 0.07 0.06 0.04 -0.01 -0.08
0.90 0.04 0.09 0.08 0.03 -0.00 -0.03 -0.04 -0.09 -0.15
0.95 -0.15 -0.09 -0.10 -0.12 -0.13 -0.16 -0.17 -0.21 -0.24
1.00 -0.60 -0.59 -0.57 -0.56 -0.55 -0.55 -0.55 -0.54 -0.47
Notice the numbers here are higher. With 30 years of data, and a fixed distribution, we have plenty of time to get a good handle on the parameters of the optimisation resulting in better OOS performance. 

Again as before the optimum pair is identical (0.25,0) but with a more conservative distributional point we can have more shrinkage with less damage to SR. 

Right, so with all that in mind what shrinkage should we take forward to the next round? It's a toughie; (0.25,0) is the winner in every round but using so little shrinkage makes me nervous (we already know it's likely to be a poor choice on real data). Let's make an arbitrary change to the format and allow one of the close runner ups to get in as well; say (0.5, 0.4) which is within a few bp of the winner. And just because I like the name, we'll also run the EPO optimal of (0, 0.75); not quite as good but not terrible.

Let's remind ourselves who is in the final five of the random final, competing at distances from 1 year up to 30 years for both the Median Memorial Trophy, and the 5% Conservative Cup.

  • MC monte carlo (random, parameteric) with instances depending on in sample length
  • BS bootstrapping (random, non parametric) with instances depending on in sample length
  • OS Optimal shrinkage (SR shrinkage=0.25, nothing on correlations)
  • EPO shrinkage (correlation shrinkage = 0.75, nothing on SR)
  • CS Cautious shrinkage (SR shirnkage = 0.5, correlation shrinkage = 0.4)
  • NMV Naive mean variance (no shrinkage on eithier)
  • EW Equal weights (full shrinkage on both)
That's not a final five, I hear you cry, it's a final seven! Because by law any consideration of portfolio optimisation has to include these two extreme options.

I promised you some speed figures, and here they are: all the various shrinkage methods take about 5 milli-seconds per optimisation; bootstrapping takes around 17.5 seconds per optimisation and monte carlo  between 5.3 and 29 seconds per optimisation (for 1 year and 30 year respectively). It took several days to complete the monte carlo for 30 years!

The results for 1 year and 30 years are pretty similar, so for brevity here are those for 30 years, median first:

BS  1.188
OS  1.188
MC  1.185
CS  1.158
EPO 1.122
NMV 1.190
EW  0.379    

30 years with 5% point now:


BS  0.318
MC  0.317
OS  0.315
NMV 0.315
CS  0.292
EPO 0.278
EW -0.4

I'd say that you can take your pick from pretty much any of these methods apart from equal weights, which with the structure we have imposed is always going to be suboptimal.

Summary and what's next

This is optimisation in highly controlled laboratory conditions with nice distributions that don't move around when you're not looking. Still we did find some interesting results:
  • We got some ballpark for MC/bootstrap convergence rates
  • Even in this context some shrinkage on the mean is optimal
  • The gold standard for weights is bootstrap/MC, which also don't require any shrinkage meta-parameters, but they are bloody slow. 
We'll take these with us on the next stage of our journey when we use some real data.

Footnote: the perfect optimiser that we can't use

Incidentally, there is another optimiser I haven't considered here which precisely suits my goal of finding the best expected probabilistic outcome. It's a grid search that considers all possible weights, and for each weight bootstraps a distribution of SR outcomes given those weights and the in sample data; and then takes the 50% or 5% of that SR distribution. Unfortunately it's very slow; and even using a coarse to fine approach I was unable to get it to run in reasonable enough time.