Monday, 8 June 2026

UFC - Ultimate Fitting Championships (Evaluating and calibrating portfolio optimisation methods with random data)

As I said in my last post I'm currently in the process of a mega-sized research project on fitting. In the first post I examined the correct way to cluster combinations of trading rules and instruments. 

This next post is rather meatier, and is about evaluating and calibrating some portfolio optimisation techniques. We might call this 'meta optimisation', since we want to find the best way to do optimisation, which itself is effectively a form of optimisation - we are choosing between alternatives based on some utility function.

And because it's optimisation, it can be done in a bad in sample way. And often is. People do have a habit of using a particular data set, working out which optimisation will work best, and then using that. They think they are good people because the optimisation is running in a nice robust out of sample fashion -but they are not good people. Because the choice of optimisation itself has been made having seen all the data.

To avoid this I'm initially going to use random data to evaluate and calibrate the various optimisation techniques. Then no real data will be harmed. A subsequent post will use some real data.

Note: I've sort of had a go at this before, here. However this is a much more thorough look at the problem, whereas the previous post was very limited in scope both of data and also of methodologies. There is also a link here to my multiple posts about probabilistic evalulation of outcomes (). 

Note 2: whilst researching this post I found a 'new' shrinkage based method, EPO, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3530390 developed by one of my favourite authors (Pedersen of AQR) with co-authors. The main reason I like this is because of the allusion with the more famous EPO which as someone who cycles and follows cycling is obviously quite ironically entertaining (I have described it as 'new' but it's several years old, and I can now see that it was highlighted by younggotti in a comment on my earlier post but which I didn't follow up)

Note 3: I also came across this relatively new book: https://portfoliooptimizationbook.com/ which is quite a nice survey of the field.

Note 4: "Rob why don't you use AI in developing trading systems like everyone else on LinkedIn". In fact AI has one - exactly one - use case as far as I am concerned:

Thank you chatgpt



The data

For the data I'm going to keep things real simple. I will be doing an optimisation with nine assets. That might seem an odd number*, but it's because in a subsequent post I will be repeating some of this with real assets, and I have nine specific ones in mind (spoiler alert: three instruments trading three different trading rules/methods: two speeds of trend, plus carry). The true SR of the assets will be drawn with equal proability from this list: [-0.5, -0.25, 0, 0.25, 0.5, 0.75, 1]. The true correlation of the assets will be drawn with equal probability from this list [-0.25, 0, 0.25, 0.5, 0.75, 0.9]. These lists are not fully symmetric, since trading strategy returns of the type I am optimising tend not to have substantial negative correlations or very high/low SR.

* obviously it is indeed an odd number, but it might also appear to be an arbitrary choice.

 I will then randomly draw returns from those multivariate Gaussian distributions, generating 2000 outcomes, each with 35 years of history (each 'year' is 256 business days). Why 35 years? Well, I will be varying the length of the in sample period. To be precise, I will use in sample periods between 1 year and 30 years; and evaluate out of sample on a five year basis. Using a (shorter) longer out of sample period would just (increase) reduce the variance between different outcomes; it won't affect their relative efficacy. It seems unlikely anyone will go more than five years without refitting (I do it every year in backtest) this seems about right.

I will generate a certain number of histories and then evaluate the relative performance of each optimiser on each history; thus avoiding the role of luck if one optimiser happens to get a lucky break.

Note if it isn't obvious I'm assuming I am a SR maximiser (equivalent to a CAGR maximiser for a leveraged investor with Gaussian returns), and I'm assuming all assets have the same expected standard deviation. As a futures trader this is fine. I'm also assuming weights will be positive. These are my standard boilerplate assumptions for optimising trading strategy returns. 


Random data is not real data

Well duh. But why is this important? Because random data is drawn from a fixed and well behaved distribution. This means the optimiser only has to discover / estimate the parameters of that distribution as more data is revealed to it. But real data doesn't have a fixed and known distribution. It doesn't actually have any distribution at all. We just model it hoping it does.

Essentially random data sets a lower bound on robustness calibration. For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1.

This also means that less robust methods will be flattered compared to more robust methods when using random rather than real data. For this reason we need to treat the results with some caution; and in a future post I will be sense checking them against some real data.


The criteria

On what basis should be evaluate an optimiser? Clearly we are interested in the out of sample performance - Sharpe Ratio in this case. But it's the probabilistic performance that interests me. Using random data means we can look at a distribution of outcomes. And I'm not just interested in the central, median point of that distribution. I'm concerned with optimisers that produce extreme, sparse, weights. On average these might look fine, but their downside will be worse than a more robust optimiser which produces more reasonable weights. So I am also going to evalutate performance at a more cautious 5% percentile point (what we use for statistical significance). 

Of course, there are other criteria for optimising. Speed is an important one that will be bad for gridsearch, bootstrap and monte carlo type methods. Related to that is convergence - how quickly does a boostrap or monte carlo converge on weights that are 'good enough'. If convergence is quite quick then the penalty of running multiple optimisations won't be as large.


The optimisers

Let's quickly run through the competitors in this little olympics:

  •  monte carlo (random, parameteric)
  •  bootstrapping (random, non parametric)
  • double shrinkage (shrinking SR towards average SR, and correlations to zero). Shrinkage can range from zero (no shrinkage) Note with the right parameters this encompasses some other methods including:
    • NMV naive mean variance (no shrinkage on anything)
    • EW equal weights (both full shrinkage)
    • MD maximum diversification (no shrinkage correlation, full shrinkage on SR)
    •  EPO (we just shrink the correlation matrix to some degree)

Notice I am not at this stage using any kind of clustering or hierarchical method, such as my own 'handcrafting'. My intention is to first, in this post, establish the best way to optimise relatively small portfolios. Then in a subsequent post I will properly evaluate the performance / speed tradeoff of using this small portfolio optimisation inside a top down clustering method.

There are a whole bunch of other methods we could use, but I have a good understanding of the methods above and I don't feel the need to go very fancy. 

Note that within the shrinkage team we have a number of competitiors as we can vary the shrinkage in a range of let's say 0 (no shrinkage, use empirical results), to 1.0 (full shrinkage). For correlation shrinkage I'm going to use these nine steps: 0, 0.2, 0.4, 0.6, 0.7, 0.75, 0.8, 0.9, 1.0 (the optimal EPO shrinkage is 0.75 hence the extra granularity around there). For SR shrinkage I'm going to use these nine values [0, 0.25, 0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0] because I know that minimal amounts of mean shrinkage don't achieve much. That gives me 81 possible shrinkage methods; but one is actually equal weights (shrinkage of 1 on SR and 1 on correlations); another is maximum diversification (1,0), a third is naive mean variance (0,0), and there are seven that are EPO (0, 0.2... 0.9). So there are actually 71 shrinkage methods of various strengths.


Establishing convergence speed

Before we begin we need to establish the number of runs required to establish convergence for the bootstrapping and monte carlo methods. 

There are a couple of ways we can establish convergence speed:

  • How quickly the weights 'settle down', i.e. does not change very much
  • How quickly the probabilistic out of sample SR 'settles down'

Note also that this convergence will take longer with larger portfolios, since there is a bigger possible space to cover.

The following shows for bootstrapping one random one year long in-sample how the average* error narrows between the 'correct' weights (run with the maximum of 2000 iterations) and the weights we get with fewer iterations:

* sqrt(sum(error^2))

We can see that after about 500 iterations (x axis) the average error (y axis) is down to 2% or less. In other words given the average weight would be 11.1%; we're looking at having a weight that has a good chance of being between 9% and 13%.

What about the convergence in SR?

Again after 500 iterations we're down to a SR difference of less than 1bp (0.01 SR units). These are single point SR though, the probabilistic result would be different with worse results for lower iterations where we are more likely to get sparse portfolios with poor OOS performance at conservative points of the distribution.

Here are the same results, but now averaged across 50 different samples (it's real slow running this code, and the results aren't that different across samples). Because we're being conservative I'm using the 80% percentile at each point rather than the median or average. Obviously that would be the 10th worst result in each bracket.



Note there the SR shown is the absolute difference between the SR for 2000 iterations and for a smaller number of iterations. So SR that are higher will be penalised as much as those that are lower.

That's for one year of in sample returns. I would say roughly 1000 to 1500 iterations is enough for decent convergence. How about ten years? Incidentally don't try this at home, it takes a long time.



It looks like convergence happens faster with longer periods of data which makes sense. With longer periods of data there is less chance of rogue samples causing extreme weights in the first few iterations. With 10 years of data we can probably risk reducing our iteration count down to 500 or so. 

Now, how does those results vary for the other random optimisation method, monte carlo? Here our resampling is . First for one year, weights:

That looks a little quicker than the same one year plot for bootstrapping. Afer 250 iterations we have our average weight difference down to around 3.2%, before it was more like 4.8%. We can probably get away with about half the iterations we had before.

How about Sharpe Ratios?
SR is noisier so there isn't as much evidence here of faster convergence.

You've probably seen enough of these plots now, so I will jump to a summary. These aren't hard numbers, but based on eyeballing the graphs to the rough point where convergence is down to about 0.01SR points or the equivalent in weights:

                              1 year                 10 years
Bootstrap                1200                    600
Monte Carlo            600                     300

We can turn these into a simple heuristic rule like use 1200*N^(-0.3) for N years; that is with bootstrapping which we then half for monte carlo.

(My gut feeling is that convergence will be a little slower with real data)


The giant test

OK so now we have established our <checks notes> 83 different candidate optimisers. 71 of these are of the shrinkage family, seven are EPO options, there are our three special cases EW, MD and NMV; and then we have our two randomly based methods: bootstrapping and monte carlo, for which we've established appropriate numbers of iterations for reasonable convergence. 

We're going to run each of those on each of the 2000 samples of fake history we have, and then look at the distribution of out of sample SR across those samples. We'll then focus on the 50% median point and the more conservative 5% point. I will also measure how long it takes to do the 2000 samples for each method to get an indication of time per optimisation. This will be repeated for different lengths of in sample history, from 1 year up to 30 years. Remember that we'll always be using the last five years of our sample for OOS evaluation.


Qualifiying round, shrinkage

To avoid too much work I will begin wih repeating the exercise in the EPO literature where we try and find the optimium levels of shrinkage for correlation and SR (note - for simplified EPO the SR shrinkage is just zero, but I'm going to explore the whole surface). I will do this on the basis of both 50% median points of the distribution, and also the 5% conservative point. This will give me two candidate shrinkage models to put up against the random competitiors of bootstrap and shrinkage in the two event finals.  In fact, since optimal shrinkage will certainly vary by the amount of data, there will be a finalist for each length of in sample period in years. At this point I'm not concerned with speed, since all the shrinkage will take about the same amount of time to optimise.

Starting with 1 year, 50% median point. Rows are SR shrinkage, columns are correlation shrinkage:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.87 0.86 0.84 0.81 0.79 0.79 0.77 0.75 0.72
0.25 0.88 0.86 0.83 0.81 0.79 0.77 0.77 0.74 0.70
0.50 0.87 0.85 0.83 0.79 0.77 0.75 0.74 0.72 0.67
0.75 0.83 0.82 0.78 0.74 0.72 0.70 0.69 0.67 0.62
0.80 0.81 0.80 0.77 0.73 0.71 0.69 0.67 0.65 0.60
0.85 0.77 0.77 0.75 0.70 0.67 0.67 0.65 0.62 0.59
0.90 0.73 0.73 0.70 0.65 0.65 0.63 0.62 0.61 0.57
0.95 0.65 0.64 0.61 0.58 0.57 0.56 0.56 0.57 0.55
1.00 0.41 0.40 0.40 0.39 0.38 0.38 0.38 0.37 0.38
Note: The coloured values are the 'special' pairs of shrinkage values. Top left in yellow is naive mean variance (no shrinkage). The rest of that row in purple is simple EPO with no mean shrinkage and varying correlation shrinkage. Bottom left in red is maximum diversification (full mean shrinkage, no correlation shrinkage). Bottom right in green is equal weights (full shrinkage on both). I won't be colouring these values in again, so keep them in mind.

There isn't much in it, but it looks like we want minimal shrinkage here with the optimal at (0.25,0). Shrinkage of more than 0.50 on means, and more than 0.40 or so on correlations leads to a dropoff in performance; though with zero correlation shrinkage we can push up to 0.75 on means without too much damage. Given the underlying data process is stable artifical data with a given distribution, perhaps that's not so surprising. 

Now let's look at the 5% point. Obviously these numbers are negative, but we want the least negative:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 -0.12 -0.13 -0.13 -0.15 -0.16 -0.15 -0.16 -0.18 -0.20
0.25 -0.12 -0.13 -0.13 -0.13 -0.14 -0.15 -0.17 -0.19 -0.22
0.50 -0.13 -0.13 -0.14 -0.17 -0.18 -0.18 -0.19 -0.22 -0.25
0.75 -0.21 -0.18 -0.19 -0.21 -0.23 -0.24 -0.25 -0.26 -0.29
0.80 -0.21 -0.21 -0.21 -0.22 -0.24 -0.26 -0.28 -0.28 -0.31
0.85 -0.27 -0.25 -0.23 -0.27 -0.29 -0.28 -0.30 -0.30 -0.34
0.90 -0.35 -0.34 -0.33 -0.33 -0.32 -0.34 -0.33 -0.34 -0.36
0.95 -0.50 -0.48 -0.45 -0.44 -0.43 -0.43 -0.40 -0.37 -0.39
1.00 -0.72 -0.69 -0.68 -0.66 -0.66 -0.66 -0.66 -0.66 -0.47

Here having more shrinkage isn't as problematic. Correlation shrinkage can be as high as 0.75 without losing more than 3bp (0.75 is the EPO optimal remember); whilst mean shrinkage can be up to 0.50. The optimal shrinkage is still about zero but the surface is quite flat beyond that. 

To reiterate; this is random data. Remember what I said earlier: "For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1." This is still true!

One useful thing as well as looking at points on the distribution is to calculated paired T-tests since the distributions can be directly compared (each particular point in a given distribution relates to the same random sample). After all the surface looks quite flat for most of the top left corner, but just how flat? It turns out that all the pairings are significantly worse than the optimal at a 1% critical value except for the optimal itself (0.25,0), (0.5,0) and (0,0).

Let's jump ahead to 30 years to get a feel for any differences. Again, first the results from the median:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 1.19 1.18 1.16 1.15 1.13 1.12 1.11 1.08 1.04
0.25 1.19 1.18 1.16 1.14 1.12 1.11 1.09 1.06 1.01
0.50 1.19 1.18 1.16 1.13 1.10 1.08 1.06 1.02 0.95
0.75 1.13 1.13 1.10 1.04 1.01 0.98 0.95 0.89 0.81
0.80 1.09 1.09 1.06 1.01 0.96 0.93 0.90 0.84 0.77
0.85 1.05 1.04 1.01 0.95 0.90 0.87 0.84 0.78 0.72
0.90 0.97 0.95 0.93 0.86 0.82 0.80 0.76 0.71 0.66
0.95 0.84 0.82 0.79 0.73 0.69 0.67 0.65 0.61 0.58
1.00 0.47 0.47 0.45 0.43 0.42 0.41 0.41 0.40 0.38
And for the 5% point:
      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.32 0.31 0.31 0.30 0.29 0.28 0.27 0.25 0.21
0.25 0.32 0.31 0.31 0.29 0.28 0.27 0.26 0.23 0.19
0.50 0.31 0.31 0.29 0.28 0.25 0.24 0.22 0.19 0.14
0.75 0.24 0.24 0.21 0.19 0.17 0.15 0.13 0.08 0.02
0.80 0.19 0.20 0.19 0.15 0.13 0.11 0.09 0.05 -0.02
0.85 0.14 0.16 0.15 0.11 0.07 0.06 0.04 -0.01 -0.08
0.90 0.04 0.09 0.08 0.03 -0.00 -0.03 -0.04 -0.09 -0.15
0.95 -0.15 -0.09 -0.10 -0.12 -0.13 -0.16 -0.17 -0.21 -0.24
1.00 -0.60 -0.59 -0.57 -0.56 -0.55 -0.55 -0.55 -0.54 -0.47
Notice the numbers here are higher. With 30 years of data, and a fixed distribution, we have plenty of time to get a good handle on the parameters of the optimisation resulting in better OOS performance. 

Again as before the optimum pair is identical (0.25,0) but with a more conservative distributional point we can have more shrinkage with less damage to SR. 

Right, so with all that in mind what shrinkage should we take forward to the next round? It's a toughie; (0.25,0) is the winner in every round but using so little shrinkage makes me nervous (we already know it's likely to be a poor choice on real data). Let's make an arbitrary change to the format and allow one of the close runner ups to get in as well; say (0.5, 0.4) which is within a few bp of the winner. And just because I like the name, we'll also run the EPO optimal of (0, 0.75); not quite as good but not terrible.

Let's remind ourselves who is in the final five of the random final, competing at distances from 1 year up to 30 years for both the Median Memorial Trophy, and the 5% Conservative Cup.

  • MC monte carlo (random, parameteric) with instances depending on in sample length
  • BS bootstrapping (random, non parametric) with instances depending on in sample length
  • OS Optimal shrinkage (SR shrinkage=0.25, nothing on correlations)
  • EPO shrinkage (correlation shrinkage = 0.75, nothing on SR)
  • CS Cautious shrinkage (SR shirnkage = 0.5, correlation shrinkage = 0.4)
  • NMV Naive mean variance (no shrinkage on eithier)
  • EW Equal weights (full shrinkage on both)
That's not a final five, I hear you cry, it's a final seven! Because by law any consideration of portfolio optimisation has to include these two extreme options.

I promised you some speed figures, and here they are: all the various shrinkage methods take about 5 microseconds per optimisation; bootstrapping takes around 17.5 seconds per optimisation and monte carlo  between 5.3 and 29 seconds per optimisation (for 1 year and 30 year respectively). It took several days to complete the monte carlo for 30 years!

The results for 1 year and 30 years are pretty similar, so for brevity here are those for 30 years, median first:

BS  1.188
OS  1.188
MC  1.185
CS  1.158
EPO 1.122
NMV 1.190
EW  0.379    

30 years with 5% point now:


BS  0.318
MC  0.317
OS  0.315
NMV 0.315
CS  0.292
EPO 0.278
EW -0.4

I'd say that you can take your pick from pretty much any of these methods apart from equal weights, which with the structure we have imposed is always going to be suboptimal.

Summary and what's next

This is optimisation in highly controlled laboratory conditions with nice distributions that don't move around when you're not looking. Still we did find some interesting results:
  • We got some ballpark for MC/bootstrap convergence rates
  • Even in this context some shrinkage on the mean is optimal
  • The gold standard for weights is bootstrap/MC, which also don't require any shrinkage meta-parameters, but they are bloody slow. 
We'll take these with us on the next stage of our journey when we use some real data.

Footnote: the perfect optimiser that we can't use

Incidentally, there is another optimiser I haven't considered here which precisely suits my goal of finding the best expected probabilistic outcome. It's a grid search that considers all possible weights, and for each weight bootstraps a distribution of SR outcomes given those weights and the in sample data; and then takes the 50% or 5% of that SR distribution. Unfortunately it's very slow; and even using a coarse to fine approach I was unable to get it to run in reasonable enough time. 


No comments:

Post a Comment

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.