Monday, 8 June 2026

Forecasting statistical estimates when data gets real

 This is my third post in a series about optimisation and fitting. In my previous post I used random data to calibrate and evaluate many portfolio optimisation techniques. It's worth quoting in full from that post:

Random data is not real data: Well duh. But why is this important? Because random data is drawn from a fixed and well behaved distribution. This means the optimiser only has to discover / estimate the parameters of that distribution as more data is revealed to it. But real data doesn't have a fixed and known distribution. It doesn't actually have any distribution at all. We just model it hoping it does.

To summarise then, random data from a fixed distribution differs from real data in three important ways:

  • There is no distribution! We just assume there is one.
  • The distribution (which doesn't exist) is not known, and thus it's likely the distribution we assume is the wrong one. This is especially true for modelling underlying financial price returns with joint Gaussian models.
  • The distribution (which again, doesn't exist) isn't fixed, but can change over time.

And it has one thing in common:

  • The unknown parameters of the distribution are unknown and have to be learned over time.
In this post I'm going to explore this learning process for two key statistical estimates: correlations and Sharpe Ratios. What I am interested in is how much wider of the mark our estimates for these two things are likely to be for real data vs random data. This obviously has important implications for optimisation.


Let's look at a plot.



This is for random data generated by a process with a true SR of 1. It shows the evolution of the SR and it's statistical distribution as it is re-estimated each year. There is a burn in year which is missing, and then in the first year we can see our estimate of the SR using all available data so far (in orange), and the SR for the current year (in blue). You can see that the orange line is lagged by a year as it is purely out of sample and always a year behind. I've then used the orange line to estimate the theoretical sampling distribution of the Sharpe Ratio for a one year period, and constructed a 1.96x confidence interval (so about 95%) around the orange line which are the green and red lines. 

Note: The theoretical standard deviation of the sampling distribution of the Sharpe Ratio, assuming i.i.d. returns, is sqrt[(1+0.5SR^2)/N] where N is the number of periods.

Broadly speaking if our estimates are correct then we'd hope to see around 1/20 of the blue points outside the red and green lines, and around 19/20 on the inside. There are 40 years of data here and we go outside the range twice, which is roughly what we'd expect.

Another way of measuring this is to look at our error term, normalised by our standard deviation. This will be equal to:

[(SR estimate this year N) - (SR estimate years 0...N-1)]/(SR sampling std dev error 0... N-1)

If I take the square of this, average of all years and then square root I get the normalised root mean squared error. This comes out at 0.998 for all the data above.



The blue line in this plot shows the absolute value of the error term for each year. The orange line shows the RMSE. You can see this gradually declining over time and settling in at around 0.85

Here are the same two plots for a correlation pair estimate:





Again the RMSE tends to end up around 0.86

Incidentally, we can also do these plots for longer periods. Here is the RMSE evolution for a SR estimate looking ahead over the next 5 years:

The RMSE here is a little higher - around 1.0


Now, let's look at some real data. I'm going to use the p&l from trading the US10 year bond with a 16,64 day EWMAC. Let's begin by trying to forecast the SR one year ahead:


Even without calculating the error we can see that there are more boundary breakages than before with random data. Here is the error:

Notice that it is higher than before (around 1.25; or about sqrt(2) times bigger than the random data RMSE) and doesn't slowly converge as it did with random data, instead it stays roughly constant (ignoring the initial period of luck at the start). 

We get a similar picture for 5 years:

What about correlations? Let's look at the correlation between this slow momentum on 10 year US bonds, and the carry rule on the same instrument:


Wow, that's noisy. The RMSE will be off the charts. What about over 5 years?

Ouch. If we look at the correlation between two variations of the same trading rule, EWMAC64,256 and EWMAC32,128 - which are naturally highly correlated - then it's not much better:

Again the RMSE would be in double digits.

Those might be flukes, so let's look at lots of random results. I'm going to pick an instrument and trading rule randomly, and measure it's final RMSE number. I will then generate some random returns of the same length from the same SR distribution (by measuring the full sample SR for the relevant instrument/rule pairing); and measure that's RMSE. I will then select another rule from the same instrument, get the correlation of the two p&l streams, and generate some more random returns with the given expected correlation. Next and finally I will measure the correlation RMSE for the two sets of real returns, and the two sets of random returns.

If I consider the ratio [RMSE real data / RMSE random data] (both for next one year); then the median of this over a few thousand randomly selected trading strategy components is 1.06 for Sharpe Ratios, and for correlations around 5.6. 

In simple terms, we are a little bit worse than forecasting Sharpe Ratios in real data one year ahead than we would be with random data, but a LOT worse with correlations. 

Partly this is because we are pretty terrible at forecasting SR one year ahead anyway even with a stable underlying distribution; we don't do much worse with real data. However it does seem that correlations are far more unstable in reality than in randomly generated data. Note that these are correlations for trading strategy component returns. In some cases they are mathematically related (eg EWMAC of different speeds) and could be derived with some assumptions, a pencil, and a napkin. They are certainly more stable than the returns of the underlying instruments themselves (think about the changing correlation of stocks and bonds in different inflation environments). 

(Note: These numbers are about the same for five years ahead and also ten years ahead)

If we recall from the prior post that the optimal shrinkage is zero on correlations with random data; we can now see why with actual data we'd probably want to opt for some correlation shrinkage; purely because the sampling error is much larger in practice. That is the empirical finding of the EPO paper. It does feel a bit weird since up to now my gut feeling has been that we have to shrink means a lot because they are much harder to forecast and because they have an outsized effect on portfolio weights compared to differences in correlation. Whilst the latter is still true it seems the former is not.

Food for though. Anyway the next step is to repeat the 'Ultimate Fitting Championships' battle, but this time with real data.

 

















UFC - Ultimate Fitting Championships (Evaluating and calibrating portfolio optimisation methods with random data)

As I said in my last post I'm currently in the process of a mega-sized research project on fitting. In the first post I examined the correct way to cluster combinations of trading rules and instruments. 

This next post is rather meatier, and is about evaluating and calibrating some portfolio optimisation techniques. We might call this 'meta optimisation', since we want to find the best way to do optimisation, which itself is effectively a form of optimisation - we are choosing between alternatives based on some utility function.

And because it's optimisation, it can be done in a bad in sample way. And often is. People do have a habit of using a particular data set, working out which optimisation will work best, and then using that. They think they are good people because the optimisation is running in a nice robust out of sample fashion -but they are not good people. Because the choice of optimisation itself has been made having seen all the data.

To avoid this I'm initially going to use random data to evaluate and calibrate the various optimisation techniques. Then no real data will be harmed. A subsequent post will use some real data.

Note: I've sort of had a go at this before, here. However this is a much more thorough look at the problem, whereas the previous post was very limited in scope both of data and also of methodologies. There is also a link here to my multiple posts about probabilistic evalulation of outcomes (). 

Note 2: whilst researching this post I found a 'new' shrinkage based method, EPO, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3530390 developed by one of my favourite authors (Pedersen of AQR) with co-authors. The main reason I like this is because of the allusion with the more famous EPO which as someone who cycles and follows cycling is obviously quite ironically entertaining (I have described it as 'new' but it's several years old, and I can now see that it was highlighted by younggotti in a comment on my earlier post but which I didn't follow up)

Note 3: I also came across this relatively new book: https://portfoliooptimizationbook.com/ which is quite a nice survey of the field.

Note 4: "Rob why don't you use AI in developing trading systems like everyone else on LinkedIn". In fact AI has one - exactly one - use case as far as I am concerned:

Thank you chatgpt



The data

For the data I'm going to keep things real simple. I will be doing an optimisation with nine assets. That might seem an odd number*, but it's because in a subsequent post I will be repeating some of this with real assets, and I have nine specific ones in mind (spoiler alert: three instruments trading three different trading rules/methods: two speeds of trend, plus carry). The true SR of the assets will be drawn with equal proability from this list: [-0.5, -0.25, 0, 0.25, 0.5, 0.75, 1]. The true correlation of the assets will be drawn with equal probability from this list [-0.25, 0, 0.25, 0.5, 0.75, 0.9]. These lists are not fully symmetric, since trading strategy returns of the type I am optimising tend not to have substantial negative correlations or very high/low SR.

* obviously it is indeed an odd number, but it might also appear to be an arbitrary choice.

 I will then randomly draw returns from those multivariate Gaussian distributions, generating 2000 outcomes, each with 35 years of history (each 'year' is 256 business days). Why 35 years? Well, I will be varying the length of the in sample period. To be precise, I will use in sample periods between 1 year and 30 years; and evaluate out of sample on a five year basis. Using a (shorter) longer out of sample period would just (increase) reduce the variance between different outcomes; it won't affect their relative efficacy. It seems unlikely anyone will go more than five years without refitting (I do it every year in backtest) this seems about right.

I will generate a certain number of histories and then evaluate the relative performance of each optimiser on each history; thus avoiding the role of luck if one optimiser happens to get a lucky break.

Note if it isn't obvious I'm assuming I am a SR maximiser (equivalent to a CAGR maximiser for a leveraged investor with Gaussian returns), and I'm assuming all assets have the same expected standard deviation. As a futures trader this is fine. I'm also assuming weights will be positive. These are my standard boilerplate assumptions for optimising trading strategy returns. 


Random data is not real data

Well duh. But why is this important? Because random data is drawn from a fixed and well behaved distribution. This means the optimiser only has to discover / estimate the parameters of that distribution as more data is revealed to it. But real data doesn't have a fixed and known distribution. It doesn't actually have any distribution at all. We just model it hoping it does.

Essentially random data sets a lower bound on robustness calibration. For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1.

This also means that less robust methods will be flattered compared to more robust methods when using random rather than real data. For this reason we need to treat the results with some caution; and in a future post I will be sense checking them against some real data.


The criteria

On what basis should be evaluate an optimiser? Clearly we are interested in the out of sample performance - Sharpe Ratio in this case. But it's the probabilistic performance that interests me. Using random data means we can look at a distribution of outcomes. And I'm not just interested in the central, median point of that distribution. I'm concerned with optimisers that produce extreme, sparse, weights. On average these might look fine, but their downside will be worse than a more robust optimiser which produces more reasonable weights. So I am also going to evalutate performance at a more cautious 5% percentile point (what we use for statistical significance). 

Of course, there are other criteria for optimising. Speed is an important one that will be bad for gridsearch, bootstrap and monte carlo type methods. Related to that is convergence - how quickly does a boostrap or monte carlo converge on weights that are 'good enough'. If convergence is quite quick then the penalty of running multiple optimisations won't be as large.


The optimisers

Let's quickly run through the competitors in this little olympics:

  •  monte carlo (random, parameteric)
  •  bootstrapping (random, non parametric)
  • double shrinkage (shrinking SR towards average SR, and correlations to zero). Shrinkage can range from zero (no shrinkage) Note with the right parameters this encompasses some other methods including:
    • NMV naive mean variance (no shrinkage on anything)
    • EW equal weights (both full shrinkage)
    • MD maximum diversification (no shrinkage correlation, full shrinkage on SR)
    •  EPO (we just shrink the correlation matrix to some degree)

Notice I am not at this stage using any kind of clustering or hierarchical method, such as my own 'handcrafting'. My intention is to first, in this post, establish the best way to optimise relatively small portfolios. Then in a subsequent post I will properly evaluate the performance / speed tradeoff of using this small portfolio optimisation inside a top down clustering method.

There are a whole bunch of other methods we could use, but I have a good understanding of the methods above and I don't feel the need to go very fancy. 

Note that within the shrinkage team we have a number of competitiors as we can vary the shrinkage in a range of let's say 0 (no shrinkage, use empirical results), to 1.0 (full shrinkage). For correlation shrinkage I'm going to use these nine steps: 0, 0.2, 0.4, 0.6, 0.7, 0.75, 0.8, 0.9, 1.0 (the optimal EPO shrinkage is 0.75 hence the extra granularity around there). For SR shrinkage I'm going to use these nine values [0, 0.25, 0.5, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0] because I know that minimal amounts of mean shrinkage don't achieve much. That gives me 81 possible shrinkage methods; but one is actually equal weights (shrinkage of 1 on SR and 1 on correlations); another is maximum diversification (1,0), a third is naive mean variance (0,0), and there are seven that are EPO (0, 0.2... 0.9). So there are actually 71 shrinkage methods of various strengths.


Establishing convergence speed

Before we begin we need to establish the number of runs required to establish convergence for the bootstrapping and monte carlo methods. 

There are a couple of ways we can establish convergence speed:

  • How quickly the weights 'settle down', i.e. does not change very much
  • How quickly the probabilistic out of sample SR 'settles down'

Note also that this convergence will take longer with larger portfolios, since there is a bigger possible space to cover.

The following shows for bootstrapping one random one year long in-sample how the average* error narrows between the 'correct' weights (run with the maximum of 2000 iterations) and the weights we get with fewer iterations:

* sqrt(sum(error^2))

We can see that after about 500 iterations (x axis) the average error (y axis) is down to 2% or less. In other words given the average weight would be 11.1%; we're looking at having a weight that has a good chance of being between 9% and 13%.

What about the convergence in SR?

Again after 500 iterations we're down to a SR difference of less than 1bp (0.01 SR units). These are single point SR though, the probabilistic result would be different with worse results for lower iterations where we are more likely to get sparse portfolios with poor OOS performance at conservative points of the distribution.

Here are the same results, but now averaged across 50 different samples (it's real slow running this code, and the results aren't that different across samples). Because we're being conservative I'm using the 80% percentile at each point rather than the median or average. Obviously that would be the 10th worst result in each bracket.



Note there the SR shown is the absolute difference between the SR for 2000 iterations and for a smaller number of iterations. So SR that are higher will be penalised as much as those that are lower.

That's for one year of in sample returns. I would say roughly 1000 to 1500 iterations is enough for decent convergence. How about ten years? Incidentally don't try this at home, it takes a long time.



It looks like convergence happens faster with longer periods of data which makes sense. With longer periods of data there is less chance of rogue samples causing extreme weights in the first few iterations. With 10 years of data we can probably risk reducing our iteration count down to 500 or so. 

Now, how does those results vary for the other random optimisation method, monte carlo? Here our resampling is . First for one year, weights:

That looks a little quicker than the same one year plot for bootstrapping. Afer 250 iterations we have our average weight difference down to around 3.2%, before it was more like 4.8%. We can probably get away with about half the iterations we had before.

How about Sharpe Ratios?
SR is noisier so there isn't as much evidence here of faster convergence.

You've probably seen enough of these plots now, so I will jump to a summary. These aren't hard numbers, but based on eyeballing the graphs to the rough point where convergence is down to about 0.01SR points or the equivalent in weights:

                              1 year                 10 years
Bootstrap                1200                    600
Monte Carlo            600                     300

We can turn these into a simple heuristic rule like use 1200*N^(-0.3) for N years; that is with bootstrapping which we then half for monte carlo.

(My gut feeling is that convergence will be a little slower with real data)


The giant test

OK so now we have established our <checks notes> 83 different candidate optimisers. 71 of these are of the shrinkage family, seven are EPO options, there are our three special cases EW, MD and NMV; and then we have our two randomly based methods: bootstrapping and monte carlo, for which we've established appropriate numbers of iterations for reasonable convergence. 

We're going to run each of those on each of the 2000 samples of fake history we have, and then look at the distribution of out of sample SR across those samples. We'll then focus on the 50% median point and the more conservative 5% point. I will also measure how long it takes to do the 2000 samples for each method to get an indication of time per optimisation. This will be repeated for different lengths of in sample history, from 1 year up to 30 years. Remember that we'll always be using the last five years of our sample for OOS evaluation.


Qualifiying round, shrinkage

To avoid too much work I will begin wih repeating the exercise in the EPO literature where we try and find the optimium levels of shrinkage for correlation and SR (note - for simplified EPO the SR shrinkage is just zero, but I'm going to explore the whole surface). I will do this on the basis of both 50% median points of the distribution, and also the 5% conservative point. This will give me two candidate shrinkage models to put up against the random competitiors of bootstrap and shrinkage in the two event finals.  In fact, since optimal shrinkage will certainly vary by the amount of data, there will be a finalist for each length of in sample period in years. At this point I'm not concerned with speed, since all the shrinkage will take about the same amount of time to optimise.

Starting with 1 year, 50% median point. Rows are SR shrinkage, columns are correlation shrinkage:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.87 0.86 0.84 0.81 0.79 0.79 0.77 0.75 0.72
0.25 0.88 0.86 0.83 0.81 0.79 0.77 0.77 0.74 0.70
0.50 0.87 0.85 0.83 0.79 0.77 0.75 0.74 0.72 0.67
0.75 0.83 0.82 0.78 0.74 0.72 0.70 0.69 0.67 0.62
0.80 0.81 0.80 0.77 0.73 0.71 0.69 0.67 0.65 0.60
0.85 0.77 0.77 0.75 0.70 0.67 0.67 0.65 0.62 0.59
0.90 0.73 0.73 0.70 0.65 0.65 0.63 0.62 0.61 0.57
0.95 0.65 0.64 0.61 0.58 0.57 0.56 0.56 0.57 0.55
1.00 0.41 0.40 0.40 0.39 0.38 0.38 0.38 0.37 0.38
Note: The coloured values are the 'special' pairs of shrinkage values. Top left in yellow is naive mean variance (no shrinkage). The rest of that row in purple is simple EPO with no mean shrinkage and varying correlation shrinkage. Bottom left in red is maximum diversification (full mean shrinkage, no correlation shrinkage). Bottom right in green is equal weights (full shrinkage on both). I won't be colouring these values in again, so keep them in mind.

There isn't much in it, but it looks like we want minimal shrinkage here with the optimal at (0.25,0). Shrinkage of more than 0.50 on means, and more than 0.40 or so on correlations leads to a dropoff in performance; though with zero correlation shrinkage we can push up to 0.75 on means without too much damage. Given the underlying data process is stable artifical data with a given distribution, perhaps that's not so surprising. 

Now let's look at the 5% point. Obviously these numbers are negative, but we want the least negative:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 -0.12 -0.13 -0.13 -0.15 -0.16 -0.15 -0.16 -0.18 -0.20
0.25 -0.12 -0.13 -0.13 -0.13 -0.14 -0.15 -0.17 -0.19 -0.22
0.50 -0.13 -0.13 -0.14 -0.17 -0.18 -0.18 -0.19 -0.22 -0.25
0.75 -0.21 -0.18 -0.19 -0.21 -0.23 -0.24 -0.25 -0.26 -0.29
0.80 -0.21 -0.21 -0.21 -0.22 -0.24 -0.26 -0.28 -0.28 -0.31
0.85 -0.27 -0.25 -0.23 -0.27 -0.29 -0.28 -0.30 -0.30 -0.34
0.90 -0.35 -0.34 -0.33 -0.33 -0.32 -0.34 -0.33 -0.34 -0.36
0.95 -0.50 -0.48 -0.45 -0.44 -0.43 -0.43 -0.40 -0.37 -0.39
1.00 -0.72 -0.69 -0.68 -0.66 -0.66 -0.66 -0.66 -0.66 -0.47

Here having more shrinkage isn't as problematic. Correlation shrinkage can be as high as 0.75 without losing more than 3bp (0.75 is the EPO optimal remember); whilst mean shrinkage can be up to 0.50. The optimal shrinkage is still about zero but the surface is quite flat beyond that. 

To reiterate; this is random data. Remember what I said earlier: "For example, suppose we determine that the correct shrinkage for the vector of expected SR on a Bayesian portfolio optimisation using random data is 0.1 (which means we average using 90% of the estimated SR, and 10% of the prior SR). Then it's likely the correct shrinkage on real data will be higher than 0.1." This is still true!

One useful thing as well as looking at points on the distribution is to calculated paired T-tests since the distributions can be directly compared (each particular point in a given distribution relates to the same random sample). After all the surface looks quite flat for most of the top left corner, but just how flat? It turns out that all the pairings are significantly worse than the optimal at a 1% critical value except for the optimal itself (0.25,0), (0.5,0) and (0,0).

Let's jump ahead to 30 years to get a feel for any differences. Again, first the results from the median:

      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 1.19 1.18 1.16 1.15 1.13 1.12 1.11 1.08 1.04
0.25 1.19 1.18 1.16 1.14 1.12 1.11 1.09 1.06 1.01
0.50 1.19 1.18 1.16 1.13 1.10 1.08 1.06 1.02 0.95
0.75 1.13 1.13 1.10 1.04 1.01 0.98 0.95 0.89 0.81
0.80 1.09 1.09 1.06 1.01 0.96 0.93 0.90 0.84 0.77
0.85 1.05 1.04 1.01 0.95 0.90 0.87 0.84 0.78 0.72
0.90 0.97 0.95 0.93 0.86 0.82 0.80 0.76 0.71 0.66
0.95 0.84 0.82 0.79 0.73 0.69 0.67 0.65 0.61 0.58
1.00 0.47 0.47 0.45 0.43 0.42 0.41 0.41 0.40 0.38
And for the 5% point:
      0.00  0.20  0.40  0.60  0.70  0.75  0.80  0.90  1.00
0.00 0.32 0.31 0.31 0.30 0.29 0.28 0.27 0.25 0.21
0.25 0.32 0.31 0.31 0.29 0.28 0.27 0.26 0.23 0.19
0.50 0.31 0.31 0.29 0.28 0.25 0.24 0.22 0.19 0.14
0.75 0.24 0.24 0.21 0.19 0.17 0.15 0.13 0.08 0.02
0.80 0.19 0.20 0.19 0.15 0.13 0.11 0.09 0.05 -0.02
0.85 0.14 0.16 0.15 0.11 0.07 0.06 0.04 -0.01 -0.08
0.90 0.04 0.09 0.08 0.03 -0.00 -0.03 -0.04 -0.09 -0.15
0.95 -0.15 -0.09 -0.10 -0.12 -0.13 -0.16 -0.17 -0.21 -0.24
1.00 -0.60 -0.59 -0.57 -0.56 -0.55 -0.55 -0.55 -0.54 -0.47
Notice the numbers here are higher. With 30 years of data, and a fixed distribution, we have plenty of time to get a good handle on the parameters of the optimisation resulting in better OOS performance. 

Again as before the optimum pair is identical (0.25,0) but with a more conservative distributional point we can have more shrinkage with less damage to SR. 

Right, so with all that in mind what shrinkage should we take forward to the next round? It's a toughie; (0.25,0) is the winner in every round but using so little shrinkage makes me nervous (we already know it's likely to be a poor choice on real data). Let's make an arbitrary change to the format and allow one of the close runner ups to get in as well; say (0.5, 0.4) which is within a few bp of the winner. And just because I like the name, we'll also run the EPO optimal of (0, 0.75); not quite as good but not terrible.

Let's remind ourselves who is in the final five of the random final, competing at distances from 1 year up to 30 years for both the Median Memorial Trophy, and the 5% Conservative Cup.

  • MC monte carlo (random, parameteric) with instances depending on in sample length
  • BS bootstrapping (random, non parametric) with instances depending on in sample length
  • OS Optimal shrinkage (SR shrinkage=0.25, nothing on correlations)
  • EPO shrinkage (correlation shrinkage = 0.75, nothing on SR)
  • CS Cautious shrinkage (SR shirnkage = 0.5, correlation shrinkage = 0.4)
  • NMV Naive mean variance (no shrinkage on eithier)
  • EW Equal weights (full shrinkage on both)
That's not a final five, I hear you cry, it's a final seven! Because by law any consideration of portfolio optimisation has to include these two extreme options.

I promised you some speed figures, and here they are: all the various shrinkage methods take about 5 microseconds per optimisation; bootstrapping takes around 17.5 seconds per optimisation and monte carlo  between 5.3 and 29 seconds per optimisation (for 1 year and 30 year respectively). It took several days to complete the monte carlo for 30 years!

The results for 1 year and 30 years are pretty similar, so for brevity here are those for 30 years, median first:

BS  1.188
OS  1.188
MC  1.185
CS  1.158
EPO 1.122
NMV 1.190
EW  0.379    

30 years with 5% point now:


BS  0.318
MC  0.317
OS  0.315
NMV 0.315
CS  0.292
EPO 0.278
EW -0.4

I'd say that you can take your pick from pretty much any of these methods apart from equal weights, which with the structure we have imposed is always going to be suboptimal.

Summary and what's next

This is optimisation in highly controlled laboratory conditions with nice distributions that don't move around when you're not looking. Still we did find some interesting results:
  • We got some ballpark for MC/bootstrap convergence rates
  • Even in this context some shrinkage on the mean is optimal
  • The gold standard for weights is bootstrap/MC, which also don't require any shrinkage meta-parameters, but they are bloody slow. 
We'll take these with us on the next stage of our journey when we use some real data.

Footnote: the perfect optimiser that we can't use

Incidentally, there is another optimiser I haven't considered here which precisely suits my goal of finding the best expected probabilistic outcome. It's a grid search that considers all possible weights, and for each weight bootstraps a distribution of SR outcomes given those weights and the in sample data; and then takes the 50% or 5% of that SR distribution. Unfortunately it's very slow; and even using a coarse to fine approach I was unable to get it to run in reasonable enough time. 


Friday, 5 June 2026

The crossword puzzle of fitting - why across and then down?

This will be the first in a series of posts about portfolio optimisation. Main reason being I'm planning to write a book about backtesting, and that will include a big chunk of material on optimisation. Yes, I know, my latest book isn't out yet (it's out in December - in time for Christmas). But this backtesting book is going to be quite deep (and probably long!) so I need to start researching now if it's going to be written any time soon. Today's post is not that deep, and is quite short. It has literally been written whilst waiting for the rather extensive testing of the second post to finish. Anyone, let's begin. 

One of the issues when fitting is to decide how to structure and order the process. In very abstract terms, a component of a trading strategy will consist of a forecast to predict the price of an instrument. A forecast might be something like momentum16,64 - that's the exponentially weighted moving average crossover with spans of 16 and 64 days to you my good sir or madam. An instrument is something like the US 10 year bond future. We can represent all these options in a grid like so:

             momentum16,64                momentum4,16                 carry10

US10             X                            X                           X

SP500            X                            X                           X       

US5              X                            X                           X      


.... where each 'X' is a place on the grid. If those were white squares, and if you can imagine that if there were some forecasts missing from certain instruments which were black squares, then we'd have a crossword grid. Yes that's all I've got. Quite a weak link. Apologies.You think it's easy coming up with catchy blog titles?

And note that this is a tiny subset of the full grid. Instead of these 9 possibilities my full trading system currently has 10,373 options. That's 40 trading rules across 260 different instruments. Some of those instruments are duplicated (eg SP500 mini and micro), some are no longer traded; but that still leaves 204 instruments and over 8,100 options.

Anyone it should be obvious that in doing our fitting we have a few options:

  • A joint fit where we fit everything in one go. 8,100 options. In one go. Let that sink in.
  • A natural clustering where we cluster together things that are correlated.
  • A down and then across structured clustering where we first fit within rules - so for example working out what the best blend of US10, SP500 and US5 is within the carry10 rule - and then across rules - so estimating the best blend of carry10, momentum16,64 and momentum4,16.
  • An across and then down  structured clustering where we first fit within instruments - so for example working out what the best blend of carry10, momentum16,64 and momentum4,16 is within US10 - and then across instruments - so estimating the best blend of US10, SP500 and US5.
Now I've generally done the final one of these four options: across and down. And it's quite a natural way of doing things; because we seperate out the idea of predicting the price of a given instrument, and then put together a portfolio of trading substrategies one per instrument. But I've never actually tested the assumption that this is the right thing to do. In particular, if we were to do a natural clustering, would it come out quite like across and down; or would it be something weird? Or would, for example, all the momentum rules be more correlated with other irrespective of instrument, in which case down and across would make more sense?

Programming note: I've done this type of clustering before; here for underlying instrument returns, and here for forecasts

Incidentally, I've got three different approaches in the post; as I iteratively updated trying different things each time. TLDR: one of these approaches was inconclusive, the other two were firmly in favour of sticking to 'across then down'.


FIRST ATTEMPT: ONE GIANT CLUSTER

(This was the original blogpost)

Before I begin, I did decide to limit the analysis to the last 10 years. It's kind of slow just calculating a correlation matrix from 52 years of data and many instruments dont't have data except for the last 10 years anyway. I also resampled the returns to a weekly frequency. This is what I do when optimising anyway. Unless your returns are very quick, this won't affect correlations much. That leaves us with 520 weeks (I know that isn't exactly 10 years. Let me check to see if I give a toss. No, I don't) or rows in a dataframe, with over 8,100 columns remember. That's about the limit as to what my laptop can calculate a correlation for; and it's quite a painful process to cluster these bad boys as well.

Anyway let's begin. I've come up with quite a fun way to visualise these clusters which you can see here for the first two cluster plot:

OK as you can see each cluster has a subplot. Each splot has two stacked bars. The lefthand side shows the composition by asset class. The righthand side shows the composition by trading rule. You can just about make out from the legend what the various colours mean. Note these aren't portfolio weights, and just reflect the number of instrument/rule combos in each category. 

If your eyes are very good you will see the tops of the left hand bars don't quite reach 1. This is because I've removed anything with less than 2% of the total from the plots for clarity. Mainly this is a few odd instruments in a few very small asset classes (like volatility). It doesn't affect trading rules, since even the mrinasset rule has 1/40 = 2.5% of the total count.

The important thing here is we aren't yet seeing any evidence of clustering eithier by instrument or by trading rule. If we were, there would be a preponderance of colours on one side or the other. For example, if correlations were higher amoungst trading rules even from different instruments, then one cluster would have a lot of blue and purple in the right hand bar of one cluster (colours I assign to more divergent rules like momentum), and yellow and orange in the other right hand bar (colours that are reserved for convergent rules).

At this stage then there is no evidence that rows or columns makes more sense. 
Here is the three cluster plot. All three clusters look very similar for both instruments and rules, again suggesting there isn't much going on here yet on eithier axis.

I'm going to skip through the next few plots, since none of them show anything interesting. Let's check in at N=16:

Still not much going on here! We aren't seeing clumps of colours developing for eithier bar.

N=36:

49:

Never have so many coloured pixels mean wasted for so little result!

Let's close out with N=64, a nice power of 2 to finish on, as we're reaching the point where the plots are getting so small they're impossible to see:


There you have it folks. No clear evidence for eithier across-> down, or for down-> across. 

Now, there are a few alternative conclusions we could draw here. One is that there is some weird deep correlation pattern that the simple analysis by asset class and trading rule doesn't pick up. I don't buy that. If for example we look at the final cluster from N=64, it looks like this:

[assettrend4 forecasting AEX, breakout80 forecasting AEX, momentum32 forecasting ALUMINIUM, breakout160 forecasting AUD_micro, assettrend16 forecasting BITCOIN, normmom64 forecasting BITCOIN, accel64 forecasting BONO, carry125 forecasting BOVESPA, momentum64 forecasting BOVESPA, breakout80 forecasting BRE, normmom4 forecasting BRENT_W, accel32 forecasting BTP, momentum8 forecasting BTP3, accel16 forecasting BUTTER, assettrend16 forecasting BUTTER, carry125 forecasting CAD, assettrend32 forecasting CAD10, relcarry forecasting CAD2, breakout20 forecasting COAL, breakout40 forecasting COPPER_LME, relmomentum40 forecasting COPPER_LME, relmomentum20 forecasting COTTON, breakout320 forecasting DOW, assettrend8 forecasting EU-AUTO, relmomentum40 forecasting EU-CHEM, assettrend2 forecasting EU-CHEM, normmom2 forecasting EU-CHEM, relmomentum80 forecasting EU-DIV30, relmomentum10 forecasting EU-DJ-TELECOM, relmomentum20 forecasting EU-HOUSE, relmomentum20 forecasting EU-INSURE, carry10 forecasting EU-MEDIA, accel32 forecasting EU-OIL, assettrend2 forecasting EU-TECH, carry10 forecasting EU-TECH, breakout20 forecasting EURAUD, momentum8 forecasting EURCAD, carry30 forecasting EURCHF, carry60 forecasting EURCHF, normmom8 forecasting EURIBOR-ICE, mrinasset1000 forecasting EUROSTX, momentum4 forecasting EUROSTX-SMALL, normmom16 forecasting EUR_micro, assettrend32 forecasting FANG, mrinasset1000 forecasting FED, assettrend4 forecasting FED, normmom32 forecasting FEEDCOW, carry60 forecasting FEEDCOW, accel16 forecasting FTSECHINAH, breakout20 forecasting FTSEINDO, relmomentum20 forecasting FTSEINDO, breakout10 forecasting FTSETAIWAN, normmom8 forecasting FTSEVIET, breakout80 forecasting FTSEVIET, breakout40 forecasting GASOIL, relmomentum10 forecasting GAS_US_mini, momentum8 forecasting GAS_US_mini, mrinasset1000 forecasting GICS, accel32 forecasting HANGENT_mini, skewabs365 forecasting HANGENT_mini, carry30 forecasting HOUSE-US, momentum8 forecasting HOUSE-US, breakout320 forecasting HOUSE-US, assettrend2 forecasting IBEX_mini, relmomentum20 forecasting INR, normmom2 forecasting INR, normmom64 forecasting IRON, accel64 forecasting JGB, carry10 forecasting JGB-SGX-mini, normmom16 forecasting JGB-SGX-mini, assettrend4 forecasting KOSPI_mini, mrinasset1000 forecasting KR10, normmom32 forecasting KR3, momentum4 forecasting KR3, relmomentum40 forecasting KRWUSD_mini, normmom64 forecasting LEAD_LME, carry125 forecasting LEAD_LME, breakout10 forecasting LEANHOG, relmomentum40 forecasting LIVECOW, relcarry forecasting LIVECOW, breakout160 forecasting MILKWET, assettrend2 forecasting MSCIEAFA, skewabs180 forecasting MSCIEMASIA, relmomentum20 forecasting MSCIEMASIA, skewrv365 forecasting MSCIEMASIA, breakout160 forecasting MUMMY, normmom16 forecasting OAT, breakout80 forecasting OJ, skewrv180 forecasting PALLAD, normmom2 forecasting PALLAD, normmom64 forecasting PLAT, normmom16 forecasting RUSSELL, normmom2 forecasting SEK, skewrv365 forecasting SHATZ, normmom64 forecasting SHATZ, accel16 forecasting SHATZ, carry30 forecasting SHATZ, breakout10 forecasting SILVER, assettrend8 forecasting SMI, breakout40 forecasting SMI-MID, momentum16 forecasting SMI-MID, momentum64 forecasting SOFR, carry30 forecasting SOYBEAN_mini, normmom2 forecasting SOYBEAN_mini, skewabs365 forecasting SOYBEAN_mini, skewrv365 forecasting SP500_micro, skewabs365 forecasting STEEL, skewrv365 forecasting SUGAR16, relmomentum80 forecasting TIN_LME, carry10 forecasting TIN_LME, relmomentum40 forecasting TWD, normmom8 forecasting US-ENERGY, carry30 forecasting US-MATERIAL, mrinasset1000 forecasting US10U, carry10 forecasting US2, momentum32 forecasting US30, breakout40 forecasting US30, relcarry forecasting US30, skewrv180 forecasting US5, momentum64 forecasting V2X, carry125 forecasting VIX_mini, breakout10 forecasting VIX_mini, skewabs180 forecasting VNKI, normmom4 forecasting WHEAT, relmomentum10 forecasting YENEUR, assettrend4 forecasting ZAR, accel64 forecasting ZAR]

I look forward to anyone who can give me a coherent story as to why those things are lumped together. 

Personally I'm taking the absence of any contradictory evidence as evidence that I should continue to do what I've done before: fit across and down. Doing some kind of all group clustering or all in one fit, or doing down and then across; none of these seem to offer clear advantages. So why not stick with a simple thing that works?

Note - there is no reason why in theory you could not do an 'all in one' fit or weird clustering, and still use the procedure of generating a combined forecast for an instrument, then a subsystem position for an instrument, and then forming a portfolio. You just take the weights for each rule/instrument pairing from your all in one or weird cluster fit, and then take them across each instrument for forecast weights, and then add up the weights for a given instrument to find instrument weights.

Arguably this has been a waste of time, but the good news is I can recycle this code to visualise forecast weights across a strategy so that's something....


SECOND ATTEMPT: USE AVERAGE CORRELATIONS

As I was posting this up I thought of a much simpler test: if I measure the average correlation of forecasts within instruments (rule/instrument pairings for the same instrument) then it is 0.29. However the average correlation of forecasts within rules (eg rule/instrument pairings with the same rule) is much lower: 0.05. That reinforces that across and down is indeed the way to go.


THIRD ATTEMPT: TRY WITH SMALLER UNIVERSE

Arguably we might not see the pattern we required until we had hundreds of clusters; equal to the number of instruments in the universe. 

So I decided to do a 'small sample' approach. If I were to pick say 5 arbitrary instruments, and 5 arbitrary trading rules, what is the likelihood that these 25 components would cluster into instrument groups rather than rule groups; or nothing at all?

To make this a proper test I need to repeat the random choice of instruments and trading rules many times over. That means I can't just eyeball the charts each time, looking for a preponderance of colours on one side or the other. This is both timeconsuming and subjective.

Let's come up with a systematic rule. Seems on brand for this blog, doesn' it? Given 5 clusters, we consider them to have formed into instrument groups if  50% or more of the weight in a cluster is for a single instrument, and this has happened in 3 or more of the clusters. Alternatively, we consider them to be rule groups if 50% or more of the weight in a  cluster is for a single rule, and this has happened in 3 or more of the groups. Otherwise we consider them to be ungrouped (and that would be the case for all the charts above). Also any cluster with only one or two components is excluded from the calculation.

I do want to show you an example, since I don't want to waste my nice pictures. Here is what might be a particularly extreme example since it includes only US bonds (that are somewhat correlated): US2, US5, US10, US20, US30; and a smattering of trading rules (less likely to be correlated):

Cluster#1: [breakout40 forecasting US2, breakout40 forecasting US20, breakout40 forecasting US30, breakout40 forecasting US10, breakout40 forecasting US5] 

100% in single rule: meets rule grouping criteria

Cluster#2: [momentum64 forecasting US30, momentum64 forecasting US5, momentum64 forecasting US20, momentum64 forecasting US10, momentum64 forecasting US2], 

100% in single rule: meets rule grouping criteria

Cluster#3: [skewrv365 forecasting US2, skewrv365 forecasting US5, skewrv365 forecasting US20, skewrv365 forecasting US10, skewrv365 forecasting US30]

100% in single rule: meets rule grouping criteria

Cluster#4: [carry10 forecasting US10, carry10 forecasting US20, relcarry forecasting US20]

66.6% in single rule: meets rule grouping criteria; 66.6% in single instrument: meets instrument grouping criteria

Cluster#5: [carry10 forecasting US2, relcarry forecasting US2, relcarry forecasting US5, relcarry forecasting US30, relcarry forecasting US10, carry10 forecasting US30, carry10 forecasting US5]

57% in single rule: meets rule grouping criteria

Since 5/5 meet the rule group criteria (more than half), and only 1/5 meets the instrument group criteria, this is a rule group clustering.

Here is a more random selection:

['HEATOIL', 'SMI-MID', 'RUSSELL', 'BOBL', 'EUA']
['skewrv365', 'momentum64', 'carry10', 'relcarry', 'breakout40']

Which clusters as follows:

Cluster#1 [carry10 forecasting RUSSELL, relcarry forecasting RUSSELL, carry10 forecasting SMI-MID, breakout40 forecasting SMI-MID, relcarry forecasting SMI-MID, momentum64 forecasting SMI-MID, breakout40 forecasting RUSSELL, momentum64 forecasting RUSSELL] - this is actually two instrument clusters with exactly 50% in each

Cluster#2 [skewrv365 forecasting SMI-MID, carry10 forecasting EUA, relcarry forecasting EUA, skewrv365 forecasting EUA] - meets both criteria

Cluster#3 [carry10 forecasting HEATOIL, breakout40 forecasting HEATOIL, relcarry forecasting HEATOIL, momentum64 forecasting HEATOIL] - Heating oil cluster

Cluster#4 [breakout40 forecasting EUA, momentum64 forecasting EUA] - ignored, only 2 components.

Cluster#5 [carry10 forecasting BOBL, breakout40 forecasting BOBL, momentum64 forecasting BOBL, skewrv365 forecasting BOBL, relcarry forecasting BOBL, skewrv365 forecasting HEATOIL, skewrv365 forecasting RUSSELL] mostly BOBL


Since 3/4 valid clusters meet the 50% instrument threshold, and only one meets the 50% rule threshold, this would be a case where we would cluster by instruments most logically.

Anyway I repeated this exercise a few thousand times, and here are the results as a proportion of the total:

Meets neithier criteria: 3.8%
Meets both criteria: 0.81%
Meets rule grouping criteria: 0.11%
Meets instrument grouping criteria: 95.3%

That seems pretty conclusive