Friday, 3 July 2026

Jumping back in the pool(ing): testing pooling by asset class and portfolio weight distance

This is post #10 in my 2026 series on portfolio optimisation. Time for a quick recap. I'm not going to revisit every post but instead summarise what I now think one should be doing when optimising forecast weights before costs (I haven't yet incorporated costs, nor thought about instrument weights).

(I also confirmed in my very first post it was better to estimate forecast then instrument weights, rather than doing them jointly).

That doesn't seem like much value for the thousands of words I've written, and it's also not a million miles from what I would have down without all this research. A few things haven't worked out: random based methods (bayesian and monte carlo) which don't account for the reduced predictability of real returns compared to synthetic data; formal structural breaks on estimates; grouping and pooling instruments according to forecast SR; shorter EWM windows for SR estimates; and shrinking weights rather than inputs. 

I should probably move on now to looking at costs, and instrument returns, but like a dog with a particularly tasty bone or a cat pulling on an especially interesting piece of string; I can't quite let go of the idea that we should be able to improve on pooling everything.


Prior art

Let's run through the options we potentially have for pooling:

  1. We could cluster things together that have similar characteristics, such as by asset class. 
  2. We could do it on an estimate by estimate basis. We could compare the distribution of returns for say carry10 on US10 year bonds, and on US2 year bonds; and say "Well these distributions aren't significantly different. Let's pool the returns together". 
  3. Or we could look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments.  I covered this in my previous post in this series. 
  4. Or we could do it on portfolio weights. We could for example fit the weights for 2 year bonds, and for 10 year bonds, and then see if they were significantly different. We could then pool the returns if they were not that different. I have also looked at this before.
  5. We don't pool at all, and fit each instrument individually. That sounds terrifying, but remember we're shrinking with our fitting.
  6. We pool everything. So far that seems to the best option, and the one I've used in the past. 

Note that we also have the option of:

  • A pooling the returns before estimating the statistics and then the weights
  • B not pooling the returns, and then pooling the weights
I'm not keen on B because it produces 'over robustness' when combined with a shrinkage methodology. Basically we throw away too much information and end up too close to equal weights.

So returning to the numbered list:

  1. By asset classes - is untried, although it resembles what we used to at AHL when we were organised into asset class teams, each of which fitted their own strategies.
  2. Grouping per estimate: I have objections to in terms of computational time and statistical unpleastness, discussed in the previous post.
  3. Grouping per vector of estimates: I tried this in the previous post. It wasn't effective, and also produced weird undesirable groups.
  4. Grouping by weights: I have tried before in a limited test with some success.

So that leaves us with 1 and 4 as candidates, along with the standard options of #6 full pooling and #5 no pooling at all - fitting each instrument's forecast weights purely on it's own data:

  • Unpooled
  • All instruments pooled
  • Asset class pooled
  • Grouping by portfolio weights


By asset classes - method

This is pretty trivial; the eight asset classes in my system are:

  • Stock indices (58 instruments in my dataset including duplicates and expired instruments like Eurodollar to avoid survivorship bias)
  • Sector stocks (eg 'EU oil companies') 36 instruments
  • Vol 4
  • FX 43
  • Bonds and STIR 39
  • Energies 20
  • Agricultural 39
  • Metals 21 (includes two crypto futures)

So for a given portfolio we fit those instruments in the same asset class together.


Grouping by weights

Well this is easy, as I already did this here and under the heading "get instrument groupings" it tells us we can use k-means clustering and there is even some code there for me to copy and paste. An important difference between this and the grouping by SR vector is that correlations will also be taken into account, at least in an implicit way.

One open question that remains is whether the grouping is done on portfolio weights that have been derived using a shrinkage method, or on weights that haven't (just using naive mean variance). I felt it was better to use 'purer' weights which hadn't used shrinkage so we don't end up discarding useful differences.

As I did in my prior post in this series, let's run the grouping exercise on my entire portfolio. Partly for laughs, and partly to see if the grouping makes sense. How many groups/clusters should we use? Well there are 7 substantive asset classes, excluding vol:

Cluster 0, length 16

BTP3, CANOLA, EU-FOOD, FED, GBPCHF, GOLD_micro, HIGHYIELD, LIVECOW, OMX, R1000, SILVER, SP500_micro, US-INDUSTRY, WHEAT, YENEUR, ZAR

Ags: 3, Metals: 2, Equity: 3, FX: 3, Sector: 2, Bond: 3


Cluster 1, length 29

BRENT_W, CNH, COAL-GEORDIE, COCOA, COFFEE, COTTON, ETHER-micro, EU-AUTO, EU-DJ-TELECOM, EU-DJ-UTIL, EU-MEDIA, EU-REALESTATE, EU-TECH, FTSETAIWAN, GBP, HEATOIL, KOSPI_mini, MILK, MSCIEAFA, MSCISING, OATIES, OJ, RUBBER, SEK, SMI, SONIA3, US-DISCRETE, US2, VNKI

OilGas: 3, Ags: 7, Metals: 1, Equity: 5, FX: 3, Sector: 7, Vol: 1, Bond: 2


Cluster 2, length 20

CAD10, CH10, CHF, CHFJPY, COPPER-micro, CZK, FTSECHINAA, FTSEINDO, IRON, JGB, JGB-SGX-mini, JP-REALESTATE, MUMMY, NIKKEI, SGX, SOYBEAN_mini, SOYOIL, TOPIX, US-ENERGY, US-HEALTH

Ags: 2, Metals: 2, Equity: 6, FX: 3, Sector: 3, Bond: 4


Cluster 3, length 12

AUD_micro, EU-INSURE, FTSE100, FTSECHINAH, GASOIL, HANG_mini, HOUSE-US, NASDAQ_micro, NOK, NZD, US-STAPLES, US-TECH

Sector: 4, Equity: 4, FX: 3, OilGas: 1


Cluster 4, length 16

ALUMINIUM, AUDJPY, BITCOIN, BOBL, BONO, BUND, BUXL, CORN, DOW, GBPJPY, MILKDRY, OAT, RICE, ROBUSTA, SHATZ, STEEL

Ags: 4, Metals: 3, Equity: 1, FX: 2, Bond: 6


Cluster 5, length 79

BB3M, BBCOMM, BRE, BTP, BUTTER, CAD, CHEESE, CHINAA-CON, COAL, COCOA_LDN, COPPER_LME, COTTON2, CRUDE_ICE, CRUDE_W_micro, DJSTX-SMALL, DX, EU-BANKS, EU-CHEM, EU-CONSTRUCTION, EU-MID, EU-OIL, EU-TRAVEL, EURCAD, EURCHF, EURIBOR-ICE, EUROSTX, EUROSTX-SMALL, EUR_micro, FANG, FEEDCOW, FTSE250, GAS-PEN, GASOILINE, GAS_US_mini, GICS, GILT, HANGENT_mini, IBEX_mini, IG, INR, IRS, JPY, KOSDAQ, KR10, KR3, LEAD_LME, LEANHOG, LUMBER-new, MIB, MILKWET, MILLWHEAT, MSCIEMASIA, MSCITAIWAN, MSCIWORLD, MXP, NICKEL_LME, PALLAD, PLAT, REDWHEAT, SARONA, SGD, SMI-MID, SOFR, SOYMEAL, SP400, SPI200, SUGAR11, SUGAR16, SUGAR_WHITE, TIN_LME, TWD, US10, US20, US5, V2X, VIX_mini, WHEAT_ICE, WHEY, ZINC_LME

OilGas: 6, Ags: 18, Metals: 7, Equity: 16, FX: 12, Sector: 6, Vol: 2, Bond: 12


Cluster 6, length 32

AEX, BOVESPA, BRENT-LAST, CAC, CAD2, CAD5, CLP, DAX, EU-BASIC, EU-DIV30, EU-HEALTH, EU-HOUSE, EU-RETAIL, EUA, EURAUD, EURO600, FTSEVIET, GBPEUR, HANGTECH, KRWUSD_mini, MSCIASIA, PLN, RUSSELL, SWISSLEAD, US-FINANCE, US-MATERIAL, US-PROPERTY, US-REALESTATE, US-UTILS, US10U, US3, US30

OilGas: 2, Equity: 11, FX: 5, Sector: 9, Bond: 5

There doesn't seem much congruency with asset classes there. Is this "the data speaking to us", or are we just data mining with a very sharp spade? Let's find out.


Testing

In my older post, here, I did a rather simplistic 'one shot' test on a subset of my available instruments and forecast rules (albeit on a rolling out of sample basis). But I have a rather more exhaustive way of doing things I've been using in this series. 

I cycle through different lengths of in sample (5 years, 10 years, 20 years) and out of sample (1 year and 5 years) lengths of time. For shorter time periods that will allow me to subsample different historic periods. For speed and to get some alternative paths I'm not going to consider all the instruments. Instead I will randomly subsample 50 instruments randomly out of the 214 available. 

Then for a given set of returns I will eithier use fully pooled, asset class pooled, portfolio weight pooled, or unpooled returns. Then I will optimise for each instrument based on the relevant returns, using the shrinkage method with SR shrinkage of 0.5 and correlation of 0.75. Finally I will take the equally weighted across instruments portfolio SR for the 50 instruments, out of sample. 

In previous posts I've discussed a more honest way of backtesting, where we include the opposite of a given trading rule to avoid implicit fitting; and then only bring positive SR rules into the optimisation. All the results here will use that methodology exclusively.


5 years in sample, 1 year out of sample

You should hopefully recognise this format from before. Each row is a fitting option. The first column shows the median SR across the many, many runs of random resampling. The second column shows the t-test p-value from comparing the best option with the others. NaN means this is the best option. A low number in this column, say below 0.01 or 0.05, indicates that the best option is statistically significantly better than the other option.

                           SR  pvalue

unpooled                 0.046     0.0

all pooled               0.425     0.0

asset class pooled       0.557     NaN

weight distanced pooled  0.184     0.0

That is ... pleasing. The least robust method is worse. More robust methods do better. And we get a significant improvement from pooling within asset classes. OK the portfolio weight distancing isn't so good, but we haven't got huge amounts of data to form our portfolio weights with so maybe they are a little unstable.


5 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.418     0.0
all pooled               0.391     0.0
asset class pooled       0.731     NaN
weight distanced pooled  0.376     0.0

Unpooled does a little better here, but asset classes are still the way to go.

10 years in sample, 1 year out of sample

                            SR      pvalue
unpooled                 0.664     NaN
all pooled               0.417   0.000
asset class pooled       0.635   0.127
weight distanced pooled  0.415   0.000

OK interestingly unpooled is making a comeback, but it still isn't significantly better than asset class pooled.

10 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.855     0.0
all pooled               0.575     0.0
asset class pooled       1.028     NaN
weight distanced pooled  0.559     0.0

Asset class is again asserting it's dominance with unpooled a close second.

20 years in sample, 1 year out of sample

                            SR  pvalue
unpooled                 0.571     0.0
all pooled               0.469     0.0
asset class pooled       1.127     NaN
weight distanced pooled  0.410     0.0

OK this is getting a bit silly. I feel like the dad whose kid at sports day is winning everything, proud but also getting a little embarrassed. "Now come on jonny, let one of the other kids win the next one". 

Interestingly it does seem with more data that unpooled is the way to go for a second option.


20 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.251     0.0
all pooled               0.569     NaN
asset class pooled       0.551     0.0
weight distanced pooled  0.553     0.0

"Well done Jonny. Everyone knows you could have won it if you wanted but it's good to show good sportmanship"

So all pooled finally gets it's day in the sun albeit with a slim advantage over the other two pooled methods. Bear in mind only 65 instruments have sufficient history here; with only 18 having two distinct blocks of 25 years so there won't be much genuine variation if we choose 50. So this could be a fluke. Jonny will tell you that it is.


But Rob, What about averaging?

At this point, given the choice between the complexity of weight distancing, and the simplicity and efficiency of asset class pooling; I'm inclined to go with the latter. And it's what we were doing at AHL all those years ago (not because of empirical evidence but because it suited the organisational structure...).

However there is another option which I talked about in the original asset class pooling post, using a blend. Here we take an average of the portfolio weights selected with different methodologies. So that would be an average of:

  • Unpooled
  • All pooled
  • Asset class pooled
Blending weights in this way is a way to improve robustness. It's arguably the correct thing to do, since otherwise we'd be making an in sample choice of methodology - 'meta implicit fitting' if you will. One of my favourite research shops, Resolve asset management, are very keen on doing this. One potential downside is it might be producing 'over robustness' given we're using weights that have already had shrinkage. But let's find out.


5 years in sample, 1 year out of sample

Note these numbers won't be exactly the same as those above, since they're a different set of random experiments. They would eventually converge but it would take millions of runs.

And also, just for fun, I've added an extra column. I started off this series of posts talking about the importance of considering other points of the distribution but I've quietly dropped that and only been quoting the median. For balance then, I've added the 25% SR point as well as the median. The pvalue is as before.

                    SR median  SR 25%  pvalue
unpooled                0.025  -0.569     0.0
all pooled              0.475  -0.147     0.0
asset class pooled      0.620   0.045     0.0
average                 0.650  -0.067     NaN

Averaging is the winner - just - but asset class is better at the more conservative point.

5 years in sample, 5 years out of sample

                    SR median  SR 25%  pvalue
unpooled                0.428   0.164     0.0
all pooled              0.430   0.243     0.0
asset class pooled      0.771   0.537     NaN
average                 0.664   0.430     0.0

A clear win for asset class pooled here. Averaging suffers from it's association with the less performative unpooled / all pooled.

10 years in sample, 1 year out of sample

                    SR median  SR 25%  pvalue
unpooled                0.681   0.034     0.0
all pooled              0.452   0.067     0.0
asset class pooled      0.664   0.167     0.0
average                 0.935   0.256     NaN

This time averaging takes the win, helped by the good performance of unpooled.

10 years in sample, 5 years out of sample

                    SR median  SR 25%  pvalue
unpooled                0.887   0.641     0.0
all pooled              0.570   0.406     0.0
asset class pooled      1.066   0.833     NaN
average                 1.005   0.780     0.0

Asset class pooled is still the winner, but averaging gives a good job.

20 years in sample, 1 year out of sample

                    SR median  SR 25%  pvalue
unpooled                0.538  -0.209     0.0
all pooled              0.515   0.154     0.0
asset class pooled      1.167   0.539     NaN
average                 0.809   0.252     0.0

Asset class pooled by more of a margin now.

20 years in sample, 5 years out of sample

                   SR median  SR 25%  pvalue
unpooled                0.243   0.084     0.0
all pooled              0.531   0.414     NaN
asset class pooled      0.513   0.391     0.0
average                 0.461   0.355     0.0

As before 'all pooled' is the winner, whilst average is dragged down by the poor performance of unpooled. But as I said above, with these longer periods it's hard to know if it's just down to flukey instrument selection.

What to do...

There is enough evidence above to justify asset class pooling as the dominant choice. But equally, I don't think there is enough to discard averaging. And there is something so neat about averaging. We combine three quite disparate source of data together, so we're protected if one of them doesn't work out. It's robustness writ large! It can be justified without any in sample fitting - whereas one could argue that the selection of asset class pooling is an implicit in sample 'meta parameter' choice.

I think we're now (finally) ready to fit our forecast weights, and with costs. This is exciting for me, as whatever comes out I will be using as my new weights. This will be more of a 'literature review' since I've talked about optimising with costs in some detail and at some length before.

Monday, 29 June 2026

Rolling, rolling, rolling.... updating statistical estimates yes or no

 The mega blog post series on portfolio optimisation continues!

A couple of posts ago, here, I looked at using the idea of formal testing for structural breaks in parameter estimates. Important parameters like Sharpe Ratio (SR). Because stuff like this happens:


This is the pre-cost performance of the momentum4 rule on CORN. The formal test found a structural break in 1989.

It's fair to say the structural break stuff didn't work that well. But there may be a much easier way of dealing with the non stationarity of these estimates, and that's to use rolling estimates. For example, if you were to use a 10 year rolling estimate of SR then by the mid 1990s we would conclude that this was a money losing rule. We could also use a rolling estimate for correlation, though as these are stable enough over periods of five years or more this wouldn't affect things much.

Of course I wouldn't be so crude as to use a mere rolling window, instead I'd use an exponential window. As usual I'm going to specify this using the span parameter of the pandas ewm function. A 10 year span has a 3.5 year half life; i.e. the 10 year EWM is roughly equivalent (same halflife) to a 7 year simple moving average.


The test

Regular readers will know exactly what to expect here, but for those that aren't regular here is how I test this procedure to be as sure as possible there is no luck involved.

  • Select 10,20,30 or 40 years of in sample data (shorter periods won't make sense to apply an exponentially weighted [EW] estimation)
  • Select 1 or 5 years of out of sample data
  • Pick a random instrument, ensuring there is enough history available (between 11 and 45 years). We will only choose from instruments with sufficient history for the time required. 
  • Randomly pick N=9 forecasting rules from those available (the same as in previous posts)

Then for each of those sumsamples:

  • Cycle through using an exponentially weighted [EW] span of 5,10,20,30 years; and no span (use all available history). For shorter in sample periods the EW results using longer spans will be very similar to those without EW estimation.
  • Estimate SR using the EW span.
  • Estimate correlation using all the in sample data (we could use an EW span here, but correlations are sufficiently stable that it won't unduly affect results).
  • Use fixed shrinkage levels (estimated here): SR shrinkage 0.5, correlation 0.75 (since we'll always have at least five years of in sample data we don't need to worry about the higher levels of shrinkage required when we have insufficent data). The results won't be much different with any vaguely similar shrinkage; you could argue we'd need more shrinkage with shorter EW spans but I am not going to test this.
  • Run in sample optimisation and out of sample optimisation on all the options above

Finally once we have all our subsamples:

  • Get the median SR from the distribution of subsamples
  • Find the optimal EWM span with the higest median SR
  • Test to see if that optimum is significantly higher than the others

As I also did in my last post I'm going to see if the figures are different without any implicit fitting. To achieve this I include the opposite of a given trading rule as a candidate; and then when I come to do optimisation I pick the version that has a positive SR (there are no costs, so the SR will be identical with a negative sign).


10 years in sample, one year out of sample

We only have five optionts to consider so we can do this in a simple table.

         SR  pvalue
5 0.026 0.065
10 0.021 0.026
20 0.033 NaN
30 0.030 0.042
999999 0.017 0.131

Each row is a different EW span. '99999' means the entire in sample period was used. The next column is the out of sample Sharpe Ratio for each option. In the second column is the p-value for a test of the optimal option against the relevant option. NaN is the optimal option, and lower values (say below 0.05) mean the optimal option is significantly better than the alternatives. We can see that a 20 year span is the optimal, and it's a little better than the other alternatives but not significantly better than the entire in sample period.

Do the results differ when we don't preselect only the 'correct' rules?


SR pvalue
5 0.076 0.188
10 0.093 NaN
20 0.085 0.186
30 0.087 0.169
999999 0.083 0.236

Nothing is really significant there.


10 years in sample, five years out of sample


SR pvalue
5 0.201 0.009
10 0.215 NaN
20 0.214 0.078
30 0.211 0.277
999999 0.199 0.271

SR pvalue 5 0.093 0.000 10 0.112 NaN 20 0.104 0.339 30 0.104 0.466 999999 0.110 0.533

Here we do get better performance with anything more than 5 years.

20 years in sample, one year out of sample

          SR  pvalue
5 -0.154 0.892
10 -0.170 0.687
20 -0.163 0.610
30 -0.164 0.647
999999 -0.138 NaN


          SR  pvalue
5 -0.202 0.006
10 -0.135 0.011
20 -0.110 0.029
30 -0.097 0.059
999999 -0.055 NaN

Longer estimates are better.

20 years in sample, five years out of sample


SR pvalue
5 0.093 0.000
10 0.119 0.000
20 0.134 0.026
30 0.132 0.043
999999 0.149 NaN
          SR  pvalue
5 0.061 0.000
10 0.111 0.003
20 0.129 0.070
30 0.131 0.071
999999 0.140 NaN

Yes, longer estimates are better.


30 years in sample, one year out of sample


SR pvalue
5 -0.073 0.0
10 -0.078 0.0
20 -0.031 0.0
30 -0.005 0.0
999999 0.042 NaN

SR pvalue 5 -0.132 0.007 10 -0.152 0.000 20 -0.156 0.000 30 -0.104 0.000 999999 0.000 NaN


Same story, slightly different numbers.

30 years in sample, five years out of sample

          SR  pvalue
5 -0.038 0.0
10 -0.031 0.0
20 -0.017 0.0
30 -0.006 0.0
999999 0.008 NaN


SR pvalue
5 -0.050 0.001
10 -0.046 0.003
20 -0.038 0.003
30 -0.029 0.003
999999 -0.017 NaN


40 years in sample, one year out of sample


SR pvalue
5 -0.135 0.145
10 -0.113 0.023
20 -0.110 0.001
30 -0.090 NaN
999999 -0.150 0.947

Again, we basically want a very long estimate.
          SR  pvalue
5 -0.159 NaN
10 -0.224 0.044
20 -0.236 0.028
30 -0.238 0.072
999999 -0.254 0.061

That was a little unexpected.

40 years in sample, five years out of sample


SR pvalue
5 -0.049 0.0
10 -0.040 0.0
20 -0.029 0.0
30 -0.021 0.0

999999 -0.015     NaN    


SR pvalue
5 -0.097 0.000
10 -0.064 0.000
20 -0.016 0.006
30 -0.006 NaN
999999 -0.029 0.703


Conclusion

I'm a big believer in publishing (well blogging) research even if it doesn't result in a positive result. And certainly it looks like you don't really gain anything from using exponentially weighted estimates of Sharpe Ratios for optimisation, versus the simpler alternative of using all the data. Still there is that nagging feeling that we should at least have the option of dropping something that hasn't worked for a while which implies a very slow EWM. A 30 year EWM span has a 10.3 year halflife, the same as a 20 year or so SMA; whilst a 40 year EWM span is equivalent to a 28 year SMA. 



One of These Things (Is Not Like the Others). Or is it? Pooling rule p&l estimates across instruments.

 This is the eighth post in a series I'm writing on portfolio optimisation. I haven't done one of these for a few posts, so here is the story so far:

  • In the first post I showed that if you are optimising across forecasts from different trading rules and instruments, then you should first fit within; and then across, instruments. As I do anyway.
  • In my second post I ran some experiments with optimising with random data. The results showed a supreme indifference between joint winners: monte carlo and bootstrapping, and a shrinkage methodology with a tiny bit of SR shrinkage. 
  • For post three I showed that the predictability of sampling distribution of parameter estimates was much worse with real data than with random data.
  • Number four, saw me rerun post #2 with real data. A middle ground of some shrinkage was the winner across most time periods; unless the in sample period was too short.
  • Post number five was a twist on Bayesian shrinkage and an abject failure.
  • Number six was about clustering. It turns out clustering is good, though not staggeringly so. A cluster size of six was arbitrarily chosen as being reasonable enough.
  • The post whose number is seven was about structural breaks in estimates of SR. For forecast/instrument pairings these occur between 13% and half the time depending on the critical value used. However out of sample optimisation did not show a significant SR improvement if only history after a break was used for estimation.
This post has something in common with number seven. We know that we want more data for estimation. We can get more data with more frequent data (not helpful if you are trading slowly) or more history of data. For more history we only want history that is relevant. If the history occured before a structural break then we should throw it away (at least in theory - that didn't work so well in practice).

There is another way to get more data and that is in the cross section. For example, we could combine the p&l for the trading rule momentum16 across all the US government bond futures, or the entire fixed income sector.  Alternatively we could go the whole hog and combine all the p&l across every future (all this is ignoring costs, which are different for each instrument, and would have to be deducted ex-post this exercise). 

A key question however when doing this is 'are these things similar enough'. For example, should we use the same SR estimate for momentum16 across all instruments. Or can we say that a particular instrument 'just doesn't trend well'? Consider Cocoa, the stand out star of many trend following portfolios in 2024. Here is it's performance using the momentum16 trading rule:

Is it wise to assume that a particular instrument is poor at trending, just because it has been historically (or at least from 1980 to 2020)?

Note: this isn't the same as assuming a particular instrument is poor - that decision will come when we look at instrument weights.

Some of you might remember this plot from my third book, Leveraged Trading:

That shows the SR for that same trading rule, momentum16, with error bars across the data set used in the book. The error bars suggest that there aren't any SR that are statistically different from each other, so we should pool everything.

If I recalculate these numbers usign the 214 instruments in my current data set, excluding duplicates, then the estimated SR trading momentum16 varies across instruments from between -1.85 and 1.95. That range is, to be fair, larger than in the plot above.
 
If we test the SR difference between the samples of the worst and best instrument (Lead-LME and House-US FWIW) we do get a significant t-statistic (around 4 to 5, depending on whether we only test on the overlapping period or do an independent test of all the data from both instruments). So they are significantly different. It's even more impressive when you consider there are only 3 or 5 years of data used to make this evaluation (again eithier the overlapping or the entire time period). The p-value is around 0.003%. Much less than the p-value we would crudely expect to see by fluke with 214 instruments (which would be a little below 0.5%). 

It's also worth noting that this is quite a wide range of p&l for a trading rule: from -1.85 to 1.94 is a range of 3.8. For momentum64 it's from -1.0 to +1.35; and for carry10 it's from -0.86 to 1.46. The wider ranges of SR seem to be only present for the faster trend style rules; a range of 3.1 SR units for momentum4 (from -1.0 to +2.1 pre costs), 2.8 for breakout10, 3.3 for relmomentum10 and so on. Some of those more extreme figures are because some instruments don't have as much data (in the box plot above, they'd have wider boxes).

But could this just be luck? What kind of SR range would you expect across 214 random gaussian p&l streams with an underlying expected SR of 0.19 and say 4 years of data? The answer is about 2.68; with a 90% range of 2.44 to 3.11. Another way of looking at a distribution is the cross sectional standard deviation, across 214 random instruments the standard deviation of SR estimates with 4 years of data is around 0.5 (and that is much more stable than the range). 

Now, if we subsample random 4 year periods for each instrument across the actual performance of momentum16 we get much bigger numbers. The SR range is around 3.88, and the standard deviation comes in around 0.59. So we can see that the real data has around a fifth more variability than what we'd expect purely from randomness.

Those four year cross sectional standard deviations come in at similar values for all trading rules; the lowest is 0.53 (momentum64) and almost all are between 0.5 and 0.6 with one standout exception: a value of 0.866 for mrinasset1000 (a slow cross sectional mean reversion within asset classes). 

All this arseing around is just to confirm that there is indeed more cross sectional variability in SR estimates than luck would suggest. Which means it is possible that full pooling (use all data to estimate everything) approach might be beatable by another more selective pooling approach.


Some selective pooling approaches

There are a number of ways we could approach this exercise:

  1. We could cluster things together that have similarities, such as by asset class. I have done this in the past but it is slightly unsatisfactory to use a method which could have subjectivity and can't 'learn' from the data. Let's ignore that for now.
  2. We could do it on an estimate by estimate basis. We could compare the distribution of returns for say carry10 on US10 year bonds, and on US2 year bonds; and say "Well these distributions aren't significantly different. Let's pool the returns together". 
  3. Or we could look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments.  
  4. Or we could do it on portfolio weights. We could for example fit the weights for 2 year bonds, and for 10 year bonds, and then see if they were significantly different. We could then pool the returns if they were not that different. In fact I have looked at this before*.
* Note: we could also take an average of the weights, but that risks 'over robustness' if the weights were already produced using some kind of shrinking technology; i.e. you would end up averaging the weights too much.

An advantage of the second approach is we don't need to worry if instruments have different sets of trading rules available to them. Another is that it's obvious what significantly different means - we can just run plain vanilla t-tests like we did with the previous post on structural breaks at some critical value (CV). A disadvantage is that we need to do a lot more computation (one set for every rule seperately); yet another is that because we are testing so many things the danger of finding false positives is greatly increased. The standard response to this is to reduce the CV but then we're likely to miss some real differences. Yes, the classic dilemma of statistical testing.

For the third and fourth approaches we're basically saying "is this instrument like this other instrument".  That radically reduces the set of comparisons we have to do (just comparing instruments with instruments). The disadvantage is that it's harder to compare instruments with different rules; though not impossible. If the rule sets are similar enough then we can do our comparisons with a zero weight to the missing rules; and then just add back the rules that aren't in the shared dataset. Then we have the calibration of significance. We can use random data, like I did above, to work out what the likelihood of a particular distance between estimates or weights happening by pure fluke. 

It goes get difficult however to incorporate missing or additional forecasts when we are comparing weights. But that's straightforward with the third method. If some rules are missing from the pooled returns, then we can just calculate the additional statistics we need using returns only for that instrument.

Note that all four approaches can incorporate different cost levels. We just do everything in pre-cost world, and then as a final step deduct costs from the gross returns before optimising a given instrument. Note we couldn't do that as easily if were averaging weights for the third approach.

As I have looked at using weights before, I'm going to look at SR based grouping in this post - approach #3. So to be clear what I am doing is:

"Look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments."


Calibrating the threshold

This then is the sort of thing we are looking at:

               US10   US5  SP500
breakout10    -0.02  0.10  -0.43
breakout20     0.23  0.28  -0.08
breakout40     0.36  0.39   0.07
breakout80     0.35  0.38   0.25
breakout160    0.28  0.30   0.34
breakout320    0.30  0.41   0.37
relmomentum10  0.09  0.23  -0.15
relmomentum20 -0.02  0.11  -0.10
relmomentum40  0.12  0.13  -0.14
relmomentum80  0.17  0.26  -0.13
mrinasset1000 -0.06 -0.15  -0.49
carry10        0.51  0.50   0.10
carry30        0.53  0.50   0.10
carry60        0.53  0.48   0.12
carry125       0.54  0.49   0.18
assettrend2    0.02  0.03  -0.17
assettrend4    0.25  0.25  -0.17
assettrend8    0.41  0.40  -0.05
assettrend16   0.43  0.44   0.17
assettrend32   0.39  0.39   0.25
assettrend64   0.40  0.40   0.26
normmom2       0.05  0.14  -0.35
normmom4       0.25  0.30  -0.27
normmom8       0.39  0.38  -0.10
normmom16      0.42  0.42   0.08
normmom32      0.40  0.42   0.25
normmom64      0.37  0.43   0.32
momentum4      0.18  0.27  -0.23
momentum8      0.35  0.42  -0.02
momentum16     0.39  0.46   0.16
momentum32     0.35  0.42   0.30
momentum64     0.32  0.41   0.35
relcarry       0.04  0.20   0.14
skewabs365    -0.02 -0.12   0.24
skewabs180     0.17  0.02   0.32
skewrv365      0.29  0.22  -0.06
skewrv180      0.26  0.16   0.19
accel16        0.31  0.31  -0.11
accel32        0.22  0.28  -0.16
accel64        0.03  0.14  -0.04

That shows you the vector Sharpe Ratios for each trading rule for three different instruments. The question then is, should be pool the returns of US5 and US10? And what about SP500? Or are these vectors of SR distinctively different and we should not pool at all?

Just by eye the two bonds do look very similar. The S&P 500 not so much. A simple euclidian distance metric gives a distance of 0.07 between the two bonds; and around 0.31 between the equity and the bonds. Now this distance measure is crude. It doesn't take into account the length of data each asset has. A proper statistical test if the time periods were matched would also look at the correlation of the two  matched return distributions. But regular followers of this blog will know that I love crude. So let's run with this.

Note: The distance metric is just sqrt(average(w_i_1 - w_i_2)) for weights in rules i=1...N and for instruments 1 and 2.

How can we calibrate this distance measure? To put it another way, is 0.07 very low, and is 0.31 very high? Should we pool everything? Just the bonds?

Well, my bias is towards pooling. Before writing this blog post I've generally pooled everything without giving it a moments thought. So I would only be not pooling if there is a high chance that two instruments are distinctly different. That suggests my rule is to pool two instruments, unless their vector distance is above some critical value. In the simple example above, any critical value in the range 0.08 to 0.29 would imply pooling the two bonds, but would say not to pool the S&P. A critical value below 0.07 would result in no pooling. A value of 0.32 or above would imply pooling everything. 

How to find these critical values? Easy! I can set the critical value using random data at a level where we would pool unless there is a (say) 95% chance that the instruments are actually significantly different. We know that we typically have about 20 years of data. We know that the average trading rule on the typical instrument has a SR of around 0.15. If we generate 40 lots of random returns with that, and measure the SR, we'll get something like one of the columns above. Repeat that for another non existent instrument and we have two random 'instruments' we can calculate a weight for. Then we calculate the distance between those two random weights. Finally we rinse and repeat many times. We then get a distribution of weights. Voila:




We can certainly say that a distance of 0.31 is what we'd easily expect through random chance, whereas 0.07 is very unlikely to happen. You may be wondering how different that would be for say an instrument with just 5 years of data. Wonder no longer:


With less data and more variability in SR larger distances are more common. And what about the upper end with say 40 years of data (roughly what we have for the two bonds and S&P above):


There is less variability in outcomes with longer periods, so the distances are smaller. We're still not seeing a distance of 0.07 here though. It's incredibly unlikely to be a coincidence that the two bonds have roughly the same weights. At the other end of the spectrum, it's also pretty unlikely that the S&P and the two bonds are actually drawn from the same distribution; since 0.31 is at the right edge of the distribution of distances.

Now we should apply a pinch of salt correction to numbers from random data as we know from post two that real data doesn't behave like random data with a fixed SR, and shows more variability. So we would expect larger distances in real data than we see here, particularly for longer periods. For a given critical value calibrated using random data, it's likely that with actual data we will see more apparent significant differences, and therefore slightly less pooling going on.

The other thing to point out is that there are (N^2-N)/2 possible pairwise comparisons of weights; or for 214 instruments 23,000 or so. Where we to set our threshold for pooling at (say) 99% we'd expect to see over 200 apparent significant weight differences just by chance even if the instruments had no significant difference in true SR. 

Fortunately there isn't that much difference in the tails for the distribution of distances over different time periods from random data:
    
         1 year      5 years       10 years   20 years  30 years    40 years
95%       1.65         0.74          0.53      0.37      0.30         0.26
99%       1.81         0.80         0.57      0.40      0.33         0.28
99.9%     1.91         0.85          0.61      0.43      0.35         0.30

Note that using the bottom row would just barely still result in S&P 500 and the bonds getting seperated (distance 0.31 with 40 years or so of data); and with a little less history they would be pooled together. Despite the higher risk of false positives, my bias to pooling isn't so absurd that I think that should happen. Anyway, I'm going to use the top row, which implies:

We will pool instruments unless there is a 95% chance or greater that their SR vectors are significantly different. That will be the case if their SR vectors have distances greater than 1.65 (one year or less) up to 0.26 (40 years or less).

What if we have instruments with different amounts of history? Given our bias is towards pooling, I would always use a higher critical value for distance. For example, if we had one instrument with 5 years of returns and another with 40 years, I'd use the critical value for 5 years. That also means that instruments with less returns are more likely to be pooled when they first enter the data set. Which feels like the correct approach.


The pooling algo in full

OK so what we do is (for a given point in time as this will be on an in and out of sample basis):

  • estimate the SR for each rule on each instrument
  • for that vector of SR, work out the distance between that instrument and all other instruments
  • calculate the critical value for each pairing, using the lowest number of years available for one of the instruments.
  • assuming there is at least one pair where the distance is less than the critical value, find the pair with the smallest distance
  • combine the returns of those instruments together into a new psuedo pooled instrument
  • calculate the SR vector, and distances between this new pseudo instrument and all existing instruments (all the previously calculated distances between the remaining instruments will remain the same and don't need recalculating). Note that the number of years available for a pseudo instrument will be equal to the sum of the years on the individual instruments.
  • repeat until there are no distances less than the critical value (which means there is a less than 95% chance that their SR are significantly different).
We now have a mixture of pseudo instruments and possibly instruments. We use the pre-cost returns for each pool to optimise the appropriate portfolio weights. There is some additional logic around handling distinct and missing forecasts, and also costs, but for now let's keep things simple. 


Just for fun

Just for fun, let's run a single in sample test and see what gets pooled together in instrument space. This is akin to clustering exercises I have done before but there I just used underlying instrument returns.

This takes a while, and you may be surprised by some of the first pairs of instruments that are pooled together:

Pooling DistanceKeys(key1='SOFR', key2='EDOLLAR')   # both STIR
Pooling DistanceKeys(key1='US5', key2='US10')       # the example we have been using
Pooling DistanceKeys(key1='PLAT', key2='REDWHEAT')  # WTF!
Pooling DistanceKeys(key1='HEATOIL', key2='CHF')    # WATF?!
Pooling DistanceKeys(key1='GILT', key2='CAD10')     # both 10 year bonds fine
Pooling DistanceKeys(key1='ZAR', key2='DAX')        # WATFF?!?!?
...

Anyway once finished we end up with 54 pooled returns rather than 214 of our original distinct instruments. 

There are two huge groups that take in a big chunk of the instruments. Here is the first with 103 instruments:
['AUD_micro', 'BBCOMM', 'BOBL', 'BONO', 'BRE', 'BRENT_W', 'BTP', 'BTP3', 'BUND', 'BUXL', 'CAD', 'CAD10', 'CANOLA', 'CH10', 'CHEESE', 'CHF', 'COCOA', 'COCOA_LDN', 'COFFEE', 'COPPER-micro', 'CORN', 'COTTON', 'COTTON2', 'CRUDE_ICE', 'CRUDE_W_micro', 'DAX', 'DOW', 'DX', 'EDOLLAR', 'EU-BANKS', 'EU-DJ-UTIL', 'EURCHF', 'EUR_micro', 'FANG', 'FEEDCOW', 'FTSE250', 'FTSECHINAA', 'GAS-PEN', 'GASOIL', 'GASOILINE', 'GAS_US_mini', 'GBP', 'GICS', 'GILT', 'GOLD_micro', 'HANG_mini', 'HEATOIL', 'HIGHYIELD', 'IBEX_mini', 'IRS', 'JGB', 'JGB-SGX-mini', 'JPY', 'KOSPI_mini', 'KR3', 'LEANHOG', 'LIVECOW', 'LUMBER', 'MILLWHEAT', 'MSCISING', 'MXP', 'NASDAQ_micro', 'NIFTY', 'NIKKEI', 'OAT', 'OATIES', 'OJ', 'OMX', 'PALLAD', 'PLAT', 'PLN', 'R1000', 'RAPESEED', 'REDWHEAT', 'RICE', 'ROBUSTA', 'RUR', 'SGX', 'SHATZ', 'SILVER', 'SMI-MID', 'SOFR', 'SOYBEAN_mini', 'SOYMEAL', 'SOYOIL', 'SP400', 'SP500_micro', 'SUGAR11', 'SUGAR_WHITE', 'TOPIX', 'US-DISCRETE', 'US-HEALTH', 'US-TECH', 'US10', 'US10U', 'US2', 'US20', 'US30', 'US5', 'VIX_mini', 'WHEAT', 'YENEUR', 'ZAR']

Note that this does include both the S&P 500 and the US bond markets!

The second group of fifty instruments is mostly stock sectors, but not entirely:

['AEX', 'BOVESPA', 'CAC', 'CAD2', 'CAD5', 'CLP', 'CZK', 'DJSTX-SMALL', 'EU-AUTO', 'EU-BASIC', 'EU-CHEM', 'EU-CONSTRUCTION', 'EU-DIV30', 'EU-DJ-TELECOM', 'EU-FOOD', 'EU-HEALTH', 'EU-MID', 'EU-OIL', 'EU-REALESTATE', 'EU-RETAIL', 'EU-TECH', 'EU-TRAVEL', 'EURCAD', 'EURO600', 'EUROSTX', 'EUROSTX-SMALL', 'FTSE100', 'FTSECHINAH', 'GBPCHF', 'GBPEUR', 'IG', 'INR', 'MSCIASIA', 'MSCIEAFA', 'NICKEL_LME', 'NOK', 'NZD', 'RUSSELL', 'SEK', 'SMI', 'SPI200', 'US-ENERGY', 'US-FINANCE', 'US-INDUSTRY', 'US-MATERIAL', 'US-PROPERTY', 'US-REALESTATE', 'US-STAPLES', 'US-UTILS', 'V2X']

Again in terms of weirdness, the Canadian 10 year bond is in group 1 whilst everything else is in group 2. VIX is in group 1, and V2X in group 2. 

Next there are a few small groups, which mostly don't have any internal logic:

BRENT-LAST, USIRS5, USIRS10 (two out of three make sense)
MILK, WHEY, KR10, WHEAT_ICE (two out of four make sense)
FED, COAL-GEORDIE (nope, I got nothing)
IRON, ETHANOL
EURIBOR-ICE, COAL
SUGAR16/MILKDRY (the "what you shouldn't put in your coffee" group)

This leaves 46 instruments which can't be pooled with anything else:

['ALUMINIUM', 'AUDJPY', 'BB3M', 'BITCOIN', 'BUTTER', 'CHFJPY', 'CHINAA-CON', 'CNH', 'COPPER_LME', 'ETHER-micro', 'EU-HOUSE', 'EU-INSURE', 'EU-MEDIA', 'EUA', 'EURAUD', 'FTSEINDO', 'FTSETAIWAN', 'FTSEVIET', 'GBPJPY', 'HANGENT_mini', 'HANGTECH', 'HOUSE-US', 'JP-REALESTATE', 'KOSDAQ', 'KRWUSD_mini', 'LEAD_LME', 'LUMBER-new', 'MIB', 'MILKWET', 'MSCIEMASIA', 'MSCITAIWAN', 'MSCIWORLD', 'MUMMY', 'RUBBER', 'SARONA', 'SGD', 'SONIA3', 'STEEL', 'SWISSLEAD', 'TIN_LME', 'TWD', 'US3', 'USIRS2ERIS', 'USIRS5ERIS', 'VNKI', 'ZINC_LME']

Oh yes crypto people, Bitcoin and Ethereum are 'special'. As special as Tin and Rubber anyway. The furthest distance remaining after all that pooling is just under 0.31 which just exceeds the relevant critical value.

Evaluating the results

You should be used to the procedure by now if you've been following the blog posts. I will do the usual thing of cycling through different lengths of in sample (5 years, 10 years) and out of sample (1 year and 5 years) lengths of time. For shorter time periods that will allow me to subsample different historic periods. For speed and to get some alternative paths I'm not going to consider all the instruments. Instead I will randomly subsample 50 instruments randomly out of the 214 available. Note that the pool of available instruments will be smaller when I am using e.g. 15 years of in and out of sample data, which is why I'm not going to a 40 year in sample period as I've done before when investigating structural breaks.

Then for a given set of returns I will eithier use fully pooled, distance weight pooled, or unpooled returns. Then I will optimise for each instrument based on the relevant returns, using the shrinkage method with SR shrinkage of 0.5 and correlation of 0.75. Finally I will take the equally weighted across instruments portfolio SR for the 50 instruments, out of sample. Note this should be better than the average SR for each instrument. I could do better with some kind of instrument weight allocation, but that is for another day. I should still be able to pick up whether I am losing in lower diversification through pooling.


... with a twist

Basically everything I have done up until now includes implicit in sample fitting, because I'm only selecting from trading rules that actually work. This will inflate the backtest results, but until now at least won't have a serious effect on the calibrations I have been running. But with this step of looking at pooling I am worried that there will be rules that just don't work on some instruments. To try and alleviate that in sample fitting problem, I'm going to include the opposite of each trading rule as well as the original rule. Then at each optimisation we only choose the positive SR option. Note there are no trading costs at this stage of my research so the p&l of the opposite rule is exactly equal to -1* the 'correct' rule. 

Anyway on with the results.

5 years in sample, 1 year out of sample

               SR    pvalue
unpooled 0.320 0.0
all pooled 0.447 0.0
algo pooled 0.514 NaN

               SR    pvalue
unpooled -0.104 0.0
all pooled 0.391 NaN
algo pooled 0.016 0.0

The first table shows the results as I've been analysing up to now, with implicit fitting and only the 'correct' version of the trading rules included. In the second table I've allowed the possibility of the opposite rule to be included. Note the much lower performance that results; and a difference of opinion on whether we are better pooling everything or using the algo.


5 years in sample, 5 year out of sample

               SR    pvalue
unpooled 0.624 0.0
all pooled 0.431 0.0
algo pooled 0.657 NaN
               SR    pvalue
unpooled 0.339 0.0
all pooled 0.346 0.0
algo pooled 0.404 NaN
Once agin the SR are reduced by being more honest, but with longer out of sample the algo pooling method is now superior. 

10 years in sample, 1 year out of sample


SR pvalue
unpooled 0.862 NaN
all pooled 0.436 0.0
algo pooled 0.626 0.0
               SR    pvalue
unpooled 0.460 NaN
all pooled 0.304 0.0
algo pooled 0.248 0.0
The one thing we haven't got here is consistency... now not pooling at all is the correct thing to do!

10 years in sample, 5 years out of sample

               SR    pvalue
unpooled 0.683 0.0
all pooled 0.584 0.0
algo pooled 0.735 NaN

SR pvalue
unpooled 0.781 NaN
all pooled 0.491 0.0
algo pooled 0.650 0.0

A bit of an unusual case here since we do better on unpooled when including opposite rules, but it can happen just by luck. Anyway things really are inconsistent here...


Summary

Although the results above do seem quite messy, if we focus on the more honest figures that include opposite rules we can see a pattern if we look at the best method in each case:

5 year 1 year:  All pooled
5 year 5 year:  Algo pooled
10 year 1 year: Unpooled
10 year 5 year: Unpooled

Hence the more data we have, the more it seems we can allow each instrument to have it's own parameter estimates rather than sharing with other instruments.

Anyway, what to do? I am struggling here. I like more SR as much as the next guy, but I also have biases towards simplicity (Occam's razor), robustness and not changing things if I can avoid them. Sticking with what I currently do - pooling everything - is very tempting. It's simple, and it is also likely to be very robust. Not pooling at all is possibly even simpler; and with enough data history does seem to perform better. But it also worries me! Although we're ensuring robustness by using shrinkage, so maybe it's okay.

The Algo method is cool and fun, but definitely massively complicates matters. The method also doesn't produce 'nice' results. When I ran the original 'all instruments' grouping exercise, the long tail of instruments that don't fit elsewhere was slightly concerning. I had hoped to get groups that were congruent with asset classes or at least had some obvious logic, and I certainly didn't. This does suggest that the pooling by weights I have attempted before is worth a second look.

Alternatively I could use some simple heuristic like:
  • If an instrument has less than 5 years of data history, use pooled returns
  • If it has more than 25 years of history, use individual returns
  • With between 5 and 25 years of history, use weights that are an average of these; where the weight on pooled returns for N years of returns is (25-N)*0.05 and obviously the weight on . 
So there is another blog post to come at some point where I revisit the issue of pooling.

But for now we can put pooling by SR vector in the bin, the concept permanently damaged by the sharp edge of Occams razor (topical political reference there!).