Friday, 3 July 2026

Jumping back in the pool(ing): testing pooling by asset class and portfolio weight distance

This is post #10 in my 2026 series on portfolio optimisation. Time for a quick recap. I'm not going to revisit every post but instead summarise what I now think one should be doing when optimising forecast weights before costs (I haven't yet incorporated costs, nor thought about instrument weights).

(I also confirmed in my very first post it was better to estimate forecast then instrument weights, rather than doing them jointly).

That doesn't seem like much value for the thousands of words I've written, and it's also not a million miles from what I would have down without all this research. A few things haven't worked out: random based methods (bayesian and monte carlo) which don't account for the reduced predictability of real returns compared to synthetic data; formal structural breaks on estimates; grouping and pooling instruments according to forecast SR; shorter EWM windows for SR estimates; and shrinking weights rather than inputs. 

I should probably move on now to looking at costs, and instrument returns, but like a dog with a particularly tasty bone or a cat pulling on an especially interesting piece of string; I can't quite let go of the idea that we should be able to improve on pooling everything.


Prior art

Let's run through the options we potentially have for pooling:

  1. We could cluster things together that have similar characteristics, such as by asset class. 
  2. We could do it on an estimate by estimate basis. We could compare the distribution of returns for say carry10 on US10 year bonds, and on US2 year bonds; and say "Well these distributions aren't significantly different. Let's pool the returns together". 
  3. Or we could look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments.  I covered this in my previous post in this series. 
  4. Or we could do it on portfolio weights. We could for example fit the weights for 2 year bonds, and for 10 year bonds, and then see if they were significantly different. We could then pool the returns if they were not that different. I have also looked at this before.
  5. We don't pool at all, and fit each instrument individually. That sounds terrifying, but remember we're shrinking with our fitting.
  6. We pool everything. So far that seems to the best option, and the one I've used in the past. 

Note that we also have the option of:

  • A pooling the returns before estimating the statistics and then the weights
  • B not pooling the returns, and then pooling the weights
I'm not keen on B because it produces 'over robustness' when combined with a shrinkage methodology. Basically we throw away too much information and end up too close to equal weights.

So returning to the numbered list:

  1. By asset classes - is untried, although it resembles what we used to at AHL when we were organised into asset class teams, each of which fitted their own strategies.
  2. Grouping per estimate: I have objections to in terms of computational time and statistical unpleastness, discussed in the previous post.
  3. Grouping per vector of estimates: I tried this in the previous post. It wasn't effective, and also produced weird undesirable groups.
  4. Grouping by weights: I have tried before in a limited test with some success.

So that leaves us with 1 and 4 as candidates, along with the standard options of #6 full pooling and #5 no pooling at all - fitting each instrument's forecast weights purely on it's own data:

  • Unpooled
  • All instruments pooled
  • Asset class pooled
  • Grouping by portfolio weights


By asset classes - method

This is pretty trivial; the eight asset classes in my system are:

  • Stock indices (58 instruments in my dataset including duplicates and expired instruments like Eurodollar to avoid survivorship bias)
  • Sector stocks (eg 'EU oil companies') 36 instruments
  • Vol 4
  • FX 43
  • Bonds and STIR 39
  • Energies 20
  • Agricultural 39
  • Metals 21 (includes two crypto futures)

So for a given portfolio we fit those instruments in the same asset class together.


Grouping by weights

Well this is easy, as I already did this here and under the heading "get instrument groupings" it tells us we can use k-means clustering and there is even some code there for me to copy and paste. An important difference between this and the grouping by SR vector is that correlations will also be taken into account, at least in an implicit way.

One open question that remains is whether the grouping is done on portfolio weights that have been derived using a shrinkage method, or on weights that haven't (just using naive mean variance). I felt it was better to use 'purer' weights which hadn't used shrinkage so we don't end up discarding useful differences.

As I did in my prior post in this series, let's run the grouping exercise on my entire portfolio. Partly for laughs, and partly to see if the grouping makes sense. How many groups/clusters should we use? Well there are 7 substantive asset classes, excluding vol:

Cluster 0, length 16

BTP3, CANOLA, EU-FOOD, FED, GBPCHF, GOLD_micro, HIGHYIELD, LIVECOW, OMX, R1000, SILVER, SP500_micro, US-INDUSTRY, WHEAT, YENEUR, ZAR

Ags: 3, Metals: 2, Equity: 3, FX: 3, Sector: 2, Bond: 3


Cluster 1, length 29

BRENT_W, CNH, COAL-GEORDIE, COCOA, COFFEE, COTTON, ETHER-micro, EU-AUTO, EU-DJ-TELECOM, EU-DJ-UTIL, EU-MEDIA, EU-REALESTATE, EU-TECH, FTSETAIWAN, GBP, HEATOIL, KOSPI_mini, MILK, MSCIEAFA, MSCISING, OATIES, OJ, RUBBER, SEK, SMI, SONIA3, US-DISCRETE, US2, VNKI

OilGas: 3, Ags: 7, Metals: 1, Equity: 5, FX: 3, Sector: 7, Vol: 1, Bond: 2


Cluster 2, length 20

CAD10, CH10, CHF, CHFJPY, COPPER-micro, CZK, FTSECHINAA, FTSEINDO, IRON, JGB, JGB-SGX-mini, JP-REALESTATE, MUMMY, NIKKEI, SGX, SOYBEAN_mini, SOYOIL, TOPIX, US-ENERGY, US-HEALTH

Ags: 2, Metals: 2, Equity: 6, FX: 3, Sector: 3, Bond: 4


Cluster 3, length 12

AUD_micro, EU-INSURE, FTSE100, FTSECHINAH, GASOIL, HANG_mini, HOUSE-US, NASDAQ_micro, NOK, NZD, US-STAPLES, US-TECH

Sector: 4, Equity: 4, FX: 3, OilGas: 1


Cluster 4, length 16

ALUMINIUM, AUDJPY, BITCOIN, BOBL, BONO, BUND, BUXL, CORN, DOW, GBPJPY, MILKDRY, OAT, RICE, ROBUSTA, SHATZ, STEEL

Ags: 4, Metals: 3, Equity: 1, FX: 2, Bond: 6


Cluster 5, length 79

BB3M, BBCOMM, BRE, BTP, BUTTER, CAD, CHEESE, CHINAA-CON, COAL, COCOA_LDN, COPPER_LME, COTTON2, CRUDE_ICE, CRUDE_W_micro, DJSTX-SMALL, DX, EU-BANKS, EU-CHEM, EU-CONSTRUCTION, EU-MID, EU-OIL, EU-TRAVEL, EURCAD, EURCHF, EURIBOR-ICE, EUROSTX, EUROSTX-SMALL, EUR_micro, FANG, FEEDCOW, FTSE250, GAS-PEN, GASOILINE, GAS_US_mini, GICS, GILT, HANGENT_mini, IBEX_mini, IG, INR, IRS, JPY, KOSDAQ, KR10, KR3, LEAD_LME, LEANHOG, LUMBER-new, MIB, MILKWET, MILLWHEAT, MSCIEMASIA, MSCITAIWAN, MSCIWORLD, MXP, NICKEL_LME, PALLAD, PLAT, REDWHEAT, SARONA, SGD, SMI-MID, SOFR, SOYMEAL, SP400, SPI200, SUGAR11, SUGAR16, SUGAR_WHITE, TIN_LME, TWD, US10, US20, US5, V2X, VIX_mini, WHEAT_ICE, WHEY, ZINC_LME

OilGas: 6, Ags: 18, Metals: 7, Equity: 16, FX: 12, Sector: 6, Vol: 2, Bond: 12


Cluster 6, length 32

AEX, BOVESPA, BRENT-LAST, CAC, CAD2, CAD5, CLP, DAX, EU-BASIC, EU-DIV30, EU-HEALTH, EU-HOUSE, EU-RETAIL, EUA, EURAUD, EURO600, FTSEVIET, GBPEUR, HANGTECH, KRWUSD_mini, MSCIASIA, PLN, RUSSELL, SWISSLEAD, US-FINANCE, US-MATERIAL, US-PROPERTY, US-REALESTATE, US-UTILS, US10U, US3, US30

OilGas: 2, Equity: 11, FX: 5, Sector: 9, Bond: 5

There doesn't seem much congruency with asset classes there. Is this "the data speaking to us", or are we just data mining with a very sharp spade? Let's find out.


Testing

In my older post, here, I did a rather simplistic 'one shot' test on a subset of my available instruments and forecast rules (albeit on a rolling out of sample basis). But I have a rather more exhaustive way of doing things I've been using in this series. 

I cycle through different lengths of in sample (5 years, 10 years, 20 years) and out of sample (1 year and 5 years) lengths of time. For shorter time periods that will allow me to subsample different historic periods. For speed and to get some alternative paths I'm not going to consider all the instruments. Instead I will randomly subsample 50 instruments randomly out of the 214 available. 

Then for a given set of returns I will eithier use fully pooled, asset class pooled, portfolio weight pooled, or unpooled returns. Then I will optimise for each instrument based on the relevant returns, using the shrinkage method with SR shrinkage of 0.5 and correlation of 0.75. Finally I will take the equally weighted across instruments portfolio SR for the 50 instruments, out of sample. 

In previous posts I've discussed a more honest way of backtesting, where we include the opposite of a given trading rule to avoid implicit fitting; and then only bring positive SR rules into the optimisation. All the results here will use that methodology exclusively.


5 years in sample, 1 year out of sample

You should hopefully recognise this format from before. Each row is a fitting option. The first column shows the median SR across the many, many runs of random resampling. The second column shows the t-test p-value from comparing the best option with the others. NaN means this is the best option. A low number in this column, say below 0.01 or 0.05, indicates that the best option is statistically significantly better than the other option.

                           SR  pvalue

unpooled                 0.046     0.0

all pooled               0.425     0.0

asset class pooled       0.557     NaN

weight distanced pooled  0.184     0.0

That is ... pleasing. The least robust method is worse. More robust methods do better. And we get a significant improvement from pooling within asset classes. OK the portfolio weight distancing isn't so good, but we haven't got huge amounts of data to form our portfolio weights with so maybe they are a little unstable.


5 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.418     0.0
all pooled               0.391     0.0
asset class pooled       0.731     NaN
weight distanced pooled  0.376     0.0

Unpooled does a little better here, but asset classes are still the way to go.

10 years in sample, 1 year out of sample

                            SR      pvalue
unpooled                 0.664     NaN
all pooled               0.417   0.000
asset class pooled       0.635   0.127
weight distanced pooled  0.415   0.000

OK interestingly unpooled is making a comeback, but it still isn't significantly better than asset class pooled.

10 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.855     0.0
all pooled               0.575     0.0
asset class pooled       1.028     NaN
weight distanced pooled  0.559     0.0

Asset class is again asserting it's dominance with unpooled a close second.

20 years in sample, 1 year out of sample

                            SR  pvalue
unpooled                 0.571     0.0
all pooled               0.469     0.0
asset class pooled       1.127     NaN
weight distanced pooled  0.410     0.0

OK this is getting a bit silly. I feel like the dad whose kid at sports day is winning everything, proud but also getting a little embarrassed. "Now come on jonny, let one of the other kids win the next one". 

Interestingly it does seem with more data that unpooled is the way to go for a second option.


20 years in sample, 5 years out of sample

                            SR  pvalue
unpooled                 0.251     0.0
all pooled               0.569     NaN
asset class pooled       0.551     0.0
weight distanced pooled  0.553     0.0

"Well done Jonny. Everyone knows you could have won it if you wanted but it's good to show good sportmanship"

So all pooled finally gets it's day in the sun albeit with a slim advantage over the other two pooled methods. Bear in mind only 65 instruments have sufficient history here; with only 18 having two distinct blocks of 25 years so there won't be much genuine variation if we choose 50. So this could be a fluke. Jonny will tell you that it is.


But Rob, What about averaging?

At this point, given the choice between the complexity of weight distancing, and the simplicity and efficiency of asset class pooling; I'm inclined to go with the latter. And it's what we were doing at AHL all those years ago (not because of empirical evidence but because it suited the organisational structure...).

However there is another option which I talked about in the original asset class pooling post, using a blend. Here we take an average of the portfolio weights selected with different methodologies. So that would be an average of:

  • Unpooled
  • All pooled
  • Asset class pooled
Blending weights in this way is a way to improve robustness. It's arguably the correct thing to do, since otherwise we'd be making an in sample choice of methodology - 'meta implicit fitting' if you will. One of my favourite research shops, Resolve asset management, are very keen on doing this. One potential downside is it might be producing 'over robustness' given we're using weights that have already had shrinkage. But let's find out.


5 years in sample, 1 year out of sample

Note these numbers won't be exactly the same as those above, since they're a different set of random experiments. They would eventually converge but it would take millions of runs.

And also, just for fun, I've added an extra column. I started off this series of posts talking about the importance of considering other points of the distribution but I've quietly dropped that and only been quoting the median. For balance then, I've added the 25% SR point as well as the median. The pvalue is as before.

                    SR median  SR 25%  pvalue
unpooled                0.025  -0.569     0.0
all pooled              0.475  -0.147     0.0
asset class pooled      0.620   0.045     0.0
average                 0.650  -0.067     NaN

Averaging is the winner - just - but asset class is better at the more conservative point.

5 years in sample, 5 years out of sample

                    SR median  SR 25%  pvalue
unpooled                0.428   0.164     0.0
all pooled              0.430   0.243     0.0
asset class pooled      0.771   0.537     NaN
average                 0.664   0.430     0.0

A clear win for asset class pooled here. Averaging suffers from it's association with the less performative unpooled / all pooled.

10 years in sample, 1 year out of sample

                    SR median  SR 25%  pvalue
unpooled                0.681   0.034     0.0
all pooled              0.452   0.067     0.0
asset class pooled      0.664   0.167     0.0
average                 0.935   0.256     NaN

This time averaging takes the win, helped by the good performance of unpooled.

10 years in sample, 5 years out of sample

                    SR median  SR 25%  pvalue
unpooled                0.887   0.641     0.0
all pooled              0.570   0.406     0.0
asset class pooled      1.066   0.833     NaN
average                 1.005   0.780     0.0

Asset class pooled is still the winner, but averaging gives a good job.

20 years in sample, 1 year out of sample

                    SR median  SR 25%  pvalue
unpooled                0.538  -0.209     0.0
all pooled              0.515   0.154     0.0
asset class pooled      1.167   0.539     NaN
average                 0.809   0.252     0.0

Asset class pooled by more of a margin now.

20 years in sample, 5 years out of sample

                   SR median  SR 25%  pvalue
unpooled                0.243   0.084     0.0
all pooled              0.531   0.414     NaN
asset class pooled      0.513   0.391     0.0
average                 0.461   0.355     0.0

As before 'all pooled' is the winner, whilst average is dragged down by the poor performance of unpooled. But as I said above, with these longer periods it's hard to know if it's just down to flukey instrument selection.

What to do...

There is enough evidence above to justify asset class pooling as the dominant choice. But equally, I don't think there is enough to discard averaging. And there is something so neat about averaging. We combine three quite disparate source of data together, so we're protected if one of them doesn't work out. It's robustness writ large! It can be justified without any in sample fitting - whereas one could argue that the selection of asset class pooling is an implicit in sample 'meta parameter' choice.

I think we're now (finally) ready to fit our forecast weights, and with costs. This is exciting for me, as whatever comes out I will be using as my new weights. This will be more of a 'literature review' since I've talked about optimising with costs in some detail and at some length before.

No comments:

Post a Comment

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.