This Blog is Systematic: June 2026

Monday, 29 June 2026

Rolling, rolling, rolling.... updating statistical estimates yes or no

The mega blog post series on portfolio optimisation continues!

A couple of posts ago, here, I looked at using the idea of formal testing for structural breaks in parameter estimates. Important parameters like Sharpe Ratio (SR). Because stuff like this happens:

This is the pre-cost performance of the momentum4 rule on CORN. The formal test found a structural break in 1989.

It's fair to say the structural break stuff didn't work that well. But there may be a much easier way of dealing with the non stationarity of these estimates, and that's to use rolling estimates. For example, if you were to use a 10 year rolling estimate of SR then by the mid 1990s we would conclude that this was a money losing rule. We could also use a rolling estimate for correlation, though as these are stable enough over periods of five years or more this wouldn't affect things much.

Of course I wouldn't be so crude as to use a mere rolling window, instead I'd use an exponential window. As usual I'm going to specify this using the span parameter of the pandas ewm function. A 10 year span has a 3.5 year half life; i.e. the 10 year EWM is roughly equivalent (same halflife) to a 7 year simple moving average.

The test

Regular readers will know exactly what to expect here, but for those that aren't regular here is how I test this procedure to be as sure as possible there is no luck involved.

Select 10,20,30 or 40 years of in sample data (shorter periods won't make sense to apply an exponentially weighted [EW] estimation)
Select 1 or 5 years of out of sample data
Pick a random instrument, ensuring there is enough history available (between 11 and 45 years). We will only choose from instruments with sufficient history for the time required.
Randomly pick N=9 forecasting rules from those available (the same as in previous posts)

Then for each of those sumsamples:

Cycle through using an exponentially weighted [EW] span of 5,10,20,30 years; and no span (use all available history). For shorter in sample periods the EW results using longer spans will be very similar to those without EW estimation.
Estimate SR using the EW span.
Estimate correlation using all the in sample data (we could use an EW span here, but correlations are sufficiently stable that it won't unduly affect results).
Use fixed shrinkage levels (estimated here): SR shrinkage 0.5, correlation 0.75 (since we'll always have at least five years of in sample data we don't need to worry about the higher levels of shrinkage required when we have insufficent data). The results won't be much different with any vaguely similar shrinkage; you could argue we'd need more shrinkage with shorter EW spans but I am not going to test this.
Run in sample optimisation and out of sample optimisation on all the options above

Finally once we have all our subsamples:

Get the median SR from the distribution of subsamples
Find the optimal EWM span with the higest median SR
Test to see if that optimum is significantly higher than the others

As I also did in my last post I'm going to see if the figures are different without any implicit fitting. To achieve this I include the opposite of a given trading rule as a candidate; and then when I come to do optimisation I pick the version that has a positive SR (there are no costs, so the SR will be identical with a negative sign).

10 years in sample, one year out of sample

We only have five optionts to consider so we can do this in a simple table.

         SR  pvalue
5       0.026   0.065
10      0.021   0.026
20      0.033     NaN
30      0.030   0.042
999999  0.017   0.131

Each row is a different EW span. '99999' means the entire in sample period was used. The next column is the out of sample Sharpe Ratio for each option. In the second column is the p-value for a test of the optimal option against the relevant option. NaN is the optimal option, and lower values (say below 0.05) mean the optimal option is significantly better than the alternatives. We can see that a 20 year span is the optimal, and it's a little better than the other alternatives but not significantly better than the entire in sample period.

Do the results differ when we don't preselect only the 'correct' rules?


           SR  pvalue
5       0.076   0.188
10      0.093     NaN
20      0.085   0.186
30      0.087   0.169
999999  0.083   0.236

Nothing is really significant there.

10 years in sample, five years out of sample


           SR  pvalue
5       0.201   0.009
10      0.215     NaN
20      0.214   0.078
30      0.211   0.277
999999  0.199   0.271

          SR  pvalue
5       0.093   0.000
10      0.112     NaN
20      0.104   0.339
30      0.104   0.466
999999  0.110   0.533

Here we do get better performance with anything more than 5 years.

20 years in sample, one year out of sample

          SR  pvalue
5      -0.154   0.892
10     -0.170   0.687
20     -0.163   0.610
30     -0.164   0.647
999999 -0.138     NaN

          SR  pvalue
5      -0.202   0.006
10     -0.135   0.011
20     -0.110   0.029
30     -0.097   0.059
999999 -0.055     NaN

Longer estimates are better.

20 years in sample, five years out of sample


           SR  pvalue
5       0.093   0.000
10      0.119   0.000
20      0.134   0.026
30      0.132   0.043
999999  0.149     NaN

          SR  pvalue
5       0.061   0.000
10      0.111   0.003
20      0.129   0.070
30      0.131   0.071
999999  0.140     NaN


Yes, longer estimates are better.

30 years in sample, one year out of sample


           SR  pvalue
5      -0.073     0.0
10     -0.078     0.0
20     -0.031     0.0
30     -0.005     0.0
999999  0.042     NaN

           SR  pvalue
5      -0.132   0.007
10     -0.152   0.000
20     -0.156   0.000
30     -0.104   0.000
999999  0.000     NaN

Same story, slightly different numbers.

30 years in sample, five years out of sample

          SR  pvalue
5      -0.038     0.0
10     -0.031     0.0
20     -0.017     0.0
30     -0.006     0.0
999999  0.008     NaN


           SR  pvalue
5      -0.050   0.001
10     -0.046   0.003
20     -0.038   0.003
30     -0.029   0.003
999999 -0.017     NaN

40 years in sample, one year out of sample


          SR  pvalue
5      -0.135   0.145
10     -0.113   0.023
20     -0.110   0.001
30     -0.090     NaN
999999 -0.150   0.947

Again, we basically want a very long estimate.
          SR  pvalue
5      -0.159     NaN
10     -0.224   0.044
20     -0.236   0.028
30     -0.238   0.072
999999 -0.254   0.061

That was a little unexpected.

40 years in sample, five years out of sample


           SR  pvalue
5      -0.049     0.0
10     -0.040     0.0
20     -0.029     0.0
30     -0.021     0.0

999999 -0.015 NaN


           SR  pvalue
5      -0.097   0.000
10     -0.064   0.000
20     -0.016   0.006
30     -0.006     NaN
999999 -0.029   0.703

Conclusion

I'm a big believer in publishing (well blogging) research even if it doesn't result in a positive result. And certainly it looks like you don't really gain anything from using exponentially weighted estimates of Sharpe Ratios for optimisation, versus the simpler alternative of using all the data. Still there is that nagging feeling that we should at least have the option of dropping something that hasn't worked for a while which implies a very slow EWM. A 30 year EWM span has a 10.3 year halflife, the same as a 20 year or so SMA; whilst a 40 year EWM span is equivalent to a 28 year SMA.

One of These Things (Is Not Like the Others). Or is it? Pooling rule p&l estimates across instruments.

This is the eighth post in a series I'm writing on portfolio optimisation. I haven't done one of these for a few posts, so here is the story so far:

In the first post I showed that if you are optimising across forecasts from different trading rules and instruments, then you should first fit within; and then across, instruments. As I do anyway.
In my second post I ran some experiments with optimising with random data. The results showed a supreme indifference between joint winners: monte carlo and bootstrapping, and a shrinkage methodology with a tiny bit of SR shrinkage.
For post three I showed that the predictability of sampling distribution of parameter estimates was much worse with real data than with random data.
Number four, saw me rerun post #2 with real data. A middle ground of some shrinkage was the winner across most time periods; unless the in sample period was too short.
Post number five was a twist on Bayesian shrinkage and an abject failure.
Number six was about clustering. It turns out clustering is good, though not staggeringly so. A cluster size of six was arbitrarily chosen as being reasonable enough.
The post whose number is seven was about structural breaks in estimates of SR. For forecast/instrument pairings these occur between 13% and half the time depending on the critical value used. However out of sample optimisation did not show a significant SR improvement if only history after a break was used for estimation.

This post has something in common with number seven. We know that we want more data for estimation. We can get more data with more frequent data (not helpful if you are trading slowly) or more history of data. For more history we only want history that is relevant. If the history occured before a structural break then we should throw it away (at least in theory - that didn't work so well in practice).

There is another way to get more data and that is in the cross section. For example, we could combine the p&l for the trading rule momentum16 across all the US government bond futures, or the entire fixed income sector. Alternatively we could go the whole hog and combine all the p&l across every future (all this is ignoring costs, which are different for each instrument, and would have to be deducted ex-post this exercise).

A key question however when doing this is 'are these things similar enough'. For example, should we use the same SR estimate for momentum16 across all instruments. Or can we say that a particular instrument 'just doesn't trend well'? Consider Cocoa, the stand out star of many trend following portfolios in 2024. Here is it's performance using the momentum16 trading rule:

Is it wise to assume that a particular instrument is poor at trending, just because it has been historically (or at least from 1980 to 2020)?

Note: this isn't the same as assuming a particular instrument is poor - that decision will come when we look at instrument weights.

Some of you might remember this plot from my third book, Leveraged Trading:

That shows the SR for that same trading rule, momentum16, with error bars across the data set used in the book. The error bars suggest that there aren't any SR that are statistically different from each other, so we should pool everything.

If I recalculate these numbers usign the 214 instruments in my current data set, excluding duplicates, then the estimated SR trading momentum16 varies across instruments from between -1.85 and 1.95. That range is, to be fair, larger than in the plot above.

If we test the SR difference between the samples of the worst and best instrument (Lead-LME and House-US FWIW) we do get a significant t-statistic (around 4 to 5, depending on whether we only test on the overlapping period or do an independent test of all the data from both instruments). So they are significantly different. It's even more impressive when you consider there are only 3 or 5 years of data used to make this evaluation (again eithier the overlapping or the entire time period). The p-value is around 0.003%. Much less than the p-value we would crudely expect to see by fluke with 214 instruments (which would be a little below 0.5%).

It's also worth noting that this is quite a wide range of p&l for a trading rule: from -1.85 to 1.94 is a range of 3.8. For momentum64 it's from -1.0 to +1.35; and for carry10 it's from -0.86 to 1.46. The wider ranges of SR seem to be only present for the faster trend style rules; a range of 3.1 SR units for momentum4 (from -1.0 to +2.1 pre costs), 2.8 for breakout10, 3.3 for relmomentum10 and so on. Some of those more extreme figures are because some instruments don't have as much data (in the box plot above, they'd have wider boxes).

But could this just be luck? What kind of SR range would you expect across 214 random gaussian p&l streams with an underlying expected SR of 0.19 and say 4 years of data? The answer is about 2.68; with a 90% range of 2.44 to 3.11. Another way of looking at a distribution is the cross sectional standard deviation, across 214 random instruments the standard deviation of SR estimates with 4 years of data is around 0.5 (and that is much more stable than the range).

Now, if we subsample random 4 year periods for each instrument across the actual performance of momentum16 we get much bigger numbers. The SR range is around 3.88, and the standard deviation comes in around 0.59. So we can see that the real data has around a fifth more variability than what we'd expect purely from randomness.

Those four year cross sectional standard deviations come in at similar values for all trading rules; the lowest is 0.53 (momentum64) and almost all are between 0.5 and 0.6 with one standout exception: a value of 0.866 for mrinasset1000 (a slow cross sectional mean reversion within asset classes).

All this arseing around is just to confirm that there is indeed more cross sectional variability in SR estimates than luck would suggest. Which means it is possible that full pooling (use all data to estimate everything) approach might be beatable by another more selective pooling approach.

Some selective pooling approaches

There are a number of ways we could approach this exercise:

We could cluster things together that have similarities, such as by asset class. I have done this in the past but it is slightly unsatisfactory to use a method which could have subjectivity and can't 'learn' from the data. Let's ignore that for now.
We could do it on an estimate by estimate basis. We could compare the distribution of returns for say carry10 on US10 year bonds, and on US2 year bonds; and say "Well these distributions aren't significantly different. Let's pool the returns together".
Or we could look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments.
Or we could do it on portfolio weights. We could for example fit the weights for 2 year bonds, and for 10 year bonds, and then see if they were significantly different. We could then pool the returns if they were not that different. In fact I have looked at this before*.

* Note: we could also take an average of the weights, but that risks 'over robustness' if the weights were already produced using some kind of shrinking technology; i.e. you would end up averaging the weights too much.

An advantage of the second approach is we don't need to worry if instruments have different sets of trading rules available to them. Another is that it's obvious what significantly different means - we can just run plain vanilla t-tests like we did with the previous post on structural breaks at some critical value (CV). A disadvantage is that we need to do a lot more computation (one set for every rule seperately); yet another is that because we are testing so many things the danger of finding false positives is greatly increased. The standard response to this is to reduce the CV but then we're likely to miss some real differences. Yes, the classic dilemma of statistical testing.

For the third and fourth approaches we're basically saying "is this instrument like this other instrument". That radically reduces the set of comparisons we have to do (just comparing instruments with instruments). The disadvantage is that it's harder to compare instruments with different rules; though not impossible. If the rule sets are similar enough then we can do our comparisons with a zero weight to the missing rules; and then just add back the rules that aren't in the shared dataset. Then we have the calibration of significance. We can use random data, like I did above, to work out what the likelihood of a particular distance between estimates or weights happening by pure fluke.

It goes get difficult however to incorporate missing or additional forecasts when we are comparing weights. But that's straightforward with the third method. If some rules are missing from the pooled returns, then we can just calculate the additional statistics we need using returns only for that instrument.

Note that all four approaches can incorporate different cost levels. We just do everything in pre-cost world, and then as a final step deduct costs from the gross returns before optimising a given instrument. Note we couldn't do that as easily if were averaging weights for the third approach.

As I have looked at using weights before, I'm going to look at SR based grouping in this post - approach #3. So to be clear what I am doing is:

"Look at the estimates of SR across different rules. You could have a vector of the SR for carry10, carry20 and so on. And you'd look at that vector of estimates, and calculate <some measure of> distance between them, and if the distance is low enough, then you'd pool the returns for all the rules for those two instruments."

Calibrating the threshold

This then is the sort of thing we are looking at:

US10 US5 SP500

breakout10 -0.02 0.10 -0.43

breakout20 0.23 0.28 -0.08

breakout40 0.36 0.39 0.07

breakout80 0.35 0.38 0.25

breakout160 0.28 0.30 0.34

breakout320 0.30 0.41 0.37

relmomentum10 0.09 0.23 -0.15

relmomentum20 -0.02 0.11 -0.10

relmomentum40 0.12 0.13 -0.14

relmomentum80 0.17 0.26 -0.13

mrinasset1000 -0.06 -0.15 -0.49

carry10 0.51 0.50 0.10

carry30 0.53 0.50 0.10

carry60 0.53 0.48 0.12

carry125 0.54 0.49 0.18

assettrend2 0.02 0.03 -0.17

assettrend4 0.25 0.25 -0.17

assettrend8 0.41 0.40 -0.05

assettrend16 0.43 0.44 0.17

assettrend32 0.39 0.39 0.25

assettrend64 0.40 0.40 0.26

normmom2 0.05 0.14 -0.35

normmom4 0.25 0.30 -0.27

normmom8 0.39 0.38 -0.10

normmom16 0.42 0.42 0.08

normmom32 0.40 0.42 0.25

normmom64 0.37 0.43 0.32

momentum4 0.18 0.27 -0.23

momentum8 0.35 0.42 -0.02

momentum16 0.39 0.46 0.16

momentum32 0.35 0.42 0.30

momentum64 0.32 0.41 0.35

relcarry 0.04 0.20 0.14

skewabs365 -0.02 -0.12 0.24

skewabs180 0.17 0.02 0.32

skewrv365 0.29 0.22 -0.06

skewrv180 0.26 0.16 0.19

accel16 0.31 0.31 -0.11

accel32 0.22 0.28 -0.16

accel64 0.03 0.14 -0.04

That shows you the vector Sharpe Ratios for each trading rule for three different instruments. The question then is, should be pool the returns of US5 and US10? And what about SP500? Or are these vectors of SR distinctively different and we should not pool at all?

Just by eye the two bonds do look very similar. The S&P 500 not so much. A simple euclidian distance metric gives a distance of 0.07 between the two bonds; and around 0.31 between the equity and the bonds. Now this distance measure is crude. It doesn't take into account the length of data each asset has. A proper statistical test if the time periods were matched would also look at the correlation of the two matched return distributions. But regular followers of this blog will know that I love crude. So let's run with this.

Note: The distance metric is just sqrt(average(w_i_1 - w_i_2)) for weights in rules i=1...N and for instruments 1 and 2.

How can we calibrate this distance measure? To put it another way, is 0.07 very low, and is 0.31 very high? Should we pool everything? Just the bonds?

Well, my bias is towards pooling. Before writing this blog post I've generally pooled everything without giving it a moments thought. So I would only be not pooling if there is a high chance that two instruments are distinctly different. That suggests my rule is to pool two instruments, unless their vector distance is above some critical value. In the simple example above, any critical value in the range 0.08 to 0.29 would imply pooling the two bonds, but would say not to pool the S&P. A critical value below 0.07 would result in no pooling. A value of 0.32 or above would imply pooling everything.

How to find these critical values? Easy! I can set the critical value using random data at a level where we would pool unless there is a (say) 95% chance that the instruments are actually significantly different. We know that we typically have about 20 years of data. We know that the average trading rule on the typical instrument has a SR of around 0.15. If we generate 40 lots of random returns with that, and measure the SR, we'll get something like one of the columns above. Repeat that for another non existent instrument and we have two random 'instruments' we can calculate a weight for. Then we calculate the distance between those two random weights. Finally we rinse and repeat many times. We then get a distribution of weights. Voila:

We can certainly say that a distance of 0.31 is what we'd easily expect through random chance, whereas 0.07 is very unlikely to happen. You may be wondering how different that would be for say an instrument with just 5 years of data. Wonder no longer:

With less data and more variability in SR larger distances are more common. And what about the upper end with say 40 years of data (roughly what we have for the two bonds and S&P above):

There is less variability in outcomes with longer periods, so the distances are smaller. We're still not seeing a distance of 0.07 here though. It's incredibly unlikely to be a coincidence that the two bonds have roughly the same weights. At the other end of the spectrum, it's also pretty unlikely that the S&P and the two bonds are actually drawn from the same distribution; since 0.31 is at the right edge of the distribution of distances.

Now we should apply a pinch of salt correction to numbers from random data as we know from post two that real data doesn't behave like random data with a fixed SR, and shows more variability. So we would expect larger distances in real data than we see here, particularly for longer periods. For a given critical value calibrated using random data, it's likely that with actual data we will see more apparent significant differences, and therefore slightly less pooling going on.

The other thing to point out is that there are (N^2-N)/2 possible pairwise comparisons of weights; or for 214 instruments 23,000 or so. Where we to set our threshold for pooling at (say) 99% we'd expect to see over 200 apparent significant weight differences just by chance even if the instruments had no significant difference in true SR.

Fortunately there isn't that much difference in the tails for the distribution of distances over different time periods from random data:

1 year 5 years 10 years 20 years 30 years 40 years

95% 1.65 0.74 0.53 0.37 0.30 0.26

99% 1.81 0.80 0.57 0.40 0.33 0.28

99.9% 1.91 0.85 0.61 0.43 0.35 0.30

Note that using the bottom row would just barely still result in S&P 500 and the bonds getting seperated (distance 0.31 with 40 years or so of data); and with a little less history they would be pooled together. Despite the higher risk of false positives, my bias to pooling isn't so absurd that I think that should happen. Anyway, I'm going to use the top row, which implies:

We will pool instruments unless there is a 95% chance or greater that their SR vectors are significantly different. That will be the case if their SR vectors have distances greater than 1.65 (one year or less) up to 0.26 (40 years or less).

What if we have instruments with different amounts of history? Given our bias is towards pooling, I would always use a higher critical value for distance. For example, if we had one instrument with 5 years of returns and another with 40 years, I'd use the critical value for 5 years. That also means that instruments with less returns are more likely to be pooled when they first enter the data set. Which feels like the correct approach.

The pooling algo in full

OK so what we do is (for a given point in time as this will be on an in and out of sample basis):

estimate the SR for each rule on each instrument
for that vector of SR, work out the distance between that instrument and all other instruments
calculate the critical value for each pairing, using the lowest number of years available for one of the instruments.
assuming there is at least one pair where the distance is less than the critical value, find the pair with the smallest distance
combine the returns of those instruments together into a new psuedo pooled instrument
calculate the SR vector, and distances between this new pseudo instrument and all existing instruments (all the previously calculated distances between the remaining instruments will remain the same and don't need recalculating). Note that the number of years available for a pseudo instrument will be equal to the sum of the years on the individual instruments.
repeat until there are no distances less than the critical value (which means there is a less than 95% chance that their SR are significantly different).

We now have a mixture of pseudo instruments and possibly instruments. We use the pre-cost returns for each pool to optimise the appropriate portfolio weights. There is some additional logic around handling distinct and missing forecasts, and also costs, but for now let's keep things simple.

Just for fun

Just for fun, let's run a single in sample test and see what gets pooled together in instrument space. This is akin to clustering exercises I have done before but there I just used underlying instrument returns.

This takes a while, and you may be surprised by some of the first pairs of instruments that are pooled together:

Pooling DistanceKeys(key1='SOFR', key2='EDOLLAR') # both STIR

Pooling DistanceKeys(key1='US5', key2='US10') # the example we have been using

Pooling DistanceKeys(key1='PLAT', key2='REDWHEAT') # WTF!

Pooling DistanceKeys(key1='HEATOIL', key2='CHF') # WATF?!

Pooling DistanceKeys(key1='GILT', key2='CAD10') # both 10 year bonds fine

Pooling DistanceKeys(key1='ZAR', key2='DAX') # WATFF?!?!?

...

Anyway once finished we end up with 54 pooled returns rather than 214 of our original distinct instruments.

There are two huge groups that take in a big chunk of the instruments. Here is the first with 103 instruments:

['AUD_micro', 'BBCOMM', 'BOBL', 'BONO', 'BRE', 'BRENT_W', 'BTP', 'BTP3', 'BUND', 'BUXL', 'CAD', 'CAD10', 'CANOLA', 'CH10', 'CHEESE', 'CHF', 'COCOA', 'COCOA_LDN', 'COFFEE', 'COPPER-micro', 'CORN', 'COTTON', 'COTTON2', 'CRUDE_ICE', 'CRUDE_W_micro', 'DAX', 'DOW', 'DX', 'EDOLLAR', 'EU-BANKS', 'EU-DJ-UTIL', 'EURCHF', 'EUR_micro', 'FANG', 'FEEDCOW', 'FTSE250', 'FTSECHINAA', 'GAS-PEN', 'GASOIL', 'GASOILINE', 'GAS_US_mini', 'GBP', 'GICS', 'GILT', 'GOLD_micro', 'HANG_mini', 'HEATOIL', 'HIGHYIELD', 'IBEX_mini', 'IRS', 'JGB', 'JGB-SGX-mini', 'JPY', 'KOSPI_mini', 'KR3', 'LEANHOG', 'LIVECOW', 'LUMBER', 'MILLWHEAT', 'MSCISING', 'MXP', 'NASDAQ_micro', 'NIFTY', 'NIKKEI', 'OAT', 'OATIES', 'OJ', 'OMX', 'PALLAD', 'PLAT', 'PLN', 'R1000', 'RAPESEED', 'REDWHEAT', 'RICE', 'ROBUSTA', 'RUR', 'SGX', 'SHATZ', 'SILVER', 'SMI-MID', 'SOFR', 'SOYBEAN_mini', 'SOYMEAL', 'SOYOIL', 'SP400', 'SP500_micro', 'SUGAR11', 'SUGAR_WHITE', 'TOPIX', 'US-DISCRETE', 'US-HEALTH', 'US-TECH', 'US10', 'US10U', 'US2', 'US20', 'US30', 'US5', 'VIX_mini', 'WHEAT', 'YENEUR', 'ZAR']

Note that this does include both the S&P 500 and the US bond markets!

The second group of fifty instruments is mostly stock sectors, but not entirely:

['AEX', 'BOVESPA', 'CAC', 'CAD2', 'CAD5', 'CLP', 'CZK', 'DJSTX-SMALL', 'EU-AUTO', 'EU-BASIC', 'EU-CHEM', 'EU-CONSTRUCTION', 'EU-DIV30', 'EU-DJ-TELECOM', 'EU-FOOD', 'EU-HEALTH', 'EU-MID', 'EU-OIL', 'EU-REALESTATE', 'EU-RETAIL', 'EU-TECH', 'EU-TRAVEL', 'EURCAD', 'EURO600', 'EUROSTX', 'EUROSTX-SMALL', 'FTSE100', 'FTSECHINAH', 'GBPCHF', 'GBPEUR', 'IG', 'INR', 'MSCIASIA', 'MSCIEAFA', 'NICKEL_LME', 'NOK', 'NZD', 'RUSSELL', 'SEK', 'SMI', 'SPI200', 'US-ENERGY', 'US-FINANCE', 'US-INDUSTRY', 'US-MATERIAL', 'US-PROPERTY', 'US-REALESTATE', 'US-STAPLES', 'US-UTILS', 'V2X']

Again in terms of weirdness, the Canadian 10 year bond is in group 1 whilst everything else is in group 2. VIX is in group 1, and V2X in group 2.

Next there are a few small groups, which mostly don't have any internal logic:

BRENT-LAST, USIRS5, USIRS10 (two out of three make sense)

MILK, WHEY, KR10, WHEAT_ICE (two out of four make sense)

FED, COAL-GEORDIE (nope, I got nothing)

IRON, ETHANOL

EURIBOR-ICE, COAL

SUGAR16/MILKDRY (the "what you shouldn't put in your coffee" group)

This leaves 46 instruments which can't be pooled with anything else:

['ALUMINIUM', 'AUDJPY', 'BB3M', 'BITCOIN', 'BUTTER', 'CHFJPY', 'CHINAA-CON', 'CNH', 'COPPER_LME', 'ETHER-micro', 'EU-HOUSE', 'EU-INSURE', 'EU-MEDIA', 'EUA', 'EURAUD', 'FTSEINDO', 'FTSETAIWAN', 'FTSEVIET', 'GBPJPY', 'HANGENT_mini', 'HANGTECH', 'HOUSE-US', 'JP-REALESTATE', 'KOSDAQ', 'KRWUSD_mini', 'LEAD_LME', 'LUMBER-new', 'MIB', 'MILKWET', 'MSCIEMASIA', 'MSCITAIWAN', 'MSCIWORLD', 'MUMMY', 'RUBBER', 'SARONA', 'SGD', 'SONIA3', 'STEEL', 'SWISSLEAD', 'TIN_LME', 'TWD', 'US3', 'USIRS2ERIS', 'USIRS5ERIS', 'VNKI', 'ZINC_LME']

Oh yes crypto people, Bitcoin and Ethereum are 'special'. As special as Tin and Rubber anyway. The furthest distance remaining after all that pooling is just under 0.31 which just exceeds the relevant critical value.

Evaluating the results

You should be used to the procedure by now if you've been following the blog posts. I will do the usual thing of cycling through different lengths of in sample (5 years, 10 years) and out of sample (1 year and 5 years) lengths of time. For shorter time periods that will allow me to subsample different historic periods. For speed and to get some alternative paths I'm not going to consider all the instruments. Instead I will randomly subsample 50 instruments randomly out of the 214 available. Note that the pool of available instruments will be smaller when I am using e.g. 15 years of in and out of sample data, which is why I'm not going to a 40 year in sample period as I've done before when investigating structural breaks.

Then for a given set of returns I will eithier use fully pooled, distance weight pooled, or unpooled returns. Then I will optimise for each instrument based on the relevant returns, using the shrinkage method with SR shrinkage of 0.5 and correlation of 0.75. Finally I will take the equally weighted across instruments portfolio SR for the 50 instruments, out of sample. Note this should be better than the average SR for each instrument. I could do better with some kind of instrument weight allocation, but that is for another day. I should still be able to pick up whether I am losing in lower diversification through pooling.

... with a twist

Basically everything I have done up until now includes implicit in sample fitting, because I'm only selecting from trading rules that actually work. This will inflate the backtest results, but until now at least won't have a serious effect on the calibrations I have been running. But with this step of looking at pooling I am worried that there will be rules that just don't work on some instruments. To try and alleviate that in sample fitting problem, I'm going to include the opposite of each trading rule as well as the original rule. Then at each optimisation we only choose the positive SR option. Note there are no trading costs at this stage of my research so the p&l of the opposite rule is exactly equal to -1* the 'correct' rule.

Anyway on with the results.

5 years in sample, 1 year out of sample

               SR    pvalue
unpooled     0.320     0.0
all pooled   0.447     0.0
algo pooled  0.514     NaN

               SR    pvalue
unpooled    -0.104     0.0
all pooled   0.391     NaN
algo pooled  0.016     0.0

The first table shows the results as I've been analysing up to now, with implicit fitting and only the 'correct' version of the trading rules included. In the second table I've allowed the possibility of the opposite rule to be included. Note the much lower performance that results; and a difference of opinion on whether we are better pooling everything or using the algo.

5 years in sample, 5 year out of sample

               SR    pvalue
unpooled     0.624     0.0
all pooled   0.431     0.0
algo pooled  0.657     NaN

               SR    pvalue
unpooled     0.339     0.0
all pooled   0.346     0.0
algo pooled  0.404     NaN

Once agin the SR are reduced by being more honest, but with longer out of sample the algo pooling method is now superior.

10 years in sample, 1 year out of sample


               SR    pvalue
unpooled     0.862     NaN
all pooled   0.436     0.0
algo pooled  0.626     0.0

               SR    pvalue
unpooled     0.460     NaN
all pooled   0.304     0.0
algo pooled  0.248     0.0

The one thing we haven't got here is consistency... now not pooling at all is the correct thing to do!

10 years in sample, 5 years out of sample

               SR    pvalue
unpooled     0.683     0.0
all pooled   0.584     0.0
algo pooled  0.735     NaN


               SR    pvalue
unpooled     0.781     NaN
all pooled   0.491     0.0
algo pooled  0.650     0.0

A bit of an unusual case here since we do better on unpooled when including opposite rules, but it can happen just by luck. Anyway things really are inconsistent here...

Summary

Although the results above do seem quite messy, if we focus on the more honest figures that include opposite rules we can see a pattern if we look at the best method in each case:

5 year 1 year: All pooled

5 year 5 year: Algo pooled

10 year 1 year: Unpooled

10 year 5 year: Unpooled

Hence the more data we have, the more it seems we can allow each instrument to have it's own parameter estimates rather than sharing with other instruments.

Anyway, what to do? I am struggling here. I like more SR as much as the next guy, but I also have biases towards simplicity (Occam's razor), robustness and not changing things if I can avoid them. Sticking with what I currently do - pooling everything - is very tempting. It's simple, and it is also likely to be very robust. Not pooling at all is possibly even simpler; and with enough data history does seem to perform better. But it also worries me! Although we're ensuring robustness by using shrinkage, so maybe it's okay.

The Algo method is cool and fun, but definitely massively complicates matters. The method also doesn't produce 'nice' results. When I ran the original 'all instruments' grouping exercise, the long tail of instruments that don't fit elsewhere was slightly concerning. I had hoped to get groups that were congruent with asset classes or at least had some obvious logic, and I certainly didn't. This does suggest that the pooling by weights I have attempted before is worth a second look.

Alternatively I could use some simple heuristic like:

If an instrument has less than 5 years of data history, use pooled returns
If it has more than 25 years of history, use individual returns
With between 5 and 25 years of history, use weights that are an average of these; where the weight on pooled returns for N years of returns is (25-N)*0.05 and obviously the weight on .

So there is another blog post to come at some point where I revisit the issue of pooling.

But for now we can put pooling by SR vector in the bin, the concept permanently damaged by the sharp edge of Occams razor (topical political reference there!).

Tuesday, 23 June 2026

Breaking Badly: finding the structural breaks in parameter estimates

Here's a nice picture from a lovely book written by a top bloke:

It shows the cumulative p&l from different speeds of momentum over time (for portfolios containing 102 instruments) over 50 years of data. Notice how the two fastest speeds (2&4) get worse in the second half of the sample. I've called the line #2 here the 'second most famous hockey stick graph in history'. It certainly looks like something changed in 1990.

This is important. If we're optimising portfolios of such things we only want to consider data that is relevant, but we also want as much data as possible for statistical significance. Now if I were a simpleton I'd do this by looking at graphs like that and going 'aha i only need to use data after 1990'. As a simpleton I don't use capital letters. But I am a big fan of not doing in sample fitting, even of meta parameters like this; and I am an even bigger fan of doing things automatically which means not wading through thousands of graphs like that (since there are thousands of SR estimates in my forecast p&l space, plus a good chunk of correlations).

So we need an automatic way of identifying such breaks. Fortunately this is not a new problem as you will know if, like me, you did undergraduate econometrics. Finding structural breaks is an entire industry. We need two things: a test for how likely it is that a break has occured between two sub-samples A and B. And an algorithim for going through all the options of A and B

And in case you haven't realised this is the seventh post in my summer 2026 series on portfolio optimisation.

What parameters

The first question to think about is what parameters we're going to apply this process to. I do two kinds of optimisation:

Forecast weights
Instrument weights

And in both cases I have estimates of SR (one per asset) and correlations ([N^2-N]/2 for N assets). I haven't really looked at instrument weight optimisation yet in this series, and there are some wrinkles there so I'm going to park that for now. That just leaves the SR for a forecast (which remember is a pairing of a trading rule and an instrument), and the correlation of such forecasts within an instrument.

Now I am going to ignore correlations in this post. As I discussed in an earlier post, although correlations are relatively unstable in the short term, they are unlikely to have secular trends like SR. And it's quite easy to deal with this by just using a relatively long lookback to estimate them, probably with an ewma on the correlation estimate.

What test

This is quite an easy one, compared to the world of econometrics and linear regression where we have to do such nonsense as a Chow test. Given two sub-samples A and B, to find out if they have different sample means we just do an independent t-test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

An important characteristic of such tests is they are only likely to be significant if there is sufficient data. If sample A or sample B is too small in size, it's unlikely we'll find a signficant difference in the means.

Typical values for critical values in t-tests are 1% and 5%; I will also check out 10% later.

Since all the p&l streams in a risk targeted trading system will have the same expected standard deviation in the long run, I can legitmately treat this as a test to see if the SR are different if I adjust the samples so they have an identical standard deviation.

(If any ex-students of mine are reading this, you should remember this from week 3 of the course!)

What algo / procedure

OK let's make this concrete. Suppose we have 50 years of data as in the original graph above, and we're currently wondering if there were one or more structural breaks. We don't want to have less than five years data to do our estimation. Note these numbers are appropriate for my context, but not in a domain with faster trading, higher SR, and faster alpha decay. We proceed as follows:

We compare year 1 with years 2 to 50. Since year 1 is very small it's unlikely we'll find a break; but if we do then we do a split (see below).
If no split has occured, we compare years 1-2 with years 3-50. Again if we find a break, we split.
If no split has occured, we compare years 1-3 with years 4-50...
...
If no split has occured, we compare years 1-45 with years 46-50. If we still don't find a break, then we use the entire period for estimation (since the period 47-50 will only have four years, we terminate here).

Now what if a split occurs at some point? Then we restart the process, but this time without the pre-split data. Suppose for example a split had occured in year 20 (which is 1992 in the original graph). Then:

We compare year 21 with years 22 to 50. If we find a break, then we split again.
If no split has occured, we compare years 21-22 with years 23-50. Again if we find a break, we split.
If no split has occured, we compare years 21-23 with years 24-50...
...
If no split has occured, we compare years 21-45 with years 46-50. If we still don't find a break, then we use the entire period from years 21-50 for estimation.

You get the idea. This procedure is quite quick and easy to run; and notice we're identifying any break that exceeds a certain threshold rather than finding the most likely break as we would do with a QLR type test.

The main downside of it is that it won't identify breaks that reverse. For example if the world is in regime A, then regime B, then regime A again then ideally we'd estimate our parameters using both regime A's. But the above test will eithier use only the final regime A, or the last two regimes, or possible all three regimes depending on whether there is a significant difference at the appropriate point(s). There are fancy things we could do to deal with this, but I feel these are corner cases and life is too short.

An example

Let's use a concrete example. This is the performance of the momentum4 rule on CORN. I've chosen it because we already know momentum4 has a structural break, and CORN has plenty of history. Just for fun, before reading on, see if you can identify where the algo finds the break here (there is exactly one break - this isn't a trick question).

Some code

This is probably the first post in this series where it's been practical to actually include the code, since there isn't much of it, nor are there are any dependencies apart from getting the returns:


from copy import copy
from typing import List, Callable

import numpy as np
import pandas as pd
from scipy.stats import  ttest_ind

BUS_DAYS_IN_YEAR = 256
import matplotlib.pyplot as plt

MIN_NUMBER_OF_YEARS = 5

def identify_and_plot_breaks(all_returns: pd.Series, CV: float =0.01):
    breaks_as_dict = identify_all_breaks(all_returns)
    breaks_as_df = pd.DataFrame(breaks_as_dict)
    breaks_as_df = breaks_as_df.bfill(axis=1)
    breaks_as_df.cumsum().plot()
    plt.show(block=True)


def identify_all_breaks(all_returns: pd.Series, CV: float):
    ## returns a dict, turn into a dataframe and you can plot
    returns_to_consider= copy(all_returns)
    returns_to_consider = returns_to_consider.dropna()
    broken_list = identify_all_breaks_recursively(returns_to_consider=returns_to_consider, list_of_returns_broken_off=[],
                                                   CV=CV)
    broken_list.reverse()
    broken_dict = dict([
        (idx, value) for idx, value in enumerate(broken_list)
    ])

    return broken_dict

def identify_all_breaks_recursively(returns_to_consider: pd.Series, list_of_returns_broken_off: List, CV: float) -> List:
    years_in_returns = how_many_years_approx(returns_to_consider)

    for i in range(years_in_returns):
        year_idx=i+1
        first_sample, second_sample = split_sample_after_n_years(returns_to_consider, year_idx)
        if len(second_sample)<(MIN_NUMBER_OF_YEARS*BUS_DAYS_IN_YEAR):
            break
        is_broken_here = test_a_break(first_sample, second_sample, CV=CV)
        if is_broken_here:
            list_of_returns_broken_off.append(first_sample)
            return identify_all_breaks_recursively(
                second_sample, list_of_returns_broken_off=list_of_returns_broken_off,
                CV=CV
            )
        else:
            continue

    ## No breaks identified or sample size too short
    list_of_returns_broken_off.append(returns_to_consider)
    return list_of_returns_broken_off

def how_many_years_approx(returns: pd.Series):
    return int(np.floor(len(returns)/BUS_DAYS_IN_YEAR))

def split_sample_after_n_years(all_returns: pd.Series, n_years: int):
    idx = n_years*BUS_DAYS_IN_YEAR
    return all_returns[:idx], all_returns[idx:]

def test_a_break(first_sample: pd.Series, second_sample: pd.Series, CV: float):
    ## Normalise by standard deviation before considering means
    norm_first_sample =first_sample/first_sample.std()
    norm_second_sample=second_sample/second_sample.std()
    return ttest_ind(norm_first_sample, norm_second_sample).pvalue<CV

On Corn this produces the following:

The break occurs on the 26th August 1982. So to calculate that particular SR estimate we'd only use data from that date onwards.

Is that what you would have guessed? Personally, I would probably have gone for a later break point if identifying it by eye. As humans we are drawn to the sharp upward move in 1989 and would probably have gone for a break just after that. It's possible a search for the likeliest break would have found that point, but remember we are looking for the first break that exceeds the threshold; and once that break happens no further breaks are identified.

A summary of results

Here is a summary of the results for each instrument/forecast pairing with the default 1% critical value:

You can see that breaks are quite rare with only 13% or so of instrument/rules having at least one break. This also suggests that the Sharpe Ratios for trading rule performance are actually quite stable over time; or at least stable enough that they won't fail any statistical tests at a 1% critical value.

Multiple breaks are even rarer. Just 1.8% have two breaks; 0.4% or 39 instruments have three breaks, ten have four breaks and only two have five breaks. They are:

skewrv365 forecasting EURIBOR-ICE (yes, there is still a EURIBOR future!)

normmom2 forecasting FTSE250

Here is Euribor, relative value skew with a 365 day window:

Although five is pushing it, there are certainly three regimes there (pre 2000, 2000 - 2010, and 2010 onwards), and using post 2010 data seems to make some kind of sense.

And FTSE 250 momentum8 (this is pre-cost):

There are certainly at least two regimes there and I wouldn't argue with the automated decision to use only data after 2006 or so, when it looks like; in the words of Pulp in one of my favourite songs, "Something changed".

Of course we will get a different picture with a slacker test. Here is the picture with a 10% critical value:

Now just over half the pairings have at least one break in them.

The decision as to use 0% (equivalent to no breaks at all), 1%, 5% or 10% CV is one we will now address.

An optimisation test

Now the big question is does this actually improve performance? On a pure out of sample test? And is this changed much by using a different critical value?

I follow the same procedure roughly as in previous posts:

Select 10,20,30 or 40 years of in sample data (I need at least 10 years because with a minimum of five years required for estimation I certainly won't find any breaks, or I will risk finding a break and not having five years of data leftover)
Select 1 or 5 years of out of sample data
Pick a random instrument, ensuring there is enough history available (between 11 and 45 years). We will only choose from instruments with sufficient history for the time required.
Randomly pick N=9 forecasting rules from those available (the same number as in posts #2 and #3)

Then for each of those sumsamples:

Cycle through using no breaks (0% CV), 1% CV, 5% CV and 10% CV
Estimate SR on the insample data using eithier all the data (0% CV), or the data after the last break given some critical value.
Estimate correlation using all the in sample data
Use fixed shrinkage levels (estimated here): SR shrinkage 0.5, correlation 0.75 (since we'll always have at least five years of in sample data we don't need to worry about the higher levels of shrinkage required when we have insufficent data). The results won't be much different with any vaguely similar shrinkage.
Run in sample optimisation and out of sample optimisation on all the options above

Finally once we have all our subsamples:

Get the median SR from the distribution of subsamples
Find the optimal CV with the highest SR
Test to see if that median is significantly higher than the others

10 years in sample, one year out of sample

We only have four options to consider so no need for the huge tables and fancy heatmaps of previous posts:

         SR  pvalue all  pvalue distinct
0.00 -0.021       0.247            0.247
0.01 -0.018       0.295            0.295
0.05 -0.019       0.204            0.204
0.10 -0.014         NaN              NaN

Each row is a different critical value used for breakpoint finding. Zero means the entire in sample period was used. The next column is the out of sample Sharpe Ratio for each option. In the second column is the p-value for a test of the optimal option against the relevant option. NaN is the optimal option, and lower values (say below 0.05) mean the optimal option is significantly better than the alternatives. In the final column I've rerun the staistical t-test but this time I have excluded instances where no breaks were found (so the test is only done comparing the out of sample SR when breaks were found, versus when they were not). This shouldn't affect the p-values, but it's nice to check it doesn't

You can see here that it it looks like a very loose breakpoint policy is the best, but it's not significantly better.

10 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00  0.166       0.168            0.168
0.01  0.160       0.003            0.003
0.05  0.167       0.007            0.007
0.10  0.170         NaN              NaN

Again the loosest breakpoint works best; but it's hardly logical since it's indistinguishable from no breakpoints at all.

20 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.160       0.542            0.542
0.01 -0.160       0.323            0.323
0.05 -0.160       0.050            0.050
0.10 -0.155         NaN              NaN

Loose is better.

20 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00  0.123         NaN              NaN
0.01  0.116         0.0              0.0
0.05  0.105         0.0              0.0
0.10  0.098         0.0              0.0

No breakpoints are the best. There isn't much consistency here.

30 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00  0.092         NaN              NaN
0.01 -0.019         0.0              0.0
0.05 -0.078         0.0              0.0
0.10 -0.066         0.0              0.0

Another massive win for not using breaks.

30 years in sample, five years out of sample

        SR  pvalue all  pvalue distinct
0.00 -0.005       0.001            0.001
0.01 -0.011       0.000            0.000
0.05 -0.008       0.014            0.014
0.10 -0.003         NaN              NaN

Or perhaps we should go for the loosest break....

40 years in sample, one year out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.142         NaN              NaN
0.01 -0.249       0.000            0.000
0.05 -0.189       0.000            0.000
0.10 -0.178       0.002            0.002

or no breaks...

40 years in sample, five years out of sample

         SR  pvalue all  pvalue distinct
0.00 -0.060       0.177            0.177
0.01 -0.051         NaN              NaN
0.05 -0.055       0.000            0.000
0.10 -0.055       0.000            0.000

A clear vote for strict breaks, but no breaks at all are also good.

Conclusion

Well that was as clear as a bowl full of mud that has been made even more unclear by painting the bowl a very dark colour and then adding some darker mud. Not very clear at all, in other words.

You think up a nice neat simple way of finding structural breaks, and then it doesn't actually work when used for optimisation. In seven out of the cases we examined not using breaks is eithier optimal, or statistically insignificant from the optimal. Only in one case was it inferior. The loosest possible break (CV 10%) was optimal in four cases. In most cases apart from 30 years/1 year the difference in performance was small between CV alternatives. This partly reflects the fact that breaks aren't that common, especially with a 1% CV.

Of course findings like this are very context dependent. I'm using rules that should probably work over long time periods; indeed they have been in sample selected for such a purpose. For the 30 year and 40 year periods there just aren't that many instruments with that much history; so even repeated random sampling is likely to turn up the same suspects repeatedly.

One issue here might be that we are considering the breaks at a forecast/instrument level. We might get different results if we pool estimates for trading rule performance across instruments. And indeed that is the subject of the next post. So I will return to this topic when I've looked at pooling.

Monday, 29 June 2026

Rolling, rolling, rolling.... updating statistical estimates yes or no

The test

10 years in sample, one year out of sample

10 years in sample, five years out of sample

20 years in sample, one year out of sample

20 years in sample, five years out of sample

30 years in sample, one year out of sample

30 years in sample, five years out of sample

40 years in sample, one year out of sample

40 years in sample, five years out of sample

Conclusion

One of These Things (Is Not Like the Others). Or is it? Pooling rule p&l estimates across instruments.

Some selective pooling approaches

Calibrating the threshold

The pooling algo in full

Just for fun

Evaluating the results

... with a twist

5 years in sample, 1 year out of sample

5 years in sample, 5 year out of sample

10 years in sample, 1 year out of sample

10 years in sample, 5 years out of sample

Summary

Tuesday, 23 June 2026

Breaking Badly: finding the structural breaks in parameter estimates

What parameters

What test

What algo / procedure

An example

Some code

A summary of results

An optimisation test

10 years in sample, one year out of sample

10 years in sample, five years out of sample

SR pvalue all pvalue distinct0.00 0.166 0.168 0.1680.01 0.160 0.003 0.0030.05 0.167 0.007 0.0070.10 0.170 NaN NaN

20 years in sample, one year out of sample

20 years in sample, five years out of sample

30 years in sample, one year out of sample

30 years in sample, five years out of sample

40 years in sample, one year out of sample

40 years in sample, five years out of sample

Conclusion

Contact Me (Spam will be politely ignored)

Subscribe To

SR pvalue all pvalue distinct
0.00 0.166 0.168 0.168
0.01 0.160 0.003 0.003
0.05 0.167 0.007 0.007
0.10 0.170 NaN NaN