Monday, 13 January 2020

Skew and Kurtosis as trading rules

This is part X of my series of blog posts on skew and kurtosis, where 2<X<5. Part X, because it depends on how you number them! If you were to read them in a logical order then the series looks something like this:

  • A post on skew: measuring, and it's impact on future returns
  • A post on kurtosis: measuring, it's impact on future returns, and it's interaction with skew.
  • A post on trend following and skew (which I actually wrote first, hence the confusion!)
  • This post: on using skew and kurtosis as trading rules
This series acts as a little demonstration as to how we can take an idea and run with it to the extremes, possibly even taking things too far (this post will reveal whether this is the case).

This post will also demonstrate how we should test an ensemble of trading rules without committing the sins of implicit fitting (basically dropping variations that don't work from our backtest). And it will use pysystemtrade. There will be some pretty geeky stuff showing you how to implement novel trading rules in the aforementioned python library.

The trading rules

In the last couple of posts I explained that if we know what skew and kurtosis have been recently (which we do) we can use that as conditioning information on what returns will be in the future (which we don't normally know). The obvious thing to do with this is turn it into a trading rule, in fact there will be 12 trading rules. This is because I have 3 kinds of rules:

  • a pure skew rule ('skew')
  • a skew conditioned on kurtosis rule ('skewK')
  • a kurtosis conditioned on skew rule ('kurtS')
And each of these rules can be applied in 4 different ways (essentially 4 kinds of demeaning):
  • Absolute: versus the average across all assets and time periods [an alternative for pure skew is to use zero as the average here, but let's be consistent] ('_abs')
  • Relative to this particular assets history (where history was the last 10 years) ('_ts' for time series)
  • Relative to the current cross sectional average across all assets ('_cs')
  • Relative to the current cross sectional average within the relevant asset class ('_rv' i.e. relative value)

Finally each of these rules will have 6 variations, for the six periods over which skew/kurtosis will be measured:

  • 7 days ('_7D')
  • 14 days ('_14D')
  • 1 month ('_30D')
  • 3 months ('_90D')
  • 6 months ('_180D)
  • 12 months ('_365D')

Thus an absolute skew with kurtosis conditioning over 3 months will be known as 'skewK_abs_90'. Catchy. That's a total of 72 different possibilities to consider!

Some of these will probably be too expensive to trade on one or more instruments, but pysystemtrade will take care of that for us. Still we will need to winnow this down to a more sensible figure.

A brief geeky diversion

A precursor to using pysystemtrade to test new trading rules is to add any raw data they will access in the relevant python code, or specifically for futures here. If you're doing your own experiments you should do this by inheriting from the base object in the relevant file and adding bells and whistles in the form of additional methods, but since 'my gaff (code) -my rules' I've updated the actual code. So for example, if we calculate the skew here then we can re-use it many times across the various rules.

However there is a weakness with this code, which is that we can't pass arguments into the raw data function. So we couldn't for example pass in the length of time used.

This isn't a problem in most cases since we can do the relevant work inside the actual trading rule, pulling in raw percentage returns as the input into our function. This is slower, but it works. It is however a problem for anything that needs access to the skew (or kurtosis) for other instruments (cs and rv rules), since trading rule functions work on the forecast for a single instrument at a time.

There are two options here; one is to modify the pysystemtrade code so it can deal with this, and the second is to fix the horizon length used in the cs and rv rules. Indeed this is the approach I use in the relative carry rule, discussed here.

I'm usually against making things more complex, but I think that changing the code is the right thing to do here. See here to see how this was done.

Coding up the rules

Here then is the raw data code. As you can see I've coded up a couple of methods to calculate skew and kurtosis, and then kept things generic with a whole bunch of 'factor' methods that can be used for any predictive factor (at some point I'll replace the relative carry code so it uses this pattern).

Notice that we have a positive skew, and a negative skew method. The latter will be used for the standalone skew rule, and the former as a conditioning method.

Here are the trading rules:

from syscore.algos import robust_vol_calc

def factor_trading_rule(demean_factor_value, smooth=90):
    vol =robust_vol_calc(demean_factor_value)
    normalised_factor_value = demean_factor_value / vol
    smoothed_normalised_factor_value = normalised_factor_value.ewm(span=smooth).mean()

    return smoothed_normalised_factor_value

def conditioned_factor_trading_rule(demean_factor_value, condition_demean_factor_value, smooth=90):
    vol = robust_vol_calc(demean_factor_value)
    normalised_factor_value = demean_factor_value / vol

    sign_condition = condition_demean_factor_value.apply(np.sign)
    sign_condition_resample = sign_condition.reindex(normalised_factor_value.index).ffill()

    conditioned_factor = normalised_factor_value *sign_condition_resample
    smoothed_conditioned_factor = conditioned_factor.ewm(span=smooth).mean()

    return smoothed_conditioned_factor

As you can see these are actually quite generic trading rules, which is a consequence of how I've written the raw data methods. This also means we can do much of our work in the configuration stage, rather than by writing many different rules.

Notice all that I've added a smoothing function that wasn't in the original code. When I examined the output originally it was quite jumpy; this is because the skew and kurtosis estimators aren't exponentially weighted, and when one exceptionally bad or good return drops in or out of the window it can cause a big change. This meant that even the very long windows had high turnover, something that is undesirable.  I've set the smooth at a tenth of the length of the lookback (not tested, but seems sensible).

Here is a gist showing how to set up the 72 rules (in code, but could easily be done as a configuration). A snippet is below:

smooth = int(np.ceil(lookback_days/10.0))
kurtS_rv = TradingRule(conditioned_factor_trading_rule, 
                    other_args=dict(smooth = smooth, _factor_name="kurtosis",
                                       _lookback_days = lookback_days,
                                       __lookback_days = lookback_days

You can really see here how the very generic functions are being configured. For the conditioned rule we pass two types of data; both are factors which have been demeaned hence the identical names. In the other args the smooth is passed to the trading rule itself, the single underscore prefixes (_factor_name, _demean_method, _lookback_days) are passed to the first method in the data list 'rawdata.demeanded_factor_value'; and the double underscores are passed to the second method (which happens to be the same here). On the second call the lookback and demeaning method are identical, but the factor names are different - we use skew as the main factor and kurtosis as the conditioning factor.

Checking behaviour, correlation and costs

Before piling into seeing whether any of these 72 (!) putative strategies makes sense from a behaviour, cost and correlation perspective. Hopefully we can drop some of the numerous variations. Now, I've been very vocal in the past about the use of fake data to do this part of fitting trading strategies.

However in this case we'd need to generate data that had interesting skew and kurtosis properies that were time varying. To avoid this I decided to use a single market, S&P 500. I chose the S&P because it has a reasonable length of history, and it's the second cheapest market I trade (the NASDAQ is slightly cheaper but doesn't have the same history). So if the S&P can't trade a particular rule, we can definitely ignore it.

This is slightly cheating, but I won't use any performance data to make in sample decisions.

First let's set up the backtest (assuming we've already got the trading rules using the gist code above):

ordered_rule_names = list(all_trading_rules.keys())
config = temp_config
config.use_forecast_div_mult_estimates = True
config.use_forecast_scale_estimates = True
config.use_instrument_div_mult_estimates = True
config.use_instrument_weight_estimates = False
config.use_forecast_weight_estimates = True
system = futures_system(trading_rules=all_trading_rules, config=config)

Now let's check the costs:

for rule in ordered_rule_names:
          system.accounts.get_SR_cost_for_instrument_forecast("SP500", rule)))

SR_costs_for_rules.sort(key=lambda x: x[1])

Looking at the last few observations, all the rules with a 7 day lookback have costs greater than my normal cuttoff (0.13 SR units, see "Systematic Trading" to understand why). So we can drop this from our consideration.

Now for correlations:

rule_returns = rule_returns[ordered_rule_names]
corr_matrix = rule_returns.corr()

First let's look at the 'internal' correlations within each rule. For example:

select_rules = ['skew_abs_14', 'skew_abs_30', 'skew_abs_90', 'skew_abs_180', 'skew_abs_365']
corr_matrix.loc[select_rules, select_rules]
              skew_abs_14  skew_abs_30  skew_abs_90  skew_abs_180  skew_abs_365
skew_abs_14      1.000000     0.530610     0.158682      0.104764      0.022758
skew_abs_30      0.530610     1.000000     0.445712      0.218372      0.039874
skew_abs_90      0.158682     0.445712     1.000000      0.619104      0.305271
skew_abs_180     0.104764     0.218372     0.619104      1.000000      0.580179
skew_abs_365     0.022758     0.039874     0.305271      0.580179      1.000000

It looks like there are pleasingly low correlations between adjacent trading rules. I checked this for all the rules, with similar results.

Now let's check for variations of the skew rule, eg:

             skew_abs_14  skew_rv_14  skew_ts_14  skew_cs_14
skew_abs_14     1.000000    0.542259    0.996992    0.952949
skew_rv_14      0.542259    1.000000    0.543386    0.562397
skew_ts_14      0.996992    0.543386    1.000000    0.948784
skew_cs_14      0.952949    0.562397    0.948784    1.000000

Wow! Looks like the absolute, time series and cross sectional variations are basically doing the same thing. Checking the other rules I see similarly high correlations, although they tend to be a bit lower for longer lookbacks.

Whipping out Occams razor, it seems to make most sense to drop the time series and cross sectional rules completely since they are more complex implementations of the basic 'abs' rule but add little diversification. We'll keep the cross asset class relative value for now, since that does something quite different.

Now let's check across styles:

                carry  ewmac4_16  skew_abs_14  skewK_abs_14  kurtS_abs_14
carry         1.000000   0.079025    -0.020398      0.018712      0.053978
ewmac4_16     0.079025   1.000000     0.129336      0.077702      0.080301
skew_abs_14  -0.020398   0.129336     1.000000      0.184635      0.120404
skewK_abs_14  0.018712   0.077702     0.184635      1.000000      0.821673
kurtS_abs_14  0.053978   0.080301     0.120404      0.821673      1.000000

Skew conditioned on Kurtosis, and kurtosis conditioned on skew, seem to have a highish correlation. That's also true for the cross sectional variants:

                carry  ewmac4_16  skew_cs_30  skewK_cs_30  kurtS_cs_30
carry        1.000000   0.079025    0.039870     0.032401     0.053643
ewmac4_16    0.079025   1.000000    0.118919     0.012837     0.044516
skew_cs_30   0.039870   0.118919    1.000000     0.151807     0.000230
skewK_cs_30  0.032401   0.012837    0.151807     1.000000     0.843337
kurtS_cs_30  0.053643   0.044516    0.000230     0.843337     1.000000

That pattern holds true all the way up the longest lookbacks. It probably doesn't make sense to have two skew rules, so let's drop the skew conditioned on Kurtosis - again this is the more complex rule.

This leaves us with the following rules:
  • a pure skew rule ('skew')
  • a kurtosis conditioned on skew rule ('kurtS')
And each of these rules can be applied in two different ways (essentially two kinds of demeaning):
  • Absolute: versus the average across all assets and time periods [an alternative for pure skew is to use zero as the average here, but let's be consistent] ('_abs')
  • Relative to the current cross sectional average within the relevant asset class ('_rv' i.e. relative value)

Finally each of these rules will have 5 variations, for the five periods over which skew/kurtosis will be measured:
  • 14 days ('_14D')
  • 1 month ('_30D')
  • 3 months ('_90D')
  • 6 months ('_180D)
  • 12 months ('_365D')
So we now have 'just' 5*2*2 = 20 rules. Much more managable.

Trading rule allocation

Proceeding with S&P 500 for now, let's see how my handcrafting method allocates weights:

portfolio = system.combForecast.calculation_of_raw_estimated_forecast_weights("SP500").results[-1].diag['hc_portfolio']

[' Contains 3 sub portfolios', 
 ['[0] Contains 3 sub portfolios', (Skew and RV kurtosis)
  ['[0][0] Contains 3 sub portfolios', (Slower skew rules)
   ["[0][0][0] Contains ['skew_abs_180', 'skew_abs_365', 'skew_abs_90']"], 
   ["[0][0][1] Contains ['skew_rv_180', 'skew_rv_90']"], 
   ["[0][0][2] Contains ['skew_rv_365']"]], 
  ['[0][1] Contains 2 sub portfolios', (Faster skew rules)
   ["[0][1][0] Contains ['skew_abs_14', 'skew_rv_14']"], (very fast skew)
   ["[0][1][1] Contains ['skew_abs_30', 'skew_rv_30']"]], (fastish skew)
  ['[0][2] Contains 3 sub portfolios', (Mostly RV kurtosis)
   ["[0][2][0] Contains ['kurtS_rv_180', 'kurtS_rv_365']"], 
   ["[0][2][1] Contains ['kurtS_abs_14', 'kurtS_rv_14']"], 
   ["[0][2][2] Contains ['kurtS_rv_30', 'kurtS_rv_90']"]]], 
 ['[1] Contains 3 sub portfolios',  (Carry and most absolute kurtosis)
  ["[1][0] Contains ['carry', 'kurtS_abs_180', 'kurtS_abs_365']"], 
  ["[1][1] Contains ['kurtS_abs_30']"], 
  ["[1][2] Contains ['kurtS_abs_90']"]], 
 ['[2] Contains 3 sub portfolios', (Momentum)
  ["[2][0] Contains ['ewmac2_8', 'ewmac4_16']"], (Fast mom) 
  ["[2][1] Contains ['ewmac32_128', 'ewmac64_256']"], (Slow mom)
  ["[2][2] Contains ['ewmac16_64', 'ewmac8_32']"]]] (medium mom)

I've added some notes manually, the algo doesn't do this labelling for us.

Summary of weights:

[(rule, weight) for rule,weight in zip(list(portfolio.all_instruments), portfolio.cash_weights)]

Carry 9.1%
EMWAC 12.9%
skew_abs 19.9%
skew_rv 17.0%
kurtS_abs 10.8%
kurtS_rv 29.2%


Okay, it's time for the moment of truth. How well do these trading rules actually perform?

First let's check out the skew rules:

select_rules = ['skew_abs_14', 'skew_abs_30', 'skew_abs_90', 'skew_abs_180', 'skew_abs_365']

The best performing 'vanilla' skew rule is the one with a 365 day lookback. A one year lookback is also what was used in the canonical paper on skew / futures (more on this later). It has a SR of 0.33. Not up there with the EWMAC and carry rules with SR of 0.9 plus (excluding the fastest EWMAC that comes in at just 0.5), but positive at least. Thereafter there is a very clear pattern with faster skew rules doing worse.

Incidentally the 'flat spot' on the blue line is because it can only be traded by the cheaper markets, none of which have data before the year 2000.

What about RV skew?

A similar(ish) pattern here with the slowest skew rules coming in at SR of around 0.35, and the faster rules being rather unhelpful.

Now for kurtosis (conditioned on skew):

Hmmm. Nothing to shoot the lights out there eithier.

Rule selection part N

The holy grail is a trading rule that is negatively correlated to something we've already got, and has a positive Sharpe Ratio. In my original post on trend following and skew I noted that skew for interesting reasons was likely to be negatively correlated with momentum at certain speeds, and seems to have positive performance.

In this post the negative correlation seems to have been borne out (or at least the correlation is basically zero), but the positive performance is patchy. Nevertheless, in my 'ideas first' paradigm (described here), I will sometimes use rules that don't have statistically significant performance if their original motivation is well founded. So it might be worth chucking some skew and kurtosis into the mix.

The slower skew rules (of both flavours) do a reasonable job, and they are logical and straightforward rules with a well motivated reason as to why they should work. Thanks to my prior work, I also have a good understanding of how they interact with momentum.

I'm a little less comfortable with the kurtosis rules; the conditioning makes it a little more complex than something I'd normally contemplate using. I think here I got a little carried away with demonstrating how clever I could be (okay K - K_mu * sign(S - S_mu) isn't exactly the general theory of relativity, but it's much more complex than EWMA_f - EWMA_s). On balance I would prefer not to use the kurtosis rules, even though their cumulative SR is similar to skew.

Some thoughts on fitting

It's worth noting that the 365 day skew rule, which did the best here, is the same lookback used by this paper Here is an opportunity for me to quickly (remind / tell) you about my framework for three kinds of fitting.

Tacit fitting would have happened if I had used the 365 day rule having read it in that paper. We know that academic papers which don't have useful results are rarely published. Therefore there is a chance that the academics in question tried different formulations before deciding on 365 days. Of course this might not be true, and they could have just used 365 days; realised it worked, and moved on*. The fact this is a 365 day lookback, and not 275.4, makes this more plausible. Still the risk is there.

* And they could also have got the 365 days from another paper, whose authors tried different variations. Same problem.

Implicit fitting would be if I had run these backtests and chosen the best performing rule variations to use in my trading system (which as it happens were skew_abs_365 and skew_rv_180 if you're interested). Then when I ran my backtest again it would have looked pretty dammn good.

Explicit fitting is what I've actually done; used a mechanical rule to decide which rules are good, and should get more capital; and which are poor. This is the best kind, as long as you do it in a robust way that understands signal:noise ratios, and in a backward looking rolling out of sample fashion.

Having stated I will, going forward, only use the two skew rules am I guilty of implicit fitting? After all I have modified the configuration of my backtest after peeking at all the data. To a degree this is true. But I offer two defenses. Firstly, I'm still using all the different variations of the rules from 14 day to 365 day lookbacks and allowing the system to weight them appropriately. Secondly, removing the kurtosis rules doesn't really affect the performance of the system one way or another. So it's not like I'm biasing my backtest SR upwards by 50 basis points.

Portfolio level results

Having done all this, what effect does adding the two types of skew rule to my standard backtest have?

The orange line is the original backtest, and the blue line is the new one. Looks decent enough, but the improvement is only 7bp of SR. Still I've always been a fan of systems that use lots of simple rules, each adding a little extra to the mix, and even 7bp is better than a punch in the face with a sharp stick.


This exercise has been a little dissapointing, as I had hoped the skew rules would be a little more exciting performance-wise, but I've demonstrated some important practices. I've also had some fun adding extra flexibility to pysystemtrade.

Tuesday, 10 December 2019

New and Improved Sharpe Ratio adjustment in the handcrafting method

In my recent posts on skew and kurtosis I've put together a large number of ideas for possible trading strategies. The next step will be to create and test these ideas out. However I already know from my initial analysis that many of these ideas will probably have poor performance. This leaves me in something of a conundrum. In fact this is just one example of the conundrum that strategy developers face all the time.

Do I ignore these strategies completely, which will result in my historic back test being overstated (a sin I have named implicit fitting)? Or do I include them in a portfolio of strategies, and hope that the allocation methodology will downweight them because of their poor performance?

The problem is that methods for portfolio allocation tend to fall into one of two camps. There are robust methods that completely ignore performance, from the simple 'one over n' equal weighting all the way up to more complex methods like Hierarchical Risk Parity (HRP). Those that do use the performance of strategies, like the basic naive Markowitz optimisation, tend to be very un-robust. Small differences in mean result in big differences in allocation.

There is no easy answer here. On the one hand, the mean (or equivalently Sharpe Ratio) is the moment of the distribution with the largest sampling error, and which is the hardest to predict. On the other hand, there really ought to be some way to downweight and ultimately drop consistently money losing strategies.

In the handcrafting method, which I wrote about in some detail this time last year (the first post is here), I use a simple heuristic technique to adjust weights for different Sharpe Ratios. Basically it's the yellow line on this graph:

The x-axis shows the relative SR of an asset compared to the average of a portfolio (or sub-portfolio). The y-axis is the weighting that is applied. So for an asset with the average SR, the weighting factor is exactly 1.

The method is then simple: we multiply all the weights in a (sub)portfolio by the weighting factor given by their relative SR. We then re-normalise the weights so they add up to 100%. Assets with higher SR have higher weights, and vice-versa. But extreme weights don't occur, since the weight multiplier is never below 0.6 or above 1.4.

(As an aside, the handcrafting method also accounts for uncertainty in correlations, but not in standard deviations. The latter is relatively small and has minimal effect on portfolio weights)

The heuristic has the advantage of being robust, fast and intuitive. The disadvantage is that this technique does not account for important factors. In particular, we can become more confident about our weighting when (a) we have more data and (b) the correlation between assets is higher. In the right circumstances a weight of almost zero would be justified - effectively entirely removing an asset or trading rule variation from the portfolio.

In this post I modify the simple heuristic to account for these factors. Although this slightly increases the calculation time for the handcrafted optimisation, it is worth doing. There will be python code.

Where does the heuristic come from?

The heuristic essentially comes from doing a portfolio optimisation on a very simple arbitrary portfolio, which has two assets with:

  • Asset one: some arbitrary Sharpe Ratio (I use SR=0.5 which makes sense for the sorts of trading rules I use)
  • Asset two: A higher or lower Sharpe Ratio
  • Both assets: Some arbitrary identical standard deviation (I use 15% though in fact this doesn't affect the results)
  • Some correlation
If both assets had the same Sharpe Ratio then the optimal allocation would be 50% each, irrespective of correlation. We can check this:

### Following code is boilerplate for optimising
from scipy.optimize import minimize
import numpy as np
import scipy.stats as stats
import pandas as pd

def optimise(mean_list, avg_correlation, std): corr_matrix = boring_corr_matrix(len(mean_list), offdiag=avg_correlation) stdev_list = [std]*len(mean_list) sigma = sigma_from_corr_and_std(stdev_list, corr_matrix) return optimise_with_sigma(sigma, mean_list) def optimise_with_sigma(sigma, mean_list): mus = np.array(mean_list, ndmin=2).transpose() number_assets = sigma.shape[1] start_weights = [1.0 / number_assets] * number_assets # Constraints - positive weights, adding to 1.0
    bounds = [(0.0, 1.0)] * number_assets
    cdict = [{'type': 'eq', 'fun': addem}]
    ans = minimize(
        start_weights, (sigma, mus),
    weights = ans['x']

    return weights

def neg_SR(weights, sigma, mus):
    # Returns minus the Sharpe Ratio (as we're minimising)    
    weights = np.matrix(weights)
    estreturn = (weights * mus)[0, 0]
    std_dev = (variance(weights, sigma)**.5)

    return -estreturn / std_dev

def variance(weights, sigma):
    # returns the variance (NOT standard deviation) given weights and sigma    
    return (weights * sigma * weights.transpose())[0, 0]

def sigma_from_corr_and_std(stdev_list, corrmatrix):
    stdev = np.array(stdev_list, ndmin=2).transpose()
    sigma = stdev * corrmatrix * stdev

    return sigma

def addem(weights):
    # Used for constraints    
    return 1.0 - sum(weights)

def boring_corr_matrix(size, offdiag=0.99, diag=1.0):
    """    Create a boring correlation matrix
    :param size: dimensions    
    :param offdiag: value to put in off diagonal    
    :param diag: value to put in diagonal    
    :return: np.array 2 dimensions, size    
    size_index = range(size)

    def _od(i, j, offdiag, diag):
        if i == j:
            return diag
            return offdiag

    m = [[_od(i, j, offdiag, diag) for i in size_index] for j in size_index]
    m = np.array(m)
    return m

# Both have same mean, and hence SR, just vary the correlation
>>> optimise([0.05, 0.05], 0.0, 0.15)
array([0.5, 0.5])
>>> optimise([0.05, 0.05], 0.9, 0.15)
array([0.5, 0.5])
>>> optimise([0.05, 0.05], -0.9, 0.15)
array([0.5, 0.5])

A small difference in means will result in crazy weights, especially if we have highish correlation:

optimise([0.05, 0.06], 0.9, 0.15)
array([1.11022302e-16, 1.00000000e+00])

There is more of this kind of analysis in Smart Portfolios if you're curious

We need some way of accounting for the uncertainty in our mean estimate (in principal we should do the same for correlation and vol; but the handcrafting method already does this for correlation, and vol is relatively predictable).

Let's begin with the textbook formula for sampling error for a difference in means:

Line 1: Standard deviation of the mean estimate, asset i, given sigma_i (annualised standard deviation) and N is the number of years of data
Line 2: Ditto for asset j
Line 3: Standard deviation of the difference in means
Line 4: If both assets have the same standard deviation this simplifies to the final line.

This gives us the standard deviation of the distribution of the differences in mean estimate. The mean of that distribution is just the difference between the two sample means. Thanks to central limit theorem, the distribution is Gaussian .So we can work out the cumulative probability of any particular difference in mean estimate.

Here is an implementation of this in python:

def calculate_omega_difference(std, years_of_data, avg_correlation):
    omega_one_asset = std / (years_of_data)**.5    
    omega_variance_difference = 2*(omega_one_asset**2)*(1- avg_correlation)
    omega_difference = omega_variance_difference**.5
    return omega_difference

def calculate_confident_mean_difference(std, years_of_data, mean_difference, confidence_interval, avg_correlation):
    omega_difference = calculate_omega_difference(std, years_of_data, avg_correlation)
    confident_mean_difference = stats.norm(mean_difference, omega_difference).ppf(confidence_interval)

    return confident_mean_difference

# The 0.5 point will just be the mean difference
>>> calculate_confident_mean_difference(0.15, 10, 0.3, 0.5, 0.0)

# It's plausible (5% chance) the difference is quite a bit lower than 0.3 with enough data
>>> calculate_confident_mean_difference(0.15, 10, 0.3, 0.05, 0.0)

# With insufficient date we can't be confident the mean is even positive
>>> calculate_confident_mean_difference(0.15, 1, 0.3, 0.05, 0.0)

# ... unless the correlation is really high 
>>> calculate_confident_mean_difference(0.15, 1, 0.3, 0.05, 0.95)

# Trivially if we increase the mean difference the distribution is just shifted
>>> calculate_confident_mean_difference(0.15, 1, 0.6, 0.05, 0.95)

It's worth playing around with this code to get a feel for how the mean distribution is affected by (a) the correlation and (b) the amount of data available (it's less interesting to change the standard deviation or average mean).

We can now see how different levels of confidence in our mean estimate affect our portfolio weights. This function optimises, fixing the Sharpe Ratio of the second asset, and applying an adjustment to the first Sharpe Ratio.

def weights_given_SR_diff(SR_diff,  avg_correlation,  confidence_interval, years_of_data,
                          avg_SR=0.5, std=0.15, how_many_assets=2):
    """    Return the ratio of weight to 1/N weight for an asset with unusual SR
    :param SR_diff: Difference between the SR and the average SR. 0.0 indicates same as average    
    :param avg_correlation: Average correlation amongst assets    
    :param years_of_data: How long has this been going one    
    :param avg_SR: Average SR to use for other asset    
    :param confidence_interval: How confident are we about our mean estimate (i.e. cdf point)    
    :param how_many_assets: .... are we optimising over (I only consider 2, but let's keep it general)    
    :param std: Standard deviation to use
    :return: Ratio of weight, where 1.0 means no difference    
    # From SR to means
    average_mean = avg_SR * std
    asset1_mean = (SR_diff + avg_SR)*std

    mean_difference = asset1_mean - average_mean

    ## Work out what the mean is with appropriate confidence    
    confident_mean_difference = calculate_confident_mean_difference(std, years_of_data, mean_difference, confidence_interval, avg_correlation)

    confident_asset1_mean = confident_mean_difference + average_mean

    mean_list = [confident_asset1_mean]+[average_mean]*(how_many_assets-1)

    weights = optimise(mean_list, avg_correlation, std)

    return list(weights)

# Median estimate (50%, 0.5 point) for a modest SR improvement (0.2) with lowish
# correlation (0.3) and a decade of data (10). 
# Equivalent to optimise([0.7*.15, 0.5*.15], 0.3, 0.15)
>>> weights_given_SR_diff(0.2, 0.3, 0.5, 10)
[0.6529536546603577, 0.3470463453396423]

# It's plausible the weight to the first asset could be lower (10% point, 0.1)...
>>> weights_given_SR_diff(0.2, 0.3, 0.1, 10)
[0.14015358673119246, 0.8598464132688076]

# or higher (90% point, 0.9)...
>>> weights_given_SR_diff(0.2, 0.3, 0.9, 10)
[0.8741474140876935, 0.12585258591230658]

I'm going to do a sort of monte-carlo exercise, but not really. If we consider a number of fixed points on a distribution, equally spaced, then each are equally likely. For example the nine points, 10%, 20%, .... 90% are all equally likely to occur. If we optimise the portfolio weights for a 10% degree of confidence, then 20%, and so on; then take an average of the weights, we basically get the same effect as if we repeatedly sampled the return distribution in the normal way (the non-parametric bootstrapping I discussed in Systematic Trading). But the advantage of using this method is that we only need to run N optimisations (where N is hopefully not too big - to be determined shortly), we lose , and more interestingly we don't get the randomness of bootstrapping - the same parameters should always produce the same result.

def mini_bootstrap_ratio_given_SR_diff(SR_diff,  avg_correlation, years_of_data, avg_SR=0.5, std=0.15, how_many_assets=2,
    Do a parametric bootstrap of portfolio weights to tell you what the ratio should be between an asset which       
    has a higher backtested SR (by SR_diff) versus another asset(s) with average Sharpe Ratio (avg_SR)
    All assets are assumed to have same standard deviation and correlation
    :param SR_diff: Difference in performance in Sharpe Ratio (SR) units between one asset and the rest    
    :param avg_correlation: Average correlation across portfolio    
    :param years_of_data: How many years of data do you have (can be float for partial years)    
    :param avg_SR: Should be realistic for your type of trading    
    :param std: Standard deviation (doesn't affect results, just a scaling parameter)    
    :param how_many_assets: How many assets in the imaginary portfolio    
    :param p_step: Step size to go through in the CDF of the mean estimate    
    :return: float, ratio of weight of asset with different SR to 1/n weight    
    dist_points = np.arange(p_step, stop=(1-pstep)+0.000001, step=p_step)
    list_of_weights = [weights_given_SR_diff(SR_diff,  avg_correlation,
                                          confidence_interval, years_of_data, avg_SR=avg_SR, std=std,
                     for confidence_interval in dist_points]

    array_of_weights = np.array(list_of_weights)
    average_weights = np.nanmean(array_of_weights, axis=0)
    ratio_of_weights = weight_ratio(average_weights)

    if np.sign(ratio_of_weights-1.0)!=np.sign(SR_diff):
        # This shouldn't happen, and only occurs because weight distributions get curtailed at zero        
        return 1.0

    return ratio_of_weights

def weight_ratio(weights):
    Return the ratio of weight of first asset to 1/n weight
    :param weights:    
    :return: float    
    one_over_N_weight = 1.0/len(weights)
    weight_first_asset = weights[0]

    return weight_first_asset/one_over_N_weight

Notice that the function returns the 'weight ratio': the ratio of the optimised  weight to a 1/N weight, since that is what is used in the heuristic method.

There are several open questions here, relating to the parameters we have to specify. Average correlation and the amount of available data are key variables which I will explore in a moment. The choice of average Sharpe Ratio will influence our results to a lesser degree, but to keep things simple I will be sticking to avg_SR=0.5. Changing the standard deviation will have absolutely no effect on the results as it is just a scaling factor.

Finally there is a straightforward methodological issue - the choice of 'p-step' i.e.  how many of these optimisations do we need to do for an accurate answer? Is 0.1,0.2,... 0.9 sufficient (in the function, p-step=0.1) or do we need to be more granular?

import matplotlib
how_many_assets = 2
avg_SR = 0.5
avg_correlation = 0.5
years_of_data = 10
SR_diff_list = list(np.arange(-0.5, stop=0.5, step=0.05))
list_of_psteps = [0.2, 0.1, 0.05, 0.1, 0.01, 0.001]

results_over_psteps = dict()
for pstep in list_of_psteps:
    results_over_SR_diff = []
    for SR_diff in SR_diff_list:
        print("Pstep %f SR_diff %f Years of data %d" % (pstep, SR_diff, years_of_data))
        ratio = mini_bootstrap_ratio_given_SR_diff(SR_diff, how_many_assets,
                                                   avg_correlation, years_of_data, avg_SR=avg_SR,
                                                   p_step = pstep)
    results_over_psteps[pstep] = results_over_SR_diff

df_results = pd.DataFrame(results_over_psteps, index=SR_diff_list)
plot_title = "(Avg correlation %f Number of assets %d, Average SR %f Years of data %d; Lines show pstes %f:%f)" \
             % (avg_correlation, how_many_assets, avg_SR, years_of_data, list_of_psteps[0], list_of_psteps[-1])
matplotlib.pyplot.xlabel("SR relative to portfolio average SR")

I would say that a p-step of 0.01 (i.e. distributional points at 0.01, 0.02, 0.03...0.99) is more than sufficient to give an accurate result (very close to the much more finely grained 0.001).

A minor point: in this and all the following plots the lines do not (as you might expect) go through the point (0.0, 1.0). This is an artifact of the portfolio optimisation process, and not a bug. If I was using these plots in a book, I'd shift them to get them to look nice. As it is, because weights are always renormalised to 100%, it won't make any difference.

How is the heuristic weighting affected by having more data?

how_many_assets = 2
SR_diff_list = list(np.arange(-0.5, stop=0.5, step=0.05))
various_lengths_of_history = [2,5,10,20,30,40]
p_step = 0.01avg_correlation=0.5results_over_years = dict()
for years_of_data in various_lengths_of_history:
    results_over_SR_diff = []
    for SR_diff in SR_diff_list:
        print("Correlation %f SR_diff %f Years of data %d" % (avg_correlation, SR_diff, years_of_data))

        ratio = mini_bootstrap_ratio_given_SR_diff(SR_diff, avg_correlation, years_of_data, avg_SR=avg_SR,
                                                   std=std, how_many_assets=how_many_assets, p_step = p_step)
    results_over_years[years_of_data] = results_over_SR_diff

df_results = pd.DataFrame(results_over_years, index = SR_diff_list)
plot_title = "Avg correlation %f (Number of assets %d, Average SR %f; Lines show years of data %d:%d)" \
                        % (avg_correlation,     how_many_assets,   avg_SR,  various_lengths_of_history[0], various_lengths_of_history[-1])
df_results.plot(title = plot_title)
matplotlib.pyplot.xlabel("SR relative to portfolio average SR")
What we expect to see here is more data -> increase in confidence -> more extreme weights, when we have more history. And that is exactly what we see; with 2 years of data the lowest weight we can get to is around 0.5, whilst with 30 years plus it's almost zero.

How is the weighting affected by average correlations?

how_many_assets = 2
SR_diff_list = list(np.arange(-0.5, stop=0.5, step=0.05))
years_of_data = 10
p_step = 0.01
various_correlations=[0.0, 0.3, 0.5, 0.7, 0.9]
results_over_correl = dict()
for avg_correlation in various_correlations:
    results_over_SR_diff = []
    for SR_diff in SR_diff_list:
        print("Correlation %f SR_diff %f Years of data %d" % (avg_correlation, SR_diff, years_of_data))

        ratio = mini_bootstrap_ratio_given_SR_diff(SR_diff, avg_correlation, years_of_data, avg_SR=avg_SR,
                                                   std=std, how_many_assets=how_many_assets, p_step = p_step)
    results_over_correl[avg_correlation] = results_over_SR_diff

df_results = pd.DataFrame(results_over_correl, index = SR_diff_list)
plot_title = "Years %d (Number of assets %d, Average SR %f; Lines show correlations %d:%d)" \
                        % (years_of_data,     how_many_assets,   avg_SR,  various_correlations[0], various_correlations[-1])
df_results.plot(title = plot_title)
matplotlib.pyplot.xlabel("SR relative to portfolio average SR")

The higher the correlation between two strategies, the more confident we can be that a small difference in Sharpe Ratios is meaningful, and the more extreme weights we can use.

Plugging into pysystemtrade

Now seems a good opportunity to test how effective this method is. So let's plug in the above code into the handcrafting implementation in my open source python backtesting engine, pysystemtrade.

The optimisation code is a bit of a mess, but as a starting point towards refactoring I've moved the core optimisation routines which are needed here into another file to avoid double coding or messing up the heirarchy. I've also updated to include the new method.

Let's check things are working as we expect with a simple example. Imagine we're trading the VIX with some vanilla trading rules, plus a dumb one: Always go long the VIX:

from systems.provided.futures_chapter15.estimatedsystem import futures_system
from systems.provided.moretradingrules.morerules import long_bias
from systems.forecasting import TradingRule

system = futures_system(log_level="on")
config = system.config
new_rule = TradingRule(long_bias) config.trading_rules['long'] = new_rule

# Following line means we won't be adjusting
system = futures_system(log_level="on", config=config)

First we'll check what happens without any Sharpe Ratio adjustment.

instrument_returns = system.combForecast.get_returns_for_optimisation("VIX").to_frame()
instrument_returns = instrument_returns.resample("W").sum()

p=Portfolio(instrument_returns, use_SR_estimates=False)

>>> p.sharpe_ratio
array([ 0.6591312 ,  0.50752412,  0.5209934 ,  0.48374054, -0.52792834])

>>> p.corr_matrix
                carry  ewmac16_64  ewmac32_128  ewmac64_256      long
carry        1.000000    0.681041     0.844301     0.913307 -0.872324
ewmac16_64   0.681041    1.000000     0.914988     0.782395 -0.528207
ewmac32_128  0.844301    0.914988     1.000000     0.946146 -0.659741
ewmac64_256  0.913307    0.782395     0.946146     1.000000 -0.726869
long        -0.872324   -0.528207    -0.659741    -0.726869  1.000000

>>> p.show_subportfolio_tree()
Natural top level grouping used
[' Contains 3 sub portfolios', 
["[0] Contains ['carry', 'ewmac32_128', 'ewmac64_256']"], 
["[1] Contains ['ewmac16_64']"], ["[2] Contains ['long']"]]

>>> p.cash_weights
Natural top level grouping used
[0.06658, 0.3790, 0.08106, 0.0653, 0.4079]

It's very diversifying, so get's it's own group and a chunky weight.

Now with the adjustment applied:

system = futures_system(log_level="on", config=config)

instrument_returns = system.combForecast.get_returns_for_optimisation("VIX").to_frame()
instrument_returns = instrument_returns.resample("W").sum()

>>> p.cash_weights
Natural top level grouping used
[0.1092, 0.5553, 0.1329, 0.09569, 0.1067]

Let's look at the time series of weights to see how they change as we get more data:


The "always long" trading rule weight starts off pretty high after 2008 when it performed very well. Subsequently however it's performance fades, and so does it's weight (albeit there is some jumping around as the group weights change).


I'm happy that the heuristic for Sharpe Ratio now deals properly with assets that have different correlations and lengths of available data. In the new year I'll be using the upgraded handcrafting method to test the trading rules for skew and kurtosis that I've been playing with for the last couple of posts.

Tuesday, 12 November 2019

Kurtosis and expected returns

In my last post, I stated my intention to write a series of posts about skew.

Slight change of plan, since one loyal reader suggested that I write about kurtosis. I thought that might be fun, since I haven't thought about kurtosis much, and the literature on kurtosis isn't as well developed. It turns out that considering both together leads to some very interesting results.

The plan is to basically repeat my previous analysis of skew for kurtosis. Then my subsequent posts on this subject will discuss both skew AND kurtosis. Hope that makes some kind of sense.

Not "everything you always wanted to know about kurtosis, but we're afraid to ask", but enough to understand this post

The first four moments of a distribution are:

- mean
- standard deviation
- skew
- kurtosis

In laymans terms, these define:

- how positive or negative the "middle"* of a distribution is
- how wide the distribution is
- how symettric the distribution is, or is not
- whether the distribution is mostly lumped into the middle, or spreads out to the edges

* notice I've used the comically imprecise term 'middle' here to avoid any mean vs median arguments

High kurtosis then means extreme events are more common than a vanilla Gaussian distribution would suggest. High kurtosis means fat tails. High kurtosis means every financial time series, ever.

Interpreting the first 3 moments is pretty easy, kurtosis less so. In this post I'm going to be using the standard kurtosis measure used in pandas. For reasons to boring to go into, the kurtosis of a normal distribution is 3. The pandas measure looks at excess kurtosis, so a figure of 0 means 'the normal amount of kurtosis'*, and any positive number means more than that. What does a kurtosis of 1.0 mean? Or 10.0? No idea really, since I don't have any intuition for the figure - one of the reasons to do this post is to get some.

* It's worth checking this yourself by generating random Gaussian data and measuring the kurtosis. I leave this as an exercise for the reader.

I won't be repeating the code since it's identical to the previous post, but with the string 'skew' replaced with 'kurtosis'.

Niggle: Not quite true, thanks to the bizzare and ever changing pandas API. These will all give averages over the whole Series:

percentage_returns[code].kurtosis(0) # omitting the zero will give a rolling figure(!)
percentage_returns[code].kurt() # same as kurtosis(0)

Variance in kurtosis estimates

So how accurately can we measure kurtosis? Here are some bootstrapped distributions of sample variance. Firstly, here it is for  EURUSD, which has a relatively low kurtosis:

Now here are US 2 year bonds, which has a relatively high kurtosis:

Somewhat unsurprisingly, the higher the Kurtosis the wider the estimate. Big kurtosis means outliers; resampling means we'll sometimes catch the outliers and sometimes will get more than our fair share of them: so a big variation in potential kurtosis.

Let's do a boxplot for everything:

Unlike skew there aren't any obvious patterns here; assets we'd expect to have similar kurtosis (like US bond futures) are all over the shop: 20 years are low, 5 years have a bit more, 10 years a bit more again, and 2 years have absolutely loads of the stuff.

Some of this is due to time varying regimes, a problem which will go away later in the post. For example here are US 2 year daily returns:

Pretty crazy in the financial crisis, and then they settle down somewhat. Measuring kurtosis post 2009 is likely to give a very different answer.

Do assets with generally higher kurtosis have higher returns?

It was easy to tell a story about skew preference; negative skew is clearly a bad thing and we should be rewared for holding it.

Should we get paid for high kurtosis? There are two possibilities. If the kurtosis is coming from a positive fat tail we might expect people to overpay for the chance to 'win a lottery': they will prefer high kurtosis, and there will be higher expected returns for low kurtosis assets. But if the kurtosis is coming from a negative fat tail then people will dislike it a lot.

Anyway, on average we're not paid for kurtosis:

X-axis, kurtosis over whole sample. Y-axis: average daily return

Ignoring the two vol markets, whose kurtosis is nowhere near as bad as their skew, the relationship looks ... slightly negative?

Here's a boxplot showing the distribution of resampled daily returns for high kurtosis (over 5.5) versus low kurtosis (under 5.5) instruments:

It does looks like low kurtosis is better than high, suggesting the 'lottery ticket preference' is holding up here: people overpay for high kurtosis. But we need to condition on skew to determine whether that hypothesis is correct or not.

Incidentally if you are worried about the vol markets, VIX comes in as low kurtosis and V2X as high.

Does the about finding still hold for risk adjusted returns, i.e. Sharpe Ratios?

Looks like something is still there, although perhaps not as significant as for outright returns. Lower kurtosis assets seem to have about 0.2 SR points extra per year.

Does an asset with currently higher kurtosis outperform one which has lower current kurtosis? (time series forecasting)

As with skew I'm going to measure kurtosis over different time periods; from a week up to a year of historic returns. Then I will do a t-test to see if assets that currently have higher kurtosis than the global median (about ) outperform those with lower kurtosis.

First the Sharpe Ratios, conditional on recent kurtosis:

It looks like, unlike skew, the preference for high kurtosis is something that appears most strongly at short horizons.

Now the t-statistics, comparing low and high kurtosis:

Basically noise.

How do current skew and kurtosis forecast future returns? (time series forecasting)

As I mentioned above there is a big difference between high kurtosis coming from positive returns, and the same from negative returns. Perhaps we will see something more interesting if we look at the combination of skew and kurtosis.

Same as before, different frequencies, but this time we look at both skew and kurtosis preceeding the date when we estimate a forward looking SR. First Sharpe Ratios:

This is without doubt the most interesting graph so far.

(Sure, but quite a low bar to beat...)

Remember we had two hypothesis about kurtosis:
  • People dislike lumpy returns, and want to avoid them. High kurtosis should always pay more than low kurtosis.
  • People only dislike lumpy returns if they're negative. They're happy to pay more for lottery tickets. High kurtosis should outperform low kurtosis for positive skew assets. For negative skew assets the relationship should be reversed

Let's summarise the findings of this graph and the previous post:

  • Negative skew* assets outperform, most strongly at longer horizons (from my previous blog post).
  • High kurtosis assets underperform, most strongly at shorter horizons (discussed earlier)
  • Within assets that have high kurtosis, at short horizons positive skew is rewarded. At longer horizons there is nothing meaningful.
  • Within assets that have low kurtosis, at longer horizons negative skew is rewarded. At shorter horizons there is nothing meaningful.
  • Within assets that have positive skew, at short horizons high kurtosis is rewarded 
  • Within assets that have negative skew, kurtosis is irrelevant

* incidentally I'm using zero as my skew cutoff here for simplicity, as in the previous post I determined it didn't make much difference. For kurtosis there is no 'natural' cutoff, so I'm sticking to the historic sample median of around 5.5

Or to put it another way:

  • The dominance of negative skew assets at longer horizons is only relevant for assets with low kurtosis.
  • The outperformance of high kurtosis assets at shorter horizons is only relevant for assets with positive skew.
Unlike a simple high skew / low skew strategy it's difficult to pick apart what relationships we should focus on here, especially if we want to put things into a continous forecast framework without implicit fitting (it would be very easy to create a set of binary rules which embodied the above findings, and which would embody loads of forward looking information). Implicit overfitting is particularly likely here as we don't have the simple intuitive results we got from outright skew.

I decided to boil the above down to two simple trading rules:

The skew rule

{(skew - Average skew) / Sigma [Skew]} 
* sign (Kurtosis - average kurtosis)

The kurtosis rule

{(Kurtosis - Average kurtosis) / Sigma [Kurtosis]} 
* sign (Skew - average skew)

... where for this specification the average is across all instruments and all past history (currently done entirely in sample, but in a trading rule will be based on an expanding window), and sigma is a standard deviation based on the past history of all instruments (the sigma is not important now, but will be when we come to design trading rules to ensure we have properly normalised forecasts).

This have the advantage of being relatively parsimonious and symettrical, albeit a bit non linear. There is still potentially an issue with implicit fitting, but we can deal with that in later posts.

Under these conditions we'd have the following positions in both rules:

  • High kurtosis, high skew: Both Long (profitable at short horizons)
  • High kurtosis, low skew: Short (Does relatively badly at short horizons)
  • Low kurtosis, high skew: Short (Does relatively badly especially at long horizons)
  • Low kurtosis, low skew: Long (does relatively well at long horizons)

(It's possible to combine these into a single rule, however I like the idea of having a skew and a kurtosis rule and the effects work differently at varying horizons)

Let's look at the t-statistics:

(For example 'pos skew rule' is the kurtosis rule applied when skew is positive and so on; really this ought to be 'pos skew, kurtosis rule' but you get the idea).

Here positive t-statistic means a rule is working. It looks like all the rules work pretty well at a one month frequency, with the skew rule working especially well for longer periods when kurtosis is low.

Does an asset with different skew / kurtosis than normal perform better than average (normalised time series)?

We can modify the rules above so that instead of using the average across all assets we will actually use the average for a given instrument (we can also modify the standard deviation once we get to producing actual forecasts).

Interestingly the rules seem to be bad at the original sweet spot, although skew conditioned on low kurtosis still does very well at longer horizons.

Now let's demean on the current average across all assets:

Ouch. A pretty poor performance. The skew rules (red and green) in particular are very sensitive to frequency.

Finally let's do the same thing, but this time demean by the current median skew and kurtosis for a given asset class. 

Again, not really the best result.


  • It's hard to estimate kurtosis with any certainty, even harder when kurtosis is large (outliers)
  • Unexpectedly we don't get paid for owning assets with high kurtosis
  • .... and then it gets complicated

Yes, there's an awful lot of results in this post! 

The key finding is that, as you may expect, skew and kurtosis have more forecasting power when they are conditioned on each other. Generally we want to own instruments that have had high kurtosis and relatively positive skew: these are lottery tickets which for some reason the market undervalues. We also want to own instruments that have low kurtosis and relatively negative skew; here we get rewarded for negative skew without suffering too many outliers. Instruments where skew and kurtosis are in opposite directions are less attractive.

These effects don't persist that well when we use different demeaning techniques, unlike in skew world where they hold up quite well.

It's worth reflecting on what I have done so far. In the last post I considered 4 different skew trading rules (outright, time series demean, cross sectional demean, asset class cross sectional demean). In this one I've effectively come up with another 8: 4 for skew conditioned on kurtosis, 4 for kurtosis conditioned on skew. That's a total of 12 different trading rules, each of which potentially has 6 different variations for different lookbacks.

Though it would be tempting to select a few of these for further testing that would be implicit fitting; I would be doing so based on the analysis I have done so far having looked at all the data. Instead the right and proper thing will be to take forward all 12 rules into an analysis where their risk weights are fitted systematically in a backward looking framework. So that's the next post.