In continuing my futile quest to raise the level of debate in the quantitative investment community I thought I'd have a go at another clever and very wealthy guy, Cliff Asness, founder of giant fund AQR.
|Cliff Asness. There is nothing wrong with his statistical intuition. Or his suit. Nice suit Cliff. (www.bloomberg.com)|
And, to disappoint you further, I'm not really going to 'have a go' at the authors of the paper.
I actually agree completely and wholeheartedly with the sentiments and the conclusions of the paper. I also know that the authors have done and also published other, very good, research which supports the results.
What I have a tiny little problem with is the way that the analysis is presented. More specifically it is about the interaction of a particular human weakness with a particular way of presenting data.
I should also say upfront that this is a problem by no means limited to this paper. It is endemic in the investment industry; and much worse beyond it*. It just so happens that I was sent this paper by an ex-colleague of mine very recently, and it got me a bit riled.
* Here's a great book on the subject of mis representing scientific statistical evidence which is worth reading.
So this post is not a criticism of AQR in particular; and I should reiterate that I have a huge amount of respect for their research and what they do generally.
What does the paper say
If you are too idle to read the paper it looks at the quarterly returns of a bunch of assets and style factors, conditioned on there being a terrible quarter in eithier the bond or equity markets (they also do some stuff with overlapping 12 month returns). For example if you skip to exhibit 1 it shows that for the 10 worst quarters for equities since 1972 fixed income made money in 8 out of 10.
The ex colleague of mine who sent me the paper made the following comment:
Have you seen the paper attached yet? It is interesting that global bonds had a decent performance in the 10 worst quarters for global stocks in 1990-2014, but not the other way around... Trend following seems to have good performance when either stocks or bonds suffer."*
* I've decided to leave my friend anonymous, although he has kindly given permission for me to use this quote.
My first thought was "hmm... yes that is interesting".
Then after a few minutes I had a second thought:
"Hang on. There are only 10 observations. Is this even statistically significant? "
This was a serious problem. More so because the authors of the paper had also highlighted a key finding, which relates to something I talked about in my last post, trend following. From the paper:
"Trend was on average profitable in all asset classes returns during these equity tail events... As noted, Trend has often performed well in the worst equity quarters... Trend has been a surprisingly good equity tail hedge for more than a century"
I stared at the numbers, but I still couldn't decide whether they were meaningful or not. The underlying problem here is that humans are rubbish at intuitively judging statistical significance - even ones like my friend and I who actually understand the concept.
A bit of small sample statistics
Before proceeding let me briefly explain my outburst on statistical significance. Those who, unlike me, saw immediately whether the results were statistically significant or not can smugly skip ahead.
If we abstract away from the specifics we can reword my friends statement as follows:
"Hypothesis 1: The average return for Bonds, conditional on poor returns for Equities, is positive."
"Hypothesis 2: The average return for Equities, conditional on poor returns for Bonds, is negative"
"Hypothesis 3: The average return for Trend, conditional on poor returns for Equities, is positive"
"Hypothesis 4: The average return for Trend, conditional on poor returns for Bonds, is positive"
The third hypothesis is also one of the main points the author's flagged up.
(By the way, and this is just a small criticism, it might have been more intuitive if the graphs in the paper had been done with Sharpe Ratio units rather than mean returns on the y axis; although do the authors quote the Sharpe Ratio's in the heading. Specifically it would probably have made sense to normalise the quoted returns by the volatility of the full sample.
However I guess what is more intuitive for me might not be to many other people; so I can live with this.)
Notice that in each of the four hypothesis we have an asset we're trying to predict returns for, and another asset that we are conditioning on. We can abstract further to avoid having to model the joint distribution in an explicit way (you can do this of course, but it would take longer to explain):
"Hypothesis 1: The average return for Bonds, in scenario X, is positive."
"Hypothesis 2: The average return for Equities, in scenario Y, is negative"
"Hypothesis 3: The average return for Trend, in scenario X, is positive"
"Hypothesis 4: The average return for Trend, in scenario Y, is positive"
Obviously Scenario X is bad equity returns, and scenario Y is poor bond returns. The next thing we need to think about is what econometricians would call the data generating process (DGP). This isn't so much where the data is coming from, but where we pretend it's coming from.
We'll treat scenario X and scenario Y individually. Scenario X then consists of a sample of 10 returns drawn from a much larger population which we can't see. Scenario Y is another 10 returns drawn from a different population. The sample mean return for bonds in X is +3.9%; and for equities in Y is -2.7% (from exhibits 1 and 3 respectively). For Trend it's 6.4% for X, and 3% for Y.
I'm also going to assume that the underlying population is Gaussian, with some unknown mean; but with a standard deviation equal to that of the sample standard deviation*; which for bonds X is about 3% a quarter; for equities Y around 5.2% a quarter, and for Trend 7% (X) and 5.3% (Y). This is all a little unrealistic, but again it would be more complicated to do it another way, and it doesn't change the core message.
* Interestingly the full period standard deviation for bonds is 2.6%*** a quarter, equities 6.95%, and trend 5%. Risk seems to be a little higher than normal in equity crisis, but not so much when bonds are selling off.
** derived from annualised figures assuming no auto correlation between quarterly returns
Looking at my hypothesis the null I'm trying to disprove in both cases is that the true population mean return is zero (I could do it other ways, but this is simpler). So let me generate by repeated randomness the distribution of the sample mean statistic for 10 observations, given the estimated standard deviation:
import numpy as np
import random as rnd
import matplotlib.pyplot as plt
stdev_dict=dict(BOND=3.0, EQUITY=5.2, TRENDX=7, TRENDY=5.3)
sample_mean_dict=dict(EQUITY=-2.7, BOND=3.9, TRENDX=6.4, TRENDY=3.0)
ans=[np.mean([rnd.gauss(mean, stdev) for x in range(sample_size)]) for unused_idx in range(monte_carlo)]
if assetname in ["TRENDX", "TRENDY", "BOND"]:
p_value=float(len([x for x in ans if x>estimate_mean]))/monte_carlo
p_value=float(len([x for x in ans if x<estimate_mean]))/monte_carlo
raise Exception("unknown assetname %s" % assetname)
ax2.annotate("%.4f" % p_value, xy=(estimate_mean, 0),xytext=(estimate_mean,max(thing)), arrowprops=dict(facecolor='black', shrink=0.05))
First for equities:
This result is just shy of significance, using the normal 5% criteria. We can't reject the null hypothesis.
Then for bonds:
Now for Trend, conditioned on poor equity returns:
Not quite as good, but just creeps into being significant at the 5% level. In summary:
"Hypothesis 1: The average return for Bonds, conditional on poor returns for Equities, is positive." - we can say this is very likely to be true.
"Hypothesis 2: The average return for Equities, conditional on poor returns for Bonds, is negative" - we cannot say if this is true or not.
"Hypothesis 3: The average return for Trend, conditional on poor returns for Equities, is positive" - we can say this is quite likely to be true
"Hypothesis 4: The average return for Trend, conditional on poor returns for Bonds, is positive" - we can say this is probably true
So my friend was mostly right; 3 out of 4 is pretty good. AQR were spot on; in fact the key findings they highlighted were hypothesis 1 and 3, the most highly significant ones. What's more my own personal feelings about allocating to trend following are still justified. However it has taken a fair bit of work to prove this!
Why we have to tell stories to explain stuff
The crux of the problem is that it's really, really hard to judge what is significant or not in small samples. Most people don't carry around an intuition about these distributions in their heads. But using small samples is quite common in papers like these. The reason is due to a flaw in the human brain, a cognitive bias, narrative fallacy. Or to put it another way we like to hear stories.
If I show you a mass of data points you will probably be thinking 'yeah fascinating. Now what Rob?'. But if I show you a nice graph as in exhibit 1 of the AQR report you'll be thinking '4Q 1987. Black Monday! Ah I remember that / I've read about that (delete depending on age)...'.
|1987 crash. Yes children it's true. In the olden days traders wore suits and ties; monitors were really, really big; and the only colour they could display was green (usnews.com)|
The information becomes more interesting. Clever researchers know this, and so present information in a way which makes it easier to hang a narrative off.
Why this is bad
This is bad because a story can be both unrepresentative and also statistically meaningless. If I show you a story about an aircraft crashing you are more likely to avoid flying, even if I subsequently show you some dry statistics on the relative safety of different kinds of transport.
|A sample of one. (www.cnn.com)|
Stories, or if you prefer small samples, can lead us to the wrong judgement*.
* I'm aware that 'you can prove anything with statistics'. However it's true to say that a rigorous analysis of a large sample set, properly presented, is always going to be more meaningful than the inferences a small sample.
Sometimes this is deliberate; as in most tabloid newspapers reporting on medical research. Sometimes it's accidental.
Of course it might be that the small sample is statistically significant, in which case we can draw a conclusion about the general population, as in the case of three out of the four hypothesis we've tested.
However if I see a paper with some small sample results in it, but no indication of significance I don't know if:
- The authors have deliberately shown an unrepresentative and insignificant sample, and the results are wrong
- The authors have got an unrepresentative and insignificant sample by accident, haven't realised it and the conclusions are wrong
- The authors have got a representative sample, but not a significant one. We can't prove the conclusion either way.
- The authors have got a significant and representative sample (the authors may, or may not realise this. I expect the AQR authors did know. These guys aren't sloppy). The authors are correct, but I have no way of knowing this.
It's for this reason that academic papers are littered with p-values and other statistics (though that doesn't mean you can trust them entirely). I'm not saying that a 'popular finance' paper like this should be festooned with statistical confetti. But a footnote would have been nice.
Don't be afraid of explaining the uncertainty in estimates. Talk about it. Explain it. Let people visualise it. And if you have got significant results, shout about it.
If you're worried that this blog is going to continue in this vein (criticising the research findings of hedge fund billionaires), don't worry. Next time I'll talk about something dull and worthy, like estimating transaction costs; or I'll give you some thrilling python code to read.
But if you're only now following me in the expectation that I'll be writing a post next week about David Shaw's inability to do stochastic calculus, or Ray Dalio's insistence on assuming returns are Gaussian, then I'm sorry you will be disappointed (and if their lawyers are reading, neither of those things are true).