Whilst I hope this was helpful it was just a starting point. There are at least two major projects to undertake before one could actually trade. The first is the design of such a system. This is the subject of a book I am writing, which I hope will be published (although if I am being honest writing this blog post is displacement activity to avoid proofreading duties). Secondly is the implementation; the nuts and bolts if you like. Apart from a single post on execution I've not given you many hints in this direction.
Thanks to overwhelming demand* I've decided to write a series of posts on the issues around implementing such a system, the key decisions you'd have to make, and some options for solving the problems involved.
* I'd like to thank both the people who requested this.
First some ground rules. I won't be exposing any more of my awful code to the general public; at best you'll get pseudo cost and at worst waffle. Secondly implementation and design are rather closely linked. I am trading futures, in a fully automated system, which is relatively slow and where latency is not an issue, using only price data. Whilst these posts might be of some interest to a high frequency equities trader using earnings announcements (I assume such people exist), they will need to make appropriate adjustments to take account of the difference between their system and mine.
Finally although I will tell you throughout what I am doing in my own system, this is by no means the only or the correct way of doing things; though I will try and justify my opinion.
This first post, of an unspecified number, will cover data capture.
What sort of data
As I said briefly above I run a purely technical system which only uses prices, rather than a fundamental system using other kinds of data (here's some more on this subject).
I don't use candlesticks or bar charts, instead all my analysis is done on a series of price points.
On an intraday basis I collect the mid price of the current priced contract I'm trading. At the same time I get the size and width of the inside spread to measure liquidity. Since I don't subscribe to Level 2 data this is all I can see. Anyway deep order book data just means you're more likely to be spoofed.
I also get the closing price of a nearby contract to measure contango / rolldown / carry (pick your favorite name). Similarly I collect closing prices from other parts of the futures strip, again to make decisions about rolling. More about intraday and closing prices later.
I also collect volume data, although this is used to decide when to roll from one futures delivery to the next, and not for signals.
Quotes or trades
As I said above I collect the mid price, which is the average of two quotes (the bid and ask). That doesn't mean the market will trade there. Depending on your system you might prefer to ignore quotes and only collect trades (or indeed use both). If you're trading relatively slowly like I am there is little difference; at higher speeds looking at trades means you will get extra volatility from the "bid-ask bounce" of actual trades. Also there can be a delay in reporting trades to the market.
High frequency traders and those with more complex execution algorithims will probably want to look deeper into the order book (level 2). Of course this will dramatically expand the amount of data that is collected.
Proactive or passive tick response
There are two ways to get a taxi.You can stand in the street and wait for one to pass, whereby you frantically wave. Alternatively you can ring up a mini cab firm (forgive the UK centric reference, but I'm sure you know what I mean) and summon one.
Will this analogy even make sense in a years time when Uber has taken over the world? No idea. I for one have never used it, nor do I intend to.
Similarly there are two ways to get prices. You can either passively wait for “ticks” to arrive (updates of the current bid, ask or mid price as trades happen or people update their quotes; or new trades if that is what interests you). Or you can proactively go and ask your broker or data provider what the current price is. Depending on the broker and type of data one or both options will be possible.
The former approach is particularly suited to higher frequency traders. Each tick on arrival is stored, but usually also used to trigger the post price snapshot process. So our code is concurrent rather than sequential. This makes life a little more complex.
For closing prices (see below) I use a proactive request since the closing price is captured as part of a historical series, for which proactive requests are the norm. For intraday prices I use a “pseudo” proactive request. I ask my broker to start sending me ticks, which I store passively as they arrive, and then once I've built up a picture of the order book I ask them to stop (otherwise I'd waste processing time dealing with ticks I didn't need). When I am actually executing trades I use a passive response to modify or cancel my order (see my blog post on execution).
Open, Close and intraday
My system is fairly slow. I simulate it using daily data, partly for reasons of speed and partly because I only have closing data up to about 18 months ago when I began collecting intraday data. In simulation I assume that I get filled at the next days price (and also conservatively assume I'll have to pay the ask or the bid, rather than getting the mid price).
In fact even if I delayed my trades for two weeks, I'd only lose about a third of my Sharpe Ratio. If you're not sure if you can get away with delaying your systems this much then you probably can't use daily data.
In actual trading there are a number of options.
The first option to collect closing prices, as you would do in simulation. This has the advantage that simulation will match reality precisely. The disadvantage is you are deliberately using stale information. Whilst in expectation this will only have a small effect on your overall p&l this is of little comfort when the price of your biggest long has plummeted all morning and your trading system is twiddling its thumbs waiting for the close to confirm this so it can trade the following day.
If there has been a big movement between the price you used to calculate your trade, and what you are trading at, then you might want to do a different trade. Again in simulation we might know it makes only a small difference, but even so its annoying to see your system buying when some big news has come out it meaning it will probably reverse this trade tomorrow. There are ways to avoid this by creating explicit or implicit conditional orders, but that is complicated (and not for this post to discuss).
Also closing prices can sometimes be a little untrustworthy. They may not, particularly for less liquid contracts, represent levels you could actually have traded at.
So another option is to collect prices on an intra-day basis. This is also the only option for those whose signals have been simulated intra-day, and are so fast that they need to trade this quickly.
The extreme case is where you collect every tick, although you might not store it (explained below).
You can also collect both closing and intra-day prices, which is what I do. Be careful to correctly timestamp closing prices (see below for a full discussion of this), flag them in some way, or save them to a separate table.
I don't collect "open/high/low" data. Again however I would mark it as different to closing data or store it in a separate table.
As I said above I use single point prices not candlesticks or bars. If you do use these you can opt to receive them from your broker if available, or construct them yourselves.
If going down the DIY route you'll need to be receiving tick data (or you'll potentially miss the highest and lowest points). You can then process these to produce bars that cover the appropriate length of time (second, minute, hour or day). To save on space you could then discard the tick data and just save down the open/high/low/close data for each bar.
When and how often?
How often should you check prices intraday? The speed and complexity of your process, and the number of instruments you are trading, or “throttling” by your data provider, might put an upper limit on this. There is also little point collecting the price every few milliseconds on a system which holds positions for several weeks. More data also means more disk storage, and with larger files slower access. We are now into the realms of "big data". As we'll see in later posts it will also have a downstream effect.
Should you use fixed snapshot times, or variable? So should we get prices at 12:00, 13:00 or randomly? Fixed times have the disadvantage, especially for large traders, that everyone knows you are coming. I am reminded of the scene in the book “The Predictors” (a great book about a systematic hedge fund which first got me into this industry).
The team find they are getting really poor fills on crude oil, which they trade once a day at a specific time. They send along an analyst to the trading floor to check. The floor is really busy, then suddenly about 10 seconds before their computer sends its order in it goes quiet. Everyone looks at the broker working for the fund. He gets the order off the phones, and the moment he flashes his request he is swamped by locals who push the price away from him. After he's filled, the price returns to its previous level.
After that the predictors learned some valuable lessons about using multiple brokers, splitting up orders and most relevant for this discussion, varying order times.
However if you're using highly variable snapshot times, resulting in an “irregular” timeseries, there are implications for how you then handle this data subsequently. See the next section.
The way I do things is slightly weird (see the end of the post where I've put the pseudo code for my process), but effectively results in collection about each hour, with a varying snapshot time. Every hour is slightly quicker than is really necessary.
(This is a somewhat technical section for which some familiarity with timeseries analysis would be helpful)
Consider the prices I have for the August 2018 Eurodollar contract in my database.
2015-04-23 14:18:54 97.8525 184534
2015-04-23 15:19:41 97.8375 184535
2015-04-23 16:42:34 97.8575 184536
2015-04-23 17:42:56 97.8600 184537
2015-04-23 18:43:21 97.8675 184538
2015-04-23 19:43:42 97.8825 184539
2015-04-23 23:00:00 97.8750 184546
2015-04-24 12:14:24 97.8675 184550
2015-04-24 13:15:26 97.8575 184551
2015-04-24 14:17:33 97.9075 184552
2015-04-24 15:18:31 97.9125 184553
2015-04-24 16:41:23 97.9125 184554
2015-04-24 17:41:42 97.9075 184555
2015-04-24 18:42:01 97.9075 184556
2015-04-24 19:42:22 97.9025 184557
2015-04-24 23:00:00 97.9250 184564
2015-04-27 12:00:05 97.9125 184568
2015-04-27 13:00:29 97.8975 184569
2015-04-27 14:00:55 97.8875 184570
Notice the final number in each row, which is the record ID from the database table. Closing prices are marked here with the timestamp 23:00 (see the later section on “Timestamps”). Finally observe the small variation in intraday snapshot timings, although its roughly hourly.
Ignoring those fascinating observations the main issue to address here is how to use this mixture of daily and intraday data for a daily system. Whatever solution we choose also has to work with the older historical, only daily, data.
Let's assume that we're going to end up doing some kind of time series analysis on the data that assumes it is being snap shotted at regular intervals. This could be something like a moving average, or an estimate of volatility.
What we have here is a specific example of a more general problem, dealing with irregular time series. Instead of snapshots that occur at regular intervals, we have varying distances in time between each snapshot. This problem can also apply to those using intraday data where they have captured the price at irregular intervals.
The two main routes to take are to regularise the series by resampling it to a regular timestamp, or to use code which handles irregular data.
Here are some things I WOULD NOT DO:
- Discard the intraday prices and just use the closing prices. That would work but makes collecting the intraday prices a complete waste of time.
- Resample the intraday prices to a daily frequency, by taking the average price over the day. This will bias downwards your estimate of daily volatility.
Here are some better options:
- Resample prices to an hourly frequency (covering market opening times only), without forward filling. Code which handles missing data properly will then do your time series analysis properly for you.
- Use irregular timestamp analysis functions. For example for an estimate of return volatility rather than equally weighting squared returns, weight them by the amount of time each return happened over. This is roughly equivalent to the previous option but more computationally efficient. However there are some complexities around periods when the markets are closed, which need to be considered and carefully handled.
- Resample the intraday prices to a daily frequency, by taking the last price each day. This means that during the day we'll be using the most recently sampled price, and after the close we'd only see the closing price. This is what I do. The disadvantage is we throw away potentially valuable information buried in the discarded earlier intraday prices.
So for example at 6pm on the 24th April my code would see these two daily prices, the last price on the 23rd which is a closing price (marked 23:00), and the last price on the 24th which was the 17:41 snapshot:
2015-04-23 23:00:00 97.8750 184546
2015-04-24 17:41:42 97.9075 184555
Then just after lunch of the 27th April it would see these three (the intraday price on the 24th replaced with the close):
2015-04-23 23:00:00 97.8750 184546
2015-04-24 23:00:00 97.9250 184564
2015-04-27 12:00:05 97.9125 184568
There are some other options available, some of which are proprietary. My lips are sealed.
There are a number of ways that we can store timeseries data like prices. This is not the place to discuss them, and such a discussion would be highly specific to your technology stack. However the usual points around efficiency, speed, file size should be borne in mind. Bear mind that your price database will be write once (from a single process) and read many times (from multiple processes).
One question that arises is whether you should store or ignore “nans” (empty values, not multiple grandmothers). So if you try and get a price, and the market is closed or not liquid enough, do you store a nan, or ignore it and do nothing. This has implications for how your data storage and analysis works. For example ignoring nans and not storing them means you will have stored an irregular timeseries which needs to be converted back to regular; if you store nans then your “native” timeseries will be regular out of the box. I don't store nans myself.
Prices must be marked with the correct timestamp. If you get this wrong you risk committing the horrific crime of “forward looking” data, where if you backtested you'd get prices earlier than you really could.
You can use local time in the market or “home” (wherever your computer is sitting). If you use local time you need to store the time zone, and any relevant daylight savings offset, with the data. This is especially important if you're trading across multiple time zones and using cross market information.
If you use your “home” time then you need to make sure that this is comparable to other times you might be using, for example the time of economic announcements and earnings releases.
You can also use the timestamp that arrives with the price (if relevant), or stamp it with your own on arrival. Both methods can occasionally give the wrong results. Timestamps aren't always correct on prices from data vendors, and price ticks don't have an equal amount of latency so might not always arrive in the right order. Very fast traders backtesting their systems need to be aware of this problem and employ solutions.
For slower traders the difference is minimal, so I use the simplest solution of using “home” system time and stamping prices on arrival.
Beware that a closing (or open, high, low...) price won't normally have a time attached. The default behaviour for a lot of systems would be to give it a midnight timestamp. This is incorrect as its probably many hours before the market closed on that day. You can either use an artificial timestamp or the actual closing time. Even if you use the real close it might be better to use a fixed artificial time (perhaps five minutes after the official close), because it's important to be able to distinguish intraday and closing prices.
To keep things simple as we've already seen I mark all close prices with the artificial timestamp 23:00. Since I use local system time on price arrival, and I'm never collecting prices then*, it isn't possible for a price to have this timestamp naturally.
* A later post will discuss scheduling in more detail.
Beware of markets where there isn't really an open or close (I'm thinking here of OTC FX markets in particular which trade 24 hours). Be aware of when and how the “open” or “close” price was calculated. Also beware of Asian markets if you're trading in US or European time. Close prices will probably be naturally datestamped for the following day. That's not the correct approach if you're using “home” system time.
Other contracts and synchronisation
I didn't really clarify above why I collected closing prices, as well as intra-day. After all I could probably just collect intraday prices right up to the close. Given the weirdness of closing prices, and the hassle of marking timestamps, why bother?
The reason is because its sometimes useful to have synchronised data. In my case I like to know the price of the current future relative to a nearby contract, to work out carry / rolldown / contango (so what I want is the spread between the two). Other examples would be if you're running a cointegrating mean reversion system for a bunch of equities, or trading on the run versus off the run bonds.
Getting synchronised cross market prices is difficult. If you're capturing tick data and you line up the time series of ticks so they're as close as possible, you'll get a very noisy measure as bid-offer "bounce" affects one market then the other. Sometimes an explicit market in the spread is quoted (as for calendar spreads in certain markets, like Eurodollar).
But closing prices are automatically synchronised, and for a slow trader much easier than collecting multiple synchronised tick prices.
(Though they aren't always tradeable. In many cases only one futures contract actively trades, and the rest are given closing prices that are “marked to model” with fixed offsets from that. )
So I collect closing prices for both the currently traded, and a nearby contract.
(I also get closing prices for other contracts, to make decisions about rolling. Again this will be a subject of a later post).
I don't bother collecting intraday data for the non traded contract since it would probably give me quite a noisy estimate of the spread; and for the purposes of what I am doing daily data is sufficiently quick.
Here is some more data for Eurodollar. I've removed some additional rows from the second and subsequent days.
2015-04-21 23:00:00 97.9050 97.985
2015-04-22 12:00:06 97.9175 NaN
2015-04-22 13:00:33 97.9025 NaN
2015-04-22 14:00:59 97.9075 NaN
2015-04-22 15:01:20 97.8675 NaN
2015-04-22 16:07:57 97.8475 NaN
2015-04-22 17:08:22 97.8425 NaN
2015-04-22 18:08:43 97.8275 NaN
2015-04-22 19:09:02 97.8325 NaN
2015-04-22 23:00:00 97.8250 97.905
2015-04-23 12:15:44 97.8575 NaN
…. snip ….
2015-04-23 19:43:42 97.8825 NaN
2015-04-23 23:00:00 97.8750 97.955
2015-04-24 12:14:24 97.8675 NaN
… snip …..
2015-04-24 19:42:22 97.9025 NaN
2015-04-24 23:00:00 97.9250 98.005
2015-04-27 12:00:05 97.9125 NaN
You can see that I'm only collecting closing prices for the non priced “Carry” contract, but intraday and closing prices for the “Price” contract.
Should you collect market prices when you know the markets are liquid, or check markets are liquid when you collect prices?
Huh?! I hear you collectively say...
I'll be clearer, or try to be. The first approach is to limit our exposure to bad prices by only getting them when we expect markets to be open and liquid. We don't collect prices during holidays or during quiet sessions. We need to keep track of when quiet sessions are (with daylight savings adjustments), and exchange holidays. This is a lot of work.
The second approach is to keep trying to get prices all the time, but to check that the price quote meets some minimum liquidity requirement (maximum spread and/or minimum inside size). Obviously if a market is closed then this requirement will be automatically failed! This involves less manual work and setting up of market characteristics, but does waste system time trying to access prices in closed or very quiet markets.
I use a mixture of these two approaches. So I limit my sampling to active opening times in markets, although I don't account for changes in daylight savings. I will try and get prices on holidays (unless I've turned my system off on a “universal” holiday to do some maintenance). I don't impose a minimum liquidity requirement (although I do store liquidity data for monitoring purposes), but an updated bid and an ask must be available within a specified time period or I won't return a price to my system (the pseudo code later will make that clear).
Spikes and cleaning
Even with a minimum liquidity requirement (and definitely without one) bad prices will sometimes creep through. In my career I've seen prices of zero, prices out by a factor of 10, 100 or more; prices for the wrong instrument or contract coming through... you name it. To be safe all prices should undergo a series of automated checks before being added to the database.
There are four approaches for potentially bad prices (sometimes called “spikes” because that is what they look like on a chart).
The first is to just exclude it. Not advised except for obviously wrong values. We can usually exclude prices of zero (unless we're getting a value for a spread market or something else where such a price would be valid). It's more work but you could also set up market specific boundaries; interest rate futures like Eurodollar are unlikely to go below 50 (a 50% interest rate) or much above 100 (100 representing interest rates of zero, so higher values are no longer “impossible”).
Secondly you exclude it, but tell the user about it. This is correct for the grey area where a large price move is conceivably correct, but its more likely to be a result of bad data. This is the approach I use. There is then a process that the user (me) can run and manually check and accept or reject the price, bypassing the normal automated checks.
Thirdly you add it to your database, but warn the user. This is correct for the grey area where a large price move is probably correct, but could also be a result of bad data. As I'll discuss in later posts you also want to have systems which aren't too vulnerable to one bad price.
A difficult dilemma is whether to allow users to correct existing prices in the database that turned out to be wrong. I am a purist – since this means we're adding forward looking information I don't allow myself to do it.
Whilst I don't generally use this approach, large price movements are flagged in other diagnostic reports I run (see later blog post on this subject).
Finally you add it, and don't tell the user. Not advised unless a price is within wholly conceivable bounds.
The big unanswered question is how we define “concievable”. A common technique which I also use is to look at the size of the price change relative to recent volatility in that price, and set trigger points if the change is a given multiple of recent volatility. If the change is unlikely to have occurred naturally you might want to warn the user (third approach). If its very unlikely to have occurred you probably don't want to collect it and warn the user (second approach).
You need to calibrate the levels here depending on how many prices you are collecting, and how many warnings you want to check. For example my system uses an 8 daily standard deviation threshold before rejecting a price. Because I'm collecting for 45 markets I usually get about one warning a week.
Very occasionally you will get a real very large move like the flash crash, and my approach will mean I won't react to it until I've manually checked the price which could be many hours later. For fully automated systems trading relatively slowly its worth having the additional robustness versus the penalty of occasionally missing a big move (which if its a flash crash type scenario will be quickly reversed anyway).
Volumes – beware
As I noted above I collect volume data to inform my roll decisions. I don't currently use volume data as a signal; although its a very popular with a lot of technical systems.
Specifically for futures it is hard to pick out the real trend in volumes when you have multiple delivery months trading, and people rolling from one month to another which shows up in volume statistics but isn't “real” volume.
Equities traders, and anyone using an instrument with multiple trading venues, may also find it hard to get a complete picture of volumes.
In conclusion I'd be extremely wary of using volumes to make trading decisions unless you are confident you can get reliable and meaningful data.
Non traded markets - should I bother
On the face of it you should only collect prices for markets you intend to trade, or already have positions on. However there are some exceptions to this. There might be a market you are thinking about trading in the future. Similarly you might have stopped trading a market (like I did with the two year German Shatz bond future) temporarily. Since “backfilling” prices is often a slow and complex exercise (if you do it right and are careful) it may be easier to keep on collecting prices.
Price stiching - where and how
The final point (before I give you some pseudo code) is about price adjustment. This is a very futures specific issue, although it affects any instrument where you need to “roll”. So for example if you're using quarterly spreadbets to gamble on FX movements, then unless you want to close your June 2015 position you'll need to roll it into September.
The problem we have is we can't just use the June price until we roll, and then the September price. This creates a discontinuity in the price at the roll point. There are some good discussions on various methods of dealing with this, for example here.
I use the simple “panama” method. This has the advantage of being simple and requiring historical prices to be adjusted only when we roll , and means that the current level of the traded contract is identical to the back adjusted price. The two disadvantages of this method cited in the web link (introducing a trend bias, and losing relative price differential) are not important in my opinion.
From an implementation point of view there are two possible times you can do this adjustment. You can eithier store non continuous prices and then do the stitching on the fly each time you need to. Alternatively you can store both non continuous, and stitched prices (meaning you need a process that regularly creates the former from the latter). This is the approach I take. This is faster, but means you have to stick to the same set of historical switching dates.
With the panama method changing your switching dates will change the history of prices, and so the patterns they form. The difference is subtle, and won't worry most people.
Some data suppliers also provide pre-calculated continuous prices. Make sure you know how they are doing the adjustment. I prefer to do my own.
There is much more to the mechanics of rolling so I will return to this in another post.
Pseudo code for my price collection code
Although I'm calling this pseudo code, its actually a heavily edited version of my real python code, where I've removed everything except bits that illustrate the underlying logic. But if you're not a python fan you can ignore that and just pretend its in the higher level language of your choice.
I haven't included the end of day pricing since its just a less interesting version of the intraday (spikes are still checked but we don't need to pretend we're proactively getting prices).
The main pricing process is kicked off just after midnight local time and consists of a massive while loop:
## Big bad ass price collection loop
## Is it after 8pm, or whenever we stop? Then autostop
log.info("Normally stop pricing process as all markets closed")
for code in all_codes:
## do not sample as market is closed.
if check_process(dbtype, "SAMPLING", code)=="STOP":
## do not sample as process is prevented from running
if last_run is not None:
if (now() - last_run).total_seconds()<(60*60):
#Too soon to run again, we want hourly prices at best
## Do price stiching. This returns the number of new prices added
## have new data – run signals code
## Because there is some variation in how long this takes we end up collecting
prices at different frequencies
## end of for loop
## End of while loop
def raw_sample_instrument(code, entrymode=”AUTO”):
## If we run manually would put entrymode=”MANUAL”
## Get top of order book data – see next function
log.warning("Found no new and or acceptable prices”)
## Use local price for sampletime
resolve_and_add_pricing_data(code, pricing_data, entrymode)
def get_market_data(code, snapshot=True, maxstaleseconds=60, maxwaitseconds=30):
Return market data, with maximum staleness. If snapshot=False or if stale iniates sampler to get it
If doesn't get all fields filled within maxwait then returns nans
- I am a price sampler. I want the latest tick (with some definition of staleness)
(Checks for latest tick. If doesn't exist, starts ticking. Waits until a full tick list is
available. Then returns it. If nothing after some period of time, return nans)
[also used for general get a price diagnostic eg when rolling instruments]
- I am checking for liquidity prior to . I want the latest marketdata and you need to carry on sampling afterwards if you get one.
## Get market data
if mymarketdata is None:
## not yet cached
## Try and get current data
stored_data=mymarketdata.get_contract( code, maxdelay=maxstaleseconds)
## We can use these prices as there are no nans
## We can't
if not snapshot or not useable_prices:
## Need to start a tick server
## Note this will create a tickid if one is required
start_ticker_for_contract(dbtype, tws, contract)
## note this loop will die immediately if we already have useable prices
while not useable_prices and timespent<maxwaitseconds:
timespent=(now() - start_time).total_seconds()
if started_tick_server and snapshot:
## need to stop the server because we are taking a snapshot
## don't want to continue ticking, reduce the number of prices we're getting
## note ticks may still arrive...
## note the tickid will still live on!
## note we won't turn off if active order
def resolve_and_add_pricing_data(code, pricing_data, entrymode):
## resolve the mode, returns None if AUTO and prices fail checks
pricing_data=resolveprice(current_price_matrix, pricing_data, entrymode)
## just leave prices as they are
log.warning("No existing prices can't do any checks")
def resolveprice(current_price_matrix, pricing_data, entrymode):
Function that resolves prices depending on entrymode AUTO or MANUAL
Returns new pricing_data - if this is None then we don't have any valid prices (happens only in AUTO mode)
assert entrymode in ["AUTO", "MANUAL"]
Potentially bad price
If running in manual mode then allow user to override, else flag
## Allows you to manually check prices, and weed out anything thats bad
pricing_data=manual_pricing_data(current_price_matrix, pricing_data, describe)
## no user interaction possible
log.critical("Sample failed spike check price move”)
The endThis is the first post in a series about the nuts and bolts of creating systematic trading systems. The second post is here: