Tuesday 28 April 2015

System building - Data capture

A while ago I ran a series of posts on how you would write some python code to systematically trade using the interactive brokers C++ API.

Whilst I hope this was helpful it was just a starting point. There are at least two major projects to undertake before one could actually trade. The first is the design of such a system. This is the subject of a book I am writing, which I hope will be published (although if I am being honest writing this blog post is displacement activity to avoid proofreading duties). Secondly is the implementation; the nuts and bolts if you like. Apart from a single post on execution I've not given you many hints in this direction.

Thanks to overwhelming demand* I've decided to write a series of posts on the issues around implementing such a system, the key decisions you'd have to make, and some options for solving the problems involved.

* I'd like to thank both the people who requested this.

First some ground rules. I won't be exposing any more of my awful code to the general public; at best you'll get pseudo cost and at worst waffle. Secondly implementation and design are rather closely linked. I am trading futures, in a fully automated system, which is relatively slow and where latency is not an issue, using only price data. Whilst these posts might be of some interest to a high frequency equities trader using earnings announcements (I assume such people exist), they will need to make appropriate adjustments to take account of the difference between their system and mine.


Finally although I will tell you throughout what I am doing in my own system, this is by no means the only or the correct way of doing things; though I will try and justify my opinion.

This first post, of an unspecified number, will cover data capture.


What sort of data


As I said briefly above I run a purely technical system which only uses prices, rather than a fundamental system using other kinds of data (here's some more on this subject).

I don't use candlesticks or bar charts, instead all my analysis is done on a series of price points.

On an intraday basis I collect the mid price of the current priced contract I'm trading. At the same time I get the size and width of the inside spread to measure liquidity. Since I don't subscribe to Level 2 data this is all I can see. Anyway deep order book data just means you're more likely to be spoofed.

I also get the closing price of a nearby contract to measure contango / rolldown / carry (pick your favorite name). Similarly I collect closing prices from other parts of the futures strip, again to make decisions about rolling. More about intraday and closing prices later.


I also collect volume data, although this is used to decide when to roll from one futures delivery to the next, and not for signals.


Quotes or trades


As I said above I collect the mid price, which is the average of two quotes (the bid and ask). That doesn't mean the market will trade there. Depending on your system you might prefer to ignore quotes and only collect trades (or indeed use both). If you're trading relatively slowly like I am there is little difference; at higher speeds looking at trades means you will get extra volatility from the "bid-ask bounce" of actual trades. Also there can be a delay in reporting trades to the market.

High frequency traders and those with more complex execution algorithims will probably want to look deeper into the order book (level 2). Of course this will dramatically expand the amount of data that is collected.

Proactive or passive tick response


There are two ways to get a taxi.You can stand in the street and wait for one to pass, whereby you frantically wave. Alternatively you can ring up a mini cab firm (forgive the UK centric reference, but I'm sure you know what I mean) and summon one.

Will this analogy even make sense in a years time when Uber has taken over the world? No idea. I for one have never used it, nor do I intend to.

Similarly there are two ways to get prices. You can either passively wait for “ticks” to arrive (updates of the current bid, ask or mid price as trades happen or people update their quotes; or new trades if that is what interests you). Or you can proactively go and ask your broker or data provider what the current price is. Depending on the broker and type of data one or both options will be possible.

The former approach is particularly suited to higher frequency traders. Each tick on arrival is stored, but usually also used to trigger the post price snapshot process. So our code is concurrent rather than sequential. This makes life a little more complex.

For closing prices (see below) I use a proactive request since the closing price is captured as part of a historical series, for which proactive requests are the norm. For intraday prices I use a “pseudo” proactive request. I ask my broker to start sending me ticks, which I store passively as they arrive, and then once I've built up a picture of the order book I ask them to stop (otherwise I'd waste processing time dealing with ticks I didn't need). When I am actually executing trades I use a passive response to modify or cancel my order (see my blog post on execution).




Open, Close and intraday


My system is fairly slow. I simulate it using daily data, partly for reasons of speed and partly because I only have closing data up to about 18 months ago when I began collecting intraday data. In simulation I assume that I get filled at the next days price (and also conservatively assume I'll have to pay the ask or the bid, rather than getting the mid price).

In fact even if I delayed my trades for two weeks, I'd only lose about a third of my Sharpe Ratio. If you're not sure if you can get away with delaying your systems this much then you probably can't use daily data.

In actual trading there are a number of options.

The first option to collect closing prices, as you would do in simulation. This has the advantage that simulation will match reality precisely. The disadvantage is you are deliberately using stale information. Whilst in expectation this will only have a small effect on your overall p&l this is of little comfort when the price of your biggest long has plummeted all morning and your trading system is twiddling its thumbs waiting for the close to confirm this so it can trade the following day.


If there has been a big movement between the price you used to calculate your trade, and what you are trading at, then you might want to do a different trade. Again in simulation we might know it makes only a small difference, but even so its annoying to see your system buying when some big news has come out it meaning it will probably reverse this trade tomorrow. There are ways to avoid this by creating explicit or implicit conditional orders, but that is complicated (and not for this post to discuss).


Also closing prices can sometimes be a little untrustworthy. They may not, particularly for less liquid contracts, represent levels you could actually have traded at.


So another option is to collect prices on an intra-day basis. This is also the only option for those whose signals have been simulated intra-day, and are so fast that they need to trade this quickly.

The extreme case is where you collect every tick, although you might not store it (explained below).
 
You can also collect both closing and intra-day prices, which is what I do. Be careful to correctly timestamp closing prices (see below for a full discussion of this), flag them in some way, or save them to a separate table.


I don't collect "open/high/low" data. Again however I would mark it as different to closing data or store it in a separate table.

As I said above I use single point prices not candlesticks or bars. If you do use these you can opt to receive them from your broker if available, or construct them yourselves.

If going down the DIY route you'll need to be receiving tick data (or you'll potentially miss the highest and lowest points). You can then process these to produce bars that cover the appropriate length of time (second, minute, hour or day). To save on space you could then discard the tick data and just save down the open/high/low/close data for each bar.

When and how often?


How often should you check prices intraday? The speed and complexity of your process, and the number of instruments you are trading, or “throttling” by your data provider, might put an upper limit on this. There is also little point collecting the price every few milliseconds on a system which holds positions for several weeks. More data also means more disk storage, and with larger files slower access. We are now into the realms of "big data". As we'll see in later posts it will also have a downstream effect.

Should you use fixed snapshot times, or variable? So should we get prices at 12:00, 13:00 or randomly? Fixed times have the disadvantage, especially for large traders, that everyone knows you are coming. I am reminded of the scene in the book “The Predictors” (a great book about a systematic hedge fund which first got me into this industry).


The team find they are getting really poor fills on crude oil, which they trade once a day at a specific time. They send along an analyst to the trading floor to check. The floor is really busy, then suddenly about 10 seconds before their computer sends its order in it goes quiet. Everyone looks at the broker working for the fund. He gets the order off the phones, and the moment he flashes his request he is swamped by locals who push the price away from him. After he's filled, the price returns to its previous level.


After that the predictors learned some valuable lessons about using multiple brokers, splitting up orders and most relevant for this discussion, varying order times.

However if you're using highly variable snapshot times, resulting in an “irregular” timeseries, there are implications for how you then handle this data subsequently. See the next section.

The way I do things is slightly weird (see the end of the post where I've put the pseudo code for my process), but effectively results in collection about each hour, with a varying snapshot time. Every hour is slightly quicker than is really necessary.



Irregular timeseries


(This is a somewhat technical section for which some familiarity with timeseries analysis would be helpful)
Consider the prices I have for the August 2018 Eurodollar contract in my database.



2015-04-23 14:18:54  97.8525  184534
2015-04-23 15:19:41  97.8375  184535
2015-04-23 16:42:34  97.8575  184536
2015-04-23 17:42:56  97.8600  184537
2015-04-23 18:43:21  97.8675  184538
2015-04-23 19:43:42  97.8825  184539
2015-04-23 23:00:00  97.8750  184546
2015-04-24 12:14:24  97.8675  184550
2015-04-24 13:15:26  97.8575  184551
2015-04-24 14:17:33  97.9075  184552
2015-04-24 15:18:31  97.9125  184553
2015-04-24 16:41:23  97.9125  184554
2015-04-24 17:41:42  97.9075  184555
2015-04-24 18:42:01  97.9075  184556
2015-04-24 19:42:22  97.9025  184557
2015-04-24 23:00:00  97.9250  184564
2015-04-27 12:00:05  97.9125  184568
2015-04-27 13:00:29  97.8975  184569
2015-04-27 14:00:55  97.8875  184570




Notice the final number in each row, which is the record ID from the database table. Closing prices are marked here with the timestamp 23:00 (see the later section on “Timestamps”). Finally observe the small variation in intraday snapshot timings, although its roughly hourly.

Ignoring those fascinating observations the main issue to address here is how to use this mixture of daily and intraday data for a daily system. Whatever solution we choose also has to work with the older historical, only daily, data.

Let's assume that we're going to end up doing some kind of time series analysis on the data that assumes it is being snap shotted at regular intervals. This could be something like a moving average, or an estimate of volatility.

What we have here is a specific example of a more general problem, dealing with irregular time series. Instead of snapshots that occur at regular intervals, we have varying distances in time between each snapshot. This problem can also apply to those using intraday data where they have captured the price at irregular intervals.

The two main routes to take are to regularise the series by resampling it to a regular timestamp, or to use code which handles irregular data.



Here are some things I WOULD NOT DO:

- Discard the intraday prices and just use the closing prices. That would work but makes collecting the intraday prices a complete waste of time.

- Resample the intraday prices to a daily frequency, by taking the average price over the day. This will bias downwards your estimate of daily volatility.



Here are some better options:

- Resample prices to an hourly frequency (covering market opening times only), without forward filling. Code which handles missing data properly will then do your time series analysis properly for you.

- Use irregular timestamp analysis functions. For example for an estimate of return volatility rather than equally weighting squared returns, weight them by the amount of time each return happened over. This is roughly equivalent to the previous option but more computationally efficient. However there are some complexities around periods when the markets are closed, which need to be considered and carefully handled.

- Resample the intraday prices to a daily frequency, by taking the last price each day. This means that during the day we'll be using the most recently sampled price, and after the close we'd only see the closing price. This is what I do. The disadvantage is we throw away potentially valuable information buried in the discarded earlier intraday prices.



So for example at 6pm on the 24th April my code would see these two daily prices, the last price on the 23rd which is a closing price (marked 23:00), and the last price on the 24th which was the 17:41 snapshot:

2015-04-23 23:00:00  97.8750  184546
2015-04-24 17:41:42  97.9075  184555

Then just after lunch of the 27th April it would see these three (the intraday price on the 24th replaced with the close):


2015-04-23 23:00:00  97.8750  184546
2015-04-24 23:00:00  97.9250  184564
2015-04-27 12:00:05  97.9125  184568


There are some other options available, some of which are proprietary. My lips are sealed.



Storage


There are a number of ways that we can store timeseries data like prices. This is not the place to discuss them, and such a discussion would be highly specific to your technology stack. However the usual points around efficiency, speed, file size should be borne in mind. Bear mind that your price database will be write once (from a single process) and read many times (from multiple processes).

One question that arises is whether you should store or ignore “nans” (empty values, not multiple grandmothers). So if you try and get a price, and the market is closed or not liquid enough, do you store a nan, or ignore it and do nothing. This has implications for how your data storage and analysis works. For example ignoring nans and not storing them means you will have stored an irregular timeseries which needs to be converted back to regular; if you store nans then your “native” timeseries will be regular out of the box. I don't store nans myself.


Timestamps


Prices must be marked with the correct timestamp. If you get this wrong you risk committing the horrific crime of “forward looking” data, where if you backtested you'd get prices earlier than you really could.

You can use local time in the market or “home” (wherever your computer is sitting). If you use local time you need to store the time zone, and any relevant daylight savings offset, with the data. This is especially important if you're trading across multiple time zones and using cross market information.
 
If you use your “home” time then you need to make sure that this is comparable to other times you might be using, for example the time of economic announcements and earnings releases.

You can also use the timestamp that arrives with the price (if relevant), or stamp it with your own on arrival. Both methods can occasionally give the wrong results. Timestamps aren't always correct on prices from data vendors, and price ticks don't have an equal amount of latency so might not always arrive in the right order. Very fast traders backtesting their systems need to be aware of this problem and employ solutions.

For slower traders the difference is minimal, so I use the simplest solution of using “home” system time and stamping prices on arrival.

Beware that a closing (or open, high, low...) price won't normally have a time attached. The default behaviour for a lot of systems would be to give it a midnight timestamp. This is incorrect as its probably many hours before the market closed on that day. You can either use an artificial timestamp or the actual closing time. Even if you use the real close it might be better to use a fixed artificial time (perhaps five minutes after the official close), because it's important to be able to distinguish intraday and closing prices.

To keep things simple as we've already seen I mark all close prices with the artificial timestamp 23:00. Since I use local system time on price arrival, and I'm never collecting prices then*, it isn't possible for a price to have this timestamp naturally.

* A later post will discuss scheduling in more detail.

Beware of markets where there isn't really an open or close (I'm thinking here of OTC FX markets in particular which trade 24 hours). Be aware of when and how the “open” or “close” price was calculated. Also beware of Asian markets if you're trading in US or European time. Close prices will probably be naturally datestamped for the following day. That's not the correct approach if you're using “home” system time.




Other contracts and synchronisation


I didn't really clarify above why I collected closing prices, as well as intra-day. After all I could probably just collect intraday prices right up to the close. Given the weirdness of closing prices, and the hassle of marking timestamps, why bother?

The reason is because its sometimes useful to have synchronised data. In my case I like to know the price of the current future relative to a nearby contract, to work out carry / rolldown / contango (so what I want is the spread between the two). Other examples would be if you're running a cointegrating mean reversion system for a bunch of equities, or trading on the run versus off the run bonds.

Getting synchronised cross market prices is difficult. If you're capturing tick data and you line up the time series of ticks so they're as close as possible, you'll get a very noisy measure as bid-offer "bounce" affects one market then the other. Sometimes an explicit market in the spread is quoted (as for calendar spreads in certain markets, like Eurodollar).

But closing prices are automatically synchronised, and for a slow trader much easier than collecting multiple synchronised tick prices.

(Though they aren't always tradeable. In many cases only one futures contract actively trades, and the rest are given closing prices that are “marked to model” with fixed offsets from that. )

So I collect closing prices for both the currently traded, and a nearby contract.

(I also get closing prices for other contracts, to make decisions about rolling. Again this will be a subject of a later post).


I don't bother collecting intraday data for the non traded contract since it would probably give me quite a noisy estimate of the spread; and for the purposes of what I am doing daily data is sufficiently quick.

Here is some more data for Eurodollar. I've removed some additional rows from the second and subsequent days.


                    PRICE    CARRY

2015-04-21 23:00:00 97.9050 97.985

2015-04-22 12:00:06 97.9175 NaN

2015-04-22 13:00:33 97.9025 NaN

2015-04-22 14:00:59 97.9075 NaN

2015-04-22 15:01:20 97.8675 NaN

2015-04-22 16:07:57 97.8475 NaN

2015-04-22 17:08:22 97.8425 NaN

2015-04-22 18:08:43 97.8275 NaN

2015-04-22 19:09:02 97.8325 NaN

2015-04-22 23:00:00 97.8250 97.905

2015-04-23 12:15:44 97.8575 NaN

…. snip ….

2015-04-23 19:43:42 97.8825 NaN

2015-04-23 23:00:00 97.8750 97.955

2015-04-24 12:14:24 97.8675 NaN

… snip …..

2015-04-24 19:42:22 97.9025 NaN

2015-04-24 23:00:00 97.9250 98.005

2015-04-27 12:00:05 97.9125 NaN



You can see that I'm only collecting closing prices for the non priced “Carry” contract, but intraday and closing prices for the “Price” contract.

This approach of using closing prices is good for slower traders for whom the spread is a secondary input, like me. But if you're trading more quickly, and you actually trade spreads, butterflies or other baskets of multiple instruments, then you really need to work harder at getting a synchronised price. This could be done by getting an explicit price for the spread or basket, or trying to snapshot at identical times. The latter will not always be tradeable, but that is for another post.

Minimum liquidity


Should you collect market prices when you know the markets are liquid, or check markets are liquid when you collect prices?

Huh?! I hear you collectively say...

I'll be clearer, or try to be. The first approach is to limit our exposure to bad prices by only getting them when we expect markets to be open and liquid. We don't collect prices during holidays or during quiet sessions. We need to keep track of when quiet sessions are (with daylight savings adjustments), and exchange holidays. This is a lot of work.

The second approach is to keep trying to get prices all the time, but to check that the price quote meets some minimum liquidity requirement (maximum spread and/or minimum inside size). Obviously if a market is closed then this requirement will be automatically failed! This involves less manual work and setting up of market characteristics, but does waste system time trying to access prices in closed or very quiet markets.

I use a mixture of these two approaches. So I limit my sampling to active opening times in markets, although I don't account for changes in daylight savings. I will try and get prices on holidays (unless I've turned my system off on a “universal” holiday to do some maintenance). I don't impose a minimum liquidity requirement (although I do store liquidity data for monitoring purposes), but an updated bid and an ask must be available within a specified time period or I won't return a price to my system (the pseudo code later will make that clear).



Spikes and cleaning


Even with a minimum liquidity requirement (and definitely without one) bad prices will sometimes creep through. In my career I've seen prices of zero, prices out by a factor of 10, 100 or more; prices for the wrong instrument or contract coming through... you name it. To be safe all prices should undergo a series of automated checks before being added to the database.

There are four approaches for potentially bad prices (sometimes called “spikes” because that is what they look like on a chart).


The first is to just exclude it. Not advised except for obviously wrong values. We can usually exclude prices of zero (unless we're getting a value for a spread market or something else where such a price would be valid). It's more work but you could also set up market specific boundaries; interest rate futures like Eurodollar are unlikely to go below 50 (a 50% interest rate) or much above 100 (100 representing interest rates of zero, so higher values are no longer “impossible”).

Secondly you exclude it, but tell the user about it. This is correct for the grey area where a large price move is conceivably correct, but its more likely to be a result of bad data. This is the approach I use. There is then a process that the user (me) can run and manually check and accept or reject the price, bypassing the normal automated checks.


Thirdly you add it to your database, but warn the user. This is correct for the grey area where a large price move is probably correct, but could also be a result of bad data. As I'll discuss in later posts you also want to have systems which aren't too vulnerable to one bad price.

A difficult dilemma is whether to allow users to correct existing prices in the database that turned out to be wrong. I am a purist – since this means we're adding forward looking information I don't allow myself to do it.

Whilst I don't generally use this approach, large price movements are flagged in other diagnostic reports I run (see later blog post on this subject).


Finally you add it, and don't tell the user. Not advised unless a price is within wholly conceivable bounds.

The big unanswered question is how we define “concievable”. A common technique which I also use is to look at the size of the price change relative to recent volatility in that price, and set trigger points if the change is a given multiple of recent volatility. If the change is unlikely to have occurred naturally you might want to warn the user (third approach). If its very unlikely to have occurred you probably don't want to collect it and warn the user (second approach).

You need to calibrate the levels here depending on how many prices you are collecting, and how many warnings you want to check. For example my system uses an 8 daily standard deviation threshold before rejecting a price. Because I'm collecting for 45 markets I usually get about one warning a week.

Very occasionally you will get a real very large move like the flash crash, and my approach will mean I won't react to it until I've manually checked the price which could be many hours later. For fully automated systems trading relatively slowly its worth having the additional robustness versus the penalty of occasionally missing a big move (which if its a flash crash type scenario will be quickly reversed anyway).


Volumes – beware


As I noted above I collect volume data to inform my roll decisions. I don't currently use volume data as a signal; although its a very popular with a lot of technical systems.

Specifically for futures it is hard to pick out the real trend in volumes when you have multiple delivery months trading, and people rolling from one month to another which shows up in volume statistics but isn't “real” volume.

Equities traders, and anyone using an instrument with multiple trading venues, may also find it hard to get a complete picture of volumes.

In conclusion I'd be extremely wary of using volumes to make trading decisions unless you are confident you can get reliable and meaningful data.


Non traded markets - should I bother


On the face of it you should only collect prices for markets you intend to trade, or already have positions on. However there are some exceptions to this. There might be a market you are thinking about trading in the future. Similarly you might have stopped trading a market (like I did with the two year German Shatz bond future) temporarily. Since “backfilling” prices is often a slow and complex exercise (if you do it right and are careful) it may be easier to keep on collecting prices.



Price stiching - where and how


The final point (before I give you some pseudo code) is about price adjustment. This is a very futures specific issue, although it affects any instrument where you need to “roll”. So for example if you're using quarterly spreadbets to gamble on FX movements, then unless you want to close your June 2015 position you'll need to roll it into September.

The problem we have is we can't just use the June price until we roll, and then the September price. This creates a discontinuity in the price at the roll point. There are some good discussions on various methods of dealing with this, for example here.

I use the simple “panama” method. This has the advantage of being simple and requiring historical prices to be adjusted only when we roll , and means that the current level of the traded contract is identical to the back adjusted price. The two disadvantages of this method cited in the web link (introducing a trend bias, and losing relative price differential) are not important in my opinion.

From an implementation point of view there are two possible times you can do this adjustment. You can eithier store non continuous prices and then do the stitching on the fly each time you need to. Alternatively you can store both non continuous, and stitched prices (meaning you need a process that regularly creates the former from the latter). This is the approach I take. This is faster, but means you have to stick to the same set of historical switching dates.

With the panama method changing your switching dates will change the history of prices, and so the patterns they form. The difference is subtle, and won't worry most people.

Some data suppliers also provide pre-calculated continuous prices. Make sure you know how they are doing the adjustment. I prefer to do my own.

There is much more to the mechanics of rolling so I will return to this in another post.


Pseudo code for my price collection code


Although I'm calling this pseudo code, its actually a heavily edited version of my real python code, where I've removed everything except bits that illustrate the underlying logic. But if you're not a python fan you can ignore that and just pretend its in the higher level language of your choice.

I haven't included the end of day pricing since its just a less interesting version of the intraday (spikes are still checked but we don't need to pretend we're proactively getting prices).

The main pricing process is kicked off just after midnight local time and consists of a massive while loop:



while okay_to_run:



    ## Big bad ass price collection loop

    ## Is it after 8pm, or whenever we stop? Then autostop

     if now()>last_sample_at:

         log.info("Normally stop pricing process as all markets closed")

         okay_to_run=False

         break




     for code in all_codes:

         market_closed=check_market_is_closed(code)


         if market_closed:

             ## do not sample as market is closed.

             continue  



         if check_process(dbtype, "SAMPLING", code)=="STOP":

         ## do not sample as process is prevented from running



            continue



         last_run=get_last_run(code)


         if last_run is not None:

            if (now() - last_run).total_seconds()<(60*60):

                #Too soon to run again, we want hourly prices at best

                continue



         raw_sample_instrument(code)



         ## Do price stiching. This returns the number of new prices added

         new_prices_added=sample_adj_instrument(code)



         if new_prices_added>0:

           ## have new data – run signals code

               ## Because there is some variation in how long this takes we end up collecting 
              prices at different frequencies

               signals_runner(code)

      ## end of for loop

    ##

    ## End of while loop







def raw_sample_instrument(code, entrymode=”AUTO”):



    ## If we run manually would put entrymode=”MANUAL”


    ## Get top of order book data – see next function  


    bookdata=get_market_data(code, snapshot=True)


    mid_price_value=midprice(bookdata)


    if isnan(mid_price_value):

       log.warning("Found no new and or acceptable prices”)

    else:

        ## Use local price for sampletime

        sampletime=now()

        pricing_data=TimeSeries([mid_price_value], index=[sampletime])


        resolve_and_add_pricing_data(code, pricing_data, entrymode)


        size=inside_size(bookdata)

        spread=inside_spread(bookdata)







def get_market_data(code, snapshot=True, maxstaleseconds=60, maxwaitseconds=30):

"""

Return market data, with maximum staleness. If snapshot=False or if stale iniates sampler to get it


If doesn't get all fields filled within maxwait then returns nans


User stories


- I am a price sampler. I want the latest tick (with some definition of staleness)

(Checks for latest tick. If doesn't exist, starts ticking. Waits until a full tick list is

available. Then returns it. If nothing after some period of time, return nans)

[also used for general get a price diagnostic eg when rolling instruments]


- I am checking for liquidity prior to . I want the latest marketdata and you need to carry on sampling afterwards if you get one.





"""

    global mymarketdata





    ## Get market data

   if mymarketdata is None:

      ## not yet cached

      mymarketdata=simple_market_data()




      ## Try and get current data

      stored_data=mymarketdata.get_contract( code, maxdelay=maxstaleseconds)



      if _no_nans(stored_data):

      ## We can use these prices as there are no nans

         useable_prices=True

      else:

         ## We can't

         useable_prices=False


    if not snapshot or not useable_prices:

         ## Need to start a tick server

         ## Note this will create a tickid if one is required

         start_ticker_for_contract(dbtype, tws, contract)

         started_tick_server=True

    else:

         started_tick_server=False



    start_time=datetime.datetime.now()


    ## note this loop will die immediately if we already have useable prices



   timespent=0

     while not useable_prices and timespent<maxwaitseconds:

        stored_data=mymarketdata.get_contract(code, maxstaleseconds)

        useable_prices=_no_nans(stored_data)

        timespent=(now() - start_time).total_seconds()



     if started_tick_server and snapshot:

        ## need to stop the server because we are taking a snapshot

        ## don't want to continue ticking, reduce the number of prices we're getting

        ## note ticks may still arrive...

        ## note the tickid will still live on!

        ## note we won't turn off if active order

      stop_ticker_for_contract(code)



    return stored_data







def resolve_and_add_pricing_data(code, pricing_data, entrymode):



    MIN_OBSERVATIONS=10


   current_price_matrix=read_prices_for_contract(code)



   if current_price_matrix.shape[0]>MIN_OBSERVATIONS:


      ## resolve the mode, returns None if AUTO and prices fail checks

      pricing_data=resolveprice(current_price_matrix, pricing_data, entrymode)

   else:

      ## just leave prices as they are

      log.warning("No existing prices can't do any checks")


      add_price_matrix(code, pricing_data)



   return pricing_data)





def resolveprice(current_price_matrix, pricing_data, entrymode):

"""

Function that resolves prices depending on entrymode AUTO or MANUAL


Returns new pricing_data - if this is None then we don't have any valid prices (happens only in AUTO mode)

"""



   assert entrymode in ["AUTO", "MANUAL"]


   spike=checkspike(current_price_matrix, pricing_data)

   if spike:

      """

      Potentially bad price


      If running in manual mode then allow user to override, else flag

      """


      if entrymode=="MANUAL":


         ## Allows you to manually check prices, and weed out anything thats bad

         pricing_data=manual_pricing_data(current_price_matrix, pricing_data, describe)


      elif entrymode=="AUTO":

         ## no user interaction possible

         log.critical("Sample failed spike check price move)

         pricing_data=None


   return pricing_data






The end

This is the first post in a series about the nuts and bolts of creating systematic trading systems. The second post is here:

http://qoppac.blogspot.co.uk/2015/05/systems-building-futures-rolling.html



23 comments:

  1. Hi again,

    While looking again to you example of data stored, noticed that you index the time series data by an incremental numbered column. Any specific for this? Why not use the timestamp in the database as the index instead ? tks

    ReplyDelete
    Replies
    1. No particular reason. Probably about 25 years ago when I did my first SQL database I read a book which said you should always have a seperate index key for certain kinds of data. Its a hard habit to get out of, although its costing me a few extra K of storage space. Assuming you had a unique timeseries index (and you could make this a condition of no duplicates when you add data, as indeed I do) there is no reason why this shouldn't be the index.

      Delete
  2. Rob,
    There was a previous comment of mine, which apparently was not published.

    Anyhow, the idea was to congratulate you for the blog and time you put on it.
    For people out there like me, which like to put hands and head at work thru learning by doing, this is of great value.
    Hope this gives you more motivation to keep going on!

    Helio S.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. You are talking about recording intra-day data yourself.. why not just download historical data from Interactive Brokers using their API?

    ReplyDelete
  5. Hi Oleg

    Essentially this is what I discuss in this section of the blog:

    "The first option to collect closing prices, as you would do in simulation. This has the advantage that simulation will match reality precisely. The disadvantage is you are deliberately using stale information.

    ... snip....

    Also closing prices can sometimes be a little untrustworthy. They may not, particularly for less liquid contracts, represent levels you could actually have traded at. "

    ReplyDelete
  6. Hi it would be great if you could write a post of using python to create continuous contracts. Additionally, how would one go about constructing the second month continuous futures contract?

    ReplyDelete
  7. Hi it would be great if you could do a post on how to create continuous future contract in python. Like how would you create the second month continuous contract?

    ReplyDelete
    Replies
    1. Sure, at some point. I will add it to the list.

      Delete
  8. Hello Rob,
    Regarding the choice of local/exchange time or 'home' PC time under the 'Timestamps' section of your article - would there be a significant issue with just using GMT/UTC, in attempt to avoid the need for DST adjustment?

    ReplyDelete
    Replies
    1. You could do that. But since the DST adjustment happens over a weekend it doesn't cause any problems that I can think of.

      Delete
  9. Hi, Rob. I think that quandl provides prestitched prices by default for futures. I was planning to rely on this data for my implementation of your system. What am I losing by not manually stitching the prices as you do? Thanks.

    ReplyDelete
    Replies
    1. Mainly flexibility in which contract you trade. But this only applies to a small number of instruments.

      Delete
  10. Hi, I've got a question regarding the combination of intraday and closing price data. Just to check I'm understanding it correctly, is the method you use:
    1. snap an intraday mid price
    2. use this most recent snap and the previous days' closes to calculate new signals and trade if appropriate
    3. repeat the above two steps each hour

    I've included a commented example below in case it's unclear. Data format is [Date @ Time: Price]
    2017-08-01 @ 23:00: $100.5 (close price of 1st)
    2017-08-02 @ 23:00: $102.2 (close price of 2nd)
    2017-08-03 @ 15:00: $101.8 (first snap, use this and closes from 1st and 2nd to calculate signals)
    ...
    2017-08-03 @ 16:00: $103.1 (second snap an hour later, use this and closes from 1st and 2nd to calculate a new set of signals)

    If my take on it is correct, if one uses just daily data in the backtest could both the volatility and turnover estimated be too low? I guess it's possible you could end up trading each hour if there's a lot of intraday volatility which might not be caught by the backtest on daily close data. Or is the idea that a fairly slow system shouldn't be too adversely affected by intraday volatility to cause significant diversions from a backtest using daily close prices? Apologies if I've missed something obvious!

    And thank you for the blog and book, they make for fascinating reading!

    ReplyDelete
    Replies
    1. You have it exactly right.

      "If my take on it is correct, if one uses just daily data in the backtest could both the volatility and turnover estimated be too low?"

      It shouldn't affect volatility estimates. But yes in theory turnover will be higher. But as you say the effect is tiny for slow trading.

      Eventually I will switch from this overcomplicated system to a simple daily system based on the prior days close.

      GAT

      Delete
  11. Hi Rob, I hope all is well. I have read and reread this excellent post. My question is when you append an instrument's intraday price to its daily close timeseries to generate forecasts, do you treat this intraday price like just another closing price? Or do you resample everything to your sampling frequency before calculating ewma values. I see a possible issue with the latter approach as the generic 2300 timestamp will not generate the correct timedelta between closing and sampling. Or maybe its neither of above and some other proprietary technique which I appreciate you would not be able to share?

    ReplyDelete
    Replies
    1. "do you treat this intraday price like just another closing price?"

      Yes

      To reiterate what I've said before I wouldn't replicate this approach which I've concluded is overly complex.

      Delete
  12. Thanks Rob - appreciate the feedback as always.

    ReplyDelete
    Replies
    1. Quick follow up. I presume a simpler approach would be to generate trades based on the closing price, in the traditional way, and simply treat the difference between the closing price and the live price when the execution is done as execution slippage which can be thought of as random noise? I suppose if you take into account any effort put into signal smoothing and possibly adding position inertia/buffering, which introduce meaningful lags into the system, then worrying about a few hours between the forecast generation and trade execution shouldn't actually cause too much concern. Am I broadly on the right track?

      Delete
    2. Yes you're on the right track. In fact you could break down your execution cost to:

      A- "delay" - difference between mid close, and the mid price when you actually come to execute
      B- "bid/ask spread" - difference between mid when you start to execute, and best bid or offer
      C- "slippage" - difference between best bid or offer, and fill price

      And I'd expect A to have mean zero, unless you're trading relatively quickly. It's also possible to simulate the effect of a one day delay in your orders (by using the closing price on the following day- pysystemtrade does this by default) and checking to see if you get any improvement in live trading from trading before the close.

      Delete
  13. Thanks, that makes perfect sense. I suppose I should at least simulate live forecasts so I can test the 0 mean hypothesis in a few years from now.

    ReplyDelete
  14. I realize it's a big ask but I would find it very useful to see a write-up expanding on this:

    Use irregular timestamp analysis functions. For example for an estimate of return volatility rather than equally weighting squared returns, weight them by the amount of time each return happened over. This is roughly equivalent to the previous option but more computationally efficient. However there are some complexities around periods when the markets are closed, which need to be considered and carefully handled.

    ReplyDelete
    Replies
    1. There are, this is indeed a big complicated area. For example one approach is to use a Brownian bridge to interpolate missing data and produce a continous series (or as near as dammn it, say minute by minute). But you need to have an estimate for the difference between vol when the market is closed vs when it is open.


      Frankly it's too big an area for me to write a blog post about and I'm not an expert

      Delete

Comments are moderated. So there will be a delay before they are published. Don't bother with spam, it wastes your time and mine.