Archived Discussions

Volker Knapp, Consultant bei WealthLab

Saturday, August 1, 2015

IMO you hit the check-pot. Take a look at BCR from last week. It was just the high that was out of bounce. Probably all data providers report $201.73 as the high.

https://www.dropbox.com/s/f5q37dzvzj8qbha/BCR%202015-07-28.png?dl=0

This is just one of many examples of bad data on EOD basis. I am not sure how you can clean data on a tick level.

We actually cleaned our WealthData on tick level to create clean EOD data for us and soon for our customers. Doing it for a few hundred symbols took over a year... and still having errors in it (I am sure).

There are so many reasons for data to go bad, at the tick level in real life trading it seems even harder. Why do you all want to trade in such a competitive environment? What is your annual expected return and largest drawdown?

Adrian Pitt, Independent Investment Management Professional

Sunday, August 2, 2015

I don't have any real experience at dealing with US stocks on a tick level time frame, but I do know MANY of the stocks can be extremely erratic. Here's the thing. Let's assume you acquire a database and eliminate all errors and create your model. How do you plan going forward to eliminate errors REAL TIME? How will your model hold up when errors start occurring? I would suggest there are more problems simply from the regular volatility of short term fluctuations than any issues with data errors. Virtually every real time data provider has filters that will block out most obvious errors. If you find you worry excessively about data errors then you are directing your stress in the wrong direction and need to rethink your major concerns. Something that is INFINITELY more important than worrying about the occasional bad tick slipping through the filters, is the robustness of your model to handle it over time. If your model is so fine tuned not to be able to handle get punched in the head now and again, then frankly your model sucks : )

Fred Quatro, Programmer/Analyst at ATS

Tuesday, August 4, 2015

Charles: Like Sam said, I think these ticks are out of order. If we "cut and pasted" a cluster of ticks just one second later (or earlier), they seem to fit right in. We track the exchange code with each tick , but haven't found a pattern yet.

Adrian: We use the same cleaning algorithm for our live feed as we do for our backtesting. We are always aware of latency. We also 'measure' the condition of the live feed and handle our trades accordingly.

Fred Quatro, Programmer/Analyst at ATS

Tuesday, August 4, 2015

Volker: It's no surprise it took you more a year to clean over one-hundred symbols. We will have to slightly tweak our cleaning algorithm for each symbol. Tedious, but worth it for safety.

Richard Barden, Senior Sales, Real-Time Business at Morningstar

Tuesday, August 4, 2015

Interesting thread - as someone who works for a vendor that has a large historical tick data service, we are well aware of a lot of these issues. Suffice to say that data, like our world, is imperfect and as Adrian suggests, your model needs to be able to cope with that. Exchange traded data feeds will have outages, fat-finger errors, 'snake-in-the-grass' trades and all sorts of other strange but legitimate (real) events. And exchange data feeds are generally much cleaner/mor predictable than OTC ones.

Adil Reghai, Responsable Recherche Quantitative Dérivés Actions et Matières premières chez Natixis

Tuesday, August 4, 2015

Data is not and will never be perfect or clean. I believe that one should see data as some possible path and build a stream of paths around it. Then one can understand deeply how sensitive any further treatment can be. If the impact is low no need to do more cleaning...

Volker Knapp, Consultant bei WealthLab

Tuesday, August 4, 2015

There is so much to say about data. First thing would be what market is it? Forex, futures, stocks...

Then you have to look at the time frame you want the data. The shorter the time frame is the more you have to deal with issues like: Delayed data (I talk milliseconds), Internet connection, pings, filter or correcting data (in real time). Now you have to ask yourself if the correction you apply in RT are the once you applied to your test data... it goes on forever.

Let me tell you, after correcting years of data, recover historical data from discontinued symbols and correcting them to and use all this information on my end-of-day stratgeies I had to find out that my systems make about 20% less on average profit per trade (using clean discontinued symbols). Using other data from expensive providers makes another 30% difference. Still good enough for me to be profitable and have decent DrawDown. Some strategies turn from good winners to losers!

Adrian Pitt, Independent Investment Management Professional

Tuesday, August 4, 2015

@Volker May I ask what you consider to be clean data versus unclean? In percentage terms. If we say compare each bars high and low from each different provider, just how different are they? From best to worst, are we talking 10% difference? 5%? 1%? 0.5%? 0.1%? I'd be surprised to say the least if the difference even managed to reach the smallest increment. Therefore, while clearly we want the best data we can get, and the smaller the time frame you trade, the more requirement for accurate data becomes, I would suggest that unless you were a TRUE HFT trader, that if your models profitability changed by 20-30%, because of data differences, then there is a serious flaw in the model somewhere. The problem ISN'T the data, unless you are talking HFT. But for 99%+ of us, to even attempt to enter the HFT realm is sheer stupidity, and not required to be highly profitable. BY HFT I mean doing trades where milliseconds count. I think a far bigger and more likely problem is simply having your data feed working 99.9% of the time, and working smoothly in real time (i.e. minimal latency issues).

Volker Knapp, Consultant bei WealthLab

Wednesday, August 5, 2015

Adrian, do you understand you correctly that you think the difference between one of the OHLC values from any given stock and from different data providers is less than 0.1%? If so I would appreciate if you could give me your (data providers) OHLC of BCR from 28.07.2015.

Mine is: 191.38 - 201.73 - 189.73 - 192.95

Kaustabh Ray, Equity Research Technologist and Systems Architect

Wednesday, August 5, 2015

I would have thought that every tick data is important, since it is transactional data.

In case you do not want bad tick data, what you can do is to take only the 95 percentile values and discard the outliers while making the OHLC data.

Volker Knapp, Consultant bei WealthLab

Wednesday, August 5, 2015

That would be too easy and would eliminate the true trades that you actually want. We all want real "outliers"!

Adrian Pitt, Independent Investment Management Professional

Wednesday, August 5, 2015

@Volker I've no idea, as I have never done the test myself. That is why I asked, since I assumed you would have done a test for something similar when it comes to comparing data sources, as you mentioned you spent so much time cleaning data, that clearly you felt it was of critical importance. I'm sure there would also be many other variations for data comparing.

My BCR data source shows BCR for 28.07.2015 as: (OHLC)

* - 200.09 - 189.73 - 192.95

(15:57:42 contains a 1 tick data spike, clearly an error IF no trade took place above 193.18, so the REAL high isn't even 200.09)

Sometimes though these data spikes are REAL. In the sense that someone accidentally pays up when there might be a brief moment when most of th eoffers vanish. My vendor shows 100 shares traded at 201.73. For reasons unknown to me, the data vendor adjusted the high to 200.09 which bears no resemblance to anything.

As a trader I would of course just ignore the spike. As a model tester the answer is more complex. Whether the spike high is real or not at the time, what would ones model do? I think you have to take it in your stride. To eliminate ALL noise from data, while technically correct, is actually incorrect when it comes to running a real time model. A real time model MUST be able to survive just spikes. BUT, I also feel a model backtesting must be on a s real a data as possible so that it is based on real market characteristics and not the rare outlier spike. so yes cleaning data is important to get rid of these obvious and ridiculous spikes, but what about the much smaller ones that presumably occur several magnitudes more often? And are much more difficult to detect. Most of my time is spent looking at FX or futures data, and these sort of outlier spikes just don't occur.

Adrian Pitt, Independent Investment Management Professional

Wednesday, August 5, 2015

open 191.38

Adrian Pitt, Independent Investment Management Professional

Wednesday, August 5, 2015

Agree 110% with Volker on that one. That 1 data spike 'error' on BCR for example represents around 0.0002% of tick data for the day.

Volker Knapp, Consultant bei WealthLab

Wednesday, August 5, 2015

Well, it is not important how many percent of tick data it represents, more important is what it does to you data and your backtesting results. I know because in one system that I follow on WealthSignals I should have had a short position at the high, but it did not show up in my account. After doing the WealthData correction it was clear that the high is just wrong. All providers will display the "untradable price" until eternity except WealthData.

Adrian Pitt, Independent Investment Management Professional

Wednesday, August 5, 2015

Your point Volker, just reinforces what I have been saying, that a model must be able to take the 20-30% hit you refer to and still make your minimum qualifications for a viable model. I will often just build this into the testing by increasing the slippage number to something greater than reality. This also helps build in future capacity for size. The importance of the 0.0002% is to show how little it takes to impact a model on a far grander scale. At the same time it is important not to get carried away, as most people will find that a large % of the errant ticks don't impact the signal at the time anyway on most models. You need a lot of things aligning in many cases for it to impact a trade. When testing a model, always try to eliminate the X% best trades, and see if you would still want to trade the model. What should X be? Well that is really a discussion for a completely different thread.

Volker Knapp, Consultant bei WealthLab

Wednesday, August 5, 2015

True, a lot of things to think of, but using good clean data eliminates some worries. If you use good clean data you will see the real trades you would have made. If you trade the system than, the worst thing that will happen is that your order is still in but your data shows something else.

That is better than being in a position that you shouldn't be.

Adrian Pitt, Independent Investment Management Professional

Wednesday, August 5, 2015

OK...so what % of time does a person allocate to cleaning their data versus creating, coding, testing and implementing their trading model(s)? The answer will be different for everyone of course, depending upon the size of their team, which could 1 to hundreds. Unlucky if you trade stocks, since there are thousands of them. In the Forex world, there are only 28 pairs bothering with, and even that is pushing it. Increasing the time frame of one's model will eliminate the vast majority of errors spikes as being relevant at all I would suggest. but at what cost to ROE? It's never easy :)

Volker Knapp, Consultant bei WealthLab

Wednesday, August 5, 2015

We are just concentrating on the NASDAQ100, SP100 and the DOW30 stocks. We only correct end-of-day data and that is enough work. Even thou we have an algorithm for it, it still requires manual daily interference.

Volker Knapp, Consultant bei WealthLab

Friday, August 7, 2015

@Adrian and all

This is what your data provider reported and what you found out (after investigating).

+++++++++++++++++++++++++++++++++++++++++++++++

My BCR data source shows BCR for 28.07.2015 as: (OHLC)

* - 200.09 - 189.73 - 192.95

(15:57:42 contains a 1 tick data spike, clearly an error IF no trade took place above 193.18, so the REAL high isn't even 200.09)

+++++++++++++++++++++++++++++++++++++++++++++++

This is what WealthData has to say:

WD: 191.92 193.18 189.73 192.95

This is what Yahoo reports (and probably all the others:

Y!: 191.38 201.73 189.73 192.95

There is a huge difference in the opening price and the high price.

The high was because of a spike at the end of the day.

The opening is a result of all data providers reporting the first trade, often a one lot trade from an exotic exchange. We report the one with the highest volume, because we call our data "tradeable data" and not "if you are lucky you get the first one lot price data".

To all end-of-day backtesters out there, you should read and this and consider it for your backtesting!

Adrian Pitt, Independent Investment Management Professional

Friday, August 7, 2015

You have discovered just how backward America is with so many exchanges (for the same stocks), and why its so important to have one OFFICIAL (NON-PROFIT) exchange, where all business is done. And where there is only ONE official opening price determined electronically by balancing all the opening bids and offers. When all this takes place, it minimizes the chance of bogus opening values and bogus spikes.

Volker, I do believe (though I may be wrong), that the BCR spike is the exception to the rule. And even when they do occur, in most cases, it won't impact the vast majority of trading models, as you need various things to align at that exact moment. You mentioned it impacting your models profit 20-30%. Those sort of numbers seem extremely high to me, and point more to a flaw in the type of model you are running than excessive data spikes in the historical data.

In the BCR example for instance, it would not have impacted day traders, as most have gone home by the time the close comes. Swing traders would not have been impacted in reality, as while the spike was not real, the price went right back up there in the next few days anyway. Allowing real positions to exit. And of course it would have had no impact on medium/long term trades either. So it leaves me wondering what model you run that was adversely impacted here? A short term mean reversion model? Those sort of models have high trade frequency generally, so the occasional spike would never impact profits by 20-30%.

In summary, I'm happy to stand by what I stated right in my very first post, and have no reason to change it, subsequent to all the discussion that has followed.

Sam Birnbaum, Founder at Quadra Analytics, Inc.

Friday, August 7, 2015

BCR is a listed stock (NYSE). The official open and close prices are established by NYSE. For any stock that is not listed, the official open/close is established by NASDAQ. The high and low and the VWAP are based on composite prices for trades that occur between the official open/close; It does not mater which data source you are using, you should always use the official open/close of either NYSE or NASDAQ and the data providers should tag those messages as the official open/close prices.

Adrian Pitt, Independent Investment Management Professional

Friday, August 7, 2015

You are quite right Sam. It perhaps highlights some flaws in the data vendors collection or sourcing process. It is my understanding that Yahoo is sourced from CSI which has a long standing of providing accurate data. I have always used them for EOD commodity data for as long as I can remember.

Volker what is WealthData? Is that just a term used as part of the Wealth-Lab program? Where is it sourced from?

Volker Knapp, Consultant bei WealthLab

Friday, August 7, 2015

@Adrian

These data spike have no impact in trading but in backtesting for all traders using end-of-day data including swing traders.

You are right about the exchange, I am amazed about how the opening price gets settled! I am sure the EUREX is doing it different.

* is the difference between using clean data and bad data. I use clean data!

WealthData is a product that we plan to launch soon for all WealthLab users.

@Sam

You are right, I am not sure how all data providers are handling the data but I am sure there is hardly any tick filtering going on.

Sam Birnbaum, Founder at Quadra Analytics, Inc.

Friday, August 7, 2015

@Volker: Every message has a code or sets of codes identifying the contents. Is it a trade, a quote, a correction etc. It also identifies the originating venue. In the same manner, it also identifies the opening/closing trades, its price and size. If your data source does not pass along those items/codes, then if possible you should switch to another data vendor.

Volker Knapp, Consultant bei WealthLab

Friday, August 7, 2015

Sam, I talk end-of-day data. I know the things you are telling me, after all we corrected the bad end-of-day data for close to 400 symbols (existing and not existing anymore).

In the End-of-day world there are no codes, tags or sets. But since you wear a suit I am sure you have Bloomberg data, just check the OHLC of BCR on that date, I am curious.

Sam Birnbaum, Founder at Quadra Analytics, Inc.

Friday, August 7, 2015

@Volker, I used to have Bloomberg and used used for real time and EOD analysis monitoring symbols in 5 major indices. Can't help you on your request as I no longer have access to Bloomberg.

Archived Discussions

Clean Data is crucial

More links

27 comments on article "Clean Data is crucial"

Please login or register to post comments.

Newsletters

About our Association

Get in touch

Media

Follow Us