Thursday, November 15, 2012

Must Read - How to Use Real Time Tick Data in HFT




Must Read - How to Use Real Time Tick Data.

Tick Data has been continuously gaining importance in recent years. Traders and researchers are not satisfied with low frequency financial market data like monthly and weekly data anymore.

The demand for high-quality high frequency data(Tick Data): intra-day data, intra-hour data and intra-minute data are soaring. Traders analyze the high frequency data to make decision and trade strategy. Researchers utilize those high frequency data to observe market microstructure, test hypothesis and develop new models. Thanks to the rapid growth of information technology, today, we are able to efficiently capture, transfer, store and process huge volume of high frequency data.

Sources of High Frequency Tick Data

Financial markets are the absolute core source of high frequency data. The global financial markets produce millions or trillion of data per day. Those data can be quotes, transaction prices, transaction volume, transaction time and other information like counterparties of the transaction etc. Man names each logical unit of high frequency data a “tick”. In most centralized markets like stock exchanges, especially those electronically traded ones, transaction data are electronically recorded by the exchange and directly provided to interested parties (like NSE and MCX). Those data contain detailed information about the transaction and are more consistent. In decentralized markets (over-the-counter, OTC markets) for example foreign exchange market, where globally 24 hours traded, transaction data are not recorded by a central institution. There is no comprehensive data source. Data vendors like Reuters capture the transaction data from the global market and fed their customers with those data in real time. Data vendors also provide tools to structure customized data sets.

Data Characteristics

By nature, high frequency Tick data means huge volume.As a consequence, there are also considerable potential of errors in the data. The “bad ticks” need to be cleaned or, be filtered prior to further analysis or Executing some quant.

The most important characteristic of high frequency Tick data is that, because of all the market transactions take place irregularly, all the data are irregularly spaced in time. In time series analysis, such are called inhomogeneous time series. Man might need to use interpolation methods to create homogeneous time series for further analysis. In financial markets, changes in transaction price are discrete and only fall on a set of values. This is because that different exchanges make certain rules to restrict price changes to retain stability and functionality. The smallest allowable price change is called a tick. Price changes must fall on multiples of the tick. In certain exchanges, there are also other limits on intra-day price change. On the other hand, in active markets, extreme changes in price are not common for rational traders. As a result, the changes in transaction price only fall on a small number of values and often represent a very high degree of kurtosis.

High frequency Tick data often contain strong periodic patterns or, seasonality. One of the well-known diurnal patterns is the U-Curve: transaction volume and volatility is significantly high after the open and shortly before the close. In the foreign exchange markets, where there is no open and close, transactions volume and volatility is systematically high in active periods of the day, where the active periods of global markets overlap. For example, late afternoon in EU is both active time of EU and US markets. Euro/Dollar exchange rate often represents the highest volatility during this period.

Temporal dependence is another important characteristic of high frequency Tick data. Like low frequency data, high frequency Tick data exhibit volatility clustering. But other than lower frequency returns, high frequency data shows significant autocorrelation of asset return. Microstructure effects (e.g. price formation process, bid-ask spread) play an important role here.

First Step - High Frequency Tick Data Analysis Models.

Different models have been developed which are suitable for high frequency Tick data analysis. For example, the Generalized Autoregressive Conditional Heteroskedastic (GARCH) model by Bollerslev (1986) is one of the widely used models for analyzing of intraday volatility. The model is a generalized version of the Autoregressive Conditional Heteroskedastic (ARCH) model of Engle (1982) and there is a full family of variations since its introduction. Generally, ARCH/GARCH models consider the current error variance as a function of the previous period’s error variance. The models are especially useful by irregularly spaced transaction data and become one of the standard tools for volatility modeling.

The standard ARCH/GARCH models are good in modeling historic data, but their forecasting performance is somehow disappointing. The availability of ultra high frequency data offers econometrist refined measurement of volatility. One of them is the realized volatility, which measures the actual volatility in a past period. The realized volatility series could be modeled as A Fractionally Integrated Autoregressive Moving Average (ARFIMA) process . The ARFIMA model’s forecasting performance is significantly improved, but it is also more complex and data intensive than ARCH/GARCH models.

For general analysis of irregularly spaced transaction data, Engle and Russell (1998) proposed the Autoregressive Conditional Duration (ACD) model. The ACD model treats the waiting time between events (duration) as a stochastic process and proposes a new class of point processes with dependent arrival rates (thinning point processes). The model is well suitable for modeling transaction volume and the arrival of other events such as price changes. The ACD model shares many features with the ARCH/GARCH models. Since its introduction, various extensions were developed and ACD model have become a leading tool in modeling the behavior of irregularly time-spaced financial data.

If the irregular spacing of the data is ignored in modeling of the marks (e.g. transaction price), the modeling problem can also be reduced to standard econometric modeling procedures of Vector Autoregression (VAR) and simple linear regression.


High Frequency Tick Data Cleaning and Transformation

Today, advanced information technologies make the huge volume of high frequency Tick data much easier to manage. Nevertheless, with increasing data volume and frequency, the amount of erroneous data also grows. There’re different types and sources of bad data and they affect the analysis result considerably. Thus, the importance of clean data has been rising continuously. Dirty data are not usable before they’ve been filtered. Moreover, because of the irregular spacing of financial markets information, data transformation might be another necessary step prior to further analysis.

Dirty data

Before cleaning the dirty data, we must define the “bad data” first. Bad data contain erroneous price, amount, transaction time and/or other information.

There’re many different types and sources of the errors . The cause of the errors is rarely known and often difficult to identify. Generally, we can classify the errors into three groups:

1. Human errors:

Unintentional errors: This type of errors cause because of the high volume of the data and the fast frequency of the transactions, e.g. decimal errors, typing errors, transposition errors etc.

Intentional errors: These are man-made bad data. Some data contributors transmit dummy ticks to test their network connection. Some data contributors may copy and re-send data from other sources.


2. System errors: Technical errors caused by computer system, software and network connection, e.g. erroneous data transformation between different data formats, damaged data because of system failure etc.

3. Market bad data:

Bad data caused from processes inherent to trading. For example trade cancelled, replaced or corrected. These data are not relevant for analysis.

Bad data come from multiple markets where the same security been simultaneously traded.

An exchange specialist can directly point out the bad ticks from a chart, but an academic researcher might not able to see them that easily. Moreover, different specialists might define a bad tick differently. There’re obvious errors like decimal error which everyone can identify, there’re also errors that lie on the borderline. Handling those marginal errors is the main problem in data filtering process .


Tick Data Filtering using Algorithms

At Algo Trading India we develop CEP engine ( with in range 400 ns ) to filterReal time Tick Data to clean the dirty data. A data filter is a computer program that utilizes certain algorithms to detect and eliminate bad ticks. A data filter can also generate reports for advanced users, which explain the reason of the rejection of each bad tick.

The data filter receives raw data from external data sources ( Data Provider or Exchanges) and perform filtering operation using its filtering algorithms. As output, the filter deliver the same time series as inputted, plus the filtering result, which contains a credibility value of the tick, the value of the tick and the reason of filtering. The credibility value is the centre concept of a filter algorithm and works as a criterion of the validity of the data. For example, we define credibility value 0 for absolute invalid and 1 for absolute valid. For a certain data set we set a credibility value as a filtering threshold, which means, all the ticks with credibility below the threshold would be rejected by the filter, while ticks with credibility equal and bigger than the threshold be accepted as clean data. The filter will also modify the value of certain bad ticks, whose error have known causes and can be automatically identified and corrected.

The main problem in developing the filter algorithm is the handling of marginal errors, which involves a tradeoff between overscrubbing and underscrubbing of the data. If we set the criterions too loose (underscrubbing), we still have unusable dirty data. On the other hand, if we filter data too tightly we might overscrub it and lose relevant information, thereby change the statistical properties of the row data. Moreover, different filter users may also have different individual criterion on the data quality, for example, different users that employ different base data unit (tick, 1-min, 5-mins, etc) in their analysis may weigh a single bad tick differently. Therefore, there is no absolute definition of “correct filtering”. Generally, a good functional data filter should efficiently remove all the false ticks which are relevant for the user, without changing the real-time properties of the row data.

A Simple Example

We use the transaction price data of TCS stocks on 8th May 2006 as an example of data filtering process As a simple way to clean the dirty data, we define the credibility value as the absolute distance between a tick and the moving average (black trend line in the chart). We also set a filtering threshold of 0.8 for all the raw data. Ticks with credibility value excess 0.8 will be detected as invalid bad data and be rejected. Ticks with credibility value below 0.8 will be accepted as good data.

There’re a lot of marginal ticks which lies near the threshold. We can set different filtering threshold for different analysis purpose. A lower threshold will reject more marginal ticks and a too strict threshold can easily leads to overscrubbing of the data. For example, with a threshold of 0.5 we already lose a lot of relevant information contained in the raw data.

Ticks which excess the threshold are defined as bad data and rejected.



High Frequency Tick Data Transformation

By nature, high frequency Tick data are irregularly spaced in time. An exchange has no transaction data during the weekend and public holidays. Intraday transaction also represents seasonality in volatility. There could be 80 transactions from 10:00 to 10:30 while no transaction happens from 12:00 to 12:30 at all. In time series analysis, such are called inhomogeneous time series. However, most of the time series analysis methods are based on homogeneous time series, where data are regularly spaced in time. With help of different interpolation methods, we can transform the raw data from inhomogeneous time series to homogenous time series for our analysis. Generally, interpolation means the creation of an artificial tick between two existing ticks. The two most important interpolation methods are linear interpolation and previous-tick interpolation. Both methods has its own metrics.

Stylized Facts

In empirical studies, the seemingly random high frequency data share some interesting statistical properties. Those properties are commonly observed across a wide range of financial instruments, regional markets and different time periods. We call such properties stylized facts. Generally, econometrists categorize those stylized facts of high frequency data into four main groups:


Autocorrelation of return, seasonality, distribution properties and scaling properties


Autocorrelation of Return

Generally, autocorrelation, or serial correlation, is the correlation between observations of a time series with lagged observations of the same series. In performance analysis, a positive first-order autocorrelation of period returns means a positive (negative) return in one period will also be followed by a positive (negative) return in the next period. A negative first-order autocorrelation of return means a positive (negative) return in one period will be followed by a negative (positive) return in the next period.

In liquid markets, prices movements do not exhibit significant autocorrelation. It is easy to explain: autocorrelation in price changes enable statistical arbitrage and will be eliminated in an efficient market. In high frequency tick data, however, a strong first-order negative autocorrelation is to observe, especially in foreign exchange markets. The negative first-order autocorrelation of return is significantly stronger at smaller time horizons (up to 3 minutes, which means between only a few trades) and disappears at longer time horizons like more than 30 minutes. The negative first-order autocorrelation can be seen as a by-product of the price formation progress in the market. It is a result of the discrete transaction prices and the existence of bid-ask price spread. Detailed explanation of the negative autocorrelation of return can be found under different market microstructure issues. A traditional explanation is bid-ask bounce between sellers and buyers transaction often takes place either close to the ask price or close to the bid price and tend to bounce between these two limits. Other explanations include order imbalance, diverging market opinions about the price impact of news, trader behavior like break-up of large orders into small ones, etc..

The negative first-order autocorrelation of return among the data can be seen as unwanted noises. Such temporal dependence may lead to serious errors in further analysis, therefore should be detected and removed. In Algo Trading India, we have different tools to identify the autocorrelation and several approaches to deal with them.


Seasonality

High frequency financial Tick data typically represent very strong periodic patterns or, seasonality. The seasonality can be found in volatility, trade frequency, volume, and spreads across different markets. For most exchange traded instruments, there’s a U-shaped curve to be observed. Over-The-Counter (OTC) instruments like foreign exchange also exhibit strong seasonality in intra-day data and intra-week data.


U-Shaped Intra-day Pattern

The U-shaped pattern during the course of a trade day is commonly observed in different exchange traded instruments from stock to commodity. In most exchanges, the volatility is significantly high after opening and be followed with a decrease during the day, then increase again just short before closing.

In most cases, the minimum is observed between 11:30 am and 1:30am.

The reason for the U-shaped curve is simple to explain: at the beginning of a trade day, all the information is well absorbed by the market overnight. The market participants have analyzed the information and made their trade strategy for the coming day. At the opening, they adjust their positions and simultaneously submit orders to build new position or empty old position. Therefore more transactions take place and drive the volatility high. During the rest time of the day, they prefer to “wait and see”, especially during lunch break. Just before closing, all market participants make their final adjustments according to the information they received during the trade day, and push the market volatility to another high.

Intra-week Pattern

Intraweek high frequency data also showed significant seasonality. Almost all exchanges are closed on weekends and holidays and therefore no trade take place. For weekdays, the level of activity is very different. A day-of-the-week effect can be observed across different exchanges: in general, there’s a minimum of activity on Monday and a maximum of activity on the last two working days of the week, or, more precisely, the market activity increase gradually from Monday to Friday.It represents a W-shaped curve. Similar periodic patterns can be also found in return, volatility and spread.

In practical analysis, the day-of-the-week effect should always be taken into consideration. However, different markets may report difference in the weekly seasonality. The situation is more complex when there are one or more holidays in the week, especially for those highly interdependent markets in different countries with different holidays.

Seasonality in FX market

Unlike exchange traded instruments, over-the-counter (OTC) traded instruments are not restricted under open and close hours. The OTC instruments, such as foreign exchanges (FX), are 24hours electronic traded around the globe. Strong seasonality can be found in OTC market, however, not the typical U-shaped curve like in an exchange. Here we take the foreign exchange market, the largest and most liquid financial market, as an example. Before we look at the empirical findings, we should first know how the global FX market works. Due to the OTC nature, there’s no central marketplace for currency transactions, therefore no single quotes. Today, there’re three main trading centers: London, New York and Tokyo. Currency trading happens continuously throughout the day around the world. As the trading session in Asia ends, the European session begins, then followed by the North American session and finally back to the Asian session.

Analyzing the global FX market is a difficult task. Apart from the different time zones, different countries also have different public holidays. Another problem is with the day-light saving time: many countries, like most Asian countries do not utilize day-light saving time at all. Those differences must be considered by cross-market comparison. Furthermore, in analysis, we have no central data source for foreign exchanges. The trading volume of FX is also unknown. We have to use quotes submitted from representative market makers, which collected by third-party data vendor like Reuters. Those data contain less information than exchange data and are often unclean and biased by data provider.

After handling all the problems above successfully, empirical study confirmed the existence of significant seasonality in global FX market. We observe USD/EUR quote as example: the main daily maximum of volatility occurred between 14:00 GMT and 16:00 GMT, when both the European and the American markets are active. The main daily minimum was between 3:00 GMT and 4:00 GMT, when both European and American markets are closed, while lunch-break in Asian markets. Other currencies reported the similar periodic pattern. The market activity is high when relevant markets are open and actively traded, especially when active period of two or more relevant markets overlapped.

The intra-week volatility in FX market showed no significant day-of-the week patterns on working days. The volatility is quite similar for the working days and by nature very low for the weekend. By contrast, the bid-ask spread is low on working days and high on weekend.

Distributional Properties

Unlike their lower frequency counterparts, high frequency tick data shows different properties in the distribution. High frequency data are very discrete distributed and a fat-tailed distribution of return can be observed across different financial instruments. The empirical study of the fat tail has been gaining more and more importance, especially in risk management of financial assets. Many researches in this field have been made and many new models are developed.

Scaling Law

Man utilizes different time intervals for different analysis purposes. A trading manager takes 5-min data for his analysis, while data on 24-hour scale is needed for analysis of the same stock by a portfolio manager. There is no privileged time interval at which data should be investigated. In practice, a common way of scaling is directly convert the data measured at short time horizon linearly into longer time horizon. This approach is questionable because it ignored the interdependence of data, which is significant in high frequency analysis. Can we transfer the results from one time scale to another without distortion? Since the early work of Mandelbort on cotton prices, a lot of empirical studies have confirmed the existence of a scaling law in a wide range of financial data: a direct relation between time intervals and the volatility.The scaling law is therefore a power law

Though applicable for a wide range of time intervals, the scaling law also has its limitation when used for long time interval.



More study to be done…..



Lokesh Madan

4 comments:

  1. good description! For more stylized facts you can read Rama Cont (2001)

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete
  3. Tick By Tick data cost you more @ 1500 USD for 1 Year specific data date & data size of 16 GB per Day....

    ReplyDelete