Quant Basics 1: Data Sources

Tom Starke August 17, 2017 Quant Basics 0

Introduction

Welcome to the Quant Basics series. This mini series came from the observation that most people starting in quantitative trading focus almost entirely on the generation of trading signals. While this is important, several other areas in quantitative trading strategy development are even more cruicial such as:

Data
Vectorised Backtesting
Performance analysis and strategy optimisation
Position management
Execution

In this series we will mostly focus on the first for those areas. We start with an introduction on how to download and condition market data from free sources. This is important as vectorised backtests are usually fast and it is possible to run through a wide range of possible configurations to find tradable sweet spots. In subsequent articles we cover:

Parameter optimisation with Monte Carlo, grid sweep and (possibly) simulated annealing/genetic algos
Out-of-sample shortfall estimation with bootstrapping, BRAC, train-test correlation
Ways to avoid overfitting and find regions of tradable parameter sets using unsupervised machine learning
Manually train a machine learning classifier to produce a compound metric from PnL, drawdown, Sharpe and other rather than using ranking
Build a machine learning model of our strategy that helps us to speed up walk-forward optimisation

Preparing Market Data

Lets look at some simple code we can use to download data for free. In the example below we have three sources: Google Finance, Quantopian and a random number generator. We also show how to save the data in a cPickle file for repeated use without having to download them every time. You might ask why we would use a random number generator. The short answer is that this gives us a more controllable time series which will produce results that are easier to understand and useful for code testing.

import pandas as pd
import numpy as np
import pandas_datareader.data as web
from dateutil.parser import parse
import cPickle

# BACKEND = 'google'
BACKEND = 'file'


tickers = ['AAPL','MSFT','CSCO','XOM']
start = '2003-01-01'
end = '2017-06-01'

def prices(tickers,start,end,backend='google'):
    if backend == 'quantopian':
        p = get_pricing(tickers,start,end)
        field = 'price'

    elif backend == 'google':
        p = web.DataReader(tickers, 'google', parse(start), parse(end)).ffill()
        field = 'Close'
    	cPickle.dump(p,open('prices.pick','w'))

    elif backend == 'random':
        field = 'Close'
        p = web.DataReader(tickers, 'google', parse(start), parse(end)).ffill()
        for ticker in tickers:
            p[field][ticker] = np.cumsum(np.random.randn(len(p[field][ticker]))-0.0)+500

    elif backend == 'file':
        p = cPickle.load(open('prices.pick'))
        field = 'Close'

 
    pp=pd.DataFrame(p[field],index=p[field].index,columns = tickers)
    return pp

p = prices(tickers,start,end,backend=BACKEND)

What is happening here? In the first lines we import a few Python modules, you should be reasonably familiar with this already. The dateutil.parser.parse function is incredibly handy as it can turn almost any time string into a datetime object which can then be used for datetime arithmetic. The only drawback of this function is its speed. Every time it is called it has to infer a suitable format from the string, which takes time. The fastest way to parse time strings is to write your own function that is specifically tailored to a pre-defined format. You can try to do this as an exercise and compare the speed of your function with the dateutil parser.

Next, we define a global variable BACKEND. Global variables should be used sparingly and never change during the running of the code. This variable defines where the data are coming from. In our case we have a choice between the Quantopian backend (which can only be used within the Quantopian research environment), Google Finance, which we call with another incredibly useful module called Pandas, furthermore, a random data generator and finally we can retrieve data from a cPickle file. cPickle is a ‘serializer’, which means that it can turn almost any Python object into a file that can be stored on disk. In our case we can store the data we’ve downloaded from Google Finance, in order to avoid calling their API every time. This often can help to save processing time, particularly when the amount of data we try to download is large.

When we request data from one of the backends we usually have to specify a start and an end time, which we define as a string. Note, that for the Google web data reader we have to parse the dates into a datetime object while for Quantopian accepts a basic string format. Take note of the function ffill() at the end of the data reader line. This is also called a forward fill. Sometimes we end up with missing data points for some reason due to a difference in exchange opening times, public holidays or emergencies where a particular exchange has to close. In this case we carry forward the last price and assume that nothing has changed on that date. This is important since many functions cannot handle NaN’s, Inf’s or other non-numerical values.

Preparing Synthetic Price Data

Let’s now focus on the random number generator which is contained in the following line:

 np.cumsum(np.random.randn(len(p[field][ticker]))-0.0)+500

This nifty piece of code singlehandedly produces something that looks like a real price series. How does it do that? Let’s run through the statement bit by bit. First we see random.randn(N) which produces N normally distributed random numbers. These numbers are price differences from one period to the next. Here, we look at absolute price differences as opposed to percentage returns. The length of that vector is the same as for the length of the number of trading days for google finance, so we can compare the series more easily if we wish to do so.

Next we can see that we subract (or add) a constant value from the random number. In our case this is zero but it could be some arbitrary value. This causes the mean of our normally distributed returns to shift such that the expected value is nonzero. This bias creates an artificial trend which can be helpful sometimes when we test our strategy. In most cases a trending series will be simulated using autocorrelation but in our case we are mainly looking for a series which we understand very well and which gives us consistent results when we test our strategy.

Once we have our biased or unbiased price differences we apply the cumsum() function in order turn them into “actual” price data. This process of cumulative summation is also called stochastic integration.

The image above shows a series without bias on the left and with bias on the right. Note the the curves are NOT the actual prices of the stocks in the legend. In contrast, lets look at the price differences before we apply the cumulative sum:

We can see that our price differences generally hover around the zero point and it would be impossible to guess from the image where the cumulative sum of these changes would move to.

Finally, we notice that we add a value of 500 to the cumulative series. This is done to shift the whole data set deeply into the positive region to avoid negative values. Arguably, a better way to do this is to use geometric brownian motion (GBM). With GBM we generate percentage returns, add 1 to them and calculate the cumulative product. This would look similar to the following example:

 gbm = np.cumprod(1+np.random.randn(1000)*0.01)*500

Note, that we need to multiply the random normal distribution with a volatility value of 0.01. If we make the values larger our price ends up drifting towards zero very quickly and not moving back up again as shown in the next figure:

The above figure shows one of the great properties of GBM: the price movement is a function of the absolute price level since we are working with relative differences. The underlying price distribution is now log-normal.

Both of these data sets have their pros and cons and we have to decide which one of them fits our purpose. Personally, I often tend to use cumulative sums for testing as they are simpler and easier to handle.

This concludes the first part of the quantitative mini course. If this was too basic for you, please stay with us. In the next part we look at how to set up a vectorised backtest.

The code base for this section can be found on Github.