Quant Basics 10: Performance Prediction With Machine Learning
In the previous post we plotted a response surface of our strategy parameters and their PnL in order to assess if our choice of parameters is rational and not just a local maxima, which rapidly drops off as we move away from it. In this section we investigate how we can use machine learning to predict the performance of our parameter sets in order to save time in our parameter sweep.
You might recall that we’ve done the parameter sweep using Monte-Carlo parameter selection in order to quickly get a uniform coverage of our parameter space. If we would use a systematic sweep the coverage of the parameter space would gradually increase but we would need to wait for the entire sweep to finish for 100% coverage.
So, let’s assume we just backtested a small subset of our parameter space, like so:
pnls1,sharpes1,ddwns1,pnls2,sharpes2,ddwns2,params = run_parameter_sweep(tickers,start,end,BACKEND,5000)
This will cover approximately 15% of our total number of parameters as we specified the in this previous post. We can now feed these parameters into a machine learning algorithm and very quickly predict the performance of the other parameter sets. Luckily, this is easily done with sklearn and on top of that we can try different machine learning tools to see which one is the most appropriate. First, let us import some machine learning algorithms:
from sklearn.svm import SVR from sklearn.linear_model import LinearRegression from sklearn.tree import DecisionTreeRegressor
As you can see, we use linear regression, support vector machines and decision trees. For the two latter, we implement their regressor functions since we are looking for a continuous output, our PnL. Next, we fit our data using different models:
def predict_pnl(params, pnls1): # model = SVR() # model = LinearRegression() model = DecisionTreeRegressor() M = 25000 # M-len(params) is the number of parameter sets used for training model.fit(np.array(params)[:-M,:],pnls1[:-M]) pred_pnls = model.predict(np.array(params)[:,:])
All we need to do here is uncomment the model that we intend to use. Let’s start with a linear regression. Remember that in our parameter sweep here, we ran 30,000 strategy parameter permutations. We now train our model on 5000 of them and then predict then see how well our model is able to predict the whole set of 30,000. Ideally, we should see a very good correlation between predicted and actual PnL. We can plot this with the code below.
plt.plot(pnls1[-M:],pred_pnls,'o') m = plot_linreg(pnls1[-M:],pred_pnls) plt.title('slope: %s'%m) plt.xlabel('real PnL') plt.ylabel('predicted PnL') plt.show()
Cleary, the graph below shows that a linear model is not very well suited for this task. Using a non-linear model such as a support-vector-machine (SVM) should hopefully improve things a bit. We can do this simply by uncommenting the SVR() line in the code above and commenting out the linear regression line.
As you can see below, the performance is somewhat better but far from satisfactory.
The SVM prediction clearly improves if we use half our data set (15000 points) as shown below but we still have an issue with the fact that our regression slope is far from one, as we would expect for a decent prediction.
Decision trees can often fit data very well but have a tendency to overfit as well. Going back to our set with 5000 points of training data, we get the following result:
This is far more along the lines of what we would expect. Note, that the slope is also close to one. Using only 5000 point we can be quite confident that this is not overfitting.
Decision trees are quite good when the parameter sets we try to predict lie within the bounds of our parameter space. Since we used a Monte-Carlo sweep we can be reasonably sure that this is the case. However, decision trees are not very good for extrapolating any predictions with parameters outside the bounds of the parameter space. Linear models, if applicable at all, will usually do a better job with this.
Now, we have an efficient method to scan the parameter space quickly and find interesting regions. Finally, let’s plot the relation between the predicted train and test PnL as we’ve previously seen here.
The left graph shows the original values from the actual backtests and the right graph shows the predicted behaviour. As we can see, the shape of the two graphs looks very similar even though we only used a small subset of our backtests for training the model.
This article has shown us an efficient way of reducing the number of backtests in a parameter sweep if we use a decision tree regressor to predict PnLs. This can save an enormous amount of time for our analysis.
Remember, the code base for this section can be found on Github.