How I made an Open-Source AI Hedge Fund
In this post I explain how I created an artificial intelligence that makes automated trades for me daily.
With modern advancements in machine learning and easy access to data online, it’s never been easier to get involved in quantitative trading. To make things even better, cloud tools like AWS make it a breeze to turn trading ideas into real, functional trading bots. For the past couple of years I’ve been messing around with financial data and recently decided to take the plunge to make a fully automated strategy.
The Strategy
Since this was my first automated strategy, I decided to make it fairly simple. I would swing trade SPY (S&P 500 index) using informed decisions from an ML model. I decided to use SPY as my asset of choice since it has relatively low risk incase something went wrong with my strategy.
Goal: Swing trade SPY using informed buy/sell points determined by a machine learning model.
The Data
To make informed decisions of when to trade SPY, I trained an ML model that would predict trades based on daily historical data from various financial sectors and US treasuries. I pulled around 20 years of daily data from these assets using the open source YFinance API.
import yfinance as yf
import pandas as pd
SPY_daily = yf.download('SPY')
energy_daily = yf.download('XLE')
materials_daily = yf.download('XLB')
industrial_daily = yf.download('XLI')
utilities_daily = yf.download('XLU')
health_daily = yf.download('XLV')
financial_daily = yf.download('XLF')
consumer_discretionary_daily = yf.download('XLY')
consumer_staples_daily = yf.download('XLP')
technology_daily = yf.download('XLK')
real_estate_daily = yf.download('VGSIX')
TYBonds_daily = yf.download('^TNX')
VIX_daily = yf.download('^VIX')
Here is what all the data I pulled for my model plotted over the past 20 years.
In a jupyter notebook the data looked like this, with each row as an individual date.
Feature Engineering
Using the raw historical data I pulled, I then produced additional features for my model. I derived various technical analysis indicators including simple moving averages, volatility, and Relative Strength Index (RSI), just to name a few. To make more diverse features I calculated these technical indicators with 7, 20,50, and 200 day windows.
def SMA(df, feature, window_size):
new_col = 'MA' + feature + str(window_size)
df[new_col] = df[feature].rolling(window=window_size).mean()
return df
def Volitility(df, feature, window_size):
new_col = 'VOLITILITY' + feature + str(window_size)
returns = np.log(df[feature]/df[feature].shift())
returns.fillna(0, inplace=True)
df[new_col] = returns.rolling(window=window_size).std()*np.sqrt(window_size)
return df
def RSI(df, feature, window_size):
new_col = 'RSI' + feature + str(window_size)
delta = df[feature].diff()
delta = delta[1:]
up, down = delta.clip(lower=0), delta.clip(upper=0)
roll_up = up.rolling(window_size).mean()
roll_down = down.abs().rolling(window_size).mean()
RS = roll_up / roll_down
RSI = 100.0 - (100.0 / (1.0 + RS))
df[new_col] = RSI
return df
Data Labelling
Once all my features were derived came the hard part, creating a way to label data an ML model could use in my swing trading strategy.
It is pretty well known in finance that it is nearly impossible to predict future price using historical data. Most securities don’t follow any clear statistical distribution, and trying to make a regression model to predict future price is almost always completely useless.
Instead, it’s best to create categorical labels and use a classification model to predict the probability of certain “events”.
The Triple Barrier Method
The triple barrier method is an intuitive way to label financial data for a ML model to predict a trade outcome. This method is taken from Marcos Lopez de Prado’s book “Advancement's in Financial Machine Learning” (which I highly recommend).
The triple barrier method works like this: first, decide a time frame you would want to hold your trades. Let’s make this 100 trading days. Then, let’s decide an ideal take-profit threshold for an arbitrary trade. Let’s make that 1x current market volatility. Finally, let’s set a theoretical stop loss for an arbitrary trade. For demonstration purposes, let’s also make that 1x current market volatility. These three conditions are our “three barriers”.
Now that we have set our barriers, we create labels for SPY trades using the following steps:
1. Take a date in the financial time series that we want to label
2. Go 100 trading days ahead of our date. Create a vertical barrier here. This barrier represents if a trade is held too long without stopping out or hitting our take profit.
3. Create a horizontal barrier 1x market volatility above our date. This barrier represents if a trade has hit our take profit threshold.
4. Create another horizontal barrier 1x market volatility below our date. This barrier represents if a trade has stopped out.
5. Label our date categorically based on if SPY hits the take profit barrier, hits the stop loss barrier, or would have been held too long
Our end result will for a single date would look something like this:
The above example shows the triple barrier method for a trade on the day in 02–2017. This trade did not hit our profit or stop threshold and instead hit the vertical barrier first.
For my strategy, I tweaked the triple barrier method to use only two labels instead of three labels. If a theoretical trade at a certain date hit my take-profit threshold, I gave it a label of “profit”, and if it didn’t, I gave it a “no-profit”. This would make my data fit to train a binary classification model, which would be a lot easier to fine tune. I also tweaked my take profit threshold to be at 2x current market volatility, and my max hold period to only 10 days so my strategy would trade quicker.
Labeling my data in this way would help train my model only go for trades with more immediate, big payouts.
def get_Daily_Volatility(close,span0=20):
# simple percentage returns
df0=close.pct_change()
# 20 days, a month EWM's std as boundary
df0=df0.ewm(span=span0).std()
df0.dropna(inplace=True)
return df0
def get_3_barriers(daily_volatility, price):
#create a container
barriers = pd.DataFrame(columns=['days_passed',
'price', 'vert_barrier', \
'top_barrier', 'bottom_barrier'], \
index = daily_volatility.index)
for day, vol in daily_volatility.iteritems():
days_passed = len(daily_volatility.loc[daily_volatility.index[0] : day])
#set the vertical barrier
if (days_passed + t_final < len(daily_volatility.index) and t_final != 0):
vert_barrier = daily_volatility.index[days_passed + t_final]
else:
vert_barrier = np.nan
#set the top barrier
if upper_lower_multipliers[0] > 0:
top_barrier = prices.loc[day] + prices.loc[day] * upper_lower_multipliers[0] * vol
else:
#set it to NaNs
top_barrier = pd.Series(index=prices.index)
#set the bottom barrier
if upper_lower_multipliers[1] > 0:
bottom_barrier = prices.loc[day] - prices.loc[day] * upper_lower_multipliers[1] * vol
else:
#set it to NaNs
bottom_barrier = pd.Series(index=prices.index)
barriers.loc[day, ['days_passed', 'price', 'vert_barrier','top_barrier', 'bottom_barrier']] = \
days_passed, prices.loc[day], vert_barrier, \
top_barrier, bottom_barrier
return barriers
def get_labels(barriers):
labels = []
size = [] # percent gained or lossed
for i in range(len(barriers.index)):
start = barriers.index[i]
end = barriers.vert_barrier[i]
if pd.notna(end):
# assign the initial and final price
price_initial = barriers.price[start]
price_final = barriers.price[end]
# assign the top and bottom barriers
top_barrier = barriers.top_barrier[i]
bottom_barrier = barriers.bottom_barrier[i]
#set the profit taking and stop loss conditons
condition_pt = (barriers.price[start: end] >= top_barrier).any()
condition_sl = (barriers.price[start: end] <= bottom_barrier).any()
#assign the labels
if condition_pt:
labels.append(1)
else:
labels.append(0)
size.append((price_final - price_initial) / price_initial)
else:
labels.append(np.nan)
size.append(np.nan)
return labels, size
# how many days we hold the stock which set the vertical barrier
t_final = 10
#the up and low boundary multipliers
upper_lower_multipliers = [2, 2]
#allign the index
vol_df = get_Daily_Volatility(full_df.SPY_Close)
prices = full_df.SPY_Close[vol_df.index]
barriers = get_3_barriers(vol_df, prices)
barriers.index = pd.to_datetime(barriers.index)
labs, size = get_labels(barriers)
full_df = full_df[full_df.index.isin(barriers.index)]
Modeling
I attempted to use a lot of different models to make an informed trade prediction. I started by experimenting with LSTM deep neural networks, which I had found very useful in the past to understand complex time series. However after running multiple experiments, I could not get a result where it did not overfit with very poor out of sample result.
I eventually decided to go with a model more suited for tabular data: CatBoost. This open source gradient boosting model has a Python library that is super intuitive to use and performs very well.
Data Preparation
In order to use my data in a CatBoost model, it needed to be flattened to contain not only the data for an individual date, but also historical data so the model could learn what historical trends influence a good trade. I wrote this expand_features
function which expanded out each feature took the percent change of each feature over the previous 100 data points in the time series.
def percentage_change(initial,final):
return ((final - initial) / initial)
def expand_features(full_df):
window = 100
new_df = pd.DataFrame()
for col in full_df.columns:
print(col)
if not col.startswith('label'):
column = full_df[col]
for i in range(1, window):
shifted = column.shift(i)
new_df['Shifted' + str(i) + col] = percentage_change(shifted, column)
else:
new_df[col] = full_df[col]
return new_df
full_df = expand_features(full_df)
The data being fed into my Catboost model now looked like this, where each column is expanded out to contain the percent change from x days in the past. 20,301 unique features were now being fed into the model, representing a wide range of historical market indicators.
Catboost Model
The data was then fit into the following model to predict if buying SPY on each date would have produced a profitable trade.
classification_params = {'loss_function':'Logloss',
'eval_metric':'AUC',
'early_stopping_rounds': 2,
'verbose': 200,
'random_seed': SEED
}
model = CatBoostClassifier(**classification_params)
model.fit(X_train, Y_train,
eval_set=(X_test, Y_test),
use_best_model=True, )
Model Analysis
Cross Validation
To validate my model, I split my dataset into a train set and “out of sample” test set. The train dataset consisted of data from March 2000–August 2020, and my test set consisted of data from August 2020– November 2022.
KFold cross validation was utilized to determine model performance using my training data. This involved the training data being split up into 5 different segments. Then, 4 out of 5 segments were rotated to train the model while 1 segment was used to validate and fine tune model performance.
Model Performance
Once I was confident in my model architecture I trained a final model using the entire training dataset, and tested out of sample performance on my test data.
The results showed that the model was much better than I expected, with a test ROC AUC of .69. While this is not an amazing ROC score, I was very impressed to see this kind of result with financial data. Financial data is incredibly stochastic, and it is difficult to predict outcomes. Anything better than a coin flip is considered an edge in our trading strategy.
The model appeared to be incredibly overfit with a training ROC AUC of .98, much better than performance on test data performance of .69. To be honest, I was not too concerned with this overfitting since this model clearly produced an edge and appeared far better than guessing. However, I think there are definitely more opportunities to generalize this model to prevent overfitting. I believe this could be done by incorporating more data, or using a bagging ensemble such as a random forest to fit my data.
Feature Analysis
The feature importance of my model was incredibly interesting to look at. The model mainly prioritized volatilities from various financial sectors to predict if any given date buying SPY would produce a winning trade. For whatever reason, features based on the energy sector consistently contributed greatly to my model.
That fact that my model prioritized volatility made perfect sense. Winning trades are characterized by large upward movements in our asset, and market volatility is a great way to anticipate large movements.
Summary
I now had created a model that takes in historical financial data given a certain date, and predicts the probability buying SPY will produce a significantly profitable trade within the next 10 days
Input: Historical financial information given a date in time
Output: Probability that buying SPY will produce a significantly profitable trade within the next 10 days
Backtesting
Now that I had a functional model to act as the predictive engine of our strategy, I could get to the really fun part: figuring out how well the swing trading strategy will perform.
Setting an Optimal Threshold
Before I tested the model in action, I first needed to determine what probability threshold produced by the model should trigger a trade.
I could make it simple and just say “buy SPY if the model produces an any probability larger than 50%”. However, this is usually not the best case for a binary classification model since we have to consider the risk of both false positives and false negatives. Instead, I determined the optimal threshold by finding what output probability produced the largest F Score on our training data. Luckily, this was incredibly easy in Python! The optimal trading threshold ended up being .46.
from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np
#Create a Precision/Recall curve for our training data
precision_train, recall_train, pr_thresholds_train =
precision_recall_curve(Y_train, probabilities_train)
fscore_train =
2 * (precision_train * recall_train) / (precision_train + recall_train)
#Find optimal thresh on PR curve train
ix = np.argmax(fscore_train)
optimal_threshold = pr_thresholds_train[ix]
Performing a Backtest
I then wrote a function in Python to determine how profitable our strategy would perform on out of sample data.
The strategy involved swing trading SPY, and looked like this:
- Pull down financial data for a specific date, produce necessary features, and feed them into our model.
- Buy SPY if our model produced a probability larger than 46%
- Set up our trade to have a take profit of 2x current market volatility, and set a reasonable stop loss to limit risk.
To determine how much shares to buy for a given trade, I simply took our current portfolio balance, and divided it by 5 times the current market value of SPY. This would allow us to easily have around 5 trades open at any given time without running out of cash.
The backtest performed incredibly well on out of sample data, and produced a return of 50% within our time frame. This is compared to the overall market return of only 12%. Our strategy also produced consistent returns with little drawdown.
I also calculated some common statistics to evaluate our strategy performance over our test time frame using a theoretical starting balance of $10,000.
Total Returns: 50.97 %
Total Trades: 284
Total Net Profit: $5107.70
Profit Factor: 2.67
Percent of Profitable Trades: 46.47%
Average Trade Net Profit: $17.98
Max Drawdown: $-554.87
Surprisingly our strategy only had 46.47% of all trades be profitable. However, since the model is trying to predict large upward movements in the stock, the winning trades significantly make up for our loosing trades. Even with a large amount of loosing trades, an average net profit of $17.98 was produced every trade.
Infrastructure
To put my strategy into use, I used AWS Cloud Development Kit (CDK) to create infrastructure to host my strategy. The strategy was then set up to interact with Alpaca (https://alpaca.markets/), an API for stock and crypto trading.
To begin, I stored my model in AWS S3, a simple storage solution for cloud. Then, I created 2 separate Lambda functions: a buying function and a selling function. The buying function would run at the end of each trading day, generate a model prediction of whether or not to make a trade, and set up a trades using the Alpaca API. The selling function would then constantly scan open trades to see if they have hit a threshold to close them, either because they have hit the stop loss, profit threshold, or have been held longer than 10 days. Finally, I created two DynamoDB tables, a trades table which would track currently open trades, and a historical trades table to track historical trades.
Tying it all Together
Fast forward to present day: I have been paper trading my strategy in the Alpaca brokerage for about 6 months. It has been performing pretty similar to the backtest results, generating consistent profits with little drawdown. So far it is up 6% since inception. It will be interesting to see how it continues to perform over time.
There are a multitude of ways I’d like to improve my strategy in the future. For one, I would love to set up a fully functional ML pipeline that could update my model daily. I would also like to create more monitoring tools to understand daily model predictions. Finally, I would love to incorporate more models, particularly one that could dynamically size my trades. Eventually, I would also like to expand into different assets.
For now, I have a fully functional AI trading fund on my hands, which I think is pretty exciting. And the best part is that it runs completely free, as I am using open source data and all completely free cloud tools. The Alpaca API is also free to use at the time of writing this.
I hope you enjoyed this article, and please feel free to steal my ideas!