There’s two ways to predict a stock, one is predicting the actual value into an x amount of time into the future, which is usually graphed and this is mainly what you’ll see compared with the “actual value” which is mainly the test set vs. the prediction graphed, which usually is close to the actual graph, that’s the beauty of Deep Neural Networks.This is a regression model, it’s actually supposed to tell you the price of the stock on day X which is likely to be off by some amount, above or below.
It’s hard to ever get it to be exactly the same unless you intentionally or unintentionally overfitted the model to get high accuracy. The other model is a classifier which could for example predict from the Open Value: the stock’s opening value it’s going to go up or down by the end of the day.
Usually the stock data you can retrieve from a site like yahoo finance contain the following fields: Date, Open, High, Low, Close and Volume. There could also be Adj Open, Adj High, Adj Low and Adj Close. This are useful because something the company could to get more investors in their stocks i.e. the price is considered too high is split the stocks so that every stock now becomes equal to 2 stocks for example of even 4 there by decreasing the value of each stock by 1/2 or 1/4 respectively dropping the price but every share but each share holder’s total investment remains unchanged. This leads to an increase in the total volume but a decline in the Close value which alone might signify a drop in the value of the stock so Adj Close adjusts for this so the price is adjusted to the change in the Volume of stocks.
One way to get the model to understand time-series data which this is and keep some memory of the trend is using an LSTM model “Long Short Term Memory” Cell. I’ve not used this model and started with a simple CNN model Convolution Neural Network and it works just fine with a high prediction accuracy, I’ll highlight the issues though with it and a comparison with scikit-learn models.
Starting with the bitcoin data
df = pd.read_csv('data/BTC-USD.csv') df.head()
What you feed a model are the data and what to predict, my intended model will only predict if the price will go up by the end of the day i.e. Open value vs. Adj Close value. Some of the fields aren’t useful to the model like the Date field, can’t also give the Close or Adj Close value because that’s basically cheating and also the Volume, High and Low since it’s only logical to assume these values are derived by the end of the day.
But what you end up with is only the Open Value and you simply can’t feed it that alone and the predicted Actual i.e. was Open > Close or vice Versa 0 or 1 and expect it will learn easily so with some tinkering I can create some additional fields which look into the past to feed the model some additional info on top of the Open Value.
This creates from the top
Open 2 which is
Open shifted up by 1 so you can compare yesterday’s open price to today’s.
Max 7 is the Maximum figure from the Open column for the past 7 days,
Min 7 is the opposite,
Change is the difference of
Open — Open2 and
Mean Change 7 is the mean of the Change column for the past 7 days. All the
rolling are for 7 days because of the 7 value used in the rolling column, it could be more or less but my assumption is if you know the stock activity for the past 7 days well enough you can know if things are going ok or not.
Drop 7 is the number of times the
Change field has negative values i.e. compared to yesterday’s Open Value there’s a dip for the past 7 days.
Up 7does the opposite and finally
Adj Close if there’s an increase that’s 1 otherwise 0 and this would include a scenario where the Open and Close price was the same, quite rare but still the point is to be blunt and only be interested in opportunities for profit or the inverse.
Training the model
First dropping the unnecessary fields
df.drop(['Date', 'High', 'Low', 'Close', 'Volume', 'Open 2', 'Predict', 'Prediction', 'Adj Close'], axis=1, inplace=True) df.head()
From the top importing keras and some scikit-learn libraries that would be useful, next is using
x_y to split the dataframe to X and y and then use scikit-learn’s
train_test_split and just as an FYI when i use
shuffle=False so it doesn’t shuffle the data I got an accuracy of 1.0 which if 100 which seems like it seriously overfit but the with shuffle it get’s 97% accuracy and this model also has EarlyStopping which avoids overfitting so when there’s no improvement for a limit of 25 cycles from the last noted best val_loss it stops training and uses that last training weight.
This is a really nice high accuracy which if it holds true would imply you could using this get a prediction whether you should expect the stock to perform well or not.
There’s also some additional testing I did using another stock record of apple and amazon comparing the bitcoin’s predictor predicting the whole after doing the preprocessing and adding the same fields as it the bitcoin and it would rarely get above 50% accuracy with the first try 45%. I also then made another predictor from the apple data which then get’s around 99% accuracy on it’s test data, trying to also compare it with the amazon stock returns similarly low score so the predictor isn’t universal i.e. it’s specializes to the stock data it was trained on. I don’t think this is proof of overfitting but how current CNN model works, they’re great a specializing and not generalizing so likely as an example a model trained to look at pictures and classify them when they’re always facing upwards would have an issue when the pictures start coming in upside down.
Trying this also with the classifier of scikit-learn that I’ve found with the highest accuracy RandomForestClassifier below
Has an accuracy of 93% which is not higher than the CNN model still pretty close and showing also trying it on different data from a different dataframe also performs poorly.
Proof of Concept
First part in proving this is not an overfit model that’s mainly lucky is looking at the distribution of the expected answers: 1 or 0
df['Actual'].value_counts() >>> 1 1612 0 1341 Name: Actual, dtype: int64
which are almost equal so a model that’s making blind guesses would likely average around 50% accuracy, on the other hand if the distribution was 80% - 20% then even mainly guessing the category for 80% or only that would assure a high score. So the 93% accuracy of the RandomForestClassifier must be good if it hasn’t overfit.
The stock market in my opinion is a semi-chaotic system; it’s well known there’s chaotic system with some order inside of them where small variation can lead to totally different results and even running the same model with the same initial conditions will always lead to different and even vastly different results but the underlying mechanics could be well understood but it makes accurate prediction impossible. This is why I’m wary of any model that could supposedly predict well into the future, even with a good understanding of the relationship between the data and outcome and given all the necessary variables a slight change in the actual real world data not modeled in the prediction model can lead to the graphs diverging easily.
Since the main thing a good stock predictor, as a person should tell you is whether you should buy or sell and know more importantly with a small window into the past for the model whether which of the two, to provide. To prove this model simply didn’t memorize the data or a pattern I’ve created this script to query the latest bitcoin value every 30 seconds and save the data so I can use this new data to confirm the final accuracy.
I let it run for a while and after it got enough info terminated the script.
df.shape >>> (261, 2) df.head()
Applying the same preprocessing to create the necessary rows and the
df.dropna(inplace=True) df['Actual'].value_counts() >>> 0 134 1 120 Name: Actual, dtype: int64
This also has an almost equal distribution of the values, instead of retraining I’d already saved the last classifier and simply loaded it again and made a prediction on the whole 261 rows.
from sklearn.externals import joblib joblib.dump(clf, 'data/bitcoin-predictor.pkl')
X = df[['Value', 'Max 7', 'Min 7', 'Change', 'Mean Change 7', 'Drop 7', 'Up 7']].values y = df['Actual'].values
from sklearn.externals import joblib
clf = joblib.load('data/bitcoin-predictor.pkl')
clf.score(X, y) >>> 0.9448818897637795
With an acuracy of 94% on previously never seen fresh data this proves these model and machine learning models in general can pick up on underlying patterns in almost random behaviour.
Getting back some returns
There’s a comment of someone who suggested returns is the best way to prove it works, I created a
stock_game where you can pick of a list of stocks like bitcoin, apple, amazon and with a player see how well you perform, you can start from day till the end of the stocks or input a start_date not to start when the value was too low and guarantee success as proof of good stock logic, anyway using this predictor I’ve created
player_3 to play the
stock_game for bitcoin. Mind you the best I’ve got with my hard_coded
player_2 is $20.5 million, not bad from day 1 in 2010 and as I stated in the article linked you could say bitcoin matured somewhere in 2017 and since then the market has become volatile and since 2018 if you started then making profits is almost practically impossible, if the value takes a huge dive nothing you do will make you recover your initial investment and what you can hope for is remove some and keep some and hope it picks up beyond you’re initial investment value, the more it drops through this cycles, the more you lose, it’s simple economics math.
Using the predictor to get a prediction of an expected peak or drop and using what I used prior to predict dips or ups changed to an absolute positive numbers only and then use that to since it’s a digit the higher means the more positively bad or good expected trend from the prior logic to determine how much to withdraw or invest as a fraction of what already is invested or withdrawn and on hand, the game starts with an initial amount of $1000.
Just to note that some of this data was used in the training data but not all of it and the data was shuffled and 20% was reserved for test data, I also showed it did well on the new data, anyway that’s what you hope it should do anyway, learn.
First from day 1
player_3(output=False, plot=True) >>>
You have 727.2 invested, you want to withdraw 1027.948007299304 which exceeds the amount available, try again with a smaller amount Reached the final day, game ending day: 8/6/2018 profit: 0 loss: 258375.57425815985 amount: 232721919.09876847 invested: 7519133.58425816
This makes a total net of $240 million; $232 million amount withdrawn and $7 million in bitcoin invested, from the initial $1000, you can see the graph below with the investment curve, amount curve and the bitcoin Value in USD chart almost a straight line compared to the large digits of the other two line charts.
As usual I’ve also started it from far later in the start of 2017
player_3(output=False, plot=True, start_date='1/1/2017') >>> Reached the final day, game ending day: 8/6/2018 profit: 0 loss: 4.989760121705757 amount: 4496.844152819358 invested: 145.28976012170577
This makes far less with from the same initial $1000 to almost $4500, as you can see in the graph below now the value which starts as higher than the initial investment is never overtaken by the amount the user makes and as should be expected there’s a high correlation of making wise investments based on the market.
The last one is from 2018 which not a suprise for me ends with less; 660 in the amount and only $20 invested in bitcoin.
player_3(output=False, plot=True, start_date='1/1/2018') >>> Reached the final day, game ending day: 8/6/2018 profit: 0 loss: 0.734349262121448 amount: 660.2833935348347 invested: 21.33434926212145
If someone tells you bitcoin is doing well show them this chart from the begining of the year.