STOCK PRICE PREDICTION AND FORECASTING VIA AN OBJECT ORIENTED PROGRAMMING
Time series has a wide range of application areas among which finance, health, and insurance sector stand out. In finance, predicting/forecasting stock price has been at the center of the time series applications in that properly modeled stock price can be used as an input in risk, pricing and so on. To accomplish this task, along with the traditional time series application, some deep learning algorithms have been developed to further improve the accuracy of the time series modeling.
Thus, the aim of this post is twofold:
- Predicting and forecasting stock via selected time series models: After training the model, I conduct prediction to see how the model works. Finally, multi-step stock price forecasting is done to see the future pattern of stock price.
- Using object oriented programming to select best time series model based on selected performance metric. This approach enables a layman to smoothly run time series application. More specifically, without any prior programming skill, one can run the algorithm simply deciding the inputs (parameters) and interpret the result accordingly.
Models that I employed here are:
* ARMA
* SARIMA
* LSTM
Let’s briefly talk about the time series models without going into math. First model is Autoregressive Moving Average
, which is known as ARMA
.
ARMA has two different parts: Autoregressive and Moving Average. As its name suggests, in autoregressive part, time series is regressed on its own lagged values.
where a, θ, 𝜖 represent constant term, slope coefficient and error term, respectively. The implicit assumption here is that lagged values have an impact on recent values of the related time series.
Moving Average
part is basically the weighted average of error term related to the time series. Mathematically speaking:
So, this part focuses on minimizing the error that we get over time.
A Seasonal Autoregressive Integrated Moving Average, short for SARIMA
, additionally, tries to capture seasonal component of the time series. So, you can think of it as an extension of the ARMA
model.
where 𝜇 is the drift term.
The tricky part in these models is to find the best hyperparameters necessary to have the best-fitting model. By convention, the hyperparameter for ARMA are p
and q
and, for SARIMA, we have seasonal component labeled as d
. Thus, the generalized version takes the following form: ARMA(p,d)
and SARIMA(p,q,d)
.
Long Short Term Memory, abbreviated as LSTM
, is a deep learning method basis on the complex neural network. It basically processes data passing on information as it propagates forward. The differences are the operations within the LSTM’s cells. It is considered as a non-parametric method.
Combining these methods give us to a chance to compare the performance of the time series modeling with Machine Learning and with Deep Learning.
ARMA, SARIMA, and LSTM Implementation for Stock Price Prediction and Forecasting
Let’s start running estimation, prediction and forecasting by separately using these models. The libraries that we are going to use throughout this post is as follows:
import math
import pandas as pd
from datetime import timedelta
from keras import Sequential
from keras.layers import LSTM, Dense, Flatten
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima_model import ARIMA, ARMA
import statsmodels.api as sm
import itertools
import numpy as np
import yfinance as yf
import datetime
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.core import Dense, Dropout, Activation
from keras.constraints import nonneg
import datetime
import warnings
warnings.filterwarnings(“ignore”)
I am going to use yahoo finance
API to extract stock prices. To do that, we need to find the ticker of the company of interet and determine the time period we have. Ten stocks with highest market value listed in S&P-500 are selected for this analysis. These stocks are:
- Apple
- Microsoft
- Amazon
- JP Morgan
- Visa
- Johnson&Johnson
- Wal-Mart
- Bank of America
To be able to extract the stock price, I need tickers which is provided below.
stocks = [‘AAPL’,’MSFT’,’GOOGL’,’AMZN’,’FB’,’JPM’,’V’,’JNJ’,’WMT’,’BAC’]
start = datetime.datetime(2015,1,1)
end = datetime.datetime(2019,1,9)
stock_prices = yf.download(stocks,start=start,end = end, interval=’1d’)
Closing stock price is selected and 70% of the data is stored as train and 30% of it is kept as test set.
stock_prices=stock_prices[‘Close’]
df=stock_prices
split=int(df.shape[0]*0.7)
To start with, ARMA model is employed. You will find the graph including prediction and forecasting as an output. The parameters defined are:
number of forecast step
is 30p
is in between 0 and 2q
is in between 0 and 1plot_result
is 1 if you want to have it as an output.
def arma(df,nstep, p, q,plot_result):
"""Autoregressive Moving Average"""
df_truth = df[split:]
pq = list(itertools.product(p, q))
AIC_list_arma = pd.DataFrame({}, columns=['pram', 'AIC', 'Pred', 'df_truth', 'rmse', 'forecast'])
nstep = 30
for param in pq:
mod = ARMA(df, order=param)
results = mod.fit()
pred_arma = results.predict(start=split, dynamic=False)
rmse_arma = math.sqrt(((pred_arma - df_truth) ** 2).mean())
forecast = results.forecast(steps=nstep)[0]
temp = pd.DataFrame([[param, results.aic, np.array(pred_arma), np.array(df_truth), rmse_arma, forecast]],
columns=['pram', 'AIC', 'Pred', 'df_truth', 'rmse', 'forecast'])
AIC_list_arma = AIC_list_arma.append(temp, ignore_index=True)
index = int(AIC_list_arma[['rmse']].astype(float).idxmin())
pred = AIC_list_arma.loc[[index], ['Pred']]
df_tr = AIC_list_arma.loc[[index], ['df_truth']]
forecast = AIC_list_arma.loc[[index], ['forecast']]
date_rng = pd.date_range(start=df_truth.index[-1], end=df_truth.index[-1] + timedelta(29),
freq='D')
df_tr = np.array(df_tr)[0][0]
pred = np.array(pred)[0][0]
forecast = np.array(forecast)[0][0]
if plot_result:
plt.figure(figsize=(5, 6))
plt.plot(date_rng, forecast, label='%s ARMA Forecast' %i)
plt.plot(df_truth.index, pred, label='%s ARMA Prediction' %i)
plt.plot(df_truth.index, df_tr, label='%s ARMA Actual' %i)
plt.legend()
plt.show()
print ("ARMA RMSE: %.4f"% rmse_arma)
To call the function, we run the following code:
for i in df.columns:
arma(df[str(i)],30, p=range(0,2), q=range(0,1),plot_result=1)
ARMA: Wal-Mart Stock Price Prediction/Forecasting
To save space, I just include one visualization belonging to Wal-Mart. Blue line denotes forecasting, and orange line represents prediction. So, both eyeballing and RMSE tell us that ARMA works well to predict/forecast stock prices.
SARIMA is the second model that I use as a time-series forecasting tool. The parameters defined are:
number of forecast step
is 30p
is in between 0 and 2q
is in between 0 and 1d
is in between 0 and 2plot_result
is 1 if you want to have it as an output.
def sarima(df,nstep, p, d, q, plot_result):
“”” Seasonally Autoregressive Integrated Moving Average “””
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
nstep = 30
df_truth = df[split:]
AIC_list_sarima = pd.DataFrame({}, columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truthy_truth’, ‘rmse’, ‘forecast’])
for param in pdq:
for param_seasonal in seasonal_pdq:
mod = sm.tsa.statespace.SARIMAX(df,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
pred = results.get_prediction(start=split, dynamic=False)
prediction = pred.predicted_mean
fore = results.get_forecast(steps=nstep)
forecast = fore.predicted_mean
rmse_sarima = math.sqrt(((prediction — df_truth) ** 2).mean())
temp = pd.DataFrame(
[[param, results.aic, np.array(prediction), np.array(df_truth), rmse_sarima, forecast]],
columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truth’, ‘rmse’, ‘forecast’])
AIC_list_sarima = AIC_list_sarima.append(temp, ignore_index=True) # DataFrame append
index = int(AIC_list_sarima[[‘rmse’]].astype(float).idxmin())
pred = AIC_list_sarima.loc[[index], [‘Pred’]]
df_tr = AIC_list_sarima.loc[[index], [‘df_truth’]]
forecast = AIC_list_sarima.loc[[index], [‘forecast’]]
date_rng = pd.date_range(start=df_truth.index[-1], end=df_truth.index[-1] + timedelta(29),
freq=’D’)
df_tr = np.array(df_tr)[0][0]
pred = np.array(pred)[0][0]
forecast = np.array(forecast)[0][0]
if plot_result:
plt.figure(figsize=(5, 6))
plt.plot(date_rng, forecast, label=’%s SARIMA Forecast’ %i)
plt.plot(df_truth.index, pred, label=’%s SARIMA Prediction’%i)
plt.plot(df_truth.index, df_tr, label=’%s SARIMA Actual’%i)
plt.xticks(fontsize=7)
plt.savefig(‘sarima.png’)
plt.legend()
plt.show()
print(“SARIMA RMSE: %.4f”% rmse_sarima)
To run the above-provided function, the following code is run:
for i in df.columns:
sarima(df[str(i)],30, p=range(0,2), q=range(0,1),d=range(0,2),plot_result=1)
SARIMA: Wal-Mart Stock Price Prediction/Forecasting
LSTM is the last model in this post. As being a complex model, LSTM has much more parameters compared to ARMA and SARIMA. The parameters are:
- hidden_neurons=64,
- dropout_parameter=0.20
- epoch=400,
- batch_size=100,
- plot_result=1
def my_LSTM(yt, hidden_neurons, dropout_parameter, epoch, batch_size,plot_result):
split=int(yt.shape[0]*0.7)
yt = yt.astype(‘float32’)
yt = np.array(yt)
yt = np.reshape(yt, (-1, 1))
train = yt[:split]
test = yt[split:]
def prior_steps(df, look_back=30):
X, Y = [], []
for i in range(len(df) — look_back — 1):
a = df[i:(i + look_back), 0]
X.append(a)
Y.append(df[i + look_back, 0])
return np.array(X), np.array(Y)
look_back = 30# 30 step ahead
X_train, Y_train = prior_steps(train, look_back)
X_test, Y_test = prior_steps(test, look_back)
X_train = np.reshape(X_train, (X_train.shape[0], 1, 30))
X_test = np.reshape(X_test, (X_test.shape[0], 1, 30))
model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], 30),activation=’relu’,return_sequences=True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1))
model.compile(loss=’mean_squared_error’, optimizer=’adam’)
model.compile(optimizer=’adam’, loss=’mse’,metrics=[‘accuracy’])
history = model.fit(X_train, Y_train, epochs=100, batch_size=10, validation_data=(X_test, Y_test), verbose=0,shuffle=False)
test_pred = model.predict(X_test)
train_pred = model.predict(X_train)
forecastStart = X_train[X_train.shape[0] — 1]
x_input = forecastStart
n_features=30
forecastStart = X_test[X_test.shape[0] — 1]
x_input = forecastStart
tempList = list()
for item in range(30):
x_input=x_input.reshape((1, 1, 30))
yhat = model.predict(x_input, verbose=0)
x_input = np.append(x_input, yhat)
x_input = x_input[1:]
tempList.append(yhat)
date_rng = pd.date_range(start=df.index[-1], end=df.index[-1] + timedelta(29), freq=’D’)
if plot_result:
plt.figure(figsize=(5, 6))
plt.plot(df[split:].index, test, label=’%s test’%i)
plt.plot(df[split:-31].index,test_pred, label=’%s Prediction’ %i)
plt.plot(date_rng,np.array(tempList).flatten(), label=’%s Forecast’ %i)
plt.xticks(fontsize=7)
plt.savefig(‘sarima.png’)
plt.legend()
plt.show()
print(“LSTM RMSE: %.4f”%math.sqrt(mean_squared_error(test_pred, test[:-31])))
After finding optimum parameters or to find them, use following code to call the LSTM
for i in df.columns:
my_LSTM(df[str(i)], hidden_neurons=64,dropout_parameter=0.20,epoch=400,batch_size=100,plot_result=1)
LSTM: Wal-Mart Stock Price Prediction/Forecasting
Object Oriented Programming for Stock Price Prediction and Forecasting
Now, it is time to combine all the model we go over so far. Python enables us to run all these three models at once using Object Oriented Programming
(OOP). OOP means building applications using objects. In this environment, we write in classes and derive objects from classes. This is what OOP looks like.
class Model_Selection(object):
def __init__(self, df, p, d, q, split, nstep,hidden_neurons, dropout_parameters, epoch, batch_size,plot_result):
self.p = p
self.d = d
self.q = q
self.df = df
self.split = split
self.nstep=nstep
self.hidden_neurons = hidden_neurons
self.dropout_parameters=dropout_parameters
self.epoch = epoch
self.batch_size = batch_size
self.plot_result=plot_resultdef _arma(self, df,nstep, p, q,plot_result):
“””Autoregressive Moving Average”””
df_truth = df[split:]pq = list(itertools.product(p, q))
AIC_list_arma = pd.DataFrame({}, columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truth’, ‘rmse’, ‘forecast’])
nstep = 30
for param in pq:
mod = ARMA(df, order=param)
results = mod.fit()
pred_arma = results.predict(start=split, dynamic=False)
rmse_arma = math.sqrt(((pred_arma — df_truth) ** 2).mean())
forecast = results.forecast(steps=nstep)[0]
temp = pd.DataFrame([[param, results.aic, np.array(pred_arma), np.array(df_truth), rmse_arma, forecast]],
columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truth’, ‘rmse’, ‘forecast’])
AIC_list_arma = AIC_list_arma.append(temp, ignore_index=True)index = int(AIC_list_arma[[‘rmse’]].astype(float).idxmin())
pred = AIC_list_arma.loc[[index], [‘Pred’]]
df_tr = AIC_list_arma.loc[[index], [‘df_truth’]]
forecast = AIC_list_arma.loc[[index], [‘forecast’]]
date_rng = pd.date_range(start=df_truth.index[-1], end=df_truth.index[-1] + timedelta(29),
freq=’D’)df_tr = np.array(df_tr)[0][0]
pred = np.array(pred)[0][0]
forecast = np.array(forecast)[0][0]
if plot_result:
plt.plot(date_rng, forecast, label=’%s ARMA Forecast’%i)
plt.plot(df_truth.index, pred, label=’%s ARMA Prediction’%i)
plt.plot(df_truth.index, df_tr, label=’%s ARMA Actual’%i)
plt.legend()
plt.show()
print (“ARMA RMSE: %.4f”% rmse_arma)
def _sarima(self, df,nstep, p, d, q, plot_result):
“”” Seasonally Autoregressive Integrated Moving Average “””
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
nstep = 30
df_truth = df[split:]
AIC_list_sarima = pd.DataFrame({}, columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truthy_truth’, ‘rmse’, ‘forecast’])
for param in pdq:
for param_seasonal in seasonal_pdq:
mod = sm.tsa.statespace.SARIMAX(df,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)results = mod.fit()pred = results.get_prediction(start=split, dynamic=False)
prediction = pred.predicted_mean
fore = results.get_forecast(steps=nstep)
forecast = fore.predicted_mean
rmse_sarima = math.sqrt(((prediction — df_truth) ** 2).mean())temp = pd.DataFrame(
[[param, results.aic, np.array(prediction), np.array(df_truth), rmse_sarima, forecast]],
columns=[‘pram’, ‘AIC’, ‘Pred’, ‘df_truth’, ‘rmse’, ‘forecast’])
AIC_list_sarima = AIC_list_sarima.append(temp, ignore_index=True) # DataFrame append
del temp
index = int(AIC_list_sarima[[‘rmse’]].astype(float).idxmin())
pred = AIC_list_sarima.loc[[index], [‘Pred’]]
df_tr = AIC_list_sarima.loc[[index], [‘df_truth’]]
forecast = AIC_list_sarima.loc[[index], [‘forecast’]]
date_rng = pd.date_range(start=df_truth.index[-1], end=df_truth.index[-1] + timedelta(29),
freq=’D’)df_tr = np.array(df_tr)[0][0]
pred = np.array(pred)[0][0]
forecast = np.array(forecast)[0][0]
if plot_result:
plt.plot(date_rng, forecast, label=’%s SARIMA Forecast’%i)
plt.plot(df_truth.index, pred, label=’%s SARIMA Prediction’%i)
plt.plot(df_truth.index, df_tr, label=’%s SARIMA Actual’%i)
plt.legend()
plt.show()print(“SARIMA RMSE: %.4f”% rmse_sarima)def my_LSTM(self,yt,nstep, hidden_neurons, dropout_parameter, epoch, batch_size,plot_result):
“”” Long Short Term Memory “””
#testDate = yt[split:]
split=int(yt.shape[0]*0.7)
yt = yt.astype(‘float32’)
yt = np.array(yt)
yt = np.reshape(yt, (-1, 1))
train = yt[:split]
test = yt[split:]
#train_scaled = scaler.fit_transform(train)
#test_scaled = scaler.fit_transform(test)def prior_steps(df, look_back=30):
X, Y = [], []
for i in range(len(df) — look_back — 1):
a = df[i:(i + look_back), 0]
X.append(a)
Y.append(df[i + look_back, 0])
return np.array(X), np.array(Y)look_back = 30# 30 step ahead
X_train, Y_train = prior_steps(train, look_back)
X_test, Y_test = prior_steps(test, look_back)X_train = np.reshape(X_train, (X_train.shape[0], 1, 30))
X_test = np.reshape(X_test, (X_test.shape[0], 1, 30))
model = Sequential()
model.add(LSTM(hidden_neurons, input_shape=(X_train.shape[1], 30),activation=’relu’,return_sequences=True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1))
model.compile(loss=’mean_squared_error’, optimizer=’adam’)
model.compile(optimizer=’adam’, loss=’mse’,metrics=[‘accuracy’])
history = model.fit(X_train, Y_train, epochs=100, batch_size=10, validation_data=(X_test, Y_test), verbose=0,shuffle=False)
#model.summary()
test_pred = model.predict(X_test)
train_pred = model.predict(X_train)
#predictions = scaler.inverse_transform(test_pred)
forecastStart = X_train[X_train.shape[0] — 1]
x_input = forecastStart
n_features=30
forecastStart = X_test[X_test.shape[0] — 1]
x_input = forecastStart
tempList = list()for item in range(nstep):
x_input=x_input.reshape((1, 1, 30))
yhat = model.predict(x_input, verbose=0)
x_input = np.append(x_input, yhat)
x_input = x_input[1:]
#x_input = x_input.reshape((1, n_steps, n_features))
#forecast_LSTM=tempList.append(yhat)
tempList.append(yhat)
#forecasts = scaler.inverse_transform(np.array(tempList).flatten().reshape(-1,1))
date_rng = pd.date_range(start=df.index[-1], end=df.index[-1] + timedelta(29), freq=’D’)
if plot_result:
plt.plot(df[split:].index, test, label=’%s test’ %i)
plt.plot(df[split:-31].index,test_pred, label=’%s Prediction’%i)
plt.plot(date_rng,np.array(tempList).flatten(), label=’%s Forecast’%i)
plt.legend()
plt.show()
print(“LSTM RMSE: %.4f”%math.sqrt(mean_squared_error(test_pred, test[:-31])))
def arma(self):
return self._arma(self.df, self.nstep,self.p, self.q,self.plot_result)def sarima(self):
return self._sarima(self.df,self.nstep, self.p, self.d, self.q,self.plot_result)def LSTM(self):
return self.my_LSTM(self.df, self.nstep,self.hidden_neurons,
self.dropout_parameters,
self.epoch,
self.batch_size,
self.plot_result)
def testAllModels(self):
arma = self.arma()
sarima = self.sarima()
lstm = self.LSTM()
This code run all three models within the class
environment and produce 10x3 stock price prediction and forecasting plots.
Wrap-Up
In this post, I try to introduce ARMA, SARIMA, and LSTM and corresponding Python applications. This post both adresses experts in the field and layman audience. Because, if you do not want to know how the time series models work, then you just take care of results. So, even a non-expert can run the above-given codes and replicate the result. As for data scientists, it shows you how time series models are applied and which model performs the best.
As a final word, for those who do not have or enough prior programming skills can basically change the following set of parameters along with the dataset, which is labeled as df
here, and reproduce the results:
- p,
- d,
- q,
- split,
- nstep,
- hidden_neurons=,
- dropout_parameters,
- epoch,
- batch_size,
- plot_result
Follow our publication MagniData for more!
Subscribe to receive our top stories here.
Join our new Slack community: AI-ML-DataScience-Lovers