In this Research Project I would like to explore data related to the stock market and what factors have the greatest influence on wether a company's stock price goes up or down. In recent years, it has been fervently debated regarding wether stocks are based on their financials (especially with the rise of growth stocks with high ceilings such as the new tech companies). According to economics, the concept of price equilibrium and long run self-adjustment means that stock prices should reflect the value of the company itself. Does this statement still remain true?
In this project, I will be using multiple types of Machine Learning techniques such as blah to find out which factors are the most important and wether there even is a correlation between financial success and stock price.
Import Libraries
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing
Exploring Dataset
df1 = pd.set_option('display.max_columns', None)
df1 = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
df1.dataframeName = '2014_Financial_Data.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')
print(df1.dtypes)
df1.info()
df1.head(5)
As can be seen from the data shown above the dataset has 3808 rows (3808 companies tracked) and we have 225 columns of factors with the last two columns being the price YOY change and the Class (which represents wether the stock went up or down with 1 meaning up and 0 meaning down). Both these factors are predictive factors and we can use them to find out which factors have the greatest correlation with PRICE VAR or CLASS. Almost all of the columns show float datatypes except the name (object), the sector (object) and the Class (int). Therefore to preprocess this data we must first make some adjustments to columns: Unnamed and Sector.
#remove data with NAN
#change unnamed
#change sector
df1.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
data_info = pd.set_option('display.max_columns', None)
data_info=pd.DataFrame(df1.dtypes).T.rename(index={0:'column type'})
data_info=data_info.append(pd.DataFrame(df1.isnull().sum()).T.rename(index={0:'null values (nb)'}))
data_info=data_info.append(pd.DataFrame(df1.isnull().sum()/df1.shape[0]*100).T.
rename(index={0:'null values (%)'}))
display(data_info)
#explore sector, pricevar and class
df1.describe()
df1.describe(include=np.object)
Linear Regression: Our first objective will be to find out if there is a linear correlation between any of the factors and PRICEVAR or the factors and Class. I had a hypothesis that out of all the variables, revenue growth would have the greatest effect on the final yoy change in stock price. However, as can be seen below, there seemed to not be much of a correlation.
dfSector = pd.get_dummies(df1["Sector"])
dfClean = df1
dfClean.drop('Sector',
axis='columns', inplace=True)
dfClean = df1.dropna(subset=['Class','2015 PRICE VAR [%]'])
dfClean = pd.concat([dfClean,dfSector], axis=1)
dfVAR = dfClean['2015 PRICE VAR [%]']
dfClass = dfClean['Class']
dfClean.drop(['2015 PRICE VAR [%]', 'Class'],
axis='columns', inplace=True)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
model = LinearRegression()
x = dfClean['Revenue Growth']
x= x.fillna(x.mean())
X = pd.concat([x,dfVAR], axis=1)
y = dfVAR.to_numpy()
plt.plot(X, y)
model.fit(X, y)
y_predicted = model.predict(X)
plt.plot(X, y_predicted)
plt.show()
# print(X)
from yellowbrick.regressor import ResidualsPlot
from sklearn.metrics import mean_squared_error
# Instantiate the linear model and the visualizer
#ridge = Ridge()
visualizer = ResidualsPlot(model)
visualizer.fit(X, y_predicted)
print("R^2: {}".format(model.score(X, y_predicted)))
rmse = np.sqrt(mean_squared_error(x,y))
print("Root Mean Squared Error: {}".format(rmse))
visualizer.poof();
plt.show()
print(y_predicted)
As can be seen from the above graph and the lasso model below, linear regression is too simple to examine a dataset with over 200 variables. Therefore we must pursue some more difficult model to evaluate the dataset. As can be seen below, after the lasso came out, for the 200 something variables all of them had a supremely low coefficient meaning that they had a very low effect on the final outcome. This is expected as we are certainly oversimplifying here.
X2 = dfClean.fillna(dfClean.mean())
X2 = X2.drop('Stock Name',axis=1)
y2 = dfVAR.values
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.33, random_state=42)
#Let's build a Lasso model
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
alpha_user = 0.95420530945486497 #play with alpha and you will get different results (between 0 and 1)
lasso_model = Lasso(alpha=alpha_user,normalize=True)
#ridge_mode = Ridget(alpha= alpha_user, normalize=True)
lasso_model.fit(X_train,y_train)
lasso_coef = lasso_model.coef_
# print(lasso_coef)
e_dataframe=pd.DataFrame({'a':lasso_coef})
e_dataframe = e_dataframe.sort_values(by=['a'],ascending=False)
print(e_dataframe.head(5))
dx = dfClean.columns.values.tolist()
print(dx[230])
print(dx[226])
print(dx[164])
Interestingly enough the top 3 variables were Real Estate, Energy and R&D to Revenue. Traditionally, both Real Estate and Energy are regarded as traditional stocks and therefore their profits have more to do with their stock price unlike the new tech stocks. On the other hand R&D to Revenue's high correlation makes sense as well as this is a good variable to see if a company will gain in profit in the long term.
Logistic Regression
Since, the data from the linear regression showed some positive data regarding the correlation between Sector and change in stock price. I decided to do a logisitic regression study on just sector and class (we can do this because sector is classification data). After doing this, using a confusion matrix, it is visible that although there is a 0.6 accuracy for the model, it is not significant enough to consider a relation betweeen the data.
from sklearn.linear_model import LogisticRegression
X3 = dfSector
y3 = dfClass.values
X_train2, X_test2, y_train2, y_test2 = train_test_split(X3, y3, test_size=0.33, random_state=42)
model = LogisticRegression()
model.fit(X_train2, y_train2)
y_pred2 = model.predict(X_test2)
from sklearn import metrics
print(metrics.accuracy_score(y_test2, y_pred2))
from sklearn.metrics import plot_confusion_matrix
cnf_matrix = metrics.confusion_matrix(y_test2, y_pred2)
print(cnf_matrix)
metrics.plot_confusion_matrix(model, X_test2, y_test2)
plt.show()
print(metrics.classification_report(y_test2, y_pred2))
Let's take a look at what the Sector data really is and how the companies are spread out among different sectors.
import seaborn as sns
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
count = df['Sector'].value_counts()
plt.figure(figsize=(15,10))
ax = sns.countplot(x='Sector', data=df, palette="Set2", order=count.index[0:10])
ax.set(xlabel='Sectors', ylabel='Number of Companies')
plt.title("Bar Graph of Sectors")
As can be seen from the bar chart above, most of the companies are related to the financial services industry and we have 11 industries in the data.
The table below shows the yoy stock price change mean, median and variance for all the sectors
# print(result)
import statistics
df = pd.DataFrame(columns=["Sector","Mean","Median","Variance"])
rows = []
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
list = df['Sector'].unique()
for i in list:
dfx = df[df['Sector']==i]
x = dfx['2015 PRICE VAR [%]'].mean()
y = dfx['2015 PRICE VAR [%]'].median()
z = statistics.variance(dfx['2015 PRICE VAR [%]'])
rows.append([i, x, y, z])
dfx = pd.DataFrame(rows, columns=["Sector","Mean","Median","Variance"])
print(dfx)
# plt.figure(figsize=(15,25))
# ax = sns.barplot(x='Sector', y='Mean', data=dfx, palette="Set2")
# ax.set(xlabel='Sectors', ylabel='Number of Companies')
# plt.title("Bar Graph of Sectors")
The table below shows the mean, median and variance for class data for the different sectors
# print(result)
import statistics
df = pd.DataFrame(columns=["Sector","Mean","Median","Variance"])
rows = []
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
list = df['Sector'].unique()
for i in list:
dfx = df[df['Sector']==i]
x = dfx['Class'].mean()
y = dfx['Class'].median()
z = statistics.variance(dfx['Class'])
rows.append([i, x, y, z])
dfx = pd.DataFrame(rows, columns=["Sector","Mean","Median","Variance"])
print(dfx)
# plt.figure(figsize=(15,25))
# ax = sns.barplot(x='Sector', y='Mean', data=dfx, palette="Set2")
# ax.set(xlabel='Sectors', ylabel='Number of Companies')
# plt.title("Bar Graph of Sectors")
The graph below shows the relationship between the mean of the stock price change of different industries and the mean of different growth indicators for those industries.
# print(result)
import statistics
from sklearn import datasets, linear_model
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
# df = pd.DataFrame(columns=["Sector","VAR Mean","Gross Profit Mean","Net Income Mean","Net Profit Margin Mean"])
df = df.dropna(subset=['Sector','2015 PRICE VAR [%]','priceEarningsToGrowthRatio','Gross Profit Growth','Net Income Growth','Revenue Growth'])
rows = []
list = df['Sector'].unique()
for i in list:
dfx = df[df['Sector']==i]
a = dfx['2015 PRICE VAR [%]'].mean()
b = dfx['priceEarningsToGrowthRatio'].median()
c = dfx['Gross Profit Growth'].mean()
d= dfx['Net Income Growth'].mean()
e= dfx['Revenue Growth'].mean()
rows.append([i, a, b, c, d, e ])
dfx = pd.DataFrame(rows, columns=["Sector","VAR Mean","priceEarningsToGrowthRatio","Gross Profit Growth","Net Income Growth","Revenue Growth",])
fig, axs = plt.subplots(2, 2)
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Net Income Growth'])
axs[0, 0].set_title('Net Income Growth')
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [slope * i + intercept for i in x2]
axs[0, 0].plot(x2, abline_values, 'b')
plt.xlim([-50, 300])
axs[0,0].scatter(x=x2, y =y)
axs[1, 0].set_title('priceEarningsToGrowthRatio')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['priceEarningsToGrowthRatio'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[1, 0].plot(x2, abline_values, 'b')
axs[1,0].scatter(x=dfx['VAR Mean'], y =dfx['priceEarningsToGrowthRatio'])
axs[1, 1].set_title('Gross Profit Growth')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Gross Profit Growth'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[1, 1].plot(x2, abline_values, 'b')
axs[1,1].scatter(x=dfx['VAR Mean'], y =dfx['Gross Profit Growth'])
axs[0, 1].set_title('Revenue Growth')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Revenue Growth'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[0, 1].plot(x2, abline_values, 'b')
axs[0,1].scatter(x=dfx['VAR Mean'], y =dfx['Revenue Growth'])
fig. tight_layout(pad=3.0)
Above are some graphs regarding different growth indicators for the different industries and how they differ based on their YOY stock price change. As can be seen above, there doesn't seem to be much of a relation between the points however it does seem that the growth indicators all show similar spread in regards to the points meaning that the growth indicators themselves might have a relation to each other but not in regards to the price YOY change. However, it is interesting to note that each industry seems to have consistent data in regards to its sector, meaning that companies in each indivudal sectors are affected by the sector it is in.
Baseline Dummy Classifier
With this baseline dummy classifier, we know that every model we develop should have an accuracy greater than 0.5879 as that means its worse than a dummy model
from sklearn.dummy import DummyClassifier
dfClass2 = dfClass.replace(1, "1")
dfClass2 = dfClass2.replace(0, "0")
dfClass2 = pd.DataFrame(dfClass2)
X2 = df1.dropna(subset=['Class','2015 PRICE VAR [%]','Stock Name'])
xTrain, xTest, yTrain, yTest = train_test_split(X2, dfClass2, test_size = 0.3, random_state = 2020)
dummy_classifier = DummyClassifier(strategy='most_frequent')
# print(X_train.head(5))
# print(df)
dummy_classifier.fit(xTrain,yTrain)
baseline_acc = dummy_classifier.score(xTest,yTest)
### For verifying answer:
print("Baseline Accuracy (using Class) = ", baseline_acc)
Decision Tree
X3 = dfClean.fillna(dfClean.mean())
X3.drop('Stock Name',
axis='columns', inplace=True)
xTrain, xTest, yTrain, yTest = train_test_split(X3, dfClass2, test_size = 0.3, random_state = 2020)
df = pd.DataFrame(columns=["A", "B", "C"])
rows = []
# print(xTrain.head(5))
# print(yTrain.head(5))
for i in range(10):
if i!=0:
dt = DecisionTreeClassifier(max_depth=i, criterion="entropy", random_state=2020)
dt.fit(xTrain,yTrain)
ypred = dt.predict((xTest))
ytrainpred = dt.predict((xTrain))
a1 = metrics.accuracy_score(yTrain, ytrainpred)
a2 = metrics.accuracy_score(yTest, ypred)
rows.append([i, a1,a2])
# print(rows)
df = pd.DataFrame(rows, columns=["A", "B","C"])
print(df)
As we can see from above, at the 5th iteration (depth of 5), the decision tree starts to overfitting the data and create test data with significantly lower accuracy when compared to the training data. Hence we should only analyse the decision tree with a depth of 5 layers.
In the confusion matrix below we can see that the model is good at predicting a company's class (stock price change 0 or 1). However it is not that great at labeling companies with stock price that went up with a significant number of stocks that were classified as false negative.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3, criterion="entropy", random_state=2020)
dt.fit(xTrain,yTrain)
ypred = dt.predict((xTest))
ytrainpred = dt.predict((xTrain))
metrics.plot_confusion_matrix(dt, xTest, yTest)
plt.show()
cf = metrics.confusion_matrix(yTest, ypred)
print(cf)
Accuracy of the train and test data for this model:
print('Accuracy =', metrics.accuracy_score(yTrain, ytrainpred))
print('Accuracy =', metrics.accuracy_score(yTest, ypred))
The decision tree below shows that deposit liabilities is the most important factor followed by capitalExpenditureCoverageRatios and Invested Cpaital. This observation is also illustrated in the feature importance table shown below the decision tree.
from sklearn import tree
fn = X3.columns
dfClass3 = dfClass.replace(1, "1")
dfClass3 = dfClass3.replace(0, "0")
cn = dfClass3.unique()
plt.figure(2)
Tree = tree.plot_tree(dt,feature_names=fn,class_names=cn,fontsize=10)
plt.figure(figsize=(100, 500))
plt.savefig('stockmarket.png')
plt.show()
print('Feature Importance:', dt.feature_importances_)
imp=pd.DataFrame(zip(xTrain.columns, dt.feature_importances_))
imp.columns =['X','Y']
imp = imp.sort_values(by=['Y'], ascending=False)
pd.set_option('display.max_rows', None)
print(imp.head(5))
Interestingly enough, the data we have got is not bad in comparison to the logistic and linear regression data. However because we have so many different variables it is quite difficult to use a decision tree to give us supremely accurate data. Hence we must look at different methods to increase this accuracy.
Ensemble
Ensemble Learning is building a prediction model by combining the strengths of a collection of simpler base models. My ensemble algorithm here will aggregate the predictions made by each individual base model and produce a single output prediction. Hopefully by using ensemble learning we will get better models to find out what factors have the greatest influence on stock price yoy change.
Ensemble Method 1: Bagging
What is Bagging? Bagging (Bootstrap Aggregation) is an averaging method doing different tests on different training samples. Doing this will take advantage of shortcoming of trees (which are heavily influenced by training data) we can ensure we don't overfit like we did in the previous decision tree model we used above.
Method 1a: Generic Bagging:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
dataframe = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
dataframe.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
dataframe2 = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
dataframe = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
ensemble= dataframe.fillna(dataframe.mean())
X_train, X_test, y_train, y_test = train_test_split(ensemble, dfClass,
test_size=0.3,
random_state=42
)
model_bagging = BaggingClassifier(random_state = 42)
model_bagging.fit(X_train, y_train)
pred_bagging = model_bagging.predict(X_test)
scores = []
for model_idx, model in enumerate(model_bagging.estimators_):
if model_idx == 0:
print('='*40)
preds = model.predict(X_test)
scores.append(recall_score(y_test, preds))
model_recall = np.round(recall_score(y_test, preds), 5)
print(f'Recall for Base Model {model_idx+1}:\t', model_recall)
if model_idx < (len(model_bagging.estimators_) - 1):
print('-'*40)
else:
print('='*40)
ensemble_preds = model_bagging.predict(X_test)
print("Mean Recall Score:\t\t", np.round(np.array(scores).mean(), 5))
randomforest = np.round(np.array(scores).mean(), 5)
print("Std Deviation:\t\t\t", np.round(np.array(scores).std(), 5))
print("Range:\t\t\t\t", np.round(np.array(scores).ptp(), 5))
print(f'Overall Recall for model:\t {np.round(recall_score(y_test, ensemble_preds), 5)}')
# acc_bagging = accuracy_score(y_test, pred_bagging)
# print(' Accuracy = ', acc_bagging)
Below is a classification report that will give us some data regarding how accurate we are classifying the data in regards to true negatives, true positives, false negatives, and false positives.
from sklearn.metrics import (
classification_report,
recall_score,
precision_score,
accuracy_score
)
print('Classification Report:\n')
print(classification_report(y_test, pred_bagging))
Method 1b: Random Forest Bagging:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
model_rf = RandomForestClassifier(n_estimators=10, max_features=200, random_state=42)
model_rf.fit(X_train, y_train)
predict_rf = model_rf.predict(X_test)
# recall_rf = recall_score(y_test, predict_rf, average=None)
# precision_rf = precision_score(y_test, predict_rf)
scores = []
for model_idx, model in enumerate(model_rf.estimators_):
if model_idx == 0:
print('='*40)
preds = model.predict(X_test)
scores.append(recall_score(y_test, preds))
model_recall = np.round(recall_score(y_test, preds), 5)
print(f'Recall for Base Model {model_idx+1}:\t', model_recall)
if model_idx < (len(model_rf.estimators_) - 1):
print('-'*40)
else:
print('='*40)
ensemble_preds = model_rf.predict(X_test)
print("Mean Recall Score:\t\t", np.round(np.array(scores).mean(), 5))
randomforest = np.round(np.array(scores).mean(), 5)
print("Std Deviation:\t\t\t", np.round(np.array(scores).std(), 5))
print("Range:\t\t\t\t", np.round(np.array(scores).ptp(), 5))
print(f'Overall Recall for model:\t {np.round(recall_score(y_test, ensemble_preds), 5)}')
As can be seen from above, both the generic and decision tree based bagging models were both quite inaccurate. This make sense as we already know that the training data is not the problem that is giving us a low accuracy model. It is that we have too many variables that do not have a significant effect on the final output at the end. Let's now look at other type of ensemble methods.
To clean up the data, I have used an out of bag evaluation to see what error are we getting from using bagging and using certain data multiple times while leaving other data unused. As can be seen below the OOB technique cleans up around 2% of the data. However this is relatively unimportant as the bagging model is not accurate enough to be continued to be considered in the rest of this study.
model_rf_oob = RandomForestClassifier(n_estimators=100, max_features=7, oob_score=True, random_state=42).fit(X_train, y_train)
oob_score = round(model_rf_oob.oob_score_,4)
acc_oob = round(accuracy_score(y_test, model_rf_oob.predict(X_test)),4)
diff_oob = round(abs(oob_score - acc_oob),4)
print('OOB Score:\t\t\t', oob_score)
print('Testing Accuracy:\t\t', acc_oob)
print('Acc. Difference:\t\t', diff_oob)
Ensemble Method 2: AdaBoost
AdaBoost is a boosting algorithm that begins by assigning equal weights to all observations but based on the first set of predictions it starts to adjust the weights to increase for misclassified observations and reduced for correctly classified observations. This update keeps on going until we have a relatively accurate model.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
base_est = DecisionTreeClassifier (max_depth =2)
ada_boost1 = AdaBoostClassifier(base_est, n_estimators=500, random_state=42, learning_rate=.15)
ada_boost1.fit(X_train, y_train)
ada_boost2 = AdaBoostClassifier(base_est, n_estimators=20, random_state=42, learning_rate=.5)
ada_boost2.fit(X_train, y_train)
res1 = round(recall_score(y_test, ada_boost1.predict(X_test)),4)
res2 = round(recall_score(y_test, ada_boost2.predict(X_test)),4)
print('Winning Model:\t {}, MODEL 1')
print('MODEL 1 Recall:\t {}'.format(res1))
print('MODEL 2 Recall:\t {}:'.format(res2))
Ensemble Method 3: Gradient Boosted Trees (GBT)
Gradient Booosted Trees is another boosting algorithm but unlike Adaboost it re-weighs observations based on the prediction performance through residuals errors of the preceding models.
from sklearn.ensemble import GradientBoostingClassifier
dataframe2 = dataframe.dropna().
xTrain, xTest, yTrain, yTest = train_test_split(X3, dfClass2, test_size = 0.3, random_state = 2020)
gbc = GradientBoostingClassifier(random_state=42).fit(X_train, y_train)
gbc_pred = gbc.predict(X_test)
recall_gbc = round(recall_score(y_test, gbc_pred),4)
precision_gbc = round(precision_score(y_test, gbc_pred),4)
print('Recall :\t {}'.format(recall_gbc))
print('Precision :\t {}:'.format(precision_gbc))
As shown above, both the boosting algorithms have relatively high precision but in terms of recall it doesn't do quite well. Below I will be using a hard voting method to combine the logistic regression and the randomforest methods to create a combined algorithm.
Ensemble Method 4: Voting Classifier
In hard voting, we use majority voting to combine results from all the models and choose the outputs (0,1) for all values. In this model we use an ensemble of RandomForestClassifier, Support Vector Machine and Logistic Regression. At the end we get a combined accuracy score of 0.65 which is quite good compared to the other models we have used before.
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
rfClf = RandomForestClassifier(n_estimators=500, random_state=0) # 500 trees.
svmClf = SVC(probability=True, random_state=0) # probability calculation
logClf = LogisticRegression(random_state=0)
# constructing the ensemble classifier by mentioning the individual classifiers.
clf2 = VotingClassifier(estimators = [('rf',rfClf), ('svm',svmClf), ('log', logClf)], voting='hard')
# train the ensemble classifier
clf2.fit(X_train, y_train)
clf2_pred = clf2.predict(X_test)
recall_voting = recall_score(y_test, clf2_pred)
precision_voting = precision_score(y_test, clf2_pred)
print('Accuracy score', accuracy_score(y_test, clf2_pred))
# You can use the individual classifiers to get the accuracy in the beginning and see if our ensemble performs
# better when compared to individual classifiers.
Ensemble Method 5: XGBoost
XGBoost is a boosting method that uses the gradient boosting (GBM) framework and is well known to provide better solutions than other machine learning algorithms.
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
dataframe = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
dataframe.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
dataframe = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
ensemble= dataframe.fillna(dataframe.mean())
X = ensemble
y = dfClass
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 100)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
from sklearn.linear_model import LinearRegression
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train,y_train)
pred_train = linear_reg_model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train,pred_train))
print("Train RMSE: %f" % (train_rmse))
pred_test = linear_reg_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test,pred_test))
print("Test RMSE: %f" % (test_rmse))
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
cv_results.head()
print((cv_results["test-rmse-mean"]).tail(1))
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
import matplotlib.pyplot as plt
xgb.plot_tree(xg_reg,num_trees=1)
plt.rcParams['figure.figsize'] = [500, 500]
plt.show()
As seen from the calculations above, XGBoost allows us to create a tree based on the most important factors. From the test rmse mean, it is clear that the amount of error is quite low especially in comparison to the RMSE of a linear regression. Now that we have a decent model, what does this model tell us about the data? Below I will be looking through features that are important in the model and how they play a role in affecting YOY stock price change.
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
# plt.barh(dataframe.columns, xgb.feature_importances_)
feature_names = dataframe.columns
sorted_idx = xgb.feature_importances_.argsort()
# print(sorted_idx)
# print(feature_names)
# print(xgb.feature_importances_)
rows=[]
for i in sorted_idx:
x = feature_names[i]
y = xgb.feature_importances_[i]
rows.append([i, x, y])
dfx = pd.DataFrame(rows, columns=["id","Variable","Value"])
pd.set_option('display.max_columns', None)
print(dfx.tail(10))
eid = dfx['Variable'].tail(15).to_numpy()
# plt.barh(feature_names[sorted_idx][:-5], xgb.feature_importances_[sorted_idx][:-5])
# plt.xlabel("Xgboost Feature Importance")
As can be seen above from the feature importance variables, Deposit Liabilities is the most important factor and most of the other variables have a significantly lower influence (with many having a value of 0.0). This is an interesting point because in an earlier study regarding the random forest model (including Sectors), Deposit liabilities was also the top variable when we looked at the feature importance for that model.
import seaborn as sns
def correlation_heatmap(train):
correlations = train.corr()
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(correlations, vmax=1.0, center=0, fmt='.2f', cmap="YlGnBu",
square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70}
)
plt.show();
correlation_heatmap(X_train[eid])
The heatmap above shows the top 15 most important factors and how they relate to one another. It is intersting that most of them have no relation to one another. However, Deposit Liabilities does have a 0.92 correlation with Tangible Asset Value. This is interesting as asset and liabilities are the two most fundamental statistics that a company has as it shows what the company is currently worth. Maybe there is some reasoning why both deposit liabilities and tangible asset value have something to do with the stock price change, as a company with a good balance sheet may seem safer to investors and become more desirable leading to a increase in the stock price of a company.
Conclusion
Although, this study doesn't really give us any concrete evidence of any relation between a company's financial factors and its stock price change, it gives us a better understanding of what factors have a higher than normal correlation with stock price change.
After using data science models ranging from linear regression and logisitic regression to decision trees and ensemble methods such as XGBoost, my study has found that Sector and Deposit liabilities have the greatest influence on stock price change. Many of the other factors have a close to 0 relation with stock price change. Most of the stocks that belong in one sector, seem to have similar financials and similar growth. This makes sense as this is why PE comparisons are often made betweeen companies in the same industry. In regards to deposit liabilities my theory is that because investors are loss aversive, they focus a lot of attention on the companies' liabilities to the bank as it represents the company's independence from giant financial investors such as banks.
The most interesting question that this project has allowed me to understand is why stock prices are not correlated with the respective companies financials. Through my experience is an avid investor and through doing this study, I have realised that a company's price on the stock market is not determined by financials and crunched by numbers, it is instead determined by the judgements of investors. And as long term thinkers, investors don't invest based on current financial numbers but instead on future numbers. Investors may instead focus their attention on politics, macroeconomics and long term trends that have longstanding effects on the future of specific industries/sectors.
The reason why the finance industry is so challenging is that there is no formula to allow for profit in the stock market. There is no guarantee for success and there is no investment without some sort of risk. This project has allowed me to have a better understanding of this concept and I hope to be able to continue doing this sort of data science project on the stock market and see if my hypothesis regarding the effects behavioral psychology and macro global trends have on stock price change is correct.