Introduction to my Data Analysis Final Project

In this Research Project I would like to explore data related to the stock market and what factors have the greatest influence on wether a company's stock price goes up or down. In recent years, it has been fervently debated regarding wether stocks are based on their financials (especially with the rise of growth stocks with high ceilings such as the new tech companies). According to economics, the concept of price equilibrium and long run self-adjustment means that stock prices should reflect the value of the company itself. Does this statement still remain true?

In this project, I will be using multiple types of Machine Learning techniques such as blah to find out which factors are the most important and wether there even is a correlation between financial success and stock price.

Import Libraries

In [1]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing

Exploring Dataset

In [2]:
df1 = pd.set_option('display.max_columns', None)
df1 = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
df1.dataframeName = '2014_Financial_Data.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')
print(df1.dtypes)
df1.info()
df1.head(5)
There are 3808 rows and 225 columns
Unnamed: 0               object
Revenue                 float64
Revenue Growth          float64
Cost of Revenue         float64
Gross Profit            float64
                         ...   
R&D Expense Growth      float64
SG&A Expenses Growth    float64
Sector                   object
2015 PRICE VAR [%]      float64
Class                     int64
Length: 225, dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3808 entries, 0 to 3807
Columns: 225 entries, Unnamed: 0 to Class
dtypes: float64(222), int64(1), object(2)
memory usage: 6.5+ MB
Out[2]:
Unnamed: 0 Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax Income Tax Expense Net Income - Non-Controlling int Net Income - Discontinued ops Net Income Preferred Dividends Net Income Com EPS EPS Diluted Weighted Average Shs Out Weighted Average Shs Out (Dil) Dividend per Share Gross Margin EBITDA Margin EBIT Margin Profit Margin Free Cash Flow margin EBITDA EBIT Consolidated Income Earnings Before Tax Margin Net Profit Margin Cash and cash equivalents Short-term investments Cash and short-term investments Receivables Inventories Total current assets Property, Plant & Equipment Net Goodwill and Intangible Assets Long-term investments Tax assets Total non-current assets Total assets Payables Short-term debt Total current liabilities Long-term debt Total debt Deferred revenue Tax Liabilities Deposit Liabilities Total non-current liabilities Total liabilities Other comprehensive income Retained earnings (deficit) Total shareholders equity Investments Net Debt Other Assets Other Liabilities Depreciation & Amortization Stock-based compensation Operating Cash Flow Capital Expenditure Acquisitions and disposals Investment purchases and sales Investing Cash flow Issuance (repayment) of debt Issuance (buybacks) of shares Dividend payments Financing Cash Flow Effect of forex changes on cash Net cash flow / Change in cash Free Cash Flow Net Cash/Marketcap priceBookValueRatio priceToBookRatio priceToSalesRatio priceEarningsRatio priceToFreeCashFlowsRatio priceToOperatingCashFlowsRatio priceCashFlowRatio priceEarningsToGrowthRatio priceSalesRatio dividendYield enterpriseValueMultiple priceFairValue ebitperRevenue ebtperEBIT niperEBT grossProfitMargin operatingProfitMargin pretaxProfitMargin netProfitMargin effectiveTaxRate returnOnAssets returnOnEquity returnOnCapitalEmployed nIperEBT eBTperEBIT eBITperRevenue payablesTurnover inventoryTurnover fixedAssetTurnover assetTurnover currentRatio quickRatio cashRatio daysOfSalesOutstanding daysOfInventoryOutstanding operatingCycle daysOfPayablesOutstanding cashConversionCycle debtRatio debtEquityRatio longtermDebtToCapitalization totalDebtToCapitalization interestCoverage cashFlowToDebtRatio companyEquityMultiplier operatingCashFlowPerShare freeCashFlowPerShare cashPerShare payoutRatio operatingCashFlowSalesRatio freeCashFlowOperatingCashFlowRatio cashFlowCoverageRatios shortTermCoverageRatios capitalExpenditureCoverageRatios dividendpaidAndCapexCoverageRatios dividendPayoutRatio Revenue per Share Net Income per Share Operating Cash Flow per Share Free Cash Flow per Share Cash per Share Book Value per Share Tangible Book Value per Share Shareholders Equity per Share Interest Debt per Share Market Cap Enterprise Value PE ratio Price to Sales Ratio POCF ratio PFCF ratio PB ratio PTB ratio EV to Sales Enterprise Value over EBITDA EV to Operating cash flow EV to Free cash flow Earnings Yield Free Cash Flow Yield Debt to Equity Debt to Assets Net Debt to EBITDA Current ratio Interest Coverage Income Quality Dividend Yield Payout Ratio SG&A to Revenue R&D to Revenue Intangibles to Total Assets Capex to Operating Cash Flow Capex to Revenue Capex to Depreciation Stock-based compensation to Revenue Graham Number ROIC Return on Tangible Assets Graham Net-Net Working Capital Tangible Asset Value Net Current Asset Value Invested Capital Average Receivables Average Payables Average Inventory Days Sales Outstanding Days Payables Outstanding Days of Inventory on Hand Receivables Turnover Payables Turnover Inventory Turnover ROE Capex per Share Gross Profit Growth EBIT Growth Operating Income Growth Net Income Growth EPS Growth EPS Diluted Growth Weighted Average Shares Growth Weighted Average Shares Diluted Growth Dividends per Share Growth Operating Cash Flow growth Free Cash Flow growth 10Y Revenue Growth (per Share) 5Y Revenue Growth (per Share) 3Y Revenue Growth (per Share) 10Y Operating CF Growth (per Share) 5Y Operating CF Growth (per Share) 3Y Operating CF Growth (per Share) 10Y Net Income Growth (per Share) 5Y Net Income Growth (per Share) 3Y Net Income Growth (per Share) 10Y Shareholders Equity Growth (per Share) 5Y Shareholders Equity Growth (per Share) 3Y Shareholders Equity Growth (per Share) 10Y Dividend per Share Growth (per Share) 5Y Dividend per Share Growth (per Share) 3Y Dividend per Share Growth (per Share) Receivables growth Inventory Growth Asset Growth Book Value per Share Growth Debt Growth R&D Expense Growth SG&A Expenses Growth Sector 2015 PRICE VAR [%] Class
0 PG 7.440100e+10 -0.0713 3.903000e+10 3.537100e+10 0.000000e+00 2.146100e+10 2.146100e+10 1.391000e+10 7.090000e+08 1.449400e+10 2.851000e+09 1.420000e+08 -1.127000e+09 1.164300e+10 0.0 1.164300e+10 4.1900 4.010 2.705960e+09 2.705960e+09 2.448 0.4754 0.2470 0.2043 0.1560 0.1359 1.834400e+10 1.520300e+10 1.178500e+10 0.1948 0.1565 8.558000e+09 2.128000e+09 1.068600e+10 6.386000e+09 6.759000e+09 3.161700e+10 2.230400e+10 8.454700e+10 0.000000e+00 1.092000e+09 1.126490e+11 1.442660e+11 8.461000e+09 1.560600e+10 3.372600e+10 1.981100e+10 3.541700e+10 0.000000e+00 1.021800e+10 0.0 4.056400e+10 7.429000e+10 -7.662000e+09 8.499000e+10 6.997600e+10 2.128000e+09 2.473100e+10 7.786000e+09 9.659000e+09 3.141000e+09 3.600000e+08 1.395800e+10 -3.848000e+09 -2.400000e+07 -8.050000e+08 -4.100000e+09 3.543000e+09 -3.911000e+09 -6.911000e+09 -7.279000e+09 3.900000e+07 2.618000e+09 1.011000e+10 -0.1163 3.0391 0.0000 2.8583 18.7566 21.0348 15.2358 12.977501 15.557842 2.434644 0.038153 11.421116 2.588601 0.204339 0.953364 0.803298 0.475410 1.0 0.186960 0.156490 0.196702 0.5765 0.1664 0.0753 0.803298 0.953364 0.204339 1.5648 10.8869 3.335769 0.515721 0.9370 0.506197 0.253751 -63.2087 31.3287 NaN 41.5084 NaN 0.2455 0.5061 0.220644 0.336047 21.4429 0.394105 2.061650 5.1582 3.964 3.1626 0.584 0.187605 0.724316 0.394105 0.894400 3.627339 1.297332 0.593576 27.4952 4.3027 5.1582 3.964 3.1626 25.8600 22.0690 25.8600 13.0885 2.126614e+11 2.373924e+11 18.7566 2.8583 15.2358 21.0348 3.0391 0.0000 3.1907 12.9411 17.0076 23.4810 0.0533 0.0475 0.5061 0.2455 1.3482 0.9370 21.4429 1.1988 0.0311 0.584 0.2885 0.000 0.5860 0.2757 0.0517 -1.2251 0.0048 49.3756 0.0753 0.5765 -0.2007 -2.109000e+09 5.971900e+10 -4.267300e+10 5.285200e+10 6.447000e+09 8.619000e+09 6.834000e+09 31.3287 41.5084 -63.2087 11.5404 1.5648 10.8869 0.1664 -1.4220 -0.1185 -0.0001 0.0067 0.0293 0.0371 0.0389 -0.0127 NaN 0.0699 -0.0615 -0.0695 0.0323 0.0088 -0.0182 0.0355 0.0015 0.0260 0.0603 -0.0137 0.0060 0.1442 0.0353 0.0201 0.1013 0.0834 0.0751 -0.0187 -0.0217 0.0359 0.0316 0.1228 0.0000 -0.1746 Consumer Defensive -9.323276 0
1 VIPS 3.734148e+09 1.1737 2.805625e+09 9.285226e+08 1.083303e+08 3.441414e+08 7.939267e+08 1.345959e+08 1.214869e+07 1.753823e+08 3.955957e+07 -1.431918e+07 0.000000e+00 1.358227e+08 0.0 1.358227e+08 0.2396 0.226 NaN 1.829362e+07 0.000 0.2487 0.0107 0.0502 0.0058 0.0704 2.456858e+08 1.875310e+08 1.215036e+08 0.0470 0.0364 7.735149e+08 6.083852e+08 1.381900e+09 1.169487e+07 5.793193e+08 2.134397e+09 3.085975e+08 1.906587e+08 6.299354e+07 3.764110e+07 6.022904e+08 2.736687e+09 1.000491e+09 0.000000e+00 1.639537e+09 6.223741e+08 6.223741e+08 3.141104e+07 0.000000e+00 0.0 6.615567e+08 2.301093e+09 -1.729254e+06 4.285438e+06 4.123545e+08 6.713788e+08 NaN NaN 6.390460e+08 5.815483e+07 3.640523e+07 5.267456e+08 -2.637046e+08 -1.922552e+08 -2.126447e+08 -6.866936e+08 6.193268e+08 1.767840e+06 0.000000e+00 6.219136e+08 -1.564869e+07 4.463169e+08 2.630410e+08 0.0956 NaN NaN 0.0443 81.5526 1.3589 NaN 0.678615 2.631793 0.095727 -0.000000 7.136538 0.866869 0.050221 0.935218 0.774438 0.248657 1.0 0.036045 0.036373 0.225562 0.0403 0.3294 0.0000 0.774438 0.935218 0.050221 NaN NaN 12.100383 1.364477 0.2102 0.849993 0.471789 -75.3670 1.1431 NaN 97.7945 NaN 0.2274 1.5093 0.601485 0.601485 15.4363 0.846349 6.636734 35.6714 22.976 52.3827 0.000 0.141062 0.499370 0.846349 NaN 1.997484 1.997484 -0.000000 252.8778 9.1980 35.6714 22.976 52.3827 0.7278 4.4939 27.9248 42.1474 6.253200e+09 NaN 81.5526 0.0443 NaN 1.3589 NaN NaN NaN NaN NaN NaN 0.0123 NaN 1.5093 0.2274 NaN 0.2102 15.4363 3.8782 0.0000 0.000 0.0922 0.029 0.0697 0.5006 0.0706 -4.5345 0.0097 NaN 0.0000 0.0403 0.0534 3.065163e+09 2.546029e+09 -1.666966e+08 7.553510e+08 NaN NaN NaN 1.1431 97.7945 -75.3670 NaN NaN NaN 0.3294 -17.8582 1.2489 1.6142 1.9463 1.5649 1.4625 1.4646 0.0184 0.0195 0.0000 0.2134 -0.3626 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.6484 1.7313 Consumer Defensive -25.512193 0
2 KR 9.837500e+10 0.0182 7.813800e+10 2.023700e+10 0.000000e+00 1.519600e+10 1.751200e+10 2.725000e+09 4.430000e+08 2.270000e+09 7.510000e+08 1.200000e+07 0.000000e+00 1.519000e+09 0.0 1.519000e+09 1.4700 1.450 1.033053e+09 1.028000e+09 0.307 0.2057 0.0450 0.0276 0.0150 0.0126 4.416000e+09 2.713000e+09 1.531000e+09 0.0231 0.0154 4.010000e+08 0.000000e+00 4.010000e+08 1.116000e+09 5.651000e+09 8.830000e+09 1.689300e+10 2.837000e+09 0.000000e+00 0.000000e+00 2.045100e+10 2.928100e+10 4.881000e+09 1.657000e+09 1.070500e+10 9.653000e+09 1.131000e+10 0.000000e+00 1.629000e+09 0.0 1.318100e+10 2.388600e+10 -4.640000e+08 1.098100e+10 5.384000e+09 0.000000e+00 1.090900e+10 1.662000e+09 4.167000e+09 1.703000e+09 1.070000e+08 3.573000e+09 -2.330000e+09 -2.344000e+09 0.000000e+00 -4.771000e+09 2.093000e+09 -4.130000e+08 -3.190000e+08 1.361000e+09 0.000000e+00 1.630000e+08 1.243000e+09 -0.5969 3.3579 7.0982 0.1858 12.0340 14.6302 5.1147 0.000000 0.000000 0.000000 NaN 2.276721 0.000000 0.027578 0.836712 0.669163 0.205713 1.0 0.027700 0.015441 0.330837 0.1011 0.2821 0.0859 0.669163 0.836712 0.027578 4.9593 18.2227 5.823418 3.359687 0.8250 0.141709 0.037459 -26.3971 4.1407 NaN 18.1099 NaN 0.3863 2.1007 0.641950 0.677489 6.1242 0.315915 5.438522 3.4757 0.837 0.3901 0.209 0.036320 0.347887 0.315915 2.156307 1.533476 1.348811 0.210007 95.6955 1.4776 3.4757 0.837 0.3901 5.2370 25.7240 5.2374 11.0019 1.827470e+10 2.918370e+10 12.0340 0.1858 5.1147 14.6302 3.3579 7.0982 0.2967 6.6086 8.1678 23.4784 0.0831 0.0680 2.1007 0.3863 2.4703 0.8250 6.1242 2.3522 0.0174 0.209 0.1545 0.000 0.0969 0.6521 0.0237 -1.3682 0.0011 13.2000 0.0859 0.1011 -0.8239 -1.875000e+09 2.644400e+10 -1.505600e+10 2.664800e+10 1.083500e+09 4.682500e+09 5.398500e+09 4.1407 18.1099 -26.3971 90.7937 4.9593 18.2227 0.2821 -2.2665 0.0173 -0.0145 -0.0141 0.0147 0.0576 0.0469 -0.0356 -0.0317 0.2429 0.2095 0.3935 0.1027 0.1038 0.1399 0.0889 0.0937 0.0946 0.2272 0.0906 0.1892 0.0676 0.0558 0.0789 0.0000 0.1215 0.1633 0.0618 0.0981 0.1886 0.3268 0.2738 0.0000 0.0234 Consumer Defensive 33.118297 1
3 RAD 2.552641e+10 0.0053 1.820268e+10 7.323734e+09 0.000000e+00 6.561162e+09 6.586482e+09 7.372520e+08 4.245910e+08 2.502180e+08 8.040000e+05 0.000000e+00 0.000000e+00 2.494140e+08 33998000.0 2.154160e+08 4.6000 4.600 4.832628e+07 4.832628e+07 0.000 0.2869 0.0420 0.0264 0.0080 0.0144 1.078550e+09 6.748090e+08 2.494140e+08 0.0098 0.0098 1.464060e+08 0.000000e+00 1.464060e+08 9.490620e+08 2.993948e+09 4.285125e+09 1.957329e+09 4.312270e+08 0.000000e+00 0.000000e+00 2.659746e+09 6.944871e+09 1.292419e+09 4.917400e+07 2.507452e+09 5.707969e+09 5.757143e+09 0.000000e+00 0.000000e+00 0.0 6.551121e+09 9.058573e+09 -3.733400e+07 -7.515848e+09 -2.113702e+09 0.000000e+00 5.610737e+09 1.957090e+08 1.165859e+09 4.037410e+08 -1.047100e+07 7.020460e+08 -3.338700e+08 0.000000e+00 4.115500e+07 -3.649240e+08 -2.954350e+08 3.321700e+07 0.000000e+00 -3.201680e+08 0.000000e+00 1.695400e+07 3.681760e+08 -0.8822 0.0000 0.0000 0.2491 28.6087 17.2736 9.0589 0.000000 0.000000 0.000000 NaN 5.428005 NaN 0.026436 0.370798 0.996787 0.286908 1.0 0.028882 0.009771 0.003213 0.0668 -0.1180 0.1062 0.996787 0.370798 0.026436 4.9289 8.3030 13.041452 3.675578 1.7090 0.436885 0.058388 -60.0346 13.5706 NaN 18.4802 NaN 0.8290 -2.7237 1.588076 1.580139 1.5893 0.121943 NaN 14.5272 0.557 3.0295 0.000 0.027503 0.524433 0.121943 14.276772 2.102753 2.102753 -0.000000 528.2098 5.1610 14.5272 0.557 3.0295 -43.7380 134.7850 -43.7381 119.1307 6.359738e+09 1.197048e+10 28.6087 0.2491 9.0589 17.2736 0.0000 0.0000 0.4689 11.0987 17.0508 32.5129 0.0350 0.0579 -2.7237 0.8290 5.2021 1.7090 1.5893 3.2590 0.0000 0.000 0.2570 0.000 0.0621 0.4756 0.0131 -0.8269 -0.0004 0.0000 0.1062 0.0668 -0.7506 1.777673e+09 6.513644e+09 -4.773448e+09 9.616929e+09 9.392690e+08 1.338532e+09 3.074345e+09 13.5706 18.4802 -60.0346 27.1769 4.9289 8.3030 -0.1180 -6.9087 0.0007 0.2904 0.1113 1.0043 0.9167 0.9167 0.0694 NaN 0.0000 -0.1434 -0.2691 -0.0195 -0.0290 -0.0230 0.0513 0.1164 0.1777 0.0480 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0211 -0.0510 -0.0189 0.1963 -0.0458 0.0000 -0.0060 Consumer Defensive 2.752291 1
4 GIS 1.790960e+10 0.0076 1.153980e+10 6.369800e+09 0.000000e+00 3.474300e+09 3.412400e+09 2.957400e+09 3.024000e+08 2.707700e+09 8.833000e+08 3.690000e+07 0.000000e+00 1.824400e+09 0.0 1.824400e+09 2.9000 2.830 6.144922e+08 6.144922e+08 1.550 0.3557 0.2010 0.1681 0.1020 0.1052 3.595500e+09 3.010100e+09 1.861300e+09 0.1512 0.1019 8.673000e+08 0.000000e+00 8.673000e+08 1.483600e+09 1.559400e+09 4.393500e+09 3.941900e+09 1.366480e+10 0.000000e+00 7.410000e+07 1.875220e+10 2.314570e+10 1.611300e+09 2.362300e+09 5.423500e+09 6.423500e+09 8.785800e+09 0.000000e+00 1.666000e+09 0.0 9.732700e+09 1.515620e+10 -1.340300e+09 1.178720e+10 6.534800e+09 0.000000e+00 7.918500e+09 4.832000e+08 1.449900e+09 5.854000e+08 3.920000e+07 2.541000e+09 -6.569000e+08 6.670000e+07 2.930000e+07 -5.618000e+08 8.011000e+08 -1.637200e+09 -9.833000e+08 -1.824100e+09 -2.920000e+07 1.259000e+08 1.884100e+09 -0.2376 5.1004 0.0000 1.8610 18.7034 17.6902 13.1169 10.952350 15.254287 1.553911 0.035332 9.767966 4.258726 0.168072 0.899538 0.673782 0.355664 1.0 0.165129 0.101867 0.326218 0.6265 0.2792 0.1041 0.673782 0.899538 0.168072 2.8234 11.5363 4.543393 0.773777 0.8100 0.433465 0.159915 -49.3233 30.2360 NaN 32.8385 NaN 0.3796 1.3445 0.495705 0.573463 9.9540 0.289217 3.541914 4.1351 3.624 1.4114 0.534 0.141879 0.741480 0.289217 1.075647 3.868169 1.549201 0.538972 29.1454 2.9690 4.1351 3.624 1.4114 10.6340 15.4290 10.6345 14.2977 3.333006e+10 4.124856e+10 18.7034 1.8610 13.1169 17.6902 5.1004 0.0000 2.3032 11.4723 16.2332 21.8930 0.0535 0.0565 1.3445 0.3796 2.2023 0.8100 9.9540 1.3928 0.0286 0.534 0.1940 0.000 0.5904 0.2585 0.0367 -1.1221 0.0022 26.3414 0.1041 0.6265 -0.3229 -1.030000e+09 9.480900e+09 -1.076270e+10 1.197590e+10 1.465000e+09 1.517250e+09 1.552450e+09 30.2360 32.8385 -49.3233 12.2250 2.8234 11.5363 0.2792 -1.0690 -0.0084 0.0332 0.0370 -0.0166 0.0140 0.0143 -0.0468 NaN 0.1742 -0.1316 -0.1936 0.0704 0.0567 0.0773 0.0782 0.0828 0.1991 0.0776 0.0841 0.0177 0.0427 0.0623 0.0217 0.1092 0.1250 0.1144 0.0257 0.0090 0.0215 0.0274 0.1025 0.0000 -0.0220 Consumer Defensive 12.897715 1

As can be seen from the data shown above the dataset has 3808 rows (3808 companies tracked) and we have 225 columns of factors with the last two columns being the price YOY change and the Class (which represents wether the stock went up or down with 1 meaning up and 0 meaning down). Both these factors are predictive factors and we can use them to find out which factors have the greatest correlation with PRICE VAR or CLASS. Almost all of the columns show float datatypes except the name (object), the sector (object) and the Class (int). Therefore to preprocess this data we must first make some adjustments to columns: Unnamed and Sector.

In [3]:
#remove data with NAN 
#change unnamed
#change sector
df1.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
data_info = pd.set_option('display.max_columns', None)
data_info=pd.DataFrame(df1.dtypes).T.rename(index={0:'column type'})
data_info=data_info.append(pd.DataFrame(df1.isnull().sum()).T.rename(index={0:'null values (nb)'}))
data_info=data_info.append(pd.DataFrame(df1.isnull().sum()/df1.shape[0]*100).T.
                         rename(index={0:'null values (%)'}))
display(data_info)
Stock Name Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax Income Tax Expense Net Income - Non-Controlling int Net Income - Discontinued ops Net Income Preferred Dividends Net Income Com EPS EPS Diluted Weighted Average Shs Out Weighted Average Shs Out (Dil) Dividend per Share Gross Margin EBITDA Margin EBIT Margin Profit Margin Free Cash Flow margin EBITDA EBIT Consolidated Income Earnings Before Tax Margin Net Profit Margin Cash and cash equivalents Short-term investments Cash and short-term investments Receivables Inventories Total current assets Property, Plant & Equipment Net Goodwill and Intangible Assets Long-term investments Tax assets Total non-current assets Total assets Payables Short-term debt Total current liabilities Long-term debt Total debt Deferred revenue Tax Liabilities Deposit Liabilities Total non-current liabilities Total liabilities Other comprehensive income Retained earnings (deficit) Total shareholders equity Investments Net Debt Other Assets Other Liabilities Depreciation & Amortization Stock-based compensation Operating Cash Flow Capital Expenditure Acquisitions and disposals Investment purchases and sales Investing Cash flow Issuance (repayment) of debt Issuance (buybacks) of shares Dividend payments Financing Cash Flow Effect of forex changes on cash Net cash flow / Change in cash Free Cash Flow Net Cash/Marketcap priceBookValueRatio priceToBookRatio priceToSalesRatio priceEarningsRatio priceToFreeCashFlowsRatio priceToOperatingCashFlowsRatio priceCashFlowRatio priceEarningsToGrowthRatio priceSalesRatio dividendYield enterpriseValueMultiple priceFairValue ebitperRevenue ebtperEBIT niperEBT grossProfitMargin operatingProfitMargin pretaxProfitMargin netProfitMargin effectiveTaxRate returnOnAssets returnOnEquity returnOnCapitalEmployed nIperEBT eBTperEBIT eBITperRevenue payablesTurnover inventoryTurnover fixedAssetTurnover assetTurnover currentRatio quickRatio cashRatio daysOfSalesOutstanding daysOfInventoryOutstanding operatingCycle daysOfPayablesOutstanding cashConversionCycle debtRatio debtEquityRatio longtermDebtToCapitalization totalDebtToCapitalization interestCoverage cashFlowToDebtRatio companyEquityMultiplier operatingCashFlowPerShare freeCashFlowPerShare cashPerShare payoutRatio operatingCashFlowSalesRatio freeCashFlowOperatingCashFlowRatio cashFlowCoverageRatios shortTermCoverageRatios capitalExpenditureCoverageRatios dividendpaidAndCapexCoverageRatios dividendPayoutRatio Revenue per Share Net Income per Share Operating Cash Flow per Share Free Cash Flow per Share Cash per Share Book Value per Share Tangible Book Value per Share Shareholders Equity per Share Interest Debt per Share Market Cap Enterprise Value PE ratio Price to Sales Ratio POCF ratio PFCF ratio PB ratio PTB ratio EV to Sales Enterprise Value over EBITDA EV to Operating cash flow EV to Free cash flow Earnings Yield Free Cash Flow Yield Debt to Equity Debt to Assets Net Debt to EBITDA Current ratio Interest Coverage Income Quality Dividend Yield Payout Ratio SG&A to Revenue R&D to Revenue Intangibles to Total Assets Capex to Operating Cash Flow Capex to Revenue Capex to Depreciation Stock-based compensation to Revenue Graham Number ROIC Return on Tangible Assets Graham Net-Net Working Capital Tangible Asset Value Net Current Asset Value Invested Capital Average Receivables Average Payables Average Inventory Days Sales Outstanding Days Payables Outstanding Days of Inventory on Hand Receivables Turnover Payables Turnover Inventory Turnover ROE Capex per Share Gross Profit Growth EBIT Growth Operating Income Growth Net Income Growth EPS Growth EPS Diluted Growth Weighted Average Shares Growth Weighted Average Shares Diluted Growth Dividends per Share Growth Operating Cash Flow growth Free Cash Flow growth 10Y Revenue Growth (per Share) 5Y Revenue Growth (per Share) 3Y Revenue Growth (per Share) 10Y Operating CF Growth (per Share) 5Y Operating CF Growth (per Share) 3Y Operating CF Growth (per Share) 10Y Net Income Growth (per Share) 5Y Net Income Growth (per Share) 3Y Net Income Growth (per Share) 10Y Shareholders Equity Growth (per Share) 5Y Shareholders Equity Growth (per Share) 3Y Shareholders Equity Growth (per Share) 10Y Dividend per Share Growth (per Share) 5Y Dividend per Share Growth (per Share) 3Y Dividend per Share Growth (per Share) Receivables growth Inventory Growth Asset Growth Book Value per Share Growth Debt Growth R&D Expense Growth SG&A Expenses Growth Sector 2015 PRICE VAR [%] Class
column type object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object float64 int64
null values (nb) 0 44 236 74 52 136 59 63 55 63 80 66 149 149 23 149 15 72 84 236 110 149 47 202 50 201 159 64 4 23 47 57 58 933 930 59 115 871 88 35 912 133 908 33 90 898 879 854 78 136 131 183 911 39 73 164 33 124 1277 1274 830 43 68 40 52 124 22 25 61 108 142 11 83 15 47 349 659 659 181 181 181 338 1053 1296 396 688 944 428 346 1118 1243 346 346 346 346 1243 995 192 987 1243 1118 346 781 306 345 173 967 972 972 182 181 3807 181 3807 185 185 282 269 172 1596 376 230 229 232 207 346 1016 1596 1956 421 293 1296 227 227 218 217 232 231 231 232 231 298 654 154 154 315 154 644 644 654 654 654 654 154 315 185 185 1424 988 144 156 154 186 156 162 173 144 156 158 165 644 987 995 315 987 173 958 185 288 288 288 182 182 183 306 781 306 193 218 231 231 231 243 282 293 336 1049 237 232 238 1547 1037 733 1536 1031 726 1547 1037 733 1585 1064 796 1465 952 600 290 290 290 369 302 247 243 0 0 0
null values (%) 0 1.15546 6.19748 1.94328 1.36555 3.57143 1.54937 1.65441 1.44433 1.65441 2.10084 1.73319 3.91282 3.91282 0.603992 3.91282 0.393908 1.89076 2.20588 6.19748 2.88866 3.91282 1.23424 5.30462 1.31303 5.27836 4.17542 1.68067 0.105042 0.603992 1.23424 1.49685 1.52311 24.5011 24.4223 1.54937 3.01996 22.8729 2.31092 0.919118 23.9496 3.49265 23.8445 0.866597 2.36345 23.5819 23.083 22.4265 2.04832 3.57143 3.44013 4.80567 23.9233 1.02416 1.91702 4.30672 0.866597 3.2563 33.5347 33.4559 21.7962 1.1292 1.78571 1.05042 1.36555 3.2563 0.577731 0.656513 1.60189 2.83613 3.72899 0.288866 2.17962 0.393908 1.23424 9.16492 17.3057 17.3057 4.75315 4.75315 4.75315 8.87605 27.6523 34.0336 10.3992 18.0672 24.7899 11.2395 9.08613 29.3592 32.6418 9.08613 9.08613 9.08613 9.08613 32.6418 26.1292 5.04202 25.9191 32.6418 29.3592 9.08613 20.5095 8.03571 9.05987 4.54307 25.3939 25.5252 25.5252 4.77941 4.75315 99.9737 4.75315 99.9737 4.85819 4.85819 7.40546 7.06408 4.51681 41.9118 9.87395 6.03992 6.01366 6.09244 5.43592 9.08613 26.6807 41.9118 51.3655 11.0557 7.69433 34.0336 5.96113 5.96113 5.72479 5.69853 6.09244 6.06618 6.06618 6.09244 6.06618 7.82563 17.1744 4.04412 4.04412 8.27206 4.04412 16.9118 16.9118 17.1744 17.1744 17.1744 17.1744 4.04412 8.27206 4.85819 4.85819 37.395 25.9454 3.78151 4.09664 4.04412 4.88445 4.09664 4.2542 4.54307 3.78151 4.09664 4.14916 4.33298 16.9118 25.9191 26.1292 8.27206 25.9191 4.54307 25.1576 4.85819 7.56303 7.56303 7.56303 4.77941 4.77941 4.80567 8.03571 20.5095 8.03571 5.06828 5.72479 6.06618 6.06618 6.06618 6.3813 7.40546 7.69433 8.82353 27.5473 6.22374 6.09244 6.25 40.625 27.2321 19.2489 40.3361 27.0746 19.0651 40.625 27.2321 19.2489 41.6229 27.9412 20.9034 38.4716 25 15.7563 7.61555 7.61555 7.61555 9.69013 7.93067 6.48634 6.3813 0 0 0
In [4]:
#explore sector, pricevar and class
df1.describe()
Out[4]:
Revenue Revenue Growth Cost of Revenue Gross Profit R&D Expenses SG&A Expense Operating Expenses Operating Income Interest Expense Earnings before Tax Income Tax Expense Net Income - Non-Controlling int Net Income - Discontinued ops Net Income Preferred Dividends Net Income Com EPS EPS Diluted Weighted Average Shs Out Weighted Average Shs Out (Dil) Dividend per Share Gross Margin EBITDA Margin EBIT Margin Profit Margin Free Cash Flow margin EBITDA EBIT Consolidated Income Earnings Before Tax Margin Net Profit Margin Cash and cash equivalents Short-term investments Cash and short-term investments Receivables Inventories Total current assets Property, Plant & Equipment Net Goodwill and Intangible Assets Long-term investments Tax assets Total non-current assets Total assets Payables Short-term debt Total current liabilities Long-term debt Total debt Deferred revenue Tax Liabilities Deposit Liabilities Total non-current liabilities Total liabilities Other comprehensive income Retained earnings (deficit) Total shareholders equity Investments Net Debt Other Assets Other Liabilities Depreciation & Amortization Stock-based compensation Operating Cash Flow Capital Expenditure Acquisitions and disposals Investment purchases and sales Investing Cash flow Issuance (repayment) of debt Issuance (buybacks) of shares Dividend payments Financing Cash Flow Effect of forex changes on cash Net cash flow / Change in cash Free Cash Flow Net Cash/Marketcap priceBookValueRatio priceToBookRatio priceToSalesRatio priceEarningsRatio priceToFreeCashFlowsRatio priceToOperatingCashFlowsRatio priceCashFlowRatio priceEarningsToGrowthRatio priceSalesRatio dividendYield enterpriseValueMultiple priceFairValue ebitperRevenue ebtperEBIT niperEBT grossProfitMargin operatingProfitMargin pretaxProfitMargin netProfitMargin effectiveTaxRate returnOnAssets returnOnEquity returnOnCapitalEmployed nIperEBT eBTperEBIT eBITperRevenue payablesTurnover inventoryTurnover fixedAssetTurnover assetTurnover currentRatio quickRatio cashRatio daysOfSalesOutstanding daysOfInventoryOutstanding operatingCycle daysOfPayablesOutstanding cashConversionCycle debtRatio debtEquityRatio longtermDebtToCapitalization totalDebtToCapitalization interestCoverage cashFlowToDebtRatio companyEquityMultiplier operatingCashFlowPerShare freeCashFlowPerShare cashPerShare payoutRatio operatingCashFlowSalesRatio freeCashFlowOperatingCashFlowRatio cashFlowCoverageRatios shortTermCoverageRatios capitalExpenditureCoverageRatios dividendpaidAndCapexCoverageRatios dividendPayoutRatio Revenue per Share Net Income per Share Operating Cash Flow per Share Free Cash Flow per Share Cash per Share Book Value per Share Tangible Book Value per Share Shareholders Equity per Share Interest Debt per Share Market Cap Enterprise Value PE ratio Price to Sales Ratio POCF ratio PFCF ratio PB ratio PTB ratio EV to Sales Enterprise Value over EBITDA EV to Operating cash flow EV to Free cash flow Earnings Yield Free Cash Flow Yield Debt to Equity Debt to Assets Net Debt to EBITDA Current ratio Interest Coverage Income Quality Dividend Yield Payout Ratio SG&A to Revenue R&D to Revenue Intangibles to Total Assets Capex to Operating Cash Flow Capex to Revenue Capex to Depreciation Stock-based compensation to Revenue Graham Number ROIC Return on Tangible Assets Graham Net-Net Working Capital Tangible Asset Value Net Current Asset Value Invested Capital Average Receivables Average Payables Average Inventory Days Sales Outstanding Days Payables Outstanding Days of Inventory on Hand Receivables Turnover Payables Turnover Inventory Turnover ROE Capex per Share Gross Profit Growth EBIT Growth Operating Income Growth Net Income Growth EPS Growth EPS Diluted Growth Weighted Average Shares Growth Weighted Average Shares Diluted Growth Dividends per Share Growth Operating Cash Flow growth Free Cash Flow growth 10Y Revenue Growth (per Share) 5Y Revenue Growth (per Share) 3Y Revenue Growth (per Share) 10Y Operating CF Growth (per Share) 5Y Operating CF Growth (per Share) 3Y Operating CF Growth (per Share) 10Y Net Income Growth (per Share) 5Y Net Income Growth (per Share) 3Y Net Income Growth (per Share) 10Y Shareholders Equity Growth (per Share) 5Y Shareholders Equity Growth (per Share) 3Y Shareholders Equity Growth (per Share) 10Y Dividend per Share Growth (per Share) 5Y Dividend per Share Growth (per Share) 3Y Dividend per Share Growth (per Share) Receivables growth Inventory Growth Asset Growth Book Value per Share Growth Debt Growth R&D Expense Growth SG&A Expenses Growth 2015 PRICE VAR [%] Class
count 3.764000e+03 3572.000000 3.734000e+03 3.756000e+03 3.672000e+03 3.749000e+03 3.745000e+03 3.753000e+03 3.745000e+03 3.728000e+03 3.742000e+03 3.659000e+03 3.659000e+03 3.785000e+03 3.659000e+03 3.793000e+03 3.736000e+03 3.724000e+03 3.572000e+03 3.698000e+03 3659.000000 3761.000000 3606.000000 3758.000000 3607.000000 3649.000000 3.744000e+03 3.804000e+03 3.785000e+03 3761.000000 3751.000000 3.750000e+03 2.875000e+03 2.878000e+03 3.749000e+03 3.693000e+03 2.937000e+03 3.720000e+03 3.773000e+03 2.896000e+03 3.675000e+03 2.900000e+03 3.775000e+03 3.718000e+03 2.910000e+03 2.929000e+03 2.954000e+03 3.730000e+03 3.672000e+03 3.677000e+03 3.625000e+03 2.897000e+03 3.769000e+03 3.735000e+03 3.644000e+03 3.775000e+03 3.684000e+03 2.531000e+03 2.534000e+03 2.978000e+03 3.765000e+03 3.740000e+03 3.768000e+03 3.756000e+03 3.684000e+03 3.786000e+03 3.783000e+03 3.747000e+03 3.700000e+03 3.666000e+03 3.797000e+03 3.725000e+03 3.793000e+03 3.761000e+03 3459.000000 3.149000e+03 3.149000e+03 3627.000000 3627.000000 3627.000000 3470.000000 2755.000000 2512.000000 3412.000000 3120.000000 2864.000000 3380.000000 3462.000000 2690.000000 2565.000000 3462.000000 3462.0 3462.000000 3462.000000 2565.000000 2813.000000 3.616000e+03 2821.000000 2565.000000 2690.000000 3462.000000 3027.000000 3502.000000 3463.000000 3635.000000 2841.000000 2836.000000 2836.000000 3626.000000 3627.000000 1.0000 3627.000000 1.0000 3623.000000 3623.000000 3526.000000 3539.000000 3636.000000 2212.000000 3.432000e+03 3.578000e+03 3.579000e+03 3.576000e+03 3601.000000 3462.000000 2792.000000 2212.000000 1852.000000 3387.000000 3515.000000 2512.000000 3.581000e+03 3.581000e+03 3.590000e+03 3.591000e+03 3.576000e+03 3.577000e+03 3.577000e+03 3.576000e+03 3.577000e+03 3.510000e+03 3.154000e+03 3654.000000 3654.000000 3493.000000 3654.000000 3.164000e+03 3.164000e+03 3154.000000 3154.000000 3154.000000 3154.000000 3654.000000 3493.000000 3623.000000 3623.000000 2384.000000 2820.000000 3664.000000 3652.000000 3654.000000 3622.000000 3652.000000 3646.000000 3635.000000 3664.000000 3652.000000 3650.000000 3643.000000 3164.000000 2821.000000 2813.000000 3493.000000 2.821000e+03 3.635000e+03 2.850000e+03 3.623000e+03 3.520000e+03 3.520000e+03 3.520000e+03 3626.000000 3626.000000 3625.000000 3502.000000 3027.000000 3502.000000 3.615000e+03 3.590000e+03 3577.000000 3577.000000 3577.000000 3565.000000 3526.000000 3515.000000 3472.000000 2759.000000 3571.000000 3576.000000 3570.000000 2261.000000 2771.000000 3075.000000 2272.000000 2777.000000 3082.000000 2261.000000 2771.000000 3075.000000 2223.000000 2744.000000 3012.000000 2343.000000 2856.000000 3208.000000 3518.000000 3518.000000 3518.000000 3439.000000 3506.000000 3561.000000 3565.000000 3.808000e+03 3808.000000
mean 5.879050e+09 12.954244 3.700973e+09 2.188214e+09 9.401830e+07 9.307406e+08 1.438144e+09 6.748248e+08 1.002018e+08 5.725866e+08 1.762750e+08 1.557750e+07 -3.453915e+06 4.894335e+08 5.343177e+06 4.839166e+08 -2.607365e+04 -2.653476e+04 2.865321e+08 2.904988e+08 3.413530 0.469036 -4.612215 -3.736484 -5.102894 -3.229727 1.041692e+09 7.782098e+08 5.051208e+08 -4.039696 -3.784724 1.615347e+09 2.612956e+08 8.496238e+08 9.615113e+08 4.866494e+08 3.111922e+09 2.941115e+09 1.664754e+09 4.245599e+08 1.849196e+08 5.816330e+09 2.642981e+10 1.024153e+09 4.672686e+08 2.071357e+09 3.857223e+09 5.994973e+09 1.228135e+08 3.984022e+08 5.179785e+09 3.449642e+09 2.208640e+10 -9.030215e+07 2.206764e+09 4.124379e+09 1.001118e+10 3.440958e+09 5.282998e+08 7.502376e+08 3.290080e+08 2.479993e+07 1.191508e+09 -4.886058e+08 -8.583320e+07 -2.183940e+08 -8.250133e+08 6.200769e+07 -8.792079e+07 -1.763699e+08 4.896244e+07 -4.174801e+06 9.939070e+07 4.453619e+08 -2.314196 2.790536e+04 2.790766e+04 45.129427 28.882535 32.866710 38.854715 22.758312 33.880458 38.046475 0.247325 37.229890 5.203304 -4.149450 0.679354 0.743828 0.428221 1.0 -4.982938 -4.333599 0.258234 0.073683 1.941046e+03 -0.430209 0.743828 0.679354 -4.149450 5.164910 27.891355 149.363526 0.728869 6.840022 6.304907 1.823634 -56.050637 87.649135 134.5455 232.431425 93.0392 0.246147 1.247731 0.268564 0.289505 -20.656176 391.603619 8.358932e+02 1.377534e+04 7.113342e+02 4.169637e+04 0.361147 13.367974 -0.345192 391.603619 566.729529 -2.855130 -6.278449 0.591802 1.342362e+05 5.323582e+03 1.372931e+04 7.089520e+02 4.169637e+04 2.271467e+05 3.941552e+05 1.484680e+05 2.090793e+05 3.503639e+10 7.994689e+09 29.168906 44.972676 38.660934 32.749071 2.777308e+04 2.777536e+04 42.642911 27.151004 27.509114 14.089356 -6.636770 -7.574803 1.247731 0.246147 -3.957762 6.872239 -20.130356 2.436330 0.019972 0.359600 2.003619 2.086115 0.142232 1.062227 0.224709 -0.452700 0.700113 396.354959 -0.430209 0.073683 -2.955065 5.915030e+09 1.756294e+10 -2.266889e+09 1.934120e+10 9.426405e+08 8.949436e+08 3.726066e+08 87.654608 232.491967 -56.066099 48.114703 5.164910 27.891355 1.941584e+03 -1.516578e+04 95.342800 3.631871 3.855939 -0.426786 0.164305 0.140394 1.824807 2.192586 0.160613 -1.341505 -23.892636 0.031811 0.043491 0.046010 0.044151 0.057456 0.067409 0.038966 0.069918 0.088356 0.045218 0.057705 0.065999 -0.019197 0.006783 0.036206 4.445793 0.164692 2.279999 0.813931 1.361780 0.596529 0.382284 9.751560e+02 0.429097
std 3.901741e+10 705.605473 3.040688e+10 1.159028e+10 6.408912e+08 4.820928e+09 7.305171e+09 3.786291e+09 6.991091e+08 2.793740e+09 1.011342e+09 1.620427e+08 2.360399e+08 4.396692e+09 6.760107e+07 4.392279e+09 1.672636e+06 1.673673e+06 2.304823e+09 2.403028e+09 167.005657 1.061432 82.473116 63.214487 90.142740 54.458512 6.057032e+09 5.329003e+09 4.415837e+09 64.600341 63.300580 1.494661e+10 2.125871e+09 3.855421e+09 6.066000e+09 6.848526e+09 3.033776e+10 2.119840e+10 9.983609e+09 3.811519e+09 1.564932e+09 2.702205e+10 3.570764e+11 9.989482e+09 5.767364e+09 1.901577e+10 8.263929e+10 7.958489e+10 9.394575e+08 2.126550e+09 6.039611e+10 2.168603e+10 3.254540e+11 1.335849e+09 1.419948e+10 2.972280e+10 9.501288e+10 8.948312e+10 1.433403e+10 5.502288e+09 2.401693e+09 1.535355e+08 1.683247e+10 3.741773e+09 7.662339e+08 5.212931e+09 6.572521e+09 3.697576e+09 2.686317e+09 7.754103e+08 9.871214e+09 9.388622e+08 4.426819e+09 4.060265e+09 91.010023 1.565693e+06 1.565693e+06 634.798219 169.220834 277.566797 1072.187989 122.713659 193.948036 660.011982 7.912518 883.203941 48.955775 66.153186 3.659063 3.232350 3.961384 0.0 85.478456 66.736850 3.232208 9.590453 1.167203e+05 17.743103 3.232350 3.659063 66.153186 20.830615 271.747359 7590.193566 0.936109 122.604424 122.705345 6.159653 197.928846 816.891148 NaN 3655.903472 NaN 0.527067 36.500177 1.195338 1.048213 1727.077299 18353.502069 4.867024e+04 3.995943e+05 1.921795e+05 1.199459e+06 4.741215 984.117469 12.929755 18353.502069 20101.549721 1163.836896 1137.993942 3.142069 4.129140e+06 1.825286e+05 3.989265e+05 1.918581e+05 1.199459e+06 1.072112e+07 1.892736e+07 3.411571e+06 5.622659e+06 1.623730e+12 2.808666e+10 169.445219 632.513736 1068.654944 276.548915 1.561977e+06 1.561977e+06 628.930733 630.194602 627.245999 955.723634 354.498211 438.287961 36.500177 0.527067 202.169311 123.059397 1720.612738 55.011342 0.118780 4.727528 34.124975 39.510150 0.197968 11.326515 1.449700 658.360878 21.442018 12755.983156 17.743103 9.590453 98.712298 2.440567e+11 1.213901e+11 9.859086e+09 1.376875e+11 6.025057e+09 6.990105e+09 1.861781e+09 817.003748 3656.405881 197.953962 1833.381818 20.830615 271.747359 1.167364e+05 3.855947e+05 5630.833201 228.011155 227.116093 29.049961 5.920824 5.934814 67.636280 73.075979 1.799747 74.436352 1398.408015 0.158958 0.205229 0.420155 0.117730 0.224307 0.292005 0.115010 0.208890 0.470741 0.127331 0.243623 0.343162 0.229890 0.229982 0.259765 161.135098 2.195115 92.428920 25.630214 18.734301 25.894049 4.696975 4.098461e+04 0.495012
min -6.276160e+08 -1.773200 -5.455740e+08 -1.105000e+09 -1.500000e+05 0.000000e+00 -1.088448e+09 -6.786000e+09 -2.250000e+08 -8.878000e+09 -2.081114e+09 -1.587227e+09 -7.015198e+09 -8.360000e+09 -1.262800e+07 -8.360000e+09 -1.018709e+08 -1.018709e+08 0.000000e+00 0.000000e+00 0.000000 -53.176500 -3053.200000 -3053.200000 -3216.080000 -2831.181800 -6.629500e+09 -8.465000e+09 -8.311600e+09 -3053.200000 -3053.200000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -1.932850e+10 -1.017445e+09 -2.030685e+10 -6.440654e+09 -7.458100e+09 0.000000e+00 -1.770249e+09 0.000000e+00 -8.970561e+09 -2.736842e+09 -2.787500e+10 -8.557072e+10 -1.262900e+10 -7.697817e+06 -6.306400e+10 -1.345100e+10 -4.880630e+10 -8.327900e+07 -1.370000e+08 -3.175545e+10 -1.455850e+11 -1.932000e+10 -1.678170e+11 -1.656360e+11 -1.963530e+11 -4.427000e+10 -1.156800e+10 -1.385010e+11 -4.300000e+10 -5.672351e+10 -1.042394e+11 -4943.552900 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 -4967.751300 0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 -3053.200000 -150.163636 -84.906977 -223.482759 1.0 -3399.545455 -3053.200000 -98.750000 -278.863200 -7.101360e+01 -616.644600 -84.906977 -150.163636 -3053.200000 -41.095800 0.000000 0.000000 -0.008685 -1.439300 0.000000 0.000000 -8547.090800 -21650.054300 134.5455 -3741.250000 93.0392 -0.382600 -263.458300 -1.228723 -1.765524 -75776.000000 -1801.548711 6.086275e-01 -4.875219e+06 -3.619346e+06 0.000000e+00 -86.000000 -2825.636364 -595.823529 -1801.548711 -1556.100000 -16221.787529 -16221.787529 -0.000000 -3.167700e+00 -4.687614e+06 -4.875219e+06 -3.619346e+06 0.000000e+00 -2.397800e+05 0.000000e+00 -2.397800e+05 -1.139670e+01 0.000000e+00 -1.251747e+11 0.000000 0.000000 -4967.751300 0.000000 0.000000e+00 0.000000e+00 -271.564000 0.000000 -3356.751300 -18641.285600 -21375.000000 -25896.708100 -263.458300 -0.382600 -9616.000000 -1.439300 -75776.000000 -724.812500 0.000000 -86.000000 -1.293500 -0.731700 0.000000 0.000000 -52.684100 -16522.069800 -1.446200 0.000000 -616.644600 -278.863200 -5474.440600 -4.780213e+12 -2.101300e+09 -1.894410e+11 -1.240000e+09 0.000000e+00 -1.890953e+10 0.000000e+00 -21650.054300 -3741.250000 -8547.090800 -1.085600 -41.095800 0.000000 -7.101360e+01 -1.402752e+07 -48.793100 -1119.284200 -487.340900 -1110.006500 -85.007900 -85.007900 -0.998900 -0.849400 -1.000000 -4028.397700 -83431.428600 -1.000000 -1.000000 -1.000000 -0.761800 -0.738700 -0.965700 -0.774000 -0.726700 -0.968000 -0.530300 -0.722100 -0.846400 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -0.979600 -230.000000 -1.051000 -1.043700 -1.000000 -1.003972e+02 0.000000
25% 5.788880e+07 -0.002350 3.135714e+06 3.092900e+07 0.000000e+00 1.548900e+07 3.106600e+07 -1.308000e+06 0.000000e+00 -3.732902e+06 0.000000e+00 0.000000e+00 0.000000e+00 -3.693000e+06 0.000000e+00 -4.369000e+06 -2.300000e-01 -2.400000e-01 1.554453e+07 1.379943e+07 0.000000 0.248700 0.036024 0.000000 -0.010000 -0.046700 3.646812e+06 -5.915000e+05 -3.728599e+06 0.000000 -0.001700 1.812575e+07 0.000000e+00 1.961225e+07 2.096328e+06 0.000000e+00 6.216000e+07 8.009750e+06 0.000000e+00 0.000000e+00 0.000000e+00 3.524285e+07 1.907368e+08 2.547500e+06 0.000000e+00 1.963000e+07 0.000000e+00 4.339782e+06 0.000000e+00 0.000000e+00 0.000000e+00 8.666000e+06 5.913900e+07 -1.128200e+07 -8.273925e+07 7.353440e+07 0.000000e+00 -4.946685e+07 1.549000e+06 5.764758e+06 1.836000e+06 2.815000e+05 1.281011e+06 -1.370000e+08 -1.200000e+07 -1.307400e+07 -3.257935e+08 -3.354167e+06 -5.642000e+06 -4.788225e+07 -5.000000e+07 -6.330000e+05 -1.721703e+07 -1.141200e+07 -0.374200 1.122400e+00 8.095000e-01 0.807750 0.000000 0.000000 2.096275 4.859322 9.356526 0.559122 -0.000000 6.846566 0.851143 0.018165 0.728538 0.632823 0.280132 1.0 0.012158 -0.006901 0.149406 -0.099300 -2.085000e-02 0.000000 0.632823 0.728538 0.018165 0.853150 0.000000 1.136849 0.123647 1.221000 0.819859 0.212494 -68.622050 10.176250 134.5455 10.526700 93.0392 0.022550 0.004700 0.000000 0.000000 0.000000 0.073443 1.535180e+00 7.842500e-02 -5.538500e-01 7.598500e-01 0.000000 0.026483 0.192590 0.073443 0.466218 0.300805 0.249213 -0.000000 3.805900e+00 -2.330000e-01 7.690000e-02 -5.600000e-01 7.598500e-01 4.001000e+00 7.861000e+00 4.083250e+00 2.708000e-01 1.993537e+08 2.255034e+08 0.000000 0.807675 2.060900 0.000000 1.120175e+00 8.035000e-01 0.963650 4.224900 4.093225 -7.007325 -0.021500 -0.042700 0.004700 0.022550 -0.547400 1.223500 0.000000 0.484625 0.000000 0.000000 0.092750 0.000000 0.000000 0.000000 0.012600 -1.672250 0.000800 0.000000 0.000000 -0.099300 -0.799800 1.387900e+07 1.636885e+08 -7.747000e+08 8.476944e+07 2.382503e+06 2.591102e+06 0.000000e+00 10.162075 10.524750 -68.627300 2.965125 0.853150 0.000000 -2.075000e-02 -1.807475e+00 -0.010200 -0.269800 -0.254500 -0.377500 -0.290775 -0.287900 -0.004425 -0.001300 0.000000 -0.328700 -0.640250 -0.006300 -0.010950 -0.023350 0.000000 0.000000 -0.014800 0.000000 0.000000 0.000000 0.000000 -0.006700 -0.008425 0.000000 0.000000 0.000000 -0.045800 0.000000 -0.015400 -0.084550 -0.048200 0.000000 -0.003900 -2.857143e+01 0.000000
50% 4.349010e+08 0.061850 1.414420e+08 1.908760e+08 0.000000e+00 7.381800e+07 1.386610e+08 4.104400e+07 2.563000e+06 2.842700e+07 5.335000e+06 0.000000e+00 0.000000e+00 2.223100e+07 0.000000e+00 2.101400e+07 7.700000e-01 7.500000e-01 4.268193e+07 3.976000e+07 0.000000 0.449900 0.137000 0.093100 0.051000 0.034200 7.404300e+07 4.256200e+07 2.361500e+07 0.068800 0.051200 7.255750e+07 0.000000e+00 8.438850e+07 4.240000e+07 1.758930e+06 2.753700e+08 9.104600e+07 2.829200e+07 0.000000e+00 4.600000e+05 3.484420e+08 1.089263e+09 2.510567e+07 2.798000e+06 1.218020e+08 7.396000e+07 1.729705e+08 0.000000e+00 7.740000e+05 0.000000e+00 1.516530e+08 5.886190e+08 -7.780000e+03 2.981150e+07 3.454000e+08 1.573300e+07 4.745816e+06 1.427000e+07 4.296150e+07 1.800000e+07 2.782500e+06 5.287800e+07 -1.705250e+07 0.000000e+00 0.000000e+00 -5.437000e+07 0.000000e+00 0.000000e+00 0.000000e+00 1.309673e+06 0.000000e+00 4.000000e+05 1.226500e+07 -0.048400 1.964600e+00 1.842500e+00 1.879500 14.684200 8.993800 9.710550 9.824162 16.209362 1.544053 0.002560 10.553933 1.593240 0.100009 0.921088 0.702852 0.475392 1.0 0.095568 0.056318 0.297302 0.055800 7.795000e-02 0.042500 0.702852 0.921088 0.100009 2.706000 3.376100 4.109010 0.501582 2.005000 1.427704 0.562938 -4.873550 45.165800 134.5455 26.460500 93.0392 0.171100 0.389300 0.051922 0.125816 1.320500 0.259684 2.227554e+00 1.625200e+00 5.510000e-01 2.259450e+00 0.000000 0.113410 0.662218 0.259684 2.618434 1.862784 1.299883 0.224380 1.149060e+01 8.300000e-01 1.622650e+00 5.500000e-01 2.259450e+00 1.041900e+01 2.150700e+01 1.100090e+01 5.733600e+00 9.149070e+08 1.148674e+09 14.682450 1.879550 9.706400 8.993750 1.964250e+00 1.836600e+00 2.277600 10.082300 12.616150 13.193850 0.036200 0.029800 0.389300 0.171100 0.956500 2.008500 1.302800 1.257550 0.000000 0.000000 0.222850 0.000000 0.036500 0.194600 0.033550 -0.917450 0.006000 13.585800 0.042500 0.055800 -0.125600 9.741000e+07 8.500508e+08 -6.846379e+06 7.837860e+08 4.316150e+07 2.508725e+07 1.808224e+06 45.133100 26.462950 -4.877000 6.171250 2.706000 3.376100 7.800000e-02 -4.848500e-01 0.072200 0.065500 0.061000 0.059100 0.096300 0.097800 0.008300 0.012600 0.000000 0.007350 -0.088200 0.042900 0.052700 0.043800 0.016000 0.002000 0.000000 0.000000 0.000000 0.000000 0.046200 0.043200 0.042950 0.000000 0.000000 0.000000 0.014250 0.000000 0.062200 0.037100 0.000000 0.000000 0.068800 -5.281842e+00 0.000000
75% 2.393625e+09 0.188875 1.199844e+09 8.922534e+08 9.911000e+06 3.510000e+08 5.885100e+08 2.712980e+08 4.300000e+07 2.193835e+08 5.752550e+07 0.000000e+00 0.000000e+00 1.658440e+08 0.000000e+00 1.630900e+08 2.030000e+00 2.000000e+00 1.171910e+08 1.130076e+08 0.654500 0.782500 0.304000 0.219300 0.138000 0.135900 4.049525e+08 2.817180e+08 1.755630e+08 0.177800 0.140350 2.858825e+08 1.691150e+07 3.300598e+08 2.958000e+08 1.075810e+08 1.198922e+09 8.360218e+08 4.100000e+08 2.258025e+07 2.551300e+07 2.392991e+09 4.924418e+09 1.814340e+08 4.149516e+07 6.301590e+08 1.007628e+09 1.358073e+09 4.876605e+06 6.119400e+07 0.000000e+00 1.455163e+09 3.161640e+09 0.000000e+00 5.604522e+08 1.535963e+09 4.053261e+08 5.141640e+08 8.318125e+07 2.550547e+08 1.127670e+08 1.149100e+07 3.145500e+08 -1.378638e+06 0.000000e+00 0.000000e+00 -3.586548e+06 7.470400e+07 8.425000e+06 0.000000e+00 7.006200e+07 0.000000e+00 3.000000e+07 1.429070e+08 0.113000 3.785000e+00 4.527100e+00 4.198300 23.979100 21.474300 16.406175 16.230272 26.079822 3.319632 0.027486 15.507845 3.255962 0.226624 1.000000 0.850666 0.800911 1.0 0.218920 0.145664 0.366562 0.169700 1.531500e-01 0.101400 0.850666 1.000000 0.226624 5.164250 10.844675 10.389072 1.000714 3.475000 2.758397 1.456602 0.000000 72.019750 134.5455 54.000100 93.0392 0.369450 1.019850 0.387503 0.432501 7.933625 0.606874 3.927946e+00 3.957725e+00 2.163000e+00 5.986325e+00 0.357000 0.264997 0.874133 0.606874 11.023445 5.212105 2.761710 0.535769 3.238150e+01 2.267300e+00 3.954550e+00 2.161600e+00 5.986325e+00 2.011940e+01 5.666600e+01 2.136085e+01 1.763870e+01 3.926217e+09 4.832950e+09 23.996825 4.195425 16.410100 21.476150 3.784175e+00 4.525150e+00 5.710050 15.400700 20.253600 25.989700 0.060675 0.067700 1.019850 0.369450 2.779175 3.476250 7.896275 2.108600 0.021200 0.357000 0.461800 0.036200 0.223750 0.599075 0.092300 -0.449300 0.019600 28.514800 0.101400 0.169700 0.103800 4.185280e+08 3.896543e+09 5.106348e+07 4.159895e+09 2.908209e+08 1.750891e+08 9.941438e+07 72.028075 54.006150 0.000000 10.110675 5.164250 10.844675 1.532000e-01 -9.665000e-02 0.225700 0.406100 0.353700 0.452900 0.519900 0.528400 0.086800 0.122150 0.055050 0.341450 0.435700 0.094100 0.117700 0.115800 0.093050 0.115100 0.129775 0.088600 0.120950 0.125450 0.099100 0.109750 0.111075 0.037950 0.031300 0.044000 0.205475 0.063850 0.217100 0.141900 0.210300 0.004700 0.220500 1.331484e+01 1.000000
max 1.824698e+12 42138.663900 1.537249e+12 4.621600e+11 1.153700e+10 1.856830e+11 3.056050e+11 1.565540e+11 3.152300e+10 8.720500e+10 3.970800e+10 4.917409e+09 8.368000e+09 2.339970e+11 2.741588e+09 2.339970e+11 8.028004e+06 6.624003e+06 1.112919e+11 1.112919e+11 10100.664000 2.659900 83.985400 97.590200 83.817600 45.292000 2.337210e+11 2.339970e+11 2.339970e+11 97.587700 97.396100 5.123080e+11 4.604800e+10 8.570900e+10 1.571052e+11 4.007370e+11 1.221035e+12 9.130160e+11 3.861800e+11 1.301620e+11 6.650100e+10 7.283810e+11 2.035703e+13 4.108950e+11 2.758350e+11 8.912570e+11 4.412529e+12 4.412529e+12 2.515800e+10 4.416800e+10 1.447804e+12 7.283810e+11 1.856536e+13 2.966300e+10 4.083840e+11 1.465725e+12 2.043074e+12 4.412529e+12 7.147500e+11 2.268290e+11 1.149940e+11 4.279000e+09 9.600000e+11 5.389408e+08 1.065800e+10 1.498722e+11 1.445908e+11 3.436000e+10 1.444008e+11 0.000000e+00 5.159800e+11 3.230100e+10 2.087463e+11 9.887000e+10 296.856700 8.786033e+07 8.786033e+07 26397.558500 4972.650400 11043.963600 48212.903200 3933.316576 7363.425186 25730.200000 410.898697 47022.618000 1846.587123 97.590197 74.922705 99.750000 2.659945 1.0 40.720398 97.396123 85.906977 345.631000 7.018763e+06 606.105500 99.750000 74.922705 97.590197 640.898700 14112.567800 446511.857143 19.595573 5852.228000 5850.927773 225.801591 89.207300 28768.636400 134.5455 192421.363600 93.0392 26.238900 2131.512800 53.618462 53.618462 38988.418600 863187.771511 2.851266e+06 1.155862e+07 1.030207e+07 4.823294e+07 180.900000 57805.094665 8.308966 863187.771511 863187.771511 64384.913180 64384.913180 113.453608 2.193725e+08 5.153974e+06 1.155862e+07 1.030207e+07 4.823294e+07 6.163623e+08 1.095290e+09 1.407759e+08 2.456253e+08 9.619200e+13 6.096634e+11 4972.650400 26397.558500 48212.903200 11043.963600 8.786033e+07 8.786033e+07 23710.175200 35296.196000 33522.612500 44509.072600 85.400000 293.817600 2131.512800 26.238900 914.406300 5852.228000 38988.418600 2086.250000 6.242800 180.900000 1540.600000 2156.000000 0.927400 596.823500 18.451400 35298.333300 1267.600000 540174.926600 606.105500 345.631000 283.038900 1.083289e+13 2.515999e+12 5.332900e+10 3.052798e+12 1.576925e+11 2.091685e+11 4.483400e+10 28768.636400 192421.363600 89.207300 108419.771000 640.898700 14112.567800 7.018763e+06 1.255873e+06 336767.800000 13562.153200 13545.338700 475.714300 136.000000 134.000000 3935.709000 3788.339100 81.278200 829.415100 2481.083300 1.747000 3.508500 11.836500 1.732700 4.186200 4.443100 0.984100 2.592300 17.028200 2.407000 6.031600 12.439600 0.574600 2.981100 3.605800 9328.200000 76.625000 5468.426400 1360.125000 729.576600 1542.611000 225.690000 2.418601e+06 1.000000
In [5]:
df1.describe(include=np.object)
Out[5]:
Stock Name Sector
count 3808 3808
unique 3808 11
top TCBK Financial Services
freq 1 660

Linear Regression: Our first objective will be to find out if there is a linear correlation between any of the factors and PRICEVAR or the factors and Class. I had a hypothesis that out of all the variables, revenue growth would have the greatest effect on the final yoy change in stock price. However, as can be seen below, there seemed to not be much of a correlation.

In [6]:
dfSector = pd.get_dummies(df1["Sector"])
dfClean = df1
dfClean.drop('Sector',
  axis='columns', inplace=True)
dfClean = df1.dropna(subset=['Class','2015 PRICE VAR [%]'])
dfClean = pd.concat([dfClean,dfSector], axis=1)
dfVAR = dfClean['2015 PRICE VAR [%]']
dfClass = dfClean['Class']
dfClean.drop(['2015 PRICE VAR [%]', 'Class'],
  axis='columns', inplace=True)
In [229]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
model = LinearRegression()
x = dfClean['Revenue Growth']
x= x.fillna(x.mean())
X = pd.concat([x,dfVAR], axis=1)
y = dfVAR.to_numpy()
plt.plot(X, y)
model.fit(X, y)
y_predicted = model.predict(X)
plt.plot(X, y_predicted)
plt.show()
# print(X)
from yellowbrick.regressor import ResidualsPlot
from sklearn.metrics import mean_squared_error
# Instantiate the linear model and the visualizer
#ridge = Ridge()
visualizer = ResidualsPlot(model)
visualizer.fit(X, y_predicted)
print("R^2: {}".format(model.score(X, y_predicted)))
rmse = np.sqrt(mean_squared_error(x,y))
print("Root Mean Squared Error: {}".format(rmse))
visualizer.poof();
plt.show()
print(y_predicted)
R^2: 1.0
Root Mean Squared Error: 40996.17652829138
[ -9.323276   -25.51219289  33.11829672 ... -23.55890017   7.77957917
 -34.09961311]

As can be seen from the above graph and the lasso model below, linear regression is too simple to examine a dataset with over 200 variables. Therefore we must pursue some more difficult model to evaluate the dataset. As can be seen below, after the lasso came out, for the 200 something variables all of them had a supremely low coefficient meaning that they had a very low effect on the final outcome. This is expected as we are certainly oversimplifying here.

In [228]:
X2 = dfClean.fillna(dfClean.mean())
X2 = X2.drop('Stock Name',axis=1)
y2 = dfVAR.values
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.33, random_state=42)
#Let's build a Lasso model
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge


alpha_user = 0.95420530945486497 #play with alpha and you will get different results (between 0 and 1)
lasso_model = Lasso(alpha=alpha_user,normalize=True)
#ridge_mode = Ridget(alpha= alpha_user, normalize=True)
lasso_model.fit(X_train,y_train)
lasso_coef = lasso_model.coef_
# print(lasso_coef)
e_dataframe=pd.DataFrame({'a':lasso_coef}) 
e_dataframe = e_dataframe.sort_values(by=['a'],ascending=False)
print(e_dataframe.head(5))
dx = dfClean.columns.values.tolist()
print(dx[230])
print(dx[226])
print(dx[164])
              a
230  232.710423
226  151.960117
164   71.314640
21    69.307591
116    2.472822
Real Estate
Energy
R&D to Revenue

Interestingly enough the top 3 variables were Real Estate, Energy and R&D to Revenue. Traditionally, both Real Estate and Energy are regarded as traditional stocks and therefore their profits have more to do with their stock price unlike the new tech stocks. On the other hand R&D to Revenue's high correlation makes sense as well as this is a good variable to see if a company will gain in profit in the long term.

Logistic Regression

Since, the data from the linear regression showed some positive data regarding the correlation between Sector and change in stock price. I decided to do a logisitic regression study on just sector and class (we can do this because sector is classification data). After doing this, using a confusion matrix, it is visible that although there is a 0.6 accuracy for the model, it is not significant enough to consider a relation betweeen the data.

In [13]:
from sklearn.linear_model import LogisticRegression
X3 = dfSector
y3 = dfClass.values
X_train2, X_test2, y_train2, y_test2 = train_test_split(X3, y3, test_size=0.33, random_state=42)
model = LogisticRegression()

model.fit(X_train2, y_train2)
y_pred2 = model.predict(X_test2)

from sklearn import metrics
print(metrics.accuracy_score(y_test2, y_pred2))


from sklearn.metrics import plot_confusion_matrix 
cnf_matrix = metrics.confusion_matrix(y_test2, y_pred2)
print(cnf_matrix)
metrics.plot_confusion_matrix(model, X_test2, y_test2)
plt.show()
print(metrics.classification_report(y_test2, y_pred2))
0.5974542561654733
[[601 104]
 [402 150]]
              precision    recall  f1-score   support

           0       0.60      0.85      0.70       705
           1       0.59      0.27      0.37       552

    accuracy                           0.60      1257
   macro avg       0.59      0.56      0.54      1257
weighted avg       0.60      0.60      0.56      1257

Let's take a look at what the Sector data really is and how the companies are spread out among different sectors.

In [124]:
import seaborn as sns
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
count = df['Sector'].value_counts()
plt.figure(figsize=(15,10))
ax = sns.countplot(x='Sector', data=df, palette="Set2", order=count.index[0:10])
ax.set(xlabel='Sectors', ylabel='Number of Companies')
plt.title("Bar Graph of Sectors")
Out[124]:
Text(0.5, 1.0, 'Bar Graph of Sectors')

As can be seen from the bar chart above, most of the companies are related to the financial services industry and we have 11 industries in the data.

The table below shows the yoy stock price change mean, median and variance for all the sectors

In [136]:
# print(result)
import statistics
df = pd.DataFrame(columns=["Sector","Mean","Median","Variance"])
rows = []
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
list = df['Sector'].unique()
for i in list:
    dfx = df[df['Sector']==i]
    x = dfx['2015 PRICE VAR [%]'].mean()
    y = dfx['2015 PRICE VAR [%]'].median()
    z = statistics.variance(dfx['2015 PRICE VAR [%]'])
    rows.append([i, x, y, z])    
dfx = pd.DataFrame(rows, columns=["Sector","Mean","Median","Variance"])
print(dfx)
# plt.figure(figsize=(15,25))
# ax = sns.barplot(x='Sector', y='Mean', data=dfx, palette="Set2")
# ax.set(xlabel='Sectors', ylabel='Number of Companies')
# plt.title("Bar Graph of Sectors")
                    Sector          Mean     Median      Variance
0       Consumer Defensive     17.956651   2.761595  3.351567e+04
1          Basic Materials    153.635483 -22.704545  6.715706e+06
2               Healthcare    354.012343  -4.069741  5.234683e+07
3        Consumer Cyclical     47.633625  -6.094964  1.279881e+06
4              Industrials    -12.509181 -13.773614  1.307211e+03
5              Real Estate  10807.848832  -2.276476  2.611423e+10
6   Communication Services     -5.434395  -5.316752  9.448733e+02
7                   Energy    -27.376600 -33.670540  5.846270e+03
8       Financial Services    280.354020   5.464307  3.263070e+07
9                Utilities     -5.473441  -2.485156  9.778282e+02
10              Technology   1480.477975  -4.823197  8.577811e+08

The table below shows the mean, median and variance for class data for the different sectors

In [137]:
# print(result)
import statistics
df = pd.DataFrame(columns=["Sector","Mean","Median","Variance"])
rows = []
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
list = df['Sector'].unique()
for i in list:
    dfx = df[df['Sector']==i]
    x = dfx['Class'].mean()
    y = dfx['Class'].median()
    z = statistics.variance(dfx['Class'])
    rows.append([i, x, y, z])    
dfx = pd.DataFrame(rows, columns=["Sector","Mean","Median","Variance"])
print(dfx)
# plt.figure(figsize=(15,25))
# ax = sns.barplot(x='Sector', y='Mean', data=dfx, palette="Set2")
# ax.set(xlabel='Sectors', ylabel='Number of Companies')
# plt.title("Bar Graph of Sectors")
                    Sector      Mean  Median  Variance
0       Consumer Defensive  0.539394     1.0  0.249963
1          Basic Materials  0.219008     0.0  0.171753
2               Healthcare  0.448454     0.0  0.247769
3        Consumer Cyclical  0.404814     0.0  0.241468
4              Industrials  0.329341     0.0  0.221317
5              Real Estate  0.455357     0.0  0.249119
6   Communication Services  0.433735     0.0  0.248604
7                   Energy  0.131222     0.0  0.114521
8       Financial Services  0.624242     1.0  0.234920
9                Utilities  0.432990     0.0  0.248067
10              Technology  0.451389     0.0  0.248068

The graph below shows the relationship between the mean of the stock price change of different industries and the mean of different growth indicators for those industries.

In [214]:
# print(result)
import statistics
from sklearn import datasets, linear_model
df = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
# df = pd.DataFrame(columns=["Sector","VAR Mean","Gross Profit Mean","Net Income Mean","Net Profit Margin Mean"])
df = df.dropna(subset=['Sector','2015 PRICE VAR [%]','priceEarningsToGrowthRatio','Gross Profit Growth','Net Income Growth','Revenue Growth'])
rows = []
list = df['Sector'].unique()
for i in list:
    dfx = df[df['Sector']==i]
    a = dfx['2015 PRICE VAR [%]'].mean()
    b = dfx['priceEarningsToGrowthRatio'].median()
    c = dfx['Gross Profit Growth'].mean()
    d= dfx['Net Income Growth'].mean()
    e= dfx['Revenue Growth'].mean()
    rows.append([i, a, b, c, d, e ])    
dfx = pd.DataFrame(rows, columns=["Sector","VAR Mean","priceEarningsToGrowthRatio","Gross Profit Growth","Net Income Growth","Revenue Growth",])

fig, axs = plt.subplots(2, 2)
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Net Income Growth'])
axs[0, 0].set_title('Net Income Growth')
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [slope * i + intercept for i in x2]
axs[0, 0].plot(x2, abline_values, 'b')
plt.xlim([-50, 300])
axs[0,0].scatter(x=x2, y =y)

axs[1, 0].set_title('priceEarningsToGrowthRatio')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['priceEarningsToGrowthRatio'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[1, 0].plot(x2, abline_values, 'b')
axs[1,0].scatter(x=dfx['VAR Mean'], y =dfx['priceEarningsToGrowthRatio'])


axs[1, 1].set_title('Gross Profit Growth')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Gross Profit Growth'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[1, 1].plot(x2, abline_values, 'b')
axs[1,1].scatter(x=dfx['VAR Mean'], y =dfx['Gross Profit Growth'])

axs[0, 1].set_title('Revenue Growth')
x = pd.concat([dfx['VAR Mean']], axis=1)
x2 = np.array(dfx['VAR Mean'])
y = np.array(dfx['Revenue Growth'])
regr = linear_model.LinearRegression()
regr.fit(x, y)
y_pred = regr.predict(x)
# print(regr.intercept_)
# print(regr.coef_)
abline_values = [regr.coef_ * i + regr.intercept_ for i in x2]
axs[0, 1].plot(x2, abline_values, 'b')
axs[0,1].scatter(x=dfx['VAR Mean'], y =dfx['Revenue Growth'])
fig. tight_layout(pad=3.0)

Above are some graphs regarding different growth indicators for the different industries and how they differ based on their YOY stock price change. As can be seen above, there doesn't seem to be much of a relation between the points however it does seem that the growth indicators all show similar spread in regards to the points meaning that the growth indicators themselves might have a relation to each other but not in regards to the price YOY change. However, it is interesting to note that each industry seems to have consistent data in regards to its sector, meaning that companies in each indivudal sectors are affected by the sector it is in.

Baseline Dummy Classifier

With this baseline dummy classifier, we know that every model we develop should have an accuracy greater than 0.5879 as that means its worse than a dummy model

In [292]:
from sklearn.dummy import DummyClassifier
dfClass2 = dfClass.replace(1, "1")
dfClass2 = dfClass2.replace(0, "0")
dfClass2 = pd.DataFrame(dfClass2)
X2 = df1.dropna(subset=['Class','2015 PRICE VAR [%]','Stock Name'])
xTrain, xTest, yTrain, yTest = train_test_split(X2, dfClass2, test_size = 0.3, random_state = 2020)
dummy_classifier = DummyClassifier(strategy='most_frequent')
# print(X_train.head(5))
# print(df)
dummy_classifier.fit(xTrain,yTrain)
baseline_acc = dummy_classifier.score(xTest,yTest)


### For verifying answer:
print("Baseline Accuracy (using Class) = ", baseline_acc)
Baseline Accuracy (using Class) =  0.5879265091863517

Decision Tree

In [343]:
X3 = dfClean.fillna(dfClean.mean())
X3.drop('Stock Name',
  axis='columns', inplace=True)
xTrain, xTest, yTrain, yTest = train_test_split(X3, dfClass2, test_size = 0.3, random_state = 2020)
df = pd.DataFrame(columns=["A", "B", "C"])
rows = []
# print(xTrain.head(5))
# print(yTrain.head(5))
for i in range(10):
    if i!=0:
        dt = DecisionTreeClassifier(max_depth=i, criterion="entropy", random_state=2020)
        dt.fit(xTrain,yTrain)
        ypred = dt.predict((xTest))
        ytrainpred = dt.predict((xTrain))
        a1 = metrics.accuracy_score(yTrain, ytrainpred)
        a2 = metrics.accuracy_score(yTest, ypred)
        rows.append([i, a1,a2])
# print(rows)

df = pd.DataFrame(rows, columns=["A", "B","C"])
print(df)
   A         B         C
0  1  0.615009  0.628171
1  2  0.619512  0.608924
2  3  0.643527  0.629921
3  4  0.653659  0.627297
4  5  0.663039  0.628171
5  6  0.689306  0.622922
6  7  0.713321  0.609799
7  8  0.751220  0.606299
8  9  0.774859  0.606299

As we can see from above, at the 5th iteration (depth of 5), the decision tree starts to overfitting the data and create test data with significantly lower accuracy when compared to the training data. Hence we should only analyse the decision tree with a depth of 5 layers.

In the confusion matrix below we can see that the model is good at predicting a company's class (stock price change 0 or 1). However it is not that great at labeling companies with stock price that went up with a significant number of stocks that were classified as false negative.

In [344]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3, criterion="entropy", random_state=2020)
dt.fit(xTrain,yTrain)
ypred = dt.predict((xTest))
ytrainpred = dt.predict((xTrain))
metrics.plot_confusion_matrix(dt, xTest, yTest)
plt.show()
cf = metrics.confusion_matrix(yTest, ypred)
print(cf)
[[501 171]
 [252 219]]

Accuracy of the train and test data for this model:

In [345]:
print('Accuracy =', metrics.accuracy_score(yTrain, ytrainpred))
print('Accuracy =', metrics.accuracy_score(yTest, ypred))
Accuracy = 0.6435272045028143
Accuracy = 0.6299212598425197

The decision tree below shows that deposit liabilities is the most important factor followed by capitalExpenditureCoverageRatios and Invested Cpaital. This observation is also illustrated in the feature importance table shown below the decision tree.

In [348]:
from sklearn import tree
fn = X3.columns
dfClass3 = dfClass.replace(1, "1")
dfClass3 = dfClass3.replace(0, "0")
cn = dfClass3.unique()
plt.figure(2)
Tree = tree.plot_tree(dt,feature_names=fn,class_names=cn,fontsize=10)
plt.figure(figsize=(100, 500)) 
plt.savefig('stockmarket.png')
plt.show()
<Figure size 7200x36000 with 0 Axes>
In [91]:
print('Feature Importance:', dt.feature_importances_)
imp=pd.DataFrame(zip(xTrain.columns, dt.feature_importances_))
imp.columns =['X','Y']
imp = imp.sort_values(by=['Y'], ascending=False)
pd.set_option('display.max_rows', None)
print(imp.head(5))
Feature Importance: [0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.07005993 0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.32760565 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.07064822 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.15979544 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.07854426
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.20283653 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.09050997 0.         0.
 0.         0.         0.         0.        ]
                                    X         Y
50                Deposit Liabilities  0.327606
176                  Invested Capital  0.202837
128  capitalExpenditureCoverageRatios  0.159795
225                            Energy  0.090510
149      Enterprise Value over EBITDA  0.078544

Interestingly enough, the data we have got is not bad in comparison to the logistic and linear regression data. However because we have so many different variables it is quite difficult to use a decision tree to give us supremely accurate data. Hence we must look at different methods to increase this accuracy.

Ensemble

Ensemble Learning is building a prediction model by combining the strengths of a collection of simpler base models. My ensemble algorithm here will aggregate the predictions made by each individual base model and produce a single output prediction. Hopefully by using ensemble learning we will get better models to find out what factors have the greatest influence on stock price yoy change.

Ensemble Method 1: Bagging
     What is Bagging? Bagging (Bootstrap Aggregation) is an averaging method doing different tests on different training samples. Doing this will take advantage of shortcoming of trees (which are heavily influenced by training data) we can ensure we don't overfit like we did in the previous decision tree model we used above.

Method 1a: Generic Bagging:

In [372]:
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
dataframe = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
dataframe.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
dataframe2 = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
dataframe = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
ensemble= dataframe.fillna(dataframe.mean())
X_train, X_test, y_train, y_test = train_test_split(ensemble, dfClass,
                                                    test_size=0.3,
                                                    random_state=42
                                                    )
model_bagging = BaggingClassifier(random_state = 42)
model_bagging.fit(X_train, y_train)
pred_bagging = model_bagging.predict(X_test)
scores = []
for model_idx, model in enumerate(model_bagging.estimators_):
    if model_idx == 0:
        print('='*40)
    preds = model.predict(X_test)
    scores.append(recall_score(y_test, preds))
    model_recall = np.round(recall_score(y_test, preds), 5)
    print(f'Recall for Base Model {model_idx+1}:\t', model_recall)
    if model_idx < (len(model_bagging.estimators_) - 1):
        print('-'*40)
    else:
        print('='*40)
ensemble_preds = model_bagging.predict(X_test)
print("Mean Recall Score:\t\t", np.round(np.array(scores).mean(), 5))
randomforest = np.round(np.array(scores).mean(), 5)
print("Std Deviation:\t\t\t", np.round(np.array(scores).std(), 5))
print("Range:\t\t\t\t", np.round(np.array(scores).ptp(), 5))
print(f'Overall Recall for model:\t {np.round(recall_score(y_test, ensemble_preds), 5)}')
# acc_bagging = accuracy_score(y_test, pred_bagging)

# print(' Accuracy = ', acc_bagging)
========================================
Recall for Base Model 1:	 0.52227
----------------------------------------
Recall for Base Model 2:	 0.52632
----------------------------------------
Recall for Base Model 3:	 0.53036
----------------------------------------
Recall for Base Model 4:	 0.51822
----------------------------------------
Recall for Base Model 5:	 0.47773
----------------------------------------
Recall for Base Model 6:	 0.52227
----------------------------------------
Recall for Base Model 7:	 0.51822
----------------------------------------
Recall for Base Model 8:	 0.53239
----------------------------------------
Recall for Base Model 9:	 0.50405
----------------------------------------
Recall for Base Model 10:	 0.48178
========================================
Mean Recall Score:		 0.51336
Std Deviation:			 0.01838
Range:				 0.05466
Overall Recall for model:	 0.41498

Below is a classification report that will give us some data regarding how accurate we are classifying the data in regards to true negatives, true positives, false negatives, and false positives.

In [325]:
from sklearn.metrics import (
    classification_report,
    recall_score,
    precision_score,
    accuracy_score
)
print('Classification Report:\n')
print(classification_report(y_test, pred_bagging))
Classification Report:

              precision    recall  f1-score   support

           0       0.64      0.79      0.71       649
           1       0.60      0.41      0.49       494

    accuracy                           0.63      1143
   macro avg       0.62      0.60      0.60      1143
weighted avg       0.62      0.63      0.61      1143

Method 1b: Random Forest Bagging:

In [333]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
model_rf = RandomForestClassifier(n_estimators=10, max_features=200, random_state=42)
model_rf.fit(X_train, y_train)
predict_rf = model_rf.predict(X_test)
# recall_rf = recall_score(y_test, predict_rf, average=None)
# precision_rf = precision_score(y_test, predict_rf)
scores = []
for model_idx, model in enumerate(model_rf.estimators_):
    if model_idx == 0:
        print('='*40)
    preds = model.predict(X_test)
    scores.append(recall_score(y_test, preds))
    model_recall = np.round(recall_score(y_test, preds), 5)
    print(f'Recall for Base Model {model_idx+1}:\t', model_recall)
    if model_idx < (len(model_rf.estimators_) - 1):
        print('-'*40)
    else:
        print('='*40)
ensemble_preds = model_rf.predict(X_test)
print("Mean Recall Score:\t\t", np.round(np.array(scores).mean(), 5))
randomforest = np.round(np.array(scores).mean(), 5)
print("Std Deviation:\t\t\t", np.round(np.array(scores).std(), 5))
print("Range:\t\t\t\t", np.round(np.array(scores).ptp(), 5))
print(f'Overall Recall for model:\t {np.round(recall_score(y_test, ensemble_preds), 5)}')
========================================
Recall for Base Model 1:	 0.52024
----------------------------------------
Recall for Base Model 2:	 0.44737
----------------------------------------
Recall for Base Model 3:	 0.53846
----------------------------------------
Recall for Base Model 4:	 0.48381
----------------------------------------
Recall for Base Model 5:	 0.49798
----------------------------------------
Recall for Base Model 6:	 0.50405
----------------------------------------
Recall for Base Model 7:	 0.4919
----------------------------------------
Recall for Base Model 8:	 0.49595
----------------------------------------
Recall for Base Model 9:	 0.46964
----------------------------------------
Recall for Base Model 10:	 0.50202
========================================
Mean Recall Score:		 0.49514
Std Deviation:			 0.0239
Range:				 0.09109
Overall Recall for model:	 0.40081

As can be seen from above, both the generic and decision tree based bagging models were both quite inaccurate. This make sense as we already know that the training data is not the problem that is giving us a low accuracy model. It is that we have too many variables that do not have a significant effect on the final output at the end. Let's now look at other type of ensemble methods.

To clean up the data, I have used an out of bag evaluation to see what error are we getting from using bagging and using certain data multiple times while leaving other data unused. As can be seen below the OOB technique cleans up around 2% of the data. However this is relatively unimportant as the bagging model is not accurate enough to be continued to be considered in the rest of this study.

In [334]:
model_rf_oob = RandomForestClassifier(n_estimators=100, max_features=7, oob_score=True, random_state=42).fit(X_train, y_train)
oob_score = round(model_rf_oob.oob_score_,4)
acc_oob = round(accuracy_score(y_test, model_rf_oob.predict(X_test)),4)
diff_oob = round(abs(oob_score - acc_oob),4)

print('OOB Score:\t\t\t', oob_score)
print('Testing Accuracy:\t\t', acc_oob)
print('Acc. Difference:\t\t', diff_oob)
OOB Score:			 0.6379
Testing Accuracy:		 0.6553
Acc. Difference:		 0.0174

Ensemble Method 2: AdaBoost
    AdaBoost is a boosting algorithm that begins by assigning equal weights to all observations but based on the first set of predictions it starts to adjust the weights to increase for misclassified observations and reduced for correctly classified observations. This update keeps on going until we have a relatively accurate model.

In [369]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
base_est = DecisionTreeClassifier (max_depth =2)
ada_boost1 = AdaBoostClassifier(base_est, n_estimators=500, random_state=42, learning_rate=.15)
ada_boost1.fit(X_train, y_train)
ada_boost2 = AdaBoostClassifier(base_est, n_estimators=20, random_state=42, learning_rate=.5)
ada_boost2.fit(X_train, y_train)
res1 = round(recall_score(y_test, ada_boost1.predict(X_test)),4)
res2 = round(recall_score(y_test, ada_boost2.predict(X_test)),4)

print('Winning Model:\t {}, MODEL 1')
print('MODEL 1 Recall:\t {}'.format(res1))
print('MODEL 2 Recall:\t {}:'.format(res2))
Winning Model:	 {}, MODEL 1
MODEL 1 Recall:	 0.4352
MODEL 2 Recall:	 0.4352:

Ensemble Method 3: Gradient Boosted Trees (GBT)
    Gradient Booosted Trees is another boosting algorithm but unlike Adaboost it re-weighs observations based on the prediction performance through residuals errors of the preceding models.

In [353]:
from sklearn.ensemble import GradientBoostingClassifier
dataframe2 = dataframe.dropna().
xTrain, xTest, yTrain, yTest = train_test_split(X3, dfClass2, test_size = 0.3, random_state = 2020)
gbc = GradientBoostingClassifier(random_state=42).fit(X_train, y_train)
gbc_pred = gbc.predict(X_test)
recall_gbc = round(recall_score(y_test, gbc_pred),4)
precision_gbc = round(precision_score(y_test, gbc_pred),4)

print('Recall :\t {}'.format(recall_gbc))
print('Precision :\t {}:'.format(precision_gbc))
Recall :	 0.4291
Precision :	 0.6405:
Recall :	 0.4291
Precision :	 0.6405:

As shown above, both the boosting algorithms have relatively high precision but in terms of recall it doesn't do quite well. Below I will be using a hard voting method to combine the logistic regression and the randomforest methods to create a combined algorithm.

Ensemble Method 4: Voting Classifier
    In hard voting, we use majority voting to combine results from all the models and choose the outputs (0,1) for all values. In this model we use an ensemble of RandomForestClassifier, Support Vector Machine and Logistic Regression. At the end we get a combined accuracy score of 0.65 which is quite good compared to the other models we have used before.

In [370]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

rfClf = RandomForestClassifier(n_estimators=500, random_state=0) # 500 trees. 
svmClf = SVC(probability=True, random_state=0) # probability calculation
logClf = LogisticRegression(random_state=0)

# constructing the ensemble classifier by mentioning the individual classifiers.
clf2 = VotingClassifier(estimators = [('rf',rfClf), ('svm',svmClf), ('log', logClf)], voting='hard') 

# train the ensemble classifier
clf2.fit(X_train, y_train)

clf2_pred = clf2.predict(X_test)
recall_voting = recall_score(y_test, clf2_pred)
precision_voting = precision_score(y_test, clf2_pred)
print('Accuracy score', accuracy_score(y_test, clf2_pred))

# You can use the individual classifiers to get the accuracy in the beginning and see if our ensemble performs
# better when compared to individual classifiers.
/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Accuracy score 0.6535433070866141

Ensemble Method 5: XGBoost
    XGBoost is a boosting method that uses the gradient boosting (GBM) framework and is well known to provide better solutions than other machine learning algorithms.

In [4]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
In [7]:
dataframe = pd.read_csv('2014_Financial_Data.csv', delimiter=',')
dataframe.rename(columns={'Unnamed: 0':'Stock Name'}, inplace=True)
dataframe = dataframe.drop(columns=['Class','2015 PRICE VAR [%]','Stock Name','Sector'])
ensemble= dataframe.fillna(dataframe.mean())
X = ensemble
y = dfClass
data_dmatrix = xgb.DMatrix(data=X,label=y)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100)
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 0.469583
/opt/anaconda3/lib/python3.8/site-packages/xgboost/data.py:112: UserWarning: Use subset (sliced data) of np.ndarray is not recommended because it will generate extra copies and increase memory consumption
  warnings.warn(
In [8]:
from sklearn.linear_model import LinearRegression

linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train,y_train)

pred_train = linear_reg_model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train,pred_train))
print("Train RMSE: %f" % (train_rmse))

pred_test = linear_reg_model.predict(X_test)
test_rmse = np.sqrt(mean_squared_error(y_test,pred_test))
print("Test RMSE: %f" % (test_rmse))
Train RMSE: 0.467616
Test RMSE: 8.410056
In [9]:
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
In [12]:
cv_results.head()
Out[12]:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 0.493484 0.000296 0.495126 0.000110
1 0.488064 0.000293 0.491800 0.000202
2 0.483503 0.000773 0.488921 0.000016
3 0.479276 0.000832 0.486460 0.000160
4 0.475218 0.001051 0.484481 0.000092
In [25]:
print((cv_results["test-rmse-mean"]).tail(1))
49    0.468364
Name: test-rmse-mean, dtype: float64
In [10]:
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
In [15]:
import matplotlib.pyplot as plt

xgb.plot_tree(xg_reg,num_trees=1)
plt.rcParams['figure.figsize'] = [500, 500]
plt.show()

As seen from the calculations above, XGBoost allows us to create a tree based on the most important factors. From the test rmse mean, it is clear that the amount of error is quite low especially in comparison to the RMSE of a linear regression. Now that we have a decent model, what does this model tell us about the data? Below I will be looking through features that are important in the model and how they play a role in affecting YOY stock price change.

In [52]:
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
# plt.barh(dataframe.columns, xgb.feature_importances_)
feature_names = dataframe.columns
sorted_idx = xgb.feature_importances_.argsort()
# print(sorted_idx)
# print(feature_names)
# print(xgb.feature_importances_)  

rows=[]
for i in sorted_idx:
    x = feature_names[i]
    y = xgb.feature_importances_[i]
    rows.append([i, x, y])  
dfx = pd.DataFrame(rows, columns=["id","Variable","Value"])
pd.set_option('display.max_columns', None)
print(dfx.tail(10))

eid = dfx['Variable'].tail(15).to_numpy()

# plt.barh(feature_names[sorted_idx][:-5], xgb.feature_importances_[sorted_idx][:-5])
# plt.xlabel("Xgboost Feature Importance")
      id                          Variable     Value
211  193                EPS Diluted Growth  0.009422
212  158                 Interest Coverage  0.010057
213   76                  priceToBookRatio  0.010354
214   74                Net Cash/Marketcap  0.011450
215  174              Tangible Asset Value  0.017536
216  142                          PE ratio  0.017872
217  157                     Current ratio  0.017959
218  128  capitalExpenditureCoverageRatios  0.018837
219  169                     Graham Number  0.019915
220   50               Deposit Liabilities  0.065098

As can be seen above from the feature importance variables, Deposit Liabilities is the most important factor and most of the other variables have a significantly lower influence (with many having a value of 0.0). This is an interesting point because in an earlier study regarding the random forest model (including Sectors), Deposit liabilities was also the top variable when we looked at the feature importance for that model.

In [53]:
import seaborn as sns
def correlation_heatmap(train):
    correlations = train.corr()

    fig, ax = plt.subplots(figsize=(10,10))
    sns.heatmap(correlations, vmax=1.0, center=0, fmt='.2f', cmap="YlGnBu",
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70}
                )
    plt.show();
    
correlation_heatmap(X_train[eid])

The heatmap above shows the top 15 most important factors and how they relate to one another. It is intersting that most of them have no relation to one another. However, Deposit Liabilities does have a 0.92 correlation with Tangible Asset Value. This is interesting as asset and liabilities are the two most fundamental statistics that a company has as it shows what the company is currently worth. Maybe there is some reasoning why both deposit liabilities and tangible asset value have something to do with the stock price change, as a company with a good balance sheet may seem safer to investors and become more desirable leading to a increase in the stock price of a company.

Conclusion
     Although, this study doesn't really give us any concrete evidence of any relation between a company's financial factors and its stock price change, it gives us a better understanding of what factors have a higher than normal correlation with stock price change.
    After using data science models ranging from linear regression and logisitic regression to decision trees and ensemble methods such as XGBoost, my study has found that Sector and Deposit liabilities have the greatest influence on stock price change. Many of the other factors have a close to 0 relation with stock price change. Most of the stocks that belong in one sector, seem to have similar financials and similar growth. This makes sense as this is why PE comparisons are often made betweeen companies in the same industry. In regards to deposit liabilities my theory is that because investors are loss aversive, they focus a lot of attention on the companies' liabilities to the bank as it represents the company's independence from giant financial investors such as banks.
    The most interesting question that this project has allowed me to understand is why stock prices are not correlated with the respective companies financials. Through my experience is an avid investor and through doing this study, I have realised that a company's price on the stock market is not determined by financials and crunched by numbers, it is instead determined by the judgements of investors. And as long term thinkers, investors don't invest based on current financial numbers but instead on future numbers. Investors may instead focus their attention on politics, macroeconomics and long term trends that have longstanding effects on the future of specific industries/sectors.
    The reason why the finance industry is so challenging is that there is no formula to allow for profit in the stock market. There is no guarantee for success and there is no investment without some sort of risk. This project has allowed me to have a better understanding of this concept and I hope to be able to continue doing this sort of data science project on the stock market and see if my hypothesis regarding the effects behavioral psychology and macro global trends have on stock price change is correct.

In [ ]: