Which company will survive from S&P 500: Classification problem with DT, RF, and NN models (Fall 2021)

December 2, 2022 18 minute read

[Notice] Journey to the academic researcher This is the story of how I became the insightful researcher.

S&P 500 is the most famous and crucial stock index tracking the performance of 500 companies S&P picked up. If S&P picks up stock, the pick gives a positive sign to the stock market, and S&P should buy the stocks so that the stock price will increase. If a stock is delisted in the index, the exclusion presents a negative sign to the market, and S&P should sell the stocks. As a result of delisting, the stock price will decrease. Because enlisting and delisting from the S&P 500 can influence the stock price, predicting which company will be delisted from the index makes investors avoid loss. A stock price generally decreases about 15% in the long term after the stock is delisted from the index.

Research purposes

a) What characteristics affect delisting from S&P 500 With Random Forest and Neural Network models, I will investigate what characteristics of a company influence the delisting from the index.

b) Which stocks will be delisted from the index Among current stocks in the index, I will figure out which stocks are most exposed to the risk of delisting.

from IPython.core.display import display, HTML
import warnings
warnings.filterwarnings(action = 'ignore')
display(HTML("<style>.container { width:100% !important; }</style>"))

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib
from matplotlib.colors import ListedColormap
matplotlib.rcParams['axes.unicode_minus'] = False

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import plot_confusion_matrix
from tqdm import tqdm

from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier

snp=pd.read_csv('2019final.csv')
snp.head()

	Name	R	OI	NI	Asset	Equity	Debt	FCFF	g	Tech	...	TEV	d	PER	PSR	PBR	EV/EBITDA	CAPEX	emp	EBITDA	Survival
0	News Corp	10074.0	599.0	155.0	15711.0	10311.0	1489.0	561.500	11.2539	0	...	10944.99989	0.0000	54.154838	0.833234	0.534275	10.524038	572.0	28000	1040.0	1
1	LyondellBasell Industries NV	34727.0	4232.0	3390.0	30435.0	8179.0	13782.0	1644.750	0.3517	0	...	45028.65465	4.4453	9.292229	0.907094	1.035014	8.209417	2694.0	19100	5485.0	1
2	Verizon Communications Inc	131868.0	31521.0	19265.0	291727.0	62835.0	133920.0	12563.375	2.2883	0	...	384178.15040	4.0065	13.181269	1.925692	0.870462	7.970005	17939.0	135000	48203.0	1
3	Broadcom Inc	22597.0	4331.0	2724.0	67493.0	24970.0	32798.0	11167.375	13.1945	1	...	153482.31920	4.1136	46.149163	5.563142	1.862568	15.137816	432.0	19000	10139.0	1
4	Boeing Co/The	76559.0	-2102.0	-636.0	133625.0	-8300.0	28532.0	-2851.000	-9.7551	0	...	198609.87210	0.0000	-288.262378	2.394687	1.372010	1175.206344	1834.0	161100	169.0	1

5 rows × 23 columns

snp = snp.replace({'na':np.nan})
snp = snp.dropna()
print(snp.shape)

(441, 23)

X = snp.iloc[:,1:-1]
y = snp.iloc[:,-1]
features = X.columns.tolist()
print(features)

['R', 'OI', 'NI', 'Asset', 'Equity', 'Debt', 'FCFF', 'g', 'Tech', 'old', 'US', 'Mkt', 'TEV', 'd', 'PER', 'PSR', 'PBR', 'EV/EBITDA', 'CAPEX', 'emp', 'EBITDA']

y.value_counts()

1    419
0     22
Name: Survival, dtype: int64

X = MinMaxScaler().fit_transform(X)

parameters = {'max_depth':range(3,500)}
print(parameters)
Mdls = GridSearchCV(tree.DecisionTreeClassifier(),parameters, n_jobs=4)
Mdls = Mdls.fit(X, y)
Mdl = Mdls.best_estimator_
print(Mdl)
imp = Mdl.feature_importances_

I = np.arange(imp.shape[0])
plt.subplots(1, figsize=(20, 5))
plt.bar(I,imp)
plt.xticks(I,features);
best_param = Mdl.get_params()
best_param
print(imp)

{'max_depth': range(3, 500)}
DecisionTreeClassifier(max_depth=158)
[0.         0.         0.03189412 0.         0.04784118 0.03189412
 0.         0.01410831 0.         0.0448063  0.         0.37289104
 0.         0.00439886 0.09459125 0.04100673 0.03827294 0.
 0.0977327  0.18056246 0.        ]

family_1

pred = Mdl.predict(X)
accuracy = np.mean(pred==y)
accuracy

1.0

plot_confusion_matrix(Mdl, X, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the basic DT model: Total data set')
plt.show()

family_1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfc = RandomForestClassifier(n_estimators = 10, random_state = 2021)
rfc.fit(X_train, y_train)

RandomForestClassifier(n_estimators=10, random_state=2021)

y_pred = rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of basic model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of basic model to test set")
print(f"accuracy: {acc:0.4f}")

The performance of basic model to train set
accuracy: 1.0000
The performance of basic model to test set
accuracy: 0.9474

cv_scores = []
estimator_list = [i for i in range(0,100,1)]
for i in tqdm(range(0,100,1)):
    rfc = RandomForestClassifier(n_estimators = i+1,
                                 n_jobs = -1, random_state = 2021)
    score = cross_val_score(rfc, X_train, y_train, cv=10, scoring = 'accuracy').mean()
    cv_scores.append(score)

best_e = [estimator_list[i] for i in range(len(cv_scores)) if cv_scores[i] == np.max(cv_scores)]
plt.figure(figsize = (20,8))
plt.legend(["Cross valid1ation scores"], fontsize = 20)
plt.plot(estimator_list, cv_scores, marker = 'o', linestyle = 'dashed')
plt.xlabel("The number of trees", fontsize = 20)
plt.ylabel("Accuracy", fontsize = 20)
plt.title("Accuracy Scores", fontsize = 25)
plt.axvline(best_e[0], color='r', linestyle = '--', linewidth=2)
plt.show()

100%|██████████| 100/100 [01:45<00:00,  1.05s/it]

family_1

print(f"the performance is the best when the number of tree is {(cv_scores.index(max(cv_scores)))+1}")
print("The performance(10 fold cross validation)")
print(f"Accuracy: {max(cv_scores): 0.4f}")

the performance is the best when the number of tree is 12
The performance(10 fold cross validation)
Accuracy:  0.9675

rfc = RandomForestClassifier(n_estimators = 90, random_state = 2021)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of adjusted model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of adjusted model to test set")
print(f"accuracy: {acc:0.4f}")

The performance of adjusted model to train set
accuracy: 1.0000
The performance of adjusted model to test set
accuracy: 0.9398

rfc_score = pd.DataFrame(cv_scores, columns = ['accuracy'])
rfc_score['accu_rank'] = rfc_score['accuracy'].rank(ascending = 0)
rfc_score = rfc_score.sort_values(by=['accu_rank'])
rfc_score.head()

	accuracy	accu_rank
40	0.967527	3.5
43	0.967527	3.5
42	0.967527	3.5
11	0.967527	3.5
29	0.967527	3.5

rfc = RandomForestClassifier()
param_grid = {
    'n_estimators' : [41, 44, 43, 12, 30],
    'max_depth' : [10, 15, 20, 25],
    'max_leaf_nodes' : [25, 30, 35],
    'criterion' : ['gini', 'entropy']
    # 'max_features' : ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator = rfc, param_grid = param_grid, cv=10, verbose = 1, n_jobs=-1)
CV_rfc.fit(X_train, y_train)

Fitting 10 folds for each of 120 candidates, totalling 1200 fits

GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [10, 15, 20, 25],
                         'max_leaf_nodes': [25, 30, 35],
                         'n_estimators': [41, 44, 43, 12, 30]},
             verbose=1)

CV_rfc.best_params_

{'criterion': 'gini',
 'max_depth': 10,
 'max_leaf_nodes': 25,
 'n_estimators': 12}

result_table = pd.DataFrame(CV_rfc.cv_results_)
result_table = result_table.sort_values(by = 'mean_test_score', ascending = False)
print(result_table[['params', 'mean_test_score']])

                                               params  mean_test_score
{'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964409
 {'criterion': 'gini', 'max_depth': 10, 'max_le...         0.964409
{'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964409
{'criterion': 'gini', 'max_depth': 15, 'max_le...         0.964409
{'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964301
..                                                ...              ...
{'criterion': 'entropy', 'max_depth': 10, 'max...         0.951398
{'criterion': 'entropy', 'max_depth': 20, 'max...         0.951398
{'criterion': 'gini', 'max_depth': 25, 'max_le...         0.951398
{'criterion': 'entropy', 'max_depth': 10, 'max...         0.951398
{'criterion': 'entropy', 'max_depth': 10, 'max...         0.944946

[120 rows x 2 columns]

best_rfc = CV_rfc.best_estimator_
best_rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=10, max_leaf_nodes=25, n_estimators=12)

y_pred = best_rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of the best rfc model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = best_rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of the best rfc  model to test set")
print(f"accuracy: {acc:0.4f}")

# Accuracy based on total dataset
y_pred = best_rfc.predict(X)
acc = accuracy_score(y_true = y, y_pred = y_pred)
print("The performance of the best rfc model to total data set")
print(f"accuracy: {acc:0.4f}")

The performance of the best rfc model to train set
accuracy: 0.9968
The performance of the best rfc  model to test set
accuracy: 0.9248
The performance of the best rfc model to total data set
accuracy: 0.9751

plot_confusion_matrix(best_rfc, X, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the best RF model: Total data set')
plt.show()

family_1

best_imp = best_rfc.feature_importances_

I = np.arange(imp.shape[0])
plt.subplots(1, figsize=(21, 5))
plt.bar(I,best_imp)
plt.xticks(I,features)
print(best_imp)

[0.07057966 0.00777979 0.03881357 0.02295456 0.01006312 0.01311264
05430833 0.03474899 0.00752369 0.02462194 0.04721084 0.11345549
08311333 0.01507277 0.06567141 0.16225725 0.04314205 0.08343326
0517976  0.01191432 0.0384254 ]

family_1

$\textbf{S&P 500 Data: Neural Network Prediction}$

X_n = MinMaxScaler().fit_transform(X)

clf = MLPClassifier(random_state=1, hidden_layer_sizes=(7), max_iter=1000).fit(X_n, y)
clf.fit(X_n,y)

pred = clf.predict(X_n) # note, 1 is positive class
accuracy = accuracy_score(y,pred)

CM = confusion_matrix(y, pred, normalize='true')
true_positive = CM[1,1]
true_negative = CM[0,0]
false_positive = CM[0,1]
false_negative = CM[1,0]

accuracy

0.9501133786848073

plot_confusion_matrix(clf, X_n, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the base NN model: Total data set')
plt.show()

family_1

X_median = np.median(X_n, axis=0).reshape((1,-1)) # (1,D)
proba_median = clf.predict_proba(X_median)
proba_median.shape

(1, 2)

proba_median

array([[0.0480666, 0.9519334]])

D = X.shape[1]
importance = []
for i in range(D):
    #print(x_median_i_perturbed)
    x_median_i_perturbed = X_median.copy()
    x_median_i_perturbed[0,i]+=0.00001
    
    proba_median_perturbed = clf.predict_proba(x_median_i_perturbed)
    imp = abs(proba_median_perturbed[:,1] - proba_median[:,1])/0.00001 
    importance.append(imp)
importance = np.array(importance).reshape(-1)
    
feature_indices = np.arange(len(importance))

plt.figure(figsize=(10, 5))
plt.bar(feature_indices,importance)
plt.xticks(feature_indices, snp.columns[1:-1], rotation='vertical')
plt.ylabel('Feature Importance')
plt.grid(True)

print(importance)

[0.05773369 0.05707531 0.03642151 0.00627693 0.09655891 0.10264398
0603424  0.02820982 0.03058961 0.02390792 0.01566889 0.14800129
16421304 0.07374916 0.06396335 0.10162396 0.11901878 0.00881586
13850265 0.10046528 0.04221742]

family_1

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_n, y, test_size=0.3)
nn_clf = MLPClassifier(random_state=1, hidden_layer_sizes=(7), max_iter=1000)
nn_clf.fit(X_train_n, y_train_n)

MLPClassifier(hidden_layer_sizes=7, max_iter=1000, random_state=1)

hl_size = [i for i in range(200,251,1)]
max_it = [1000, 2000, 5000, 10000]

cv_nn_scores = []

for j in range(0,len(max_it)):
    for i in tqdm(range(0,len(hl_size),1)):
        nn_clf = MLPClassifier(random_state=1, hidden_layer_sizes=hl_size[i], max_iter=hl_size[j]).fit(X_n, y)
        score = cross_val_score(nn_clf, X_train_n, y_train_n, cv=10, scoring = 'accuracy').mean()
        cv_nn_scores.append(score)
        best_e = [hl_size[k] for k in range(len(cv_nn_scores)) if cv_nn_scores[k] == np.max(cv_nn_scores)]
    
    # Show the results
    plt.figure(figsize = (20,5))
    plt.legend(["Cross validation scores"], fontsize = 10)
    plt.plot(hl_size, cv_nn_scores, marker = 'o', linestyle = 'dashed')
    plt.xlabel("The number of nodes", fontsize = 10)
    plt.ylabel("Accuracy", fontsize = 10)
    plt.title("Accuracy Scores", fontsize = 15)
    plt.axvline(best_e[0], color='r', linestyle = '--', linewidth=2)
    plt.show()
    
    # print out how much accurate the results and clean out
    print(f"the performance is the best when the number of Hidden Layers is {(cv_nn_scores.index(max(cv_nn_scores)))+1}")
    print("The performance(10 fold cross validation)")
    print(f"Accuracy with Max_it {max_it[j]}: {max(cv_nn_scores): 0.4f}")
    cv_nn_scores = []

100%|██████████| 51/51 [05:48<00:00,  6.83s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 1000:  0.9547


100%|██████████| 51/51 [05:50<00:00,  6.87s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 2000:  0.9547


100%|██████████| 51/51 [05:59<00:00,  7.04s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 5000:  0.9547


100%|██████████| 51/51 [05:55<00:00,  6.97s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 10000:  0.9547

best_nn = MLPClassifier(random_state=1, hidden_layer_sizes=(202), max_iter=10000)
best_nn.fit(X_train_n, y_train_n)

MLPClassifier(hidden_layer_sizes=202, max_iter=10000, random_state=1)

# Accuracy based on Training set
y_pred_n = best_nn.predict(X_train_n)
acc_n = accuracy_score(y_true = y_train_n, y_pred = y_pred_n)
print("The performance of the best NN model to train set")
print(f"accuracy: {acc_n:0.4f}")

# Accuracy based on Test set
y_pred_n = best_nn.predict(X_test_n)
acc_n = accuracy_score(y_true = y_test_n, y_pred = y_pred_n)
print("The performance of the best NN model to test set")
print(f"accuracy: {acc_n:0.4f}")

# Accuracy based on total dataset
y_pred_n = best_nn.predict(X_n)
acc_n = accuracy_score(y_true = y, y_pred = y_pred_n)
print("The performance of the best NN model to total data set")
print(f"accuracy: {acc_n:0.4f}")

The performance of the best NN model to train set
accuracy: 0.9870
The performance of the best NN model to test set
accuracy: 0.9398
The performance of the best NN model to total data set
accuracy: 0.9728

plot_confusion_matrix(best_nn, X_n, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the best NN model: Total data set')
plt.show()

family_1

X_median = np.median(X_n, axis=0).reshape((1,-1)) # (1,D)
proba_median = best_nn.predict_proba(X_median)
proba_median.shape

(1, 2)

proba_median

array([[0.00660987, 0.99339013]])

D = X.shape[1]
importance = []
for i in range(D):
    #print(x_median_i_perturbed)
    x_median_i_perturbed = X_median.copy()
    x_median_i_perturbed[0,i]+=0.00001
    
    proba_median_perturbed = clf.predict_proba(x_median_i_perturbed)
    imp = abs(proba_median_perturbed[:,1] - proba_median[:,1])/0.00001 
    importance.append(imp)
importance = np.array(importance).reshape(-1)
    
feature_indices = np.arange(len(importance))

plt.figure(figsize=(10, 5))
plt.bar(feature_indices,importance)
plt.xticks(feature_indices, snp.columns[1:-1], rotation='vertical')
plt.ylabel('Feature Importance')
plt.grid(True)

print(importance)

[4145.61522665 4145.61588503 4145.63653883 4145.66668341 4145.57640143
57031636 4145.61261794 4145.64475052 4145.70354995 4145.69686826
65729145 4145.52495906 4145.5087473  4145.7467095  4145.60899699
57133638 4145.55394156 4145.66414448 4145.53445769 4145.57249507
63074292]

family_1

snp2021=pd.read_csv('2021final.csv')
snp2021.head()

	Name	R	OI	NI	Asset	Equity	Debt	FCFF	g	Tech	...	TEV	d	PER	PSR	PBR	EV/EBITDA	CAPEX	emp	EBITDA	Survival
0	LyondellBasell Industries NV	27753.0	2181.0	1420.0	35403.0	8104.0	17832.0	-61.000	-15.6471	0	...	43379.60441	5.0272	21.070848	1.078103	0.845143	12.472572	1947.0	19200	3478.0	1
1	Verizon Communications Inc	134238.0	33943.0	22040.0	353457.0	78489.0	178985.0	-24466.875	-0.9872	0	...	380963.27000	5.1006	9.559268	1.569498	0.596073	7.917107	18192.0	132200	48119.0	1
2	Broadcom Inc	26510.0	7646.0	6071.0	75880.0	24367.0	40457.0	11894.375	7.0428	1	...	289388.77680	2.5962	42.828163	9.807989	3.426592	25.942517	463.0	21000	11155.0	1
3	Boeing Co/The	62797.0	-3517.0	-8479.0	146846.0	-14266.0	62419.0	-2700.500	-24.1648	0	...	163111.60290	0.0000	-14.213186	1.919098	0.820680	-25.434524	1303.0	141000	-6413.0	1
4	Caterpillar Inc	48408.0	6984.0	5149.0	80784.0	16695.0	36792.0	3257.750	-12.6553	0	...	138327.97790	2.1822	21.375020	2.273591	1.362398	18.964625	2115.0	97300	7294.0	1

5 rows × 23 columns

X21 = snp2021.iloc[:,1:-1]
y_pred21 = best_rfc.predict(X21)
print(y_pred21)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

y_pred_n21 = best_nn.predict(X21)
print(y_pred_n21)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Conclusion and inference

A) Prediction conclusion i) The RF and NN models with 2001 data predict that 8 companies in the 2021 list can not survive after 20 years. [Figure 13] ii) Two RF models with 2001, and 2009 data and NN model with 2001 data anticipated that 6 stocks will be excluded from the S&P 500 within 20 years. [Figure 30] iii) If an investor seeks long-term profit from the stock market, the models recommend that do not put your money on the stocks in the [Figure 13] and [Figure 30]. The investor can avoid expected loss by the models, not investing in the stocks. iv) The models with 2019 data predict that every stock in the 2021 list will survive within 2 years. Because some companies will be delisted every year, this result is clearly wrong, but I can infer that S&P 500 cannot find the critical problems with current companies comprising S&P 500 based on the current public data.

B) Importance [Figure 31] i) In the long-term the Dividend Rate and Net income are the most crucial features for survival. Since investors expect dividends from long-term investment and a large dividend rate and net income illustrate the company’s stable operation, it is reasonable that these features are the most significant. ii) Relatively short-term aspects, current Market capital is the most important for survival. It can be interpreted as the S&P 500 considering Market capital the best business result, believing efficient market theory. iii) Among the balance sheet aspects, Total enterprise value outperformed the other factors, total equity, total asset, and total debt. Total EV presents the consolidated value of the business, considering asset, debt, and equity value. Hence, valuation experts regard total EV as the best criteria for evaluating the target company’s value. Therefore, the result, outperforming EV, well illustrates valuation reality. iv) Along EV, EV/EBITDA is usually an important factor by valuation experts. Furthermore, not as much as FCFF but EBITDA is manipulated as the approximated cash flow value by adding noncash flow expenses such as depreciation and amortization. When evaluating the target company’s value, CF is more important than profit and loss. Therefore, the results, EBITDA and EV/EBITDA, are highly ranked, also represent valuation reality. v) Capital Expenses, CAPEX, is an investment for future growth. Despite the common sense that professional investors do not care about future growth, CAPEX played a significant role in the classification, so we can infer those professional investors also consider future expansion. vi) Some researchers claim that S&P 500 highly concentrates on the Tech industry on purpose. However, whether the company is in the Tech industry does nothing in classification. Therefore, I can infer that financial data is more crucial for survival than a company’s industry, and Tech companies are included because of their performance, not industry. Furthermore, whether the company originated from the US is never a crucial factor in survival.

Limitation and Further Research

A) Limitation i) Insufficient delisting data in 2019’s list: Only 5% of total companies are delisted by 2021, so classification models can not sufficiently learn what characteristics affect the delisting. Although the performance of accuracy in RF and NN models with 2019 is higher than 97%, it is wrong that the models predict every company in the 2021 list will survive. ii) The number of hidden layers in Neural network models: I confirmed that with 100~150 hidden layers, the model’s performance was 73.6%, while with 200~250 hidden layers, that of the model was 74.4%. Based on the result, I can anticipate that the performance will increase as the number of hidden layers increases. However, compiling the model with more than 200 layers took more than 20 minutes at once, so I could not compile with larger numbers. iii) Limitation of data access: several factors can affect the survival, such as other industry categories, CEO’s background information, the difference between financial analysts’ forecasts and current price, and text data.

B) Further Research i) Which companies will be included in the S&P 500: Forecasting delisting is a passive investment strategy, avoiding expected loss, whereas predicting enlisting is an active investment strategy since the stock price will soar when the information that S&P will include the stock in the index is released. Therefore, if an investor predicts which stock will be included, the investor can profit when soring. ii) Collecting more restricted data above and training with the data again

Share on

Twitter Facebook LinkedIn

Which company will survive from S&P 500: Classification problem with DT, RF, and NN models (Fall 2021)

Research purposes

Conclusion and inference

Limitation and Further Research

Share on

You may also enjoy

Journey to the academic researcher

Twitter tweets scrapping model (Fall 2022)

Text Analysis Code: whether a sentence is which ESG issues (Fall 2022)

How to scrappe Google Trends with Python (Fall 2022)