18 minute read

[Notice] Journey to the academic researcher This is the story of how I became the insightful researcher.

S&P 500 is the most famous and crucial stock index tracking the performance of 500 companies S&P picked up. If S&P picks up stock, the pick gives a positive sign to the stock market, and S&P should buy the stocks so that the stock price will increase. If a stock is delisted in the index, the exclusion presents a negative sign to the market, and S&P should sell the stocks. As a result of delisting, the stock price will decrease. Because enlisting and delisting from the S&P 500 can influence the stock price, predicting which company will be delisted from the index makes investors avoid loss. A stock price generally decreases about 15% in the long term after the stock is delisted from the index.

Research purposes

a) What characteristics affect delisting from S&P 500 With Random Forest and Neural Network models, I will investigate what characteristics of a company influence the delisting from the index.

b) Which stocks will be delisted from the index Among current stocks in the index, I will figure out which stocks are most exposed to the risk of delisting.

from IPython.core.display import display, HTML
import warnings
warnings.filterwarnings(action = 'ignore')
display(HTML("<style>.container { width:100% !important; }</style>"))
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import matplotlib
from matplotlib.colors import ListedColormap
matplotlib.rcParams['axes.unicode_minus'] = False

from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import plot_confusion_matrix
from tqdm import tqdm

from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
snp=pd.read_csv('2019final.csv')
snp.head()
Name R OI NI Asset Equity Debt FCFF g Tech ... TEV d PER PSR PBR EV/EBITDA CAPEX emp EBITDA Survival
0 News Corp 10074.0 599.0 155.0 15711.0 10311.0 1489.0 561.500 11.2539 0 ... 10944.99989 0.0000 54.154838 0.833234 0.534275 10.524038 572.0 28000 1040.0 1
1 LyondellBasell Industries NV 34727.0 4232.0 3390.0 30435.0 8179.0 13782.0 1644.750 0.3517 0 ... 45028.65465 4.4453 9.292229 0.907094 1.035014 8.209417 2694.0 19100 5485.0 1
2 Verizon Communications Inc 131868.0 31521.0 19265.0 291727.0 62835.0 133920.0 12563.375 2.2883 0 ... 384178.15040 4.0065 13.181269 1.925692 0.870462 7.970005 17939.0 135000 48203.0 1
3 Broadcom Inc 22597.0 4331.0 2724.0 67493.0 24970.0 32798.0 11167.375 13.1945 1 ... 153482.31920 4.1136 46.149163 5.563142 1.862568 15.137816 432.0 19000 10139.0 1
4 Boeing Co/The 76559.0 -2102.0 -636.0 133625.0 -8300.0 28532.0 -2851.000 -9.7551 0 ... 198609.87210 0.0000 -288.262378 2.394687 1.372010 1175.206344 1834.0 161100 169.0 1

5 rows × 23 columns

snp = snp.replace({'na':np.nan})
snp = snp.dropna()
print(snp.shape)
(441, 23)
X = snp.iloc[:,1:-1]
y = snp.iloc[:,-1]
features = X.columns.tolist()
print(features)
['R', 'OI', 'NI', 'Asset', 'Equity', 'Debt', 'FCFF', 'g', 'Tech', 'old', 'US', 'Mkt', 'TEV', 'd', 'PER', 'PSR', 'PBR', 'EV/EBITDA', 'CAPEX', 'emp', 'EBITDA']
y.value_counts()
1    419
0     22
Name: Survival, dtype: int64
X = MinMaxScaler().fit_transform(X)
parameters = {'max_depth':range(3,500)}
print(parameters)
Mdls = GridSearchCV(tree.DecisionTreeClassifier(),parameters, n_jobs=4)
Mdls = Mdls.fit(X, y)
Mdl = Mdls.best_estimator_
print(Mdl)
imp = Mdl.feature_importances_

I = np.arange(imp.shape[0])
plt.subplots(1, figsize=(20, 5))
plt.bar(I,imp)
plt.xticks(I,features);
best_param = Mdl.get_params()
best_param
print(imp)
{'max_depth': range(3, 500)}
DecisionTreeClassifier(max_depth=158)
[0.         0.         0.03189412 0.         0.04784118 0.03189412
 0.         0.01410831 0.         0.0448063  0.         0.37289104
 0.         0.00439886 0.09459125 0.04100673 0.03827294 0.
 0.0977327  0.18056246 0.        ]

family_1

pred = Mdl.predict(X)
accuracy = np.mean(pred==y)
accuracy
1.0
plot_confusion_matrix(Mdl, X, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the basic DT model: Total data set')
plt.show()

family_1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfc = RandomForestClassifier(n_estimators = 10, random_state = 2021)
rfc.fit(X_train, y_train)
RandomForestClassifier(n_estimators=10, random_state=2021)
y_pred = rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of basic model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of basic model to test set")
print(f"accuracy: {acc:0.4f}")
The performance of basic model to train set
accuracy: 1.0000
The performance of basic model to test set
accuracy: 0.9474
cv_scores = []
estimator_list = [i for i in range(0,100,1)]
for i in tqdm(range(0,100,1)):
    rfc = RandomForestClassifier(n_estimators = i+1,
                                 n_jobs = -1, random_state = 2021)
    score = cross_val_score(rfc, X_train, y_train, cv=10, scoring = 'accuracy').mean()
    cv_scores.append(score)

best_e = [estimator_list[i] for i in range(len(cv_scores)) if cv_scores[i] == np.max(cv_scores)]
plt.figure(figsize = (20,8))
plt.legend(["Cross valid1ation scores"], fontsize = 20)
plt.plot(estimator_list, cv_scores, marker = 'o', linestyle = 'dashed')
plt.xlabel("The number of trees", fontsize = 20)
plt.ylabel("Accuracy", fontsize = 20)
plt.title("Accuracy Scores", fontsize = 25)
plt.axvline(best_e[0], color='r', linestyle = '--', linewidth=2)
plt.show()
100%|██████████| 100/100 [01:45<00:00,  1.05s/it]

family_1

print(f"the performance is the best when the number of tree is {(cv_scores.index(max(cv_scores)))+1}")
print("The performance(10 fold cross validation)")
print(f"Accuracy: {max(cv_scores): 0.4f}")
the performance is the best when the number of tree is 12
The performance(10 fold cross validation)
Accuracy:  0.9675
rfc = RandomForestClassifier(n_estimators = 90, random_state = 2021)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of adjusted model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of adjusted model to test set")
print(f"accuracy: {acc:0.4f}")
The performance of adjusted model to train set
accuracy: 1.0000
The performance of adjusted model to test set
accuracy: 0.9398
rfc_score = pd.DataFrame(cv_scores, columns = ['accuracy'])
rfc_score['accu_rank'] = rfc_score['accuracy'].rank(ascending = 0)
rfc_score = rfc_score.sort_values(by=['accu_rank'])
rfc_score.head()
accuracy accu_rank
40 0.967527 3.5
43 0.967527 3.5
42 0.967527 3.5
11 0.967527 3.5
29 0.967527 3.5
rfc = RandomForestClassifier()
param_grid = {
    'n_estimators' : [41, 44, 43, 12, 30],
    'max_depth' : [10, 15, 20, 25],
    'max_leaf_nodes' : [25, 30, 35],
    'criterion' : ['gini', 'entropy']
    # 'max_features' : ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator = rfc, param_grid = param_grid, cv=10, verbose = 1, n_jobs=-1)
CV_rfc.fit(X_train, y_train)
Fitting 10 folds for each of 120 candidates, totalling 1200 fits





GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [10, 15, 20, 25],
                         'max_leaf_nodes': [25, 30, 35],
                         'n_estimators': [41, 44, 43, 12, 30]},
             verbose=1)
CV_rfc.best_params_
{'criterion': 'gini',
 'max_depth': 10,
 'max_leaf_nodes': 25,
 'n_estimators': 12}
result_table = pd.DataFrame(CV_rfc.cv_results_)
result_table = result_table.sort_values(by = 'mean_test_score', ascending = False)
print(result_table[['params', 'mean_test_score']])
                                               params  mean_test_score
58  {'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964409
3   {'criterion': 'gini', 'max_depth': 10, 'max_le...         0.964409
54  {'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964409
27  {'criterion': 'gini', 'max_depth': 15, 'max_le...         0.964409
46  {'criterion': 'gini', 'max_depth': 25, 'max_le...         0.964301
..                                                ...              ...
62  {'criterion': 'entropy', 'max_depth': 10, 'max...         0.951398
93  {'criterion': 'entropy', 'max_depth': 20, 'max...         0.951398
56  {'criterion': 'gini', 'max_depth': 25, 'max_le...         0.951398
66  {'criterion': 'entropy', 'max_depth': 10, 'max...         0.951398
73  {'criterion': 'entropy', 'max_depth': 10, 'max...         0.944946

[120 rows x 2 columns]
best_rfc = CV_rfc.best_estimator_
best_rfc.fit(X_train, y_train)
RandomForestClassifier(max_depth=10, max_leaf_nodes=25, n_estimators=12)
y_pred = best_rfc.predict(X_train)
acc = accuracy_score(y_true = y_train, y_pred = y_pred)
print("The performance of the best rfc model to train set")
print(f"accuracy: {acc:0.4f}")

y_pred = best_rfc.predict(X_test)
acc = accuracy_score(y_true = y_test, y_pred = y_pred)
print("The performance of the best rfc  model to test set")
print(f"accuracy: {acc:0.4f}")

# Accuracy based on total dataset
y_pred = best_rfc.predict(X)
acc = accuracy_score(y_true = y, y_pred = y_pred)
print("The performance of the best rfc model to total data set")
print(f"accuracy: {acc:0.4f}")
The performance of the best rfc model to train set
accuracy: 0.9968
The performance of the best rfc  model to test set
accuracy: 0.9248
The performance of the best rfc model to total data set
accuracy: 0.9751
plot_confusion_matrix(best_rfc, X, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the best RF model: Total data set')
plt.show()

family_1

best_imp = best_rfc.feature_importances_

I = np.arange(imp.shape[0])
plt.subplots(1, figsize=(21, 5))
plt.bar(I,best_imp)
plt.xticks(I,features)
print(best_imp)
[0.07057966 0.00777979 0.03881357 0.02295456 0.01006312 0.01311264
 0.05430833 0.03474899 0.00752369 0.02462194 0.04721084 0.11345549
 0.08311333 0.01507277 0.06567141 0.16225725 0.04314205 0.08343326
 0.0517976  0.01191432 0.0384254 ]

family_1

$\textbf{S&P 500 Data: Neural Network Prediction}$

X_n = MinMaxScaler().fit_transform(X)
clf = MLPClassifier(random_state=1, hidden_layer_sizes=(7), max_iter=1000).fit(X_n, y)
clf.fit(X_n,y)

pred = clf.predict(X_n) # note, 1 is positive class
accuracy = accuracy_score(y,pred)

CM = confusion_matrix(y, pred, normalize='true')
true_positive = CM[1,1]
true_negative = CM[0,0]
false_positive = CM[0,1]
false_negative = CM[1,0]

accuracy
0.9501133786848073
plot_confusion_matrix(clf, X_n, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the base NN model: Total data set')
plt.show()

family_1

X_median = np.median(X_n, axis=0).reshape((1,-1)) # (1,D)
proba_median = clf.predict_proba(X_median)
proba_median.shape
(1, 2)
proba_median
array([[0.0480666, 0.9519334]])
D = X.shape[1]
importance = []
for i in range(D):
    #print(x_median_i_perturbed)
    x_median_i_perturbed = X_median.copy()
    x_median_i_perturbed[0,i]+=0.00001
    
    proba_median_perturbed = clf.predict_proba(x_median_i_perturbed)
    imp = abs(proba_median_perturbed[:,1] - proba_median[:,1])/0.00001 
    importance.append(imp)
importance = np.array(importance).reshape(-1)
    
feature_indices = np.arange(len(importance))

plt.figure(figsize=(10, 5))
plt.bar(feature_indices,importance)
plt.xticks(feature_indices, snp.columns[1:-1], rotation='vertical')
plt.ylabel('Feature Importance')
plt.grid(True)

print(importance)
[0.05773369 0.05707531 0.03642151 0.00627693 0.09655891 0.10264398
 0.0603424  0.02820982 0.03058961 0.02390792 0.01566889 0.14800129
 0.16421304 0.07374916 0.06396335 0.10162396 0.11901878 0.00881586
 0.13850265 0.10046528 0.04221742]

family_1

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_n, y, test_size=0.3)
nn_clf = MLPClassifier(random_state=1, hidden_layer_sizes=(7), max_iter=1000)
nn_clf.fit(X_train_n, y_train_n)
MLPClassifier(hidden_layer_sizes=7, max_iter=1000, random_state=1)
hl_size = [i for i in range(200,251,1)]
max_it = [1000, 2000, 5000, 10000]
cv_nn_scores = []

for j in range(0,len(max_it)):
    for i in tqdm(range(0,len(hl_size),1)):
        nn_clf = MLPClassifier(random_state=1, hidden_layer_sizes=hl_size[i], max_iter=hl_size[j]).fit(X_n, y)
        score = cross_val_score(nn_clf, X_train_n, y_train_n, cv=10, scoring = 'accuracy').mean()
        cv_nn_scores.append(score)
        best_e = [hl_size[k] for k in range(len(cv_nn_scores)) if cv_nn_scores[k] == np.max(cv_nn_scores)]
    
    # Show the results
    plt.figure(figsize = (20,5))
    plt.legend(["Cross validation scores"], fontsize = 10)
    plt.plot(hl_size, cv_nn_scores, marker = 'o', linestyle = 'dashed')
    plt.xlabel("The number of nodes", fontsize = 10)
    plt.ylabel("Accuracy", fontsize = 10)
    plt.title("Accuracy Scores", fontsize = 15)
    plt.axvline(best_e[0], color='r', linestyle = '--', linewidth=2)
    plt.show()
    
    # print out how much accurate the results and clean out
    print(f"the performance is the best when the number of Hidden Layers is {(cv_nn_scores.index(max(cv_nn_scores)))+1}")
    print("The performance(10 fold cross validation)")
    print(f"Accuracy with Max_it {max_it[j]}: {max(cv_nn_scores): 0.4f}")
    cv_nn_scores = []
100%|██████████| 51/51 [05:48<00:00,  6.83s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 1000:  0.9547


100%|██████████| 51/51 [05:50<00:00,  6.87s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 2000:  0.9547


100%|██████████| 51/51 [05:59<00:00,  7.04s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 5000:  0.9547


100%|██████████| 51/51 [05:55<00:00,  6.97s/it]

family_1

the performance is the best when the number of Hidden Layers is 2
The performance(10 fold cross validation)
Accuracy with Max_it 10000:  0.9547
best_nn = MLPClassifier(random_state=1, hidden_layer_sizes=(202), max_iter=10000)
best_nn.fit(X_train_n, y_train_n)
MLPClassifier(hidden_layer_sizes=202, max_iter=10000, random_state=1)
# Accuracy based on Training set
y_pred_n = best_nn.predict(X_train_n)
acc_n = accuracy_score(y_true = y_train_n, y_pred = y_pred_n)
print("The performance of the best NN model to train set")
print(f"accuracy: {acc_n:0.4f}")

# Accuracy based on Test set
y_pred_n = best_nn.predict(X_test_n)
acc_n = accuracy_score(y_true = y_test_n, y_pred = y_pred_n)
print("The performance of the best NN model to test set")
print(f"accuracy: {acc_n:0.4f}")

# Accuracy based on total dataset
y_pred_n = best_nn.predict(X_n)
acc_n = accuracy_score(y_true = y, y_pred = y_pred_n)
print("The performance of the best NN model to total data set")
print(f"accuracy: {acc_n:0.4f}")
The performance of the best NN model to train set
accuracy: 0.9870
The performance of the best NN model to test set
accuracy: 0.9398
The performance of the best NN model to total data set
accuracy: 0.9728
plot_confusion_matrix(best_nn, X_n, y, normalize = 'all', cmap=plt.cm.Blues)
plt.title('Confusion Matrix of the best NN model: Total data set')
plt.show()

family_1

X_median = np.median(X_n, axis=0).reshape((1,-1)) # (1,D)
proba_median = best_nn.predict_proba(X_median)
proba_median.shape
(1, 2)
proba_median
array([[0.00660987, 0.99339013]])
D = X.shape[1]
importance = []
for i in range(D):
    #print(x_median_i_perturbed)
    x_median_i_perturbed = X_median.copy()
    x_median_i_perturbed[0,i]+=0.00001
    
    proba_median_perturbed = clf.predict_proba(x_median_i_perturbed)
    imp = abs(proba_median_perturbed[:,1] - proba_median[:,1])/0.00001 
    importance.append(imp)
importance = np.array(importance).reshape(-1)
    
feature_indices = np.arange(len(importance))

plt.figure(figsize=(10, 5))
plt.bar(feature_indices,importance)
plt.xticks(feature_indices, snp.columns[1:-1], rotation='vertical')
plt.ylabel('Feature Importance')
plt.grid(True)

print(importance)
[4145.61522665 4145.61588503 4145.63653883 4145.66668341 4145.57640143
 4145.57031636 4145.61261794 4145.64475052 4145.70354995 4145.69686826
 4145.65729145 4145.52495906 4145.5087473  4145.7467095  4145.60899699
 4145.57133638 4145.55394156 4145.66414448 4145.53445769 4145.57249507
 4145.63074292]

family_1

snp2021=pd.read_csv('2021final.csv')
snp2021.head()
Name R OI NI Asset Equity Debt FCFF g Tech ... TEV d PER PSR PBR EV/EBITDA CAPEX emp EBITDA Survival
0 LyondellBasell Industries NV 27753.0 2181.0 1420.0 35403.0 8104.0 17832.0 -61.000 -15.6471 0 ... 43379.60441 5.0272 21.070848 1.078103 0.845143 12.472572 1947.0 19200 3478.0 1
1 Verizon Communications Inc 134238.0 33943.0 22040.0 353457.0 78489.0 178985.0 -24466.875 -0.9872 0 ... 380963.27000 5.1006 9.559268 1.569498 0.596073 7.917107 18192.0 132200 48119.0 1
2 Broadcom Inc 26510.0 7646.0 6071.0 75880.0 24367.0 40457.0 11894.375 7.0428 1 ... 289388.77680 2.5962 42.828163 9.807989 3.426592 25.942517 463.0 21000 11155.0 1
3 Boeing Co/The 62797.0 -3517.0 -8479.0 146846.0 -14266.0 62419.0 -2700.500 -24.1648 0 ... 163111.60290 0.0000 -14.213186 1.919098 0.820680 -25.434524 1303.0 141000 -6413.0 1
4 Caterpillar Inc 48408.0 6984.0 5149.0 80784.0 16695.0 36792.0 3257.750 -12.6553 0 ... 138327.97790 2.1822 21.375020 2.273591 1.362398 18.964625 2115.0 97300 7294.0 1

5 rows × 23 columns

X21 = snp2021.iloc[:,1:-1]
y_pred21 = best_rfc.predict(X21)
print(y_pred21)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
y_pred_n21 = best_nn.predict(X21)
print(y_pred_n21)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

Conclusion and inference

A) Prediction conclusion i) The RF and NN models with 2001 data predict that 8 companies in the 2021 list can not survive after 20 years. [Figure 13] ii) Two RF models with 2001, and 2009 data and NN model with 2001 data anticipated that 6 stocks will be excluded from the S&P 500 within 20 years. [Figure 30] iii) If an investor seeks long-term profit from the stock market, the models recommend that do not put your money on the stocks in the [Figure 13] and [Figure 30]. The investor can avoid expected loss by the models, not investing in the stocks. iv) The models with 2019 data predict that every stock in the 2021 list will survive within 2 years. Because some companies will be delisted every year, this result is clearly wrong, but I can infer that S&P 500 cannot find the critical problems with current companies comprising S&P 500 based on the current public data.

B) Importance [Figure 31] i) In the long-term the Dividend Rate and Net income are the most crucial features for survival. Since investors expect dividends from long-term investment and a large dividend rate and net income illustrate the company’s stable operation, it is reasonable that these features are the most significant. ii) Relatively short-term aspects, current Market capital is the most important for survival. It can be interpreted as the S&P 500 considering Market capital the best business result, believing efficient market theory. iii) Among the balance sheet aspects, Total enterprise value outperformed the other factors, total equity, total asset, and total debt. Total EV presents the consolidated value of the business, considering asset, debt, and equity value. Hence, valuation experts regard total EV as the best criteria for evaluating the target company’s value. Therefore, the result, outperforming EV, well illustrates valuation reality. iv) Along EV, EV/EBITDA is usually an important factor by valuation experts. Furthermore, not as much as FCFF but EBITDA is manipulated as the approximated cash flow value by adding noncash flow expenses such as depreciation and amortization. When evaluating the target company’s value, CF is more important than profit and loss. Therefore, the results, EBITDA and EV/EBITDA, are highly ranked, also represent valuation reality. v) Capital Expenses, CAPEX, is an investment for future growth. Despite the common sense that professional investors do not care about future growth, CAPEX played a significant role in the classification, so we can infer those professional investors also consider future expansion. vi) Some researchers claim that S&P 500 highly concentrates on the Tech industry on purpose. However, whether the company is in the Tech industry does nothing in classification. Therefore, I can infer that financial data is more crucial for survival than a company’s industry, and Tech companies are included because of their performance, not industry. Furthermore, whether the company originated from the US is never a crucial factor in survival.

Limitation and Further Research

A) Limitation i) Insufficient delisting data in 2019’s list: Only 5% of total companies are delisted by 2021, so classification models can not sufficiently learn what characteristics affect the delisting. Although the performance of accuracy in RF and NN models with 2019 is higher than 97%, it is wrong that the models predict every company in the 2021 list will survive. ii) The number of hidden layers in Neural network models: I confirmed that with 100~150 hidden layers, the model’s performance was 73.6%, while with 200~250 hidden layers, that of the model was 74.4%. Based on the result, I can anticipate that the performance will increase as the number of hidden layers increases. However, compiling the model with more than 200 layers took more than 20 minutes at once, so I could not compile with larger numbers. iii) Limitation of data access: several factors can affect the survival, such as other industry categories, CEO’s background information, the difference between financial analysts’ forecasts and current price, and text data.

B) Further Research i) Which companies will be included in the S&P 500: Forecasting delisting is a passive investment strategy, avoiding expected loss, whereas predicting enlisting is an active investment strategy since the stock price will soar when the information that S&P will include the stock in the index is released. Therefore, if an investor predicts which stock will be included, the investor can profit when soring. ii) Collecting more restricted data above and training with the data again