Kaggleをやってみよう!タイタニック号の生存者予測

Kaggleのタイタニック号の生存者予測をやってみました。予測精度よりもひと通り、データ可視化からモデリングプロセスを行うことが目的。

Tableauでも可視化してみましたので、事前にデータ概観見るのに活用ください。

データソース

データの読み込み

In [1]:
# module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series,DataFrame
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_theme()
In [2]:
# data
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
In [4]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Age、Cabinはnullデータがある

In [5]:
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
In [6]:
train.describe()
Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

データ可視化

どんな乗船者か?

性年代

In [7]:
# 性別
sns.catplot(x="Sex", kind="count", data=train)
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x7fb1067a16d0>
In [8]:
train['Sex'].value_counts(normalize=True)
Out[8]:
male      0.647587
female    0.352413
Name: Sex, dtype: float64

男性が64%

In [9]:
# 性年代
sns.displot(train, x="Age", hue="Sex")
Out[9]:
<seaborn.axisgrid.FacetGrid at 0x7fb1067cdb80>
In [10]:
train["Age2"] = pd.cut(train["Age"], bins=np.arange(0,80,10),right=False)
In [11]:
pd.crosstab(train["Age2"], train["Sex"], normalize=True)
Out[11]:
Sex female male
Age2
[0, 10) 0.042433 0.045262
[10, 20) 0.063649 0.080622
[20, 30) 0.101839 0.209335
[30, 40) 0.084866 0.151344
[40, 50) 0.045262 0.080622
[50, 60) 0.025460 0.042433
[60, 70) 0.005658 0.021216
In [12]:
pd.crosstab(train["Age2"], train["Sex"], normalize="index")
Out[12]:
Sex female male
Age2
[0, 10) 0.483871 0.516129
[10, 20) 0.441176 0.558824
[20, 30) 0.327273 0.672727
[30, 40) 0.359281 0.640719
[40, 50) 0.359551 0.640449
[50, 60) 0.375000 0.625000
[60, 70) 0.210526 0.789474

20代が31%(うち男性67%%)、30代が23%(うち男性64%)と多い。分析を進める上で、子供を分けたほうがよさそう。16歳以下を子供と定義する。

In [13]:
# 16歳以下を子供と定義
def male_female_child(passnger):
    age, sex = passnger
    if age < 16:
        return 'child'
    else:
        return sex
train['person'] = train[['Age','Sex']].apply(male_female_child,axis=1)
In [14]:
train["person"].value_counts()
Out[14]:
male      537
female    271
child      83
Name: person, dtype: int64
In [15]:
train["person"].value_counts(normalize=True)
Out[15]:
male      0.602694
female    0.304153
child     0.093154
Name: person, dtype: float64
In [16]:
sns.displot(train, x="Age", hue="person", kind="kde", fill=True)
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x7fb102d35b80>

ひとり or 家族

sibspが兄弟・配偶者の数、parchが両親・子供の数

In [18]:
train['SibSp'].value_counts()
Out[18]:
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
In [19]:
train['Parch'].value_counts()
Out[19]:
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
In [20]:
train['Family'] = train['SibSp'] + train['Parch']
train['Alone'] = train['Family'].apply(lambda x : 'With Family' if x > 0 else 'Alone')
In [22]:
sns.catplot(x='Alone', kind="count", data=train)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7fb102cbabb0>
In [23]:
train["Alone"].value_counts(normalize=True)
Out[23]:
Alone          0.602694
With Family    0.397306
Name: Alone, dtype: float64
In [24]:
sns.catplot(x="Family", kind="count", data=train)
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x7fb102c91fd0>
In [25]:
train["Family"].value_counts(normalize=True)
Out[25]:
0     0.602694
1     0.180696
2     0.114478
3     0.032548
5     0.024691
4     0.016835
6     0.013468
10    0.007856
7     0.006734
Name: Family, dtype: float64

ひとりで乗車している人が60%。家族1人とが18%、2人とが11%。

チケットクラス

In [26]:
# チケットクラス別の性別
sns.catplot(x="Pclass", hue="Sex", kind="count", data=train)
Out[26]:
<seaborn.axisgrid.FacetGrid at 0x7fb105f8ca30>
In [27]:
pd.crosstab(train['Pclass'],train['Sex'])
Out[27]:
Sex female male
Pclass
1 94 122
2 76 108
3 144 347
In [28]:
pd.crosstab(train['Pclass'],train['Sex'], normalize=True)
Out[28]:
Sex female male
Pclass
1 0.105499 0.136925
2 0.085297 0.121212
3 0.161616 0.389450
In [29]:
pd.crosstab(train['Pclass'],train['Sex'], normalize='index')
Out[29]:
Sex female male
Pclass
1 0.435185 0.564815
2 0.413043 0.586957
3 0.293279 0.706721

クラス3の人数が多く(55%)、クラス3の男性で38%。クラス別の男女比率もクラス3は70%と他のクラスに比べて、10%以上高い。

In [30]:
sns.catplot(x="Pclass", hue="person", kind="count", data=train)
Out[30]:
<seaborn.axisgrid.FacetGrid at 0x7fb102b7c0a0>
In [31]:
pd.crosstab(train["Pclass"], train["person"], normalize="index")
Out[31]:
person child female male
Pclass
1 0.027778 0.421296 0.550926
2 0.103261 0.358696 0.538043
3 0.118126 0.232179 0.649695

クラスが低い(Pclassの数値が高い)ほど、子供の比率が高く、女性の比率が低い。

In [32]:
sns.displot(train, x="Age", hue="Pclass", kind="kde", fill=True)
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7fb102ca16a0>

どこの港から?

In [34]:
sns.catplot(x="Embarked", kind="count", data=train)
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x7fb102aad8e0>

C = Cherbourg, Q = Queenstown, S = Southampton

In [35]:
train["Embarked"].value_counts(normalize=True)
Out[35]:
S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64
In [36]:
sns.catplot(x="Embarked", hue="Pclass", kind="count", data=train)
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x7fb102a29490>
In [37]:
pd.crosstab(train["Embarked"], train["Pclass"], normalize='index')
Out[37]:
Pclass 1 2 3
Embarked
C 0.505952 0.101190 0.392857
Q 0.025974 0.038961 0.935065
S 0.197205 0.254658 0.548137

Southamptonからが72%と多く、メインの顧客。チケットクラスから、相対的にCherbourgが富裕層が多く、Queentownが庶民的な地域か。

どんな船室に乗っていた?

In [39]:
train['Cabin'] = train['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
In [40]:
train['Cabin'] = train['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
In [41]:
sns.catplot(x="Cabin", kind="count", data=train, 
            order=["A","B","C","D","E","F","G"], palette="Blues_r")
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7fb102b39790>
In [42]:
train['Cabin'].value_counts(normalize=True)
Out[42]:
C    0.290640
B    0.231527
D    0.162562
E    0.157635
A    0.073892
F    0.064039
G    0.019704
Name: Cabin, dtype: float64

モデリング

生きる確率が高い乗船者

In [43]:
# データ再度読み込み
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
In [44]:
# 目的変数を分ける
train_x = train.drop(["Survived"], axis=1)
train_y = train["Survived"]
In [45]:
test_x = test.copy()

特徴量

In [46]:
from sklearn.preprocessing import LabelEncoder
In [48]:
# 変数除外
train_x = train_x.drop(["PassengerId","Name","Ticket"], axis=1)
test_x = test_x.drop(["PassengerId","Name","Ticket"], axis=1)

# Cabin変数変換
train_x['Cabin'] = train_x['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
train_x['Cabin'] = train_x['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
test_x['Cabin'] = test_x['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
test_x['Cabin'] = test_x['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
In [49]:
# label encoding
for c in ['Sex', 'Embarked', 'Cabin']:
    # 学習データに基づいてどう変換するかを定める
    le = LabelEncoder()
    le.fit(train_x[c].fillna('NA'))

    # 学習データ、テストデータを変換する
    train_x[c] = le.transform(train_x[c].fillna('NA'))
    test_x[c] = le.transform(test_x[c].fillna('NA'))
In [50]:
train_x.head()
Out[50]:
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 1 22.0 1 0 7.2500 7 3
1 1 0 38.0 1 0 71.2833 2 0
2 3 0 26.0 0 0 7.9250 7 3
3 1 0 35.0 1 0 53.1000 2 3
4 3 1 35.0 0 0 8.0500 7 3

モデル作成

In [51]:
from xgboost import XGBClassifier
In [52]:
# モデル作成
model = XGBClassifier(n_estimators=20, random_state=0)
model.fit(train_x, train_y)
Out[52]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=20, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [53]:
# 予測値
pred = model.predict_proba(test_x)[:, 1]
pred_label = np.where(pred > 0.5, 1, 0)
In [54]:
# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_first.csv', index=False)

kaggleにアップロードしたスコア:0.77511

バリデーション

In [55]:
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import KFold
In [56]:
scores_accuracy = []
scores_logloss = []
kf = KFold(n_splits=4, shuffle=True, random_state=0)
In [57]:
for tr_idx, va_idx in kf.split(train_x):
    tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
    tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]
    
    model = XGBClassifier(n_estimators=20, random_state=0)
    model.fit(tr_x, tr_y)
    va_pred = model.predict_proba(va_x)[:, 1]
    
    logloss = log_loss(va_y, va_pred)
    accuracy = accuracy_score(va_y, va_pred>0.5)
    
    scores_logloss.append(logloss)
    scores_accuracy.append(accuracy)
In [58]:
logloss = np.mean(scores_logloss)
accuracy = np.mean(scores_accuracy)
print('logloss:{:.4f}, accuracy:{:.4f}'.format(logloss,accuracy))
logloss:0.4388, accuracy:0.8272

モデルチューニング

In [59]:
import itertools
In [60]:
# ハイパーパラーメタ候補
param_space = {
    'max_depth': [3,5,7],
    'min_child_weight': [1.0, 2.0, 4.0]
}

# ハイパーパラメータの組み合わせ
param_combionations = itertools.product(
    param_space['max_depth'],
    param_space['min_child_weight']
)
In [61]:
params = []
scores = []
In [62]:
for max_depth, min_child_weight in param_combionations:
    score_folds = []
    
    kf = KFold(n_splits=4, shuffle=True, random_state=0)
    for tr_idx, va_idx in kf.split(train_x):
        tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
        tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

        model = XGBClassifier(n_estimators=20, random_state=0,
                             max_depth=max_depth, min_child_weight=min_child_weight)
        model.fit(tr_x, tr_y)
        
        va_pred = model.predict_proba(va_x)[:, 1]
        logloss = log_loss(va_y, va_pred)
        score_folds.append(logloss)
    
    score_mean = np.mean(score_folds)
    
    params.append((max_depth, min_child_weight))
    scores.append((score_mean))
In [63]:
scores
Out[63]:
[0.4203921794270626,
 0.4152452225412223,
 0.42597380259439005,
 0.4332257781268632,
 0.4326991807009023,
 0.43098039436766405,
 0.4458128212407607,
 0.44579364216805906,
 0.4303601269663239]
In [64]:
best_idx = np.argsort(scores)[0]
best_param = params[best_idx]
In [65]:
print('max_depth: {}, min_child_weight:{}'
      .format(best_param[0],best_param[1]))
max_depth: 3, min_child_weight:2.0
In [66]:
# モデル作成
model = XGBClassifier(n_estimators=20, random_state=0,
                     max_depth=3, min_child_weight=2.0)
model.fit(train_x, train_y)

# 予測値
pred = model.predict_proba(test_x)[:, 1]
pred_label = np.where(pred > 0.5, 1, 0)

# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_second.csv', index=False)

kaggleにアップロードしたスコア:0.76076

アンサンブル

勾配ブースティング木(xgboost)とロジスティック回帰をアンサンブル

ロジスティック回帰用データ作成

In [67]:
from sklearn.preprocessing import OneHotEncoder

train_x2 = train.drop(['Survived'], axis=1)
test_x2 = test.copy()
In [68]:
# 変数除外
train_x2 = train_x2.drop(["PassengerId","Name","Ticket"], axis=1)
test_x2 = test_x2.drop(["PassengerId","Name","Ticket"], axis=1)
In [69]:
# Cabin変数変換
train_x2['Cabin'] = train_x2['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
train_x2['Cabin'] = train_x2['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
test_x2['Cabin'] = test_x2['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
test_x2['Cabin'] = test_x2['Cabin'].apply(lambda x: np.NaN if x=="T" else x)

# onnhot encoding
cat_cols = ['Sex', 'Embarked', 'Pclass', 'Cabin']
ohe = OneHotEncoder(categories='auto', sparse=False)
ohe.fit(train_x2[cat_cols].fillna('NA'))
Out[69]:
OneHotEncoder(sparse=False)
In [70]:
# onehot encodingのダミー変数の列名
ohe_columns = []
for i, c in enumerate(cat_cols):
    ohe_columns += [f'{c}_{v}' for v in ohe.categories_[i]]

# one-hot encodingによる変換
ohe_train_x2 = pd.DataFrame(ohe.transform(train_x2[cat_cols].fillna('NA')), columns=ohe_columns)
ohe_test_x2 = pd.DataFrame(ohe.transform(test_x2[cat_cols].fillna('NA')), columns=ohe_columns)

# one-hot encoding済みの変数除外
train_x2 = train_x2.drop(cat_cols, axis=1)
test_x2 = test_x2.drop(cat_cols, axis=1)

# one-hot encodingで変換された変数を結合
train_x2 = pd.concat([train_x2, ohe_train_x2], axis=1)
test_x2 = pd.concat([test_x2, ohe_test_x2], axis=1)

# 数値変数の欠損値を学習データの平均で埋める
num_cols = ['Age', 'SibSp', 'Parch', 'Fare']
for col in num_cols:
    train_x2[col].fillna(train_x2[col].mean(), inplace=True)
    test_x2[col].fillna(train_x2[col].mean(), inplace=True)

# 変数Fareを対数変換する
train_x2['Fare'] = np.log1p(train_x2['Fare'])
test_x2['Fare'] = np.log1p(test_x2['Fare'])

アンサンブル学習

In [71]:
from sklearn.linear_model import LogisticRegression

# xgboostモデル
model_xgb = XGBClassifier(n_estimators=20, random_state=0,
                          max_depth=3, min_child_weight=2.0)
model_xgb.fit(train_x, train_y)
pred_xgb = model_xgb.predict_proba(test_x)[:, 1]

# ロジスティック回帰モデル
model_lr = LogisticRegression(solver='lbfgs', max_iter=300)
model_lr.fit(train_x2, train_y)
pred_lr = model_lr.predict_proba(test_x2)[:, 1]

# 予測値の加重平均をとる
pred = pred_xgb * 0.8 + pred_lr * 0.2
pred_label = np.where(pred > 0.5, 1, 0)
In [72]:
# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_third.csv', index=False)

kaggleにアップロードしたスコア:0.76555