Kaggleのタイタニック号の生存者予測をやってみました。予測精度よりもひと通り、データ可視化からモデリングプロセスを行うことが目的。
Tableauでも可視化してみましたので、事前にデータ概観見るのに活用ください。
In [1]:
# module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series,DataFrame
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
sns.set_theme()
In [2]:
# data
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
In [4]:
train.info()
Age、Cabinはnullデータがある
In [5]:
test.info()
In [6]:
train.describe()
Out[6]:
どんな乗船者か?
In [7]:
# 性別
sns.catplot(x="Sex", kind="count", data=train)
Out[7]:
In [8]:
train['Sex'].value_counts(normalize=True)
Out[8]:
男性が64%
In [9]:
# 性年代
sns.displot(train, x="Age", hue="Sex")
Out[9]:
In [10]:
train["Age2"] = pd.cut(train["Age"], bins=np.arange(0,80,10),right=False)
In [11]:
pd.crosstab(train["Age2"], train["Sex"], normalize=True)
Out[11]:
In [12]:
pd.crosstab(train["Age2"], train["Sex"], normalize="index")
Out[12]:
20代が31%(うち男性67%%)、30代が23%(うち男性64%)と多い。分析を進める上で、子供を分けたほうがよさそう。16歳以下を子供と定義する。
In [13]:
# 16歳以下を子供と定義
def male_female_child(passnger):
age, sex = passnger
if age < 16:
return 'child'
else:
return sex
train['person'] = train[['Age','Sex']].apply(male_female_child,axis=1)
In [14]:
train["person"].value_counts()
Out[14]:
In [15]:
train["person"].value_counts(normalize=True)
Out[15]:
In [16]:
sns.displot(train, x="Age", hue="person", kind="kde", fill=True)
Out[16]:
sibspが兄弟・配偶者の数、parchが両親・子供の数
In [18]:
train['SibSp'].value_counts()
Out[18]:
In [19]:
train['Parch'].value_counts()
Out[19]:
In [20]:
train['Family'] = train['SibSp'] + train['Parch']
train['Alone'] = train['Family'].apply(lambda x : 'With Family' if x > 0 else 'Alone')
In [22]:
sns.catplot(x='Alone', kind="count", data=train)
Out[22]:
In [23]:
train["Alone"].value_counts(normalize=True)
Out[23]:
In [24]:
sns.catplot(x="Family", kind="count", data=train)
Out[24]:
In [25]:
train["Family"].value_counts(normalize=True)
Out[25]:
ひとりで乗車している人が60%。家族1人とが18%、2人とが11%。
In [26]:
# チケットクラス別の性別
sns.catplot(x="Pclass", hue="Sex", kind="count", data=train)
Out[26]:
In [27]:
pd.crosstab(train['Pclass'],train['Sex'])
Out[27]:
In [28]:
pd.crosstab(train['Pclass'],train['Sex'], normalize=True)
Out[28]:
In [29]:
pd.crosstab(train['Pclass'],train['Sex'], normalize='index')
Out[29]:
クラス3の人数が多く(55%)、クラス3の男性で38%。クラス別の男女比率もクラス3は70%と他のクラスに比べて、10%以上高い。
In [30]:
sns.catplot(x="Pclass", hue="person", kind="count", data=train)
Out[30]:
In [31]:
pd.crosstab(train["Pclass"], train["person"], normalize="index")
Out[31]:
クラスが低い(Pclassの数値が高い)ほど、子供の比率が高く、女性の比率が低い。
In [32]:
sns.displot(train, x="Age", hue="Pclass", kind="kde", fill=True)
Out[32]:
In [34]:
sns.catplot(x="Embarked", kind="count", data=train)
Out[34]:
C = Cherbourg, Q = Queenstown, S = Southampton
In [35]:
train["Embarked"].value_counts(normalize=True)
Out[35]:
In [36]:
sns.catplot(x="Embarked", hue="Pclass", kind="count", data=train)
Out[36]:
In [37]:
pd.crosstab(train["Embarked"], train["Pclass"], normalize='index')
Out[37]:
Southamptonからが72%と多く、メインの顧客。チケットクラスから、相対的にCherbourgが富裕層が多く、Queentownが庶民的な地域か。
In [39]:
train['Cabin'] = train['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
In [40]:
train['Cabin'] = train['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
In [41]:
sns.catplot(x="Cabin", kind="count", data=train,
order=["A","B","C","D","E","F","G"], palette="Blues_r")
Out[41]:
In [42]:
train['Cabin'].value_counts(normalize=True)
Out[42]:
生きる確率が高い乗船者
In [43]:
# データ再度読み込み
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
In [44]:
# 目的変数を分ける
train_x = train.drop(["Survived"], axis=1)
train_y = train["Survived"]
In [45]:
test_x = test.copy()
In [46]:
from sklearn.preprocessing import LabelEncoder
In [48]:
# 変数除外
train_x = train_x.drop(["PassengerId","Name","Ticket"], axis=1)
test_x = test_x.drop(["PassengerId","Name","Ticket"], axis=1)
# Cabin変数変換
train_x['Cabin'] = train_x['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
train_x['Cabin'] = train_x['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
test_x['Cabin'] = test_x['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
test_x['Cabin'] = test_x['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
In [49]:
# label encoding
for c in ['Sex', 'Embarked', 'Cabin']:
# 学習データに基づいてどう変換するかを定める
le = LabelEncoder()
le.fit(train_x[c].fillna('NA'))
# 学習データ、テストデータを変換する
train_x[c] = le.transform(train_x[c].fillna('NA'))
test_x[c] = le.transform(test_x[c].fillna('NA'))
In [50]:
train_x.head()
Out[50]:
In [51]:
from xgboost import XGBClassifier
In [52]:
# モデル作成
model = XGBClassifier(n_estimators=20, random_state=0)
model.fit(train_x, train_y)
Out[52]:
In [53]:
# 予測値
pred = model.predict_proba(test_x)[:, 1]
pred_label = np.where(pred > 0.5, 1, 0)
In [54]:
# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_first.csv', index=False)
kaggleにアップロードしたスコア:0.77511
In [55]:
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import KFold
In [56]:
scores_accuracy = []
scores_logloss = []
kf = KFold(n_splits=4, shuffle=True, random_state=0)
In [57]:
for tr_idx, va_idx in kf.split(train_x):
tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]
model = XGBClassifier(n_estimators=20, random_state=0)
model.fit(tr_x, tr_y)
va_pred = model.predict_proba(va_x)[:, 1]
logloss = log_loss(va_y, va_pred)
accuracy = accuracy_score(va_y, va_pred>0.5)
scores_logloss.append(logloss)
scores_accuracy.append(accuracy)
In [58]:
logloss = np.mean(scores_logloss)
accuracy = np.mean(scores_accuracy)
print('logloss:{:.4f}, accuracy:{:.4f}'.format(logloss,accuracy))
In [59]:
import itertools
In [60]:
# ハイパーパラーメタ候補
param_space = {
'max_depth': [3,5,7],
'min_child_weight': [1.0, 2.0, 4.0]
}
# ハイパーパラメータの組み合わせ
param_combionations = itertools.product(
param_space['max_depth'],
param_space['min_child_weight']
)
In [61]:
params = []
scores = []
In [62]:
for max_depth, min_child_weight in param_combionations:
score_folds = []
kf = KFold(n_splits=4, shuffle=True, random_state=0)
for tr_idx, va_idx in kf.split(train_x):
tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]
model = XGBClassifier(n_estimators=20, random_state=0,
max_depth=max_depth, min_child_weight=min_child_weight)
model.fit(tr_x, tr_y)
va_pred = model.predict_proba(va_x)[:, 1]
logloss = log_loss(va_y, va_pred)
score_folds.append(logloss)
score_mean = np.mean(score_folds)
params.append((max_depth, min_child_weight))
scores.append((score_mean))
In [63]:
scores
Out[63]:
In [64]:
best_idx = np.argsort(scores)[0]
best_param = params[best_idx]
In [65]:
print('max_depth: {}, min_child_weight:{}'
.format(best_param[0],best_param[1]))
In [66]:
# モデル作成
model = XGBClassifier(n_estimators=20, random_state=0,
max_depth=3, min_child_weight=2.0)
model.fit(train_x, train_y)
# 予測値
pred = model.predict_proba(test_x)[:, 1]
pred_label = np.where(pred > 0.5, 1, 0)
# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_second.csv', index=False)
kaggleにアップロードしたスコア:0.76076
勾配ブースティング木(xgboost)とロジスティック回帰をアンサンブル
In [67]:
from sklearn.preprocessing import OneHotEncoder
train_x2 = train.drop(['Survived'], axis=1)
test_x2 = test.copy()
In [68]:
# 変数除外
train_x2 = train_x2.drop(["PassengerId","Name","Ticket"], axis=1)
test_x2 = test_x2.drop(["PassengerId","Name","Ticket"], axis=1)
In [69]:
# Cabin変数変換
train_x2['Cabin'] = train_x2['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
train_x2['Cabin'] = train_x2['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
test_x2['Cabin'] = test_x2['Cabin'].apply(lambda x: x[0] if x is not np.NaN else x)
test_x2['Cabin'] = test_x2['Cabin'].apply(lambda x: np.NaN if x=="T" else x)
# onnhot encoding
cat_cols = ['Sex', 'Embarked', 'Pclass', 'Cabin']
ohe = OneHotEncoder(categories='auto', sparse=False)
ohe.fit(train_x2[cat_cols].fillna('NA'))
Out[69]:
In [70]:
# onehot encodingのダミー変数の列名
ohe_columns = []
for i, c in enumerate(cat_cols):
ohe_columns += [f'{c}_{v}' for v in ohe.categories_[i]]
# one-hot encodingによる変換
ohe_train_x2 = pd.DataFrame(ohe.transform(train_x2[cat_cols].fillna('NA')), columns=ohe_columns)
ohe_test_x2 = pd.DataFrame(ohe.transform(test_x2[cat_cols].fillna('NA')), columns=ohe_columns)
# one-hot encoding済みの変数除外
train_x2 = train_x2.drop(cat_cols, axis=1)
test_x2 = test_x2.drop(cat_cols, axis=1)
# one-hot encodingで変換された変数を結合
train_x2 = pd.concat([train_x2, ohe_train_x2], axis=1)
test_x2 = pd.concat([test_x2, ohe_test_x2], axis=1)
# 数値変数の欠損値を学習データの平均で埋める
num_cols = ['Age', 'SibSp', 'Parch', 'Fare']
for col in num_cols:
train_x2[col].fillna(train_x2[col].mean(), inplace=True)
test_x2[col].fillna(train_x2[col].mean(), inplace=True)
# 変数Fareを対数変換する
train_x2['Fare'] = np.log1p(train_x2['Fare'])
test_x2['Fare'] = np.log1p(test_x2['Fare'])
In [71]:
from sklearn.linear_model import LogisticRegression
# xgboostモデル
model_xgb = XGBClassifier(n_estimators=20, random_state=0,
max_depth=3, min_child_weight=2.0)
model_xgb.fit(train_x, train_y)
pred_xgb = model_xgb.predict_proba(test_x)[:, 1]
# ロジスティック回帰モデル
model_lr = LogisticRegression(solver='lbfgs', max_iter=300)
model_lr.fit(train_x2, train_y)
pred_lr = model_lr.predict_proba(test_x2)[:, 1]
# 予測値の加重平均をとる
pred = pred_xgb * 0.8 + pred_lr * 0.2
pred_label = np.where(pred > 0.5, 1, 0)
In [72]:
# 提出用ファイル
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': pred_label})
submission.to_csv('submission_third.csv', index=False)
kaggleにアップロードしたスコア:0.76555