博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[kaggle入门] Titanic Data Science Solutions
阅读量:5307 次
发布时间:2019-06-14

本文共 26458 字,大约阅读时间需要 88 分钟。

Titanic Data Science Solutions

()

数据挖掘竞赛七个部分:

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.

数据挖掘竞赛的七种目标:

  1. 分类 Classifying: classify or categorize our samples and may also want to understand the implications or correlation of different classes with our solution goal.
  2. 关联 Correlating: Correlating certain features may help in creating, completing, or correcting features.
  3. 转换 Converting: For instance converting text categorical values to numeric values.
  4. 完整 Completing: Estimate any missing values within a feature.
  5. 改正 Correcting: Detect any outliers among our samples or features and may discard a feature if it is not contribting to the analysis or may significantly skew the results.
  6. 生成 Creating: Create new features based on an existing feature or a set of features.(correlation, conversion, completeness..)
  7. 制图 Charting: Select the right visualization plots and charts

1. 问题定义

Question or problem definition

  1. The question or problem definition for Titanic Survival competition
    Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
  2. Some early understanding about the domain of our problem.
    On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
    One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
    Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
# data analysis and wrangling 数据分析和清洗工具import pandas as pdimport numpy as npimport random as rnd# visualization 数据可视化工具import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline# machine learning 机器学习模型from sklearn.linear_model import LogisticRegression # 逻辑回归from sklearn.svm import SVC, LinearSVC # 支持向量机from sklearn.ensemble import RandomForestClassifier # 随机森林from sklearn.neighbors import KNeighborsClassifier # K近邻from sklearn.naive_bayes import GaussianNB # 贝叶斯算法from sklearn.linear_model import Perceptron # 感知机from sklearn.linear_model import SGDClassifier # 随机梯度下降分类器from sklearn.tree import DecisionTreeClassifier # 决策树

2. 获取数据

Acquire training and testing data

train_df = pd.read_csv('data/train.csv') # 用pandas的read_csv方法读出DataFrame数据test_df = pd.read_csv('data/test.csv')combine = [train_df, test_df] # combine为一个数据集,方便对训练集和测试集做相同的数据清洗操作

3. 分析数据

Analyze and explore the data

Analyze by describing data

print(train_df.columns.values) # 导出列名:features的名字
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked']
# preview the datatrain_df.head()  # 默认前5行
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train_df.tail() # 默认后5行
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
train_df.info()print('_'*40)test_df.info()
RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Ticket 891 non-null objectFare 891 non-null float64Cabin 204 non-null objectEmbarked 889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.6+ KB________________________________________
RangeIndex: 418 entries, 0 to 417Data columns (total 11 columns):PassengerId 418 non-null int64Pclass 418 non-null int64Name 418 non-null objectSex 418 non-null objectAge 332 non-null float64SibSp 418 non-null int64Parch 418 non-null int64Ticket 418 non-null objectFare 417 non-null float64Cabin 91 non-null objectEmbarked 418 non-null objectdtypes: float64(2), int64(4), object(5)memory usage: 36.0+ KB
  1. Which features are categorical?
    Categorical: Survived, Sex, and Embarked.
    Ordinal: Pclass.
  2. Which features are numerical?
    Continous: Age, Fare.
    Discrete: SibSp, Parch.
  3. Which features are mixed data types?
    Ticket is a mix of numeric and alphanumeric data types.
    Cabin is alphanumeric.
  4. Which features may contain errors or typos?
    Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
  5. Which features contain blank, null or empty values?
    Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
    Cabin > Age are incomplete in case of test dataset.
  6. What are the data types for various features?
    Seven features are integer or floats. Six in case of test dataset.
    Five features are strings (object).
train_df.describe()  # 数据的描述(总数、均值、标准差、最大、最小、25%、50%、75%)# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.# Review Parch distribution using `percentiles=[.75, .8]`# SibSp distribution `[.68, .69]`# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train_df.describe(include=['O'])  # 找出特征中几个出现的不同值和频率最高
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Beane, Mrs. Edward (Ethel Clarke) male 1601 B96 B98 S
freq 1 577 7 4 644
  1. What is the distribution of numerical feature values across the samples?
    Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
    Survived is a categorical feature with 0 or 1 values.
    Around 38% samples survived representative of the actual survival rate at 32%.
    Most passengers (> 75%) did not travel with parents or children.
    Nearly 30% of the passengers had siblings and/or spouse aboard.
    Fares varied significantly with few passengers (<1%) paying as high as 512.
    Few elderly passengers (<1%) within age range 65-80.
  2. What is the distribution of categorical features?
    Names are unique across the dataset (count=unique=891).
    Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
    Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
    Embarked takes three possible values. S port used by most passengers (top=S).
    Ticket feature has high ratio (22%) of duplicate values (unique=681).

Assumtions based on data analysis

Correlating:

We want to know how well does each feature correlate with Survival. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

Completing:

  1. We may want to complete Age feature as it is definitely correlated to survival.
  2. We may want to complete the Embarked feature as it may also correlate with survival or another important feature.

Correcting:

  1. Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.
  2. Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.
  3. PassengerId may be dropped from training dataset as it does not contribute to survival.
  4. Name feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

Creating:

  1. We may want to create a new feature called Family based on Parch and SibSp to get total count of family members on board.
  2. We may want to engineer the Name feature to extract Title as a new feature.
  3. We may want to create new feature for Age bands. This turns a continous numerical feature into an ordinal categorical feature.
  4. We may also want to create a Fare range feature if it helps our analysis.

Classifying:

We may also add to our assumptions based on the problem description noted earlier.

  1. Women (Sex=female) were more likely to have survived.
  2. Children (Age<?) were more likely to have survived.
  3. The upper-class passengers (Pclass=1) were more likely to have survived.

Analyze by pivoting features

We can only do so at this stage for features which do not have any empty values.

It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

# 通过groupby找出该特征与目标之间的关联train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363

Pclass: We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.

train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Sex Survived
0 female 0.742038
1 male 0.188908

Sex: We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).

train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000

SibSp and Parch: These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).

Analyze by visualizing data

Correlating numerical features

g = sns.FacetGrid(train_df, col='Survived')g.map(plt.hist, 'Age', bins=20)

1180557-20170616171703728-435270920.png

Correlating numerical and ordinal features

# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)grid.map(plt.hist, 'Age', alpha=.5, bins=20)grid.add_legend();

1180557-20170616171711196-579622533.png

Correlating categorical features

# grid = sns.FacetGrid(train_df, col='Embarked')grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')grid.add_legend()

1180557-20170616171718634-352163669.png

Correlating categorical and numerical features

# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)grid.add_legend()

1180557-20170616171728212-90388329.png

4. 清洗数据

Wrangle, prepare, cleanse the data.

Correcting by dropping features

drop the Cabin (correcting #2) and Ticket (correcting #1) features

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)combine = [train_df, test_df]print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)
Before (891, 12) (418, 11) (891, 12) (418, 11)After (891, 10) (418, 9) (891, 10) (418, 9)

Creating new feature extracting from existing

for dataset in combine:    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)pd.crosstab(train_df['Title'], train_df['Sex'])
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
for dataset in combine:    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\    'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')    train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
Title Survived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}for dataset in combine:    dataset['Title'] = dataset['Title'].map(title_mapping)    dataset['Title'] = dataset['Title'].fillna(0)train_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 3
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 3
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)test_df = test_df.drop(['Name'], axis=1)combine = [train_df, test_df]train_df.shape, test_df.shape
((891, 9), (418, 9))

Converting a categorical feature

for dataset in combine:    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)train_df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
0 0 3 0 22.0 1 0 7.2500 S 1
1 1 1 1 38.0 1 0 71.2833 C 3
2 1 3 1 26.0 0 0 7.9250 S 2
3 1 1 1 35.0 1 0 53.1000 S 3
4 0 3 0 35.0 0 0 8.0500 S 1
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)grid.map(plt.hist, 'Age', alpha=.5, bins=20)grid.add_legend()

1180557-20170616171801071-132443449.png

guess_ages = np.zeros((2,3))guess_ages
array([[ 0.,  0.,  0.],       [ 0.,  0.,  0.]])
for dataset in combine:    for i in range(0, 2):        for j in range(0, 3):            guess_df = dataset[(dataset['Sex'] == i) & \                                  (dataset['Pclass'] == j+1)]['Age'].dropna()            # age_mean = guess_df.mean()            # age_std = guess_df.std()            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)            age_guess = guess_df.median()            # Convert random age float to nearest .5 age            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5                for i in range(0, 2):        for j in range(0, 3):            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\                    'Age'] = guess_ages[i,j]    dataset['Age'] = dataset['Age'].astype(int)train_df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
0 0 3 0 22 1 0 7.2500 S 1
1 1 1 1 38 1 0 71.2833 C 3
2 1 3 1 26 0 0 7.9250 S 2
3 1 1 1 35 1 0 53.1000 S 3
4 0 3 0 35 0 0 8.0500 S 1
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
AgeBand Survived
0 (-0.08, 16.0] 0.550000
1 (16.0, 32.0] 0.337374
2 (32.0, 48.0] 0.412037
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909
for dataset in combine:        dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3    dataset.loc[ dataset['Age'] > 64, 'Age']train_df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Title AgeBand
0 0 3 0 1 1 0 7.2500 S 1 (16.0, 32.0]
1 1 1 1 2 1 0 71.2833 C 3 (32.0, 48.0]
2 1 3 1 1 0 0 7.9250 S 2 (16.0, 32.0]
3 1 1 1 2 1 0 53.1000 S 3 (32.0, 48.0]
4 0 3 0 2 0 0 8.0500 S 1 (32.0, 48.0]
train_df = train_df.drop(['AgeBand'], axis=1)combine = [train_df, test_df]train_df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
0 0 3 0 1 1 0 7.2500 S 1
1 1 1 1 2 1 0 71.2833 C 3
2 1 3 1 1 0 0 7.9250 S 2
3 1 1 1 2 1 0 53.1000 S 3
4 0 3 0 2 0 0 8.0500 S 1

Create new feature combining existing features

for dataset in combine:    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)
FamilySize Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
for dataset in combine:    dataset['IsAlone'] = 0    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
IsAlone Survived
0 0 0.505650
1 1 0.303538
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)combine = [train_df, test_df]train_df.head()
Survived Pclass Sex Age Fare Embarked Title IsAlone
0 0 3 0 1 7.2500 S 1 0
1 1 1 1 2 71.2833 C 3 0
2 1 3 1 1 7.9250 S 2 1
3 1 1 1 2 53.1000 S 3 0
4 0 3 0 2 8.0500 S 1 1
for dataset in combine:    dataset['Age*Class'] = dataset.Age * dataset.Pclasstrain_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)
Age*Class Age Pclass
0 3 1 3
1 2 2 1
2 3 1 3
3 2 2 1
4 6 2 3
5 3 1 3
6 3 3 1
7 0 0 3
8 3 1 3
9 0 0 2

Completing a categorical feature

freq_port = train_df.Embarked.dropna().mode()[0]freq_port
'S'
for dataset in combine:    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)    train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009

Converting categorical feature to numeric

for dataset in combine:    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)train_df.head()
Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class
0 0 3 0 1 7.2500 0 1 0 3
1 1 1 1 2 71.2833 1 3 0 2
2 1 3 1 1 7.9250 0 2 1 3
3 1 1 1 2 53.1000 0 3 0 2
4 0 3 0 2 8.0500 0 1 1 6

Quick completing and converting a numeric feature

test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)test_df.head()
PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class
0 892 3 0 2 7.8292 2 1 1 6
1 893 3 1 2 7.0000 0 3 0 6
2 894 2 0 3 9.6875 2 1 1 6
3 895 3 0 1 8.6625 0 1 1 3
4 896 3 1 1 12.2875 0 3 0 3
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
FareBand Survived
0 (-0.001, 7.91] 0.197309
1 (7.91, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081
for dataset in combine:    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3    dataset['Fare'] = dataset['Fare'].astype(int)train_df = train_df.drop(['FareBand'], axis=1)combine = [train_df, test_df]    train_df.head(10)
Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class
0 0 3 0 1 0 0 1 0 3
1 1 1 1 2 3 1 3 0 2
2 1 3 1 1 1 0 2 1 3
3 1 1 1 2 3 0 3 0 2
4 0 3 0 2 1 0 1 1 6
5 0 3 0 1 1 2 1 1 3
6 0 1 0 3 3 0 1 1 3
7 0 3 0 0 2 0 4 0 0
8 1 3 1 1 1 0 3 0 3
9 1 2 1 0 2 1 3 0 0
test_df.head(10)
PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class
0 892 3 0 2 0 2 1 1 6
1 893 3 1 2 0 0 3 0 6
2 894 2 0 3 1 2 1 1 6
3 895 3 0 1 1 0 1 1 3
4 896 3 1 1 1 0 3 0 3
5 897 3 0 0 1 0 1 1 0
6 898 3 1 1 0 2 2 1 3
7 899 2 0 1 2 0 1 0 2
8 900 3 1 1 0 1 3 1 3
9 901 3 0 1 2 0 1 0 3

5. 模型验证

Model, predict and solve

With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

  • Logistic Regression
  • KNN or k-Nearest Neighbors
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision Tree
  • Random Forrest
  • Perceptron
  • Artificial neural network
  • RVM or Relevance Vector Machine
X_train = train_df.drop("Survived", axis=1)Y_train = train_df["Survived"]X_test  = test_df.drop("PassengerId", axis=1).copy()X_train.shape, Y_train.shape, X_test.shape
((891, 8), (891,), (418, 8))

Logistic Regression

# Logistic Regressionlogreg = LogisticRegression()logreg.fit(X_train, Y_train)Y_pred = logreg.predict(X_test)acc_log = round(logreg.score(X_train, Y_train) * 100, 2)acc_log
80.359999999999999

Logistic Regression is a useful model to run early in the workflow.

coeff_df = pd.DataFrame(train_df.columns.delete(0))coeff_df.columns = ['Feature']coeff_df["Correlation"] = pd.Series(logreg.coef_[0])coeff_df.sort_values(by='Correlation', ascending=False)
Feature Correlation
1 Sex 2.201528
5 Title 0.398234
2 Age 0.287164
4 Embarked 0.261762
6 IsAlone 0.129142
3 Fare -0.085150
7 Age*Class -0.311201
0 Pclass -0.749006

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability)

negative coefficients decrease the log-odds of the response (and thus decrease the probability).

  • Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
  • Inversely as Pclass increases, probability of Survived=1 decreases the most.
  • This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.
  • So is Title as second highest positive correlation.

Support Vector Machines

# Support Vector Machinessvc = SVC()svc.fit(X_train, Y_train)Y_pred = svc.predict(X_test)acc_svc = round(svc.score(X_train, Y_train) * 100, 2)acc_svc
83.840000000000003

Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

KNN

knn = KNeighborsClassifier(n_neighbors = 3)knn.fit(X_train, Y_train)Y_pred = knn.predict(X_test)acc_knn = round(knn.score(X_train, Y_train) * 100, 2)acc_knn
84.739999999999995

In pattern recognition, the** k-Nearest Neighbors algorithm **(or k-NN for short) is a non-parametric method used for classification and regression.

Naive Bayes classifiers

# Gaussian Naive Bayesgaussian = GaussianNB()gaussian.fit(X_train, Y_train)Y_pred = gaussian.predict(X_test)acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)acc_gaussian
72.280000000000001

Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem.

Perceptron

# Perceptronperceptron = Perceptron()perceptron.fit(X_train, Y_train)Y_pred = perceptron.predict(X_test)acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)acc_perceptron
78.0

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not).

# Linear SVClinear_svc = LinearSVC()linear_svc.fit(X_train, Y_train)Y_pred = linear_svc.predict(X_test)acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)acc_linear_svc
79.010000000000005
# Stochastic Gradient Descentsgd = SGDClassifier()sgd.fit(X_train, Y_train)Y_pred = sgd.predict(X_test)acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)acc_sgd
75.079999999999998

Decision Tree

# Decision Treedecision_tree = DecisionTreeClassifier()decision_tree.fit(X_train, Y_train)Y_pred = decision_tree.predict(X_test)acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)acc_decision_tree
86.760000000000005

Random Forest

# Random Forestrandom_forest = RandomForestClassifier(n_estimators=100)random_forest.fit(X_train, Y_train)Y_pred = random_forest.predict(X_test)random_forest.score(X_train, Y_train)acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)acc_random_forest
86.760000000000005

The model Random Forests is one of the most popular.

6. 模型评估

Model evaluation

models = pd.DataFrame({    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',               'Random Forest', 'Naive Bayes', 'Perceptron',               'Stochastic Gradient Decent', 'Linear SVC',               'Decision Tree'],    'Score': [acc_svc, acc_knn, acc_log,               acc_random_forest, acc_gaussian, acc_perceptron,               acc_sgd, acc_linear_svc, acc_decision_tree]})models.sort_values(by='Score', ascending=False)
Model Score
3 Random Forest 86.76
8 Decision Tree 86.76
1 KNN 84.74
0 Support Vector Machines 83.84
2 Logistic Regression 80.36
7 Linear SVC 79.01
5 Perceptron 78.00
6 Stochastic Gradient Decent 75.08
4 Naive Bayes 72.28

While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.

submission = pd.DataFrame({        "PassengerId": test_df["PassengerId"],        "Survived": Y_pred    })# submission.to_csv('../output/submission.csv', index=False)

转载于:https://www.cnblogs.com/daigz1224/p/7028485.html

你可能感兴趣的文章
【转】OO无双的blocking/non-blocking执行时刻
查看>>
eclipse,python
查看>>
深入理解java集合框架(jdk1.6源码)
查看>>
php截取后台登陆密码的代码
查看>>
选假球的故事
查看>>
ul li剧中对齐
查看>>
关于 linux 的 limit 的设置
查看>>
模块搜索路径
查看>>
jenkins配置详解之——执行者数量
查看>>
AngularJS模块加载
查看>>
书评第003篇:《0day安全:软件漏洞分析技术(第2版)》
查看>>
FetchType与FetchMode的差别
查看>>
WEB 缓存
查看>>
uva--242(邮资问题 dp)
查看>>
微软七届MVP桂素伟:移动互联网与职业规划
查看>>
PADS技巧——过孔定制与使用
查看>>
spring boot web开发 简单的增删改查和spring boot 自带的Junit测试 案例
查看>>
LINQ笔记
查看>>
S3C2440中断寄存器
查看>>
html的的归纳点
查看>>