Test Jupyter Project

Titanic

Titanic Data Challenge

Introduction

Kaggle Description:

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Goal: To predict whether or not a passenger will survive the sinking of the Titanic based on provided information

Let's begin! First, let's do some boilerplate setup.

Imports

%reload_ext autoreload
%autoreload 2

# custom helpers
from helpers.helper import get_splits
# data handling
import numpy as np
import pandas as pd
# output
from termcolor import cprint
import matplotlib.pyplot as plt
import seaborn as sns

cprint('All Modules Imported!', 'green')
All Modules Imported!

Data Import

import os
os.listdir('./data/')
['gender_submission.csv', 'test.csv', 'train.csv']
train_data = pd.read_csv('./data/train.csv', index_col='PassengerId')
test_data = pd.read_csv('./data/test.csv', index_col='PassengerId')

cprint('Data Imported!', 'green')
cprint('Training Data Example:', 'cyan')
display(train_data)
Data Imported!
Training Data Example:

SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
....................................
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 11 columns

Process

  1. Figure out which features we can safely drop/keep.
  2. Encode features that need encoding (label encoding, categorical encoding).
  3. Start feature engineering some new columns so we have a wider predicition set.
  4. Do feature selection to determine which features are not needed and find the best combination of features to use.
  5. Research and test what models would be best for our situation and train/test different models.
  6. Train and predict on the train/test sets.
  7. Finally, output everything to a new CSV.

Tools

Our current model options are: LightGBM, RandomForestRegressor, ExtraTreesRegressor.

I also want to use a pipeline to keep everything organized into various steps.

Getting Started

Feature Engineering

So the columns we have are:

VariableDefinitionKey
survivalSurvived or not0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
ageAge in years
sibspNum of siblings / spouses aboard
parchNum of parents / children aboard
ticketTicket number
farePassengar fare
cabinCabin number
embarkedPort of embarkationC = Cherbourg, Q = Queenstown, S = Southampton

Looking at these descriptions, we can probably disregard

  • Name
  • Ticket number

Ticket number is a … maybe, as we aren't entirely sure how ticket numbers are handed out.

Let's get to engineering.

Model Createion Steps

  1. Choose feature cols based on feature table and relevant data
  2. Split into train/valid/test sets
  3. Generate features
    1. Interactions
  4. Setup pipeline
    1. Imputation to fill in N/A values
    2. Categorical encoding, CatBoost
    3. Standardize values
    4. Feature Selection
  5. Train

Feature Generation / Engineering

# 1. Choose our feature cols based on feature table above
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_cols = ['Pclass', 'Sex', 'Cabin', 'Embarked']
target_col = 'Survived'
# Let's make some features
from itertools import combinations

interactions = pd.DataFrame(index=train_data.index)
for comb in combinations(categorical_cols, 2):
    new_feat = comb[0] + "_" + comb[1]
    interactions[new_feat] = train_data[comb[0]].astype(str) + "_" + train_data[comb[1]].astype(str)
    categorical_cols.append(new_feat)
train_data = train_data.join(interactions)
display(train_data)
# 2. Split sets
train, valid, _ = get_splits(train_data)
X_train = train.drop([target_col], axis=1)
y_train = train[target_col]
X_valid = valid.drop([target_col], axis=1)
y_valid = valid[target_col]
display(X_train)

SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedPclass_SexPclass_CabinPclass_EmbarkedSex_CabinSex_EmbarkedCabin_Embarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3_male3_nan3_Smale_nanmale_Snan_S
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1_female1_C851_Cfemale_C85female_CC85_C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3_female3_nan3_Sfemale_nanfemale_Snan_S
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1_female1_C1231_Sfemale_C123female_SC123_S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS3_male3_nan3_Smale_nanmale_Snan_S
......................................................
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS2_male2_nan2_Smale_nanmale_Snan_S
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S1_female1_B421_Sfemale_B42female_SB42_S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS3_female3_nan3_Sfemale_nanfemale_Snan_S
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C1_male1_C1481_Cmale_C148male_CC148_C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ3_male3_nan3_Qmale_nanmale_Qnan_Q

891 rows × 17 columns

PclassNameSexAgeSibSpParchTicketFareCabinEmbarkedPclass_SexPclass_CabinPclass_EmbarkedSex_CabinSex_EmbarkedCabin_Embarked
PassengerId
13Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS3_male3_nan3_Smale_nanmale_Snan_S
21Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1_female1_C851_Cfemale_C85female_CC85_C
33Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS3_female3_nan3_Sfemale_nanfemale_Snan_S
41Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1_female1_C1231_Sfemale_C123female_SC123_S
53Allen, Mr. William Henrymale35.0003734508.0500NaNS3_male3_nan3_Smale_nanmale_Snan_S
...................................................
7091Cleaver, Miss. Alicefemale22.000113781151.5500NaNS1_female1_nan1_Sfemale_nanfemale_Snan_S
7103Moubarek, Master. Halim Gonios ("William George")maleNaN11266115.2458NaNC3_male3_nan3_Cmale_nanmale_Cnan_C
7111Mayne, Mlle. Berthe Antonine ("Mrs de Villiers")female24.000PC 1748249.5042C90C1_female1_C901_Cfemale_C90female_CC90_C
7121Klaber, Mr. HermanmaleNaN0011302826.5500C124S1_male1_C1241_Smale_C124male_SC124_S
7131Taylor, Mr. Elmer Zebleymale48.0101999652.0000C126S1_male1_C1261_Smale_C126male_SC126_S

713 rows × 16 columns

Pipeline

# machine learning
from sklearn import feature_selection
from sklearn import preprocessing
# Pipeline code originally copied from 3_intermediate_training_summary
from helpers.helper import PipelineFS
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
# conda install -c conda-forge category_encoders
from category_encoders import CatBoostEncoder
# conda install -c conda-forge lightgbm
import lightgbm as lgb
from sklearn.model_selection import cross_val_score

def get_lgb_pipeline_score(X, y, params={'n_estimators':10,'num_leaves':64,'rate':0.1,'early_stopping_rounds':10}):
    """
    Run LightBGM pipeline on the provided parameters. 
    Scores based on cross_validation with 5 folds.

    params: Python object of params with the following keys
        n_estimators: number of estimators to use in pipeline, Default: 10
        num_leaves: num_leaves in lgb model, Default: 64
        rate: learning rate, Default: 0.1
        early_stopping_rounds: how many rounds to stop after if low variance, Default: 10
    """
    # Preprocessing for numerical data (fill in NA)
    numerical_transformer = PipelineFS(
        steps=[
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]
    )

    # Preprocessing for categorical data
    categorical_transformer = PipelineFS(
        steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('catboost', CatBoostEncoder())
        ]
    )

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ]
    )

    model = lgb.LGBMClassifier(n_estimators=params['n_estimators'], num_leaves=params['num_leaves'], learning_rate=params['rate'])

    # Bundle preprocessing and modeling code in a pipeline
    my_pipeline = PipelineFS(
        steps=[
            ('preprocessor', preprocessor),
            ('model', model)
        ],
        verbose=False
    )
    # Preprocessing of training data, fit model 
    # my_pipeline.fit(X_train, y_train)
    # cprint('Fit!', 'green')
    # Preprocessing of validation data, get predictions
    scores = cross_val_score(my_pipeline, X, y, cv=5)
    return scores.mean()
# get_lgb_pipeline_score(X_train, y_train)

# results = {}
# params={'n_estimators':10,'num_leaves':64,'rate':0.1,'early_stopping_rounds':10}
# for i in range(50, 1001, 50):
#     params['n_estimators'] = i
#     results[i] = get_lgb_pipeline_score(X_train, y_train, params=params)
# print(results)

Basic steps for the future

  • Deal with missing values
  • Deal with zeroes
  • Encoding categorical variables
  • Creating new features
    • Interactions
  • Transforming features
    • Normalization/Outliers
  • Feature Analysis
    • Feature types
    • How feature correlates to target
    • Graphs Graphs Graphs
  • Model fitting
    • outlier removal
    • optimization

Also, should we use Pipelines from now on?

  • Yes
  • No

Notes from example kernel

So far so good, but compared to others…not great. Below are some notes after looking at some example kernels.

  • Combine the train and test data sets into a single dataframe to make feature transformation easier.
    • Get the ids (indexes) of both the test and train sets so that they can be extracted again later.
  • When filling in N/A values, be sure to figure out what N/A ACTUALLY means. Do we need to persist that value as something else? ie, in housing data, if garage is N/A, that just means the house doesn't have a garage, which is data we want to persist
  • When creating features, it takes a little brain/common sense to figure out interactions. One way is to yes, just create a ton, but for finding INTERESTING interactions, you really gotta sit and think.
    • For example, with the housing data, instead of just comparing different area values. Large properties will have large areas with high sale prices by default. But what is more intersting maybe is the fraction of the base that is of each different room type.
  • GRAPHS
    • USE SCATTERPLOTS. They are amazing for visualizing the correlation of a feature vs target.
    • Same as above, GRAPH GRAPH GRAPH. Another useful graph is graphing the different means of the target. This can help to find outliers.
    • Correlation. How do the current features correlate to the target?
  • Optmization tricks:
    • use a model to predict values for filling in NaN

Avatar
Tyler Welsh
Full Stack and Lead Backend Engineer

Interests include Machine Learning, Neural Networks, AI in Games, Full-Stack Development, and learning to make witty sentences.