Skip to content

Instantly share code, notes, and snippets.

@ghego
Last active February 21, 2018 18:52
Show Gist options
  • Select an option

  • Save ghego/a1b5b79b8eebf0e306caed8966064bb5 to your computer and use it in GitHub Desktop.

Select an option

Save ghego/a1b5b79b8eebf0e306caed8966064bb5 to your computer and use it in GitHub Desktop.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(1)
# train comes from the titantic dataset provided by
# kaggle (https://www.kaggle.com/c/titanic/data)
data = pd.read_csv('./train.csv')
# Preprocess data
# Convert to binary fields
dummy_fields = ['Pclass', 'Embarked', 'Sex']
dummies = pd.get_dummies(data[dummy_fields])
data = pd.concat([data, dummies], axis=1)
# drop other fields
fields_to_drop = ['PassengerId', 'Ticket', 'Parch',
'Name', 'Cabin', 'Fare', 'Pclass',
'Embarked', 'Sex', 'Sex_male']
data = data.drop(fields_to_drop, axis=1)
mean, std = data['Age'].mean(), data['Age'].std()
data.loc[:, 'Age'] = (data['Age'] - mean) / std
data = data.fillna(0)
data = data.sample(frac=1).reset_index(drop=True)
X = data.drop('Survived', axis=1).values
y = data[['Survived']].values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment