Created
June 7, 2018 03:59
-
-
Save easonlai/ca7351aaf3a2be8835626302c74fedbe to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now we will import pandas to read our data from a CSV file and manipulate it for further use. We will also use numpy to convert out data into a format suitable to feed our classification model. We'll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build our classification model. Lastly, we will use joblib available in sklearn to save our model for future use." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import pandas as pd\n", | |
| "import numpy as np\n", | |
| "import seaborn as sns\n", | |
| "import matplotlib.pyplot as plt\n", | |
| "% matplotlib inline\n", | |
| " \n", | |
| "from sklearn.linear_model import LogisticRegression\n", | |
| "from sklearn.externals import joblib" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We have our data saved in a CSV file called insurance3r2.csv. We first read our dataset in a pandas dataframe called insuranceDF, and then use the head() function to show the first five records from our dataset." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " age sex bmi steps children smoker region charges \\\n", | |
| "0 19 0 27.900 3009 0 1 3 16884.92400 \n", | |
| "1 18 1 33.770 3008 1 0 2 1725.55230 \n", | |
| "2 28 1 33.000 3009 3 0 2 4449.46200 \n", | |
| "3 33 1 22.705 10009 0 0 1 21984.47061 \n", | |
| "4 32 1 28.880 8010 0 0 1 3866.85520 \n", | |
| "\n", | |
| " insuranceclaim \n", | |
| "0 1 \n", | |
| "1 1 \n", | |
| "2 0 \n", | |
| "3 0 \n", | |
| "4 1 \n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "insuranceDF = pd.read_csv('insurance3r2.csv')\n", | |
| "print(insuranceDF.head())" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The following features have been provided to help us predict whether a person is diabetic or not:\n", | |
| "\n", | |
| "age : age of policyholder\n", | |
| "sex: gender of policy holder (female=0, male=1)\n", | |
| "bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 25\n", | |
| "steps: average number of walking steps per day\n", | |
| "children: number of children / dependents of policyholder\n", | |
| "smoker: smoking state of policyholder (non-smoke=0;smoker=1) \n", | |
| "region: the residential area of policyholder in the US (northeast=0, northwest=1, southeast=2, southwest=3)\n", | |
| "charges: individual medical costs billed by health insurance" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's also make sure that our data is clean (has no null values, etc)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": { | |
| "scrolled": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "<class 'pandas.core.frame.DataFrame'>\n", | |
| "RangeIndex: 1338 entries, 0 to 1337\n", | |
| "Data columns (total 9 columns):\n", | |
| "age 1338 non-null int64\n", | |
| "sex 1338 non-null int64\n", | |
| "bmi 1338 non-null float64\n", | |
| "steps 1338 non-null int64\n", | |
| "children 1338 non-null int64\n", | |
| "smoker 1338 non-null int64\n", | |
| "region 1338 non-null int64\n", | |
| "charges 1338 non-null float64\n", | |
| "insuranceclaim 1338 non-null int64\n", | |
| "dtypes: float64(2), int64(7)\n", | |
| "memory usage: 94.2 KB\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "insuranceDF.info()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Let's start by finding correlation of every pair of features (and the outcome variable), and visualize the correlations using a heatmap." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " age sex bmi steps children smoker \\\n", | |
| "age 1.000000 -0.020856 0.109272 -0.167957 0.042469 -0.025019 \n", | |
| "sex -0.020856 1.000000 0.046371 -0.039470 0.017163 0.076185 \n", | |
| "bmi 0.109272 0.046371 1.000000 -0.681149 0.012759 0.003750 \n", | |
| "steps -0.167957 -0.039470 -0.681149 1.000000 0.055346 -0.267845 \n", | |
| "children 0.042469 0.017163 0.012759 0.055346 1.000000 0.007673 \n", | |
| "smoker -0.025019 0.076185 0.003750 -0.267845 0.007673 1.000000 \n", | |
| "region 0.002127 0.004588 0.157566 -0.076483 0.016569 -0.002181 \n", | |
| "charges 0.299008 0.057292 0.198341 -0.305570 0.067998 0.787251 \n", | |
| "insuranceclaim 0.113723 0.031565 0.384198 -0.419514 -0.409526 0.333261 \n", | |
| "\n", | |
| " region charges insuranceclaim \n", | |
| "age 0.002127 0.299008 0.113723 \n", | |
| "sex 0.004588 0.057292 0.031565 \n", | |
| "bmi 0.157566 0.198341 0.384198 \n", | |
| "steps -0.076483 -0.305570 -0.419514 \n", | |
| "children 0.016569 0.067998 -0.409526 \n", | |
| "smoker -0.002181 0.787251 0.333261 \n", | |
| "region 1.000000 -0.006208 0.020891 \n", | |
| "charges -0.006208 1.000000 0.309418 \n", | |
| "insuranceclaim 0.020891 0.309418 1.000000 \n" | |
| ] | |
| }, | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "<matplotlib.axes._subplots.AxesSubplot at 0x139a9fe0400>" | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| }, | |
| { | |
| "data": { | |
| "image/png": "\n", | |
| "text/plain": [ | |
| "<matplotlib.figure.Figure at 0x139a9f32be0>" | |
| ] | |
| }, | |
| "metadata": {}, | |
| "output_type": "display_data" | |
| } | |
| ], | |
| "source": [ | |
| "corr = insuranceDF.corr()\n", | |
| "print(corr)\n", | |
| "sns.heatmap(corr, \n", | |
| " xticklabels=corr.columns,\n", | |
| " yticklabels=corr.columns)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In the above heatmap, brighter colors indicate more correlation." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "When using machine learning algorithms we should always split our data into a training set and test set. (If the number of experiments we are running is large, then we can should be dividing our data into 3 parts, namely - training set, development set and test set). In our case, we will also separate out some data for manual cross checking.\n", | |
| "\n", | |
| "The data set consists of record of 1338 policy-holders in total. To train our model we will be using 1000 records. We will be using 300 records for testing, and the last 38 records to cross check our model." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "dfTrain = insuranceDF[:1000]\n", | |
| "dfTest = insuranceDF[1000:1300]\n", | |
| "dfCheck = insuranceDF[1300:] " | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Next, we separate the label and features (for both training and test dataset). In addition to that, we will also convert them into NumPy arrays as our machine learning algorithm process data in NumPy array format." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "trainLabel = np.asarray(dfTrain['insuranceclaim'])\n", | |
| "trainData = np.asarray(dfTrain.drop('insuranceclaim',1))\n", | |
| "testLabel = np.asarray(dfTest['insuranceclaim'])\n", | |
| "testData = np.asarray(dfTest.drop('insuranceclaim',1))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "As the final step before using machine learning, we will normalize our inputs. Machine Learning models often benefit substantially from input normalization. It also makes it easier for us to understand the importance of each feature later, when we'll be looking at the model weights. We'll normalize the data such that each variable has 0 mean and standard deviation of 1." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "means = np.mean(trainData, axis=0)\n", | |
| "stds = np.std(trainData, axis=0)\n", | |
| " \n", | |
| "trainData = (trainData - means)/stds\n", | |
| "testData = (testData - means)/stds" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "We can now train our classification model. We'll be using a machine simple learning model called logistic regression. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code. First, we create an instance called insuranceCheck and then use the fit function to train the model." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", | |
| " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", | |
| " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", | |
| " verbose=0, warm_start=False)" | |
| ] | |
| }, | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "insuranceCheck = LogisticRegression()\n", | |
| "insuranceCheck.fit(trainData, trainLabel)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now use our test data to find out accuracy of the model." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "accuracy = 86.0 %\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "accuracy = insuranceCheck.score(testData, testLabel)\n", | |
| "print(\"accuracy = \", accuracy * 100, \"%\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "Text(0.5,0,'Importance')" | |
| ] | |
| }, | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| }, | |
| { | |
| "data": { | |
| "image/png": "\n", | |
| "text/plain": [ | |
| "<matplotlib.figure.Figure at 0x139ac114e48>" | |
| ] | |
| }, | |
| "metadata": {}, | |
| "output_type": "display_data" | |
| } | |
| ], | |
| "source": [ | |
| "coeff = list(insuranceCheck.coef_[0])\n", | |
| "labels = list(dfTrain.drop('insuranceclaim',1).columns)\n", | |
| "features = pd.DataFrame()\n", | |
| "features['Features'] = labels\n", | |
| "features['importance'] = coeff\n", | |
| "features.sort_values(by=['importance'], ascending=True, inplace=True)\n", | |
| "features['positive'] = features['importance'] > 0\n", | |
| "features.set_index('Features', inplace=True)\n", | |
| "features.importance.plot(kind='barh', figsize=(11, 6),color = features.positive.map({True: 'blue', False: 'red'}))\n", | |
| "plt.xlabel('Importance')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "From the above figure, we can draw the following conclusions.\n", | |
| "\n", | |
| "1. BMI, Smoker have significant influence on the model, specially BMI. It is good to see our machine learning model match what we have been hearing from doctors our entire lives!\n", | |
| "\n", | |
| "2. Children has a negative influence on the prediction, i.e. higher number children / dependents are correlated with a policyholder not taken insurance claim.\n", | |
| "\n", | |
| "3. Although age was more correlated than BMI to the output variables (as we saw during data exploration), the model relies more on BMI. This can happen for several reasons, including the fact that the correlation captured by age is also captured by some other variable, whereas the information captured by BMI is not captured by other variables.\n", | |
| "\n", | |
| "Note that this above interpretations require that our input data is normalized. Without that, we can't claim that importance is proportional to weights." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now save our trained model for future use using joblib." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "['insurance01Model.pkl']" | |
| ] | |
| }, | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "joblib.dump([insuranceCheck, means, stds], 'insurance01Model.pkl')" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "To check whether we have saved the model properly or not, we will use our test data to check the accuracy of our saved model (we should observe no change in accuracy if we have saved it properly)." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "accuracy = 86.0 %\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "insuranceLoadedModel, means, stds = joblib.load('insurance01Model.pkl')\n", | |
| "accuracyModel = insuranceLoadedModel.score(testData, testLabel)\n", | |
| "print(\"accuracy = \",accuracyModel * 100,\"%\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now use our unused 38 data to see how predictions can be made. We have our unused data in dfCheck." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 14, | |
| "metadata": { | |
| "scrolled": true | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| " age sex bmi steps children smoker region charges \\\n", | |
| "1300 45 1 30.360 4002 0 1 2 62592.87309 \n", | |
| "1301 62 1 30.875 4001 3 1 1 46718.16325 \n", | |
| "1302 25 0 20.800 10005 1 0 3 3208.78700 \n", | |
| "1303 43 1 27.800 4009 0 1 3 37829.72420 \n", | |
| "1304 42 1 24.605 8009 2 1 0 21259.37795 \n", | |
| "1305 24 0 27.720 8008 0 0 2 2464.61880 \n", | |
| "1306 29 0 21.850 3007 0 1 0 16115.30450 \n", | |
| "1307 32 1 28.120 4001 4 1 1 21472.47880 \n", | |
| "1308 25 0 30.200 3008 0 1 3 33900.65300 \n", | |
| "1309 41 1 32.200 3001 2 0 3 6875.96100 \n", | |
| "1310 42 1 26.315 8006 1 0 1 6940.90985 \n", | |
| "1311 33 0 26.695 8005 0 0 1 4571.41305 \n", | |
| "1312 34 1 42.900 4001 1 0 3 4536.25900 \n", | |
| "1313 19 0 34.700 4000 2 1 3 36397.57600 \n", | |
| "1314 30 0 23.655 4005 3 1 1 18765.87545 \n", | |
| "1315 18 1 28.310 8009 1 0 0 11272.33139 \n", | |
| "1316 19 0 20.600 10009 0 0 3 1731.67700 \n", | |
| "1317 18 1 53.130 3005 0 0 2 1163.46270 \n", | |
| "1318 35 1 39.710 3000 4 0 0 19496.71917 \n", | |
| "1319 39 0 26.315 8003 2 0 1 7201.70085 \n", | |
| "1320 31 1 31.065 4008 3 0 1 5425.02335 \n", | |
| "1321 62 1 26.695 5006 0 1 0 28101.33305 \n", | |
| "1322 62 1 38.830 3010 0 0 2 12981.34570 \n", | |
| "1323 42 0 40.370 4006 2 1 2 43896.37630 \n", | |
| "1324 31 1 25.935 8010 1 0 1 4239.89265 \n", | |
| "1325 61 1 33.535 3006 0 0 0 13143.33665 \n", | |
| "1326 42 0 32.870 4004 0 0 0 7050.02130 \n", | |
| "1327 51 1 30.030 4001 1 0 2 9377.90470 \n", | |
| "1328 23 0 24.225 10001 2 0 0 22395.74424 \n", | |
| "1329 52 1 38.600 3009 2 0 3 10325.20600 \n", | |
| "1330 57 0 25.740 8003 2 0 2 12629.16560 \n", | |
| "1331 23 0 33.400 3004 0 0 3 10795.93733 \n", | |
| "1332 52 0 44.700 4009 3 0 3 11411.68500 \n", | |
| "1333 50 1 30.970 4008 3 0 1 10600.54830 \n", | |
| "1334 18 0 31.920 3003 0 0 0 2205.98080 \n", | |
| "1335 18 0 36.850 3008 0 0 2 1629.83350 \n", | |
| "1336 21 0 25.800 8009 0 0 3 2007.94500 \n", | |
| "1337 61 0 29.070 8008 0 1 1 29141.36030 \n", | |
| "\n", | |
| " insuranceclaim \n", | |
| "1300 1 \n", | |
| "1301 1 \n", | |
| "1302 0 \n", | |
| "1303 1 \n", | |
| "1304 1 \n", | |
| "1305 1 \n", | |
| "1306 1 \n", | |
| "1307 0 \n", | |
| "1308 1 \n", | |
| "1309 0 \n", | |
| "1310 0 \n", | |
| "1311 1 \n", | |
| "1312 1 \n", | |
| "1313 1 \n", | |
| "1314 0 \n", | |
| "1315 0 \n", | |
| "1316 0 \n", | |
| "1317 1 \n", | |
| "1318 0 \n", | |
| "1319 0 \n", | |
| "1320 0 \n", | |
| "1321 1 \n", | |
| "1322 1 \n", | |
| "1323 1 \n", | |
| "1324 0 \n", | |
| "1325 1 \n", | |
| "1326 1 \n", | |
| "1327 1 \n", | |
| "1328 0 \n", | |
| "1329 1 \n", | |
| "1330 1 \n", | |
| "1331 1 \n", | |
| "1332 0 \n", | |
| "1333 0 \n", | |
| "1334 1 \n", | |
| "1335 1 \n", | |
| "1336 0 \n", | |
| "1337 1 \n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "print(dfCheck.head(38))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now use the third record to make our insurance claim prediction." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 17, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Insurance Claim Probability: [[0.95741143 0.04258857]]\n", | |
| "Insurance Claim Prediction: [0]\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "sampleData = dfCheck[2:3]\n", | |
| " \n", | |
| "# prepare sample \n", | |
| "sampleDataFeatures = np.asarray(sampleData.drop('insuranceclaim',1))\n", | |
| "sampleDataFeatures = (sampleDataFeatures - means)/stds\n", | |
| " \n", | |
| "# predict \n", | |
| "predictionProbability = insuranceLoadedModel.predict_proba(sampleDataFeatures)\n", | |
| "prediction = insuranceLoadedModel.predict(sampleDataFeatures)\n", | |
| "print('Insurance Claim Probability:', predictionProbability)\n", | |
| "print('Insurance Claim Prediction:', prediction)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.4" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment