sbeleidy/Notes.ipynb

## Notes.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# General Notes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 1\n",
    "\n",
    "[Learn Pandas](https://bitbucket.org/hrojas/learn-pandas)\n",
    "\n",
    "$$ Accuracy = \\frac{Correctly Identified}{All} = \\frac{TP+TN}{TP+TN+FP+FN}$$\n",
    "\n",
    "$$ Precision = \\frac{TP}{TP + FP}$$\n",
    "\n",
    "$$ Recall = \\frac{TP}{TP+FN} $$\n",
    "\n",
    "$$ F1 = \\frac{2*(Precision * Recall)}{Precision + Recall}$$\n",
    "\n",
    "For regression: \n",
    "\n",
    "* Mean Squared Error and Mean Absolute Error.\n",
    "* From 0 to 1 where 1 is better, R^2 score and explained variance.\n",
    "\n",
    "\n",
    "## sklearn \n",
    "\n",
    "* cross-validation.train_test_split\n",
    "* confusion matrix, recall, precision functions built in\n",
    "\n",
    "## Algorithms\n",
    "\n",
    "Decision trees compared to Naiive Bayes:\n",
    "\n",
    "* Decision trees are better at precision.\n",
    "* Naiive Bayes is better at recall.\n",
    "    \n",
    "    \n",
    "## Causes of Errors\n",
    "\n",
    "* Bias due to assumptions or inability to accurately model underlying data -> low accuracy and underfitting.\n",
    "* Variance due to model being overly sensitive to training dataset -> overfitting.\n",
    "\n",
    "\n",
    "\n",
    "http://scott.fortmann-roe.com/docs/BiasVariance.html\n",
    "\n",
    "To balance between underfitting and overfitting, use few features getting a large R^2 or a low sum of squared error.\n",
    "\n",
    "\n",
    "## Curse of Dimensionality\n",
    "\n",
    "As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.\n",
    "\n",
    "\n",
    "## Learning Curve\n",
    "\n",
    "A learning curve in machine learning is a graph that compares the performance of a model on training and testing data over a varying number of training instances.\n",
    "\n",
    "* When the training and testing errors converge and are quite high this usually means the model is biased (underfit).\n",
    "* When there is a large gap between the training and testing error this generally means the model suffers from high variance (overfit).\n",
    "* Ideal is good performance and generalizes well to unseen data.\n",
    "\n",
    "\n",
    "Validation curves can also be helpful with detecting the ideal balance of model complexity.\n",
    "\n",
    "## Cross Validation\n",
    "\n",
    "* k sets, comparing n groupings as training set\n",
    "* k sets, comparing different variations\n",
    "* grid search, use multiple parameters and creating different combinations of them\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Kaggle Section\n",
    "\n",
    "Tips:\n",
    "\n",
    "* Computer vision - deep learning, keras\n",
    "* Others - Gradient Boosted Machines? Extree boost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 2 - Supervised Learning\n",
    "\n",
    "## Regression\n",
    "\n",
    "Originated due to values naturally regressing to the mean. Slope is less than 1.\n",
    "\n",
    "Current meaning: Using a functional form to approximate data points.\n",
    "\n",
    "Used for continuous outputs.\n",
    "\n",
    "LinearRegression in sklearn. $$ y = mx + b $$\n",
    "\n",
    "Evaluating regressions:\n",
    "\n",
    "* sum of squared errors\n",
    "* R^2\n",
    "\n",
    "\n",
    "numpy [polyfit](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html)\n",
    "\n",
    "scikit learn [polynomial features](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)\n",
    "\n",
    "\n",
    "## Decision Trees\n",
    "\n",
    "Classification takes input and mapping it to a discrete label. Regression takes input and maps it to a continuous value. Decision trees are a form of classification.\n",
    "\n",
    "Some definitions:\n",
    "\n",
    "* Instances - Input\n",
    "* Concept - function to map inputs to membership in a set (returns True or False)\n",
    "* Target Concept - function that determines which set input is a member of\n",
    "* Hypothesis - All functions\n",
    "* Sample - training set (input, label)\n",
    "* Candidate - concept that you think might be the target concept\n",
    "* Testing Set - another set of (input,label) that is not the training set\n",
    "\n",
    "The best attribute/feature to pick for the decision tree is the one that splits the data into the largest two subsets.\n",
    "\n",
    "\n",
    "Dealing with overfitting: pruning.\n",
    "\n",
    "sklearn [DecisionTreeClassifier](http://scikit-learn.org/0.17/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)\n",
    "\n",
    "Entropy controls how a decision tree decides where to split the data. Entropy is a measure of impurity in a bunch of examples.\n",
    "\n",
    "Information gain = entropy(parent) - weighted average(entropy(children))\n",
    "\n",
    "Decision tree will attempt to maximize information gain.\n",
    "\n",
    "Decision Trees are prone to overfitting but are really simple and easy to use.\n",
    "\n",
    "\n",
    "## Neural Networks\n",
    "\n",
    "Perceptron is a unit of the network that has an activation and a firing threshold. Sum of inputs * weight product is the activation. Then we check if it is above the firing threshold to determine whether to return 0 or 1.\n",
    "\n",
    "To figure out the weights and threshold from examples we can use the Perceptron Rule or Gradient Descent.\n",
    "\n",
    "To select initial weights, use small random values.\n",
    "\n",
    "\n",
    "#### Perceptron Rule\n",
    "\n",
    "Uses a learning rate to determine weight based on starting weights and how they impact the results.\n",
    "\n",
    "Works well for linearly separable data.\n",
    "\n",
    "$$ \\Delta w_i = n(y - \\hat{y}) $$\n",
    "\n",
    "\n",
    "#### Gradient Descent\n",
    "\n",
    "Works well even if data is not linearly separable.\n",
    "\n",
    "$$ \\Delta w_i = n(y - a), a= gradient $$\n",
    "\n",
    "\n",
    "#### Sigmoid\n",
    "\n",
    "$$ \\sigma(a) = \\frac{1}{1+e^{-a}} $$\n",
    "\n",
    "$$ D\\sigma(a) = \\sigma(a)(1-\\sigma(a)) $$\n",
    "\n",
    "\n",
    "### Chaining Perceptrons\n",
    "\n",
    "We then connect multiple perceptrons together to build a network.\n",
    "\n",
    "### Biases\n",
    "\n",
    "Restriction bias: representational power & set of hypotheses to consider.\n",
    "\n",
    "Simple perceptron -> linear\n",
    "networks -> more than linear\n",
    "optimizing weights -> potential overfitting\n",
    "Can build boolean, continuous (1 hidden layer) and arbitrary functions (two hidden layers).\n",
    "\n",
    "Danger of overfitting.\n",
    "\n",
    "Preference bias: selection of one representation over another\n",
    "\n",
    "NNs prefer simpler explanations.\n",
    "\n",
    "\n",
    "## Support Vector Machines\n",
    "\n",
    "The idea is to find a separating line between classes of data. It does this by maximizing the margin (distance between the line and the nearest point).\n",
    "\n",
    "SVM can handle outliers so that it gives a more reasonable result.\n",
    "\n",
    "SVM can also work in non-linear fashion using different kernels. The kernels work with more features but a combination that is linearly separable. The separating line can then be used with the initial featureset but will be a nonlinear solution.\n",
    "\n",
    "SVMs have three main parameters for customization, the kernel, C and gamma.\n",
    "\n",
    "C controls the tradeoff between a smooth decision boundary and correct classification. A higher value of C means more correct classifications in favor of a less smooth boundary.\n",
    "\n",
    "\n",
    "\n",
    "## K Nearest Neighbors\n",
    "\n",
    "Given a distance metric, group data into sets of k points that are nearest to each other.\n",
    "\n",
    "Emphasizes:\n",
    "\n",
    "* locality\n",
    "* smoothness\n",
    "* all features matter equally unless distance metric weights the features differently\n",
    "\n",
    "\n",
    "## Naive Bayes\n",
    "\n",
    "Bayes theorem incorporates some result of a test with a prior probability to arrive at a posterior probability.\n",
    "\n",
    "$$ P(A\\ |\\ B) = \\frac{P(B\\ |\\ A)*P(A)}{P(B)} $$\n",
    "\n",
    "Sensitivity: P(Pos|C) = P(True Positive)\n",
    "Specitivity: P(Neg|not C) = P(True Negative)\n",
    "\n",
    "We can use data to calculate probabilities and conditional probabilities of features on resulting in a specific class. Then we can use these probabilities to classify new inputs. This is called Bayesian Classification.\n",
    "\n",
    "\n",
    "## Ensemble\n",
    "\n",
    "Learn over a subset of the data to develop rules and then combine them to create the model. \n",
    "\n",
    "Bagging/bootstrap aggregation: random subsets, combining by averaging.\n",
    "\n",
    "Boosting: hardest examples and combining with weighted average.\n",
    "\n",
    "Boosting doesn't often overfit but will overfit if a weak learner uses Artificial Neural Network with many layers of nodes.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 3 - Unsupervised Learning\n",
    "\n",
    "Learning with no labels in the input data.\n",
    "\n",
    "\n",
    "## Clustering\n",
    "\n",
    "Grouping data to extract useful information without labels.\n",
    "\n",
    "K-Means Clustering:\n",
    "\n",
    "* Creates k clusters from k centroids\n",
    "* Updates position of centroid after clustering its points\n",
    "* Updates point's cluster based on closest centroid\n",
    "* Initial placement of points significantly affects the result.\n",
    "* Should be run multiple times\n",
    "\n",
    "Single Linkage Clustering:\n",
    "\n",
    "* Connect the closest two points together until we have k clusters\n",
    "* Connected points denote a cluster\n",
    "\n",
    "Soft Clustering:\n",
    "\n",
    "* Allows for a point to be in multiple possible clusters\n",
    "* Gives probability of a point in each cluster\n",
    "\n",
    "\n",
    "## Feature Scaling\n",
    "\n",
    "Each feature may have a different scale. They should be normalized for comparison/combination.\n",
    "\n",
    "To do so we can use [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) from sklearn.\n",
    "\n",
    "SVM and K-Means clustering are significantly affected by feature scaling.\n",
    "\n",
    "Decision trees and linear regression are not affected significantly by feature scaling.\n",
    "\n",
    "### Filtering\n",
    "\n",
    "Features -> search -> feature_subset -> learner\n",
    "\n",
    "No feedback loop, ignored bias, faster than wrapping.\n",
    "\n",
    "### Wrapping\n",
    "\n",
    "features -> (loop of learner <-> search) -> model\n",
    "\n",
    "Takes into account model bias but very slow.\n",
    "\n",
    "\n",
    "## Dimensionality Reduction\n",
    "\n",
    "\n",
    "### Principal Component Analysis\n",
    "\n",
    "Determining the dimensionality of data.\n",
    "\n",
    "PCA finds a new coordinate system by translation or rotation only from the center of the data.\n",
    "\n",
    "Principal component is the feature that has the highest variance.\n",
    "\n",
    "Use principal components as new features.\n",
    "\n",
    "Principal components are directions in data that maximize varaince (minimize information loss) when compressed.\n",
    "\n",
    "More variance of data along a principal component means the higher it will be ranked.\n",
    "\n",
    "Further principal components do not overlap with previous ones since they would be on perpindicular.\n",
    "\n",
    "Max number of principal components is the number of input features.\n",
    "\n",
    "Use PCA when you want to:\n",
    "    \n",
    "* find latent features driving the patterns in data\n",
    "* for dimensionality reduction\n",
    "    * visualize high-dimensional data\n",
    "    * reduce noice\n",
    "    * make other algorithms work better with fewer inputs\n",
    "\n",
    "mutually orthogonal, maximal variance, ordered features\n",
    "\n",
    "### Independent Component Analysis\n",
    "\n",
    "ICA uses independence to determine the minimized feature subset that affects the result the most. \n",
    "\n",
    "[Paper](http://mlsp.cs.cmu.edu/courses/fall2012/lectures/ICA_Hyvarinen.pdf)\n",
    "\n",
    "[Demo](http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi)\n",
    "    \n",
    "Mutually independent, maximal mutual information, bag of features.\n",
    "\n",
    "### Others\n",
    "\n",
    "RCA - Random Components Analysis ( random directions )\n",
    "\n",
    "LDA - Linear Discriminant Analysis (finds a projection that discriminates based on the label)\n",
    "\n",
    "[Paper](http://computation.llnl.gov/casc/sapphire/pubs/148494.pdf)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Project 4 - Reinforcement Learning\n",
    "\n",
    "\n",
    "## Markov Decision Processes\n",
    "\n",
    "Markovian property - only the present matters.\n",
    "\n",
    "Markov Decision Processes have the following components:\n",
    "\n",
    "1. States\n",
    "1. Actions\n",
    "1. Models (transition function = T(state_1, action, state_2) or Probability(state_2 | state_1 and action))\n",
    "1. Rewards - value/usefulness of a state\n",
    "\n",
    "These components define the problem and from that we can determine a solution (policy). We are generally interested in the optimal policy, that is the solution that maximizes the reward.\n",
    "\n",
    "A policy defines the action you should take given the state you are in.\n",
    "\n",
    "You can have an infinite horizon or a finite one. What finite means is there are n steps left before the total reward is calculated. Infinite means n is infinity.\n",
    "\n",
    "The utility of a sequence is the sum of all the the rewards for each state in the sequence.\n",
    "\n",
    "1. Planner takes model and returns a policy\n",
    "1. Learner takes transitions and returns a policy\n",
    "1. Modeler takes transitions and returns a model\n",
    "1. Simulator takes model and returns transitions\n",
    "1. Reinforcement Learning-based Planner takes a model passes it through a simulator then a learner to generate a policy\n",
    "1. Model-based Reinforcement Learning takes transitions passes them through a modeler then a planner to generate a policy\n",
    "\n",
    "The core concept in reinforcement learning is trying to find a balance between exploration and exploitation.\n",
    "\n",
    "\n",
    "## Game Theory\n",
    "\n",
    "Game theory can be defined as the mathematics of conflict.\n",
    "\n",
    "A strategy is a mapping of all possible states to actions.\n",
    "\n",
    "Nash equlibrium is a stable state of a system involving the interaction of different participants, in which no participant can gain by a unilateral change of strategy if the strategies of the others remain unchanged.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 5 - Deep Learning"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda env:mlnd]",
   "language": "python",
   "name": "conda-env-mlnd-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# General Notes\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Project 1\n",
	"\n",
	"[Learn Pandas](https://bitbucket.org/hrojas/learn-pandas)\n",
	"\n",
	"$$ Accuracy = \\frac{Correctly Identified}{All} = \\frac{TP+TN}{TP+TN+FP+FN}$$\n",
	"\n",
	"$$ Precision = \\frac{TP}{TP + FP}$$\n",
	"\n",
	"$$ Recall = \\frac{TP}{TP+FN} $$\n",
	"\n",
	"$$ F1 = \\frac{2(Precision Recall)}{Precision + Recall}$$\n",
	"\n",
	"For regression: \n",
	"\n",
	"* Mean Squared Error and Mean Absolute Error.\n",
	"* From 0 to 1 where 1 is better, R^2 score and explained variance.\n",
	"\n",
	"\n",
	"## sklearn \n",
	"\n",
	"* cross-validation.train_test_split\n",
	"* confusion matrix, recall, precision functions built in\n",
	"\n",
	"## Algorithms\n",
	"\n",
	"Decision trees compared to Naiive Bayes:\n",
	"\n",
	"* Decision trees are better at precision.\n",
	"* Naiive Bayes is better at recall.\n",
	" \n",
	" \n",
	"## Causes of Errors\n",
	"\n",
	"* Bias due to assumptions or inability to accurately model underlying data -> low accuracy and underfitting.\n",
	"* Variance due to model being overly sensitive to training dataset -> overfitting.\n",
	"\n",
	"\n",
	"\n",
	"http://scott.fortmann-roe.com/docs/BiasVariance.html\n",
	"\n",
	"To balance between underfitting and overfitting, use few features getting a large R^2 or a low sum of squared error.\n",
	"\n",
	"\n",
	"## Curse of Dimensionality\n",
	"\n",
	"As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.\n",
	"\n",
	"\n",
	"## Learning Curve\n",
	"\n",
	"A learning curve in machine learning is a graph that compares the performance of a model on training and testing data over a varying number of training instances.\n",
	"\n",
	"* When the training and testing errors converge and are quite high this usually means the model is biased (underfit).\n",
	"* When there is a large gap between the training and testing error this generally means the model suffers from high variance (overfit).\n",
	"* Ideal is good performance and generalizes well to unseen data.\n",
	"\n",
	"\n",
	"Validation curves can also be helpful with detecting the ideal balance of model complexity.\n",
	"\n",
	"## Cross Validation\n",
	"\n",
	"* k sets, comparing n groupings as training set\n",
	"* k sets, comparing different variations\n",
	"* grid search, use multiple parameters and creating different combinations of them\n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Kaggle Section\n",
	"\n",
	"Tips:\n",
	"\n",
	"* Computer vision - deep learning, keras\n",
	"* Others - Gradient Boosted Machines? Extree boost"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Project 2 - Supervised Learning\n",
	"\n",
	"## Regression\n",
	"\n",
	"Originated due to values naturally regressing to the mean. Slope is less than 1.\n",
	"\n",
	"Current meaning: Using a functional form to approximate data points.\n",
	"\n",
	"Used for continuous outputs.\n",
	"\n",
	"LinearRegression in sklearn. $$ y = mx + b $$\n",
	"\n",
	"Evaluating regressions:\n",
	"\n",
	"* sum of squared errors\n",
	"* R^2\n",
	"\n",
	"\n",
	"numpy [polyfit](https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html)\n",
	"\n",
	"scikit learn [polynomial features](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)\n",
	"\n",
	"\n",
	"## Decision Trees\n",
	"\n",
	"Classification takes input and mapping it to a discrete label. Regression takes input and maps it to a continuous value. Decision trees are a form of classification.\n",
	"\n",
	"Some definitions:\n",
	"\n",
	"* Instances - Input\n",
	"* Concept - function to map inputs to membership in a set (returns True or False)\n",
	"* Target Concept - function that determines which set input is a member of\n",
	"* Hypothesis - All functions\n",
	"* Sample - training set (input, label)\n",
	"* Candidate - concept that you think might be the target concept\n",
	"* Testing Set - another set of (input,label) that is not the training set\n",
	"\n",
	"The best attribute/feature to pick for the decision tree is the one that splits the data into the largest two subsets.\n",
	"\n",
	"\n",
	"Dealing with overfitting: pruning.\n",
	"\n",
	"sklearn [DecisionTreeClassifier](http://scikit-learn.org/0.17/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)\n",
	"\n",
	"Entropy controls how a decision tree decides where to split the data. Entropy is a measure of impurity in a bunch of examples.\n",
	"\n",
	"Information gain = entropy(parent) - weighted average(entropy(children))\n",
	"\n",
	"Decision tree will attempt to maximize information gain.\n",
	"\n",
	"Decision Trees are prone to overfitting but are really simple and easy to use.\n",
	"\n",
	"\n",
	"## Neural Networks\n",
	"\n",
	"Perceptron is a unit of the network that has an activation and a firing threshold. Sum of inputs * weight product is the activation. Then we check if it is above the firing threshold to determine whether to return 0 or 1.\n",
	"\n",
	"To figure out the weights and threshold from examples we can use the Perceptron Rule or Gradient Descent.\n",
	"\n",
	"To select initial weights, use small random values.\n",
	"\n",
	"\n",
	"#### Perceptron Rule\n",
	"\n",
	"Uses a learning rate to determine weight based on starting weights and how they impact the results.\n",
	"\n",
	"Works well for linearly separable data.\n",
	"\n",
	"$$ \\Delta w_i = n(y - \\hat{y}) $$\n",
	"\n",
	"\n",
	"#### Gradient Descent\n",
	"\n",
	"Works well even if data is not linearly separable.\n",
	"\n",
	"$$ \\Delta w_i = n(y - a), a= gradient $$\n",
	"\n",
	"\n",
	"#### Sigmoid\n",
	"\n",
	"$$ \\sigma(a) = \\frac{1}{1+e^{-a}} $$\n",
	"\n",
	"$$ D\\sigma(a) = \\sigma(a)(1-\\sigma(a)) $$\n",
	"\n",
	"\n",
	"### Chaining Perceptrons\n",
	"\n",
	"We then connect multiple perceptrons together to build a network.\n",
	"\n",
	"### Biases\n",
	"\n",
	"Restriction bias: representational power & set of hypotheses to consider.\n",
	"\n",
	"Simple perceptron -> linear\n",
	"networks -> more than linear\n",
	"optimizing weights -> potential overfitting\n",
	"Can build boolean, continuous (1 hidden layer) and arbitrary functions (two hidden layers).\n",
	"\n",
	"Danger of overfitting.\n",
	"\n",
	"Preference bias: selection of one representation over another\n",
	"\n",
	"NNs prefer simpler explanations.\n",
	"\n",
	"\n",
	"## Support Vector Machines\n",
	"\n",
	"The idea is to find a separating line between classes of data. It does this by maximizing the margin (distance between the line and the nearest point).\n",
	"\n",
	"SVM can handle outliers so that it gives a more reasonable result.\n",
	"\n",
	"SVM can also work in non-linear fashion using different kernels. The kernels work with more features but a combination that is linearly separable. The separating line can then be used with the initial featureset but will be a nonlinear solution.\n",
	"\n",
	"SVMs have three main parameters for customization, the kernel, C and gamma.\n",
	"\n",
	"C controls the tradeoff between a smooth decision boundary and correct classification. A higher value of C means more correct classifications in favor of a less smooth boundary.\n",
	"\n",
	"\n",
	"\n",
	"## K Nearest Neighbors\n",
	"\n",
	"Given a distance metric, group data into sets of k points that are nearest to each other.\n",
	"\n",
	"Emphasizes:\n",
	"\n",
	"* locality\n",
	"* smoothness\n",
	"* all features matter equally unless distance metric weights the features differently\n",
	"\n",
	"\n",
	"## Naive Bayes\n",
	"\n",
	"Bayes theorem incorporates some result of a test with a prior probability to arrive at a posterior probability.\n",
	"\n",
	"$$ P(A\\ \|\\ B) = \\frac{P(B\\ \|\\ A)*P(A)}{P(B)} $$\n",
	"\n",
	"Sensitivity: P(Pos\|C) = P(True Positive)\n",
	"Specitivity: P(Neg\|not C) = P(True Negative)\n",
	"\n",
	"We can use data to calculate probabilities and conditional probabilities of features on resulting in a specific class. Then we can use these probabilities to classify new inputs. This is called Bayesian Classification.\n",
	"\n",
	"\n",
	"## Ensemble\n",
	"\n",
	"Learn over a subset of the data to develop rules and then combine them to create the model. \n",
	"\n",
	"Bagging/bootstrap aggregation: random subsets, combining by averaging.\n",
	"\n",
	"Boosting: hardest examples and combining with weighted average.\n",
	"\n",
	"Boosting doesn't often overfit but will overfit if a weak learner uses Artificial Neural Network with many layers of nodes.\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Project 3 - Unsupervised Learning\n",
	"\n",
	"Learning with no labels in the input data.\n",
	"\n",
	"\n",
	"## Clustering\n",
	"\n",
	"Grouping data to extract useful information without labels.\n",
	"\n",
	"K-Means Clustering:\n",
	"\n",
	"* Creates k clusters from k centroids\n",
	"* Updates position of centroid after clustering its points\n",
	"* Updates point's cluster based on closest centroid\n",
	"* Initial placement of points significantly affects the result.\n",
	"* Should be run multiple times\n",
	"\n",
	"Single Linkage Clustering:\n",
	"\n",
	"* Connect the closest two points together until we have k clusters\n",
	"* Connected points denote a cluster\n",
	"\n",
	"Soft Clustering:\n",
	"\n",
	"* Allows for a point to be in multiple possible clusters\n",
	"* Gives probability of a point in each cluster\n",
	"\n",
	"\n",
	"## Feature Scaling\n",
	"\n",
	"Each feature may have a different scale. They should be normalized for comparison/combination.\n",
	"\n",
	"To do so we can use [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) from sklearn.\n",
	"\n",
	"SVM and K-Means clustering are significantly affected by feature scaling.\n",
	"\n",
	"Decision trees and linear regression are not affected significantly by feature scaling.\n",
	"\n",
	"### Filtering\n",
	"\n",
	"Features -> search -> feature_subset -> learner\n",
	"\n",
	"No feedback loop, ignored bias, faster than wrapping.\n",
	"\n",
	"### Wrapping\n",
	"\n",
	"features -> (loop of learner <-> search) -> model\n",
	"\n",
	"Takes into account model bias but very slow.\n",
	"\n",
	"\n",
	"## Dimensionality Reduction\n",
	"\n",
	"\n",
	"### Principal Component Analysis\n",
	"\n",
	"Determining the dimensionality of data.\n",
	"\n",
	"PCA finds a new coordinate system by translation or rotation only from the center of the data.\n",
	"\n",
	"Principal component is the feature that has the highest variance.\n",
	"\n",
	"Use principal components as new features.\n",
	"\n",
	"Principal components are directions in data that maximize varaince (minimize information loss) when compressed.\n",
	"\n",
	"More variance of data along a principal component means the higher it will be ranked.\n",
	"\n",
	"Further principal components do not overlap with previous ones since they would be on perpindicular.\n",
	"\n",
	"Max number of principal components is the number of input features.\n",
	"\n",
	"Use PCA when you want to:\n",
	" \n",
	"* find latent features driving the patterns in data\n",
	"* for dimensionality reduction\n",
	" * visualize high-dimensional data\n",
	" * reduce noice\n",
	" * make other algorithms work better with fewer inputs\n",
	"\n",
	"mutually orthogonal, maximal variance, ordered features\n",
	"\n",
	"### Independent Component Analysis\n",
	"\n",
	"ICA uses independence to determine the minimized feature subset that affects the result the most. \n",
	"\n",
	"[Paper](http://mlsp.cs.cmu.edu/courses/fall2012/lectures/ICA_Hyvarinen.pdf)\n",
	"\n",
	"[Demo](http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi)\n",
	" \n",
	"Mutually independent, maximal mutual information, bag of features.\n",
	"\n",
	"### Others\n",
	"\n",
	"RCA - Random Components Analysis ( random directions )\n",
	"\n",
	"LDA - Linear Discriminant Analysis (finds a projection that discriminates based on the label)\n",
	"\n",
	"[Paper](http://computation.llnl.gov/casc/sapphire/pubs/148494.pdf)\n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"# Project 4 - Reinforcement Learning\n",
	"\n",
	"\n",
	"## Markov Decision Processes\n",
	"\n",
	"Markovian property - only the present matters.\n",
	"\n",
	"Markov Decision Processes have the following components:\n",
	"\n",
	"1. States\n",
	"1. Actions\n",
	"1. Models (transition function = T(state_1, action, state_2) or Probability(state_2 \| state_1 and action))\n",
	"1. Rewards - value/usefulness of a state\n",
	"\n",
	"These components define the problem and from that we can determine a solution (policy). We are generally interested in the optimal policy, that is the solution that maximizes the reward.\n",
	"\n",
	"A policy defines the action you should take given the state you are in.\n",
	"\n",
	"You can have an infinite horizon or a finite one. What finite means is there are n steps left before the total reward is calculated. Infinite means n is infinity.\n",
	"\n",
	"The utility of a sequence is the sum of all the the rewards for each state in the sequence.\n",
	"\n",
	"1. Planner takes model and returns a policy\n",
	"1. Learner takes transitions and returns a policy\n",
	"1. Modeler takes transitions and returns a model\n",
	"1. Simulator takes model and returns transitions\n",
	"1. Reinforcement Learning-based Planner takes a model passes it through a simulator then a learner to generate a policy\n",
	"1. Model-based Reinforcement Learning takes transitions passes them through a modeler then a planner to generate a policy\n",
	"\n",
	"The core concept in reinforcement learning is trying to find a balance between exploration and exploitation.\n",
	"\n",
	"\n",
	"## Game Theory\n",
	"\n",
	"Game theory can be defined as the mathematics of conflict.\n",
	"\n",
	"A strategy is a mapping of all possible states to actions.\n",
	"\n",
	"Nash equlibrium is a stable state of a system involving the interaction of different participants, in which no participant can gain by a unilateral change of strategy if the strategies of the others remain unchanged.\n",
	"\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Project 5 - Deep Learning"
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [conda env:mlnd]",
	"language": "python",
	"name": "conda-env-mlnd-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.13"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}
No results found