anantmalhotra/Network Cost Model.ipynb

## Network Cost Model.ipynb
{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Network Cost Model"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Carrying traffic in an IP network incurs many costs, including transit, port costs, backhaul costs, and other capital costs. Traffic routing decisions can significantly affect the overall costs of the network. For example, the cost of carrying traffic over trans-oceanic or satellite links is more than carrying over under-utilized backhaul commodity links. \n",
      "\n",
      "Here we map traffic flows using real traffic matrices from a Tier1 IP Network to the costs of carrying that traffic, the inability to attribute costs to traffic flows can result in missed opportunities for cost savings and ad hoc decisions about routing and interconnection. Each link in the network has been assigned an OPEX cost based on its capacity (Mbps) and the associated vertex router pair locations.\n",
      "\n",
      "Two cost models have been considered while attributing costs covering the two extremes :\n",
      "\n",
      "* Complete customer liability: where each customer is attributed the cost of the links it uses in proportion of its percentage of total utilization of the links.\n",
      "* No customer liability: where each customer is attributed the cost of the links it uses based on a fixed cost per Mbps on the links regardless of link utilization.\n",
      "\n",
      "Interaction data for 1 day (18.06.2014) is created by annotating netflow records and stored in the form of a BizReflex master cube, we annotate further on entity type (customer/peer), PIB (path, hops) and Costs from the two distinct cost models described above:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "master = pd.read_csv(\"../../new_pib_data_1d\",names = [\"direction\",\"inIF\",\"inRTR\",\"egIF\",\"egRTR\",\"inASN\",\"egASN\",\"inPOP\",\"inRegion\",\"inCountry\",\"egPOP\",\"egRegion\",\"egCountry\",\"hops\",\"onOff\",\"bytes\",\"cost1\",\"cost\"])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*\"in_\"* and *\"out_\"* prefixes describe the source and destination respectively, with the suffixes \"IF\", \"RTR\", \"ASN\", \"POP\", \"Region\" and \"Country\" describing the various levels in the Network Hierarchy.\n",
      "\n",
      "Any price/cost models are built over customers at a router-interface level, i.e. each customer (denoted by an AS number) is connected at various routers via an interface, hence all aggregations are done at these levels while minimizing information loss. Some imputations have been done based on knowledge of the annotation process."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pandas.tools.pivot import pivot_table\n",
      "\n",
      "# ignore peer to peer traffic\n",
      "master = master[master['onOff'] <> 2]\n",
      "\n",
      "# treat OnOff=3 as Onnet\n",
      "master.loc[master['onOff'] == 3,\"onOff\"] = 1\n",
      "\n",
      "# convert bytes to speed (bps)\n",
      "master['speed'] = master['bytes']/24*3600\n",
      "\n",
      "# treat hops as categorical\n",
      "master['c(hops)'] = master['hops']\n",
      "\n",
      "# hopspeed = hops*speed\n",
      "master['hopspeed'] = master['hops']*master['speed']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Certain additional metrics have also been computed based on domain knowledge. \"Hops\" is treated as categorical as well as continuous in this process. Here are some summary statistics of the attributes considered."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "master[['speed','hops','cost','hopspeed']].describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>speed</th>\n",
        "      <th>hops</th>\n",
        "      <th>cost</th>\n",
        "      <th>hopspeed</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>count</th>\n",
        "      <td> 6.366340e+05</td>\n",
        "      <td> 636634.000000</td>\n",
        "      <td>  636634.000000</td>\n",
        "      <td> 6.366340e+05</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>mean</th>\n",
        "      <td> 2.251118e+11</td>\n",
        "      <td>      5.837318</td>\n",
        "      <td>     686.677007</td>\n",
        "      <td> 8.958334e+11</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>std</th>\n",
        "      <td> 3.241939e+12</td>\n",
        "      <td>      2.052375</td>\n",
        "      <td>   11434.580481</td>\n",
        "      <td> 1.076409e+13</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>min</th>\n",
        "      <td> 9.375000e+04</td>\n",
        "      <td>      2.000000</td>\n",
        "      <td>       0.000000</td>\n",
        "      <td> 3.750000e+05</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>25%</th>\n",
        "      <td> 5.940000e+07</td>\n",
        "      <td>      4.000000</td>\n",
        "      <td>       0.000000</td>\n",
        "      <td> 3.348000e+08</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>50%</th>\n",
        "      <td> 6.496500e+08</td>\n",
        "      <td>      6.000000</td>\n",
        "      <td>       0.552006</td>\n",
        "      <td> 3.600000e+09</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>75%</th>\n",
        "      <td> 7.638750e+09</td>\n",
        "      <td>      7.000000</td>\n",
        "      <td>      11.220987</td>\n",
        "      <td> 4.151970e+10</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>max</th>\n",
        "      <td> 4.548026e+14</td>\n",
        "      <td>     15.000000</td>\n",
        "      <td> 1665038.500000</td>\n",
        "      <td> 1.263103e+15</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>8 rows \u00d7 4 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "              speed           hops            cost      hopspeed\n",
        "count  6.366340e+05  636634.000000   636634.000000  6.366340e+05\n",
        "mean   2.251118e+11       5.837318      686.677007  8.958334e+11\n",
        "std    3.241939e+12       2.052375    11434.580481  1.076409e+13\n",
        "min    9.375000e+04       2.000000        0.000000  3.750000e+05\n",
        "25%    5.940000e+07       4.000000        0.000000  3.348000e+08\n",
        "50%    6.496500e+08       6.000000        0.552006  3.600000e+09\n",
        "75%    7.638750e+09       7.000000       11.220987  4.151970e+10\n",
        "max    4.548026e+14      15.000000  1665038.500000  1.263103e+15\n",
        "\n",
        "[8 rows x 4 columns]"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Cost Attribution"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A top level segmentation is used for dividing on-net and off-net flows and analysing each seperately.\n",
      "\n",
      "Directions have been used to assign costs based on which entity is nearest to the sampling router. So for Direction=1 cost is attributed to the ingress entity, and similarly Direction=2 represents costs attributed to the egress entity. The data has been validated on both directions with sampling bias corrected for an error of upto 5%.\n",
      "\n",
      "Note: Direction has no relationship with sending/receiving traffic, rather it tells us where each flow was sampled (ingress/egress). Thus, each flow should be captured in both directions in case of bi-directional sampling, which is often the case."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# break data into disjoint sets\n",
      "\n",
      "ondir1 = master[master['onOff'] == 1][master['direction'] == 1]\n",
      "ondir2 = master[master['onOff'] == 1][master['direction'] == 2]\n",
      "offdir1 = master[master['onOff'] == 0][master['direction'] == 1]\n",
      "offdir2 = master[master['onOff'] == 0][master['direction'] == 2]\n",
      "\n",
      "# build Traffic, Hops, and Cost matrices\n",
      "\n",
      "on1_matrix = pivot_table(ondir1, rows=['inCountry','inRegion','inPOP','inRTR','inASN'], cols=['egCountry','egRegion','egPOP','c(hops)','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
      "on2_matrix = pivot_table(ondir2, cols=['inCountry','inRegion','inPOP','c(hops)','inRTR','inASN'], rows=['egCountry','egRegion','egPOP','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
      "off1_matrix = pivot_table(offdir1, rows=['inCountry','inRegion','inPOP','inRTR','inASN'], cols=['egCountry','egRegion','egPOP','c(hops)','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
      "off2_matrix = pivot_table(offdir2, cols=['inCountry','inRegion','inPOP','c(hops)','inRTR','inASN'], rows=['egCountry','egRegion','egPOP','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/Users/anant.malhotra/anaconda/envs/scientific0.1/lib/python2.7/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n",
        "  \"DataFrame index.\", UserWarning)\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A final aggregation step on the various metrics considered above."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# combine directions for Onnet, Offnet separately\n",
      "\n",
      "on_matrix = on1_matrix[['cost','speed','hopspeed']].add(on2_matrix[['cost','speed','hopspeed']], fill_value=0)\n",
      "off_matrix = off1_matrix[['cost','speed','hopspeed']].add(off2_matrix[['cost','speed','hopspeed']], fill_value=0)\n",
      "\n",
      "# adding Hops\n",
      "\n",
      "on_matrix = pd.concat({\"hops\":on1_matrix['hops'].combine_first(on2_matrix['hops'])},axis=1).join(on_matrix,how=\"inner\")\n",
      "off_matrix = pd.concat({\"hops\":off1_matrix['hops'].combine_first(off2_matrix['hops'])},axis=1).join(off_matrix,how=\"inner\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "on_matrix.to_pickle(\"on_matrix\")\n",
      "off_matrix.to_pickle(\"off_matrix\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Variable Construction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After preprocessing, an essential requirement is to gather enough features and minimize information loss caused upon aggregation. Here is a hierarchy of objects that we would be iterating over to compute the relevant traffic attributes."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The hierarchy\n",
      "\n",
      "levels_y = ['country','region','pop','rtr','asn']\n",
      "levels_x = ['country','region','pop','c(hops)','rtr','asn']\n",
      "measures = ['speed','hopspeed','cost','hops']\n",
      "attributes = ['sum','mean','count','median']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we compute some aggregates (attributes) at various levels described in the above hierarchy:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# loop through levels in hierarchy\n",
      "\n",
      "feature_dict = {}\n",
      "aggregate_dict = {}\n",
      "\n",
      "for measure in measures:\n",
      "    feature_dict[measure] = {}\n",
      "    aggregate_dict[measure] = {}\n",
      "    \n",
      "    for level in levels_x:\n",
      "        feature_dict[measure].update({level:{}})\n",
      "        \n",
      "        for attribute in attributes:\n",
      "            # overall aggregates\n",
      "            aggregate_dict[measure].update({attribute:eval('on_matrix[measure].%s(axis=1)'%attribute)})\n",
      "            # level aggregates\n",
      "            feature_dict[measure][level].update({attribute:eval('on_matrix[measure].%s(axis=1,level=levels_x.index(level))'%attribute)})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This function creates a dataframe from dictionary while preserving the level hierarchy :"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def createframe(dicts,levels=2):\n",
      "    \n",
      "    l1_dict = {}\n",
      "    l2_dict = {}\n",
      "    \n",
      "    for key,value in dicts.iteritems():\n",
      "        \n",
      "        if levels == 1:\n",
      "            l1_dict.update({key:pd.concat(value.values(),axis=1,keys=value.keys())})\n",
      "\n",
      "        elif levels == 2:\n",
      "    \n",
      "            for key2,value2 in value.iteritems():\n",
      "                l2_dict.update({key2:pd.concat(value2.values(),axis=1,keys=value2.keys())})\n",
      "        \n",
      "            l1_dict.update({key:pd.concat(l2_dict.values(),axis=1,keys=l2_dict.keys())})\n",
      "        \n",
      "    frame = pd.concat(l1_dict.values(),axis=1,keys=l1_dict.keys())        \n",
      "    \n",
      "    return frame"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Keep all aggregates in a separate dataframe. Create another sparse set of features that keeps all information."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "aggregate_frame = createframe(aggregate_dict,levels=1)\n",
      "sparse_frame = createframe(feature_dict).reorder_levels([0,2,1,3],axis=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is sparsely populated and we should also consider making some non-sparse data using this frames' aggregates. Here is an attempt where levels are sorted based on attributes and the top 5 attributes' values and labels are reported, this should reduce the sparsity :"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# top n Country, Region, PoP, Hops, Router and ASN (ID/Name, Attribute)\n",
      "\n",
      "top = {}\n",
      "\n",
      "def sortvalue(a, max):\n",
      "    tmp = pd.Series(data = np.array(a.order(ascending=False)[:max]), index=[\"value\"+str(s) for s in range(1,max+1)])\n",
      "    return tmp\n",
      "\n",
      "def sortindex(a, max):\n",
      "    tmp = pd.Series(data = a.order(ascending=False).index[:max], index=[\"label\"+str(s) for s in range(1,max+1)])\n",
      "    return tmp\n",
      "\n",
      "for measure in measures:\n",
      "    top[measure] = {}\n",
      "    \n",
      "    for attribute in attributes:\n",
      "        top[measure].update({attribute:{}})\n",
      "        \n",
      "        for level in levels_x:\n",
      "            top[measure][attribute].update({level: sparse_frame[measure][attribute][level].div(aggregate_frame[measure][attribute],axis=0).apply(lambda x: sortvalue(x, 5), axis=1).join(sparse_frame[measure][attribute][level].apply(lambda x: sortindex(x, 5), axis=1))})\n",
      "            \n",
      "top_frame = createframe(top)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# save the datasets\n",
      "\n",
      "sparse_frame.to_pickle(\"sparse_frame\")\n",
      "top_frame.to_pickle(\"top_frame\")\n",
      "aggregate_frame.to_pickle(\"aggregate_frame\")\n",
      "master.to_pickle(\"master\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": ""
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": [
	"Network Cost Model"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Carrying traffic in an IP network incurs many costs, including transit, port costs, backhaul costs, and other capital costs. Traffic routing decisions can significantly affect the overall costs of the network. For example, the cost of carrying traffic over trans-oceanic or satellite links is more than carrying over under-utilized backhaul commodity links. \n",
	"\n",
	"Here we map traffic flows using real traffic matrices from a Tier1 IP Network to the costs of carrying that traffic, the inability to attribute costs to traffic flows can result in missed opportunities for cost savings and ad hoc decisions about routing and interconnection. Each link in the network has been assigned an OPEX cost based on its capacity (Mbps) and the associated vertex router pair locations.\n",
	"\n",
	"Two cost models have been considered while attributing costs covering the two extremes :\n",
	"\n",
	"* Complete customer liability: where each customer is attributed the cost of the links it uses in proportion of its percentage of total utilization of the links.\n",
	"* No customer liability: where each customer is attributed the cost of the links it uses based on a fixed cost per Mbps on the links regardless of link utilization.\n",
	"\n",
	"Interaction data for 1 day (18.06.2014) is created by annotating netflow records and stored in the form of a BizReflex master cube, we annotate further on entity type (customer/peer), PIB (path, hops) and Costs from the two distinct cost models described above:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import pandas as pd\n",
	"%matplotlib inline"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"master = pd.read_csv(\"../../new_pib_data_1d\",names = [\"direction\",\"inIF\",\"inRTR\",\"egIF\",\"egRTR\",\"inASN\",\"egASN\",\"inPOP\",\"inRegion\",\"inCountry\",\"egPOP\",\"egRegion\",\"egCountry\",\"hops\",\"onOff\",\"bytes\",\"cost1\",\"cost\"])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\"in_\" and \"out_\" prefixes describe the source and destination respectively, with the suffixes \"IF\", \"RTR\", \"ASN\", \"POP\", \"Region\" and \"Country\" describing the various levels in the Network Hierarchy.\n",
	"\n",
	"Any price/cost models are built over customers at a router-interface level, i.e. each customer (denoted by an AS number) is connected at various routers via an interface, hence all aggregations are done at these levels while minimizing information loss. Some imputations have been done based on knowledge of the annotation process."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from pandas.tools.pivot import pivot_table\n",
	"\n",
	"# ignore peer to peer traffic\n",
	"master = master[master['onOff'] <> 2]\n",
	"\n",
	"# treat OnOff=3 as Onnet\n",
	"master.loc[master['onOff'] == 3,\"onOff\"] = 1\n",
	"\n",
	"# convert bytes to speed (bps)\n",
	"master['speed'] = master['bytes']/24*3600\n",
	"\n",
	"# treat hops as categorical\n",
	"master['c(hops)'] = master['hops']\n",
	"\n",
	"# hopspeed = hops*speed\n",
	"master['hopspeed'] = master['hops']*master['speed']"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Certain additional metrics have also been computed based on domain knowledge. \"Hops\" is treated as categorical as well as continuous in this process. Here are some summary statistics of the attributes considered."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"master[['speed','hops','cost','hopspeed']].describe()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>speed</th>\n",
	" <th>hops</th>\n",
	" <th>cost</th>\n",
	" <th>hopspeed</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>count</th>\n",
	" <td> 6.366340e+05</td>\n",
	" <td> 636634.000000</td>\n",
	" <td> 636634.000000</td>\n",
	" <td> 6.366340e+05</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>mean</th>\n",
	" <td> 2.251118e+11</td>\n",
	" <td> 5.837318</td>\n",
	" <td> 686.677007</td>\n",
	" <td> 8.958334e+11</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>std</th>\n",
	" <td> 3.241939e+12</td>\n",
	" <td> 2.052375</td>\n",
	" <td> 11434.580481</td>\n",
	" <td> 1.076409e+13</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>min</th>\n",
	" <td> 9.375000e+04</td>\n",
	" <td> 2.000000</td>\n",
	" <td> 0.000000</td>\n",
	" <td> 3.750000e+05</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>25%</th>\n",
	" <td> 5.940000e+07</td>\n",
	" <td> 4.000000</td>\n",
	" <td> 0.000000</td>\n",
	" <td> 3.348000e+08</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>50%</th>\n",
	" <td> 6.496500e+08</td>\n",
	" <td> 6.000000</td>\n",
	" <td> 0.552006</td>\n",
	" <td> 3.600000e+09</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>75%</th>\n",
	" <td> 7.638750e+09</td>\n",
	" <td> 7.000000</td>\n",
	" <td> 11.220987</td>\n",
	" <td> 4.151970e+10</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>max</th>\n",
	" <td> 4.548026e+14</td>\n",
	" <td> 15.000000</td>\n",
	" <td> 1665038.500000</td>\n",
	" <td> 1.263103e+15</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"<p>8 rows \u00d7 4 columns</p>\n",
	"</div>"
	],
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 4,
	"text": [
	" speed hops cost hopspeed\n",
	"count 6.366340e+05 636634.000000 636634.000000 6.366340e+05\n",
	"mean 2.251118e+11 5.837318 686.677007 8.958334e+11\n",
	"std 3.241939e+12 2.052375 11434.580481 1.076409e+13\n",
	"min 9.375000e+04 2.000000 0.000000 3.750000e+05\n",
	"25% 5.940000e+07 4.000000 0.000000 3.348000e+08\n",
	"50% 6.496500e+08 6.000000 0.552006 3.600000e+09\n",
	"75% 7.638750e+09 7.000000 11.220987 4.151970e+10\n",
	"max 4.548026e+14 15.000000 1665038.500000 1.263103e+15\n",
	"\n",
	"[8 rows x 4 columns]"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "heading",
	"level": 2,
	"metadata": {},
	"source": [
	"Cost Attribution"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A top level segmentation is used for dividing on-net and off-net flows and analysing each seperately.\n",
	"\n",
	"Directions have been used to assign costs based on which entity is nearest to the sampling router. So for Direction=1 cost is attributed to the ingress entity, and similarly Direction=2 represents costs attributed to the egress entity. The data has been validated on both directions with sampling bias corrected for an error of upto 5%.\n",
	"\n",
	"Note: Direction has no relationship with sending/receiving traffic, rather it tells us where each flow was sampled (ingress/egress). Thus, each flow should be captured in both directions in case of bi-directional sampling, which is often the case."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# break data into disjoint sets\n",
	"\n",
	"ondir1 = master[master['onOff'] == 1][master['direction'] == 1]\n",
	"ondir2 = master[master['onOff'] == 1][master['direction'] == 2]\n",
	"offdir1 = master[master['onOff'] == 0][master['direction'] == 1]\n",
	"offdir2 = master[master['onOff'] == 0][master['direction'] == 2]\n",
	"\n",
	"# build Traffic, Hops, and Cost matrices\n",
	"\n",
	"on1_matrix = pivot_table(ondir1, rows=['inCountry','inRegion','inPOP','inRTR','inASN'], cols=['egCountry','egRegion','egPOP','c(hops)','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
	"on2_matrix = pivot_table(ondir2, cols=['inCountry','inRegion','inPOP','c(hops)','inRTR','inASN'], rows=['egCountry','egRegion','egPOP','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
	"off1_matrix = pivot_table(offdir1, rows=['inCountry','inRegion','inPOP','inRTR','inASN'], cols=['egCountry','egRegion','egPOP','c(hops)','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])\n",
	"off2_matrix = pivot_table(offdir2, cols=['inCountry','inRegion','inPOP','c(hops)','inRTR','inASN'], rows=['egCountry','egRegion','egPOP','egRTR','egASN'],values = ['hops','cost','speed','hopspeed','cost/hop','cost/bps'])"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stderr",
	"text": [
	"/Users/anant.malhotra/anaconda/envs/scientific0.1/lib/python2.7/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n",
	" \"DataFrame index.\", UserWarning)\n"
	]
	}
	],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A final aggregation step on the various metrics considered above."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# combine directions for Onnet, Offnet separately\n",
	"\n",
	"on_matrix = on1_matrix[['cost','speed','hopspeed']].add(on2_matrix[['cost','speed','hopspeed']], fill_value=0)\n",
	"off_matrix = off1_matrix[['cost','speed','hopspeed']].add(off2_matrix[['cost','speed','hopspeed']], fill_value=0)\n",
	"\n",
	"# adding Hops\n",
	"\n",
	"on_matrix = pd.concat({\"hops\":on1_matrix['hops'].combine_first(on2_matrix['hops'])},axis=1).join(on_matrix,how=\"inner\")\n",
	"off_matrix = pd.concat({\"hops\":off1_matrix['hops'].combine_first(off2_matrix['hops'])},axis=1).join(off_matrix,how=\"inner\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 6
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"on_matrix.to_pickle(\"on_matrix\")\n",
	"off_matrix.to_pickle(\"off_matrix\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Variable Construction"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After preprocessing, an essential requirement is to gather enough features and minimize information loss caused upon aggregation. Here is a hierarchy of objects that we would be iterating over to compute the relevant traffic attributes."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# The hierarchy\n",
	"\n",
	"levels_y = ['country','region','pop','rtr','asn']\n",
	"levels_x = ['country','region','pop','c(hops)','rtr','asn']\n",
	"measures = ['speed','hopspeed','cost','hops']\n",
	"attributes = ['sum','mean','count','median']"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here we compute some aggregates (attributes) at various levels described in the above hierarchy:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# loop through levels in hierarchy\n",
	"\n",
	"feature_dict = {}\n",
	"aggregate_dict = {}\n",
	"\n",
	"for measure in measures:\n",
	" feature_dict[measure] = {}\n",
	" aggregate_dict[measure] = {}\n",
	" \n",
	" for level in levels_x:\n",
	" feature_dict[measure].update({level:{}})\n",
	" \n",
	" for attribute in attributes:\n",
	" # overall aggregates\n",
	" aggregate_dict[measure].update({attribute:eval('on_matrix[measure].%s(axis=1)'%attribute)})\n",
	" # level aggregates\n",
	" feature_dict[measure][level].update({attribute:eval('on_matrix[measure].%s(axis=1,level=levels_x.index(level))'%attribute)})"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 9
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This function creates a dataframe from dictionary while preserving the level hierarchy :"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"def createframe(dicts,levels=2):\n",
	" \n",
	" l1_dict = {}\n",
	" l2_dict = {}\n",
	" \n",
	" for key,value in dicts.iteritems():\n",
	" \n",
	" if levels == 1:\n",
	" l1_dict.update({key:pd.concat(value.values(),axis=1,keys=value.keys())})\n",
	"\n",
	" elif levels == 2:\n",
	" \n",
	" for key2,value2 in value.iteritems():\n",
	" l2_dict.update({key2:pd.concat(value2.values(),axis=1,keys=value2.keys())})\n",
	" \n",
	" l1_dict.update({key:pd.concat(l2_dict.values(),axis=1,keys=l2_dict.keys())})\n",
	" \n",
	" frame = pd.concat(l1_dict.values(),axis=1,keys=l1_dict.keys()) \n",
	" \n",
	" return frame"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 10
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Keep all aggregates in a separate dataframe. Create another sparse set of features that keeps all information."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"aggregate_frame = createframe(aggregate_dict,levels=1)\n",
	"sparse_frame = createframe(feature_dict).reorder_levels([0,2,1,3],axis=1)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 11
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is sparsely populated and we should also consider making some non-sparse data using this frames' aggregates. Here is an attempt where levels are sorted based on attributes and the top 5 attributes' values and labels are reported, this should reduce the sparsity :"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# top n Country, Region, PoP, Hops, Router and ASN (ID/Name, Attribute)\n",
	"\n",
	"top = {}\n",
	"\n",
	"def sortvalue(a, max):\n",
	" tmp = pd.Series(data = np.array(a.order(ascending=False)[:max]), index=[\"value\"+str(s) for s in range(1,max+1)])\n",
	" return tmp\n",
	"\n",
	"def sortindex(a, max):\n",
	" tmp = pd.Series(data = a.order(ascending=False).index[:max], index=[\"label\"+str(s) for s in range(1,max+1)])\n",
	" return tmp\n",
	"\n",
	"for measure in measures:\n",
	" top[measure] = {}\n",
	" \n",
	" for attribute in attributes:\n",
	" top[measure].update({attribute:{}})\n",
	" \n",
	" for level in levels_x:\n",
	" top[measure][attribute].update({level: sparse_frame[measure][attribute][level].div(aggregate_frame[measure][attribute],axis=0).apply(lambda x: sortvalue(x, 5), axis=1).join(sparse_frame[measure][attribute][level].apply(lambda x: sortindex(x, 5), axis=1))})\n",
	" \n",
	"top_frame = createframe(top)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 12
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# save the datasets\n",
	"\n",
	"sparse_frame.to_pickle(\"sparse_frame\")\n",
	"top_frame.to_pickle(\"top_frame\")\n",
	"aggregate_frame.to_pickle(\"aggregate_frame\")\n",
	"master.to_pickle(\"master\")"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 13
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 13
	}
	],
	"metadata": {}
	}
	]
	}
No results found