Skip to content

Instantly share code, notes, and snippets.

@veena-LINE
Created February 4, 2020 21:25
Show Gist options
  • Select an option

  • Save veena-LINE/a93e5d149f4a61887cf978cc0c7744db to your computer and use it in GitHub Desktop.

Select an option

Save veena-LINE/a93e5d149f4a61887cf978cc0c7744db to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"# <span style=\"font-size:3em; color:orangered; font-weight:bold;\">Hacker News </span><span style=\"font-size:1.5em; color:gray; font-style:italic;\"> a case study </span> #\n",
"\n",
"<br>\n",
"\n",
"# <a name=\"whatishn\"> What is Hacker News ? </a> #\n",
"[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator [_Y Combinator,_](https://www.ycombinator.com/) where user-submitted stories (known as \"posts\") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.\n",
"More about it, [_here_](https://en.wikipedia.org/wiki/Hacker_News).\n",
"\n",
"<br>\n",
"\n",
"# <a name=\"source\"> Source </a> #\n",
"When you visit [Hacker News](https://news.ycombinator.com/), you will find submissions categorized into __new__, __old__, just __comments__, __Ask HN__, __Show HN__, and who knows you may find your next job in __jobs__ category !<br>\n",
"A dataset of ~300,000 submissions (recorded in ET - Eastern Time Zone) encompassing data points from all the above categories is sourced from [kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts).\n",
"\n",
"<br>\n",
"\n",
"# <a name=\"journey\"> About this journey </a> #\n",
"\n",
"Here, I will focus on `Ask HN` and `Show HN` submissions only. It is a subset among other submissions HN receives.<br>\n",
"<span style=\"font-weight:bold;\">What's in it for you ?</span>\n",
"_You will track my journey (yes, I am on [dataquest.io](https://www.dataquest.io)) and hopefully I will learn from your insights !_\n",
"\n",
"<br>\n",
"\n",
"# <a name=\"endgoal\"> End GOAL </a> #\n",
"\n",
"__Ask HN__ submissions : Users submit Ask HN posts to ask the Hacker News community a specific question.<br>\n",
"__Show HN__ submissions : Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.\n",
"\n",
"I'll be comparing these two types of posts to determine :\n",
"1. [Do __Ask HN__ or __Show HN__ receive more comments on average? and why?](#askvsshow)\n",
"2. [Do posts created at a certain time receive more comments on average?](#favorabletimes)\n",
"\n",
"<br>\n",
"\n",
"# Journey begins #\n",
"Here's a description of the columns from the dataset :\n",
"\n",
"|INDEX|Column|Description|\n",
"|:-:|:----|:----|\n",
"|0|<b>id</b>| The unique identifier from Hacker News for the post|\n",
"|1|<b>title</b>| The title of the post|\n",
"|2|<b>url</b>| The URL that the posts links to, if it the post has a URL|\n",
"|3|<b>num_points</b>| The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes|\n",
"|4|<b>num_comments</b>| The number of comments that were made on the post|\n",
"|5|<b>author</b>| The username of the person who submitted the post|\n",
"|6|<b>created_at</b>| The date and time at which the post was submitted|\n",
"\n",
"<br>\n",
"Let's start by reading the sourced csv file and display a few posts / submissions :<br>\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']\n",
"\n",
"['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']\n",
"\n",
"['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']\n",
"\n",
"['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']\n",
"\n"
]
}
],
"source": [
"from csv import reader\n",
"\n",
"file_open = open('HN_posts_year_to_Sep_26_2016.csv', encoding='utf-8') # Open the file for reading with UTF encoding.\n",
"file_read = reader(file_open) # Read the opened file.\n",
"\n",
"hn = list(file_read) # Convert the read file into a list of lists.\n",
"\n",
"\n",
"header = hn[0] # Store the column headers separately.\n",
"hn = hn[1:] # hn[] is just data points now (excludes column header). Makes it easier to work with without the header.\n",
"\n",
"\n",
"print(f'{header}\\n'); # Print column header.\n",
"\n",
"for row in hn[:3]:\n",
" print(f'{row}\\n'); # Display a few posts (including the column header)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"Dataset is available in the list `hn`.\n",
"I will walk you through the steps I followed.\n",
"<br>\n",
"\n",
"----\n",
"<br>\n",
"\n",
"## |1| Isolate _`Ask HN`_ and _`Show HN`_\n",
"I am looking for posts with titles starting with __Ask HN__ and __Show HN__ only.<br>\n",
"These posts should first be isolated from the source dataset `hn` before data cleaning.\n",
"<br><br>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[9139] Ask HN submissions.\n",
"[10158] Show HN submissions.\n",
"[273822] Other submissions\n"
]
}
],
"source": [
"ask_hn = [] # List of 'Ask HN' posts\n",
"show_hn = [] # List of 'Show HN' posts\n",
"other_hn = [] # List of all the other non-qualifying posts\n",
"\n",
"for row in hn: # Scan the column `title` index:1 for the matching criteria.\n",
" title = row[1]\n",
" \n",
" if title.lower().startswith('ask hn'): # Let's be cautious about mixed-case text that still matches 'ask hn'\n",
" ask_hn.append(row)\n",
" elif title.lower().startswith('show hn'): # Let's be cautious about mixed-case text that still matches 'show hn'\n",
" show_hn.append(row)\n",
" else:\n",
" other_hn.append(row)\n",
"\n",
"# Print to make sure we have sufficient posts for analysis.\n",
"print(f'[{len(ask_hn)}] Ask HN submissions.')\n",
"print(f'[{len(show_hn)}] Show HN submissions.')\n",
"print(f'[{len(other_hn)}] Other submissions')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"<span style=\"color:tomato; font-style:italic; font-weight:light; font-size:1.3em;\"> There are sufficient posts in both Ask and Show HN categories. </span>\n",
"<br>\n",
"\n",
"----\n",
"<br>\n",
"\n",
"## |2| Cleaning data #\n",
"Now that we have two separate lists for Ask and Show HN posts, before proceeding with the intended anlaysis, I need to make sure data is available.<br>\n",
"In other words, if there are any posts with missing data points, we can just skim them if it's just a handful.\n",
"\n",
"> ### |2.1| Any missing datapoints ? #\n",
"\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Missing [0] / Total [9139] : Ask HN posts.\n",
"Missing [0] / Total [10158] : Show HN posts.\n"
]
}
],
"source": [
"length_hn = len(header) # Length of the header (to check missing datapoints)\n",
"ask_missing = [] # Ask HN datasets with missing column values will be held here.\n",
"show_missing = [] # Show HN datasets with missing column values will be held here.\n",
"\n",
"for hnpost in ask_hn:\n",
" hnpost_length = len(hnpost)\n",
" if hnpost_length != length_hn:\n",
" ask_missing.append(hnpost)\n",
"\n",
"for hnpost in show_hn:\n",
" hnpost_length = len(hnpost)\n",
" if hnpost_length != length_hn:\n",
" show_missing.append(hnpost)\n",
"\n",
"\n",
"print(f'Missing [{len(ask_missing)}] / Total [{len(ask_hn)}] : Ask HN posts.')\n",
"print(f'Missing [{len(show_missing)}] / Total [{len(show_hn)}] : Show HN posts.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"<span style=\"color:tomato; font-style:italic; font-weight:light; font-size:1.3em;\"> There are no missing data points in both Ask and Show HN posts. </span>\n",
"\n",
"<br>\n",
"\n",
"> ### |2.2| Ensure data points in a column are of the same data type. #\n",
"> While browsing the csv, I could see that there are columns with numeric and time data. However, I will check for the stored data-type. This will give me a heads-up during data type conversion.<br>\n",
"> I will perform this check for columns at indexes 3, 4, and -1(last column) as I will be using them for my analysis.\n",
"\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ask HN : num_points is of type {<class 'str'>: 9139}\n",
"Ask HN : num_comments is of type {<class 'str'>: 9139}\n",
"Ask HN : created_at is of type {<class 'str'>: 9139}\n",
"--------------------------------------------------------\n",
"Show HN : num_points is of type {<class 'str'>: 10158}\n",
"Show HN : num_comments is of type {<class 'str'>: 10158}\n",
"Show HN : created_at is of type {<class 'str'>: 10158}\n"
]
}
],
"source": [
"# FUNCTION : to determine the data type of a column passed via argument:index for the list:posts\n",
"def datatype_ft(posts, index):\n",
" type_ft = {}\n",
" \n",
" for post in posts:\n",
" post_col = type(post[index])\n",
" if post_col in type_ft:\n",
" type_ft[post_col] += 1\n",
" else:\n",
" type_ft[post_col] = 1\n",
" \n",
" return type_ft\n",
"\n",
"indexes = [3, 4, -1]\n",
"for index in indexes:\n",
" ask_type_ft = datatype_ft(ask_hn, index)\n",
" print(f'Ask HN : {header[index]} is of type {ask_type_ft}')\n",
"\n",
"print('--------------------------------------------------------')\n",
"\n",
"indexes = [3, 4, -1]\n",
"for index in indexes:\n",
" show_type_ft = datatype_ft(show_hn, index)\n",
" print(f'Show HN : {header[index]} is of type {show_type_ft}')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"<span style=\"color:tomato; font-style:italic; font-weight:light; font-size:1.3em;\"> Column-wise, they are of the same data type. </span> <br>\n",
"I'll use `int()` or `datetime()` as it deems fit.\n",
"<br>\n",
"\n",
"----\n",
"<br>\n",
"\n",
"### |3| <a name=\"askvsshow\"> _`Ask HN`_ vs _`Show HN`_ </a> #\n",
"\n",
"The lists have been isolated and data is in a clean state.<br>\n",
"It's time to dig in.<br>\n",
"To help me meet my end goal, I will be gathering stats on __total comments__ by posts :\n",
"\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Ask HN posts | 9139 posts | 94986 total comments (65.68%) | 10.39 comments average per post.\n",
"| Show HN posts | 10158 posts | 49633 total comments (34.32%) | 4.89 comments average per post.\n"
]
}
],
"source": [
"ask_comments_total = 0 # Total comments from `Ask HN` submissions.\n",
"show_comments_total = 0 # Total comments from `Show HN` submissions.\n",
"\n",
"for askhn in ask_hn: # Compute total comments from `Ask HN` submissions.\n",
" ask_comments_total += int(askhn[4])\n",
"\n",
"\n",
"for showhn in show_hn: # Compute total comments from `Show HN` submissions.\n",
" show_comments_total += int(showhn[4])\n",
"\n",
"\n",
"ask_show_comments_total = ask_comments_total + show_comments_total\n",
"\n",
"\n",
"\n",
"# Ask HN stats\n",
"print(f'| Ask HN posts | {len(ask_hn)} posts | {ask_comments_total} total comments '\n",
" f'({(ask_comments_total / ask_show_comments_total)*100:.2f}%) | '\n",
" f'{(ask_comments_total / len(ask_hn)):.2f} comments average per post.');\n",
"\n",
"# Show HN stats\n",
"print(f'| Show HN posts | {len(show_hn)} posts | {show_comments_total} total comments '\n",
" f'({(show_comments_total / ask_show_comments_total)*100:.2f}%) | '\n",
" f'{(show_comments_total / len(show_hn)):.2f} comments average per post.');\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"Simple stats show that :\n",
"* Total ___Ask HN_ comments__ > Total ___Post HN_ comments__ .\n",
"* __Ask HN__ submissions average at `10.39` comments/post while __Show HN__'s is at `4.89`.\n",
"\n",
"<br>\n",
"I wish this could be straight-forward so that I could conclude my analysis here. But, the distribution of comments per post is not constant as you will see in a bit.<br>\n",
"There are some posts that fail to garner any user response while there are some that are so dear to us that it will hit the top spot with the maximum user response.<br>\n",
"They could be skewing the averages.<br>\n",
"To assert this, I will check on posts that have no comments at all :<br>\n",
" \n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Ask HN posts | Total posts/posts with 0 comments = 9139/2228 | 24.38%\n",
"| Show HN posts | Total posts/posts with 0 comments = 10158/5099 | 50.20%\n"
]
}
],
"source": [
"ask_comments_0 = 0\n",
"for askhn in ask_hn: # 'Ask HN' posts without any commments.\n",
" if askhn[4] == '0':\n",
" ask_comments_0 += 1\n",
" \n",
"show_comments_0 = 0\n",
"for showhn in show_hn: # 'Show HN' posts without any commments.\n",
" if showhn[4] == '0':\n",
" show_comments_0 += 1\n",
" \n",
"\n",
"\n",
"# Ask HN stats (% of posts with 0 comments)\n",
"print(f'| Ask HN posts | Total posts/posts with 0 comments = '\n",
" f'{len(ask_hn)}/{ask_comments_0} | {(ask_comments_0/len(ask_hn))*100:.2f}%')\n",
"\n",
"# Show HN stats (% of posts with 0 comments)\n",
"print(f'| Show HN posts | Total posts/posts with 0 comments = '\n",
" f'{len(show_hn)}/{show_comments_0} | {(show_comments_0/len(show_hn))*100:.2f}%')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
" \n",
"Observations about posts with 0 comments :\n",
"* About ___a quarter of Ask HN (24.38%)___ and ___half of Show HN submissions (50.20%)___ are without any comments (0 comments).<br>\n",
"This goes to show that about 75% of Ask HN posts have received more response than only 50% of Show HN, for the Year 2016 (Jan - Sep)\n",
"\n",
"<br><br>\n",
"\n",
"To be fair, if I were to get rid of this figure in the mean calculation, I would get :\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Ask HN posts | 9139 / 94986 comments | 13.74 comments average per post.\n",
"| Show HN posts | 10158 / 49633 comments | 9.81 comments average per post.\n"
]
}
],
"source": [
"# Ask HN stats (new average excluding submissions with 0 comments)\n",
"print(f'| Ask HN posts | {len(ask_hn)} / {ask_comments_total} comments | '\n",
" f'{(ask_comments_total / (len(ask_hn) - ask_comments_0)):.2f} comments average per post.');\n",
"\n",
"\n",
"# Show HN stats (new average excluding submissions with 0 comments)\n",
"print(f'| Show HN posts | {len(show_hn)} / {show_comments_total} comments | '\n",
" f'{(show_comments_total / (len(show_hn) - show_comments_0)):.2f} comments average per post.');\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"I now have improved averages which I will tabularize for easy reference later on :\n",
"<br>\n",
"\n",
"|Submission Type|Mean comments <br> (all submissions)|Mean comments <br> (excluding submissions with 0 comments)|\n",
"|:---|:---|:---|\n",
"|Ask HN|10.39|<a name=\"askhnmean\">13.74</a>|\n",
"|Show HN|4.89|<a name=\"showhnmean\">9.81</a>|\n",
"\n",
"<br>\n",
"\n",
"How does the new _'average comments per post'_ compare with the actual dataset values ? <br>\n",
"Are they fairly distributed ? Calculating standard deviation gives a measure of how values are distributed. Before the calculation, I will display the first few and the last few rows to get an idea of the outliers.<br>\n",
"\n",
"<br>\n",
"\n",
"* _The table below displays the #submissions grouped by #comments for `Ask HN`_ :\n",
"\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"1007 comments / [1 posts] = [1007] Total comments\n",
"947 comments / [1 posts] = [947] Total comments\n",
"937 comments / [1 posts] = [937] Total comments\n",
"910 comments / [1 posts] = [910] Total comments\n",
"898 comments / [1 posts] = [898] Total comments\n",
"896 comments / [1 posts] = [896] Total comments\n",
"868 comments / [1 posts] = [868] Total comments\n",
"825 comments / [1 posts] = [825] Total comments\n",
"778 comments / [1 posts] = [778] Total comments\n",
"767 comments / [1 posts] = [767] Total comments\n",
"--------------------------------------------------\n",
"9 comments / [181 posts] = [1629] Total comments\n",
"8 comments / [149 posts] = [1192] Total comments\n",
"7 comments / [260 posts] = [1820] Total comments\n",
"6 comments / [343 posts] = [2058] Total comments\n",
"5 comments / [375 posts] = [1875] Total comments\n",
"4 comments / [595 posts] = [2380] Total comments\n",
"3 comments / [766 posts] = [2298] Total comments\n",
"2 comments / [1243 posts] = [2486] Total comments\n",
"1 comments / [1388 posts] = [1388] Total comments\n",
"0 comments / [2228 posts] = [0] Total comments\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"# Build a frequency table for Ask HN (#submissions grouped by #comments)\n",
"ask_comments_ft = {}\n",
"for askhn in ask_hn:\n",
" comments = int(askhn[4])\n",
"\n",
" if comments in ask_comments_ft:\n",
" ask_comments_ft[comments] += 1\n",
" else:\n",
" ask_comments_ft[comments] = 1\n",
"\n",
"\n",
"\n",
"# Sort frequency table for Ask HN (#submissions grouped by #comments)\n",
"ask_comments_ft_sorted = []\n",
"for key, value in ask_comments_ft.items():\n",
" askhn_tuple = (key, value)\n",
" ask_comments_ft_sorted.append(askhn_tuple)\n",
"\n",
"ask_comments_ft_sorted = sorted(ask_comments_ft_sorted, reverse=True)\n",
"\n",
"\n",
"# Print the sorted table (Top 10 + Bottom 10 rows)\n",
"print('--------------------------------------------------')\n",
"for ask_sorted in ask_comments_ft_sorted[:10]: # Top 10\n",
" print(f'{ask_sorted[0]} comments / [{ask_sorted[1]} posts] = [{ask_sorted[0] * ask_sorted[1]}] Total comments')\n",
"print('--------------------------------------------------')\n",
"for ask_sorted in ask_comments_ft_sorted[-10:]: # Bottom 10\n",
" print(f'{ask_sorted[0]} comments / [{ask_sorted[1]} posts] = [{ask_sorted[0] * ask_sorted[1]}] Total comments')\n",
"print('--------------------------------------------------')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"___Calculated mean was [13.74](#askhnmean) per post/submission on average for Ask HN submissions that have comments___. This means there should be around 13 comments per submission.<br>\n",
"From the list above, you can see that the actual comments per post vary drastically. Results are skewed by a handful of outliers, posts with maximum comments for 1 submission at 1007 to 1388 posts with just 1 comment.\n",
"\n",
"___Standard deviation___ tells us how far apart are the values from the mean (the distribution of comments per post, from the mean value of 13.74).<br>\n",
"But, before that, I will check the #submissions grouped by #comments for _`Show HN`_ as well.<br>\n",
"\n",
"<br>\n",
"\n",
"* _The table below displays the #submissions grouped by #comments for `Show HN`_ :\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--------------------------------------------------\n",
"306 comments / [1 posts] = [306] Total comments\n",
"298 comments / [1 posts] = [298] Total comments\n",
"280 comments / [1 posts] = [280] Total comments\n",
"257 comments / [1 posts] = [257] Total comments\n",
"250 comments / [1 posts] = [250] Total comments\n",
"233 comments / [1 posts] = [233] Total comments\n",
"211 comments / [1 posts] = [211] Total comments\n",
"206 comments / [1 posts] = [206] Total comments\n",
"199 comments / [1 posts] = [199] Total comments\n",
"197 comments / [1 posts] = [197] Total comments\n",
"--------------------------------------------------\n",
"9 comments / [84 posts] = [756] Total comments\n",
"8 comments / [102 posts] = [816] Total comments\n",
"7 comments / [108 posts] = [756] Total comments\n",
"6 comments / [172 posts] = [1032] Total comments\n",
"5 comments / [196 posts] = [980] Total comments\n",
"4 comments / [301 posts] = [1204] Total comments\n",
"3 comments / [506 posts] = [1518] Total comments\n",
"2 comments / [816 posts] = [1632] Total comments\n",
"1 comments / [1748 posts] = [1748] Total comments\n",
"0 comments / [5099 posts] = [0] Total comments\n",
"--------------------------------------------------\n"
]
}
],
"source": [
"# Build a frequency table for Show HN (#submissions grouped by #comments)\n",
"show_comments_ft = {}\n",
"for showhn in show_hn:\n",
" comments = int(showhn[4])\n",
"\n",
" if comments in show_comments_ft:\n",
" show_comments_ft[comments] += 1\n",
" else:\n",
" show_comments_ft[comments] = 1\n",
"\n",
"\n",
"\n",
"# Sort frequency table for Show HN (#submissions grouped by #comments)\n",
"show_comments_ft_sorted = []\n",
"for key, value in show_comments_ft.items():\n",
" showhn_tuple = (key, value)\n",
" show_comments_ft_sorted.append(showhn_tuple)\n",
"\n",
"show_comments_ft_sorted = sorted(show_comments_ft_sorted, reverse=True)\n",
"\n",
"\n",
"\n",
"# Print the sorted table (Top 10 + Bottom 10 rows)\n",
"print('--------------------------------------------------')\n",
"for show_sorted in show_comments_ft_sorted[:10]: # Top 10\n",
" print(f'{show_sorted[0]} comments / [{show_sorted[1]} posts] = [{show_sorted[0] * show_sorted[1]}] Total comments')\n",
"print('--------------------------------------------------')\n",
"for show_sorted in show_comments_ft_sorted[-10:]: # Bottom 10\n",
" print(f'{show_sorted[0]} comments / [{show_sorted[1]} posts] = [{show_sorted[0] * show_sorted[1]}] Total comments')\n",
"print('--------------------------------------------------')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"___Calculated mean was [9.81](#showhnmean) per post/submission on average for Show HN submissions that have comments___. This means there should be around 9 comments per submission.<br>\n",
"From the list above, you can see that the actual comments per post vary drastically. Results are skewed by a handful of outliers, posts with maximum comments for 1 submission topping at 306 to 1748 posts with just 1 comment.\n",
"\n",
"<br>\n",
"\n",
"* Below, I will calculate the `standard deviation` of both ___Ask HN___ and ___Show HN___ posts :\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| Ask HN | Std.Deviation = 49.57\n",
"| Show HN | Std.Deviation = 21.81\n"
]
}
],
"source": [
"askhn_non0 = [] # An array to hold non-0 `Ask HN `comments.\n",
"showhn_non0 = [] # An array to hold non-0 `Show HN `comments.\n",
"\n",
"for askhn in ask_hn: # Build the non-0 `Ask HN` comments array.\n",
" comments = int(askhn[4])\n",
" \n",
" if comments != 0:\n",
" askhn_non0.append(comments)\n",
" \n",
" \n",
"for showhn in show_hn: # Build the non-0 `Show HN` comments array.\n",
" comments = int(showhn[4])\n",
" \n",
" if comments != 0:\n",
" showhn_non0.append(comments)\n",
"\n",
" \n",
"\n",
"# Import numpy module to access standard deviation method std()\n",
"import numpy as npy\n",
"\n",
"print(f'| Ask HN | Std.Deviation = {npy.std(askhn_non0):.2f}\\n'\n",
" f'| Show HN | Std.Deviation = {npy.std(showhn_non0):.2f}')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"<span style=\"color:tomato; font-style:italic; font-weight:light; font-size:1.3em;\"> This goes to say that <b>Show HN</b> submissions have a much fairer distribution of comments per post from the calculated mean. </span><br>\n",
"_The lesser the std. deviation number, the more closer is the distribution of values to the calculated mean._\n",
"<br><br>\n",
"\n",
"___Could the type of posts provide a clue to why Show HN has a fairer distribution ?___ <br>\n",
">_If you recall [how the posts are categorized](#endgoal) on hacker news portal :<br>\n",
"<u>`ask` submissions are for raising questions</u> for which one hopes to find answer, with luck. For this year, about a quarter of Ask submissions weren't answered, yet. The remaining three quarter submissions showed a great diversity in the number of comments as it depends on what question is being asked, the knowledge of the respondees and how many respond, etc._ <br><br>\n",
"_<u>`show` submissions showcase</u> new talent, be it a project, a product, an interesting idea or an interesting discussion, many in the hopes of becoming the next big thing !<br>\n",
"An interesting showcased submission attracts HN and HN community attention with Q&A's both ways.<br>\n",
"This could explain why there is generally a greater density in the number of comments a show submission receives._\n",
"<br>\n",
"\n",
"----\n",
"<br>\n",
"\n",
"## |4| <a name=\"favorabletimes\"> Is there a favorable time to <i>`ask`</i> and <i>`show`</i> HN posts ? </a> #\n",
"\n",
"As mentioned earlier, HN posts are recorded in Eastern Time. I will tailor it to PST (Pacific Time Zone) for my analysis.<br>\n",
"To measure favorable time, I will average comments received for the posts created by hour.<br>\n",
"\n",
"<br>\n",
"\n",
"* Peak and non-peak ___ask___ submission times are listed below :\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" -------------------------------------\n",
"| Ask HN | #posts & #comments by hour |\n",
" -------------------------------------\n",
"23 Hour(s) : 269 posts / 2996 comments\n",
"22 Hour(s) : 282 posts / 2089 comments\n",
"19 Hour(s) : 383 posts / 3372 comments\n",
"18 Hour(s) : 518 posts / 4500 comments\n",
"16 Hour(s) : 552 posts / 3954 comments\n",
"14 Hour(s) : 587 posts / 5547 comments\n",
"12 Hour(s) : 646 posts / 18525 comments\n",
"11 Hour(s) : 513 posts / 4972 comments\n",
"10 Hour(s) : 444 posts / 7245 comments\n",
"08 Hour(s) : 312 posts / 2797 comments\n",
"07 Hour(s) : 282 posts / 3013 comments\n",
"06 Hour(s) : 222 posts / 1477 comments\n",
"04 Hour(s) : 226 posts / 1585 comments\n",
"00 Hour(s) : 271 posts / 2154 comments\n",
"20 Hour(s) : 343 posts / 2297 comments\n",
"17 Hour(s) : 510 posts / 4462 comments\n",
"13 Hour(s) : 579 posts / 4466 comments\n",
"05 Hour(s) : 257 posts / 2362 comments\n",
"21 Hour(s) : 301 posts / 2277 comments\n",
"15 Hour(s) : 614 posts / 4877 comments\n",
"09 Hour(s) : 342 posts / 4234 comments\n",
"01 Hour(s) : 243 posts / 2360 comments\n",
"03 Hour(s) : 234 posts / 1587 comments\n",
"02 Hour(s) : 209 posts / 1838 comments\n",
" ----------------------\n",
"| Ask HN Averages/Hour |\n",
" ----------------------\n",
"23 : 11.14\n",
"22 : 7.41\n",
"19 : 8.80\n",
"18 : 8.69\n",
"16 : 7.16\n",
"14 : 9.45\n",
"12 : 28.68\n",
"11 : 9.69\n",
"10 : 16.32\n",
"08 : 8.96\n",
"07 : 10.68\n",
"06 : 6.65\n",
"04 : 7.01\n",
"00 : 7.95\n",
"20 : 6.70\n",
"17 : 8.75\n",
"13 : 7.71\n",
"05 : 9.19\n",
"21 : 7.56\n",
"15 : 7.94\n",
"09 : 12.38\n",
"01 : 9.71\n",
"03 : 6.78\n",
"02 : 8.79\n"
]
}
],
"source": [
"import datetime as dt\n",
"\n",
"ask_posts_by_hour = {}\n",
"ask_comments_by_hour = {}\n",
"\n",
"for row in ask_hn:\n",
" created_dt_str = row[-1]\n",
" created_dt_obj = dt.datetime.strptime(created_dt_str, '%m/%d/%Y %H:%M') - dt.timedelta(hours=3) # Adjust for ET -> PST\n",
" created_dt_hour = created_dt_obj.strftime('%H')\n",
"\n",
" ask_posts = 1\n",
" ask_comments = int(row[4])\n",
"\n",
" if created_dt_hour in ask_posts_by_hour:\n",
" ask_comments_by_hour[created_dt_hour] += ask_comments\n",
" ask_posts_by_hour[created_dt_hour] += ask_posts\n",
" else:\n",
" ask_comments_by_hour[created_dt_hour] = ask_comments\n",
" ask_posts_by_hour[created_dt_hour] = ask_posts\n",
"\n",
"\n",
" \n",
"print(f' -------------------------------------\\n'\n",
"f'| Ask HN | #posts & #comments by hour |\\n'\n",
"f' -------------------------------------')\n",
"for hour, posts in ask_posts_by_hour.items():\n",
" print(f'{hour} Hour(s) : {posts} posts / {ask_comments_by_hour[hour]} comments')\n",
"\n",
"\n",
"\n",
"\n",
"ask_averages_by_hour = []\n",
"for hour, comments in ask_comments_by_hour.items():\n",
" average = comments / ask_posts_by_hour[hour]\n",
" ask_averages_by_hour.append([hour, average])\n",
" \n",
"print(f' ----------------------\\n'\n",
"f'| Ask HN Averages/Hour |\\n'\n",
"f' ----------------------')\n",
"for hour_average in ask_averages_by_hour:\n",
" print(f'{hour_average[0]} : {hour_average[1]:.2f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"I will replicate the same for ___show___ submissions as well, and then list the sorted averages for both together.<br>\n",
"\n",
"* Peak and non-peak ___show___ submission times are listed below :\n",
"<br>"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" --------------------------------------\n",
"| Show HN | #posts & #comments by hour |\n",
" --------------------------------------\n",
"21 Hour(s) : 276 posts / 1283 comments\n",
"20 Hour(s) : 319 posts / 1444 comments\n",
"17 Hour(s) : 525 posts / 2183 comments\n",
"16 Hour(s) : 556 posts / 2791 comments\n",
"15 Hour(s) : 656 posts / 3242 comments\n",
"13 Hour(s) : 801 posts / 3769 comments\n",
"11 Hour(s) : 696 posts / 3839 comments\n",
"07 Hour(s) : 323 posts / 1228 comments\n",
"06 Hour(s) : 302 posts / 1411 comments\n",
"05 Hour(s) : 316 posts / 1771 comments\n",
"03 Hour(s) : 192 posts / 904 comments\n",
"00 Hour(s) : 206 posts / 934 comments\n",
"18 Hour(s) : 430 posts / 1759 comments\n",
"14 Hour(s) : 761 posts / 3236 comments\n",
"12 Hour(s) : 836 posts / 3824 comments\n",
"08 Hour(s) : 402 posts / 2413 comments\n",
"04 Hour(s) : 236 posts / 1577 comments\n",
"01 Hour(s) : 194 posts / 978 comments\n",
"10 Hour(s) : 610 posts / 3314 comments\n",
"09 Hour(s) : 516 posts / 3609 comments\n",
"22 Hour(s) : 247 posts / 1006 comments\n",
"19 Hour(s) : 377 posts / 1450 comments\n",
"23 Hour(s) : 209 posts / 1076 comments\n",
"02 Hour(s) : 172 posts / 592 comments\n",
" -----------------------\n",
"| Show HN Averages/Hour |\n",
" -----------------------\n",
"21 : 4.65\n",
"20 : 4.53\n",
"17 : 4.16\n",
"16 : 5.02\n",
"15 : 4.94\n",
"13 : 4.71\n",
"11 : 5.52\n",
"07 : 3.80\n",
"06 : 4.67\n",
"05 : 5.60\n",
"03 : 4.71\n",
"00 : 4.53\n",
"18 : 4.09\n",
"14 : 4.25\n",
"12 : 4.57\n",
"08 : 6.00\n",
"04 : 6.68\n",
"01 : 5.04\n",
"10 : 5.43\n",
"09 : 6.99\n",
"22 : 4.07\n",
"19 : 3.85\n",
"23 : 5.15\n",
"02 : 3.44\n"
]
}
],
"source": [
"show_posts_by_hour = {}\n",
"show_comments_by_hour = {}\n",
"\n",
"for row in show_hn:\n",
" created_dt_str = row[-1]\n",
" created_dt_obj = dt.datetime.strptime(created_dt_str, '%m/%d/%Y %H:%M') - dt.timedelta(hours=3) # Adjust for ET -> PST\n",
" created_dt_hour = created_dt_obj.strftime('%H')\n",
"\n",
" show_posts = 1\n",
" show_comments = int(row[4])\n",
"\n",
" if created_dt_hour in show_posts_by_hour:\n",
" show_comments_by_hour[created_dt_hour] += show_comments\n",
" show_posts_by_hour[created_dt_hour] += show_posts\n",
" else:\n",
" show_comments_by_hour[created_dt_hour] = show_comments\n",
" show_posts_by_hour[created_dt_hour] = show_posts\n",
"\n",
"\n",
" \n",
"print(f' --------------------------------------\\n'\n",
"f'| Show HN | #posts & #comments by hour |\\n'\n",
"f' --------------------------------------')\n",
"for hour, posts in show_posts_by_hour.items():\n",
" print(f'{hour} Hour(s) : {posts} posts / {show_comments_by_hour[hour]} comments')\n",
"\n",
"\n",
"\n",
"\n",
"show_averages_by_hour = []\n",
"for hour, comments in show_comments_by_hour.items():\n",
" average = comments / show_posts_by_hour[hour]\n",
" show_averages_by_hour.append([hour, average])\n",
" \n",
"print(f' -----------------------\\n'\n",
"f'| Show HN Averages/Hour |\\n'\n",
"f' -----------------------')\n",
"for hour_average in show_averages_by_hour:\n",
" print(f'{hour_average[0]} : {hour_average[1]:.2f}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"It's hard to read the averages from the above list.<br>\n",
"One shouldn't be bogged down by an un-readable display.<br>\n",
"\n",
"<br><br>\n",
"I will make it easily readable by sorting them ___in descending order of averages___ :<br>"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" --------------------------------\n",
"| Best PST time for Ask HN Posts |\n",
" --------------------------------\n",
"\n",
"Top 1 : 12 : 28.68\n",
"Top 2 : 10 : 16.32\n",
"Top 3 : 09 : 12.38\n",
"Top 4 : 23 : 11.14\n",
"Top 5 : 07 : 10.68\n",
" 01 : 9.71\n",
" 11 : 9.69\n",
" 14 : 9.45\n",
" 05 : 9.19\n",
" 08 : 8.96\n",
" 19 : 8.80\n",
" 02 : 8.79\n",
" 17 : 8.75\n",
" 18 : 8.69\n",
" 00 : 7.95\n",
" 15 : 7.94\n",
" 13 : 7.71\n",
" 21 : 7.56\n",
" 22 : 7.41\n",
" 16 : 7.16\n",
" 04 : 7.01\n",
" 03 : 6.78\n",
" 20 : 6.70\n",
" 06 : 6.65\n",
"\n",
" ---------------------------------\n",
"| Best PST time for Show HN Posts |\n",
" ---------------------------------\n",
"\n",
"Top 1 : 09 : 6.99\n",
"Top 2 : 04 : 6.68\n",
"Top 3 : 08 : 6.00\n",
"Top 4 : 05 : 5.60\n",
"Top 5 : 11 : 5.52\n",
" 10 : 5.43\n",
" 23 : 5.15\n",
" 01 : 5.04\n",
" 16 : 5.02\n",
" 15 : 4.94\n",
" 03 : 4.71\n",
" 13 : 4.71\n",
" 06 : 4.67\n",
" 21 : 4.65\n",
" 12 : 4.57\n",
" 00 : 4.53\n",
" 20 : 4.53\n",
" 14 : 4.25\n",
" 17 : 4.16\n",
" 18 : 4.09\n",
" 22 : 4.07\n",
" 19 : 3.85\n",
" 07 : 3.80\n",
" 02 : 3.44\n"
]
}
],
"source": [
"'''\n",
"Ask HN\n",
"'''\n",
"ask_swap_averages_by_hour = [] # List to hold swapped list items\n",
"\n",
"for row in ask_averages_by_hour: # Swap the columns so the the list can be sorted by the averages (at index 0)\n",
" ask_swap_averages_by_hour.append([row[1], row[0]])\n",
" \n",
"ask_sorted_swap = sorted(ask_swap_averages_by_hour, reverse=True) # Sort in DESCENDING order\n",
"\n",
"\n",
"\n",
"text_template = '''\n",
" --------------------------------\n",
"| Best PST time for Ask HN Posts |\n",
" --------------------------------\n",
"'''\n",
"print(text_template)\n",
"\n",
"top5 = 1\n",
"for average_hour in ask_sorted_swap:\n",
" if top5 <=5:\n",
" print(f'Top {top5} : {average_hour[1]} : {average_hour[0]:.2f}')\n",
" top5 += 1\n",
" else:\n",
" print(f' {average_hour[1]} : {average_hour[0]:.2f}')\n",
"\n",
" \n",
" \n",
"'''\n",
"Show HN\n",
"'''\n",
"show_swap_averages_by_hour = [] # List to hold swapped list items\n",
"\n",
"for row in show_averages_by_hour: # Swap the columns so the the list can be sorted by the averages (at index 0)\n",
" show_swap_averages_by_hour.append([row[1], row[0]])\n",
" \n",
"show_sorted_swap = sorted(show_swap_averages_by_hour, reverse=True) # Sort in DESCENDING order\n",
"\n",
"\n",
"\n",
"text_template = '''\n",
" ---------------------------------\n",
"| Best PST time for Show HN Posts |\n",
" ---------------------------------\n",
"'''\n",
"print(text_template)\n",
"\n",
"top5 = 1\n",
"for average_hour in show_sorted_swap:\n",
" if top5 <=5:\n",
" print(f'Top {top5} : {average_hour[1]} : {average_hour[0]:.2f}')\n",
" top5 += 1\n",
" else:\n",
" print(f' {average_hour[1]} : {average_hour[0]:.2f}')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-family:candara, roboto, helvetica, courier, times;\">\n",
"\n",
"Top 5 PST times favoring the submissions :\n",
"\n",
" \n",
"| RANK | ___ask HN___<br>at Hrs | ___show HN___<br>at Hrs |\n",
"|:----:|:----:|:----:|\n",
"|1|__12__|__09__|\n",
"|2|__10__|__04__|\n",
"|3|__09__|__08__|\n",
"|4|__23__|__05__|\n",
"|5|__07__|__11__|\n",
"\n",
"\n",
"__Favorable times observation__ :<br>\n",
"* With ___ask___ submissions, between the top 2 spots, there is a clear distinction between their averages at ___12___ and ___10___ Hrs and that shows ___12___ Hrs is a super favorite.<br>\n",
"Looking at the list, while there is activity throughout the day, <span style=\"color:tomato\">majority of the activity happens starting at the break of dawn (around <b>07</b> Hrs) through noon (<b>09</b> Hrs, <b>10</b> Hrs, and at its peak at <b>12</b> Hrs). And, before the day ends, I can see quite a sizable participation from the community (<b>23 Hrs</b>)</span>.<br>\n",
"Most of the time, it's before noon.\n",
"<br><br>\n",
"* With ___show___ submissions, as we saw with the distribution of comments, there is a more fairer distribution of activity throughout the day.<br>\n",
"<span style=\"color:tomato\">With nearly twice the activity at <b>09</b> Hrs than the lowest point in the day (at <b>02</b> Hrs), peak times for this type of submission is around <b>04</b> Hrs & <b>05</b> Hrs (before the day break), <b>08</b> Hrs, <b>09</b> Hrs (when the work day begins) and just before noon at <b>11</b> Hrs</span>.<br>\n",
"So, it's all before noon.\n",
"\n",
"----\n",
"<br>\n",
"\n",
"# CONCLUSION #\n",
"\n",
"Based on my analysis of the dataset of year 2016 (Jan to Sep), I conclude :\n",
"1. ___ask___ submissions > ___show___ submissions (general tendency)\n",
"2. Distribution of comments : ___show___ > ___ask___ (more comments are exchanged for ___show___)\n",
"3. Favorable times for both ___ask___ & ___show___ seem to be in the morning upto noon (in PST time zone) :<br>\n",
" a. ___ask___ submissions spans early part of the day (starting at dawn break upto noon) and for some time closer to midnight.<br>\n",
" b. ___show___ submissions spans a couple of hours when the day begins and all the way into just before noon.<br>\n",
"\n",
"<br>\n",
"\n",
"### _IMPORTANT Notes & Pointers before you leave this page_ #\n",
"* Could the geolocation of the users have had an impact on the favorable times (you know due to time zones) ? If location was part of the dataset, then I could assess the geolocation's contributing factor. As more and more developers showcase in ___Show HN___ from different time zones, favorable times will vary.<br>\n",
"* The above analysis was only for the year 2016 (January to September). This is only a small picture.\n",
"Will my assessment hold true for other years ? To assess this, analysis of dataset spanning a few more years is recommended for a bigger picture.\n",
"* The above analyses can be entended to calculate the favorable day of the week that peaks max activity.\n",
"<br>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@veena-LINE
Copy link
Author

NBViewer provides a better rendering.
Give this "https://nbviewer.jupyter.org/gist/aneev-eveejnas/a93e5d149f4a61887cf978cc0c7744db" a try !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment