dfolch/twitter_geo_search.ipynb

## twitter_geo_search.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Testing Twitter's Spatial Search\n",
    "David C. Folch\n",
    "\n",
    "October 19, 2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Context\n",
    "Twitter has a free API that allows for searching tweets. One of the search parameters is `geocode`, which takes a latitude, longitude and radius and ostensibly returns Tweets within the radius of the location. In around November 2014, Twitter made two changes: 1) how they collect spatial data from users and 2) how their entire API query works. Together these changes make spatial search somewhat frustrating. \n",
    "\n",
    "My understanding is that Twitter use three levels of spatial data when locating a Tweet:\n",
    "- Exact latitude/longitude of the Tweet\n",
    "- A \"place\" location of the Tweet; the Tweet falls within some bounding box that Twitter will provide\n",
    "- The location associated with the user's profile; they might have Chicago, IL associated with their profile\n",
    "\n",
    "The Twitter search API uses (at a minimum) all three of these criteria when determining if a Tweet meets the criteria passed via the `geocode` parameter. \n",
    "\n",
    "The third criteria seems to be causing the most consternation. The user's account location might not be related to the location of the Tweet. For example, when the user is away from their regular location (e.g., on vacation); they have forgotten to update their profile after moving; etc.\n",
    "\n",
    "#### The Problem \n",
    "A search using the `geocode` parameter tends to not return very many geolocated Tweets. Why is that?\n",
    "\n",
    "#### Some Discussions on the Topic\n",
    "- [This (very long) thread](https://twittercommunity.com/t/search-api-returning-very-sparse-geocode-results/27998) in the Twitter developer community provides some information on the history.\n",
    "- [This link](https://twittercommunity.com/search?q=geocode) will get you a bunch more posts on the topic:\n",
    "\n",
    "#### My Explanation\n",
    "I have not found a clear explanation of the _cause_ of the \"problem\", so here is my __hypothesis__ (just speculation):\n",
    "\n",
    "- Fact: Twitter search returns a maximum of 100 Tweets per query. \n",
    "- Twitter has some algorithm to prioritize the Tweets it returns since most queries would result in more than 100 Tweets. This algorithm could be a random draw or chronological, but is probably more sophisticated.\n",
    "- Empirical finding: very small search radii (say 1 mile) tend to return a high percentage of tweets with lat/lon.  \n",
    "- Empirical finding: as the search radius increases, the percentage of tweets with lat/lon decreases dramatically.\n",
    "- Empirical finding: as the percentage of tweets with lat/lon decreases, the percentage of Tweets with \"place\" information initially increases, but then decreases.\n",
    "- Hypothesis: the number of people putting lat/lon on their Tweets is _less than_ the number of people putting places on their Tweets _which is less than_ the number of people putting a location in their profile.\n",
    "- Hypothesis: the algorithm for picking the _universe_ of Tweets that meet the query includes spatial attributes, but the algorithm for winnowing that universe down to the subset to be returned does not include spatial data.\n",
    "- Hypothesis: the algorithm that chooses the universe of Tweets is based on centroids of _places_. This means that all the people or Tweets tagged only as \"Chicago, IL\" will be located at the exact same lat/lon (i.e., the \"center\" of Chicago). There are many _places_, so many of these concentration points.\n",
    "- A smaller radius might not capture many centroids. Therefore, the _universe_ for a smaller radius would be based mostly on Tweets with a lat/lon. This explains very small radii have large numbers of lat/lon Tweets.\n",
    "- Hypothesis: Tweets tend to use a much larger set of \"places\" than users do in their profiles. Users are more likely to put their location as the broad label \"Chicago, IL\", but Tweets might have the more refined location \"downtown Chicago\".\n",
    "- A small to medium size radius would encompass more centroids. If the number of people using places is much greater than the number using lat/lon, then the place Tweets would quickly swamp the lat/lon Tweets. If the radius doesn't hit one of the big centroids tending to be associated with many user profiles, then the universe would be dominated by Tweets with place data. This explains the finding that place based Tweets are initially low for very small radii (where lat/lon dominates), grow with the radius increase (place dominated), but then fall off (profile location dominated).\n",
    "- A medium to large radius would have a universe dominated by a large number of users who only provide spatial information through their profile location.\n",
    "\n",
    "One empirical finding that I cannot explain is why the _total_ number Tweets returned sometimes decreases when the radius increases.\n",
    "\n",
    "#### Testing with Python\n",
    "\n",
    "Below is some code that explores these ideas.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import tweepy\n",
    "\n",
    "consumer_key = 'replace with your info'\n",
    "consumer_secret = 'replace with your info'\n",
    "access_token = 'replace with your info'\n",
    "access_token_secret = 'replace with your info'\n",
    "\n",
    "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
    "auth.set_access_token(access_token, access_token_secret)\n",
    "api = tweepy.API(auth)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_geotweets(dist, lat=40.758899, lon=-73.9873197):\n",
    "    # default is Times Square in New York City\n",
    "    results = api.search(geocode=str(lat)+\",\"+str(lon)+\",\"+str(dist)+\"mi\", count=100)\n",
    "    geo_results = []\n",
    "    lat_lon = []\n",
    "    place = []\n",
    "    author_loc = []\n",
    "    no_geo = []\n",
    "    for i in results:\n",
    "        if i.coordinates:\n",
    "            #print \"coordinates:\", i.coordinates['coordinates']\n",
    "            geo_results.append(i)\n",
    "            lat_lon.append(i)\n",
    "        elif i.geo:\n",
    "            #print \"geo:\", i.geo\n",
    "            geo_results.append(i)\n",
    "            lat_lon.append(i)\n",
    "        elif i.place:\n",
    "            #print \"place:\", i.place.name\n",
    "            geo_results.append(i)\n",
    "            place.append(i)\n",
    "        elif i.author.location:\n",
    "            #print \"author location:\", i.author.location\n",
    "            author_loc.append(i)\n",
    "        else:\n",
    "            #print \">>>> no-geo info\"\n",
    "            no_geo.append(i)\n",
    "    print \"\\n\"+str(dist)+\" miles; total tweets:\", len(results)\n",
    "    print \"number of tweets with at least lat/lon:\", len(lat_lon)\n",
    "    print \"number of tweets with at least place:\", len(place)\n",
    "    print \"number of tweets with at least author location:\", len(author_loc)\n",
    "    print \"number of tweets with no geo information:\", len(no_geo)\n",
    "    return geo_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "1 miles; total tweets: 100\n",
      "number of tweets with at least lat/lon: 33\n",
      "number of tweets with at least place: 5\n",
      "number of tweets with at least author location: 55\n",
      "number of tweets with no geo information: 7\n",
      "\n",
      "2 miles; total tweets: 37\n",
      "number of tweets with at least lat/lon: 6\n",
      "number of tweets with at least place: 27\n",
      "number of tweets with at least author location: 4\n",
      "number of tweets with no geo information: 0\n",
      "\n",
      "5 miles; total tweets: 100\n",
      "number of tweets with at least lat/lon: 0\n",
      "number of tweets with at least place: 2\n",
      "number of tweets with at least author location: 78\n",
      "number of tweets with no geo information: 20\n",
      "\n",
      "10 miles; total tweets: 100\n",
      "number of tweets with at least lat/lon: 2\n",
      "number of tweets with at least place: 1\n",
      "number of tweets with at least author location: 79\n",
      "number of tweets with no geo information: 18\n"
     ]
    }
   ],
   "source": [
    "# Times Square (very dense place)\n",
    "geo_tweets = get_geotweets(1)\n",
    "geo_tweets = get_geotweets(2)\n",
    "geo_tweets = get_geotweets(5)\n",
    "geo_tweets = get_geotweets(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "1 miles; total tweets: 97\n",
      "number of tweets with at least lat/lon: 86\n",
      "number of tweets with at least place: 10\n",
      "number of tweets with at least author location: 1\n",
      "number of tweets with no geo information: 0\n",
      "\n",
      "2 miles; total tweets: 62\n",
      "number of tweets with at least lat/lon: 50\n",
      "number of tweets with at least place: 1\n",
      "number of tweets with at least author location: 11\n",
      "number of tweets with no geo information: 0\n",
      "\n",
      "5 miles; total tweets: 100\n",
      "number of tweets with at least lat/lon: 0\n",
      "number of tweets with at least place: 17\n",
      "number of tweets with at least author location: 80\n",
      "number of tweets with no geo information: 3\n",
      "\n",
      "10 miles; total tweets: 100\n",
      "number of tweets with at least lat/lon: 0\n",
      "number of tweets with at least place: 17\n",
      "number of tweets with at least author location: 80\n",
      "number of tweets with no geo information: 3\n"
     ]
    }
   ],
   "source": [
    "# Florida State University (mid-size city)\n",
    "geo_tweets = get_geotweets(1, 30.4418778, -84.3006776)\n",
    "geo_tweets = get_geotweets(2, 30.4418778, -84.3006776)\n",
    "geo_tweets = get_geotweets(5, 30.4418778, -84.3006776)\n",
    "geo_tweets = get_geotweets(10, 30.4418778, -84.3006776)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Testing Twitter's Spatial Search\n",
	"David C. Folch\n",
	"\n",
	"October 19, 2017"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Context\n",
	"Twitter has a free API that allows for searching tweets. One of the search parameters is `geocode`, which takes a latitude, longitude and radius and ostensibly returns Tweets within the radius of the location. In around November 2014, Twitter made two changes: 1) how they collect spatial data from users and 2) how their entire API query works. Together these changes make spatial search somewhat frustrating. \n",
	"\n",
	"My understanding is that Twitter use three levels of spatial data when locating a Tweet:\n",
	"- Exact latitude/longitude of the Tweet\n",
	"- A \"place\" location of the Tweet; the Tweet falls within some bounding box that Twitter will provide\n",
	"- The location associated with the user's profile; they might have Chicago, IL associated with their profile\n",
	"\n",
	"The Twitter search API uses (at a minimum) all three of these criteria when determining if a Tweet meets the criteria passed via the `geocode` parameter. \n",
	"\n",
	"The third criteria seems to be causing the most consternation. The user's account location might not be related to the location of the Tweet. For example, when the user is away from their regular location (e.g., on vacation); they have forgotten to update their profile after moving; etc.\n",
	"\n",
	"#### The Problem \n",
	"A search using the `geocode` parameter tends to not return very many geolocated Tweets. Why is that?\n",
	"\n",
	"#### Some Discussions on the Topic\n",
	"- [This (very long) thread](https://twittercommunity.com/t/search-api-returning-very-sparse-geocode-results/27998) in the Twitter developer community provides some information on the history.\n",
	"- [This link](https://twittercommunity.com/search?q=geocode) will get you a bunch more posts on the topic:\n",
	"\n",
	"#### My Explanation\n",
	"I have not found a clear explanation of the _cause_ of the \"problem\", so here is my __hypothesis__ (just speculation):\n",
	"\n",
	"- Fact: Twitter search returns a maximum of 100 Tweets per query. \n",
	"- Twitter has some algorithm to prioritize the Tweets it returns since most queries would result in more than 100 Tweets. This algorithm could be a random draw or chronological, but is probably more sophisticated.\n",
	"- Empirical finding: very small search radii (say 1 mile) tend to return a high percentage of tweets with lat/lon. \n",
	"- Empirical finding: as the search radius increases, the percentage of tweets with lat/lon decreases dramatically.\n",
	"- Empirical finding: as the percentage of tweets with lat/lon decreases, the percentage of Tweets with \"place\" information initially increases, but then decreases.\n",
	"- Hypothesis: the number of people putting lat/lon on their Tweets is _less than_ the number of people putting places on their Tweets _which is less than_ the number of people putting a location in their profile.\n",
	"- Hypothesis: the algorithm for picking the _universe_ of Tweets that meet the query includes spatial attributes, but the algorithm for winnowing that universe down to the subset to be returned does not include spatial data.\n",
	"- Hypothesis: the algorithm that chooses the universe of Tweets is based on centroids of _places_. This means that all the people or Tweets tagged only as \"Chicago, IL\" will be located at the exact same lat/lon (i.e., the \"center\" of Chicago). There are many _places_, so many of these concentration points.\n",
	"- A smaller radius might not capture many centroids. Therefore, the _universe_ for a smaller radius would be based mostly on Tweets with a lat/lon. This explains very small radii have large numbers of lat/lon Tweets.\n",
	"- Hypothesis: Tweets tend to use a much larger set of \"places\" than users do in their profiles. Users are more likely to put their location as the broad label \"Chicago, IL\", but Tweets might have the more refined location \"downtown Chicago\".\n",
	"- A small to medium size radius would encompass more centroids. If the number of people using places is much greater than the number using lat/lon, then the place Tweets would quickly swamp the lat/lon Tweets. If the radius doesn't hit one of the big centroids tending to be associated with many user profiles, then the universe would be dominated by Tweets with place data. This explains the finding that place based Tweets are initially low for very small radii (where lat/lon dominates), grow with the radius increase (place dominated), but then fall off (profile location dominated).\n",
	"- A medium to large radius would have a universe dominated by a large number of users who only provide spatial information through their profile location.\n",
	"\n",
	"One empirical finding that I cannot explain is why the _total_ number Tweets returned sometimes decreases when the radius increases.\n",
	"\n",
	"#### Testing with Python\n",
	"\n",
	"Below is some code that explores these ideas.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import tweepy\n",
	"\n",
	"consumer_key = 'replace with your info'\n",
	"consumer_secret = 'replace with your info'\n",
	"access_token = 'replace with your info'\n",
	"access_token_secret = 'replace with your info'\n",
	"\n",
	"auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
	"auth.set_access_token(access_token, access_token_secret)\n",
	"api = tweepy.API(auth)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def get_geotweets(dist, lat=40.758899, lon=-73.9873197):\n",
	" # default is Times Square in New York City\n",
	" results = api.search(geocode=str(lat)+\",\"+str(lon)+\",\"+str(dist)+\"mi\", count=100)\n",
	" geo_results = []\n",
	" lat_lon = []\n",
	" place = []\n",
	" author_loc = []\n",
	" no_geo = []\n",
	" for i in results:\n",
	" if i.coordinates:\n",
	" #print \"coordinates:\", i.coordinates['coordinates']\n",
	" geo_results.append(i)\n",
	" lat_lon.append(i)\n",
	" elif i.geo:\n",
	" #print \"geo:\", i.geo\n",
	" geo_results.append(i)\n",
	" lat_lon.append(i)\n",
	" elif i.place:\n",
	" #print \"place:\", i.place.name\n",
	" geo_results.append(i)\n",
	" place.append(i)\n",
	" elif i.author.location:\n",
	" #print \"author location:\", i.author.location\n",
	" author_loc.append(i)\n",
	" else:\n",
	" #print \">>>> no-geo info\"\n",
	" no_geo.append(i)\n",
	" print \"\\n\"+str(dist)+\" miles; total tweets:\", len(results)\n",
	" print \"number of tweets with at least lat/lon:\", len(lat_lon)\n",
	" print \"number of tweets with at least place:\", len(place)\n",
	" print \"number of tweets with at least author location:\", len(author_loc)\n",
	" print \"number of tweets with no geo information:\", len(no_geo)\n",
	" return geo_results"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\n",
	"1 miles; total tweets: 100\n",
	"number of tweets with at least lat/lon: 33\n",
	"number of tweets with at least place: 5\n",
	"number of tweets with at least author location: 55\n",
	"number of tweets with no geo information: 7\n",
	"\n",
	"2 miles; total tweets: 37\n",
	"number of tweets with at least lat/lon: 6\n",
	"number of tweets with at least place: 27\n",
	"number of tweets with at least author location: 4\n",
	"number of tweets with no geo information: 0\n",
	"\n",
	"5 miles; total tweets: 100\n",
	"number of tweets with at least lat/lon: 0\n",
	"number of tweets with at least place: 2\n",
	"number of tweets with at least author location: 78\n",
	"number of tweets with no geo information: 20\n",
	"\n",
	"10 miles; total tweets: 100\n",
	"number of tweets with at least lat/lon: 2\n",
	"number of tweets with at least place: 1\n",
	"number of tweets with at least author location: 79\n",
	"number of tweets with no geo information: 18\n"
	]
	}
	],
	"source": [
	"# Times Square (very dense place)\n",
	"geo_tweets = get_geotweets(1)\n",
	"geo_tweets = get_geotweets(2)\n",
	"geo_tweets = get_geotweets(5)\n",
	"geo_tweets = get_geotweets(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\n",
	"1 miles; total tweets: 97\n",
	"number of tweets with at least lat/lon: 86\n",
	"number of tweets with at least place: 10\n",
	"number of tweets with at least author location: 1\n",
	"number of tweets with no geo information: 0\n",
	"\n",
	"2 miles; total tweets: 62\n",
	"number of tweets with at least lat/lon: 50\n",
	"number of tweets with at least place: 1\n",
	"number of tweets with at least author location: 11\n",
	"number of tweets with no geo information: 0\n",
	"\n",
	"5 miles; total tweets: 100\n",
	"number of tweets with at least lat/lon: 0\n",
	"number of tweets with at least place: 17\n",
	"number of tweets with at least author location: 80\n",
	"number of tweets with no geo information: 3\n",
	"\n",
	"10 miles; total tweets: 100\n",
	"number of tweets with at least lat/lon: 0\n",
	"number of tweets with at least place: 17\n",
	"number of tweets with at least author location: 80\n",
	"number of tweets with no geo information: 3\n"
	]
	}
	],
	"source": [
	"# Florida State University (mid-size city)\n",
	"geo_tweets = get_geotweets(1, 30.4418778, -84.3006776)\n",
	"geo_tweets = get_geotweets(2, 30.4418778, -84.3006776)\n",
	"geo_tweets = get_geotweets(5, 30.4418778, -84.3006776)\n",
	"geo_tweets = get_geotweets(10, 30.4418778, -84.3006776)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.13"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}
No results found