Grabbing twitter Data

Posted on August 15, 2016 in misc

Table of Contents

Introduction:

Twitter is one of the ten most visited web sites around the world. Its a microblogging platform that allows you share messages of lenght not more than 140 characters. Not just that , but it also helps you discover messages related to the topics you are intrested in. It can be usefull in many ways. Since its one of the most used websites, there is a lot of data flowing through its network which could used for varius research purposes. Fourtunately twitter decided to share this huge data with some limitations ofcourse to the public. Twitter offers API (Application Programing Interface) to do this. In this tutorial we will be exploring Twitter API and what it has to offer and how we can use some of these to obtain the data that we want.

Understanding the Limitations:

Twitter offers many ways to access its data through its API. Each way has its pros and cons. Some of the limitations you come across are:

  1. The amount of data you can get
  2. The number of requests you can make (API rate limits)
  3. Getting historical data i.e tweets from the past. etc

We need to understand the limitations of the API thorugly along with what they have to offer before we go ahead and build applications.

Twitter Access keys:

In order to build programs that use the twitter API we need Twitter access keys( also know as OAuth Access tokens). The Access keys can be generated on the twitter developer portal. The following are the required steps you need to follow to generate the access keys:

  1. First you must have twitter account. If you dont have one, please create one. Every twitter account is associated with a mobile number.


  2. Once you have your twitter account you can log into the twitter dev account http://apps.twitter.com/ using the same credentials as your twitter account.


  3. Create new app.
    "click on create new app button"







  4. Fill in the necessary details in the form pagee. If all the fields are filed in properly a new page will be loaded.
    "click on create new app button"



  5. Now open the “Keys and Access Tokens” tab. Scroll down and click “Create my access token”. This will generate the Access token and Access token secret. Copy your “Access token” and “Access token secret” along with API Key and API secret which is present at the top of the same page.

"click on create new app button"


  1. Congratulations! now that we have access tokens we can start using the API.

Twitter libraries:

Since we will be using python through out this tutorial we will be disucssing packages related to python. But if you prefer using other languages feel free to explore them. This page has a list of various libraris for accessing twitter API in various programing languages: twitter Libraries . libraries make life easy when it comes to programming. For this tutorial we will be using the twitter-python libary.


Installing:

$ pip install python-twitter

You will get the following message at the end if it worked: Successfully installed future-0.15.2 python-twitter-3.1

Usage:

To start using the twitter-python library you need the access keys that you have generated earlier:

  1. API key
  2. API secret
  3. Access token
  4. Access token secret

Try the following in the python Interpreter or ipython:

In [1]:
# The keys have been displayed here becuase they are not to be shared.
# The value inside quotes must be replaced by your keys if you are using this.
import twitter
#api = twitter.Api(consumer_key='consumer_key',consumer_secret='consumer_secret',access_token_key='access_token',access_token_secret='access_token_secret')
api = twitter.Api(consumer_key='VXzRzW62biX8KW7A4XycqIeCL',
                      consumer_secret='Dr1ak1sdfL2CdpGCp2IWYg3xbOYFWmJ2H3Tm6ZkgMPo5ejqBrY',
                      access_token_key='78477561-2SitfsaoG4zvrq5jk1oMGahSgtBvQ9b7noe1XNNSX',
                      access_token_secret='vYNqoq4IrLpRINUQdn06aAWaeoSz7G3PNNSSt23XIlx1F')
In [36]:
# to see if you have given the access keys properly:
print api.VerifyCredentials()
{"created_at": "Wed Sep 30 01:36:35 +0000 2009", "favourites_count": 1, "followers_count": 54, "friends_count": 89, "id": 78477561, "lang": "en", "listed_count": 1, "location": "India", "name": "Surya Teja", "profile_background_color": "C0DEED", "profile_background_image_url": "http://pbs.twimg.com/profile_background_images/198736057/Game_Scenes__14_.jpg", "profile_background_tile": true, "profile_image_url": "http://pbs.twimg.com/profile_images/700970727084666880/wSrJ6NCz_normal.jpg", "profile_link_color": "0084B4", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "screen_name": "SuryaTeja1991", "status": {"created_at": "Mon Jul 14 17:07:38 +0000 2014", "favorite_count": 1, "hashtags": [{"text": "Hitwicket"}], "id": 488731558652956672, "id_str": "488731558652956672", "lang": "en", "source": "Hitwicket", "text": "I'm now a Magnificent manager in Hitwicket! super\n http://t.co/BxUMEsymBe #Hitwicket via @HitwicketGame", "urls": [{"expanded_url": "http://hitwicket.com/team/show/696?utm_source=twitterFeed&utm;_medium=userFeedback&utm;_campaign=team%3A696", "url": "http://t.co/BxUMEsymBe"}], "user_mentions": [{"id": 580912774, "name": "Hitwicket", "screen_name": "HitwicketGame"}]}, "statuses_count": 106, "time_zone": "New Delhi", "utc_offset": 19800}

The response will be something like this {"id": 16133, "location": "Philadelphia", "name": "bear"}, a json object with a summary of details of your account.

Twitter Streaming API

The twitter Streaming API gives you access to the current stream messages i.e you get access to the tweets that are currently being pushed to the twitter. You may no be able to access the whole but you will get <=1 percent of the twitter global stream and thats substantial amount of data. But on contrary to the REST API offered by twitter, the streaming API has no rate limits. The only limit it has is the cap on the number of messages it delivers.

The streaming API gives you the number of messages it couldnt deliver. This is called the limit Notices.

Usage:

Twitter API enables us to specify parameters for filtering the data. The parameters can be keywords that we are looking for in tweets or geo locations or usernames or user_id's.

Parameters :

These are the various parameters that you can specify to get the data which you want from twitter:

  1. Keywords
  2. Geo location
  3. usernames
  4. userid (These are the unique ids assigned by twitter to each user)
  1. For example lets say we want to grab all tweets that have the word "life". Hence I use this keyword as a parameter to query the twitter Streaming API. The streaming API returns tweets that have the word "life" from the current pushed messages.

With python-twitter library this is how we do this:

In [2]:
# used break becuase : I just want to exit the loop for printing one line
for tweet in api.GetStreamFilter(track='life'):
    print tweet
    break
{u'contributors': None, u'truncated': False, u'text': u'RT @DailyMendesLife: Pls dm us your rants and confessions about Shawn and the Mendes Army so I can start posting!  https://t.co/Usor286c2y', u'is_quote_status': True, u'in_reply_to_status_id': None, u'id': 760565175132958721, u'favorite_count': 0, u'entities': {u'user_mentions': [{u'id': 2820952439, u'indices': [3, 19], u'id_str': u'2820952439', u'screen_name': u'DailyMendesLife', u'name': u'shawn mendes'}], u'symbols': [], u'hashtags': [], u'urls': [{u'url': u'https://t.co/Usor286c2y', u'indices': [115, 138], u'expanded_url': u'https://twitter.com/dailymendeslife/status/760564739600748549', u'display_url': u'twitter.com/dailymendeslif\u2026'}]}, u'quoted_status_id': 760564739600748549, u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1470167843026', u'quoted_status': {u'contributors': None, u'truncated': False, u'text': u"Guys we made another Insta it's @/rantingshawn where you guys can dm us your Shawn rants and confessions and we will post it anonymously!", u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 760564739600748549, u'favorite_count': 21, u'source': u'Twitter for iPhone', u'retweeted': False, u'coordinates': None, u'entities': {u'user_mentions': [], u'symbols': [], u'hashtags': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'760564739600748549', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 2820952439, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 16466, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'2820952439', u'profile_background_color': u'C0DEED', u'listed_count': 33, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -25200, u'statuses_count': 7228, u'description': u'Account owned by @MendesFeatures & @seIuminated', u'friends_count': 776, u'location': u'Instagram: @GoalWithMendes', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2820952439/1469148330', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'shawn mendes', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 32917, u'screen_name': u'DailyMendesLife', u'notifications': None, u'url': u'https://twitter.com/goalwithmendes/status/721478509927858176', u'created_at': u'Fri Oct 10 01:03:54 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Pacific Time (US & Canada)', u'protected': False, u'default_profile': True, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'created_at': u'Tue Aug 02 19:55:39 +0000 2016', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None}, u'source': u'Twitter for Android', u'in_reply_to_screen_name': None, u'id_str': u'760565175132958721', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Pls dm us your rants and confessions about Shawn and the Mendes Army so I can start posting!  https://t.co/Usor286c2y', u'is_quote_status': True, u'in_reply_to_status_id': None, u'id': 760565123937366016, u'favorite_count': 7, u'entities': {u'user_mentions': [], u'symbols': [], u'hashtags': [], u'urls': [{u'url': u'https://t.co/Usor286c2y', u'indices': [94, 117], u'expanded_url': u'https://twitter.com/dailymendeslife/status/760564739600748549', u'display_url': u'twitter.com/dailymendeslif\u2026'}]}, u'quoted_status_id': 760564739600748549, u'retweeted': False, u'coordinates': None, u'quoted_status': {u'contributors': None, u'truncated': False, u'text': u"Guys we made another Insta it's @/rantingshawn where you guys can dm us your Shawn rants and confessions and we will post it anonymously!", u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 760564739600748549, u'favorite_count': 21, u'source': u'Twitter for iPhone', u'retweeted': False, u'coordinates': None, u'entities': {u'user_mentions': [], u'symbols': [], u'hashtags': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'760564739600748549', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 2820952439, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 16466, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'2820952439', u'profile_background_color': u'C0DEED', u'listed_count': 33, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -25200, u'statuses_count': 7228, u'description': u'Account owned by @MendesFeatures & @seIuminated', u'friends_count': 776, u'location': u'Instagram: @GoalWithMendes', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2820952439/1469148330', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'shawn mendes', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 32917, u'screen_name': u'DailyMendesLife', u'notifications': None, u'url': u'https://twitter.com/goalwithmendes/status/721478509927858176', u'created_at': u'Fri Oct 10 01:03:54 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Pacific Time (US & Canada)', u'protected': False, u'default_profile': True, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'created_at': u'Tue Aug 02 19:55:39 +0000 2016', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None}, u'source': u'Twitter for iPhone', u'in_reply_to_screen_name': None, u'id_str': u'760565123937366016', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 2820952439, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'followers_count': 16466, u'profile_sidebar_border_color': u'C0DEED', u'id_str': u'2820952439', u'profile_background_color': u'C0DEED', u'listed_count': 33, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -25200, u'statuses_count': 7228, u'description': u'Account owned by @MendesFeatures & @seIuminated', u'friends_count': 776, u'location': u'Instagram: @GoalWithMendes', u'profile_link_color': u'0084B4', u'profile_image_url': u'http://pbs.twimg.com/profile_images/752622035126267904/G-K22LhL_normal.jpg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2820952439/1469148330', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'shawn mendes', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 32917, u'screen_name': u'DailyMendesLife', u'notifications': None, u'url': u'https://twitter.com/goalwithmendes/status/721478509927858176', u'created_at': u'Fri Oct 10 01:03:54 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Pacific Time (US & Canada)', u'protected': False, u'default_profile': True, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Tue Aug 02 19:57:10 +0000 2016', u'quoted_status_id_str': u'760564739600748549', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 357243270, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/756939005858500608/GYWbvXIM_normal.jpg', u'profile_sidebar_fill_color': u'18E9F0', u'profile_text_color': u'3C3940', u'followers_count': 3344, u'profile_sidebar_border_color': u'FFFFFF', u'id_str': u'357243270', u'profile_background_color': u'FFFFFF', u'listed_count': 15, u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/699755534853877760/KavcTPpB.jpg', u'utc_offset': -18000, u'statuses_count': 89240, u'description': u'raul mendes is my favourite person in the world', u'friends_count': 1735, u'location': None, u'profile_link_color': u'000000', u'profile_image_url': u'http://pbs.twimg.com/profile_images/756939005858500608/GYWbvXIM_normal.jpg', u'following': None, u'geo_enabled': True, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/357243270/1469393622', u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/699755534853877760/KavcTPpB.jpg', u'name': u'bel', u'lang': u'es', u'profile_background_tile': True, u'favourites_count': 31529, u'screen_name': u'shawnisourangel', u'notifications': None, u'url': u'http://smarturl.it/IlluminateSM', u'created_at': u'Thu Aug 18 02:12:53 +0000 2011', u'contributors_enabled': False, u'time_zone': u'Mexico City', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Tue Aug 02 19:57:23 +0000 2016', u'quoted_status_id_str': u'760564739600748549', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None}

That is how the output looks like. The data is in JSON format. Surprisingly all that information is just realated to one tweet. We will cover later more on JSON format and structure of the json data that we get from twitter, for now lets focus on the collection of the data from twitter.

Since we are trying to grab the streaming data, which is a continious process (unless we stop it), it's better that we save the data to a file. Make sure this file is located on a disk with considerable space as the file size keeps on increasing due to the nature of the streaming data.

In [ ]:
# lets save the json data to a file: streamingData.json
# "\n" is new line. We save the data to a file,
# with each new line we save all the information related to one tweet

f = open('./streamingData.json', 'w')
for tweet in api.GetStreamFilter(track='life'):
    f.write(json.dumps(tweet))
    f.write('\n')    
    

Twiiter Search API:

Twitter rest API is another way to get data from twitter. But unlike the streaming API the rest API has the API rate limitaitons i.e the number of calls or requests that you make are limited to certain number for a given window of time (most of them have 15 minutes window). But with rest API your search parameters are applied against the historical data of one week. These api rate limits vary depending on the kind of query you are making.

Using Search API with python-twitter:

The search API expects encoded url format. You supply a hardcoded url which has the parameters listed in it. This is how you do this with python-twitter:

In [38]:
# search with keyword "life" with q=life:
# limit the number of tweets to 100 with count=100
# limit tweets to the most recent tweets with result_type=recent
# The paratmeters "count", "q", "result_type" are seperated by '&'.DS_Store
# %20 just implies a space. 

results = api.GetSearch( raw_query="q=life%20&result;_type=recent&count;=100")
print len(results)        # count the number of records returend by the query
print results[0]          # print and see an example one tweet instance.
100
{"created_at": "Sun Jul 03 00:54:25 +0000 2016", "hashtags": [], "id": 749405904525754369, "id_str": "749405904525754369", "lang": "en", "retweet_count": 10, "retweeted_status": {"created_at": "Sun Jul 03 00:34:55 +0000 2016", "favorite_count": 17, "hashtags": [], "id": 749400995726106624, "id_str": "749400995726106624", "lang": "en", "retweet_count": 10, "source": "Twitter for iPhone", "text": "don't let the things of this world rob you of your spiritual life", "urls": [], "user": {"created_at": "Wed May 07 03:28:55 +0000 2014", "default_profile": true, "description": "child of God, heaven is my home", "favourites_count": 34060, "followers_count": 6599, "friends_count": 641, "id": 2481242240, "lang": "en", "listed_count": 29, "location": "pittsburgh", "name": "dante lee \u271e", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2481242240/1464542980", "profile_image_url": "http://pbs.twimg.com/profile_images/737134592503783424/7yQi_q7O_normal.jpg", "profile_link_color": "0084B4", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "screen_name": "whoknowsdante", "statuses_count": 3582, "time_zone": "Atlantic Time (Canada)", "utc_offset": -10800}, "user_mentions": []}, "source": "Twitter for iPhone", "text": "RT @whoknowsdante: don't let the things of this world rob you of your spiritual life", "urls": [], "user": {"created_at": "Tue Nov 19 01:35:17 +0000 2013", "default_profile": true, "description": "\u2022beyond blessed \u2022Joshua 21:45 \u2022CCH\u2764\ufe0f\u2022Snapchat: tatianaa_reneee", "favourites_count": 64297, "followers_count": 1514, "friends_count": 966, "geo_enabled": true, "id": 2186766317, "lang": "en", "listed_count": 5, "location": "Florida ", "name": "\u271datiana", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2186766317/1465481330", "profile_image_url": "http://pbs.twimg.com/profile_images/740910215881707520/wxI7_Ylz_normal.jpg", "profile_link_color": "0084B4", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "screen_name": "Totttt_35", "statuses_count": 27255}, "user_mentions": [{"id": 2481242240, "name": "dante lee \u271e", "screen_name": "whoknowsdante"}]}








Try and Explore:

  1. Try set up a streaming for twitter using the geolocations as the query paramter.

  2. Try the same with the search API as well.

references

  1. For referecne twitter has an excellent documentation of twitter API.
  2. For references to python-twitter library. You can use any library to try this.

Now that we have all this JSON data from twitter stored in a file, let's take a quick peak at what is JSON and how we can parse this data using python