Cleaning the tweets
Posted on August 15, 2016 in misc
The noise in the data:¶
Once you have obtained the data that we want, the next major step is to pre-process this data. Most of the textual data that we get from social media has a lot of noise. We cannot use this data stright awawy. We need to remove this noise before we try to get any meaningful insights from this data and do some magic with it. The noise can be due to the usage of colloqual language, encoding issues etc.
The day to day language that we use on social media does not exactly go by the book. Its devoid of grammer and the words are spelt differently than the dictionary. Such data though humans can understand it clearly its not so easy to make a computer understand it. This is becuae computers are built to understand formal and logical languages with rules. But the day to day colloqual and spoken language does not have this. And if you plan to get insights from this kind of data with machines you have to make it structured.
How do you approach the problem:¶
The number of steps involved in data cleaning depends on the kind of data that you have at hand, from the domain you obtained it and most importantly what are your future plans for it. It's a decission making process that considers what to include and what to exclude from the data that you have. Sometimes even the parts of the data that we consider as noise may have something insightful or meaningful that goes unnoticed. The decissions involved in preprocessing the data has an impact on the kind of insights that you will able to derive from the dataset.
Tweet¶
In the last chapter we learned how to extract data from twitter. Now its time we play with this data. Twitter data i.e the tweets have a lot of noise. People use colloqual language extremely on twitter. It also has unwanted parts that is part of the twitter vocabulary such as usernames, retweets etc.
Consier the following tweet:
"@user@34 Life is great
&
I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com "_
HTML parsing:¶
Some times tweets have html elements such as <, >, &.; To convert this html elements into readble format we need to use HTML parser:
import HTMLParser
htmlParser = HTMLParser.HTMLParser()
tweet = "@user_@34 Life is great & I like it sooooooooo much. It's whatis life. #life #great#like http://lifeisgreat.com ."
parsedTweet = htmlParser.unescape(tweet)
print parsedTweet
URL's:¶
Urls can be removed using the regular expressions
import re
url_pattern = re.compile("http\S+")
tweet_v1 = re.sub(url_pattern, "", parsedTweet) # version 1 of the tweet
print tweet_v1
The re.sub in this case finds anything that matches the url pattern ("http\S+") and replaces it with a given string in this case an empty one "".
Usernames¶
Some of the tweets are username , which I feel are not required. Like I said earlier, this entire process will be a sequenec of decission making steps which decide what to include and what not to.
We can remove the usernames using regular expressions as shown below:
username_pattern = re.compile("@\S+")
tweet_v2=re.sub(username_pattern, "", tweet_v1)
print tweet_v2
Word Formating:¶
Sometimes words in the social media are spelt differently though they still have the same meaning as their original form. For example, in our sentence "I like it sooooooooo much" the word sooooooooo 's orignal form is "so".
How do we format such words? The easy way to sovle this is to use regular expressions. The below code snippet shows how to do this:
word_pattern = re.compile("s[o]+")
tweet_v3= re.sub(word_pattern, "so", tweet_v2)
print tweet_v3
Similarly for different words we create a list of regular expressions patterns and apply it to the tweet. ( You can try this). But the downside is it's a tideous process to create such a list. And the day to day colloquial language is giving birth to new forms of words each and every day. And its hard to keep track and maintain the vocabulary. But we have to do what ever we can do. Of course there are other ways , but for now we climb one step at a time.
Things You Should Try:¶
Try cleaning the tweets that you have extracted in the the previous chapter. Apply the above rules and in addition to that apply the below mentioned rules as well:
Remove Pucnctuations. Pucntuations some times don't carry any weight. You can remove them. Try writing a regular expression to remove , from sentences. Dont remove question marks "?" or exclamatory marks as they have effect upon any sentence.
Remove Apostrophe's and expand the words. For example in the sentence "It's a great time to code!" the first word It's can be expanded to it is. You can do this either with regular expressions.
Create a list of word pattenrs for word formatting. For example gud should be substitued with good