Text mining Tutorial

Introduction:

This entire tutorial on text mining uses python for implementation and also some linux command line functions. You need not know either of them before hand as they are explained as part of the tutorial. But if you already know, then things will be even more easy for you.

In this tutorial you will be learning what is text mining about through an example. The example works around the twitter data. Twitter is an extensively used social media platform that uses short text messages. We will be extracting data from this platform and work with it through out the chapter.

Though we will be working on twitter data in this tutorial you can apply the same concepts to any other textual datasets out there.

Text mining:

This very moment you are reading this sentence , huge amounts of textual data are being posted on the web. Some of this content are short textual messages sometimes combined with other sorts of media such as images or videos, there are also blogs written by people on different topics expressing their views or opinons on them, there are reviews or comments often expressed by people on movies, books, restaurents, products etc and also the books themselves(either fiction or non-fiction). All this data is lying out there, sometimes easily accessible and some times not, but if left untouched it is simply a waste and otherwise might give you something insightful if you start working with it.

The goal of text mining is to obtain quality information from text data. In above paragraph I have listed some text resources that are present on the world wide web. But there are certain chal lenges that lie in the way of obtaining this qualitative information.

Text mining starts generaly with the process of information retrieval. We need to identify the source of data and then collect from this source. General sources are web: blogs, social media platforms, reviews and comments etc. Once we collect the data, we need to clean the noise in it, such as the removal of duplicate data entries, unwanted information such as url's, image links etc. There are number of steps invovled in denoising the data and this dependes on the kind of data that you have at hand. Once we clean the text data, we can apply natural language processing techniques such as parsing, pos tagging etc.

The whole idea is to convert something not so structured into something meaningful and structured. Once we have such a structured output, we can perform various tasks such as:

  • Sentiment analysis,
  • Topic detection
  • Document Summarization
  • entitiy relational Modelling
  • pattern recognition
  • predictive analytics
  • text categorization etc...

In this tutorial we will be covering some primary concepts in sentiment analysis. The above mentioned tasks are extremely usefull for gaining insights into textual data.

Contents:

  1. Regular Expressions
  2. Command line text Processing

  3. Python: installation and, pip

  4. Playing with python

  5. Extracting data from twitter

  6. A quicky on JSON

  7. Cleaning the tweets

  8. Sentiment Analysis

  9. Topic detection

  10. A Project

In [ ]: