Command Line Utilities for text processing

Posted on August 15, 2016 in misc

Unix Command line Utilities:

You might be wondering why do we need command line utilites for text processing. But the traditional unix command line tools are extremely helpfull when it comes to text processing. Most of the preprocesing and cleaning of text can be done using these tools. It would be helpfull to learn regular expressions before learning to use these tools because some of these utilites use regular experssions.

Since there are a lot of command line utilites, we will only discuss the most used ones:

  1. wc
  2. head
  3. Sort
  4. uniq
  5. paste & join
  6. split
  7. cut
  8. grep
  9. sed

wc or word count

wc is used for obtaining the count of number of words, lines and characters in a file.

  1. An example to get number of lines:
$wc -l 
20
  1. An example to get number of words:
$wc -w 
60
  1. An example to get number of characters:
    $wc -c 
    233

head:

Head is used to ouput the first "n" lines from a file. If the number is not specified it outputs the first ten lines. The header can be used to see a sample of the entire data set or it can be used to output partial data set that can be further used for processing.

$head -n 

sort

As the name implies it sorts the lines of text files and prints it out to standard output. You can also sort the file backwords or forwards or based on character position.

$sort [options] 

uniq

The uniq command is used to remove duplicate lines in from a sorted text file. Or count the occurance of duplicates.

NOTE: The file should be sorted before applying the uniq command. Thats why uniq is always used in combination with the sort command. In unix commands can be pipelined.

Suppose you have a file named "foo.txt" with the following lines: This line occurs only once. This line occurs twice. This line occurs twice. This line occurs three times. This line occurs three times. This line occurs three times.

When you use the uniq command on this file, this is how output looks like:

  $ uniq -c foo.txt
    1 This line occurs only once.
    2 This line occurs twice.
    3 This line occurs three times.

NOTE: The -c option that is used above implies to print the occurance of duplicates as well. The numbers that you see in the output are each the count of the occurance of the line in the entire file. If you dont specify -c it will just output the uniq lines.

paste & join

Paste is used to merge two files as shown below. items and prices are two text files and have we merge them using paste command:

$ cat items
 alphabet blocks
 building blocks
 cables

$ cat prices
 $1.00/dozen
 $2.50 ea.
 $3.75

The paste Command output:

$ paste items prices
 alphabet blocks $1.00/dozen
 building blocks $2.50 ea.
 cables  $3.75

If you observe the files are merged at line level i.e each line of the first file is merged with the corresponding line in the second file.

The join command is an homework for you to explore.

split

Imagine you have a very large data file and you want to split this file into equal chunks into different files. Split is the command you are looking for.

  $ split --number="< give the number of files or chunks to split here>"  

split can be used in many other ways. You can split based on number of lines. The coolest future is you can split by specifying the size ( in terms of memory such as bytes ). For any case the output is stored in files with prefix 'X' such as xaa, xab etc. You can change the prefix by giving optional parameter.

grep

Grep is a utility that searches any number of input files for a given pattern or pattenrs and outputs only those lines which match the pattern. It searches line by line, and if a line matches any one of the patterns it spits it out. What makes grep even more powerfull is you can use basic regular expressions along with simple patters.

How do you use grep:

  1. Lets say you want to search all the occurances of the word mining in a file named tutorial.txt :

      $ grep "mining" tutorial.txt

    The above command outputs all the lines that have the pattern "mining" in the file tutorial.txt. But this includes anything that matches "mining" such as "text-mining", "coal-mining", "sdfsminingsdfd" and also the mining.

  2. If you just want the pattern to match whole words then this -w is the option to be used:

      $ grep -w "mining" tutorial.txt
  3. The same can be applied for phrases as well . After all a phrase is a set of characters.

      $ grep -w " is a good boy" tutorial.txt

    It will output all the lines that have the phrase "is a good boy".

Regular expressions

You can also use regular expressions along with grep. This makes it much more powerfull in terms of pattern search. Its like Tony Stark wearing the iron man suit.

  $ grep '^hello' tutorial.txt

This outputs all those those lines that start with the characters "hello". If you notice we have used the '^' a metacharacter which represents the start of a line in regular expressions language. For more on regular expressions please refer the regular expressions chapter.

Inverse search:

It's not always that you will be searching for lines in a text that match a pattern. There are also instances when you must search for lines that do not match a pattern. You can do this with grep.

  $ grep -v "^hello" tutorial.txt

The option '-v' is to be supplied for this functionaliy.






Things to try:

  1. Create a file and save the following text in the file as such:
Ironman is cool.
Ironman is smart
Ironman is intelligent.
Ironman is good.
Ironman is strong.
Everyday is a new day
Ironman is goood
Ironman is smarter
  1. Find the number of lines in the file (use wc)

  2. Check the first line of the file using head.

  3. Check the lines 1 to 5 using the head file

  4. sort the lines using Sort command

  5. Remove duplicates using Uniq command

  6. Use grep to search for lines that have the word "smart" (should not )

  7. Use grep to get lines that do not have the word "Ironman"