HowTo Handle Mass Output - Miscellaneous Commands

This is the final article on the tutorial series "How To Handle Massive Output"

Part 4: sort, uniq, seq, xargs

The 4 commands above, when joined with grep/awk/sed, become very powerful by allowing us to manipulating massive/mostly irrelevant data into actionable data. 

 

The "sort" Command

It does exactly what it says; to sort

Let's say we have a file that looks like this

$ cat random.txt
November
Delta
Foxtrot
Tango
Charlie
Romeo

The sort command will reorganize the list above as below

$ cat random.txt | sort
Charlie
Delta
Foxtrot
November
Romeo
Tango


The sort command can specify a column to sort and additional options to sort as alpha or numeric. A new example shows a multi-column file such as below

$ cat names_ages_emails.txt
Kevin    27    kevin@sentiblue.net
Kelly    19    kelly@sentiblue.net
Robert   14    bob@sentiblue.net
Randall  32    randall@sentiblue.net
Michael  37    mike@blogspot.com


To sort ascending for age we do this

$ cat names_ages_emails.txt | sort -k 2n
Robert   14    bob@sentiblue.net
Kelly    19    kelly@sentiblue.net
Kevin    27    kevin@sentiblue.net
Randall  32    randall@sentiblue.net
Michael  37    mike@blogspot.com


The "-k 2" tells sort to act on  column 2 and the "n" tells it to sort numerically.

 

The "uniq" Command

Uniq command normally works with sort. In fact, it requires data to be sorted first in order to work.

Let's say we have a list that have multiple duplicate items like this

$ cat names.txt
Kevin
Kelly
Robert
Kevin
Kevin
Robert
Kelly
Tom
Amber
Kelly
Amber
Robert
Tom

The uniq command is used remove the duplicates, but it will only work correctly if the list is sorted. Notice above that there are 2 "Kevin"'s duplicated next to each other. The uniq command will remove one of them, but will reprint all of the rest because they are not sorted;

$ cat names.txt | uniq
Kevin
Kelly
Robert
Kevin
Robert
Kelly
Tom
Amber
Kelly
Amber
Robert
Tom


Now, if we REALLY want to remove all duplicates, apply the uniq command *AFTER* sorting the list:

$ cat names.txt | sort | uniq
Amber
Kelly
Kevin
Robert
Tom


See how that works?

Note that this is specifically useful when you use it to analyze IP Addresses in a web server access log file. Using grep/awk/sed together with sort/uniq, you can distinctively extract IP Address of the visitor, the time and the page they pulled up. This is very resourceful when doing forensic analysis on a web application.

Real Life Example:

Apache log file looks like this
192.168.247.33 - - [01/Apr/2014:11:20:49 -0700] "GET /vars/images/dt_images/topnav/header/flash.jpg HTTP/1.1" 302 268 "https://blog.sentiblue.com/blogs/technical/promodetailtechnical.dt?arg_promoid=80878&parentpage=technical_technicalmanager&pagename=technical_technicalmanager&idPageIndex=1&advfiltervalues=" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.33 - - [01/Apr/2014:11:20:49 -0700] "GET /vars/images/dt_images/topnav/header/menuselectorBK2.jpg HTTP/1.1" 302 270 "https://blog.sentiblue.com/blogs/technical/promodetailtechnical.dt?arg_promoid=80878&parentpage=technical_technicalmanager&pagename=technical_technicalmanager&idPageIndex=1&advfiltervalues=" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.33 - - [01/Apr/2014:11:20:49 -0700] "GET /vars/images/dt_images/topnav/header/banner_bg.jpg HTTP/1.1" 302 272 "https://blog.sentiblue.com/blogs/technical/promodetailtechnical.dt?arg_promoid=80878&parentpage=technical_technicalmanager&pagename=technical_technicalmanager&idPageIndex=1&advfiltervalues=" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.33 - - [01/Apr/2014:11:20:49 -0700] "GET /vars/images/dt_images/topnav/header/toolbarBkSelected.jpg HTTP/1.1" 302 272 "https://blog.sentiblue.com/blogs/technical/promodetailtechnical.dt?arg_promoid=80878&parentpage=technical_technicalmanager&pagename=technical_technicalmanager&idPageIndex=1&advfiltervalues=" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.33 - - [01/Apr/2014:11:20:49 -0700] "GET /vars/images/dt_images/k.png HTTP/1.1" 302 242 "https://blog.sentiblue.com/blogs/technical/promodetailtechnical.dt?arg_promoid=80878&parentpage=technical_technicalmanager&pagename=technical_technicalmanager&idPageIndex=1&advfiltervalues=" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.118 - - [01/Apr/2014:11:20:49 -0700] "GET / HTTP/1.1" 302 226 "-" "Echoping/6.0.2"
192.168.247.24 - - [01/Apr/2014:11:20:49 -0700] "GET / HTTP/1.1" 302 225 "-" "-"
192.168.247.142 - - [01/Apr/2014:11:20:49 -0700] "POST /container27/queues/amfpollingsecure HTTP/1.1" 200 65 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2; MS-RTC LM 8; .NET4.0E)"
192.168.247.231 - - [01/Apr/2014:11:20:50 -0700] "POST /container29/queues/amfpollingsecure HTTP/1.1" 200 66 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.142 - - [01/Apr/2014:11:20:50 -0700] "POST /container27/queues/amfpollingsecure HTTP/1.1" 200 65 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2; MS-RTC LM 8; .NET4.0E)"
192.168.247.118 - - [01/Apr/2014:11:20:51 -0700] "GET / HTTP/1.1" 302 226 "-" "Echoping/6.0.2"
192.168.247.142 - - [01/Apr/2014:11:20:52 -0700] "POST /container27/queues/amfpollingsecure HTTP/1.1" 200 65 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2; MS-RTC LM 8; .NET4.0E)"
192.168.247.231 - - [01/Apr/2014:11:20:53 -0700] "POST /container29/queues/amfpollingsecure HTTP/1.1" 200 66 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
192.168.247.142 - - [01/Apr/2014:11:20:53 -0700] "POST /container27/queues/amfpollingsecure HTTP/1.1" 200 65 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2; MS-RTC LM 8; .NET4.0E)"
192.168.247.142 - - [01/Apr/2014:11:20:52 -0700] "POST /container27/queues/amfsecure HTTP/1.1" 200 10411 "https://blog.sentiblue.com/visitors/listing/demo.swf/[[DYNAMIC]]/5" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.2; MS-RTC LM 8; .NET4.0E)"

Urghhh!!! That output looks tough!!! I just want to get a count of all the IP Addresses in this log... here's how

$ cat apache.log | awk '{ print $1 }' | sort | uniq -c
   2 192.168.247.118
   5 192.168.247.142
   2 192.168.247.231
   1 192.168.247.24
   5 192.168.247.33


See how easy it is? In this command example; we tell the bash shell to do this:

View the apache.log file, then print only column 1 (The IP Address), sort it, remove duplicates and count each of them. The Uniq command, when used with "-c", it will print an additional column in front indicating the count of each unique item.

 

The "xargs" Command

This is a simple command, it simply transforms columns into rows, separated by a space by default.

From this example above
$ cat names.txt | sort | uniq
Amber
Kelly
Kevin
Robert
Tom


We can revise it with the xargs commands to turn the output into a string of names;

$ cat names.txt | sort | uniq | xargs
Amber Kelly Kevin Robert Tom


This scenario of usage is particularly useful when generating output that are to be used with another command that requires data in string as above.

For example; the command "ps -eaf | grep apache | awk '{ print $2 }'" will list all processes owned by the account name "apache".

It shows like this
$ ps -eaf | grep [a]pache
apache 21962  6852  0 11:41 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 21963  6852  0 11:41 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 21976  6852  0 11:41 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 21980  6852  0 11:41 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22021  6852  0 11:41 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22242  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22247  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22280  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22281  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22309  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22395  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22404  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22415  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22416  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL
apache 22420  6852  0 11:42 ?        00:00:00 /usr/sbin/httpd -d /opt/sentiblue/blog -D SSL


We want to write a single command line to kill all these processes. We know that the kill command syntax works like this;

$ kill ..

We also know that we can extract the second column to get the PID list like this
$ ps -eaf | grep [a]pache | awk '{ print $2 }' | xargs
21962 21963 21976 21980 22021 22242 22247 22280 22281 22309 22395 22404 22415 22416 22420

If we pipe this output to the kill command, they will all die;

$ ps -eaf | grep [a]pache | awk '{ print $2 }' | xargs kill
$ ps -eaf | grep [a]pache
< Nothing shows here because all processes died. >

The most useful credit for xargs is when you use it in a loop when programming a script;

The "for" loop uses syntax like this;

for NAME in Kevin Robert Kelly
do
   echo $NAME
done

Notice that the for loop lists all $NAME items in a text string as individual items. Xargs, when converting a column data file into such a data format, will be able to pipe that data to the for loop and process it accordingly.

 

The "seq" Command

The command "seq" is used to generate a sequential type of data list. The simplest example is to generate a lis of numbers from 1 to 5 as below

$ seq 1 5
1
2
3
4
5


Going back to for loop example in the section above, we want the list 1-5 to be in a string of text instead of column; we do this

$ for NUMBER in `seq 1 5 | xargs`
do
   grep $NUMBER
done

The loop above will generate a list of numbers in column, convert it to a row list, then pipe to the loop and iterate through each number and search for it in a file.

Let's make real life example more practical;

I want to check the date on my web server farm (5 machines); The server names are www2001 www2002 www2003 www2004 www2005. I can do something like this

$ for WS in www2001 www2002 www2003 www2004 www2005; do echo -n "$WS: "; ssh $WS date; done
www2001: Tue Apr  1 12:00:53 PDT 2014
www2002: Tue Apr  1 12:00:53 PDT 2014
www2003: Tue Apr  1 12:00:53 PDT 2014
www2004: Tue Apr  1 12:00:53 PDT 2014
www2005: Tue Apr  1 12:00:53 PDT 2014

But if the number of servers in my farm is a few thousand, will I be willing to type out the whole list in the for loop? Definitely NOT!!! Here's how to get away from that with a list of 500 servers:

$ for WS in `seq 2001 2500 | xargs`; do echo -n "www$WS: "; ssh www$WS date; done
www2001: Tue Apr  1 12:00:53 PDT 2014
www2002: Tue Apr  1 12:00:53 PDT 2014
www2003: Tue Apr  1 12:00:53 PDT 2014
....
....
www2499: Tue Apr  1 12:00:53 PDT 2014
www2500: Tue Apr  1 12:00:53 PDT 2014

Say we only want the odd numbers out of the server list; we can do this
$ for WS in `seq 2001 2 2500 | xargs`; do echo $WS; done
2001
2003
....
2497
2499

Note that seq doesn't know odd or even. The above command only tells it to increment by 2. You can increment by any number to achieve purposes other than odd/even.

- If we were dealing with leading zeroes, we may have the need to format numbers so that they have the same width... just throw in the "-w" switch with "seq" and the command will generate the list with the same number of digits where the number of digits is derived from the largest number in the output sequence;

If we have a server list like "s01 s02 s03.... s10" then we may need to do this $ seq -w 1 10
01
02
03
04
05
06
07
08
09
10


For 100-999 servers, seq will automatically use the same syntax to generate 3 digits. Again, the number of digits come from the highest number.

$ seq -w 1 250
001
002
...
250

Articles in the Series

Part 1: The grep Command
Part 2: The awk Command 
Part 3: The sed Command
Part 4: Miscellaneous Commands

4 comments:

Help a friend, share your knowledge