A short article of mine promoting shell scripting has appeared in the excellent ACCU publication, Overload. Since Overload is available online, you can read the original version there. Alternatively, I’ve republished it here, and added a couple of important revisions based on reader comments, so just keep reading …
The Unix command shell contains a lot of what I like in a programming environment: it’s dynamic, high-level, interpreted, flexible, succinct. It’s even reasonably portable now that bash seems to have become the shell of choice. Although there’s much about shell scripting I don’t like, on many occasions it turns out to be the best tool for the job.
In this article we shall demonstrate how simple shell scripts can be used to implement sets, providing one line recipes for set creation, set union, set intersection and more. Having explored the power of the Unix shell we’ll consider its limitations, before finally discussing the more general lessons we can learn from the Unix tools.
As an example, let’s suppose we want to analyse sets of IP addresses contained in a couple of Apache HTTP Server access logs,
access_log2. Each log file contains many thousands of lines which look something like this:
18.104.22.168 - - [25/Jun/2007:00:03:21 +0000] ... 22.214.171.124 - - [25/Jun/2007:00:03:24 +0000] ... 126.96.36.199 - - [25/Jun/2007:00:03:24 +0000] ... 188.8.131.52 - - [25/Jun/2007:00:03:36 +0000] ... ...
cut this file down to leave just the IP address at the start of each line.
Cut is a simple tool which we’ll be using again later, and here we pass it options
-f1 to select the first field from each line and
-d" " to use the space character as a field separator.
$ cut -f1 -d" " access_log1 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 ...
The output from this command is likely to be full of duplicates. Regular site visitors typically hit the web server a few times; web spiders and robots are much more hungry. To obtain the sets of unique IP addresses contained in each log file, we could do this:
$ cut -f1 -d" " access_log1 | sort | uniq > IP1 $ cut -f1 -d" " access_log2 | sort | uniq > IP2
cut picks out the IP addresses,
sort orders the results,
uniq eliminates duplicates, and we’ve redirected the output into files IP1 and IP2. By the way, we could have eliminated a link from the pipeline using the
-u option to sort. The Unix shell tools aren’t entirely orthogonal!
$ cut -f1 -d" " access_log1 | sort -u > IP1
The resulting sets are ordered — a set implementation which should be familiar to C++ programmers. The IP addresses will be lexicographically rather than numerically ordered, since we went with the
sort defaults. This means that, for example,
126.96.36.199 appears before
1 alphabetically precedes
5. With a little more effort, we could probably persuade
sort to yield a numeric ordering (no,
sort -n isn’t good enough).
If instead we wanted a multiset — that is, a set in which elements may appear more than once, we could count the number of times items are repeated in the sorted output using the
-c option to
$ cut -f1 -d" " access_log1 | sort | uniq -c 8 188.8.131.52 2 184.108.40.206 14 220.127.116.11 1 18.104.22.168 ...
Here, each IP address is prefixed by the number of times it occurred in the log file, so our multiset contains
22.214.171.124 8 times, etc. This will be useful later when we come to intersection operations.
Let’s assume we’ve followed the steps above and IP1 and IP1 contain the set of IP addresses in the two access logs. Forming the union of these sets is simple.
$ sort -m IP1 IP2 | uniq > IP1_union_IP2
-m merge option to
sort is purely for efficiency and the result would be equally correct without it. Since the inputs are already sorted, we can just merge them together, line by line. For C++ users, it’s the difference between the
Finding the intersection of IP1 and IP2 can be done in a similar fashion. We merge them together then use the
-d option to
uniq to pick out duplicates. Since the original sets contained no duplicates, the elements output by this command are those common to both inputs; the set intersection, that is.
$ sort -m IP1 IP2 | uniq -d > IP1_intersection_IP2
Brief though this command is, we needn’t type it all in. Exploiting its similarity to the previous command and using the magic of shell history, we just hit the up arrow key ↑ and edit the previous line.
A more succinct alternative would be to use
comm, a specialised tool for selecting or rejecting lines common to two files.
$ comm -12 IP1 IP2 > IP1_intersection_IP2
Comm requires the input text files to be lexically sorted, and by default outputs three columns: lines only in the first file; lines only in the second file; and lines common to both files. By supplying the
-12 option we choose to select just the third column, again generating the desired intersection.
We can tweak the first intersection recipe to find the set symmetric difference between IP1 and IP2 (the IP addresses in just one of IP1 and IP2 that is). Again, the up arrow key ↑ recalls the command, and this time we use
-u option to suppress repeated elements.
$ sort -m IP1 IP2 | uniq -u > IP1_symmetric_diff_IP2
We could also have used
comm -3 | tr -d "\t" (note the use of
tr to delete unwanted tab characters from
What about the elements in IP1 but not IP2? Again,
comm does the job. This time, we suppress columns 2 and 3.
$ comm -23 IP1 IP2 > IP1_subtract_IP2
Uniting a set of sets is easy.
Sort handles as many files as you pass it.
$ sort -mu IP1 IP2 IP3 .... IPN > IP_unite_all
To intersect N sets we could iterate through them, maintaining their cumulative intersection so far by using pair-wise intersection at each step. An alternative approach does away with the explicit iteration by forming their multiset union, then extracting elements which appear N times. Here’s an example when N is 3.
$ sort -m IP1 IP2 IP3 | uniq -c | grep "^ *3" \ | tr -s " " | cut -f3 -d" "
Let’s unpick this pipeline. First,
sort -m IP1 IP2 IP3 | uniq -c generates the multiset of IP addresses in IP1, IP2 and IP2. Since IP1, IP2 and IP3 are sets and therefore individually contain no repeats, the resulting multiset looks something like this:
$ sort -m IP1 IP2 IP3 | uniq -c 1 126.96.36.199 3 188.8.131.52 2 184.108.40.206 1 220.127.116.11 ...
Each line in the output starts with a count which must be either 1, 2 or 3. Lines starting with 3 correspond to IP addresses common to all three files — and these are the IP addresses which form the intersection of IP1, IP2 and IP3. Now we can use some standard pattern matching and extraction techniques to pick out the desired fields.
grep picks out lines starting with any number of spaces followed by a
tr -s " " squeezes repeated spaces from each line, making the output suitable for use with
cut using the space character as a field delimiter. Finally
cut itself extracts the column we want (the one with the IP address).
This approach generalises to the following shell script.
#! /bin/sh # Intersect a collection of lexicographically sorted input sets sort -m $@ | uniq -c | grep "^ *$# " | tr -s " " | cut -f3 -d" "
The rather cryptic looking
$# which appear in this script are special shell parameters: the first expands to the parameters passed to
intersect, the second to the number of these parameters. This function generates output on
stdout, and is ready for use in yet bigger shell scripts.
If you call this function with no inputs, it appears to hang — that’s because
sort, given no input files, processes
stdin. This breaks
intersect. We can fix the problem in a couple of ways.
We could add a conditional check that callers have supplied at least one file, printing usage information and returning an error code if not.
#! /bin/sh if [ $# -eq 0 ] then echo 1>&2 "Usage: $0 SET1 SET2..." exit 127 fi sort -m $@ | uniq -c | grep "^ *$# " | tr -s " " | cut -f3 -d" "
Alternatively, we could take the view that intersecting the empty set of sets is fine and should yield the empty set. We can avoid a conditional check by using the shell’s own version of the Null Object pattern.
sort -m /dev/null $@ | ....
One of the nice things about set operations is there aren’t many of them. We’ve already covered the important ones, and these can easily be extended. Try and work out what set operations are going on in the the command history shown below.
$ comm -13 S1 S2 $ comm -23 S1 S2 $ diff S1 S2 $ head -1 S1 $ sort -m S1 S2 S3 | uniq -c | grep -c "^ *3" $ tail -1 S2 $ wc -l S1
As a hint, the answers in lexicographical order are:
- are two sets disjoint?
- are two sets the same?
- how big is the intersection of three sets?
- how many elements in a set?
- is a subset of?
- largest element of a set
- smallest element of a set
The command shell is a powerful, dynamic and extensible programming environment. Even these simple one-line scripts can be stored as functions which can be sourced when a new shell is started; you can add command-line help to them, you can find them using tab-completion, you can keep them in your source control system. In this way you can create your own customised shell working environment and port it from platform to platform just by checking it out.
Apache server logs are no more and no less than line oriented text. Each record in the log is terminated by a newline character, and each field within each record is delimited in an obvious way: by brackets, spaces, quotation marks, whatever — who needs XML? This is the kind of format shell scripts handle well. Conversely, anything more complicated, XML for example, or records which span multiple lines, is likely to push the shell tools too far. Maybe
awk could cope, but I don’t think many people bother learning
awk these days: it’s better to use one of the popular high-level languages when basic shell commands won’t do.
Shell scripts tend not to fail safely. For example, the following command is meant to clear out files in a temporary directory:
# Don't try this at home! $ rm -rf $TEMP_WORK_DIR/*
You can imagine what happens if TEMP_WORK_DIR has not been set. In general, the Unix commands build on a couple of dangerous assumptions: that programmers know what they are doing; and that the show must go on — by which I mean that, given malformed input, a shell script will not throw an exception. The IP filters we discussed in this article work quite happily with any old text file as input — if it wasn’t an Apache http server log, the only indication of failure may well be smaller sets than expected.
I’ll admit that I personally avoid writing any shell scripts much longer than the ones shown here. As with Makefiles, I admire and respect the technology but I’d rather have someone else deal with the details. The
bash manual may be brief to a fault, but I’ve yet to get to grips with its finer details. Sometimes it’s just too subtle.
On the subject of details, earlier in this article I said that by default
sort uses lexicographical ordering, which isn’t perhaps the ordering we’d prefer for IP addresses; and I also said that a numeric
sort -n wouldn’t do the job either: IP addresses aren’t really numbers, they’re dot separated number quartets. You can use
sort to place IP addresses in a more natural order, but the command you’ll need is anything but natural.
# "Natural" ordering of IP addresses $ sort -t. +0n -1n +1n -2n +2n -3n +3n IP
If you want to know how this works you’ll have to read the manual. The code, on its own, is unreadable. If you don’t know where the manual is, just open a shell window and type
man. If the output from this command doesn’t help, try
man man, and if you don’t know how to open a shell window, I’m surprised you’re even reading this sentence!
Modern graphical development environments tend to hide the shell and the command line, probably with good reason, and I don’t suppose this article will persuade anyone they’re worth hunting out. And yet the Unix shell embodies so much which is modern and, I suspect, future, best practice.
For me, it’s not just what the shell tools can do, it’s the example they set. Look again at some of the recipes presented in this article and you’ll see container operations without explicit loops. You’ll see flexible and generic algorithms. You’ll see functional programming. You’ll see programs which can parallel-process data without a thread or a mutex in sight; no chance of shared memory corruption or race conditions here. The original design of the shell tools may have become somewhat polluted — we’ve already seen that
sort does some of what
uniq can do — but I think the intent shines through as clearly as ever: we have a compact suite of tools, each with its own responsibility, which cooperate using simple interfaces. We would do well to emulate this model in our own software designs.
I would like to thank the Overload editorial team for their help with this article. I would also like to thank David Jones and Don for their suggestions.
I think this is a really good article on the spirit of shell programming.
Intersection is «comm -12 a b», subtraction is «comm -23 a b», symmetric difference is (almost) «comm -2 a b».
«| sort | uniq | sort -nr» is one of my favourites. sort in popularity order.
I think this is a really good article on the spirit of shell programming.
Thanks David, much appreciated.
Intersection is «comm -12 a b» ...
I wish I'd discovered
commbefore I wrote the article! It's much neater than the combination of
cut. One of the problems I have with the shell tools, though, is that it takes time to discover them. Thus
man commpoints me to related functions
cmpetc, but (e.g.)
man sortdoesn't have anything to say about
Totally agree about time to discover shell tools (and the traditionally bad organisation of the man pages). In the good old days a Unix system was simple enough that you could just «ls /usr/bin» and go through each one to see what it did. That's not so feasible any more.
There's a certain learning value to seeing how to build up the tools from other parts though.
I've realised the other thing about the
cutpipeline is that it's a bit more flexible -- so you can easily tweak the recipe to (e.g.) intersect N sets. Unlike most of the shell tools,
commhas a very limited role (but thanks again for pointing it out).
Here is how i did it with cat, sort, and uniq:
(shell script)take 2 input files and output 2 files: unique.txt and duplicate.txt
if [ $# -ne 2 ]; then echo 1>&2 Usage: $0 file1 file2 exit 127 fi
cat $1 $2>tmp.tmp sort tmp.tmp>sort.tmp uniq -u sort.tmp unique.txt uniq -d sort.tmp duplicate.txt rm *.tmp 2>/dev/null wc *.txt exit 0
if [ $# -ne 2 ] then echo 1>&2 Usage: $0 file1 file2 exit 127 fi cat $1 $2 > tmp.tmp sort tmp.tmp > sort.tmp uniq -u sort.tmp > unique.txt uniq -d sort.tmp > duplicate.txt rm *.tmp 2>/dev/null wc *.txt exit 0
Don, I've tried to reformat your comment -- I think linebreaks got mangled. It assumes each input file contains no duplicates, right? I think
rm *.tmpis a little risky!