He Sells Shell Scripts to Intersect Sets

2007-08-18 • C++, ACCU, Shell, Streams • Comments

A short article of mine promoting shell scripting has appeared in the excellent ACCU publication, Overload. Since Overload is available online, you can read the original version there. Alternatively, I’ve republished it here, and added a couple of important revisions based on reader comments, so just keep reading …

Introduction

The Unix command shell contains a lot of what I like in a programming environment: it’s dynamic, high-level, interpreted, flexible, succinct. It’s even reasonably portable now that bash seems to have become the shell of choice. Although there’s much about shell scripting I don’t like, on many occasions it turns out to be the best tool for the job.

In this article we shall demonstrate how simple shell scripts can be used to implement sets, providing one line recipes for set creation, set union, set intersection and more. Having explored the power of the Unix shell we’ll consider its limitations, before finally discussing the more general lessons we can learn from the Unix tools.

An Example: Apache Server Logs

As an example, let’s suppose we want to analyse sets of IP addresses contained in a couple of Apache HTTP Server access logs, access_log1 and access_log2. Each log file contains many thousands of lines which look something like this:

65.214.44.29 - - [25/Jun/2007:00:03:21 +0000] ...
74.6.87.40 - - [25/Jun/2007:00:03:24 +0000] ...
65.214.44.29 - - [25/Jun/2007:00:03:24 +0000] ...
74.6.86.212 - - [25/Jun/2007:00:03:36 +0000] ...
...

We can cut this file down to leave just the IP address at the start of each line. Cut is a simple tool which we’ll be using again later, and here we pass it options -f1 to select the first field from each line and -d" " to use the space character as a field separator.

$ cut -f1 -d" " access_log1
65.214.44.29
74.6.87.40
65.214.44.29
74.6.86.212
...

Set Creation

The output from this command is likely to be full of duplicates. Regular site visitors typically hit the web server a few times; web spiders and robots are much more hungry. To obtain the sets of unique IP addresses contained in each log file, we could do this:

$ cut -f1 -d" " access_log1 | sort | uniq > IP1
$ cut -f1 -d" " access_log2 | sort | uniq > IP2

Here cut picks out the IP addresses, sort orders the results, uniq eliminates duplicates, and we’ve redirected the output into files IP1 and IP2. By the way, we could have eliminated a link from the pipeline using the -u option to sort. The Unix shell tools aren’t entirely orthogonal!

$ cut -f1 -d" " access_log1 | sort -u > IP1

The resulting sets are ordered — a set implementation which should be familiar to C++ programmers. The IP addresses will be lexicographically rather than numerically ordered, since we went with the sort defaults. This means that, for example, 122.152.128.10 appears before 58.167.213.128 because 1 alphabetically precedes 5. With a little more effort, we could probably persuade sort to yield a numeric ordering (no, sort -n isn’t good enough).

Multiset Creation

If instead we wanted a multiset — that is, a set in which elements may appear more than once, we could count the number of times items are repeated in the sorted output using the -c option to uniq.

$ cut -f1 -d" " access_log1 | sort | uniq -c
   8 12.153.20.132
   2 12.217.178.11
  14 12.30.66.226
   1 122.152.128.49
  ...

Here, each IP address is prefixed by the number of times it occurred in the log file, so our multiset contains 12.153.20.132 8 times, etc. This will be useful later when we come to intersection operations.

Set Union

Let’s assume we’ve followed the steps above and IP1 and IP1 contain the set of IP addresses in the two access logs. Forming the union of these sets is simple.

$ sort -m IP1 IP2 | uniq > IP1_union_IP2

The -m merge option to sort is purely for efficiency and the result would be equally correct without it. Since the inputs are already sorted, we can just merge them together, line by line. For C++ users, it’s the difference between the std::sort and std::merge algorithms.

Set Intersection

Finding the intersection of IP1 and IP2 can be done in a similar fashion. We merge them together then use the -d option to uniq to pick out duplicates. Since the original sets contained no duplicates, the elements output by this command are those common to both inputs; the set intersection, that is.

$ sort -m IP1 IP2 | uniq -d > IP1_intersection_IP2

Brief though this command is, we needn’t type it all in. Exploiting its similarity to the previous command and using the magic of shell history, we just hit the up arrow key ↑ and edit the previous line.

A more succinct alternative would be to use comm, a specialised tool for selecting or rejecting lines common to two files.

$ comm -12 IP1 IP2 > IP1_intersection_IP2

Comm requires the input text files to be lexically sorted, and by default outputs three columns: lines only in the first file; lines only in the second file; and lines common to both files. By supplying the -12 option we choose to select just the third column, again generating the desired intersection.

Set Symmetric Difference

We can tweak the first intersection recipe to find the set symmetric difference between IP1 and IP2 (the IP addresses in just one of IP1 and IP2 that is). Again, the up arrow key ↑ recalls the command, and this time we use uniq’s -u option to suppress repeated elements.

$ sort -m IP1 IP2 | uniq -u > IP1_symmetric_diff_IP2

We could also have used comm -3 | tr -d "\t" (note the use of tr to delete unwanted tab characters from comm’s output).

Set Subtraction

What about the elements in IP1 but not IP2? Again, comm does the job. This time, we suppress columns 2 and 3.

$ comm -23 IP1 IP2 > IP1_subtract_IP2

Sets of Sets

Uniting a set of sets is easy. Sort handles as many files as you pass it.

$ sort -mu IP1 IP2 IP3 .... IPN > IP_unite_all

To intersect N sets we could iterate through them, maintaining their cumulative intersection so far by using pair-wise intersection at each step. An alternative approach does away with the explicit iteration by forming their multiset union, then extracting elements which appear N times. Here’s an example when N is 3.

$ sort -m IP1 IP2 IP3 | uniq -c | grep "^ *3" \
    | tr -s " " | cut -f3 -d" "

Let’s unpick this pipeline. First, sort -m IP1 IP2 IP3 | uniq -c generates the multiset of IP addresses in IP1, IP2 and IP2. Since IP1, IP2 and IP3 are sets and therefore individually contain no repeats, the resulting multiset looks something like this:

$ sort -m IP1 IP2 IP3 | uniq -c
   1 12.30.66.226
   3 122.152.128.10
   2 122.152.128.49
   1 122.152.129.54
   ...

Each line in the output starts with a count which must be either 1, 2 or 3. Lines starting with 3 correspond to IP addresses common to all three files — and these are the IP addresses which form the intersection of IP1, IP2 and IP3. Now we can use some standard pattern matching and extraction techniques to pick out the desired fields.

First grep picks out lines starting with any number of spaces followed by a 3. Next tr -s " " squeezes repeated spaces from each line, making the output suitable for use with cut using the space character as a field delimiter. Finally cut itself extracts the column we want (the one with the IP address).

This approach generalises to the following shell script.

intersect

#! /bin/sh
# Intersect a collection of lexicographically sorted input sets
sort -m $@ | uniq -c | grep "^ *$# " | tr -s " " | cut -f3 -d" "

The rather cryptic looking $@ and $# which appear in this script are special shell parameters: the first expands to the parameters passed to intersect, the second to the number of these parameters. This function generates output on stdout, and is ready for use in yet bigger shell scripts.

If you call this function with no inputs, it appears to hang — that’s because sort, given no input files, processes stdin. This breaks intersect. We can fix the problem in a couple of ways.

We could add a conditional check that callers have supplied at least one file, printing usage information and returning an error code if not.

intersect

#! /bin/sh
if [ $# -eq 0 ]
then
    echo 1>&2 "Usage: $0 SET1 SET2..."
    exit 127
fi
sort -m $@ | uniq -c | grep "^ *$# " | tr -s " " | cut -f3 -d" "

Alternatively, we could take the view that intersecting the empty set of sets is fine and should yield the empty set. We can avoid a conditional check by using the shell’s own version of the Null Object pattern.

sort -m /dev/null $@ | ....

More Set Operations

One of the nice things about set operations is there aren’t many of them. We’ve already covered the important ones, and these can easily be extended. Try and work out what set operations are going on in the the command history shown below.

$ comm -13 S1 S2
$ comm -23 S1 S2
$ diff S1 S2
$ head -1 S1
$ sort -m S1 S2 S3 | uniq -c | grep -c "^ *3"
$ tail -1 S2
$ wc -l S1

As a hint, the answers in lexicographical order are:

are two sets disjoint?
are two sets the same?
how big is the intersection of three sets?
how many elements in a set?
is a subset of?
largest element of a set
smallest element of a set

Extending the Toolset

The command shell is a powerful, dynamic and extensible programming environment. Even these simple one-line scripts can be stored as functions which can be sourced when a new shell is started; you can add command-line help to them, you can find them using tab-completion, you can keep them in your source control system. In this way you can create your own customised shell working environment and port it from platform to platform just by checking it out.

A Script’s Got to Know its Limitations

Apache server logs are no more and no less than line oriented text. Each record in the log is terminated by a newline character, and each field within each record is delimited in an obvious way: by brackets, spaces, quotation marks, whatever — who needs XML? This is the kind of format shell scripts handle well. Conversely, anything more complicated, XML for example, or records which span multiple lines, is likely to push the shell tools too far. Maybe awk could cope, but I don’t think many people bother learning awk these days: it’s better to use one of the popular high-level languages when basic shell commands won’t do.

Shell scripts tend not to fail safely. For example, the following command is meant to clear out files in a temporary directory:

# Don't try this at home!
$ rm -rf $TEMP_WORK_DIR/*

You can imagine what happens if TEMP_WORK_DIR has not been set. In general, the Unix commands build on a couple of dangerous assumptions: that programmers know what they are doing; and that the show must go on — by which I mean that, given malformed input, a shell script will not throw an exception. The IP filters we discussed in this article work quite happily with any old text file as input — if it wasn’t an Apache http server log, the only indication of failure may well be smaller sets than expected.

I’ll admit that I personally avoid writing any shell scripts much longer than the ones shown here. As with Makefiles, I admire and respect the technology but I’d rather have someone else deal with the details. The bash manual may be brief to a fault, but I’ve yet to get to grips with its finer details. Sometimes it’s just too subtle.

On the subject of details, earlier in this article I said that by default sort uses lexicographical ordering, which isn’t perhaps the ordering we’d prefer for IP addresses; and I also said that a numeric sort -n wouldn’t do the job either: IP addresses aren’t really numbers, they’re dot separated number quartets. You can use sort to place IP addresses in a more natural order, but the command you’ll need is anything but natural.

# "Natural" ordering of IP addresses
$ sort -t. +0n -1n +1n -2n +2n -3n +3n IP

If you want to know how this works you’ll have to read the manual. The code, on its own, is unreadable. If you don’t know where the manual is, just open a shell window and type man. If the output from this command doesn’t help, try man man, and if you don’t know how to open a shell window, I’m surprised you’re even reading this sentence!

Conclusion

Modern graphical development environments tend to hide the shell and the command line, probably with good reason, and I don’t suppose this article will persuade anyone they’re worth hunting out. And yet the Unix shell embodies so much which is modern and, I suspect, future, best practice.

For me, it’s not just what the shell tools can do, it’s the example they set. Look again at some of the recipes presented in this article and you’ll see container operations without explicit loops. You’ll see flexible and generic algorithms. You’ll see functional programming. You’ll see programs which can parallel-process data without a thread or a mutex in sight; no chance of shared memory corruption or race conditions here. The original design of the shell tools may have become somewhat polluted — we’ve already seen that sort does some of what uniq can do — but I think the intent shines through as clearly as ever: we have a compact suite of tools, each with its own responsibility, which cooperate using simple interfaces. We would do well to emulate this model in our own software designs.

Credits

I would like to thank the Overload editorial team for their help with this article. I would also like to thank David Jones and Don for their suggestions.

Feedback

David Jones 2007-09-18

I think this is a really good article on the spirit of shell programming.

Intersection is «comm -12 a b», subtraction is «comm -23 a b», symmetric difference is (almost) «comm -2 a b».

«| sort | uniq | sort -nr» is one of my favourites. sort in popularity order.
Thomas Guest 2007-09-18

I think this is a really good article on the spirit of shell programming.

Thanks David, much appreciated.

Intersection is «comm -12 a b» ...

I wish I'd discovered comm before I wrote the article! It's much neater than the combination of grep and cut. One of the problems I have with the shell tools, though, is that it takes time to discover them. Thus man comm points me to related functions sort, diff, cmp etc, but (e.g.) man sort doesn't have anything to say about comm.
David Jones 2007-09-20

Totally agree about time to discover shell tools (and the traditionally bad organisation of the man pages). In the good old days a Unix system was simple enough that you could just «ls /usr/bin» and go through each one to see what it did. That's not so feasible any more.

There's a certain learning value to seeing how to build up the tools from other parts though.
Thomas Guest 2007-09-20

I've realised the other thing about the uniq, grep, tr, cut pipeline is that it's a bit more flexible -- so you can easily tweak the recipe to (e.g.) intersect N sets. Unlike most of the shell tools, comm has a very limited role (but thanks again for pointing it out).
Don 2007-09-24

Here is how i did it with cat, sort, and uniq:

(shell script)
take 2 input files and output 2 files: unique.txt and duplicate.txt
if [ $# -ne 2 ]; then echo 1>&2 Usage: $0 file1 file2 exit 127 fi

cat $1 $2>tmp.tmp sort tmp.tmp>sort.tmp uniq -u sort.tmp unique.txt uniq -d sort.tmp duplicate.txt rm *.tmp 2>/dev/null wc *.txt exit 0

Thomas Guest 2007-09-26

if [ $# -ne 2 ]
then 
    echo 1>&2 Usage: $0 file1 file2
    exit 127 
fi

cat $1 $2 > tmp.tmp
sort tmp.tmp > sort.tmp
uniq -u sort.tmp > unique.txt 
uniq -d sort.tmp > duplicate.txt
rm *.tmp 2>/dev/null
wc *.txt
exit 0

Don, I've tried to reformat your comment -- I think linebreaks got mangled. It assumes each input file contains no duplicates, right? I think rm *.tmp is a little risky!

Word Aligned

sweating the small stuff