Blackmail made easy using Python counters

2009-07-27, , Comments

The Obsessive Blackmailer

An obsessive blackmailer writes anonymous messages by by cut-and-pasting letters from newspapers. Being obsessive, the blackmailer only writes messages which can be composed entirely from a single newspaper.

word aligned

Devise an algorithm which determines whether a given message can be written using a given newspaper.

Modeling the Problem

This is a nice little problem but I’m about to spoil it since I’m using it here as a study in Python’s evolution. So if you’d like to try it yourself, look away now.

We can represent both inputs to the algorithm as sequences of characters: a message string, length M, and a newspaper string, length N. We could process the message string one character at a time, at each step scanning through the newspaper and noting the first occurrence of that character we haven’t used before; but this is inefficient since we potentially read the whole paper M times.

It’s better to think of this problem in terms of multisets, sometimes known as bags. A multiset is a set which can have repeated elements. Our blackmailer can proceed if the multiset of letters used in the message is contained entirely within the multiset of letters used in the newspaper.

The evolution of multisets in Python

A dictionary provides a compact and efficient way to represent a multiset in Python: each dictionary key represents an item in the multiset, and the value associated with that key is the number of times the key appears in the multiset. Python dictionaries are implemented as hashed arrays, meaning that member insertion and access take constant time, on average.

It’s not hard to create such a multiset from a sequence but it’s interesting to see how advances in the Python language have simplified the code.

The complete documentation for Python 1.4, released in 1996, is still available on the Python website. In version 1.4 you could write:

def multiset_14(xs):
    multiset = {}
    for x in xs:
        if multiset.has_key(x):
            multiset[x] = multiset[x] + 1
        else:
            multiset[x] = 1
    return multiset

This code works unchanged in the current Python release, 2.6 (though note dict.has_key() doesn’t exist in Python 3.*). Alternatively, you might catch the KeyError raised when trying to access the dict with an invalid key:

def multiset_14(xs):
    multiset = {}        
    for x in xs:
        try:
            multiset[x] = multiset[x] + 1
        except KeyError:
            multiset[x] = 1
    return multiset

Python 1.5 introduces an exception-free dictionary access method, dict.get(), which returns a user supplied default (defaulting to None) for missing keys.

def multiset_15(xs):
    multiset = {}        
    for x in xs:
        multiset[x] = multiset.get(x, 0) + 1
    return multiset

It’s certainly shorter, a little cleaner maybe, but perhaps it takes more effort for readers to see what exactly is going on.

At Python 2.2, x in multiset improves on the equivalent multiset.has_key(x) and we can use augmented arithmetic operators (+=, -=, *=, /=, %=, **=, <<=, >>=, &=, =, |=), allowing:

def multiset_22(xs):
    multiset = {}
    for x in xs:
        if x in multiset:
            multiset[x] += 1
        else:
            multiset[x] = 1
    return multiset

I think I prefer the dict.get() version, though.

The collections module makes its first appearance in Python 2.4 offering a deque and a promise of more high performance container types to come. Python 2.5 makes good on this promise, adding defaultdict to the module. A defaultdict is a specialised dictionary which calls a client supplied factory function for missing keys. Setting this factory function to int turns the defaultdict into a multiset. No need for dict.get() any more.

from collections import defaultdict

def multiset_25(xs):
    multiset = defaultdict(int)
    for x in xs:
        multiset[x] += 1
    return multiset

Wait, there’s more!

The final improvement is available in Python 3.1 right now (or in Python 2.7, coming soon), courtesy once again of the collections module. Collections.Counter is exactly what we’ve been waiting for.

from collections import Counter

def multiset_31(xs):
    return Counter(xs)

Back to Blackmail

So our blackmailer should first generate a multiset representation of the letters in the message. Then it’s a matter of iterating through the newspaper and reducing the multiset each time a letter matches up. We keep a tally of the number of letters we still need to match, and stop when this tally is zero or when we get to the end of the newspaper. Here’s a sketch of an implementation.

def blackmailable(message, newspaper):
    """Return True if newspaper can be used to write the blackmail 
    message, False otherwise.
    """
    m = len(message)
    if m == 0:
        return True
    counts = multiset(message)
    for ch in newspaper:
        if counts[ch] > 0:
            counts[ch] -= 1
            m -= 1
            if m == 0:
                return True
    return False

This code assumes the multiset is represented as a Counter or a defaultdict, since it depends on counts[ch] returning 0 for any character not in the message. If we’d used a plain dict, we’d need to employ dict.get(ch, 0).

I’m not entirely happy with the code shown. It’s what I first came up with. Here’s an alternative, which I also find a bit clunky. I’d welcome any improvements. It’s also worth noting that the algorithm locates the matching characters in the newspaper, so we might want to cache some indices for later use.

def blackmailable(message, newspaper):
    """Return True if newspaper can be used to write the blackmail 
    message, False otherwise.
    """
    counts = multiset(message)
    m = len(message)
    n = len(newspaper)
    i = 0
    while m != 0 and i != n:
        ch = newspaper[i]
        if counts[ch] > 0:
            counts[ch] -= 1
            m -= 1
        i += 1
    return m == 0

We can avoid the ugly code by persuading the obssessive blackmailer to generate and maintain multiset representations of the entire newspaper library. Then blackmailable() can be implemented as multiset containment, something which the Counter class handles nicely using the subtraction operator. Note here that multiset subtraction never results in any negative counts, even though a Counter instance could itself have negative counts.

>>> from collections import Counter
>>> missing_letters = Counter(message) - Counter(newspaper)
>>> blackmailable = len(missing_letters) == 0

Alternatively:

>>> blackmailable = not missing_letters

Generic Code

Suppose the blackmailer prefers to compose a message from words, rather than letters? (For an example, see the threat to stay away from Grimpen Moor delivered to Sir Henry Baskerville discussed later in this article.) The code works as is — just pass in message and newspaper as word sequences, rather than character sequences. Anything we can hash can be counted.

End of Message

In the age of the interweb anonymous cowardice is far easier and blackmailers don’t need to resort to manual cut and paste techniques unless they’re after a retro threatening effect.

Never Mind the Bollocks

Sherlock Holmes

What’s more, a detective can figure out plenty from these messages: so when Sir Henry Baskerville receives a threatening letter during his stay at the Northumberland Hotel, he shows it promptly to Sherlock Holmes:

Across the middle of it a single sentence had been formed by the expedient of pasting printed words upon it. It ran: “As you value your life or your reason keep away from the moor.” The word “moor” only was printed in ink.

In a virtuso display of deductive reasoning, Holmes shows the author of the message was in a hurry, afraid of being interrupted, and working in a hotel room using nail-scissors. (He also deduces something else, which he does not reveal at the time.) Identifying the source of the words to be yesterday’s Times leader is elementary.

The detection of types is one of the most elementary branches of knowledge to the special expert in crime, though I confess that once when I was very young I confused the Leeds Mercury with the Western Morning News. But a Times leader is entirely distinctive, and these words could have been taken from nothing else.

— Sherlock Holmes, The Hound of the Baskervilles

Can anyone identify the newspaper I used to create the image at the start of this article?


My thanks to jay for a correction to the original version of this article.