Spam, Typo, Subversion Logs

2006-11-27, , , , Comments

This blog got hit by some spam promoting online gambling sites, even though I’d cranked up most of the built in anti-spam settings. I went across to the Typo website looking for advice and discovered:

The Typo Trac is currently offline because of a high level of spam. It shall return as soon as we’ve got some more protection added to it.

Unfunnily enough, I experimented with an open Trac project a while back and it too received spam contributions cunningly hidden where a regular reader wouldn’t notice.

Examining the Typo Subversion Logs

Fortunately the Typo Subversion server remains up and running. I took a look at the log files to see if there were any spam-related improvements since I’d originally installed Typo and the latest release (4.0.3 at the time of writing). There were a few hits.

$ svn log -r1133:1231 svn:// | grep spam
Add spam setting for Akismet key.  I still need to write the Akismet glue code, but it won't work without a key.
Big spam filtering upgrade.
Comments (and trackbacks) that fail the spam check are marked as unpublished
registering of spam/ham classification with akismet). JustPresumedHam
of articles whose classification you have confirmed as well as a simple spam/ham
Use published_at for comment spam checks.  Closes #1089

A Closer Examination

This superficial inspection suggests that, out of 99 changes, 4 relate to spam — suggesting that the Typo developers spent less than 5% of their effort making anti-spam changes in the period concerned.

A more useful statistic would be the number of files which were modified for anti-spam purposes. It’s rather harder to extract this number using simple shell programs such as grep so I wrote a Python program to analyse the the svn log output. I used the --xml option to the svn log command to provide me with more structured output, and the Python minidom XML module proved more than up to the task of parsing this output.

Here’s what this program told me.

$ svn log -r1133:1231 svn:// --xml --verbose | \ spam akismet
Found /spam|akismet/i in 9/99 changes affecting 72/270 files.

Note that I included Akismet in my pattern match. As I understand it, Akismet is a service specifically designed to protect blogs against spam.

I could dig even deeper and find out how many lines of code were changed, but I don’t think it’s worth it. This is a pretty blunt tool, but it does tell us that some smart programmers are having to spend almost as much time fighting dumb spammers as they are writing more useful code.

The svn log processor

For the record, here’s my program. It’s best suited to the job it actually did but it’s simple enough that I’ll be able to adapt it for use elsewhere.
""" This program filters 'svn log --xml --verbose' output
    for log entries which match patterns.

This output has the form:
<?xml version="1.0"?>
<msg>Make search+pagination work right</msg>

def usage(program):
    print """\
Usage: %s PATTERN ...
Searches the output from 'svn log --xml --verbose' for log entries whose
message matches the supplied PATTERN(s) and yields summary statistics.
svn log -r1133:1231 svn:// --xml --verbose | %s spam""" % (
    program, program)

def elements(node, tagname):
    " Return named child elements of a DOM node. "
    return node.getElementsByTagName(tagname)

def count_paths(logentries):
    " Count repository path changes logged. "
    return sum(1
               for logentry in logentries
               for paths in elements(logentry, "paths")
               for path in elements(paths, "path"))

def log_msg_matches(logentry, matcher):
    " Return true if the logentry message matches, false otherwise. "
    msgs = elements(logentry, "msg")
    assert len(msgs) == 1, "Require a single log message per log entry."
    return matcher(msgs[0].childNodes[0].data) is not None

def process(log, patterns):
    " Process the input svn log, looking for messages matching the input patterns. "
    import re
    pattern = "|".join(patterns)
    matcher = re.compile(pattern, re.IGNORECASE).search
    entries = elements(log, "logentry")
    matches = [entry for entry in entries
               if log_msg_matches(entry, matcher)]
    paths = count_paths(entries)
    matching_paths = count_paths(matches)

print "Found /%s/i in %d/%d changes affecting %d/%d files." % (
        pattern, len(matches), len(entries), matching_paths, paths)

def main(argv):
    if len(argv) == 1:
        from xml.dom.minidom import parse
        process(parse(sys.stdin), argv[1:])

if __name__ == "__main__":
    import sys