Spam, Typo, Subversion Logs
This blog got hit by some spam promoting online gambling sites, even though I’d cranked up most of the built in anti-spam settings. I went across to the Typo website looking for advice and discovered:
The Typo Trac is currently offline because of a high level of spam. It shall return as soon as we’ve got some more protection added to it.
Unfunnily enough, I experimented with an open Trac project a while back and it too received spam contributions cunningly hidden where a regular reader wouldn’t notice.
Examining the Typo Subversion Logs
Fortunately the Typo Subversion server remains up and running. I took a look at the log files to see if there were any spam-related improvements since I’d originally installed Typo and the latest release (4.0.3 at the time of writing). There were a few hits.
$ svn log -r1133:1231 svn://typosphere.org/typo | grep spam Add spam setting for Akismet key. I still need to write the Akismet glue code, but it won't work without a key. Big spam filtering upgrade. Comments (and trackbacks) that fail the spam check are marked as unpublished registering of spam/ham classification with akismet). JustPresumedHam of articles whose classification you have confirmed as well as a simple spam/ham Use published_at for comment spam checks. Closes #1089
A Closer Examination
This superficial inspection suggests that, out of 99 changes, 4 relate to spam — suggesting that the Typo developers spent less than 5% of their effort making anti-spam changes in the period concerned.
A more useful statistic would be the number of files which were
modified for anti-spam purposes. It’s rather harder to extract this
number using simple shell programs such as grep
so I wrote a
Python program to analyse the the svn log output. I used the
--xml
option to the svn log command to provide me with more
structured output, and the Python minidom XML module proved more than
up to the task of parsing this output.
Here’s what this program told me.
$ svn log -r1133:1231 svn://typosphere.org/typo --xml --verbose | \ process_svn_log.py spam akismet Found /spam|akismet/i in 9/99 changes affecting 72/270 files.
Note that I included Akismet in my pattern match. As I understand it, Akismet is a service specifically designed to protect blogs against spam.
I could dig even deeper and find out how many lines of code were changed, but I don’t think it’s worth it. This is a pretty blunt tool, but it does tell us that some smart programmers are having to spend almost as much time fighting dumb spammers as they are writing more useful code.
The svn log processor
For the record, here’s my program. It’s best suited to the job it actually did but it’s simple enough that I’ll be able to adapt it for use elsewhere.
""" This program filters 'svn log --xml --verbose' output for log entries which match patterns. This output has the form: <?xml version="1.0"?> <log> <logentry revision="1133"> <author>scott</author> <date>2006-07-13T17:26:26.186291Z</date> <paths> <path action="M">/trunk/app/views/admin/feedback/list.rhtml</path> </paths> <msg>Make search+pagination work right</msg> </logentry> </log> """ def usage(program): print """\ Usage: %s PATTERN ... Searches the output from 'svn log --xml --verbose' for log entries whose message matches the supplied PATTERN(s) and yields summary statistics. Example: svn log -r1133:1231 svn://typosphere.org/typo --xml --verbose | %s spam""" % ( program, program) def elements(node, tagname): " Return named child elements of a DOM node. " return node.getElementsByTagName(tagname) def count_paths(logentries): " Count repository path changes logged. " return sum(1 for logentry in logentries for paths in elements(logentry, "paths") for path in elements(paths, "path")) def log_msg_matches(logentry, matcher): " Return true if the logentry message matches, false otherwise. " msgs = elements(logentry, "msg") assert len(msgs) == 1, "Require a single log message per log entry." return matcher(msgs[0].childNodes[0].data) is not None def process(log, patterns): " Process the input svn log, looking for messages matching the input patterns. " import re pattern = "|".join(patterns) matcher = re.compile(pattern, re.IGNORECASE).search entries = elements(log, "logentry") matches = [entry for entry in entries if log_msg_matches(entry, matcher)] paths = count_paths(entries) matching_paths = count_paths(matches) print "Found /%s/i in %d/%d changes affecting %d/%d files." % ( pattern, len(matches), len(entries), matching_paths, paths) def main(argv): if len(argv) == 1: usage(argv[0]) else: from xml.dom.minidom import parse process(parse(sys.stdin), argv[1:]) if __name__ == "__main__": import sys main(sys.argv)