A Python syntax highlighter

2006-07-29, , Comments

In a recent post I described my first ever Ruby program — which was actually a syntax highlighter for Python and written in Ruby, ready to be used in a Typo web log. Since the post was rather a long one, I decided to post the code itself separately. Here it is, then.

The Test Code

As you can see, currently only comments, single- and triple- quoted strings, keywords and identifiers are recognised. That’s really all I wanted, for now. For completeness, I may well add support for numeric literals. Watch this space!

typo/vendor/syntax/test/syntax/tc_python.rb
require File.dirname(__FILE__) + "/tokenizer_testcase"

class TC_Syntax_Python < TokenizerTestCase

syntax "python"

def test_empty
    tokenize ""
    assert_no_next_token
  end
  def test_comment_eol
    tokenize "# a comment\nfoo"
    assert_next_token :comment, "# a comment"
    assert_next_token :normal, "\n"
    assert_next_token :ident, "foo"
  end
  def test_two_comments
    tokenize "# first comment\n# second comment"
    assert_next_token :comment, "# first comment"
    assert_next_token :normal, "\n"
    assert_next_token :comment, "# second comment"
  end
  def test_string
    tokenize "'' 'aa' r'raw' u'unicode' UR''"
    assert_next_token :string, "''"
    skip_token
    assert_next_token :string, "'aa'"
    skip_token
    assert_next_token :string, "r'raw'"
    skip_token
    assert_next_token :string, "u'unicode'"
    skip_token
    assert_next_token :string, "UR''"
    tokenize '"aa\"bb"'
    assert_next_token :string, '"aa\"bb"'
  end
  def test_triple_quoted_string
    tokenize "'''\nfoo\n'''"
    assert_next_token :triple_quoted_string, "'''\nfoo\n'''"
    tokenize '"""\nfoo\n"""'
    assert_next_token :triple_quoted_string, '"""\nfoo\n"""'
    tokenize "uR'''\nfoo\n'''"
    assert_next_token :triple_quoted_string, "uR'''\nfoo\n'''"
    tokenize '"""\\'a\\'"b"c"""'
    assert_next_token  :triple_quoted_string, '"""\\'a\\'"b"c"""'
  end
  def test_keyword
    Syntax::Python::KEYWORDS.each do |word|
      tokenize word
      assert_next_token :keyword, word
    end
    Syntax::Python::KEYWORDS.each do |word|
      tokenize "x#{word}"
      assert_next_token :ident, "x#{word}"
      tokenize "#{word}x"
      assert_next_token :ident, "#{word}x"
    end
  end
end

The Python Tokenizer

typo/vendor/syntax/python.rb
require 'syntax'

module Syntax

# A basic tokenizer for the Python language. It recognises
  # comments, keywords and strings.
  class Python < Tokenizer
    # The list of all identifiers recognized as keywords.
    # http://docs.python.org/ref/keywords.html
    # Strictly speaking, "as" isn't yet a keyword -- but for syntax
    # highlighting, we'll treat it as such.
    KEYWORDS =
      %w{as and del for is raise assert elif from lambda return break
         else global not try class except if or while continue exec
         import pass yield def finally in print}
    # Step through a single iteration of the tokenization process.
    def step
      if scan(/#.*$/)
        start_group :comment, matched
      elsif scan(/u?r?'''.*?'''|""".*?"""/im)
        start_group :triple_quoted_string, matched
      elsif scan(/u?r?'([^\\']|\\.)*'/i)
        start_group :string, matched
      elsif scan(/u?r?"([^\\"]|\\.)*"/i)
        start_group :string, matched
      elsif check(/[_a-zA-Z]/)
        word = scan(/\w+/)
        if KEYWORDS.include?(word)
          start_group :keyword, word
        else
          start_group :ident, word
        end
      else
        start_group :normal, scan(/./m)
      end
    end
  end
  SYNTAX["python"] = Python

end