A Python syntax highlighter
In a recent post I described my first ever Ruby program — which was actually a syntax highlighter for Python and written in Ruby, ready to be used in a Typo web log. Since the post was rather a long one, I decided to post the code itself separately. Here it is, then.
The Test Code
As you can see, currently only comments, single- and triple- quoted strings, keywords and identifiers are recognised. That’s really all I wanted, for now. For completeness, I may well add support for numeric literals. Watch this space!
typo/vendor/syntax/test/syntax/tc_python.rb
require File.dirname(__FILE__) + "/tokenizer_testcase"
class TC_Syntax_Python < TokenizerTestCase
syntax "python"
def test_empty
tokenize ""
assert_no_next_token
end
def test_comment_eol
tokenize "# a comment\nfoo"
assert_next_token :comment, "# a comment"
assert_next_token :normal, "\n"
assert_next_token :ident, "foo"
end
def test_two_comments
tokenize "# first comment\n# second comment"
assert_next_token :comment, "# first comment"
assert_next_token :normal, "\n"
assert_next_token :comment, "# second comment"
end
def test_string
tokenize "'' 'aa' r'raw' u'unicode' UR''"
assert_next_token :string, "''"
skip_token
assert_next_token :string, "'aa'"
skip_token
assert_next_token :string, "r'raw'"
skip_token
assert_next_token :string, "u'unicode'"
skip_token
assert_next_token :string, "UR''"
tokenize '"aa\"bb"'
assert_next_token :string, '"aa\"bb"'
end
def test_triple_quoted_string
tokenize "'''\nfoo\n'''"
assert_next_token :triple_quoted_string, "'''\nfoo\n'''"
tokenize '"""\nfoo\n"""'
assert_next_token :triple_quoted_string, '"""\nfoo\n"""'
tokenize "uR'''\nfoo\n'''"
assert_next_token :triple_quoted_string, "uR'''\nfoo\n'''"
tokenize '"""\\'a\\'"b"c"""'
assert_next_token :triple_quoted_string, '"""\\'a\\'"b"c"""'
end
def test_keyword
Syntax::Python::KEYWORDS.each do |word|
tokenize word
assert_next_token :keyword, word
end
Syntax::Python::KEYWORDS.each do |word|
tokenize "x#{word}"
assert_next_token :ident, "x#{word}"
tokenize "#{word}x"
assert_next_token :ident, "#{word}x"
end
end
end
The Python Tokenizer
typo/vendor/syntax/python.rb
require 'syntax'
module Syntax
# A basic tokenizer for the Python language. It recognises
# comments, keywords and strings.
class Python < Tokenizer
# The list of all identifiers recognized as keywords.
# http://docs.python.org/ref/keywords.html
# Strictly speaking, "as" isn't yet a keyword -- but for syntax
# highlighting, we'll treat it as such.
KEYWORDS =
%w{as and del for is raise assert elif from lambda return break
else global not try class except if or while continue exec
import pass yield def finally in print}
# Step through a single iteration of the tokenization process.
def step
if scan(/#.*$/)
start_group :comment, matched
elsif scan(/u?r?'''.*?'''|""".*?"""/im)
start_group :triple_quoted_string, matched
elsif scan(/u?r?'([^\\']|\\.)*'/i)
start_group :string, matched
elsif scan(/u?r?"([^\\"]|\\.)*"/i)
start_group :string, matched
elsif check(/[_a-zA-Z]/)
word = scan(/\w+/)
if KEYWORDS.include?(word)
start_group :keyword, word
else
start_group :ident, word
end
else
start_group :normal, scan(/./m)
end
end
end
SYNTAX["python"] = Python
end