A Python syntax highlighter
In a recent post I described my first ever Ruby program — which was actually a syntax highlighter for Python and written in Ruby, ready to be used in a Typo web log. Since the post was rather a long one, I decided to post the code itself separately. Here it is, then.
The Test Code
As you can see, currently only comments, single- and triple- quoted strings, keywords and identifiers are recognised. That’s really all I wanted, for now. For completeness, I may well add support for numeric literals. Watch this space!
typo/vendor/syntax/test/syntax/tc_python.rb
require File.dirname(__FILE__) + "/tokenizer_testcase" class TC_Syntax_Python < TokenizerTestCase syntax "python" def test_empty tokenize "" assert_no_next_token end def test_comment_eol tokenize "# a comment\nfoo" assert_next_token :comment, "# a comment" assert_next_token :normal, "\n" assert_next_token :ident, "foo" end def test_two_comments tokenize "# first comment\n# second comment" assert_next_token :comment, "# first comment" assert_next_token :normal, "\n" assert_next_token :comment, "# second comment" end def test_string tokenize "'' 'aa' r'raw' u'unicode' UR''" assert_next_token :string, "''" skip_token assert_next_token :string, "'aa'" skip_token assert_next_token :string, "r'raw'" skip_token assert_next_token :string, "u'unicode'" skip_token assert_next_token :string, "UR''" tokenize '"aa\"bb"' assert_next_token :string, '"aa\"bb"' end def test_triple_quoted_string tokenize "'''\nfoo\n'''" assert_next_token :triple_quoted_string, "'''\nfoo\n'''" tokenize '"""\nfoo\n"""' assert_next_token :triple_quoted_string, '"""\nfoo\n"""' tokenize "uR'''\nfoo\n'''" assert_next_token :triple_quoted_string, "uR'''\nfoo\n'''" tokenize '"""\\'a\\'"b"c"""' assert_next_token :triple_quoted_string, '"""\\'a\\'"b"c"""' end def test_keyword Syntax::Python::KEYWORDS.each do |word| tokenize word assert_next_token :keyword, word end Syntax::Python::KEYWORDS.each do |word| tokenize "x#{word}" assert_next_token :ident, "x#{word}" tokenize "#{word}x" assert_next_token :ident, "#{word}x" end end end
The Python Tokenizer
typo/vendor/syntax/python.rb
require 'syntax' module Syntax # A basic tokenizer for the Python language. It recognises # comments, keywords and strings. class Python < Tokenizer # The list of all identifiers recognized as keywords. # http://docs.python.org/ref/keywords.html # Strictly speaking, "as" isn't yet a keyword -- but for syntax # highlighting, we'll treat it as such. KEYWORDS = %w{as and del for is raise assert elif from lambda return break else global not try class except if or while continue exec import pass yield def finally in print} # Step through a single iteration of the tokenization process. def step if scan(/#.*$/) start_group :comment, matched elsif scan(/u?r?'''.*?'''|""".*?"""/im) start_group :triple_quoted_string, matched elsif scan(/u?r?'([^\\']|\\.)*'/i) start_group :string, matched elsif scan(/u?r?"([^\\"]|\\.)*"/i) start_group :string, matched elsif check(/[_a-zA-Z]/) word = scan(/\w+/) if KEYWORDS.include?(word) start_group :keyword, word else start_group :ident, word end else start_group :normal, scan(/./m) end end end SYNTAX["python"] = Python end