My (Test) First Ruby Program

2006-07-19, , , Comments

One of my reasons for starting this blog was to find out more about web application frameworks based on dynamic languages in general, and about Ruby on Rails in particular. The only problem being, I’d never actually written any Ruby before.

Now, back when I started out as a programmer I never took a huge interest in learning computer languages — I just figured out what existing code was doing then fiddled around with it until it seemed to do what I wanted. Some of the time I got away with it.

These days I’m more interested in computer languages, but I still think that reading and tweaking existing code is a good way to learn. Ruby, being a dynamic, interpreted language, is perfect for such experimentation. The Ruby on Rails framework turns out to be equally dynamic; by running the development environment, I could see my code changes instantly reflected in my Typo application. Even better, the exact same code that I tested at home on my Windows machine could be deployed on my live shared UNIX server. Best of all, I soon discovered the test framework for the module I needed to alter. By developing the tests and code in parallel, I deployed my first ever Ruby code with reasonable confidence that it worked.

The Requirement

I wanted to be able to post code snippets to this blog, and I wanted the code to be nicely syntax-highlighted. Digging through the Typo admin pages revealed that this was already supported for Ruby (of course!), XML and YAML. Futhermore, the syntax highlighting scheme was open to extension, which was good, since I intended to highlight Python and C++ snippets — and possibly others too. All you had to do was extend Syntax::Tokenize, implementing the #step method.

A few minutes of googling didn’t turn up any existing solutions to this particular problem, so I decided to have a crack at it myself.

Emacs ruby mode

Before I could even contemplate working with Ruby code, I needed to get my editor to recognise it. This was straightforward.

Locating the code to change

Grepping the Typo code for syntax yielded several hits:

config/environment.rb     # Adds vendor/syntax/lib to the load path
components/plugins/textfilters/code_controller.rb # Does the syntax highlighting
vendor/syntax             # The syntax module itself

Fiddling around with the code

So, the first thing I did was start hacking at code_controller.rb, adding a new class and registering it, just like this:

class PythonTokenizer < Syntax::Tokenizer
  def step
    if digits = scan(/\d+/)
      start_group :digits, digits
    elsif words = scan(/\w+/)
      start_group :words, words
    else
      start_group :normal, scan(/./)
    end
  end
end

Syntax::SYNTAX['python'] = PythonTokenizer

This being my first ever attempt at Ruby code, I didn’t even write it myself: I simply cut-and-pasted it direct from the Ruby syntax highlight manual. As you can see, I made no attempt to implement a real Python tokenizer — I just wanted to see if I could get any syntax highlighter working. Sure enough, when I started up my Typo development environment and posted a code snippet

<code lang="python">
abc 123

then examined the resultant HTML (CTRL-U in Firefox). It read:

<div class="typocode"><pre>
<code class="typocode_python ">
<span class="words">abc</span> <span class="digits">123</span>

Perfect!

Portability

Incidentally, my home development environment is on the Windows platform; my live blog runs on a shared server running FreeBSD. Identical Typo code runs on both — the only difference being that I use WEBrick as my development webserver and lighttpd on the live blog.

Hot updates

Wouldn’t it be nice if you could edit code_controller.rb, hit F5 in the web browser and see your changes take immediate effect? I gave it a go, switching words for worms for a bit of fun.

class PythonTokenizer < Syntax::Tokenizer
    ....
      start_group :worms, words
end

Sure enough, the updated HTML page read:

<span class="worms">abc</span>

which is how things should be. I was pleased to see that the syntax highlight module created the new CSS class "worms" without complaining. I didn’t even have to enter the string literal "worms" anywhere in the code — some sort of reflection must have figured out how to process the :worms symbol correctly.

Overenthusiasm

Enthused by this early success, I tried editing my PythonTokenizer class to do what it was really meant to do: namely, identify comments, strings, keywords. Typo reported back the inevitable syntax errors through the web interface in a friendly enough way, but I soon realised that this was not the correct way to develop code.

What I really ought to be doing was developing my new PythonTokenizer class in isolation, then integrating it into the Rails application.

Running the Syntax Unit Tests

So, I went looking in the vendor/syntax directory.

+---api
|   +---classes
|   |   \---Syntax
|   |       \---Convertors
|   \---files
|       \---lib
|           \---syntax
|               +---convertors
|               \---lang
+---doc
|   +---manual
|   |   +---parts
|   |   \---stylesheets
|   \---manual-html
|       \---stylesheets
+---lib
|   \---syntax
|       +---convertors
|       \---lang
\---test
    \---syntax

I found the Ruby, XML and YAML tokenizers in lib/lang/ruby.rb, lib/lang/xml.rb and lib/lang/yaml.rb respectively. I found accompanying unit tests in test/syntax/tc_ruby.rb, test/syntax/tc_xml.rb and test/syntax/tc_yaml.rb. Running the test/ALL-TESTS.rb gave:

c:\thomas\typo\vendor\syntax\test>ALL-TESTS.rb
ALL-TESTS.rb
Loaded suite c:/thomas/typo/vendor/syntax/test/ALL-TESTS
Started
............................................................
Finished in 0.359 seconds.

122 tests, 761 assertions, 0 failures, 0 errors

My new strategy was clear: develop lib/lang/python.rb and test/syntax/tc_python.rb in parallel until my new syntax highlighter passed all the tests — then integrate my new Python highlighter into Typo. I reverted my changes to code_controller.rb and started again.

Adding a testcase

So, I created tc_python.rb, using tc_ruby.rb as an example. Here’s what the my first test looked like:

tc_python.rb
require File.dirname(__FILE__) + "/tokenizer_testcase"

class TC_Syntax_Python < TokenizerTestCase

  syntax "python"

  def test_empty
    tokenize ""
    assert_no_next_token
  end
end

Running ALL-TESTS.rb again gave me:

Started
...F........................................................
Finished in 0.282 seconds.

  1) Failure:
test_empty(TC_Syntax_Python)
    [./syntax/tokenizer_testcase.rb:34:in `assert_no_next_token'
     ./syntax/tc_python.rb:9:in `test_empty']:
<false> is not true.

123 tests, 762 assertions, 1 failures, 0 errors

This at least confirmed my test was being run. Actually, I was a little surprised to get a failure and not an error, since I hadn’t even registered a Python syntax highlighter.

Getting started on python.rb

My first cut at python.rb simply reproduced the simple tokenizer I’d put into code_controller.rb.

python.rb
require 'syntax'

module Syntax
  class Python < Tokenizer

    # Step through a single iteration of the tokenization process.
    def step
      if digits = scan(/\d+/)
        start_group :digits, digits
      elsif words = scan(/\w+/)
        start_group :words, words
      else
        start_group :normal, scan(/./)
      end
    end
  end

  SYNTAX["python"] = Python
end

With this implementation, all the tests passed. Now I wrote a test case for finding comments — about the simplest syntactic element of a Python program. Perhaps “wrote” overstates things. Actually, I just cut-and-pasted a testcase from tc_ruby.rb.

  def test_comment_eol
    tokenize "# a comment\
foo"
    assert_next_token :comment, "# a comment"
    assert_next_token :normal, "\
"
    assert_next_token :ident, "foo"
  end

This caused the tests to hang. By playing with the code, I soon figured out the problem. My tokenizer wasn’t getting past the newline. I’d seen enough Perl in my time to figure out what to do. Clearly the scan function accepted a regular expression, and the else case used the regex special character . to eat any single character except an end-of-line. I modified the regex so the code read start_group :normal, scan(/./m) (notice the m), and now my test failed instead of hanging:

  1) Failure:
test_comment_eol(TC_Syntax_Python)
    [./syntax/tokenizer_testcase.rb:29:in `assert_next_token'
     ./syntax/tc_python.rb:13:in `test_comment_eol']:
<[:comment, "# a comment", :none]> expected but was
<[:normal, "# ", :none]>.

It was time to start making my Python tokenizer look like it really wanted to tokenize Python.

python.rb
  class Python < Tokenizer
    def step
      if comment = scan(/#.*$/)
        start_group :comment, comment
      else
        start_group :normal, scan(/./m)
      end
    end
  end

With this change, my failure moved on a line:

  1) Failure:
test_comment_eol(TC_Syntax_Python)
    [./syntax/tokenizer_testcase.rb:29:in `assert_next_token'
     ./syntax/tc_python.rb:14:in `test_comment_eol']:
<[:normal, "\

”, :none]> expected but was <[:normal, “\ foo”, :none]>.

Good! My tokenizer had at least recognised the comment. Hardly surprisingly, it then treated the rest of the string as normal, which is what the test failure indicates.

Rinse and Repeat

You can probably work out the rest. I added code and test cases until my Python syntax highlighter did all I wanted it to do: namely, pick out comments, strings, triple quoted strings. This post is far too long already — I’ll post my code and the accompanying tests in another post.

Deploying the Python Highlighter

I didn’t need to do anything to deploy the code in my development environment. It was already there, since I’d developed it in place. I ran some system level tests to convince myself all was indeed OK, then copied it across to my shared server.

Just to show it all works, here’s a simple Python program to generate all the subsets of a set.

def generate_subsets(the_set, m):
   """ Generate all m element subsets of the input set.

   If the input set is empty or m is 0, yield the empty set.
   Otherwise, use a recursive solution. Pick any element from
   the set, and yield the subsets which contain this element,
   followed by those which don't.
   """
   if m > len(the_set):
       pass
   elif len(the_set) == 0 or m == 0:
       yield set()
   else:
       e = the_set.pop()
       for subset in generate_subsets(the_set, m - 1):
           subset.add(e)
           yield subset
       for subset in generate_subsets(the_set, m):
           yield subset
       the_set.add(e)