OCR. Wrong characters, right meaning! (chuckles)

2009-03-19, Comments

Chuckles graphic

Run this image through the tesseract OCR engine and it gets the characters wrong but the meaning right.

$ curl wordaligned.org/images/chuckles.tif > chuckles.tif
$ tesseract chuckles.tif ocr-chuckles && cat ocr-chuckles.txt

At first I assumed I’d chanced on an easter egg but now I’m not so sure. Crop to the region of interest and all is well.

Cropped chuckles graphic
$ curl wordaligned.org/images/chuckles-cropped.tif > cropped.tif
$ tesseract cropped.tif ocr-cropped && cat ocr-cropped.txt

Just in case you were wondering … the graphic appears in the subtitles of a TV advert featuring Rolf Harris and the Churchill dog. Rolf is the one who’s chuckling.

Rolf Harris and Churchill

This oddity happened using tesseract 2.03 built on OS X, untrained. The grayscale images shown on this page are PNGs, not TIFFs because — much to my surprise — browser support for TIFFs is limited. Tesseract only accepts TIFF images, and the file extension has to be .tif.