OCR. Wrong characters, right meaning! (chuckles)

2009-03-19, Comments

Chuckles graphic

Run this image through the tesseract OCR engine and it gets the characters wrong but the meaning right.

$ curl wordaligned.org/images/chuckles.tif > chuckles.tif
$ tesseract chuckles.tif ocr-chuckles && cat ocr-chuckles.txt
[HEHE]

At first I assumed I’d chanced on an easter egg but now I’m not so sure. Crop to the region of interest and all is well.

Cropped chuckles graphic
$ curl wordaligned.org/images/chuckles-cropped.tif > cropped.tif
$ tesseract cropped.tif ocr-cropped && cat ocr-cropped.txt
(chuckles)

Just in case you were wondering … the graphic appears in the subtitles of a TV advert featuring Rolf Harris and the Churchill dog. Rolf is the one who’s chuckling.

Rolf Harris and Churchill


This oddity happened using tesseract 2.03 built on OS X, untrained. The grayscale images shown on this page are PNGs, not TIFFs because — much to my surprise — browser support for TIFFs is limited. Tesseract only accepts TIFF images, and the file extension has to be .tif.