Chinese OCR

Tesseract on Mac OS X

Here are the details on how to install and run Tesseract-OCR 3.0 on Mac OS X: First install tesseract via Homebrew and then download the Chinese language training files:

brew install tesseract
mkdir -p ~/Downloads/tessdata
cd ~/Downloads/tessdata
gunzip chi_sim.traineddata.gz chi_tra.traineddata.gz

(With the newer Homebrew formula you can simply run a brew install tesseract --all-languages so you don't need to get the language files yourself.)

The recognition process for a picture (here inputfile.jpg) is then as simple as:

convert inputfile.jpg -type Grayscale inputfile.tif
export TESSDATA_PREFIX=~/Downloads/
tesseract inputfile.tif output -l chi_sim

The traditional training data file does not work though at the moment. See bug 381 and 336.

