Custom Search

Friday, April 12, 2013

How to convert jpg to tiff for OCR with tesseract

1)
Install PIL
#pip install pil

2)
Install  tesseract-ocr
#sudo apt-get install tesseract-ocr

3)
Install  pytesser
http://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip&can=2&q=
4)
Convert your image to tif
#convert myimage.jpeg -auto-level -compress none myimage.tif

 
5)
Python code to read data from myimage.tif

from PIL import Image
from pytesser.pytesser import *

image_file = 'myimage.tif'
im = Image.open(image_file)
text = image_to_string(im)
text = image_file_to_string(image_file)
text = image_file_to_string(image_file, graceful_errors=True)
print "=====output=======\n"
print text