Wednesday, April 16

Convert PDF to TIFF then OCR as TXT, in Ubuntu, of course

Today I need to OCR a pdf file. I mean, I have a file in pdf, and I want to get text out of it.

The default pdf reader in Ubuntu is evince, so I searched "ocr evince", and the first page I got was a roadmap of GNOME. It tells me that in the future version, the evince software WILL support OCR. Thanks for letting me know.

Tesseract is the software to do the OCR, but it can only process TIFF file. So the first thing is to install imagemagick:
sudo apt-get install imagemagick
convert aaa.pdf aaa.tif
the 100k pdf file generated 4M tif file.

I installed tesseract 1.2 from the apt-get, and it generates a messy output:
pmorvxu qo6 jnwbeq oAeL we gas?` ;ox~
]F1LUbGq OAGL QJG {SEA {OX` j_}.IG dF1!C}(
OAGL [{16 {SEA J`OX~ j_}JG ClI'1!C}( pLOMU qo6
gas?` ;ox~ ipe dngcg pkorvxu qod jnuabeq
j_}JG ClI'1!C}( pLOMU qo6 ]f1!JJbGq OAGL HJG
0% HIS J=OHiJ9I~
OCL COqG *3Uq 266 QJG ![ MOLK2 OU *3}} []xbG2
J.!J!e !e 9 lot 0% JS bO!U{ IGXI to [Gel {IJG
so I had to uninstall it, and download tesseract 2.0 source code, training data, extract both of them, put the training data into tessdata folder of the tesseract 2.0 source code, then
sudo make install
tesseract aaa.tif aaa.txt
and got 4k text file.

The OCR result is acceptable. I wish there is a button in the evince to do the conversion automatically, then I don't need to use command line.

If the OCR softwares integrate a dictionary, the recognition rate will be much higher. For example, there won't be a "palanquln" in the text, but "palanquin". After that, add a grammar checking. A feedback system will be of great benefit as well. The Keyboard sniffer is talking about relating technology.



At March 25, 2009 6:23 AM, Blogger Geordy said...

Hi Ben,

Old post but I wanted to thank you anyways for the small but very usefull tutorial.



<< Home