gImageReader (runs on Linux and Windows) is a GUI for tesseract-ocr, a free software optical character recognition (OCR) engine which you can use to
extract text from PDF documents or images.
gImageReader allows you to select columns, part of a document, spell check the output and more but it didn't recognize a whole document at once. But the latest gImageReader 0.9 adds multipage-recognition support for multipage PDF. You can also set gImageReader to extract the text from a page range if you don't need it to recognize a whole document.
Besides this very useful (and much needed!) new feature,
gImageReader 0.9 also comes with:
- new language profiles: chinese, korean, japanese, hebrew, arabic, croatian
- all formats supported by gdk_pixbuf to file filter for open dialog
- option to cancel the recognition
- fixed auto-installing new dictionaries (new dictionaries would not appear in main language selector until program restart)
- many other minor improvements and bug fixes
How about the speed you may ask. Well, in my test, gImageReader was able to recognize a 36 page PDF document in 1,10 minutes (on a kind of slow computer I have at work).
Download gImageReader (.deb, .rpm and .exe files available)