Several years back, I was working on an imaging project in Java which was going to require some Optical Character Recognition (OCR) functionality. After an exhaustive search, I could find nothing to fit the bill. My requirements were:
I never found anything that met my requirements, so I set about developing something to fit the bill. What I ended up developing, is a generic, trainable OCR package that does a fairly decent job of decoding printed text, as long as it has been trained for the font(s) it is expected to recognize.
This OCR engine is implemented as a Java library, along with a demo application which shows the library in action. The core concept, at the character level, is image matching with automatic position and aspect ratio correction, using a least-square-error matching algorithm. It is a very simple yet reasonably effective implementation.
Training consists of the following steps:
The general steps used by this OCR engine for converting a scanned document to text are:
This is a generic, trainable OCR engine. By default, it knows nothing except how to (attempt to) filter/clean up dust, convert to greyscale, break the document into lines, break the lines into characters, compare each character against known characters in user-supplied training images, and output the closest matches as text.
The engine was originally written to digitize documents (or specific sections of documents) which were printed with a handful of known fonts for which it could be trained, in order to minimize the error. Digitization was not intended to be 100 percent accurate, since the digitized text was to be used mainly for searching the documents by keywords. It was intended to be used in a document imaging system.
With the simple documents with which it was tested, this OCR engine has compared favorably against the open-source OCR package GOCR. It translated images to text with at least comparable accuracy to GOCR, and was in the same ballpark as far as speed, if not somewhat faster than GOCR. Extensive comparisons were not performed.
The following instructions assume you're running on a Linux box, with a reasonably recent version of Sun's JDK installed. You can get the JDK at java.sun.com. Be sure to remove any "fake" java packages that come with your Linux distribution. If you install OpenOffice, chances are you'll get a counterfeit GNU Java implementation which does not conform to Sun's Java specification, and is actually quite outdated as well. Unfortunately, OpenOffice has dependencies on this package. To get rid of it, you'll need to do something like this before installing Sun's JDK:
rpm -e --nodeps java-1.4.2-gcj-compat
NOTE: This may BREAK your OpenOffice installation, at least until you install the Sun JDK to replace the missing Java functionality. But hey, the OpenOffice guys should know better than to force someone to install an illegitimate Java knock-off, especially since OpenOffice is operated by Sun, who created the real Java in the first place. There's just no excuse.
As a potential "alternative", if you're more skilled than I am with the Linux alternatives package, you could use it to fix up the symlinks under /etc/alternatives to point to the real JDK without uninstalling the GNU Java knock-off. However, you'd have to be careful about software updates to the GNU Java knock-off "accidentally" resetting these symlinks, thereby breaking the real JDK. What a mess. Sun should really go after these guys for creating executables with the same names as Sun's, and purposely interfering with the distribution of Sun's legitimite Java implementation. After all, isn't that what Microsoft did with their fake Java implementation? Bad actions are bad, no matter who's doing them. But I digress.
So, back to the OCR engine. When you download and unpack the tarball, you'll have an "ocr" directory. Under it you'll find these scripts:
The source code *should* already be compiled, and there should be an ocr.jar file in the top-level "ocr" directory. If so, you can proceed. If not, or if you need to rebuild after making a change to the source code, just do the following:
./compile && ./createJars
Assuming there are no errors, you'll get freshly compiled classes and a new ocr.jar with your changes.
If you look under the ocrTests directory, there are several png and jpg files. Each of these is an image which contains text, and can be used to demonstrate the functionality of the OCR engine. To test the OCR engine on an image, do something like this:
./ocrscannerdemo ocrTests/asciiSentence.png
Notice that there is also a directory named ocrTests/trainingImages. This contains the font samples that are used to train the OCR engine in the demo application, so that it can recognize the fonts that were used to create the test images in the ocrTests directory. If you look at the src/com/roncemer/ocr/OCRScannerDemo.java source file, in the loadTrainingImages() function, you'll see that the demo app is loading up each of these training images and telling the OCR engine which character ranges are contained in each image. The OCR engine then uses these images to match against each character in the source image, in order to convert the source image into text.
To use the code in your own program, put ocr.jar into your classpath and follow the usage pattern which is used in the src/com/roncemer/ocr/OCRScannerDemo.java source file.
Feel free to look at the other source files, if you're interested in the inner workings of the OCR engine. The concepts are fairly simple, yet reasonably effective.
I originally released this engine under the GPL license, version 2. However, I felt it would be more commercially friendly if it were re-released under the BSD license. As of may 6, 2010, I've created a project page on SourceForge, changed the license to BSD, and uploaded the whole thing to the SourceForge Subversion repository.
The new JavaOCR SourceForge project is located here: http://javaocr.sourceforge.net
As always, I'm interested in your feedback, suggestions for improvement, use cases, success stories, or whatever.
Enjoy!