Installation and configuration of TESSERACT-OCR
- General
Installation and configuration of TESSERACT-OCR
Optical character recognition (OCR)
Optical character recognition is also referred as text recognition. An OCR program extracts and re-purposes data from camera images, scanned document and image-only pdfs.
What is tesseract?
Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.
Installation Of Tesseract
1 |
apt-get install tesseract-ocr |
1 |
apt-get install tess |
By default, Tesseract will install the English language pack.
To install additional languages pack:
1 |
sudo apt install tesseract-ocr-heb |
To install all other available languages, run:
1 |
sudo apt install tesseract-ocr-all -y |
For Tesseract to work properly, we need to use the “convert” command. This is useful to convert between image formats and resize an image, crop, blur, dither, join, flip,blur, crop, despeckle, draw on,re-sample, etc. This tool is provided by Imagemagick:
1 |
sudo apt install imagemagick |
1 |
tesseract /filepath/file_name.jpg stdout |
Tess4j
Tess4J is a Java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP. Then, we can use the Tesseract class provided by tess4j to process the image.
We start with adding the Tess4J maven dependency to our project:
1 2 3 4 5 |
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>4.5.2</version> </dependency> |
Tesseract is a popular open source project for OCR.
With Tess4J we can access the Tesseract API in Java.
after this, here we need to create tesseract object:
1 |
Tesseract instance = new Tesseract(); |
It is very important for setting a path for training data, as Tesseract can provide highly inaccurate results. Fortunately, training data for Tesseract comes with its installation so all you need to do is look at the right place.
Here is how we set the training data path:
1 2 |
instance.setDatapath("/usr/local/Cellar/tesseract/4.0.0/share/tessdata"); instance.setLanguage("eng"); |
Next we will be telling Tesseract that the output we need is in the format something called as the HOCR format. Basically, HOCR format is a simple XML based format which contains two things:
- The text PDF document will contain
- The x and y coordinates of that text on each page. This means that a {DF document can be exactly drawn in the same manner back from an HOCR output
here how we enable hocr:
1 |
instance.setHocr(true); |
Here the whole code:-
1 2 3 4 5 6 7 8 9 10 |
public static void main(String[] args) throws TesseractException { File file = new File("//home//auriga//Downloads//image//sample_pan.jpg"); Tesseract it = new Tesseract(); it.setDatapath("/usr/local/Cellar/tesseract/4.0.0/share/tessdata"); it.setLanguage("eng"); it.setHocr(true); String text = it.doOCR(file); System.out.println(text); } |
INPUT:-
OUTPUT: –
There
are a variety of reasons you might not get good quality output from
Tesseract if the image has noise on thebackground.
Noise removal from image comes in the part of image processing.
In most of the cases, we get a noisy image and thus a very nosy output.
To deal with it we need to perform some processing on the image called Image processing.
Related content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s