Installation and configuration of TESSERACT-OCR

Published On: 9 June 2022.By yash agarwal.

General

Installation and configuration of TESSERACT-OCR

Optical character recognition (OCR)

Optical character recognition is also referred as text recognition. An OCR program extracts and re-purposes data from camera images, scanned document and image-only pdfs.

What is tesseract?

Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats.

Installation Of Tesseract

apt-get install tesseract-ocr

1	apt-get install tesseract-ocr

apt-get install tess

1	apt-get install tess

By default, Tesseract will install the English language pack.

To install additional languages pack:

sudo apt install tesseract-ocr-heb

1	sudo apt install tesseract-ocr-heb

To install all other available languages, run:

sudo apt install tesseract-ocr-all -y

1	sudo apt install tesseract-ocr-all -y

For Tesseract to work properly, we need to use the “convert” command. This is useful to convert between image formats and resize an image, crop, blur, dither, join, flip,blur, crop, despeckle, draw on,re-sample, etc. This tool is provided by Imagemagick:

sudo apt install imagemagick

1	sudo apt install imagemagick

to run on terminal:

tesseract  /filepath/file_name.jpg stdout

1	tesseract /filepath/file_name.jpg stdout

Tess4j

Tess4J is a Java wrapper for the Tesseract APIs that provides OCR support for various image formats like JPEG, GIF, PNG, and BMP. Then, we can use the Tesseract class provided by tess4j to process the image.

We start with adding the Tess4J maven dependency to our project:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.5.2</version>
</dependency>

<groupId>net.sourceforge.tess4j</groupId>

</dependency>

Tesseract is a popular open source project for OCR.

With Tess4J we can access the Tesseract API in Java.

after this, here we need to create tesseract object:

Tesseract instance = new Tesseract();

1	Tesseract instance = new Tesseract();

It is very important for setting a path for training data, as Tesseract can provide highly inaccurate results. Fortunately, training data for Tesseract comes with its installation so all you need to do is look at the right place.

Here is how we set the training data path:

instance.setDatapath("/usr/local/Cellar/tesseract/4.0.0/share/tessdata");
instance.setLanguage("eng");

1 2	instance.setDatapath("/usr/local/Cellar/tesseract/4.0.0/share/tessdata"); instance.setLanguage("eng");

Next we will be telling Tesseract that the output we need is in the format something called as the HOCR format. Basically, HOCR format is a simple XML based format which contains two things:

The text PDF document will contain
The x and y coordinates of that text on each page. This means that a {DF document can be exactly drawn in the same manner back from an HOCR output

here how we enable hocr:

instance.setHocr(true);

1	instance.setHocr(true);

Here the whole code:-

public static void main(String[] args) throws TesseractException { 
   File file = new File("//home//auriga//Downloads//image//sample_pan.jpg"); 
   Tesseract it = new Tesseract();
   it.setDatapath("/usr/local/Cellar/tesseract/4.0.0/share/tessdata"); 
   it.setLanguage("eng");
   it.setHocr(true);
   String text = it.doOCR(file);
   System.out.println(text);
   
}