Use of Regex in OCR
- General
Use of Regex in OCR
What is Regex in java?
Regex stands for “Regular expression”.
Regex in java is an API which is used to construct a pattern for searching or performing manipulations on the string. Regular expression is used for searching required information for data by the use of search pattern to describe what you are searching. A simple example of regex looks like “
1 |
String Regex="(.*)(\\d+)(.*)"; |
which is used to find a digit string in an alphanumeric string.
Regex API in java
We can use the following statement to import the regex package in java
1 |
import java.util.regex.*; |
This package includes the following class:
1. Pattern Class
It is the compiled version of a regular expression. It is used to define a pattern for the regex engine.
Methods defined in pattern class
- static Pattern compile(String regex) – It compiles the given regex expression and returns the object of the pattern.
1 |
Pattern pattern1 = Pattern.compile("REGEX_EXPRESSION"); |
- Matcher matcher(CharSequence input) : It creates a matcher that matches the given input with the pattern.
- static boolean matches(String regex, CharSequence input) : It compiles the regex and matches the given input with the pattern & return true if it matches otherwise return false.
1 2 |
System.out.println(Pattern.matches(".s", "as"));//true (2nd char is s) System.out.println(Pattern.matches(".s", "mk"));//false (2nd char is not s) |
- String[] split(CharSequence input): It splits the given input string around matches of the given pattern.
- String pattern() : It returns the regex expression from which this pattern was compiled.
2. Matcher Class
It is a regex engine which is used to perform match operations on a character sequence
Methods defined in matcher class
- boolean matches() : It is used to check whether the regex matches the pattern.
- boolean find() : It is used for searching of multiple occurrence of the regex expression in the gieven string
- int start() : It returns the starting index of the sub-sequence matched using regex.
- int end() : It returns the ending index of the sub-sequence matched using regex.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import java.util.regex.Matcher; import java.util.regex.Pattern; public class Main{ public static void main(String[] args) { String regex="\\bjava\\b"; String input="java c python ruby java"; Pattern pattern = Pattern.compile(regex); // to get a matcher object Matcher match = pattern.matcher(input); // initialize a count variable to count int cnt = 0; //loop till the pattern is found while (match.find()) { cnt++; System.out.println("Matching number : " + cnt); System.out.println("start position : " + match.start()); System.out.println("end position : " + match.end()); } } } |
1 2 3 4 5 6 |
Matching number : 1 start position : 0 end position : 4 Matching number : 2 start position : 19 end position : 23 |
3.PatternSyntaxException Class :
It indicates the syntax error in a regex pattern.
Methods defined in PatternSyntaxException class
- String getDescription() : It returns the description of the error.
- Int getIndex() : It returns the error-index.
- String getMessage() : It returns a multi-line string containing:
i) the description of the syntax error and its index,
ii) the incorrect regular-expression pattern,
iii) a visual indication of the error-index within the pattern. - String getPattern() : Retrieves the erroneous regex pattern.
What are Regular Expressions Used for?
A Regular Expression is used when you need to find and replace a pattern in a string, and when you need to validate a form(form may include data like date of birth,Aadhar number,Pan card number etc).Depending on the circumstances, you can test your regex pattern in a number of different ways.It is widely used to define the constraint on Strings such as password and email validation.
Different UseCases of regex
1. Search and Replace
The first use case for using regular expressions would be if you want to search for a particular pattern and then replace it with something else.
Lets suppose we have an employee database
If you take a quick look at the database, you can see there are some typos mistakes in the email & Dob field. Dob of john is not correct also email id of anuj is wrong. Now imagine having thousands of those fields! It would be hectic task to go over each record by hand & check it’s correctness.
If we are unsure as to whether or not all of the addresses are valid email addresses, we can use regex methods to make sure it has the correct format.
If it does not, we can replace it with something else — either a null value or something of your choosing to indicate that the email is incorrect.
Following code will help in checking the correctness of Email Id
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
public class EmailValidator { public static void main(String[] args) { Pattern emailValidatePattern=Pattern.compile("^[a-zA-Z0-9+_.-]+ @[a-zA-Z0-9.-]+$", Pattern.CASE_INSENSITIVE); Matcher emailId= emailValidatePattern.matcher ("anuj123@gmail.com"); boolean matchFound=emailId.find(); if(matchFound) { System.out.println("Email Id is corrrect"); } else { System.out.println("Email Id is incorrect"); } } } |
1 |
Email Id is correct |
Regex for Email contains
- ^ matches the starting of the sentence.
- [a-zA-Z0-9+_.-] matches one character from the English alphabet (both cases),
- digits, “+”, “_”, “.” and, “-” before the @ symbol.
- + indicates the repetition of the above-mentioned set of characters one or more times.
- @ matches itself.
- [a-zA-Z0-9.-] matches one character from the English alphabet (both cases), digits, “.” and “–” after the @ symbol.
- $ indicates the end of the sentence.
2. Validation
The other way to use regular expressions is to validate something. When we validate, we want to make sure it follows the correct format. This is an optimal time to make sure a user is giving you the proper format for their input fields.
Take, for instance, when a user inputs a phone number into a form.
You can use regex to write a function that makes certain that the input from the user is in the format we want. When working with databases, it’s important to have the same format for all the fields. It makes working with the data much easier
For example
Following is a regular expression example which matches any phone number. A phone number in this example consists either out of 7 numbers in a row or out of 3 number, a (white)space or a dash and then 4 numbers.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
public class Main{ public static void main(String args[]) { String pattern = "\\d\\d\\d([,\\s])?\\d\\d\\d\\d"; String s= "1233323322"; System.out.println(s.matches(pattern)); \\output-false s = "1233323"; System.out.println(s.matches(pattern)); \\output-true s = "123 3323"; System.out.println(s.matches(pattern)); \\output-true } } |
Use of regex in OCR(Optical Character Recognition)
Regex can be used with OCR to fetch a particular data by finding the pattern in the string obtained from OCR engine & extracting it from the string.
The text obtained from the image processed by the ocr engine(like Tesseract) contains unwanted characters along with the actual data so it is not possible to obatin the data using indexing.so in order to overcome this information from the text can be obtained by using regular expressions.
Lets take the following Pan Card Image
The text obtained from the OCR engine is as follows
As you can see in case of pan card OCR provides the whole text in a single string. Along with the correct details other unwanted characters are also present. In order to fetch a particular field such as pan number, Date of Birth
regex can be applied so that wherever this pattern of pan number & Dob is found we can extract the substring from the original string obtained from the OCR engine.
Extracting pan card number using regex
A PAN card number will have exactly 10 characters, only containing numbers 0-9 and upper case alphabets A-Z. Any PAN number will have the following pattern:
- Five upper case alphabets [A-Z] occupying first five positions, 1-5
- Four numbers [0-9] occupying next four positions, 6-9
- An upper case alphabet [A-Z] in the last position, 10
- Using this pattern, a regular expression can be formed and used to validate whether or not a PAN number is valid.
- A regular expression for the above pattern would be
1 2 3 4 5 6 7 8 9 10 11 12 13 |
public String getPanNumber(String str) { String regex = "[A-Z]{5}[0-9]{4}[A-Z]{1}"; Pattern pattern = Pattern.compile(regex); Matcher match = pattern.matcher(str); String panNumber=""; if(match.find()) { panNumber=str.substring(m.start(), m.end()); } return panNumber; } |
Extracting dob using regex
A PAN CARD will have dob in format like dd/mm/yyyy.
In order to extract the dob we need to apply the same procedures as appiled above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import java.util.regex.*; public class Main{ public static boolean dateValidator(String date) { String regex = "(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher((CharSequence)date); return matcher.matches(); } public static void main(String args[]) { System.out.println(dateValidator("10/12/2016")); System.out.println(dateValidator("10/02/18")); System.out.println(dateValidator("34/02/2018")); } } |
1 2 3 |
true false false |
Related content
Auriga: Leveling Up for Enterprise Growth!
Auriga’s journey began in 2010 crafting products for India’s