OCR using Tesseract in C#

Tesseract is one of the most accurate open source OCR engines. Tesseract allows us to convert the given image into the text. Before going to the code we need to download the assembly and tessdata of the Tesseract. We can download the data from GitHub or NuGet.

After downloading the assembly, add the assembly in your project.

Now can see the list of data have been added after importing the Tesseract2 API like the following image.


Then add the following package.

  1. using tessnet2;  
  2. using System.Drawing;  
  3. using System.Drawing.Drawing2D;  
  4. using System.Drawing.Imaging; 
  6. // now add the following C# line in the code page  
  7. var image = new Bitmap(@ "Z:\NewProject\demo\image.bmp");  
  8. varocr = new Tesseract();  
  9. ocr.Init(@ "Z:\NewProject\How to use Tessnet2 library\C#\tessdata""eng"false);  
  10. var result = ocr.DoOCR(image, Rectangle.Empty);  
  11. foreach(tessnet2.Word word in result)  
  12. {  
  13.     Console.writeline(word.text);  
  14. }  
Let me explain the line,

We are creating a new variable for the bitmap image. In that I have specified the location of the bitmap image or we can use an open dialog box to get the image location.
  1. var image = new Bitmap(@"Z:\NewProject\demo\image.bmp");   
  2. varocr = new Tesseract();  
After getting the image location I have created a variable to create a new object for the Tesseract. This will allow us to use the methods available in the tesseract DLL file.
  1. var result = ocr.DoOCR(image, Rectangle.Empty);  
Then I have create another variable, To do OCR the given image and put that image text in the variable called result.
  1. var result = ocr.DoOCR(image, Rectangle.Empty);  
Now the result variable holds all the characters available in the image. Then I have used the for loop to extract the values one by one.

We can also extract the number only by using the following code.
  1. ocr.SetVariable("tessedit_char_whitelist""0123456789");  
The following image is the sample OCR image,

This is the output of OCR image:


With OCR the image in each sentence has been split into words. We can also use an array or list to get the value.

Read more articles on C# Programming: