OCR Using Tesseract In C#

Introduction

OCR stands for optical character recognition and is used to convert images, handwritten documents, printed text, and scanned documents into machine-encoded text. Tesseract is one of the most accurate OCR engines. Tesseract allows us to convert any given images into text.

1. Creating a New Project

Open Visual Studio and go to the File menu, select "new project", and then select Console Application/Windows forms/WPF Application. BarCode can also be used on all types of applications, you can also use apps like Webform / MVC/ MVC Core on a net framework and Dot net core.

OCR using Tesseract in C#

Enter a project name and select the file path in the appropriate text box in Visual Studio. Then click the "create" button. Also, select the required Dot Net Framework. Now the project will generate the structure for the selected application, and, if you have selected the console application, it will be open the program.cs file where you can enter the code and build/run the application.

2. Install from NuGet Package Manager

The NuGet package manager allows us to install packages from the NuGet server. NuGet allows us to search and install a NuGet package. We can search for a wide variety of packages and download them. We are also able to update or change between various versions of NuGet packages.

Next, go to the solution explorer in Visual Studio and left-click the project. A menu will pop up. Select NuGet package manager from the menu and search for IRONOCR as a keyword. Select the first result in the NuGet package manager dialog and click the install option.

OCR using Tesseract in C#

Alternatively,

In Visual Studio: go to Tools-> NuGet package manager -> Package manager console

Enter the following code in the NuGet package manager to install the NuGet package.

Open the package manager console tab as in the image below. Add the following line:

OCR using Tesseract in C#

Install-Package IronOcr

Press the Enter key and it will install the NuGet package in the Visual Studio project.

Go to the NuGet package manager link (below) to learn more about the latest version of the Iron Ocr Tesseract library.

https://www.nuget.org/packages/IronOcr/2022.1.0

Next, the NuGet package Manager will download all the DLL files and also add the reference of the DLL file in the current project or default project in Visual Studio as a net project reference. Now the Visual Studio project is ready to use on the code. This library is supported by all the net frameworks.

3. Optical Character Recognition Using the Tesseract Engine

The Tesseract optical character recognition has been created by using C++, and a C++ runtime environment is required to run the Tesseract OCR engine. Tesseract uses the Leptonica library for an opening input image. We need to use Leptonica and its built-in support for Zlib, png, and tiff image formats for optical character recognition. Tesseract OCR can support more than 15 types of image format.

Before we start to write the code for the OCR process, we need to include the IronOCR library. By "using" statement which allows us to import the IronOCR library in the code. The Tesseract 5 example below shows us how to convert an image into text. We can set language/secondary languages with Iron Tesseract. Once we set the language, it will read only the specified language. Tesseract OCR provides multiple options for a single language, which will also us to choose between them, normal, best, and fast. Best and fast are improved versions that will lead to high performance and lack accuracy. The normal method leads to high accuracy but low performance. Based on the requirement we can use it in the code

Other languages available in the image will be considered as unknown characters. With Iron Tesseract we are also able to specify Tesseract versions such as Tesseract 5, Tesseract 4, Tesseract 3, etc.

We are also able to specify the type of setting to improve the accuracy of the Tesseract OCR. We can add blacklist characters, something which increases speed and accuracy. This will allow us to add a set of special characters to the code, which in turn ensures that the Tesseract engine matches the unknown character with the available blacklist characters.

These settings are optional and are not required to use the OCR. However, if we use this setting it will improve the performance of the OCR process using Tesseract.

4. Using the Tesseract Engine for Images

            var Ocr = new IronTesseract(); // nothing to configure
            Ocr.Language = OcrLanguage.English;
            Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;
            using (var Input = new OcrInput())
            {
                Input.AddImage(@"OCR_Test.png");
                var Result = Ocr.Read(Input);
                Console.WriteLine(Result.Text);
            }

The above is an example of the Tesseract 5 API which allows us to convert image files into text. On the above line of code, we are creating an object for Iron Tesseract. Also, we are creating an object for OcrInput which allows us to add one or more image files. By using the OcrInput object method add, we may need to specify the available image path inside the function. We can add any number of images. The Object Irontesseract which we created earlier can be used to get the images with the function called "Read" which will parse the image file and extract the result into the OCR result. It can extract text from images and convert it into a string.

Tesseract also allows us to add multi-frame images. There is a separate method for this operation: "AddMultiFrameTiff". Each frame in the image is read by the Tesseract library and each frame will be considered as a single page. The process will link it read the frame of the image then it will be moved to the next frame likewise, it will scan all the frames of the image. This method will only support the tiff image format.

5. Using the Tesseract Engine for PDF

We are also able to manage PDF files using OCRInput. The Iron Tesseract class will read each and every page of the documents. It will then extract the text from the pages. We can also open protected documents using a separate method called "AddPdf", which allows us to add PDFs (Password if it is protected). The following code example explains how to open a protected PDF file:

            var Ocr = new IronTesseract(); // nothing to configure
            using (var Input = new OcrInput())
            {
                Input.AddPdf("example.pdf", "password");   
                var Result = Ocr.Read(Input);
                Console.WriteLine(Result.Text);
            }

Iron Tesseract also provides three method pdf. They are:

  • AddPdfPage
  • AddPdfPages

"Addpdfpage" allows us to read and extract text from a single page in PDF documents. We just need to specify the page number from which we wish to extract text. "AddPdfPage" allows us to extract text from specified multiple pages. We simply need to specify multiple pages in IEnumerable<int>. Also, we need to add the file path with the extension of the file. The following code example shows us how to do this:

            IEnumerable<int> numbers = new List<int> {2,8,10 };
             var Ocr = new IronTesseract();
            using (var Input = new OcrInput())
            {
                //single page
                Input.AddPdfPage("example.pdf",10);
                //Multiple page
                Input.AddPdfPages("example.pdf", numbers);
                var Result = Ocr.Read(Input);
                Console.WriteLine(Result.Text);
                Result.SaveAsTextFile("ocrtext.txt");
            }

We can save the result in a text file using the function called SaveAsTextFile. which allows us to download the file on the output directory path.

6. Using Tesseract OCR without Object

If we want to perform OCR without creating any object, Iron Tesseract provides a method to accomplish this. We can use the following line of code to perform the OCR.

var Result = new IronOcr.IronTesseract().Read("OCR_Test.png").Text;

The above code shows how to perform the OCR without creating any object. It is very simple and easy to use. The following is the sample image that we used for this sample code:

Output

Conclusion

Iron Tesseract is simple and easy to use in the net framework environment. It provides multiple support for images and PDF documents. It also offers various options for increasing the performance of the Tesseract OCR library. There is support for various languages and multiple languages in a single operation. To learn more about the Tesseract OCR,

IronOcr extends google tesseract OCR engine with IronTesseract. It is one of the most powerful OCR libraries with a high degree of stability and accuracy. The Tesseract engine supports 125 languages, and a custom package also can be created. IronOCR provides Tesseract OCR for Mac, Windows, Linux, etc. It will read any given PDF/image files and extract text. This library is supported by both 32 bit and 64 bit. Iron Tesseract Support for .NET 5, Standard, Core, and framework, etc.,