How to Use Tesseract with C# for Image Text Extraction

Photo from Pexels

Originally Posted On: https://www.linkedin.com/pulse/how-use-tesseract-c-image-text-extraction-dharshini-k-oaadc/

Introduction

The OCR library is a software that converts documents of various natures into editable and searchable information. It converts scanned images, PDFs, and digital camera shots into editable and searchable data. It is widely applied in converting printed documents to digital format for editing, searching, and archiving, lessening the amount of space consumed by documents. OCR plays a humongous part in automating data entry, thereby saving businesses and organizations ample time by curtailing the human resources used. It is a process employing sophisticated machine learning and pattern recognition methods to read text precisely from images.

Recent developments concerning OCR have made it more accurate, thereby supporting additional languages and complicated scripts like the Arabic script. Extremely important in finance, healthcare, law, and education, OCR was a necessary tool whereby the mass processing of a number of printed documents was required quickly. In this paper, Tesseract will be utilized to OCR images in multiple languages.

How to Use Tesseract OCR ?

Installing the IronOCR/Tesseract package.
To further initialise the OCR engine, an instance of the class IronTesseract will be generated.
After defining the location of the image file you wish to process, build an OCRInput object.
Proceed to OCR the input image using the IronTesseract instance’s Read function.
Display the result on the console.

What is Tesseract?

Tesseract is an open Optical Character Recognition engine developed by Hewlett-Packard and further developed by Google. It is famous for its precision and adaptability, which makes it one of the most popular OCRs. Tesseract has script detection, supports text recognition in many languages, and can recognize more than one language; thus, it is normally utilized for multilingual document projects and requires trained data.

To render characters, words, and sentences that are ultimately converted to machine-readable text, the Tesseract OCR engine employs information from any pixel within the image to learn patterns. Tesseract is capable of producing text in plain text, HTML, and searchable PDF as it is able to read many types of bitmap image files, including TIFF, JPEG, and PNG.

One of the most important assets of Tesseract is that it can be taught to be font-sensitive or to new languages. It is also extensively used in various programs, ranging from simple text pulling out to more complex operations in digitizing historic documents, handling invoices, or even accessibility packages that enable the blind to read.

Creating a New Project in Visual Studio

Launch Visual Studio. When the application opens, select the “file menu.” The “new project” option is located in the “file menu.” Click on “new project,” then “Console Application.” In this article, we’ll be using a console application to create PDF files.

Enter the file path and the project name in the text boxes. Then, as seen in the image below, click the Create button and choose the .NET Framework you require.

Once the version has been selected, Visual Studio will create the structure of the program. It will open the file program.cs to insert code and execute/run the application if you choose the console, Windows, or web versions.

Install Tesseract OCR For .NET

The first step is to install the Tesseract OCR software on your machine. The Tesseract installation, including the tessdata folder, can be located in the official Tesseract GitHub source here: https://github.com/tesseract-ocr/tesseract.

For NuGet, choose Tools -> Package Manager -> Select Manage NuGet Packages for Solution within your Visual Studio project in order to launch the NuGet Package Manager. The “Tesseract” or “Tesseract.NET” package can then be located by searching for “Tesseract” in the NuGet Package Manager. The package can then be added to your project by choosing it and clicking the Install button.

OCR Image using Tesseract

After you have installed the Tesseract.NET wrapper, you must configure Tesseract in your C# application by specifying the path to the Tesseract OCR executable and language data files, adhering to the permissions of your language. Below is an example:

Importing Required Namespaces

using Tesseract;

Imports the Tesseract OCR library into the C# application, making its classes and methods accessible. This makes it possible for the developer to utilize functionality such as loading images, initializing the OCR engine, and reading text from images using Optical Character Recognition within the application.

Setting the Tesseract Data Path

string tessDataPath = @"./tessdata";

specifies the file path to the Tesseract language data directory. The @”./tessdata” indicates a relative path to a directory named tessdata, which should include traineddata files (e.g., eng.traineddata) needed for language identification by the OCR engine, adhering to applicable law.

Loading the Input Image

string imagePath = @"path_to_your_image.png";
using (var img = Pix.LoadFromFile(imagePath))

This line of code finds the path where an image file is present and loads it into processing with the help of Tesseract’s OCR engine. The image is stored in a variable, and the using statement guarantees disposing of the image after usage so that resources and memory can be utilized effectively.

Initializing the Tesseract Engine with Multiple Languages

using (var engine = new TesseractEngine(tessDataPath, "eng+spa+fra", EngineMode.Default))

This line instantiates the Tesseract OCR engine from the given data path and supports English, Spanish, and French languages. It employs the default mode of the engine and is placed inside a using block to automatically release the engine resources once processing is finished.

Processing the Image and Extracting Text

using (var page = engine.Process(img))
{
    string text = page.GetText();
    Console.WriteLine("Recognized Text:");
    Console.WriteLine(text);
}

This block scans the imported image using Tesseract OCR to recognize the text. It recognizes text as a string and writes it to the console. Using block guarantees page object disposal after it is used to ensure correct resource management while running OCR.

Full code:

using System;
using System.Drawing;
using Tesseract;

class Program
{
    static void Main()
    {
        // Set the path to the Tesseract data files (traineddata files)
        string tessDataPath = @"./tessdata"; // Ensure this directory contains the language data files

        // Load the image
        string imagePath = @"path_to_your_image.png";
        using (var img = Pix.LoadFromFile(imagePath))
        {
            // Add tesseract languages into engine
            using (var engine = new TesseractEngine(tessDataPath,  "eng+spa+fra", EngineMode.Default))
            {
                using (var page = engine.Process(img))
                {
                    // Extract the text
                    string text = page.GetText();
                    Console.WriteLine("Recognized Text:");
                    Console.WriteLine(text);
                }
            }
        }
    }
}

Input Image:

Output Result:

What is IronOCR?

IronOCR is a .NET OCR library. It provides text extraction from pictures, scanned documents, PDFs, and all other visual media and gives OCR functionality to .NET applications. Besides the highly effective Tesseract engine that powers state-of-the-art text recognition, IronOCR has a number of other features that make it appropriate for use in commercial applications.

IronOCR, however, continues to focus on usability and integration. Its very intuitive API, coupled with thorough documentation and a number of sample applications, will have any developer up and running in an instant. A variety of image formats and PDFs are supported. OCR speed and accuracy are improved through integrated advanced photo preprocessing, noise removal, and error correction functionality.

Install IronOCR

Enter the following command on the package manager to install all the packages required for OCR French and Spanish languages.

Install-Package IronOcr
Install-Package IronOcr.Languages.French
Install-Package IronOcr.Languages.Spanish

Extracting Text from Images Using IronOCR

The following example illustrates how to implement C# and the IronOCR and Tesseract engines for text recognition from a picture in various languages.

using IronOcr;

class Program
{
    static void Main(string[] args)
    {
        // Initialize IronTesseract engine
        var Ocr = new IronTesseract();
        
        // Add multiple languages
        Ocr.Language = OcrLanguage.English + OcrLanguage.Spanish + OcrLanguage.French;
        
        // Path to the image
        var inputFile = @"pathtoyourimage.png";
        
        // Read the image and perform OCR
        using (var input = new OcrInput(inputFile))
        {
            // Perform OCR
            var result = Ocr.Read(input);
            
            // Display the result
            Console.WriteLine("Text:");
            Console.WriteLine(result.Text);
        }
    }
}

This code describes how to conduct OCR with the IronTesseract library. This code initializes the OCR engine initially and sets it up to recognize English, Spanish, and French. It announces the image file path to be read next. Using an OCRInput object, it reads and conducts OCR with the Read method.

The recognized text is contained in the result object and is written to the console. The using block makes the input of images disappear upon being used and processed, respectively. The process provides an easy and rapid method through which text that can speak many languages may be extracted from images.

Conclusion

Both IronOCR and Tesseract are excellent OCR libraries, but serve different kinds of users. Tesseract is open source, multilingual, and very available, and is a suitable solution for skilled developers who can perform manual configuration and coding but need more setup for functionalities such as multi-language processing, reading from PDF, or pre-processing images. IronOCR, however, is easier to use and easier to work with regarding out-of-the-box capabilities.

It has native support for multiple languages, PDF and image OCR, low-code integration, and more precise noisy or scanned document recognition. For ease of use, solid features, and commercial-grade reliability, IronOCR is obviously the better option. For more information regarding the IronOCR licensing, refer here, and for more information regarding the other Iron software products, refer here.