Extract information from documents using Azure Form Recognizer on Python

Article

Introduction

In today's digital era, businesses deal with an enormous amount of paperwork, ranging from invoices and receipts to forms and surveys. Manual data entry can be time-consuming, error-prone, and resource-intensive. However, Azure Form Recognizer, a powerful AI-based service by Microsoft, offers a solution to streamline and automate this process. In this article, we'll explore how you can leverage the capabilities of Azure Form Recognizer using Python, enabling you to extract valuable information from forms effortlessly.

What is Azure Form Recognizer?

Azure Form Recognizer is a cloud-based service that utilizes machine learning algorithms to automatically extract key-value pairs, tables, and text from documents. It employs optical character recognition (OCR) technology, allowing businesses to digitize and process large volumes of forms efficiently. The service can handle various document types, including invoices, receipts, business cards, and more, making it a versatile tool for document processing.

Setting up Azure Form Recognizer resource

Go to Azure Portal and search Form Recognizer, then click on Create.

Choose the subscription, resource group, region, pricing tier, and type the resource name. Then, click on Review + create.

Once the resource is created, go to Keys and Endpoint to copy your credentials.

Getting Started with Azure Form Recognizer on Python

You need to install the Azure AI Form Recognizer SDK. You can do this by running the following command in your Python environment:

pip install azure-ai-formrecognizer

Next, import the required libraries and authenticate with your Azure account.

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
import numpy as np
import pandas as pd

ENDPOINT = "<YOUR_ENDPOINT>"
APIKEY = "<YOUR_API_KEY>"

document_analysis_client = DocumentAnalysisClient(ENDPOINT, credential=AzureKeyCredential(APIKEY))

We'll use the document_analysis_client to extract information from different types of documents using the following prebuilt models:

Invoices
Receipts
Business cards
Identity documents

Visit this page to know about all the models that Azure Form Recognizer offers.

We'll create the following utility methods:

def is_class(o):
  return hasattr(o, '__dict__')


def get_valid_rounded_value(val):
  return round(val * 100, 2) if val else None

Let's start testing the Prebuilt Models. We'll create two more methods, one to analyze the documents and the another one to print a table with the extracted information:

def get_poller_result(path: str, model_id: str):
  with open(path, "rb") as f:
    poller = document_analysis_client.begin_analyze_document(
        model_id, document=f, locale="en-US",
      )

    return poller.result()


def print_generic_table(items, is_business_card: bool = False):
  if not is_business_card:
    array: list = []
    for name, field in items:
      if name == 'MachineReadableZone':
        continue

      if field.value is not None and not is_class(field.value) and not type(field.value) is list:
        array.append([name, field.value, get_valid_rounded_value(field.confidence)])

    if len(array) > 0:
      np_array = np.array(array)
      df = pd.DataFrame(np_array, columns = ['Field', 'Value', '% Confidence'])
      display(df)
  else:
    array: list = []
    for name, field in items:
      if field.value is not None and type(field.value) is list:
        array: list = []
        for idx, sub_item in enumerate(field.value):
          if sub_item.value is not None and not is_class(sub_item.value) and not type(sub_item.value) is list:
           if name == 'ContactNames':
            for sub_field in ['FirstName', 'LastName']:
              if sub_item.value[sub_field]:
                sub_item_details = sub_item.value[sub_field]
                array.append(['{} {}'.format(sub_field, idx + 1), sub_item_details.value, get_valid_rounded_value(sub_item_details.confidence)])
           else:
            array.append(['{} {}'.format(name, idx + 1), sub_item.value, get_valid_rounded_value(sub_item.confidence)])
          elif name == 'Addresses':
            array.append(['{} {}'.format(name, idx + 1), sub_item.content, get_valid_rounded_value(sub_item.confidence)])

        if len(array) > 0:
          display(name)
          np_array = np.array(array)
          df = pd.DataFrame(np_array, columns = ['Field', 'Value', '% Confidence'])
          display(df)

Invoices

It analyzes and extracts key fields and line items from sales invoices, utility bills, and purchase orders. Invoices can be of various formats and quality including phone-captured images, scanned documents, and digital PDFs. The API analyzes invoice text; extracts key information such as customer name, billing address, due date, and amount due; and returns a structured JSON data representation.

To know about the supported languages, fields extraction and more, visit this page.

Lest's test the Invoice model. We can pass an image or PDF with one or more invoices.

invoices = get_poller_result("invoices/invoice_sample.png", "prebuilt-invoice")

def print_products_table(items, document_type: str):
  array: list = []
  for idx, item in enumerate(items):
    if document_type == 'invoice':
      fields = ["ProductCode", "Description", "Quantity", "Unit", "UnitPrice", "Tax", "Amount"]
    elif document_type == 'receipt':
      fields = ["ProductCode", "Description", "Quantity", "QuantityUnit", "Price", "TotalPrice"]

    current_row = []
    for field in fields:
      current_item = item.value.get(field)
      if current_item:
        current_row.append(current_item.value)
      else:
        current_row.append(None)

    array.append(current_row)

  np_array = np.array(array)
  df = pd.DataFrame(np_array, columns = fields)
  display(df)


def print_invoices_details(invoices):
  for idx, invoice in enumerate(invoices.documents):
    display("-------- Recognizing invoice #{} --------".format(idx + 1))
    items = invoice.fields.items()
    print_generic_table(items)

    display("Invoice products:")
    invoice_products = invoice.fields.get("Items").value
    print_products_table(invoice_products, 'invoice')

We created the print_products_table method to print the products for invoices and receipts.

Call the print_invoices_details method and pass the invoices.

print_invoices_details(invoices)

-------- Recognizing invoice #1 --------

Field	Value	% Confidence
BillingAddressRecipient	Microsoft Finance	93.5
CustomerAddressRecipient	Microsoft Corp	93.2
CustomerId	CID-12345	94.3
CustomerName	MICROSOFT CORPORATION	89.6
DueDate	2019-12-15	97.1
InvoiceDate	2019-11-15	97.1
InvoiceId	INV-100	96.4
PurchaseOrder	PO-3333	94.3
RemittanceAddressRecipient	Contoso Billing	93.4
ServiceAddressRecipient	Microsoft Services	93.2
ServiceEndDate	2019-11-14	95.4
ServiceStartDate	2019-10-14	95.8
ShippingAddressRecipient	Microsoft Delivery	93.2
VendorAddressRecipient	Contoso Headquarters	93.2
VendorName	CONTOSO LTD.	93.0

Invoice products:

ProductCode	Description	Quantity	Unit	UnitPrice	Tax	Amount
A123	Consulting Services	2.0	hours	$30.0	$6.0	$60.0
B456	Document Fee	3.0	None	$10.0	$3.0	$30.0
C789	Printing Fee	10.0	pages	$1.0	$1.0	$10.0

Receipts

It analyzes and extracts key information from sales receipts. Receipts can be of various formats and quality including printed and handwritten receipts. The API extracts key information such as merchant name, merchant phone number, transaction date, tax, and transaction total and returns structured JSON data.

To know about the supported languages, fields extraction and more, visit this page.

Lest's test the Receipt model.

receipts = get_poller_result("receipts/receipt_sample.png", "prebuilt-receipt")

def print_receipts_details(receipts):
  for idx, receipt in enumerate(receipts.documents):
    print("-------- Recognizing receipt #{} --------".format(idx + 1))
    items = receipt.fields.items()
    print_generic_table(items)

    display("Receipt products:")
    receipt_products = receipt.fields.get("Items").value
    print_products_table(receipt_products, 'receipt')

Call the print_receipts_details method and pass the receipts.

print_receipts_details(receipts)

-------- Recognizing receipt #1 --------

Field	Value	% Confidence
MerchantName	Contoso	98.5
MerchantPhoneNumber	+11234567890	98.9
Subtotal	1098.99	99.0
Total	1203.39	95.9
TotalTax	104.4	99.0
TransactionDate	2019-06-10	98.9
TransactionTime	13:59:00	99.5

Receipt products:

ProductCode	Description	Quantity	QuantityUnit	Price	TotalPrice
None	Surface Pro 6	1.0	None	None	999.0
None	SurfacePen	1.0	None	None	99.99

Business cards

It analyzes and extracts data from business card images. The API analyzes printed business cards; extracts key information such as first name, last name, company name, email address, and phone number; and returns a structured JSON data representation.

To know about the supported languages, fields extraction and more, visit this page.

Lest's test the Business card model.

business_cards = get_poller_result("business_cards/bizcard.jpg", "prebuilt-businessCard")

def print_business_cards_details(business_cards):
  for idx, business_card in enumerate(business_cards.documents):
    print("-------- Analyzing business card #{} --------".format(idx + 1))
    items = business_card.fields.items()
    print_generic_table(items, True)

Call the print_business_cards_details method and pass the business cards.

print_business_cards_details(business_cards)

-------- Analyzing business card #1 --------
Addresses

Field	Value	% Confidence
Addresses 1	4001 1st Ave NE Redmond, WA 98052	96.9

CompanyNames

Field	Value	% Confidence
CompanyNames 1	CONTOSO	40.0

ContactNames

Field	Value	% Confidence
FirstName 1	Chris	98.9
LastName 1	Smith	99.0

Departments

Field	Value	% Confidence
Departments 1	Cloud & AI Department	97.3

Emails

Field	Value	% Confidence
Emails 1	[email protected]	98.9

Faxes

Field	Value	% Confidence
Faxes 1	+19873126745	98.8

JobTitles

Field	Value	% Confidence
JobTitles 1	Senior Researcher	98.8

MobilePhones

Field	Value	% Confidence
MobilePhones 1	+19871234567	98.8

Websites

Field	Value	% Confidence
Websites 1	https://www.contoso.com/	98.9

WorkPhones

Field	Value	% Confidence
WorkPhones 1	+19872135674	98.5

Identity documents

It analyzes and extracts key information from identity documents. The API analyzes identity documents (including the following) and returns a structured JSON data representation:

US Drivers Licenses (all 50 states and District of Columbia)
International passport biographical pages
US state IDs
Social Security cards
Permanent resident cards

To know about the supported languages, fields extraction and more, visit this page.

Lest's test the ID document model.

id_documents = get_poller_result("identity_documents/various_id_cards.pdf", "prebuilt-idDocument")

def print_id_documents_details(id_documents):
  for idx, id_document in enumerate(id_documents.documents):
    print("-------- Recognizing ID document #{} --------".format(idx + 1))
    items = id_document.fields.items()
    print_generic_table(items)

Call the print_id_documents_details method and pass the id documents.

print_id_documents_details(id_documents)

-------- Recognizing ID document #1 --------

Field	Value	% Confidence
DateOfExpiration	2031-08-01	98.2
FirstName	Willeke Liselotte	None

-------- Recognizing ID document #2 --------

Field	Value	% Confidence
DateOfExpiration	2023-06-11	99.0
DocumentNumber	GDC000001	99.0
FirstName	ÅSAMUND SPECIMEN	None
LastName	ØSTENBYEN	None

-------- Recognizing ID document #3 --------

Field	Value	% Confidence
DateOfBirth	1981-01-01	99.0
DateOfExpiration	2019-11-29	99.0
DateOfIssue	2009-11-30	99.0
DocumentNumber	C03005988	99.0
FirstName	HAPPY	99.5
LastName	TRAVELER	99.5
Nationality	USA	99.0
PlaceOfBirth	NEW YORK. U.S.A.	99.0
Sex	M	99.0

-------- Recognizing ID document #4 --------

Field	Value	% Confidence
DateOfBirth	2023-05-18	80.6
DateOfExpiration	2023-03-24	85.2
DocumentNumber	0018-5978	86.6
FirstName	LATIKA YASMIN	81.2
LastName	SPECIMEN	88.0

-------- Recognizing ID document #5 --------

Field	Value	% Confidence
CountryRegion	USA	49.2
DateOfBirth	1961-02-15	99.0
DateOfExpiration	2027-05-20	99.0
DateOfIssue	2017-05-21	99.0
DocumentNumber	685471230	99.0
DocumentType	P	99.0
FirstName	JHON	99.5
IssuingAuthority	United States\nDepartment of State	99.0
LastName	DOE	99.5
PlaceOfBirth	Florida	99.0
Sex	M	99.0

-------- Recognizing ID document #6 --------

Field	Value	% Confidence
CountryRegion	AUS	99.0
DateOfBirth	1984-06-07	99.0
DateOfExpiration	2019-03-21	99.0
DateOfIssue	2014-03-01	99.0
DocumentNumber	PA0940443	99.0
DocumentType	P	99.0
FirstName	JANE	99.5
IssuingAuthority	AUSTRALIA	99.0
LastName	CITIZEN	99.5
Nationality	AUS	99.0
PlaceOfBirth	CANBERRA	99.0
Sex	F	99.0

You can find the full source code and images used here.

Conclusion

Azure Form Recognizer, combined with the versatility of Python, empowers businesses to streamline their document processing workflows. With its powerful OCR capabilities and the ability to extract key data elements, Azure Form Recognizer simplifies the extraction of valuable information from various forms. By harnessing the potential of this cloud-based service and the flexibility of Python, you can significantly improve efficiency, reduce errors, and unlock new opportunities for automation in your organization.

Thanks for reading

Thank you very much for reading. I hope you found this article interesting and may be useful in the future. If you have any questions or ideas you need to discuss, it will be a pleasure to collaborate and exchange knowledge.