LLMs  

Classify Contract Types with LlamaIndex Cloud Classify API (Python SDK Guide)

Abstract / Overview

LlamaIndex’s LlamaClassify enables developers to automatically categorize unstructured documents (PDFs, scans, contracts, invoices, receipts, etc.) into custom-defined types using natural-language rules.

This developer-focused tutorial explains how to classify contract types (e.g., affiliate vs. co-branding agreements) using the LlamaCloud Python SDK. It covers authentication, rule definition, asynchronous classification, and result interpretation.

You’ll learn how to:

  • Define flexible, natural-language classification rules

  • Upload and classify files through LlamaCloud

  • Retrieve model-generated predictions with confidence and reasoning

  • Integrate classification into document pipelines for parsing, extraction, or indexing

Conceptual Background

llamaindex-contract-classification-hero

What Is LlamaClassify?

LlamaClassify is part of the LlamaCloud Services SDK, providing automatic document categorization with natural-language reasoning. You define types and textual rules that describe what distinguishes each type. The model then classifies each document accordingly.

Each classification job returns:

  • type: predicted document type

  • confidence: numeric probability (0–1.0)

  • reasoning: textual explanation of the model’s decision

When to Use LlamaClassify

  • Pre-Extraction Step: Route documents to schema-specific extraction agents (e.g., invoices vs. receipts).

  • Pre-Parsing Step: Apply tailored LlamaParse configurations based on type (e.g., parse settings differ for contracts vs. receipts).

  • Pre-Indexing Step: Send labeled documents to the right LlamaCloud indices for improved search and retrieval.

  • Back-office automation: Route incoming files (e.g., receipts, POs, bank statements) automatically to workflow queues.

  • Dataset curation: Generate labeled subsets for training specialized document understanding models.

Step-by-Step Developer Walkthrough

1. Environment Setup

Install dependencies and configure your API key.

pip install llama-cloud-services python-dotenv

Create a .env file to store your Llama Cloud API key securely:

LLAMA_CLOUD_API_KEY=llx-your-api-key

Alternatively, you can use environment variables or getpass() for interactive input.

2. Initialize the Client

import os
from dotenv import load_dotenv
from llama_cloud.client import AsyncLlamaCloud
from llama_cloud_services.beta.classifier.client import ClassifyClient

load_dotenv()

client = AsyncLlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])
project_id = "your-project-id"
organization_id = "your-organization-id"

classifier = ClassifyClient(client, project_id=project_id, organization_id=organization_id)

This initializes an asynchronous LlamaCloud client and a ClassifyClient wrapper that handles uploads, job creation, polling, and retrieval.

3. Define Classification Rules

Classification in LlamaClassify relies on rules, each describing a type using natural language. The model uses these rules to evaluate which type best matches the document content.

Example rules for two contract categories:

from llama_cloud.types import ClassifierRule

rules = [
    ClassifierRule(
        type="affiliate_agreements",
        description="Contracts that outline an affiliate partnership where one party markets or sells another’s products in exchange for commission."
    ),
    ClassifierRule(
        type="co_branding",
        description="Contracts that define a co-branding arrangement between two organizations sharing brand assets or marketing campaigns under both logos."
    ),
]

Best Practices for Rule Writing

  • Use descriptive phrases about distinctive content.

  • Mention unique structural elements (e.g., “Affiliate Partner,” “royalty,” “joint marketing”).

  • Avoid vague phrasing like “looks like a contract.”

  • Refine rules iteratively by testing them on sample files.

4. Optional Parsing Configuration

You can instruct the classifier to limit parsing for efficiency or accuracy.

from llama_cloud.types import ClassifyParsingConfiguration, ParserLanguages

parsing = ClassifyParsingConfiguration(
    lang=ParserLanguages.EN,
    max_pages=5  # Limit to first 5 pages for speed
)

5. Upload and Classify Contracts

Asynchronously classify one or more PDF contracts:

result = await classifier.aclassify_file_path(
    rules=rules,
    file_input_path="CybergyHoldingsInc_Affiliate_Agreement.pdf",
)

You can also classify multiple files:

results = await classifier.aclassify_file_paths(
    rules=rules,
    file_input_paths=["/contracts/Affiliate1.pdf", "/contracts/Cobrand1.pdf"],
    parsing_configuration=parsing
)

6. Retrieve and Inspect Results

Each result item includes the predicted type, confidence score, and reasoning string.

classification = result.items[0].result

print("Classification Result:", classification.type)
print("Confidence:", classification.confidence)
print("Reasoning:", classification.reasoning)

Example Output:

Classification Result: affiliate_agreements
Confidence: 0.97
Reasoning: The document is titled 'MARKETING AFFILIATE AGREEMENT' and repeatedly refers to one party as the 'Marketing Affiliate.' The content describes affiliate rights, commissions, and promotional obligations, with no shared branding terms—indicating an affiliate agreement.

7. Handling Partial Failures

If classification partially fails (e.g., due to upload or parse errors), the SDK returns result=None for affected items.

for item in results.items:
    if item.result is None:
        print(f"Error in job {item.classify_job_id} for file {item.file_id}")
    else:
        print(f"{item.file_id} → {item.result.type} ({item.result.confidence:.2f})")

Integration Patterns

A. Classification Before Extraction

Use classification to route files to schema-specific LlamaExtract agents:

if classification.type == "invoice":
    extract_invoice_fields(file_path)
elif classification.type == "contract":
    extract_contract_clauses(file_path)

B. Classification Before Parsing

Adjust LlamaParse parameters dynamically per type:

if classification.type == "receipt":
    parse_options = {"table_detection": False, "ocr": True}
else:
    parse_options = {"table_detection": True, "ocr": False}

C. Classification Before Indexing

Tag and route content into appropriate LlamaCloud indices with specific chunking rules:

index_name = f"{classification.type}_index"
llama_index.add_document(file_path, index=index_name, metadata={"type": classification.type})

Mermaid Diagram: Classification Workflow

llamaindex-contract-classification-workflow

Common Pitfalls and Fixes

IssuePossible CauseFix
None resultsNetwork or parse failureRetry or reduce max_pages
Low confidenceAmbiguous rulesRefine descriptions; add unique indicators
Slow performanceLarge filesUse parsing limits or page targets
MisclassificationOverlapping semanticsSplit into narrower types; add more examples

Future Enhancements

  • Rule weight calibration: Prioritize certain rules when overlap occurs.

  • Hybrid mode: Combine classification with semantic retrieval for unseen classes.

  • Batch pipelines: Integrate classification with ETL tools like Airflow or Prefect.

  • Model fine-tuning: Train LlamaClassify on domain-specific contracts.

FAQs

Q1. Is classification supervised or rule-based? Hybrid. You provide natural-language rules; the LLM interprets them for semantic alignment.

Q2. Can I use this offline? No. Classification runs in LlamaCloud, requiring an API key.

Q3. How accurate is it? Depends on rule quality and dataset complexity. Typical confidence >0.9 for clear document types.

Q4. What’s the cost? Billing depends on file volume, page count, and model runtime. Refer to LlamaIndex pricing documentation.

Q5. Can I build this workflow in the UI? Yes. The same rule definitions can be created in LlamaCloud UI under the Classify tab.

References