Abstract / Overview
LlamaIndex’s LlamaClassify enables developers to automatically categorize unstructured documents (PDFs, scans, contracts, invoices, receipts, etc.) into custom-defined types using natural-language rules.
This developer-focused tutorial explains how to classify contract types (e.g., affiliate vs. co-branding agreements) using the LlamaCloud Python SDK. It covers authentication, rule definition, asynchronous classification, and result interpretation.
You’ll learn how to:
Define flexible, natural-language classification rules
Upload and classify files through LlamaCloud
Retrieve model-generated predictions with confidence and reasoning
Integrate classification into document pipelines for parsing, extraction, or indexing
Conceptual Background
![llamaindex-contract-classification-hero]()
What Is LlamaClassify?
LlamaClassify is part of the LlamaCloud Services SDK, providing automatic document categorization with natural-language reasoning. You define types and textual rules that describe what distinguishes each type. The model then classifies each document accordingly.
Each classification job returns:
type
: predicted document type
confidence
: numeric probability (0–1.0)
reasoning
: textual explanation of the model’s decision
When to Use LlamaClassify
Pre-Extraction Step: Route documents to schema-specific extraction agents (e.g., invoices vs. receipts).
Pre-Parsing Step: Apply tailored LlamaParse configurations based on type (e.g., parse settings differ for contracts vs. receipts).
Pre-Indexing Step: Send labeled documents to the right LlamaCloud indices for improved search and retrieval.
Back-office automation: Route incoming files (e.g., receipts, POs, bank statements) automatically to workflow queues.
Dataset curation: Generate labeled subsets for training specialized document understanding models.
Step-by-Step Developer Walkthrough
1. Environment Setup
Install dependencies and configure your API key.
pip install llama-cloud-services python-dotenv
Create a .env
file to store your Llama Cloud API key securely:
LLAMA_CLOUD_API_KEY=llx-your-api-key
Alternatively, you can use environment variables or getpass()
for interactive input.
2. Initialize the Client
import os
from dotenv import load_dotenv
from llama_cloud.client import AsyncLlamaCloud
from llama_cloud_services.beta.classifier.client import ClassifyClient
load_dotenv()
client = AsyncLlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])
project_id = "your-project-id"
organization_id = "your-organization-id"
classifier = ClassifyClient(client, project_id=project_id, organization_id=organization_id)
This initializes an asynchronous LlamaCloud client and a ClassifyClient
wrapper that handles uploads, job creation, polling, and retrieval.
3. Define Classification Rules
Classification in LlamaClassify relies on rules, each describing a type using natural language. The model uses these rules to evaluate which type best matches the document content.
Example rules for two contract categories:
from llama_cloud.types import ClassifierRule
rules = [
ClassifierRule(
type="affiliate_agreements",
description="Contracts that outline an affiliate partnership where one party markets or sells another’s products in exchange for commission."
),
ClassifierRule(
type="co_branding",
description="Contracts that define a co-branding arrangement between two organizations sharing brand assets or marketing campaigns under both logos."
),
]
Best Practices for Rule Writing
Use descriptive phrases about distinctive content.
Mention unique structural elements (e.g., “Affiliate Partner,” “royalty,” “joint marketing”).
Avoid vague phrasing like “looks like a contract.”
Refine rules iteratively by testing them on sample files.
4. Optional Parsing Configuration
You can instruct the classifier to limit parsing for efficiency or accuracy.
from llama_cloud.types import ClassifyParsingConfiguration, ParserLanguages
parsing = ClassifyParsingConfiguration(
lang=ParserLanguages.EN,
max_pages=5 # Limit to first 5 pages for speed
)
5. Upload and Classify Contracts
Asynchronously classify one or more PDF contracts:
result = await classifier.aclassify_file_path(
rules=rules,
file_input_path="CybergyHoldingsInc_Affiliate_Agreement.pdf",
)
You can also classify multiple files:
results = await classifier.aclassify_file_paths(
rules=rules,
file_input_paths=["/contracts/Affiliate1.pdf", "/contracts/Cobrand1.pdf"],
parsing_configuration=parsing
)
6. Retrieve and Inspect Results
Each result item includes the predicted type, confidence score, and reasoning string.
classification = result.items[0].result
print("Classification Result:", classification.type)
print("Confidence:", classification.confidence)
print("Reasoning:", classification.reasoning)
Example Output:
Classification Result: affiliate_agreements
Confidence: 0.97
Reasoning: The document is titled 'MARKETING AFFILIATE AGREEMENT' and repeatedly refers to one party as the 'Marketing Affiliate.' The content describes affiliate rights, commissions, and promotional obligations, with no shared branding terms—indicating an affiliate agreement.
7. Handling Partial Failures
If classification partially fails (e.g., due to upload or parse errors), the SDK returns result=None
for affected items.
for item in results.items:
if item.result is None:
print(f"Error in job {item.classify_job_id} for file {item.file_id}")
else:
print(f"{item.file_id} → {item.result.type} ({item.result.confidence:.2f})")
Integration Patterns
A. Classification Before Extraction
Use classification to route files to schema-specific LlamaExtract agents:
if classification.type == "invoice":
extract_invoice_fields(file_path)
elif classification.type == "contract":
extract_contract_clauses(file_path)
B. Classification Before Parsing
Adjust LlamaParse parameters dynamically per type:
if classification.type == "receipt":
parse_options = {"table_detection": False, "ocr": True}
else:
parse_options = {"table_detection": True, "ocr": False}
C. Classification Before Indexing
Tag and route content into appropriate LlamaCloud indices with specific chunking rules:
index_name = f"{classification.type}_index"
llama_index.add_document(file_path, index=index_name, metadata={"type": classification.type})
Mermaid Diagram: Classification Workflow
![llamaindex-contract-classification-workflow]()
Common Pitfalls and Fixes
Issue | Possible Cause | Fix |
---|
None results | Network or parse failure | Retry or reduce max_pages |
Low confidence | Ambiguous rules | Refine descriptions; add unique indicators |
Slow performance | Large files | Use parsing limits or page targets |
Misclassification | Overlapping semantics | Split into narrower types; add more examples |
Future Enhancements
Rule weight calibration: Prioritize certain rules when overlap occurs.
Hybrid mode: Combine classification with semantic retrieval for unseen classes.
Batch pipelines: Integrate classification with ETL tools like Airflow or Prefect.
Model fine-tuning: Train LlamaClassify on domain-specific contracts.
FAQs
Q1. Is classification supervised or rule-based? Hybrid. You provide natural-language rules; the LLM interprets them for semantic alignment.
Q2. Can I use this offline? No. Classification runs in LlamaCloud, requiring an API key.
Q3. How accurate is it? Depends on rule quality and dataset complexity. Typical confidence >0.9 for clear document types.
Q4. What’s the cost? Billing depends on file volume, page count, and model runtime. Refer to LlamaIndex pricing documentation.
Q5. Can I build this workflow in the UI? Yes. The same rule definitions can be created in LlamaCloud UI under the Classify tab.
References