Performance Tips
There are some scenarios which require careful schema and extraction workflow design. Following these performance tips will help you understand LLM limitations when your use case falls into one of these scenarios and achieve good extraction results.
Entity Enumeration on Long Documents
The situation: When working with lengthy documents (50+ pages) that contain repetitive information or many instances of similar entities, you want to ensure comprehensive and accurate extraction.
Best practices for optimal performance:
- Use targeted extraction: Design schemas that extract specific, well-defined fields rather than exhaustive lists
- Implement document chunking: Break large documents into logical sections (by chapter, section, or topic) and run separate extractions
- Leverage page-based extraction: If each page contains an entity to be extracted, use the
PER_PAGE
extraction target to process each page individually for comprehensive coverage. - Apply page range filtering: Use the
page_range
option to focus on specific sections where your target data is located
Anti-patterns to avoid:
- Asking for exhaustive enumeration across very long documents (e.g., "extract all product names from this 100-page catalog")
- Using overly broad, undefined extraction targets without structure or limits
- Processing entire long documents as single extraction jobs without consideration of context limits
Schema design examples:
# ❌ Problematic: Asking for exhaustive enumeration on a large document
problematic_schema = {
"all_product_names": {
"type": "array",
"description": "Every single product mentioned anywhere in this document"
}
}
# ✅ Better: Targeted, structured extraction
better_schema = {
"primary_products": {
"type": "array",
"description": "The main 3-5 products featured in this section",
},
"product_category": {
"type": "string",
"description": "The category this section focuses on"
}
}
Tabular Data Transformation (CSV/Excel)
The situation: When working with spreadsheets or CSV files, you need to extract and transform structured data while maintaining accuracy and exhaustiveness.
Performance guidelines by table size:
- Small tables (< 50 rows): Process directly with excellent results
- Medium tables (50-100 rows): Process directly but validate thoroughly
- Large tables (> 100 rows): Use batch processing strategies for optimal results
Best practices for optimal performance:
- Implement batch processing: Process large tables in smaller, manageable chunks (20-50 rows at a time)
- Focus on header/metadata extraction: Extract table structure, column definitions, and summary information for comprehensive understanding
- Use sample-based processing: Process representative samples to validate your approach before scaling
- Combine with traditional tools: Use LlamaExtract for intelligent extraction and pandas/SQL for large-scale transformations
- Validate incrementally: Test your extraction logic on small batches before processing entire datasets
Anti-patterns to avoid:
- Processing hundreds or thousands of rows in a single extraction job
- Attempting to transform every row of massive datasets without batching
- Using LlamaExtract for purely computational tasks that don't require language understanding
Optimal workflow for large tables:
# ✅ Process a large CSV in batches
def process_large_csv(csv_file, batch_size=50):
results = []
for i in range(0, len(csv_data), batch_size):
batch = csv_data[i:i+batch_size]
# Convert batch to document format
result = agent.extract(batch)
results.append(result)
return combine_results(results)
Complex Field Transformations
The situation: When creating extraction schemas, you want to capture complex business logic and data transformations while ensuring reliable, consistent results.
Best practices for optimal performance:
- Keep extraction simple and focused: Design fields that extract clean, structured data without complex computations
- Separate extraction from transformation: Handle business logic, calculations, and complex rules in your application code after extraction
- Use clear, single-purpose descriptions: Each field should have one clear, well-defined extraction target
- Provide concrete examples: Include specific examples in field descriptions when the expected format might be ambiguous
- Test iteratively: Start with simple schemas and add complexity gradually while validating results
Anti-patterns to avoid:
- Embedding complex conditional logic directly in field descriptions
- Asking for computed values that require multiple decision points
- Creating fields that try to do multiple things at once
- Using overly complex business rules within the extraction schema
Schema design examples:
# ❌ Problematic: Too much logic in the field description
problematic_field = {
"calculated_score": {
"type": "number",
"description": "If revenue > 1M, multiply by 0.8, else if revenue < 500K multiply by 1.2, otherwise use the base score from table 3, but only if the date is after 2020 and the category is not 'exempt'"
}
}
What works better:
# ✅ Better: Simple extraction, handle logic separately
better_schema = {
"revenue": {
"type": "number",
"description": "Total revenue in dollars"
},
"base_score": {
"type": "number",
"description": "Base score value from the scoring table"
},
"date": {
"type": "string",
"description": "Date in YYYY-MM-DD format"
},
"category": {
"type": "string",
"description": "Business category"
}
}
# Then handle calculations in your application code:
def calculate_final_score(extracted_data):
revenue = extracted_data["revenue"]
if revenue > 1000000:
return extracted_data["base_score"] * 0.8
elif revenue < 500000:
return extracted_data["base_score"] * 1.2
else:
return extracted_data["base_score"]
Recommended approach: Design extraction schemas to capture clean, structured data first, then implement business logic and complex transformations in your application code for maximum reliability and maintainability.
Overall Performance Best Practices
For maximum extraction success:
- Start small and iterate: Begin with a subset of your data to validate your extraction approach before scaling
- Design clear, focused schemas: Define exactly what you want to extract rather than trying to capture "everything". See the section on schema design.
- Leverage document structure: Use page ranges, sections, and chunking strategies to optimize processing
- Test thoroughly: Validate your schemas across different document types and edge cases
- Combine tools strategically: Use LlamaExtract for intelligent content understanding and traditional tools for large-scale data processing
- Monitor and adjust: Continuously evaluate extraction quality and adjust your approach based on real-world performance
Remember: LlamaExtract excels at understanding and structuring document content intelligently. Focus on leveraging this strength while using complementary tools for computational tasks and large-scale processing.