Performance Tips

There are some scenarios which require careful schema and extraction workflow design. Following these performance tips will help you understand LLM limitations when your use case falls into one of these scenarios and achieve good extraction results.

Entity Enumeration on Long Documents

The situation: When working with lengthy documents (50+ pages) that contain repetitive information or many instances of similar entities, you want to ensure comprehensive and accurate extraction.

Best practices for optimal performance:

Use targeted extraction: Design schemas that extract specific, well-defined fields rather than exhaustive lists
Implement document chunking: Break large documents into logical sections (by chapter, section, or topic) and run separate extractions
Leverage page-based extraction: If each page contains an entity to be extracted, use the PER_PAGE extraction target to process each page individually for comprehensive coverage.
Apply page range filtering: Use the page_range option to focus on specific sections where your target data is located

Anti-patterns to avoid:

Asking for exhaustive enumeration across very long documents (e.g., "extract all product names from this 100-page catalog")
Using overly broad, undefined extraction targets without structure or limits
Processing entire long documents as single extraction jobs without consideration of context limits

Schema design examples:

# ❌ Problematic: Asking for exhaustive enumeration on a large document
problematic_schema = {
    "all_product_names": {
        "type": "array", 
        "description": "Every single product mentioned anywhere in this document"
    }
}

# ✅ Better: Targeted, structured extraction
better_schema = {
    "primary_products": {
        "type": "array",
        "description": "The main 3-5 products featured in this section",
    },
    "product_category": {
        "type": "string",
        "description": "The category this section focuses on"
    }
}

Tabular Data Transformation (CSV/Excel)

The situation: When working with spreadsheets or CSV files, you need to extract and transform structured data while maintaining accuracy and exhaustiveness.

Performance guidelines by table size:

Small tables (< 50 rows): Process directly with excellent results
Medium tables (50-100 rows): Process directly but validate thoroughly
Large tables (> 100 rows): Use batch processing strategies for optimal results

Best practices for optimal performance:

Implement batch processing: Process large tables in smaller, manageable chunks (20-50 rows at a time)
Focus on header/metadata extraction: Extract table structure, column definitions, and summary information for comprehensive understanding
Use sample-based processing: Process representative samples to validate your approach before scaling
Combine with traditional tools: Use LlamaExtract for intelligent extraction and pandas/SQL for large-scale transformations
Validate incrementally: Test your extraction logic on small batches before processing entire datasets

Anti-patterns to avoid:

Processing hundreds or thousands of rows in a single extraction job
Attempting to transform every row of massive datasets without batching
Using LlamaExtract for purely computational tasks that don't require language understanding

Optimal workflow for large tables:

# ✅ Process a large CSV in batches
def process_large_csv(csv_file, batch_size=50):
    results = []
    
    for i in range(0, len(csv_data), batch_size):
        batch = csv_data[i:i+batch_size]
        # Convert batch to document format
        result = agent.extract(batch)
        results.append(result)
    
    return combine_results(results)

Complex Field Transformations

The situation: When creating extraction schemas, you want to capture complex business logic and data transformations while ensuring reliable, consistent results.

Best practices for optimal performance:

Keep extraction simple and focused: Design fields that extract clean, structured data without complex computations
Separate extraction from transformation: Handle business logic, calculations, and complex rules in your application code after extraction
Use clear, single-purpose descriptions: Each field should have one clear, well-defined extraction target
Provide concrete examples: Include specific examples in field descriptions when the expected format might be ambiguous
Test iteratively: Start with simple schemas and add complexity gradually while validating results

Anti-patterns to avoid:

Embedding complex conditional logic directly in field descriptions
Asking for computed values that require multiple decision points
Creating fields that try to do multiple things at once
Using overly complex business rules within the extraction schema

Schema design examples:

# ❌ Problematic: Too much logic in the field description
problematic_field = {
    "calculated_score": {
        "type": "number",
        "description": "If revenue > 1M, multiply by 0.8, else if revenue < 500K multiply by 1.2, otherwise use the base score from table 3, but only if the date is after 2020 and the category is not 'exempt'"
    }
}

What works better:

# ✅ Better: Simple extraction, handle logic separately
better_schema = {
    "revenue": {
        "type": "number", 
        "description": "Total revenue in dollars"
    },
    "base_score": {
        "type": "number",
        "description": "Base score value from the scoring table"
    },
    "date": {
        "type": "string",
        "description": "Date in YYYY-MM-DD format"
    },
    "category": {
        "type": "string",
        "description": "Business category"
    }
}

# Then handle calculations in your application code:
def calculate_final_score(extracted_data):
    revenue = extracted_data["revenue"]
    if revenue > 1000000:
        return extracted_data["base_score"] * 0.8
    elif revenue < 500000:
        return extracted_data["base_score"] * 1.2
    else:
        return extracted_data["base_score"]

Recommended approach: Design extraction schemas to capture clean, structured data first, then implement business logic and complex transformations in your application code for maximum reliability and maintainability.

Overall Performance Best Practices

For maximum extraction success:

Start small and iterate: Begin with a subset of your data to validate your extraction approach before scaling
Design clear, focused schemas: Define exactly what you want to extract rather than trying to capture "everything". See the section on schema design.
Leverage document structure: Use page ranges, sections, and chunking strategies to optimize processing
Test thoroughly: Validate your schemas across different document types and edge cases
Combine tools strategically: Use LlamaExtract for intelligent content understanding and traditional tools for large-scale data processing
Monitor and adjust: Continuously evaluate extraction quality and adjust your approach based on real-world performance

Remember: LlamaExtract excels at understanding and structuring document content intelligently. Focus on leveraging this strength while using complementary tools for computational tasks and large-scale processing.

Entity Enumeration on Long Documents​

Tabular Data Transformation (CSV/Excel)​

Complex Field Transformations​

Overall Performance Best Practices​

Entity Enumeration on Long Documents

Tabular Data Transformation (CSV/Excel)

Complex Field Transformations

Overall Performance Best Practices