Schemas

At the core of LlamaExtract is the schema, which defines the structure of the data you want to extract from your documents.

Schema Design: Tips & Best Practices

Try to limit schema nesting to 3-4 levels.
Make fields optional when data might not always be present. Having required fields may force the model to hallucinate when these fields are not present in the documents.
When you want to extract a variable number of entities, use an array type. However, note that you cannot use an array type for the root node.
Use descriptive field names and detailed descriptions. Use descriptions to pass formatting instructions or few-shot examples.
Above all, start simple and iteratively build your schema to incorporate requirements.
If you are hitting token limitations, it is worth rethinking the design of your schema to enable a more efficient extraction. e.g. you can use multiple schemas to extract different subsets of fields from the same document. Likewise, if you are hitting document size limitations, you can consider breaking up your document into smaller chunks.

Defining Schemas (Python SDK)

Schemas can be defined using either Pydantic models or JSON Schema:

Using Pydantic (Recommended)

from pydantic import BaseModel, Field
from typing import List, Optional


class Experience(BaseModel):
    company: str = Field(description="Company name")
    title: str = Field(description="Job title")
    start_date: Optional[str] = Field(description="Start date of employment")
    end_date: Optional[str] = Field(description="End date of employment")


class Resume(BaseModel):
    name: str = Field(description="Candidate name")
    experience: List[Experience] = Field(description="Work history")

Using JSON Schema

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string", "description": "Candidate name"},
        "experience": {
            "type": "array",
            "description": "Work history",
            "items": {
                "type": "object",
                "properties": {
                    "company": {
                        "type": "string",
                        "description": "Company name",
                    },
                    "title": {"type": "string", "description": "Job title"},
                    "start_date": {
                        "anyOf": [{"type": "string"}, {"type": "null"}],
                        "description": "Start date of employment",
                    },
                    "end_date": {
                        "anyOf": [{"type": "string"}, {"type": "null"}],
                        "description": "End date of employment",
                    },
                },
            },
        },
    },
}

agent = extractor.create_agent(name="resume-parser", data_schema=schema)

Important restrictions on JSON/Pydantic Schema

LlamaExtract only supports a subset of the JSON Schema specification. While limited, it should be sufficient for a wide variety of use-cases.

All fields are required by default. Nullable fields must be explicitly marked as such, using anyOf with a null type. See "start_date" field above.
Root node must be of type object.
Schema nesting must be limited to within 5 levels.
The important fields are key names/titles, type and description. Fields for formatting, default values, etc. are not supported.
There are other restrictions on number of keys, size of the schema, etc. that you may hit for complex extraction use cases. In such cases, it is worth thinking how to restructure your extraction workflow to fit within these constraints, e.g. by extracting subset of fields and later merging them together.

Schema Design: Tips & Best Practices​

Defining Schemas (Python SDK)​

Using Pydantic (Recommended)​

Using JSON Schema​

Important restrictions on JSON/Pydantic Schema​

Schema Design: Tips & Best Practices

Defining Schemas (Python SDK)

Using Pydantic (Recommended)

Using JSON Schema

Important restrictions on JSON/Pydantic Schema