Schemas
At the core of LlamaExtract is the schema, which defines the structure of the data you want to extract from your documents.
See existing schemas
You can see all the schemas you've created by calling extractor.list_schemas()
:
extractor = LlamaExtract()
schemas = extractor.list_schemas()
for schema in schemas:
print(f"{schema.name}: {schema.id}")
This will by default list all the schemas in the default project; you can pass a project_id
argument to list_schemas
to list schemas in a specific project.
Retrieve an existing schema
If you already have a schema ID, you can retrieve the schema object by calling extractor.get_schema()
:
extractor = LlamaExtract()
schema = extractor.get_schema("616c354a-dd4e-44b0-a830-89e0f52a2169")
print(schema)
Schema IDs are unique across projects, so you don't need to pass a project_id
.
Creating a schema
There are 3 ways to create a schema:
1. Infer from existing documents
The easiest way to create a schema is to infer it from existing documents. This is done by uploading a set of documents to LlamaExtract, which will analyze the documents and create a schema based on the data it finds. This can be done in the web UI or using the Python library:
extractor = LlamaExtract()
extractor.infer_schema("A Schema Name", ["data/file1.pdf", "data/file2.pdf"])
The second argument can be a string path to a file, a Path
object, or the contents of the file as raw bytes or a BufferedIOBase
object. See limitations for how many files you can pass to inference at a time, and how many pages can be in those files.
You can also optionally pass a third argument, project_id
, which is the string ID of the project in which to create the schema. If you don't pass this argument, the schema will be created in the default project.
The schema inferred from documents may not be perfect, so you may want to modify it. See below for how to do that.
2. Create a schema from a Pydantic model
If you know in advance what data structure you want to create, one of the easiest ways to do that is to create a Pydantic model of that structure:
from pydantic import BaseModel, Field
extractor = LlamaExtract()
class ResumeMetadata(BaseModel):
"""Resume metadata."""
years_of_experience: int = Field(..., description="Number of years of work experience.")
highest_degree: str = Field(..., description="Highest degree earned (options: High School, Bachelor's, Master's, Doctoral, Professional")
professional_summary: str = Field(..., description="A general summary of the candidate's experience")
extraction_schema = extractor.create_schema("Test Schema", ResumeMetadata)
As with inferred schemas, you can pass a project_id
as the third argument to create_schema
.
3. Create a schema from a JSON schema
If you aren't working in Python it can be more convenient to create a schema directly from JSON. Here's an equivalent schema to the Pydantic model above:
{
"type": "object",
"title": "ResumeMetadata",
"required": [
"years_of_experience",
"highest_degree",
"professional_summary"
],
"properties": {
"highest_degree": {
"type": "string",
"title": "Highest Degree",
"description": "Highest degree earned (options: High School, Bachelor's, Master's, Doctoral, Professional"
},
"years_of_experience": {
"type": "integer",
"title": "Years Of Experience",
"description": "Number of years of work experience."
},
"professional_summary": {
"type": "string",
"title": "Professional Summary",
"description": "A general summary of the candidate's experience"
}
},
"description": "Resume metadata."
}
In Python, you can pass in the schema object just as you would a Pydantic model:
extractor = LlamaExtract()
resume_metadata = ...object above...
extraction_schema = extractor.create_schema("Test Schema", resume_metadata)
Modifying a schema
If you have inferred a schema, the data structure may not be perfect. You can modify the schema by calling extractor.update_schema()
:
extractor = LlamaExtract()
schema = extractor.update_schema("ed9bba8a-0e0c-4d70-981a-71b86f78cc6e",resume_metadata)
Just like creating schemas, resume_metdata
can be a Pydantic object or a JSON schema object.