Parsing options
Set language
LlamaParse use OCR to extract text from images. Our OCR supports a long list of languages and you can tell LlamaParse which language(s) to parse for by setting this option. You can specify multiple languages by separating them with a comma. This will only affect text extracted from images.
In Python:parser = LlamaParse(Using the API:
language=fr
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'language="fr"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Parsing instructions
LlamaParse can use LLMs under the hood, allowing you to give it natural-language instructions about what it's parsing and how to parse. This is an incredibly powerful feature!
In Python:parser = LlamaParse(Using the API:
parsing_instruction = "You are parsing a receipt from a restaurant. Please extract the total amount paid and the tip."
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'parsing_instruction="string"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Is formatting instruction
Allow the parsing instruction to also format the output. Default: true
.
If turned off, our custom formatting instructions will be used to output the best markdown possible. You can always add a parsing instruction to translate the text for instance.
In Python:parser = LlamaParse(Using the API:
is_formatting_instruction=False
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'is_formatting_instruction="false"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Disable OCR
By default, LlamaParse will run an OCR on images embedded in the document. You can disable it by setting disable_ocr
to True
.
parser = LlamaParse(Using the API:
disable_ocr=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'disable_ocr="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Skip diagonal text
By default, LlamaParse will attempt to parse text that is diagonal on the page. This can be useful for some documents, but it can also lead to errors. If you're seeing strange results, try setting skip_diagonal_text
to True
.
parser = LlamaParse(Using the API:
skip_diagonal_text=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'skip_diagonal_text="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Do not unroll columns
By default, LlamaParse will attempt to unroll columns (putting them after each other in reading order). Setting do_not_unroll_columns
to True
will prevent LlamaParse from doing so.
parser = LlamaParse(Using the API:
do_not_unroll_columns=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'do_not_unroll_columns="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Target pages
A comma separated string listing the page to be extracted. By default, all pages will be extracted. Pages are numbered starting at 0.
In Python:parser = LlamaParse(Using the API:
target_pages="0,2,7"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'target_pages="0,2,7"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Page separator
By default, LlamaParse will separate pages in the markdown and text output by \n---\n
. You can change this separator by setting page_separator
to the desired string.
parser = LlamaParse(
page_separator="\n=================\n"
)
It's also possible to include the page number within the separator using {pageNumber}
in the string. It will be replaced by the page number of the next page.
parser = LlamaParse(Using the API:
page_separator="\n== {pageNumber} ==\n" # Will transform to "\n== 4 ==\n" to separate page 3 and 4.
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'page_separator="\n== {pageNumber} ==\n"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Page prefix and suffix
It's possible to specify a prefix or a suffix to be added to each page. These strings can contain {pageNumber}
as well and will be replaced by the current page number. Both parameters are optional and empty by default.
parser = LlamaParse(Using the API:
page_prefix="START OF PAGE: {pageNumber}\n" page_suffix="\nEND OF PAGE: {pageNumber}"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'page_prefix="START OF PAGE: {pageNumber}\n" page_suffix="\nEND OF PAGE: {pageNumber}"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Bounding box
Specify an area of a document that you want to parse. This can be helpful to remove headers and footers. To do so you need to provide the bounding box margin in clockwise order from the top in a comma-separated. The margins are expressed as a fraction of the page size, a number between 0 and 1.
Examples:
- To exclude the top 10% of a document: bounding_box="0.1,0,0,0"
- To exclude the top 10% and bottom 20% of a document: bounding_box="0.1,0,0.2,0"
parser = LlamaParse(Using the API:
bounding_box="0.1,0,0.2,0"
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'bounding_box="0.1,0,0.2,0"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Take screenshot
Take a screenshot of each page and add it to JSON output in the following format:
{ "images": [ { "name": "page_1.jpg", "height": 792, "width": 612, "x": 0, "y": 0, "type": "full_page_screenshot" } ] }In Python:
parser = LlamaParse(Using the API:
take_screenshot=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'take_screenshot="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'
Disable image extraction
It is possible to disable the extraction of image for better performance using disable_image_extraction=true
parser = LlamaParse(Using the API:
disable_image_extraction=True
)
curl -X 'POST' \
'https://api.cloud.llamaindex.ai/api/parsing/upload' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
--form 'disable_image_extraction="true"' \
-F 'file=@/path/to/your/file.pdf;type=application/pdf'