Skip to main content

Parsing options

Result type

By default, LlamaParse will return your results as parsed text. The other options available are markdown, which formats the output as clean Markdown, and json which returns a data structure representing the parsed object.

In Python:
parser = LlamaParse(
  result_type="markdown"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/job/<job_id>/result/markdown'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY"

Set language

LlamaParse use OCR to extract text from images. Our OCR supports a long list of languages and you can tell LlamaParse which language(s) to parse for by setting this option. You can specify multiple languages by separating them with a comma. This will only affect text extracted from images.

In Python:
parser = LlamaParse(
  language=fr
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'language="fr"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Parsing instructions

LlamaParse can use LLMs under the hood, allowing you to give it natural-language instructions about what it's parsing and how to parse. This is an incredibly powerful feature!

In Python:
parser = LlamaParse(
  parsing_instruction = "You are parsing a receipt from a restaurant. Please extract the total amount paid and the tip."
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'parsing_instruction="string"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Is formatting instruction

Allow the parsing instruction to also format the output. Default: true.

If turned off, our custom formatting instructions will be used to output the best markdown possible. You can always add a parsing instruction to translate the text for instance.

In Python:
parser = LlamaParse(
  is_formatting_instruction=False
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'is_formatting_instruction="false"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Disable OCR

By default, LlamaParse will run an OCR on images embedded in the document. You can disable it by setting disable_ocr to True.

In Python:
parser = LlamaParse(
  disable_ocr=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'disable_ocr="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Skip diagonal text

By default, LlamaParse will attempt to parse text that is diagonal on the page. This can be useful for some documents, but it can also lead to errors. If you're seeing strange results, try setting skip_diagonal_text to True.

In Python:
parser = LlamaParse(
  skip_diagonal_text=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'skip_diagonal_text="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Do not unroll columns

By default, LlamaParse will attempt to unroll columns (putting them after each other in reading order). Setting do_not_unroll_columns to True will prevent LlamaParse from doing so.

In Python:
parser = LlamaParse(
  do_not_unroll_columns=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'do_not_unroll_columns="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Target pages

A comma separated string listing the page to be extracted. By default, all pages will be extracted. Pages are numbered starting at 0.

In Python:
parser = LlamaParse(
  target_pages="0,2,7"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'target_pages="0,2,7"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Page separator

By default, LlamaParse will separate pages in the markdown and text output by \n---\n. You can change this separator by setting page_separator to the desired string.

In Python:
parser = LlamaParse(
  page_separator="\n=================\n"
)

It's also possible to include the page number within the separator using {pageNumber} in the string. It will be replaced by the page number of the next page.

In Python:
parser = LlamaParse(
  page_separator="\n== {pageNumber} ==\n" # Will transform to "\n== 4 ==\n" to separate page 3 and 4.
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'page_separator="\n== {pageNumber} ==\n"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Page prefix and suffix

It's possible to specify a prefix or a suffix to be added to each page. These strings can contain {pageNumber} as well and will be replaced by the current page number. Both parameters are optional and empty by default.

In Python:
parser = LlamaParse(
  page_prefix="START OF PAGE: {pageNumber}\n"   page_suffix="\nEND OF PAGE: {pageNumber}"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'page_prefix="START OF PAGE: {pageNumber}\n"   page_suffix="\nEND OF PAGE: {pageNumber}"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Bounding box

Specify an area of a document that you want to parse. This can be helpful to remove headers and footers. To do so you need to provide the bounding box margin in clockwise order from the top in a comma-separated. The margins are expressed as a fraction of the page size, a number between 0 and 1.

Examples:

  • To exclude the top 10% of a document: bounding_box="0.1,0,0,0"
  • To exclude the top 10% and bottom 20% of a document: bounding_box="0.1,0,0.2,0"
In Python:
parser = LlamaParse(
  bounding_box="0.1,0,0.2,0"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'bounding_box="0.1,0,0.2,0"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Take screenshot

Take a screenshot of each page and add it to JSON output in the following format:

{
  "images": [
    {
      "name": "page_1.jpg",
      "height": 792,
      "width": 612,
      "x": 0,
      "y": 0,
      "type": "full_page_screenshot"
    }
  ]
}
In Python:
parser = LlamaParse(
  take_screenshot=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'take_screenshot="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'