Skip to main content

Parsing options

Set language​

LlamaParse use OCR to extract text from images. Our OCR supports a long list of languages and you can tell LlamaParse which language(s) to parse for by setting this option. You can specify multiple languages by separating them with a comma. This will only affect text extracted from images.

In Python:
parser = LlamaParse(
  language=fr
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'language="fr"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Disable OCR​

By default, LlamaParse will run an OCR on images embedded in the document. You can disable it by setting disable_ocr to True.

In Python:
parser = LlamaParse(
  disable_ocr=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'disable_ocr="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Skip diagonal text​

By default, LlamaParse will attempt to parse text that is diagonal on the page. This can be useful for some documents, but it can also lead to errors. If you're seeing strange results, try setting skip_diagonal_text to True.

In Python:
parser = LlamaParse(
  skip_diagonal_text=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'skip_diagonal_text="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Do not unroll columns​

By default, LlamaParse will attempt to unroll columns (putting them after each other in reading order). Setting do_not_unroll_columns to True will prevent LlamaParse from doing so.

In Python:
parser = LlamaParse(
  do_not_unroll_columns=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'do_not_unroll_columns="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Target pages​

A comma separated string listing the page to be extracted. By default, all pages will be extracted. Pages are numbered starting at 0.

In Python:
parser = LlamaParse(
  target_pages="0,2,7"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'target_pages="0,2,7"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Page separator​

By default, LlamaParse will separate pages in the markdown and text output by \n---\n. You can change this separator by setting page_separator to the desired string.

In Python:
parser = LlamaParse(
  page_separator="\n=================\n"
)

It's also possible to include the page number within the separator using {pageNumber} in the string. It will be replaced by the page number of the next page.

In Python:
parser = LlamaParse(
  page_separator="\n== {pageNumber} ==\n" # Will transform to "\n== 4 ==\n" to separate page 3 and 4.
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'page_separator="\n== {pageNumber} ==\n"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Page prefix and suffix​

It's possible to specify a prefix or a suffix to be added to each page. These strings can contain {pageNumber} as well and will be replaced by the current page number. Both parameters are optional and empty by default.

In Python:
parser = LlamaParse(
  page_prefix="START OF PAGE: {pageNumber}\n"   page_suffix="\nEND OF PAGE: {pageNumber}"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'page_prefix="START OF PAGE: {pageNumber}\n"   page_suffix="\nEND OF PAGE: {pageNumber}"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Bounding box​

Specify an area of a document that you want to parse. This can be helpful to remove headers and footers. To do so you need to provide the bounding box margin in clockwise order from the top in a comma-separated. The margins are expressed as a fraction of the page size, a number between 0 and 1.

Examples:

  • To exclude the top 10% of a document: bounding_box="0.1,0,0,0"
  • To exclude the top 10% and bottom 20% of a document: bounding_box="0.1,0,0.2,0"
In Python:
parser = LlamaParse(
  bounding_box="0.1,0,0.2,0"
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'bounding_box="0.1,0,0.2,0"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Take screenshot​

Take a screenshot of each page and add it to JSON output in the following format:

{
  "images": [
    {
      "name": "page_1.jpg",
      "height": 792,
      "width": 612,
      "x": 0,
      "y": 0,
      "type": "full_page_screenshot"
    }
  ]
}
In Python:
parser = LlamaParse(
  take_screenshot=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'take_screenshot="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Disable image extraction​

It is possible to disable the extraction of image for better performance using disable_image_extraction=true

In Python:
parser = LlamaParse(
  disable_image_extraction=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'disable_image_extraction="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Extract multiple table per sheet in spreadsheet​

By default LlamaParse extract each sheet of a spreadsheet as one table. Using spreadsheet_extract_sub_tables=true, LlamaParse will try to identify spreadsheet sheet with multiple table and return them as separated tables.

In Python:
parser = LlamaParse(
  spreadsheet_extract_sub_tables=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'spreadsheet_extract_sub_tables="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'

Output table as HTML in markdown​

A common issue with markdown table is that they do not handle merged cells well. It is possible to ask LlamaParse to return table as html with colspan and rowspan to get a better representation of the table. When output_tables_as_HTML=true, tables present in the markdown will be output as HTML tables.

In Python:
parser = LlamaParse(
  output_tables_as_HTML=True
)
Using the API:
curl -X 'POST' \
  'https://api.cloud.llamaindex.ai/api/parsing/upload'  \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H "Authorization: Bearer $LLAMA_CLOUD_API_KEY" \
  --form 'output_tables_as_HTML="true"' \
  -F 'file=@/path/to/your/file.pdf;type=application/pdf'