Skip to content

Commit

Permalink
Documentation: reorders parameter listing for PyMuPDF4LLM API.
Browse files Browse the repository at this point in the history
  • Loading branch information
jamie-lemon committed Aug 26, 2024
1 parent 1add0be commit cec5119
Showing 1 changed file with 10 additions and 8 deletions.
18 changes: 10 additions & 8 deletions docs/pymupdf4llm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,14 +42,6 @@ The |PyMuPDF4LLM| API
* `(top, bottom)` yields `(0, top, 0, bottom)`.
* To always read full pages, use `margins=0`.

:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.

:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.

:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.

:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Typically, scientific documents or pages simulating text using graphics commands may contain tens of thousands of these objects. As vector graphics are used for table detection mainly, analyzing pages of this kind may result in excessive runtimes. You can exclude problematic pages via `graphics_limit=5000`. The respective pages will then be ignored and be represented by one message line in the output text.

:arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure:

- **"metadata"** - a dictionary consisting of the document's metadata :attr:`Document.metadata`, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number).
Expand All @@ -64,6 +56,16 @@ The |PyMuPDF4LLM| API

- **"text"** - page content as |Markdown| text.

:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.

:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.

:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.

:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Typically, scientific documents or pages simulating text using graphics commands may contain tens of thousands of these objects. As vector graphics are used for table detection mainly, analyzing pages of this kind may result in excessive runtimes. You can exclude problematic pages via `graphics_limit=5000`. The respective pages will then be ignored and be represented by one message line in the output text.



:returns: Either a string of the combined text of all selected document pages or a list of dictionaries.

.. method:: LlamaMarkdownReader(*args, **kwargs)
Expand Down

0 comments on commit cec5119

Please sign in to comment.