Performance issues when rendering large PDFs #1691

jesusgp22 · 2017-10-26T20:30:08Z

jesusgp22
Oct 26, 2017

This might be a good question for pdf.js community itself but how does rendering large PDFs can be better handled with react-pdf?

pdf.js suggests not rendering more than 25 pages at a time:
https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#allthepages

I even had to add this to my component to keep react from trying re-create the virtual DOM of the Document:

    shouldComponentUpdate(nextProps, nextState) {
        if(nextProps.file !== this.props.file
            || nextState.numPages !== this.state.numPages
            || nextState.width !== this.state.width){
            return true
        }
        return false
    }

The problem is that I also need to dynamically set the width of the document on user interacting so I can't save myself from re-creating the virtual DOM after width changes, any way I can achieve this with your lib?

michaeldzjap · 2017-10-27T15:32:25Z

michaeldzjap
Oct 27, 2017

@jesusgp22 You probably want to use some kind of virtualization library for displaying PDF's with a lot of pages, like react-virtualized for instance. Maybe this is useful to you.

0 replies

jesusgp22 · 2017-10-27T15:38:33Z

jesusgp22
Oct 27, 2017
Author

Hey, thank you so much for your answer, I'll def check this out

You might want to add a note about this on react-pdf documentation to help others with the same performance issues or even in the future add this as a core feature for large docs.

0 replies

jesusgp22 · 2017-10-31T02:11:51Z

jesusgp22
Oct 31, 2017
Author

Following up on this @michaeldzjap I am watching some presentations on react-virtualized and it will break text search feature, is this a trade off that I can't get around?

0 replies

michaeldzjap · 2017-10-31T06:47:27Z

michaeldzjap
Oct 31, 2017

I am not familiar with the text search feature I have to admit. But I suspect that it relies on the text layer for each page to be rendered in order to be able to find all the relevant results for a specific search query (e.g. a word that could be located anywhere in the document). The whole point of virtualizing a collection of elements (Page components in the case of react-pdf) is to not render them all at the same time.

I don't think there is an easy way around this unfortunately. A solution could be to keep a virtual representation of a text layer of each page in memory (like how React does this for HTML elements) and search through that instead. Might be possible.

0 replies

jesusgp22 · 2017-11-01T13:55:22Z

jesusgp22
Nov 1, 2017
Author

That's an interesting approach, I am guessing this will most likely break browser text search feature anyway, in any case I think it is ok to implement this using just a regular search box element. Now the questions are:

How can I extract the text from the pdf to keep a "virtual" copy of the whole text layer I can search from
After getting a list of results from the text how can I implement a feature to seek these features in the document (guessing I will need to map results to scrollbar coordinates accordingly)

0 replies

michaeldzjap · 2017-11-01T19:12:20Z

michaeldzjap
Nov 1, 2017

How can I extract the text from the pdf to keep a "virtual" copy of the whole text layer I can search from

I think you would need to dig into pdf.js for this, relying on the react-pdf api probably is not enough. You can get the text content for a page using this apparently:

page.getTextContent().then(function(textContent) { ... });

After getting a list of results from the text how can I implement a feature to seek these features in the document (guessing I will need to map results to scrollbar coordinates accordingly)

Yes, that is a tricky one... You'd know the page number. Maybe it should be a 2 step operation or something. 1 - Search through the virtual text layers for a query. Keep a result of all pages that match. 2 - For each page in the result of step 1 see if it is rendered, if it is you can probably find the word rather easily, because I think each word is rendered as a separate HTML element in a text layer. If the page is not rendered yet, scroll to it with react-virtualized so that it will be rendered and then again find the HTML element that contains the first occurrence of the word/query in the text layer element tree.

Something like the above. I might think too simplistic about this, I haven't actually tried it myself. But this is how I would approach things initially I think.

0 replies

jesusgp22 · 2017-11-01T21:37:57Z

jesusgp22
Nov 1, 2017
Author

I was wondering if the biggest performance issue was rendering the text layer or the canvas, in case rendering the canvas is an issue, it might be possible to ask pdf.js to only render the text layer?
I know this is not possible with the current react-pdf API

0 replies

wojtekmaj · 2017-11-04T14:57:17Z

wojtekmaj
Nov 4, 2017
Maintainer

@jesusgp22 Nope, you can toggle text content and annotations on/off, but canvas is currently not behind a flag. I don't see a good reason against having it toggleable, though :)

0 replies

wojtekmaj · 2017-11-04T15:07:24Z

wojtekmaj
Nov 4, 2017
Maintainer

I think you would need to dig into pdf.js for this, relying on the react-pdf api probably is not enough.

@michaeldzjap Any reason for this? Documents's onLoadSuccess should return all pdf properties and methods, and Page's onLoadSuccess should return all page properties and methods.

If you use Document, you can get the number of pages, iterate through all of them with pdf.getPage(pageNumber) and run the code you pasted on getPage()'s results.

0 replies

michaeldzjap · 2017-11-04T15:22:39Z

michaeldzjap
Nov 4, 2017

@wojtekmaj Yes, my wording was rather poor. What I meant is that pdf.getPage(), page.getTextContent() etc. all are pdf.js related rather than react-pdf specific. So although of course all those methods are perfectly well accessible through the react-pdf API, they really belong to the underlying pdf.js API.

If you use Document, you can get the number of pages, iterate through all of them with pdf.getPage(pageNumber) ...

Yes. This is exactly what I do to cache all document page widths and height on initial load when using react-pdf together with react-virtualized.

0 replies

jesusgp22 · 2017-11-06T15:54:45Z

jesusgp22
Nov 6, 2017
Author

Thank you both for this amazing discussion 👍

0 replies

MarcoNicolodi · 2017-11-16T20:08:30Z

MarcoNicolodi
Nov 16, 2017

We are also having trouble loading long PDFs. We are loading a 17mb PDFs and the application crashes, and since we have customers with 100mb+ PDFs, crashing is not an option.

This example which is also a react wrapper to PDF.js seem to work for us. It tricks PDF.js to load only the current visible page and the ten previous pages. It looks like it has something to do with the wrapper div's styles, because when you change some of the styles it loses it lazy loading behaviour.

I couldnt reproduce this trick to your lib. But we liked react-pdf so much that we are still trying to adapt this lazy load trick to it.

We like the fact that your lib has no default toolbox and that it has mapped its props to pdf.js handlers/configs, so we can develop our customized toolbox.

So we would be glad to see it working better with long pdfs, maybe using this trick that yurydelendik/pdfjs-react uses (thats a shame that I couldnt reproduce it with your lib! )

0 replies

jesusgp22 · 2017-11-17T17:07:40Z

jesusgp22
Nov 17, 2017
Author

@MarcoNicolodi I found that react-virtualized worked really bad with react-pdf I implemented the aproach to only render a few pages but to make things work you have to render a div that has the dimensions of the pages you don't render

you can 100% integrate this with react-pdf using the document object that is returned by react-pdf and use getPage and page.getViewport methods to get the page dimensions

I built my own algorithm to detect what pages are visible and I run it everytime the user scrolls or if a resize event happens.

0 replies

wojtekmaj · 2017-11-17T21:01:26Z

wojtekmaj
Nov 17, 2017
Maintainer

Hey everyone,
I'd like to remind you that it was never React-PDF's intention to provide the users with fully-fledged PDF reader. Instead, this is only a tool to make it. While I have a plan of creating React-PDF-based PDF reader, I'm far from it. Mozilla is working on it for years and they seem to never be done. I think it would go similar way ;)

There is some good news too, though. If I can suggest something, onRenderSuccess callback that you can define for <Page> components can be your powerful friend. You can use it to, for example, force pages to be rendered one by one:

import React, { Component } from 'react';
import { Document, Page } from 'react-pdf/build/entry.webpack';

import './Sample.less';

export default class Sample extends Component {
  state = {
    file: './test.pdf',
    numPages: null,
    pagesRendered: null,
  }

  onDocumentLoadSuccess = ({ numPages }) =>
    this.setState({
      numPages,
      pagesRendered: 0,
    });

  onRenderSuccess = () =>
    this.setState(prevState => ({
      pagesRendered: prevState.pagesRendered + 1,
    }));

  render() {
    const { file, numPages, pagesRendered } = this.state;

    /**
     * The amount of pages we want to render now. Always 1 more than already rendered,
     * no more than total amount of pages in the document.
     */
    const pagesRenderedPlusOne = Math.min(pagesRendered + 1, numPages);

    return (
      <div className="Example">
        <header>
          <h1>react-pdf sample page</h1>
        </header>
        <div className="Example__container">
          <div className="Example__container__document">
            <Document
              file={file}
              onLoadSuccess={this.onDocumentLoadSuccess}
            >
              {
                Array.from(
                  new Array(pagesRenderedPlusOne),
                  (el, index) => {
                    const isCurrentlyRendering = pagesRenderedPlusOne === index + 1;
                    const isLastPage = numPages === index + 1;
                    const needsCallbackToRenderNextPage = isCurrentlyRendering && !isLastPage;

                    return (
                      <Page
                        key={`page_${index + 1}`}
                        onRenderSuccess={
                          needsCallbackToRenderNextPage ? this.onRenderSuccess : null
                        }
                        pageNumber={index + 1}
                      />
                    );
                  },
                )
              }
            </Document>
          </div>
        </div>
      </div>
    );
  }
}

Of course you can do much more - add placeholders, check on scroll which pages need rendering, keep info on whether all pages so far were rendered... I believe in your creativity ;) And if I can be of any help regarding API, please let me know!

0 replies

MarcoNicolodi · 2017-11-20T11:41:55Z

MarcoNicolodi
Nov 20, 2017

@jesusgp22

Hey, may you share this example?

0 replies

zhoumy96 · 2022-11-09T03:23:25Z

zhoumy96
Nov 9, 2022

this is a demo for large pdfs: https://github.com/zhoumy96/react-pdf-large-files
Thanks ngoclinhng #94 (comment)

0 replies

wojtekmaj · 2022-11-09T08:45:48Z

wojtekmaj
Nov 9, 2022
Maintainer

I am a bit confused why the PDF.js viewer from Mozilla (https://mozilla.github.io/pdf.js/web/viewer.html) can load large PDF instantly, can zoom instantly, and you can scroll through the pages with minimal buffering. While using this library as is without performance optimization, large PDF's take at least 30 seconds to load, and I can't zoom at all because it makes the webpage freeze.

"without performance optimization" is the key here. React-PDF is NOT a PDF viewer - it is only a tool to build one. If you want to browse 100 page PDFs, you need to take similar precautions as if you were trying to open 100 images at once, or 100 videos, or whatever. You wouldn't open them all at once, would you?

It would be VERY helpful if you can start viewing a PDF without downloading the entire file first.

You can, as long as Range header is supported by the server you're serving the content from.

0 replies

wojtekmaj · 2022-11-09T08:46:19Z

wojtekmaj
Nov 9, 2022
Maintainer

@zhoumy96 Good example. Rendering pages only when they are actually needed is a key for performant PDF viewer.

0 replies

wojtekmaj · 2022-11-09T09:28:35Z

wojtekmaj
Nov 9, 2022
Maintainer

Here's my take on hooking React-PDF to React-Window.

https://codesandbox.io/s/react-pdf-react-window-x3xzzg
https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

0 replies

Moebits · 2022-11-09T14:26:59Z

Moebits
Nov 9, 2022

It would be VERY helpful if you can start viewing a PDF without downloading the entire file first.

You can, as long as Range header is supported by the server you're serving the content from.

Would you mind giving an example of how to do this? I tried setting options={{disableAutoFetch: true, disableStream: true}} on the Document, but it seems like it has no effect. It is still downloading the whole file before it displays anything.

My PDF files are hosted in an AWS S3 bucket which does support range requests to my knowledge.

0 replies

wojtekmaj · 2022-11-09T18:36:57Z

wojtekmaj
Nov 9, 2022
Maintainer

Hmm, not sure about that. I'm pretty sure PDF.js will request only as much data as needed, if it's possible, e.g. when you only want to display Page 1. If it's not happening, it's on PDF.js side. There may be something else that I don't know about that might prevent partial download from happening, e.g. PDF built in a specific way or something.

0 replies

EricLiu0614 · 2022-11-10T15:01:56Z

EricLiu0614
Nov 10, 2022

Here's my take on hooking React-PDF to React-Window.

https://codesandbox.io/s/react-pdf-react-window-x3xzzg https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

@wojtekmaj Thank you for providing the great demo.

When I try to implement it in my application and load a large pdf file I notice the memory is keep increasing when I keep loading following pages or switch between pages. And the memory are not released until I close the browser tab.. Any idea how can we optimize it? Thank you!

0 replies

Moebits · 2022-11-14T00:18:50Z

Moebits
Nov 14, 2022

Hmm, not sure about that. I'm pretty sure PDF.js will request only as much data as needed, if it's possible, e.g. when you only want to display Page 1. If it's not happening, it's on PDF.js side. There may be something else that I don't know about that might prevent partial download from happening, e.g. PDF built in a specific way or something.

Ok, I figured out what the problem here was: The PDF files have to be "linearized", which means that they are saved in a way so that the file can be requested in chunks.

On a Mac, I just opened the PDF in Preview, reordered a page and put it back (otherwise it doesn't save if no changes are made), and hit File -> Save. It should save linearized by default. On Windows you will probably have to find a third party app to do it (like Acrobat).

I hope that helps anyone that was having the same issue.

0 replies

HadesShadows · 2023-01-12T11:09:08Z

HadesShadows
Jan 12, 2023

Here's my take on hooking React-PDF to React-Window.

https://codesandbox.io/s/react-pdf-react-window-x3xzzg https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

This is working perfectly. Thankyou so much

EDIT: Any help to make the navigation button work in these codes?

EDIT #2: So the solution to going to specific page is given by react-window documents React-window . Previous Page and next page can also be done similarly

0 replies

ibweb3dev · 2023-02-02T19:49:30Z

ibweb3dev
Feb 2, 2023

Here's my take on hooking React-PDF to React-Window.
https://codesandbox.io/s/react-pdf-react-window-x3xzzg https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

This is working perfectly. Thankyou so much

EDIT: Any help to make the navigation button work in these codes?

EDIT #2: So the solution to going to specific page is given by react-window documents React-window . Previous Page and next page can also be done similarly

I think still load the full PDF

if you pass in <Document... onLoadProgress={onDocumentLoadProgress}

  function onDocumentLoadProgress({ loaded, total }) {
    const tot = Math.round((loaded / total) * 100);

    console.log(tot);
  }

Logs will display the total load

0 replies

srinadh239 · 2023-07-21T11:54:35Z

srinadh239
Jul 21, 2023

Here's my take on hooking React-PDF to React-Window.
https://codesandbox.io/s/react-pdf-react-window-x3xzzg https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

@wojtekmaj Thank you for providing the great demo.

When I try to implement it in my application and load a large pdf file I notice the memory is keep increasing when I keep loading following pages or switch between pages. And the memory are not released until I close the browser tab.. Any idea how can we optimize it? Thank you!

@wojtekmaj @EricLiu0614 Any update on this. Even I had a similar problem, when i play around with scroll on huge pdfs, there seems to be a memory leak, which is causing page crash.
Sample PDF I used, where the memory went up to 5 gb
https://codesandbox.io/s/react-pdf-react-window-forked-jp5w5x?file=/src/App.js

0 replies

admondtamang · 2023-08-25T04:40:20Z

admondtamang
Aug 25, 2023

It would be VERY helpful if you can start viewing a PDF without downloading the entire file first.

You can, as long as Range header is supported by the server you're serving the content from.

Would you mind giving an example of how to do this? I tried setting options={{disableAutoFetch: true, disableStream: true}} on the Document, but it seems like it has no effect. It is still downloading the whole file before it displays anything.

My PDF files are hosted in an AWS S3 bucket which does support range requests to my knowledge.

The browser does not expose Accept-Ranges and Content-Range by default. These two headers will cause pdf.js to mistakenly think that the server does not support range requests, and then directly request the entire file.

My CORS configuration for s3 bucket.

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "PUT",
            "GET",
            "POST",
            "HEAD"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [
            "Accept-Ranges",
            "Content-Length",
            "Content-Range"
        ]
    }
]

Expose these header to let react-pdf known about the headers that it need to stream

0 replies

ccasper89 · 2023-09-06T14:11:23Z

ccasper89
Sep 6, 2023

Here's my take on hooking React-PDF to React-Window.
https://codesandbox.io/s/react-pdf-react-window-x3xzzg https://codesandbox.io/s/react-pdf-react-window-fullscreen-ky4yy0

@wojtekmaj Thank you for providing the great demo.
When I try to implement it in my application and load a large pdf file I notice the memory is keep increasing when I keep loading following pages or switch between pages. And the memory are not released until I close the browser tab.. Any idea how can we optimize it? Thank you!

@wojtekmaj @EricLiu0614 Any update on this. Even I had a similar problem, when i play around with scroll on huge pdfs, there seems to be a memory leak, which is causing page crash. Sample PDF I used, where the memory went up to 5 gb https://codesandbox.io/s/react-pdf-react-window-forked-jp5w5x?file=/src/App.js

The memory leak seems to be due to code sandbox and not react-pdf + react-window example. Did you try to run the example locally?

0 replies

jkgenser · 2024-06-29T14:38:37Z

jkgenser
Jun 29, 2024

I have published the following library for rendering large PDFs. It's not yet in 1.0 but it's useful for rendering very large PDFs and lazily loading each pdf.

https://github.com/jkgenser/react-pdf-headless

This is also published to NPM: https://www.npmjs.com/package/react-pdf-headless

The DEMO app is a full example of this library being used: https://github.com/jkgenser/react-pdf-headless/blob/main/demo/App.tsx

It only depends on react-pdf and @tanstack/virtual

0 replies

This comment was marked as spam.

Sign in to view

Performance issues when rendering large PDFs #1691

Replies: 65 comments

jesusgp22 Oct 27, 2017 Author

jesusgp22 Oct 31, 2017 Author

jesusgp22 Nov 1, 2017 Author

jesusgp22 Nov 1, 2017 Author

wojtekmaj Nov 4, 2017 Maintainer

wojtekmaj Nov 4, 2017 Maintainer

jesusgp22 Nov 6, 2017 Author

jesusgp22 Nov 17, 2017 Author

wojtekmaj Nov 17, 2017 Maintainer

wojtekmaj Nov 9, 2022 Maintainer

wojtekmaj Nov 9, 2022 Maintainer

wojtekmaj Nov 9, 2022 Maintainer

wojtekmaj Nov 9, 2022 Maintainer

This comment was marked as spam.

jesusgp22
Oct 27, 2017
Author

jesusgp22
Oct 31, 2017
Author

jesusgp22
Nov 1, 2017
Author

jesusgp22
Nov 1, 2017
Author

wojtekmaj
Nov 4, 2017
Maintainer

wojtekmaj
Nov 4, 2017
Maintainer

jesusgp22
Nov 6, 2017
Author

jesusgp22
Nov 17, 2017
Author

wojtekmaj
Nov 17, 2017
Maintainer

wojtekmaj
Nov 9, 2022
Maintainer

wojtekmaj
Nov 9, 2022
Maintainer

wojtekmaj
Nov 9, 2022
Maintainer

wojtekmaj
Nov 9, 2022
Maintainer