Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bunch of enhancements for dealing with large tar archives #37

Merged
merged 6 commits into from
Dec 9, 2024

Conversation

zozlak
Copy link
Contributor

@zozlak zozlak commented Dec 7, 2024

This pull request contains following changes to the Tar class:

  • Adds support for files bigger than 8GiB (see the note on handling out of basic format range values on the https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions). The helper functions were implemented using static public methods to make them testable without a need to create >8GiB files.
  • Optimizes tar data writing by using a bigger write chunk size (512 bytes at once kills I/O performance on modern storage) and avoiding unnecessary pack() calls (only the very last 512 bytes block needs it).
  • Adds the Tar::readCurrentEntry() method which allows reading tar entry content while iterating trough the archive with the generator returned by the Tar::yieldContents(). This allows efficient inspection of large tar archive content without need to extract it. In my particular case it is allows me to check backup consistency with code like
    $chunkSize = 1 << 20; // 1MB, whatever
    $tar = new Tar();
    $tar->open('some_archive.tgz');
    foreach($tar->yieldContents() as $i) {
      $hash = hash_init('some algo');
      while ($chunk = $tar->readCurrentEntry($chunkSize)) {
        hash_update($hash, $chunk);
      }
      hash_final($hash);
      if ($hash !== $refHashTakenFromSomewhereElse) {
        throw new Exception('inconsistency!');
      }
    }
    This code uses constant memory (depending on the $chunkSize only) and reads the archive only once no matter the tar archive size and format (compressed or not).

In 2001 the GNU tar introduced support for large and negative numbers
(https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions)

This is required to handle files bigger than 8G.
So far there was no way to read the data from a file in an archive
without extracting it and extraction of a single file required rereading
of a whole archive. This commit changes the yieldContents() in a way it
does not skip to the next header entry before returning a current header
content. A position of the next header entry is remembered instead and
rewinded to only at the next next() call on the generator. This allows
to read the current entry content until the next() call. For that the
Tar::readCurrentEntry() method was added.
Tar::addData(): pad only the last block of data and write everything
else with just a single writebytes() call and without pack().

Tar::addFile(): move the read chunk size to a class constant.
@splitbrain
Copy link
Owner

That's pretty cool. Thanks!

@splitbrain splitbrain merged commit d9d4eaa into splitbrain:master Dec 9, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants