A bunch of enhancements for dealing with large tar archives #37

zozlak · 2024-12-07T09:38:42Z

This pull request contains following changes to the Tar class:

Adds support for files bigger than 8GiB (see the note on handling out of basic format range values on the https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions). The helper functions were implemented using static public methods to make them testable without a need to create >8GiB files.
Optimizes tar data writing by using a bigger write chunk size (512 bytes at once kills I/O performance on modern storage) and avoiding unnecessary pack() calls (only the very last 512 bytes block needs it).
Adds the Tar::readCurrentEntry() method which allows reading tar entry content while iterating trough the archive with the generator returned by the Tar::yieldContents(). This allows efficient inspection of large tar archive content without need to extract it. In my particular case it is allows me to check backup consistency with code like
```
$chunkSize = 1 << 20; // 1MB, whatever
$tar = new Tar();
$tar->open('some_archive.tgz');
foreach($tar->yieldContents() as $i) {
  $hash = hash_init('some algo');
  while ($chunk = $tar->readCurrentEntry($chunkSize)) {
    hash_update($hash, $chunk);
  }
  hash_final($hash);
  if ($hash !== $refHashTakenFromSomewhereElse) {
    throw new Exception('inconsistency!');
  }
}
```
This code uses constant memory (depending on the $chunkSize only) and reads the archive only once no matter the tar archive size and format (compressed or not).

In 2001 the GNU tar introduced support for large and negative numbers (https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions) This is required to handle files bigger than 8G.

So far there was no way to read the data from a file in an archive without extracting it and extraction of a single file required rereading of a whole archive. This commit changes the yieldContents() in a way it does not skip to the next header entry before returning a current header content. A position of the next header entry is remembered instead and rewinded to only at the next next() call on the generator. This allows to read the current entry content until the next() call. For that the Tar::readCurrentEntry() method was added.

Tar::addData(): pad only the last block of data and write everything else with just a single writebytes() call and without pack(). Tar::addFile(): move the read chunk size to a class constant.

splitbrain · 2024-12-09T14:08:39Z

That's pretty cool. Thanks!

zozlak added 6 commits December 6, 2024 18:23

Tar: add support for large and negative numbers

3e51582

In 2001 the GNU tar introduced support for large and negative numbers (https://www.gnu.org/software/tar/manual/html_node/Extensions.html#Extensions) This is required to handle files bigger than 8G.

Tar::readCurrentEntry(): recognize end of file properly

f931cad

Tar::addFile(): use larger read buffer for better performance

f15ef3a

Tar: write performance optimizations

5ff390c

Tar::addData(): pad only the last block of data and write everything else with just a single writebytes() call and without pack(). Tar::addFile(): move the read chunk size to a class constant.

TarTestCase::testReadCurrentEntry() added

7b1936c

splitbrain merged commit d9d4eaa into splitbrain:master Dec 9, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bunch of enhancements for dealing with large tar archives #37

A bunch of enhancements for dealing with large tar archives #37

zozlak commented Dec 7, 2024 •

edited

Loading

splitbrain commented Dec 9, 2024

A bunch of enhancements for dealing with large tar archives #37

A bunch of enhancements for dealing with large tar archives #37

Conversation

zozlak commented Dec 7, 2024 • edited Loading

splitbrain commented Dec 9, 2024

zozlak commented Dec 7, 2024 •

edited

Loading