Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(archive): A clean implementation of Tar #4538

Closed
wants to merge 6 commits into from

Conversation

BlackAsLight
Copy link
Contributor

@BlackAsLight BlackAsLight commented Mar 31, 2024

Nobody asked for this, but since @std/archive was still listed as unstable; I thought I'd have a go at rewriting it based off this spec. The tests for this new implementation will need to be written to accommodate the changes, but I thought I wouldn't have a crack at that unless you actually want this.

Notes

Some things to note about this implementation.

  1. At the moment it only supports untar-ing files. Anything else will be skipped over.
  2. Unlike the current implementation on JSR, the ustar spec doesn't actually require names and prefixes to only be ASCII letters, and so this implementation also does support non-ASCII letters.
  3. While the default file limit is at 8 GiBs, many modern implementations do support unarchiving files up to 64 GiBs. This implementation not only also supports that, but also supports creating them provided the user provide an extra argument.
  4. This rewrite has no dependencies and is several lines less than the existing one. The code is also completely original and not a fork of somebody else's. I noticed in the license that the current one was a fork.
  5. This implementation doesn't allow the user to only provide a path to the current file on the system. Instead it required the user give an Iterable or AsyncIterable to it. This design choice was to allow the implementation to work in other environments besides Deno, like the browser for instance.

Usage Examples

Creating a Tarball

The piping of the readable stream can happen at any stage here. calling tar.close() must happen when the user is finished appending files to the instance.

const tar = new Tar()

tar.append({
    pathname: 'deno.json',
    size: (await Deno.stat('deno.json')).size,
    iterable: (await Deno.open('deno.json')).readable
})
tar.close()

tar.readable
    .pipeThrough(new CompressionStream('gzip'))
    .pipeTo((await Deno.create('archive.tar.gz')).writable)

Extracting a Tarball

It is important to note for when unarchiving, one must decide to either consume the entirety of the ReadableStream or call cancel (entry.readable.cancel()) on it before the next entry will be resolved.

await Deno.mkdir('archive/')
for await (const entry of new UnTar(
    (await Deno.open('archive.tar.gz'))
        .readable
        .pipeThrough(new DecompressionStream('gzip'))
)) {
    entry.readable.pipeTo((await Deno.create('archive/' + entry.pathname)).writable)
}

It should also be noted here that while compression is optional, it is very easy for the user to implement in their system. This implementation does allow the user to stream their creation and extraction of tarballs all without it ever hitting the file system or creating memory spikes. Granted that the size of each iterable is known.

Anyway, I'd like to know what you think. What you do and don't like.

Edit

TarStream

await ReadableStream.from([
    {
        pathname: './data.csv',
        size: (await Deno.stat('./data.csv')).size,
        iterable: (await Deno.open('./data.csv')).readable
    },
    {
        pathname: './deno.json',
        size: (await Deno.stat('./deno.json')).size,
        iterable: (await Deno.open('./deno.json')).readable
    },
    {
        pathname: './deno.lock',
        size: (await Deno.stat('./deno.lock')).size,
        iterable: (await Deno.open('./deno.lock')).readable
    }
])
    .pipeThrough(new TarStream())
    .pipeThrough(new CompressionStream('gzip'))
    .pipeTo((await Deno.create('archive.tar.gz')).writable)

UnTarStream

await Deno.mkdir('archive/')
for await (
    const entry of (await Deno.open('archive.tar.gz'))
        .readable
        .pipeThrough(new DecompressionStream('gzip'))
        .pipeThrough(new UnTarStream())
)
    await entry.readable.pipeTo((await Deno.create('archive/' + entry.pathname)).writable)

@BlackAsLight BlackAsLight requested a review from kt3k as a code owner March 31, 2024 07:28
@CLAassistant
Copy link

CLAassistant commented Mar 31, 2024

CLA assistant check
All committers have signed the CLA.

- Incoming chunks of size less than 512 were causing the header to not be aligned properly.
- ReadableStreams seem to call the `pull` upon initiation so if the cancel method is called, an extra record was being discarded.
- I honestly don't understand how that push property wasn't causing issues before. I guess the chunks I got from Deno.open was always divisible by 512.
Copy link
Contributor

@iuioiua iuioiua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run deno fmt and ensure the @example code snippets correctly import symbols? See other code snippets in the codebase for examples.

@iuioiua
Copy link
Contributor

iuioiua commented Apr 2, 2024

PTAL @crowlKats

@BlackAsLight
Copy link
Contributor Author

BlackAsLight commented Apr 2, 2024

Can you please run deno fmt and ensure the @example code snippets correctly import symbols? See other code snippets in the codebase for examples.

Sure, but if this is going to be accepted then there are some other things I'd like to implement in it as I realised earlier today it probably should have. For example:

  • it should probably actually make use of the checksum when detecting headers.
    • it should also probably have some handling for the end of the stream. I have no idea why but for some reason this code doesn't complain when it hits that 1024 null bytes.
  • it should probably throw an error if it does hit the end of a stream before expected.
  • it should probably not allow the same file to be appended twice.
  • it should support decoding directories and other types of typeflags, even if it doesn't support creating them.
  • it should let the user optionally add mtime etc for the headers as the file.
    • if you know what some good defaults for these are then I'd very much be open to hearing it.

Also if this thing about blobs was implemented in Deno then I'd also like to implement it into the Tar class to make the size property optional. I think that would provide much advantage in situations where a file is being made and you don't actually know its size until the end; where it is just temporarily written to disk behind the scenes and is compatible with other runtimes.

@kt3k
Copy link
Member

kt3k commented Apr 2, 2024

This seems a breaking change. We need to deprecate the existing API first.

Also this seems conflicting with #1985. I think we first need to discuss the direction to take in general

@crowlKats crowlKats self-requested a review April 2, 2024 09:46
@iuioiua
Copy link
Contributor

iuioiua commented Apr 3, 2024

@BlackAsLight, is it possible to make this a non-breaking change (i.e. pass with existing tests)?

@BlackAsLight
Copy link
Contributor Author

@BlackAsLight, is it possible to make this a non-breaking change (i.e. pass with existing tests)?

@iuioiua One of the things not compatible about this implementation to the current one is that this one doesn't allow the user to provide a path to the file on the file system and let the Tar instance get an iterable of it. My preference would be to not support that so it is compatible with other runtimes.

Besides that I can't see anything else that can't be changed so its offering the same methods and arguments with maybe slightly different but compatible types.

@crowlKats
Copy link
Member

as I have stated, this should be a transformstream instead of a class utility, and also this needs to a separate API instead of replacing the current one. Cant just replace the existing one and have different behaviour.

@BlackAsLight BlackAsLight deleted the archive branch April 11, 2024 11:46
@BlackAsLight BlackAsLight restored the archive branch May 24, 2024 05:02
@BlackAsLight BlackAsLight deleted the archive branch May 24, 2024 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants