Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Size (self.size) remaining at 0 for certain zarr-related assets #2067

Open
aaronkanzer opened this issue Nov 4, 2024 · 5 comments
Open

Comments

@aaronkanzer
Copy link
Member

aaronkanzer commented Nov 4, 2024

Issue:

Some zarr assets are registering with a size of 0 even though memory size is well above -- this issue was noticed when attempting to call dandi download upon Dandiset 000719 -- still investigating if other zarr-containing datasets have similar issues.

Seems that there are 119 ZarrArchive objects registered with size of 0:

>>> ZarrArchive.objects.filter(size=0).count()
119

@waxlamp @satra @kabilar -- any idea if this could be intentional behavior? Seems that the dandisets that contain these ZarrArchives of size 0 have variable dates of posting on DANDI Archive.

e.g. https://api.dandiarchive.org/api/dandisets/000108/versions/draft/assets/db2fe61f-2874-444d-b952-02234d00f2ba/ is from a few years ago and contains a similar size of 0 on sub-SChmi53/ses-20220114h19m16s46/micr/sub-SChmi53_ses-20220114h19m16s46_sample-13_stain-LEC_run-1_chunk-3_SPIM.ome.zarr

Cc @yarikoptic @jwodder -- is this something that I should also cross-post in dandi-cli repo for reference?

Steps to replicate initial bug noticed

  1. Visit https://dandiarchive.org/dandiset/000719/draft
  2. Use the dandi download command: dandi download DANDI:000719/draft
  3. Observe the output (below is sample output in middle of download -- note especially the lines with output such as:
    0 Bytes 31.0 MB 0% downloading 1256 done
(nov4) (base) aaronkanzer@Aarons-MacBook-Pro 000719 % dandi download DANDI:000719/draft          
PATH                                                    SIZE      DONE            DONE% CHECKSUM STATUS                 MESSAGE         
000719/dandiset.yaml                                                                             done                   updated         
...T210000_behavior+ophys_NestedDirectoryStore_nwb.zarr                                                                                 
...ses-1214621812_icephys_NestedDirectoryStore_nwb.zarr 0 Bytes   1.2 MB             0%          error                  AssertionError  
...ephys/sub-1214579789_ses-1214621812_icephys.nwb.zarr 0 Bytes   2.5 MB             0%          error                  AssertionError  
...ephys/sub-1214579789_ses-1214621812_icephys.nwb.zarr 25.8 MB   20.3 MB           78%          downloading            10469 done      
...ses-1214621812_icephys_NestedDirectoryStore_nwb.zarr 25.9 MB   20.4 MB           78%          downloading            10500 done      
...phys/sub-npI3_ses-20190421_behavior+ecephys.nwb.zarr 0 Bytes   184.7 MB           0%          downloading            741 done        
...90421_behavior+ecephys_NestedDirectoryStore_nwb.zarr 0 Bytes   179.8 MB           0%          downloading            720 done        
...s/sub-R6_ses-20200206T210000_behavior+ophys.nwb.zarr 0 Bytes   31.0 MB            0%          downloading            1256 done       
...79789_ses-1214621812_icephys_DirectoryStore.nwb.zarr                                                                                 
...ses-1214621812_icephys_NestedDirectoryStore_nwb.zarr                                                                                 
...200206T210000_behavior+ophys_DirectoryStore.nwb.zarr                                                                                 
...200206T210000_behavior+ophys_DirectoryStore.nwb.zarr       
  1. Verify values stored in Heroku PostgresDB for given zarr assets

Exec into Django shell -- python manage.py shell -- then:

from dandiapi.api.models import *

[print(f"Name: {archive.name}, Size: {archive.size}") for archive in Dandiset.objects.filter(pk="000719").zarr_archives.all()]

Notice that some assets have content size of 0

  1. Verify that a given zarr asset does have size larger than 0 in AWS -- for example:
aws s3 ls s3://dandiarchive/zarr/dbbf7b82-c649-409b-a1ae-3b28d1991628/ --recursive --human-readable --summarize

The links below correspond with that assets dandiarchive API and UI

https://api.dandiarchive.org/api/dandisets/000719/versions/draft/assets/23182373-62a6-4747-b6cd-ac7e37f0bb15/
https://dandiarchive.org/dandiset/000719/draft/files?location=ophys_DirectoryStore_9_29_24&page=1

@aaronkanzer
Copy link
Member Author

@bendichter -- just tagging for visibility -- as this specific dandiset contained nwb extended zarr

@aaronkanzer aaronkanzer changed the title Content Size (self.size) remaining at 0 for zarr-related assets Content Size (self.size) remaining at 0 for certain zarr-related assets Nov 4, 2024
@jjnesbitt
Copy link
Member

The size of a zarr being zero is only significant if the zarr has been finalized. If the zarr is not finalized, it simply means that the upload was not "finished" and the checksum computed. The zarr checksum computation is how the size and file_count fields get populated. If I run that same command but filter it to those that have a COMPLETE status, I get nothing:

>>> ZarrArchive.objects.filter(size=0, status=ZarrArchiveStatus.COMPLETE).count()
0

@aaronkanzer
Copy link
Member Author

aaronkanzer commented Nov 4, 2024

@jjnesbitt good to know, thanks for looking into this -- seems that it isn't so much a bug then, I think? -- albeit Zarrs that don't have the status=ZarrArchiveStatus.COMPLETE are still being displayed in the UI/included in the dandi download operations -- is this intentional, or should these also be filtered out, should other users be notified that the zarr assets they might be referencing are not complete?

@jjnesbitt
Copy link
Member

@jjnesbitt good to know, thanks for looking into this -- seems that it isn't so much a bug then, I think? -- albeit Zarrs that don't have the status=ZarrArchiveStatus.COMPLETE are still being displayed in the UI/included in the dandi download operations -- is this intentional, or should these also be filtered out, should other users be notified that the zarr assets they might be referencing are not complete?

It may be a good idea to not show any zarrs that haven't been finalized "at least once". Technically you can further upload data to a zarr that's already been finalized, so in that case we wouldn't want to "disappear" the zarr during that secondary upload, even though the size gets reset to zero. We've had this idea in the past, but it's not been fleshed out or implemented in any way yet. We've also considered email reminders on un-finished (blob or zarr) uploads.

@yarikoptic
Copy link
Member

while thinking about zarr "redesign" (#1892), but may be even before that -- I agree with @jjnesbitt that we should disallow minting an assets until asset was finalized. In new design (whatever it would be) it would mean that it gets a manifest, and thus that version would already be accessible etc. A complimentary behavior we might want to add (or not) is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants