-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid characters in paths causing missing files and extraction issues. #1
Comments
Capitalization mismatches:
|
Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files. SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL count(*) = 92520 Or around 40% of the dataset has no reconcilable ratings in the data source. |
Just to follow up on this, I wrote a python script to manually process the tar files and stream decompress them into a clean directory structure, keeping the IDs in place and stripping everything else, so if anyone else lands here and still wants to be able to access the files currently, here's a solution. You only need to go to the bottom of the script and replace the input path to the input folder path where you have the .tar archives, and the output directory path where you want the renamed pngs extracted to, directories will be auto created: import pathlib
import tarfile
def reprocess_archives(path_input_base, path_output_base):
archive_paths = list(sorted([p for p in path_input_base.glob('*.tar')]))
for input_tar_path in archive_paths:
path_output_dir = path_output_base / input_tar_path.stem
if not path_output_dir.exists():
path_output_dir.mkdir()
with tarfile.open(name=input_tar_path, mode='r', bufsize=10240) as tf:
print(f'Extracting {input_tar_path}...')
file_ct = 0
entry = tf.next() # type: tarfile.TarInfo
while entry is not None:
if not entry.isfile(): continue
# Replace prefix and leave only leading 'gid' and 'index' | sac-000000/123_1.png
new_name = entry.name.replace('home/jdp/simulacra-aesthetic-captions/', '')
name_parts = new_name.split('_')
new_name = f'{name_parts[0]}_{name_parts[-1]}'
path_output_file = path_output_dir / new_name
# Override output filepath/name
tf._extract_member(entry, str(path_output_file), set_attrs=False, numeric_owner=False)
file_ct += 1
entry = tf.next()
print(f' Extracted {file_ct} files.')
if __name__ == '__main__':
path_input_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions')
path_output_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions-output')
reprocess_archives(path_input_base, path_output_base) This creates a structure like this, where the files are output named for their gid and index, which can then still be looked up in the images table in the sqlite db, to get their imageid (iid) (and then the rating after that, if there is one.)
|
This is intentional, the images are public domain and have utility outside of being rated. You can check the exact export process in https://github.com/JD-P/simulacrabot/blob/imagen/export_dataset.py. In short every gen that was not flagged is in the dataset and the bot doesn't force you to rate, so only most images are rated. |
Okay, it's good to hear at least that it's just my assumption was wrong that all the images included in the dataset had ratings |
Hey thanks for all the work on this Dataset, the efforts are greatly appreciated.
Problem
I've been seeing an issue with the tar archives in v1 of the dataset that cause issues in most applications I've looked at and even plain old TAR itself. The result being that files are going to go missing when extracted or even if you mount the file, depending on the TAR library doing the parsing. I wanted to point this out in case anyone else wound up in the same boat I'm in.
Cause?
I'm guessing that the tar archives were created programmatically by some software that didn't have file-path worries on its mind, so there are some forbidden, or raw whitespace characters in prompt titles that were then copied to the filenames, which make applications angry. I discovered this when trying to re-combine archives into other formats IE zstd or Brotli archives, and running into a bunch of collisions or odd paths.
Examples from sac-000000.tar
The end result is that many applications, even 7z or tar itself will try to gracefully fail and wind up misnaming files, so they will no longer match the paths in the sqlite database, or will extract to random places, etc.
In the brand newest version of Sevenzip, 'invalid' characters are replaced with underscores, but this means that again the file's extracted path will not be the same as in the SQLite db and so the file is effectively 'missing' if you are trying to programmatically match a path to a record to retrieve the image's score.
Digging:
The text was updated successfully, but these errors were encountered: