Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid characters in paths causing missing files and extraction issues. #1

Open
kjerk opened this issue Jul 20, 2022 · 6 comments
Open

Comments

@kjerk
Copy link

kjerk commented Jul 20, 2022

Hey thanks for all the work on this Dataset, the efforts are greatly appreciated.

Problem

I've been seeing an issue with the tar archives in v1 of the dataset that cause issues in most applications I've looked at and even plain old TAR itself. The result being that files are going to go missing when extracted or even if you mount the file, depending on the TAR library doing the parsing. I wanted to point this out in case anyone else wound up in the same boat I'm in.

Cause?

I'm guessing that the tar archives were created programmatically by some software that didn't have file-path worries on its mind, so there are some forbidden, or raw whitespace characters in prompt titles that were then copied to the filenames, which make applications angry. I discovered this when trying to re-combine archives into other formats IE zstd or Brotli archives, and running into a bunch of collisions or odd paths.

Examples from sac-000000.tar

Escaped Backslash:
home/jdp/simulacra-aesthetic-captions/32003_melancholic_cuboid_chained_Luminism_Angel_by_Peter_MohrBacher_WLOP_Alphonse_Mucha_J_C_Leyendecker_Ruan_Jia_and_Beksinsk_Featured_on_ArtStation\\_2.png

Newline:
home/jdp/simulacra-aesthetic-captions/6052_A_lamp_lighting_up_a_room_#lighting_#electronics\n_#artstation_#purple_#pastel_#color_#watercolor_2.png

Newline:
home/jdp/simulacra-aesthetic-captions/15711_Their_bodies_are_almost_always_hidden_by_several_layers_of_cloaks_and_by_metallic_helmets_with_visors_made_of_glass,_\n_detailed_digital_art_by_Greg_Rutkowski_and_Erik_Bulatov,_cgsociety_trending_on_artstation_8.png

Misc Quotations:
home/jdp/simulacra-aesthetic-captions/30132_The_man_said,_\n"Dude,_I_don't_know_that_I_like_those_eyes.""they're_still_looking_at_us._they're_good_eyes.""what_makes_you_say_that?,"_the_man_replied."oh,_it's_just_a_feeling."_3.png

The end result is that many applications, even 7z or tar itself will try to gracefully fail and wind up misnaming files, so they will no longer match the paths in the sqlite database, or will extract to random places, etc.

image

In the brand newest version of Sevenzip, 'invalid' characters are replaced with underscores, but this means that again the file's extracted path will not be the same as in the SQLite db and so the file is effectively 'missing' if you are trying to programmatically match a path to a record to retrieve the image's score.

Digging:

> tar --list -f sac-000000.tar | grep -E '\\n|\\\\|""'
home/jdp/simulacra-aesthetic-captions/6052_A_lamp_lighting_up_a_room_#lighting_#electronics\n_#artstation_#purple_#pastel_#color_#watercolor_2.png
home/jdp/simulacra-aesthetic-captions/9292_In_film_gray\nTry_to_change\nThe_frame_rates_intermixing\nIn_your_chronic_carbon_system_7.png
home/jdp/simulacra-aesthetic-captions/5761_At_the_school_playground_at_your_friends_#friends_#embarrassingmoment\n_#artsy_#anime_#cartoon_2.png
home/jdp/simulacra-aesthetic-captions/32003_melancholic_cuboid_chained_Luminism_Angel_by_Peter_MohrBacher_WLOP_Alphonse_Mucha_J_C_Leyendecker_Ruan_Jia_and_Beksinsk_Featured_on_ArtStation\\_2.png
home/jdp/simulacra-aesthetic-captions/15711_Their_bodies_are_almost_always_hidden_by_several_layers_of_cloaks_and_by_metallic_helmets_with_visors_made_of_glass,_\n_detailed_digital_art_by_Greg_Rutkowski_and_Erik_Bulatov,_cgsociety_trending_on_artstation_8.png
home/jdp/simulacra-aesthetic-captions/30132_The_man_said,_\n"Dude,_I_don't_know_that_I_like_those_eyes.""they're_still_looking_at_us._they're_good_eyes.""what_makes_you_say_that?,"_the_man_replied."oh,_it's_just_a_feeling."_3.png
@kjerk
Copy link
Author

kjerk commented Jul 20, 2022

Capitalization mismatches:

sac-000000.tar ->
home/jdp/simulacra-aesthetic-captions/14135_"There's_something_wrong_in_this_village."_-_color_manga_illustration_by_junji_ito._high_quality_7.png

--------
SELECT t.* FROM paths t WHERE path like '14135%' LIMIT 500
108393,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_1.png"
108394,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_2.png"
108395,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_3.png"
108396,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_4.png"
108397,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_5.png"
108398,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_6.png"
108399,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_7.png"
108400,"14135_""There's_something_wrong_in_this_village.""_-_color_manga_illustration_by_Junji_Ito._high_quality_8.png"

@kjerk
Copy link
Author

kjerk commented Jul 20, 2022

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520

Or around 40% of the dataset has no reconcilable ratings in the data source.

@henry501
Copy link

This also causes issues when extracting to an exfat formatted drive due to the invalid characters in filenames. Reference from wikipedia
8BE73EE0-4185-429E-B153-C08AC3074105

@kjerk
Copy link
Author

kjerk commented Jul 21, 2022

Just to follow up on this, I wrote a python script to manually process the tar files and stream decompress them into a clean directory structure, keeping the IDs in place and stripping everything else, so if anyone else lands here and still wants to be able to access the files currently, here's a solution. You only need to go to the bottom of the script and replace the input path to the input folder path where you have the .tar archives, and the output directory path where you want the renamed pngs extracted to, directories will be auto created:

import pathlib
import tarfile

def reprocess_archives(path_input_base, path_output_base):
    archive_paths = list(sorted([p for p in path_input_base.glob('*.tar')]))
    
    for input_tar_path in archive_paths:
        path_output_dir = path_output_base / input_tar_path.stem
        
        if not path_output_dir.exists():
            path_output_dir.mkdir()
        
        with tarfile.open(name=input_tar_path, mode='r', bufsize=10240) as tf:
            print(f'Extracting {input_tar_path}...')
            
            file_ct = 0
            
            entry = tf.next()  # type: tarfile.TarInfo
            while entry is not None:
                if not entry.isfile(): continue
                
                # Replace prefix and leave only leading 'gid' and 'index' | sac-000000/123_1.png
                new_name = entry.name.replace('home/jdp/simulacra-aesthetic-captions/', '')
                name_parts = new_name.split('_')
                new_name = f'{name_parts[0]}_{name_parts[-1]}'
                
                path_output_file = path_output_dir / new_name
                
                # Override output filepath/name
                tf._extract_member(entry, str(path_output_file), set_attrs=False, numeric_owner=False)
                
                file_ct += 1
                entry = tf.next()
            print(f'  Extracted {file_ct} files.')

if __name__ == '__main__':
    path_input_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions')
    path_output_base = pathlib.Path(r'/path/to/simulacra-aesthetic-captions-output')
    
    reprocess_archives(path_input_base, path_output_base)

This creates a structure like this, where the files are output named for their gid and index, which can then still be looked up in the images table in the sqlite db, to get their imageid (iid) (and then the rating after that, if there is one.)

├───sac-000000
│       32_7.png
│       4_5.png
│       8_7.png
│
└───sac-000001
        1681_5.png
        3215_6.png
        941_3.png

@JD-P
Copy link
Owner

JD-P commented Jul 31, 2022

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520

Or around 40% of the dataset has no reconcilable ratings in the data source.

This is intentional, the images are public domain and have utility outside of being rated. You can check the exact export process in https://github.com/JD-P/simulacrabot/blob/imagen/export_dataset.py. In short every gen that was not flagged is in the dataset and the bot doesn't force you to rate, so only most images are rated.

@kjerk
Copy link
Author

kjerk commented Aug 1, 2022

Okay there may be a much more catastrophic problem here, probably worth another issue. This came up doing a full parse of the entries to files.

SELECT count(*) FROM images
LEFT JOIN paths ON paths.iid = images.id
LEFT JOIN ratings ON ratings.iid = images.id
WHERE ratings.rating IS NULL

count(*) = 92520
Or around 40% of the dataset has no reconcilable ratings in the data source.

This is intentional, the images are public domain and have utility outside of being rated. You can check the exact export process in https://github.com/JD-P/simulacrabot/blob/imagen/export_dataset.py. In short every gen that was not flagged is in the dataset and the bot doesn't force you to rate, so only most images are rated.

Okay, it's good to hear at least that it's just my assumption was wrong that all the images included in the dataset had ratings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants