Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Inconsistencies #21

Open
nick-thumiger opened this issue Dec 3, 2024 · 2 comments
Open

Data Inconsistencies #21

nick-thumiger opened this issue Dec 3, 2024 · 2 comments

Comments

@nick-thumiger
Copy link

Hi,

First off, great work! I’m excited to test some new models on this dataset. However, I’ve noticed several discrepancies in the data that I wanted to bring to your attention:

  • In the train-test-validation split file, some filenames are prefixed with "DrivAer_", while others are not. However, in the actual dataset, none of the files include this prefix. I assume the prefix should be removed from the index files for consistency.

  • Some files listed in the train-test-validation split files are missing from the dataset uploaded to Dataverse. For example, "E_S_WWC_WM_640" is included in the test index file but does not exist in the Pressure or Shear zip files uploaded to Dataverse. This issue appears to affect many files, not just this one.

  • There is significant inconsistency in how surface shear values are stored in the dataset. Some VTK files store shear as a cell_data field, others as a point_data field, and some contain both. It would be best to choose one format and standardize it, as inaccuracies are introduced on my end when converting between cell_data and point_data. Proper export from the CFD simulation would ensure accuracy.

  • Lastly, many of the Shear VTK files contain several NaN values, which seem to indicate issues with either the export process or the CFD simulation itself. For example, in "N_S_WW_WM_229," I found more than 2,500 cells with NaN values. Additionally, this file does not contain any point_data, as mentioned in the previous point.

Please let me know if you need further clarification. Unfortunately, until these issues are resolved, I am unable to use the shear data in my models.

@nick-thumiger
Copy link
Author

Just as a follow up, the total number of indices included in the train-test-validation split files do not sum to the total number of datapoints. So there is definitely something wrong there. There are also exactly 40 indices that do not correspond to any data points in the dataset (are they missing?)

@Mohamedelrefaie
Copy link
Owner

Hi Nick,

Thank you so much for your detailed feedback and for bringing these issues to our attention! This will definitely help us refine the dataset for better usability.

  1. Filename Prefix: You’re absolutely right. We recently renamed the files to remove the "DrivAer_" prefix for consistency across the dataset.
  2. Missing Files: This is also true, and I’ve updated the parametric CSV file to ensure consistency across all data modalities. This update should address the missing files you mentioned, but please let us know if you notice any further discrepancies.
  3. Shear Data Format (cell_data vs. point_data): Could you elaborate more on the differences you’re encountering here? We’ve trained several models for surface field predictions without running into major issues.
  4. Indices That Don’t Match Data Points: Thank you for pointing this out. We’ve identified these discrepancies and are currently working on fixing them.

Again, We truly appreciate your contribution and the time you’ve taken to test the dataset. If you have any further feedback or suggestions, please don’t hesitate to share!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants