-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement concatenate_dataset_dicts
#153
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #153 +/- ##
==========================================
+ Coverage 92.19% 92.30% +0.11%
==========================================
Files 10 10
Lines 897 910 +13
==========================================
+ Hits 827 840 +13
Misses 70 70
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
concatenate_dataset_dicts
Added Tests. One way around this might be to prefix
This should work since HF method allows fields to be empty. Another way is not to use HF implementation and write own. OR we can leave it as is, and preprocess datasets before merging. |
For now i'm just removing metadata from documents when merging. It seems that HF Datasets |
Implements
concatenate_dataset_dicts()
which allows merging multiple datasets into a single one.concatenate_datasets()
andconcatenate_dataset_dicts()
now have an additional optionclear_metadata
to clean metadata fields when merging. If True, onlydataset_name
field will stay in metadata.dataset_name
will NOT be overwritten by subsequent concatenations, so it always contains the original dataset name.Other changes:
Dataset.map()
andIterableDataset.map()
behavior when parameterfunction
is None.dataset_to_document_type()
converter now removes all features not declared by target document type.