Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement concatenate_dataset_dicts #153

Merged
merged 7 commits into from
Oct 1, 2024
Merged

Conversation

RainbowRivey
Copy link
Collaborator

@RainbowRivey RainbowRivey commented Sep 9, 2024

Implements concatenate_dataset_dicts() which allows merging multiple datasets into a single one.

  • concatenate_datasets() and concatenate_dataset_dicts() now have an additional option clear_metadata to clean metadata fields when merging. If True, only dataset_name field will stay in metadata. dataset_name will NOT be overwritten by subsequent concatenations, so it always contains the original dataset name.

Other changes:

  • fixes Dataset.map() and IterableDataset.map() behavior when parameter function is None.
  • dataset_to_document_type() converter now removes all features not declared by target document type.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.30%. Comparing base (e1db8f3) to head (4719a07).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
+ Coverage   92.19%   92.30%   +0.11%     
==========================================
  Files          10       10              
  Lines         897      910      +13     
==========================================
+ Hits          827      840      +13     
  Misses         70       70              
Flag Coverage Δ
92.30% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ArneBinder ArneBinder changed the title Implement concatenate_dataset_dicts Implement concatenate_dataset_dicts Sep 9, 2024
@ArneBinder ArneBinder added the enhancement New feature or request label Sep 9, 2024
@RainbowRivey
Copy link
Collaborator Author

RainbowRivey commented Sep 23, 2024

Added Tests.
However, current code uses hf datasets concatenate_datasets method, which requires documents to have identical features structure, which might not be the case: our datasets have completely different metadata fields. I'm not sure at which layer should this one be solved.

One way around this might be to prefix metadata fields with dataset name and create an new empty metadata (needed by concatenate_documents), but then all documents will have two metadata fields:

  • metadata: with dataset_name (created by current concatenate method)
  • %dataset%_metadata with dataset specific metadata

This should work since HF method allows fields to be empty.

Another way is not to use HF implementation and write own.

OR we can leave it as is, and preprocess datasets before merging.

@RainbowRivey
Copy link
Collaborator Author

For now i'm just removing metadata from documents when merging. It seems that HF Datasets concatenate_datasets method compares features too deeply and i see no better way around it.

@ArneBinder ArneBinder merged commit 34fff5d into main Oct 1, 2024
4 checks passed
@ArneBinder ArneBinder deleted the concatenate_dataset_dicts branch October 1, 2024 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants