Implement `concatenate_dataset_dicts` #153

RainbowRivey · 2024-09-09T11:11:55Z

Implements concatenate_dataset_dicts() which allows merging multiple datasets into a single one.

concatenate_datasets() and concatenate_dataset_dicts() now have an additional option clear_metadata to clean metadata fields when merging. If True, only dataset_name field will stay in metadata. dataset_name will NOT be overwritten by subsequent concatenations, so it always contains the original dataset name.

Other changes:

fixes Dataset.map() and IterableDataset.map() behavior when parameter function is None.
dataset_to_document_type() converter now removes all features not declared by target document type.

codecov · 2024-09-09T11:14:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.30%. Comparing base (e1db8f3) to head (4719a07).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
+ Coverage   92.19%   92.30%   +0.11%     
==========================================
  Files          10       10              
  Lines         897      910      +13     
==========================================
+ Hits          827      840      +13     
  Misses         70       70

Flag	Coverage Δ
	`92.30% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RainbowRivey · 2024-09-23T10:04:16Z

Added Tests.
However, current code uses hf datasets concatenate_datasets method, which requires documents to have identical features structure, which might not be the case: our datasets have completely different metadata fields. I'm not sure at which layer should this one be solved.

One way around this might be to prefix metadata fields with dataset name and create an new empty metadata (needed by concatenate_documents), but then all documents will have two metadata fields:

metadata: with dataset_name (created by current concatenate method)
%dataset%_metadata with dataset specific metadata

This should work since HF method allows fields to be empty.

Another way is not to use HF implementation and write own.

OR we can leave it as is, and preprocess datasets before merging.

…est datasets

RainbowRivey · 2024-09-30T00:33:35Z

For now i'm just removing metadata from documents when merging. It seems that HF Datasets concatenate_datasets method compares features too deeply and i see no better way around it.

…nate_dataset_dicts`

implement concatenate_dataset_dicts

38fb841

ArneBinder changed the title ~~Implement concatenate_dataset_dicts~~ Implement concatenate_dataset_dicts Sep 9, 2024

ArneBinder added the enhancement New feature or request label Sep 9, 2024

ArneBinder assigned RainbowRivey Sep 9, 2024

add tests

0c346ba

wipe metadata from docs in concatenate_datasets + add metadata to t…

04a0418

…est datasets

RainbowRivey added 4 commits September 30, 2024 11:28

add feature check in test_to_document_type_function

c5ca512

Fix map() when no function used at all.

0f6ed10

remove features not declared in the target document type

f09940c

add parameter clean_metadata to concatenate_datasets and `concate…

4719a07

…nate_dataset_dicts`

ArneBinder merged commit 34fff5d into main Oct 1, 2024
4 checks passed

ArneBinder deleted the concatenate_dataset_dicts branch October 1, 2024 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `concatenate_dataset_dicts` #153

Implement `concatenate_dataset_dicts` #153

RainbowRivey commented Sep 9, 2024 •

edited by ArneBinder

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

RainbowRivey commented Sep 23, 2024 •

edited

Loading

RainbowRivey commented Sep 30, 2024

Implement concatenate_dataset_dicts #153

Implement concatenate_dataset_dicts #153

Conversation

RainbowRivey commented Sep 9, 2024 • edited by ArneBinder Loading

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

RainbowRivey commented Sep 23, 2024 • edited Loading

RainbowRivey commented Sep 30, 2024

Implement `concatenate_dataset_dicts` #153

Implement `concatenate_dataset_dicts` #153

RainbowRivey commented Sep 9, 2024 •

edited by ArneBinder

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

RainbowRivey commented Sep 23, 2024 •

edited

Loading