Releases: mlcommons/croissant
Releases · mlcommons/croissant
v1.0.12
What's Changed
- Notebooks: Load right split names for fashionmnist by @ccl-core in #773
- Don't handle untested and bugged case for
excludes
. by @marcenacp in #771 - Handle non-capturing groups in regex transforms (
partial-train/*.parquet
). by @marcenacp in #774 - Drop all useless operations when we filter on a field - so we know its value in advance. by @marcenacp in #775
- Properly handle python variable. by @marcenacp in #777
- Fix errors with nested subfields by @ccl-core in #776
- Early return for num_shards==0 in the Beam pipeline. by @marcenacp in #778
- Clean code by checking attribute. by @marcenacp in #779
- Simplify
ReadFromCroissant
by removing the pipeline argument and making it a PCollection. by @marcenacp in #780 - Create new version mlcroissant==1.0.12 with the new ReadFromCroissant. by @marcenacp in #781
Full Changelog: v1.0.11...v1.0.12
v1.0.11
What's Changed
- Make field more robust with None/nan repeated input by @ccl-core in #757
- Figure update dataverse by @luisoala in #760
- add Dataverse to list of integrations by @pdurbin in #758
- Fix bug with repeated fields. by @ccl-core in #763
- Fix bug with jsonpath_rw and numpy arrays by @ccl-core in #764
- Include a prefix to the beam pipeline's stages by @ccl-core in #767
- Update pyproject.toml by @ccl-core in #769
New Contributors
Full Changelog: v1.0.10...v1.0.11
v1.0.10
What's Changed
- Update README.md by @ccl-core in #749
- Example of a dataset with nested fields. by @ccl-core in #745
- Add the web-of-science dataset (from parquet) by @ccl-core in #752
- Remove editor tests by @ccl-core in #753
- BoundigBox feature defaults to crs 1.0 by @ccl-core in #755
- Release 1.0.10 by @ccl-core in #756
Full Changelog: v1.0.9...v1.0.10
v1.0.9
What's Changed
- Isolate a
.call()
method in operations. by @marcenacp in #736 - Keys in a RecordSet should be a list of ids references. by @ccl-core in #740
- Cache the result of each operation. by @marcenacp in #741
- Allow datasets with joins when generating with Apache Beam. by @marcenacp in #743
- Fix discrepancies with the specs by @ccl-core in #742
- Use ids to reference a field or a node. by @ccl-core in #744
- Check that the mapping is valid after setting it. by @marcenacp in #747
- New release mlcroissant==1.0.9 by @ccl-core in #748
Full Changelog: v1.0.8...v1.0.9
v1.0.8
What's Changed
- Adding the levanti dataset. by @ccl-core in #727
- Make nodes and operations pickable. by @marcenacp in #729
- Add splits to the huggingface-mnist dataset by @ccl-core in #726
- Allow to parallelize operations in mlcroissant with Apache Beam. by @marcenacp in #730
- More features around Beam. by @marcenacp in #731
- Remove
pipeline
argument from ReadFromCroissant and usebeam.ptransform_fn
. by @marcenacp in #734 - New release mlcroissant==1.0.8. by @marcenacp in #735
Full Changelog: v1.0.7...v1.0.8
v1.0.7
What's Changed
- Add URLs to pyproject.toml by @PGijsbers in #705
- Implement filtering in the case of filename regular expression and add a test for this feature. by @marcenacp in #716
- Fix broken Unit tests. by @ccl-core in #717
- Add more info links on how to do releases. by @ccl-core in #718
- Apply filters to a Hugging Face dataset to avoid repeating all variants. by @marcenacp in #719
- Move filters from Dataset init to
self.records
by @ccl-core in #720 - Release 1.0.7 by @ccl-core in #721
Full Changelog: v1.0.6...v1.0.7
v1.0.6
What's Changed
- git lfs download fileObject and read gzipped files by @ccl-core in #636
- update readme code example to new hf and croissant api by @luisoala in #642
- Add a dataset with a repeated field by @ccl-core in #644
- Updates to the Croissant turtle definition to align with the spec, and… by @benjelloun in #634
- Remove flores notebook from the automatically checked notebook. by @marcenacp in #652
- Use regex-based version casting that accept by @ccl-core in #658
- update readme with paper proceedings info by @luisoala in #665
- Editor RAI tab by @JoanGi in #578
- Fix typo in schema:Enumeration name by @benjelloun in #669
- Small fixes to the Croissant specification by @benjelloun in #666
- Add four record sets to anthropic hh hlhf by @ccl-core in #670
- Fix end-to-end tests by @marcenacp in #672
- Rerun Croissant Health reports for Hugging Face and OpenML by @marcenacp in #660
- Fix small bugs for splits. by @ccl-core in #680
- camera-ready pdf link by @luisoala in #701
- Introduce mlc.DataType.SPLIT for consistency. by @ccl-core in #709
- Add example output of a dataset with splits. by @ccl-core in #710
- Change DataType.SPLIT to use croissant 1.0 specs by @ccl-core in #712
- When creating an
mlc.Metadata
object, share the graph with all nodes. by @marcenacp in #713 - DataTypes should be URIRef by @ccl-core in #714
- Publish mlcroissant==1.0.6. by @ccl-core in #715
Full Changelog: v1.0.5...v1.0.6