Skip to content

Releases: sdv-dev/RDT

v1.4.2 - 2023-05-02

02 May 18:03
Compare
Choose a tag to compare

This release fixes a bug that caused datetime and numerical transformers to crash if a column was all NaNs. Additionally, it adds support for Pandas 2.0!

Bugs

  • Numerical & datetime transformers crash if the entire column is null - Issue #637 by @fraces-h

Maintenance

v1.4.1 - 2023-04-25

26 Apr 21:07
Compare
Choose a tag to compare

This release patches an issue that prevented the RegexGenerator from working with regexes that had a very large number of possible combinations.

Bugs

  • RegexGenerator continues to have problems if there are too many possibilities - Issue #635 by @pvk-developer

v1.4.0 - 2023-04-13

13 Apr 17:44
Compare
Choose a tag to compare

This release adds a couple of new features including adding the OrderedLabelEncoder and deprecating the CustomLabelEncoder. It also adds a change that makes all generator type transformers in the HyperTransformer use a different random seed.

Additionally, bugs were patched in the RegexGenerator that caused it to crash or take too long in certain cases. Finally, this release improved the detection of Faker functions in the AnonymizedFaker.

Bugs

  • Find nested Faker provider submodules - PR #630 by @frances-h
  • RegexGenerator fails to generate values if there are too many possibilities - Issue #623 by @R-Palazzo
  • RegexGenerator takes too much time and runs out of memory if there are too many possibilities - Issue #624 by @R-Palazzo

New Features

  • Choose a different seed for each transformer - Issue #619 by @fealho
  • Rename CustomLabelEncoder to OrderedLabelEncoder - Issue #621 by @R-Palazzo
  • Add functionality to find version add-on - Issue #620 by @frances-h

v1.3.0 - 2023-1-18

18 Jan 20:55
Compare
Choose a tag to compare

This release makes changes to the way that individual transformers are stored in the HyperTransformer. When accessing the config via HyperTransformer.get_config(), the transformers listed in the config are now the actual transformer instances used during fitting and transforming. These instances can now be accessed and used to examine their properties post fitting. For example, you can now view the mapping for a PseudoAnonymizedFaker instance using PseudoAnonymizedFaker.get_mapping() on the instance retrieved from the config.

Additionally, the output of reverse_tranform no longer appends the .value suffix to every unnamed output column. Only output columns that are created from context extracted from the input columns will have suffixes (eg. .normalized in the ClusterBasedNormalizer).

The AnonymizedFaker and RegexGenerator now have an enforce_uniqueness parameter, which controls whether the data returned by reverse_transform should be unique. The HyperTransformer now has a method called create_anonymized_columns that can be used to generate columns that are matched with anonymizing transformers like AnonymizedFaker and RegexGenerator. The method can be used as follows:
HyperTransformer.create_anonymized_columns(num_rows=5, column_names=['email_optin', 'credit_card'])

Another major change in this release is the ability to control randomization. Every time a HyperTransformer is initialized, its randomness will be reset to the same seed, and it will yield the same results for reverse_transform if given the same input. Every subsequent call to reverse_transform yields a different result. If a user desires to reset the seed, they can call HyperTransformer.reset_randomization.

Finally, this release adds support for Python 3.10 and drops support for 3.6.

Bugs

Maintenance

New Features

v1.2.1 - 2022-9-12

12 Sep 17:42
Compare
Choose a tag to compare

This release fixes a bug that caused the UnixTimestampEncoder to return data with the incorrect datetime format. It also fixes a bug that caused the null column not to be reverse transformed when using the UnixTimestampEncoder when the missing_value_replacement was not set.

Bugs

v1.2.0 - 2022-8-17

18 Aug 00:04
Compare
Choose a tag to compare

This release adds a new transformer called the PseudoAnonymizedFaker. This transformer enables the pseudo-anonymization of your data by mapping all of a column's original values to fake values that get returned during the reverse transformation process. Each original value is always mapped to the same fake value.

Additionally, this release enables the HyperTransformer to use categorical transformers on boolean columns. It also introduces a new parameter called computer_representation to the FloatFormatter that will allow for values to be clipped to certain bounds based on the computer type used for a numerical column.

Finally, this release patches a bug that caused unpredicatable results from the reverse_transform method of the FrequencyEncoder when add_noise is enabled.

New Features

  • Add PseudoAnonymizedFaker transformer - Issue #517 by @pvk-developer
  • Boolean columns should be able to use any of the categorical transformers - Issue#527 by @pvk-developer
  • Update FloatFormatter with parameters for the computer representation - Issue#521 by @fealho

Bugs

  • Unpredictable results for FrequencyEncoder(add_noise=True) - Issue #528 by @fealho

Internal

v1.1.0 - 2022-6-9

09 Jun 20:39
Compare
Choose a tag to compare

This release adds multiple new transformers: the CustomLabelEncoder and the RegexGenerator. The CustomLabelEncoder works similarly to the LabelEncoder, except it allows users to provide the order of the categories. The RegexGenerator allows users to specify a regex pattern and will generate values that match that pattern.

This release also improves current transformers. The LabelEncoder now has a parameter called order_by that allows users to specify the ordering scheme for their data (eg. order numerically or alphabetically). The LabelEncoder also now has a parameter called add_noise that allows users to specify whether or not uniform noise should be added to the transformed data. Performance enhancements were made for the GaussianNormalizer by removing an unnecessary distribution search and the FloatFormatter will no longer round values to any place higher than the ones place by default.

New Features

  • Add noise parameter to LabelEncoder - Issue #500 by @fealho
  • Remove parameters related to distribution search and change default for GaussianNormalizer - Issue #499
    by @amontanez24
  • Add order_by parameter to LabelEncoder - Issue #510 by @amontanez24
  • Only round to decimal places in FloatFormatter - Issue #508 by @fealho
  • Add CustomLabelEncoder transformer - Issue #507 by @amontanez24
  • Add RegexGenerator Transformer - Issue #505 by @pvk-developer

v1.0.0 - 2022-5-5

05 May 21:52
Compare
Choose a tag to compare

The main update of this release is the introduction of a config, which describes the sdtypes and transformers that will be used by the HyperTransformer for each column of the data, where sdtype stands for the semantic or statistical meaning of a datatype. The user can interact with this config through the newly created methods update_sdtypes, get_config, set_config, update_transformers, update_transformers_by_sdtype and remove_transformer_by_sdtype.

This release also included various new features and updates, including:

  • Users can now transform subsets of the data using its own methods, transform_subset and reverse_transform_subset.
  • User validation was added for the following methods: transform, reverse_transform, update_sdtypes, update_transformers, set_config.
  • Unnecessary warnings were removed from GaussianNormalizer.fit and FrequencyEncoder.transform.
  • The user can now set a transformers as None.
  • Transformers that cannot work with missing values will automatically fill them in.
  • Added support for additional datetime formats.
  • Setting model_missing_values = False in a transformer was updated to keep track of the percentage of missing values, instead of producing data containing NaN's.
  • All parameters were removed from the HyperTransformer.
  • The demo dataset get_demo was improved to be more intuitive.

Finally, a number of transformers were redesigned to be more user friendly. Among them, the following transformers have also been renamed:

  • BayesGMMTransformer -> ClusterBasedNormalizer
  • GaussianCopulaTransformer -> GaussianNormalizer
  • DateTimeRoundedTransformer -> OptimizedTimestampEncoder
  • DateTimeTransformer -> UnixTimestampEncoder
  • NumericalTransformer -> FloatFormatter
  • LabelEncodingTransformer -> LabelEncoder
  • OneHotEncodingTransformer -> OneHotEncoder
  • CategoricalTransformer -> FrequencyEncoder
  • BooleanTransformer -> BinaryEncoder
  • PIIAnonymizer -> AnonymizedFaker

New Features

  • Fix using None as transformer when update_transformers_by_sdtype - Issue #496 by @pvk-developer
  • Rename PIIAnonymizer --> AnonymizedFaker - Issue #483 by @pvk-developer
  • User validation for reverse_transform - Issue #480 by @amontanez24
  • User validation for transform - Issue #479 by @fealho\
  • User validation for set_config - Issue #478 by @fealho
  • User validation for update_transformers_by_sdtype - Issue #477 by @amontanez24
  • User validation for update_transformers - Issue #475 by @fealho
  • User validation for update_sdtypes - Issue #474 by @fealho
  • Allow columns to not have a transformer - Issue #473 by @pvk-developer
  • Create methods to transform a subset of the data (& reverse transform it) - Issue #472 by @amontanez24
  • Throw a warning if you use set_config on a HyperTransformer that's already fit - Issue #466 by @amontanez24
  • Update README for RDT 1.0 - Issue #454 by @amontanez24
  • Issue with printing PIIAnonymizer in HyperTransformer - Issue #452 by @pvk-developer
  • Pretty print get_config - Issue #450 by @pvk-developer
  • Silence warning for GaussianNormalizer.fit - Issue #443 by @pvk-developer
  • Transformers that cannot work with missing values should automatically fill them in - Issue #442 by @amontanez24
  • More descriptive error message in PIIAnonymizer when provider_name and function_name don't align - Issue #440 by @pvk-developer
  • Can we support additional datetime formats? - Issue #439 by @pvk-developer
  • Update FrequencyEncoder.transform so that pandas won't throw a warning - Issue #436 by @pvk-developer
  • Update functionality when model_missing_values=False - Issue #435 by @amontanez24
  • Create methods for getting and setting a config - Issue #418 by @amontanez24
  • Input validation & error handling in HyperTransformer - Issue #408 by @fealho and @amontanez24
  • Remove unneeded params from HyperTransformer - Issue #407 by @pvk-developer
  • Rename property: _valid_output_sdtypes - Issue #406 by @amontanez24
  • Add pii as a new sdtype in HyperTransformer - Issue #404 by @pvk-developer
  • Update transformers by data type (in HyperTransformer) - Issue #403 by @pvk-developer
  • Update transformers by column name in HyperTransformer - Issue #402 by @pvk-developer
  • Improve updating field_data_types in HyperTransformer - Issue #400 by @amontanez24
  • Create method to auto detect HyperTransformer config from data - Issue #399 by @fealho
  • Update HyperTransformer default transformers - Issue #398 by @fealho
  • Add PIIAnonymizer - Issue #397 by @pvk-developer
  • Improve the way we print an individual transformer - Issue #395 by @amontanez24
  • Rename columns parameter in fit for each individual transformer - Issue #376 by @fealho and @pvk-developer
  • Create a more descriptive demo dataset - Issue #374 by @fealho
  • Delete unnecessary transformers - Issue #373 by @fealho
  • Update NullTransformer to make it user friendly - Issue #372 by @pvk-developer
  • Update BayesGMMTransformer to make it user friendly - Issue #371 by @amontanez24
  • Update GaussianCopulaTransformer to make it user friendly - Issue #370 by @amontanez24
  • Update DateTimeRoundedTransformer to make it user friendly - Issue #369 by @amontanez24
  • Update DateTimeTransformer to make it user friendly - Issue #368 by @amontanez24
  • Update NumericalTransformer to make it user friendly - Issue #367 by @amontanez24
  • Update LabelEncodingTransformer to make it user friendly - Issue #366 by @fealho
  • Update OneHotEncodingTransformer to make it user friendly - Issue #365 by @fealho
  • Update CategoricalTransformer to make it user friendly - Issue #364 by @fealho
  • Update BooleanTransformer to make it user friendly - Issue #363 by @fealho
  • Update names & functionality for handling missing values - Issue #362 by @pvk-developer

Bugs

  • Checking keys of config as set - Issue #497 by @amontanez24
  • Only update transformer used when necessary for update_sdtypes - Issue #469 by @amontanez24
  • Fix how get_config prints transformers - Issue #468 by @pvk-developer
  • NullTransformer reverse_transform alters input data due to not copying - Issue #455 by @amontanez24
  • Attempting to transform a subset of the data should lead to an Error - Issue #451 by @amontanez24
  • Detect_initial_config isn't detecting sdtype "numerical" - Issue #449 by @pvk-developer
  • PIIAnonymizer not generating multiple locales - Issue #447 by @pvk-developer
  • Error when printing ClusterBasedNormalizer and GaussianNormalizer - Issue #441 by @pvk-developer
  • Datetime reverse transform crashes if datetime_format is specified - Issue #438 by @amontanez24
  • Correct datetime format is not recovered on reverse_transform - Issue #437 by @pvk-developer
  • Use numpy NaN values in BinaryEncoder - Issue #434 by @pvk-developer
  • Duplicate _output_columns during fitting - Issue #423 by @fealho

Internal Improvements

  • Making methods that aren't part of API private - Issue #489 by @amontanez24
  • Fix columns missing in config and update transformers to None - Issue #495 by @pvk-developer

v0.6.4 - 2022-3-7

07 Mar 20:46
Compare
Choose a tag to compare

History

0.6.4 - 2022-3-7

This release fixes multiple bugs concerning the HyperTransformer. One is that the get_transformer_tree_yaml method no longer crashes on every call. Another is that calling the update_field_data_types and update_default_data_type_transformers after fitting no longer breaks the transform method.

The HyperTransformer now sorts its outputs for both transform and reverse_transform based on the order of the input's columns. It is also now possible to create transformers that simply drops columns during transform and don't return any new columns.

New Features

  • Support dropping a column trough a transformer - Issue #393 by @pvk-developer
  • HyperTransformer should sort columns after transform and reverse_transform - Issue #405 by @fealho

Bugs

v0.6.3 - 2022-2-4

04 Feb 18:37
Compare
Choose a tag to compare

This release adds a new module to the RDT library called performance. This module can be used to evaluate the speed and peak memory usage of any transformer in RDT. This release also increases the maximum acceptable version of scikit-learn to make it more compatible with other libraries in the SDV ecosystem. On top of that, it fixes a bug related to a new version of pandas.

New Features

Housekeeping

  • Increase scikit-learn dependency range - Issue #351 by @amontanez24
  • pandas 1.4.0 release causes a small error - Issue #358 by @fealho

Bugs

  • Performance tests get stuck on Unix if multiprocessing is involved - Issue #337 by @amontanez24