Releases: sdv-dev/RDT
v1.4.2 - 2023-05-02
This release fixes a bug that caused datetime and numerical transformers to crash if a column was all NaNs. Additionally, it adds support for Pandas 2.0!
Bugs
- Numerical & datetime transformers crash if the entire column is null - Issue #637 by @fraces-h
Maintenance
- Remove upper bound for pandas - Issue #633 by @pvk-developer
v1.4.1 - 2023-04-25
This release patches an issue that prevented the RegexGenerator
from working with regexes that had a very large number of possible combinations.
Bugs
- RegexGenerator continues to have problems if there are too many possibilities - Issue #635 by @pvk-developer
v1.4.0 - 2023-04-13
This release adds a couple of new features including adding the OrderedLabelEncoder
and deprecating the CustomLabelEncoder
. It also adds a change that makes all generator type transformers in the HyperTransformer
use a different random seed.
Additionally, bugs were patched in the RegexGenerator
that caused it to crash or take too long in certain cases. Finally, this release improved the detection of Faker functions in the AnonymizedFaker
.
Bugs
- Find nested Faker provider submodules - PR #630 by @frances-h
- RegexGenerator fails to generate values if there are too many possibilities - Issue #623 by @R-Palazzo
- RegexGenerator takes too much time and runs out of memory if there are too many possibilities - Issue #624 by @R-Palazzo
New Features
- Choose a different seed for each transformer - Issue #619 by @fealho
- Rename CustomLabelEncoder to OrderedLabelEncoder - Issue #621 by @R-Palazzo
- Add functionality to find version add-on - Issue #620 by @frances-h
v1.3.0 - 2023-1-18
This release makes changes to the way that individual transformers are stored in the HyperTransformer
. When accessing the config via HyperTransformer.get_config()
, the transformers listed in the config are now the actual transformer instances used during fitting and transforming. These instances can now be accessed and used to examine their properties post fitting. For example, you can now view the mapping for a PseudoAnonymizedFaker
instance using PseudoAnonymizedFaker.get_mapping()
on the instance retrieved from the config.
Additionally, the output of reverse_tranform
no longer appends the .value
suffix to every unnamed output column. Only output columns that are created from context extracted from the input columns will have suffixes (eg. .normalized
in the ClusterBasedNormalizer
).
The AnonymizedFaker
and RegexGenerator
now have an enforce_uniqueness
parameter, which controls whether the data returned by reverse_transform
should be unique. The HyperTransformer
now has a method called create_anonymized_columns
that can be used to generate columns that are matched with anonymizing transformers like AnonymizedFaker
and RegexGenerator
. The method can be used as follows:
HyperTransformer.create_anonymized_columns(num_rows=5, column_names=['email_optin', 'credit_card'])
Another major change in this release is the ability to control randomization. Every time a HyperTransformer
is initialized, its randomness will be reset to the same seed, and it will yield the same results for reverse_transform
if given the same input. Every subsequent call to reverse_transform
yields a different result. If a user desires to reset the seed, they can call HyperTransformer.reset_randomization
.
Finally, this release adds support for Python 3.10 and drops support for 3.6.
Bugs
- The reset_randomization should also apply to fit and transform - Issue #608 by @amontanez24
- Cannot print CustomLabelEncoder: ValueError - Issue #607 by @amontanez24
- Float formatter learn_rounding_scheme doesn't work on all digits - Issue #556 by @fealho
- Warnings not showing on update_transformers_by_sdtype - Issue #582 by @amontanez24
- OneHotEncoder doesn't work with boolean sdtype - Issue #583 by @pvk-developer
- Setting config on HyperTransformer does not read supported_sdtypes - Issue #560 by @pvk-developer
- #545 - Issue #545 by @pvk-developer
- Add error to NullTransformer when data only contains nans - PR #567 by @fealho
- Update update_transformers validation - PR #563 by @fealho
Maintenance
- Support Python 3.10 - Issue #593 by @pvk-developer
- RDT 1.3 Package Maintenance Updates - Issue #594 by @pvk-developer
New Features
- Update errors - Issue #599 by @amontanez24
- Add ability to control randomness - Issue #584 by @amontanez24
- Printing and error improvements - Issue #581 by @amontanez24
- Make RegexGenerator not to reset itself - Issue #558 by @pvk-developer
- Add a reset_anonymization method - Issue #559 by @pvk-developer
- Don't copy instances of tranformer - Issue #541 by @fealho
- Remove '.value' suffix - Issue #533 by @fealho
- Change the NEXT_TRANSFORMERS logic - Issue #557 by @fealho
- Add utility functions to AnonymizedFaker - Issue #561 by @pvk-developer
- Update API for update_transformers_by_sdtype to be more explicit about instances vs. copies - Issue #540 by @fealho
- Add create_anonymized_columns method to anonymize data from scratch - Issue #546 by @pvk-developer
- Add parameter to AnonymizedFaker() and RegexGenerator() to generate only unique values - Issue #542 by @pvk-developer
v1.2.1 - 2022-9-12
This release fixes a bug that caused the UnixTimestampEncoder
to return data with the incorrect datetime format. It also fixes a bug that caused the null column not to be reverse transformed when using the UnixTimestampEncoder
when the missing_value_replacement
was not set.
Bugs
- Inconsistency in date format after reverse transform - Issue #515 by @pvk-developer
- Fix calling null_transformer with model_missing_values. - PR #550 by @pvk-developer
v1.2.0 - 2022-8-17
This release adds a new transformer called the PseudoAnonymizedFaker
. This transformer enables the pseudo-anonymization of your data by mapping all of a column's original values to fake values that get returned during the reverse transformation process. Each original value is always mapped to the same fake value.
Additionally, this release enables the HyperTransformer
to use categorical transformers on boolean columns. It also introduces a new parameter called computer_representation
to the FloatFormatter
that will allow for values to be clipped to certain bounds based on the computer type used for a numerical column.
Finally, this release patches a bug that caused unpredicatable results from the reverse_transform
method of the FrequencyEncoder
when add_noise
is enabled.
New Features
- Add PseudoAnonymizedFaker transformer - Issue #517 by @pvk-developer
- Boolean columns should be able to use any of the categorical transformers - Issue#527 by @pvk-developer
- Update FloatFormatter with parameters for the computer representation - Issue#521 by @fealho
Bugs
Internal
- Performance Tests update - Issue #524 by @pvk-developer
v1.1.0 - 2022-6-9
This release adds multiple new transformers: the CustomLabelEncoder
and the RegexGenerator
. The CustomLabelEncoder
works similarly to the LabelEncoder
, except it allows users to provide the order of the categories. The RegexGenerator
allows users to specify a regex pattern and will generate values that match that pattern.
This release also improves current transformers. The LabelEncoder
now has a parameter called order_by
that allows users to specify the ordering scheme for their data (eg. order numerically or alphabetically). The LabelEncoder
also now has a parameter called add_noise
that allows users to specify whether or not uniform noise should be added to the transformed data. Performance enhancements were made for the GaussianNormalizer
by removing an unnecessary distribution search and the FloatFormatter
will no longer round values to any place higher than the ones place by default.
New Features
- Add noise parameter to LabelEncoder - Issue #500 by @fealho
- Remove parameters related to distribution search and change default for GaussianNormalizer - Issue #499
by @amontanez24 - Add order_by parameter to LabelEncoder - Issue #510 by @amontanez24
- Only round to decimal places in FloatFormatter - Issue #508 by @fealho
- Add CustomLabelEncoder transformer - Issue #507 by @amontanez24
- Add RegexGenerator Transformer - Issue #505 by @pvk-developer
v1.0.0 - 2022-5-5
The main update of this release is the introduction of a config
, which describes the sdtypes
and transformers
that will be used by the HyperTransformer
for each column of the data, where sdtype
stands for the semantic or statistical meaning of a datatype. The user can interact with this config through the newly created methods update_sdtypes
, get_config
, set_config
, update_transformers
, update_transformers_by_sdtype
and remove_transformer_by_sdtype
.
This release also included various new features and updates, including:
- Users can now transform subsets of the data using its own methods,
transform_subset
andreverse_transform_subset
. - User validation was added for the following methods:
transform
,reverse_transform
,update_sdtypes
,update_transformers
,set_config
. - Unnecessary warnings were removed from
GaussianNormalizer.fit
andFrequencyEncoder.transform
. - The user can now set a transformers as None.
- Transformers that cannot work with missing values will automatically fill them in.
- Added support for additional datetime formats.
- Setting
model_missing_values = False
in a transformer was updated to keep track of the percentage of missing values, instead of producing data containingNaN
's. - All parameters were removed from the
HyperTransformer
. - The demo dataset
get_demo
was improved to be more intuitive.
Finally, a number of transformers were redesigned to be more user friendly. Among them, the following transformers have also been renamed:
BayesGMMTransformer
->ClusterBasedNormalizer
GaussianCopulaTransformer
->GaussianNormalizer
DateTimeRoundedTransformer
->OptimizedTimestampEncoder
DateTimeTransformer
->UnixTimestampEncoder
NumericalTransformer
->FloatFormatter
LabelEncodingTransformer
->LabelEncoder
OneHotEncodingTransformer
->OneHotEncoder
CategoricalTransformer
->FrequencyEncoder
BooleanTransformer
->BinaryEncoder
PIIAnonymizer
->AnonymizedFaker
New Features
- Fix using None as transformer when update_transformers_by_sdtype - Issue #496 by @pvk-developer
- Rename PIIAnonymizer --> AnonymizedFaker - Issue #483 by @pvk-developer
- User validation for reverse_transform - Issue #480 by @amontanez24
- User validation for transform - Issue #479 by @fealho\
- User validation for set_config - Issue #478 by @fealho
- User validation for update_transformers_by_sdtype - Issue #477 by @amontanez24
- User validation for update_transformers - Issue #475 by @fealho
- User validation for update_sdtypes - Issue #474 by @fealho
- Allow columns to not have a transformer - Issue #473 by @pvk-developer
- Create methods to transform a subset of the data (& reverse transform it) - Issue #472 by @amontanez24
- Throw a warning if you use set_config on a HyperTransformer that's already fit - Issue #466 by @amontanez24
- Update README for RDT 1.0 - Issue #454 by @amontanez24
- Issue with printing PIIAnonymizer in HyperTransformer - Issue #452 by @pvk-developer
- Pretty print get_config - Issue #450 by @pvk-developer
- Silence warning for GaussianNormalizer.fit - Issue #443 by @pvk-developer
- Transformers that cannot work with missing values should automatically fill them in - Issue #442 by @amontanez24
- More descriptive error message in PIIAnonymizer when provider_name and function_name don't align - Issue #440 by @pvk-developer
- Can we support additional datetime formats? - Issue #439 by @pvk-developer
- Update FrequencyEncoder.transform so that pandas won't throw a warning - Issue #436 by @pvk-developer
- Update functionality when model_missing_values=False - Issue #435 by @amontanez24
- Create methods for getting and setting a config - Issue #418 by @amontanez24
- Input validation & error handling in HyperTransformer - Issue #408 by @fealho and @amontanez24
- Remove unneeded params from HyperTransformer - Issue #407 by @pvk-developer
- Rename property: _valid_output_sdtypes - Issue #406 by @amontanez24
- Add pii as a new sdtype in HyperTransformer - Issue #404 by @pvk-developer
- Update transformers by data type (in HyperTransformer) - Issue #403 by @pvk-developer
- Update transformers by column name in HyperTransformer - Issue #402 by @pvk-developer
- Improve updating field_data_types in HyperTransformer - Issue #400 by @amontanez24
- Create method to auto detect HyperTransformer config from data - Issue #399 by @fealho
- Update HyperTransformer default transformers - Issue #398 by @fealho
- Add PIIAnonymizer - Issue #397 by @pvk-developer
- Improve the way we print an individual transformer - Issue #395 by @amontanez24
- Rename columns parameter in fit for each individual transformer - Issue #376 by @fealho and @pvk-developer
- Create a more descriptive demo dataset - Issue #374 by @fealho
- Delete unnecessary transformers - Issue #373 by @fealho
- Update NullTransformer to make it user friendly - Issue #372 by @pvk-developer
- Update BayesGMMTransformer to make it user friendly - Issue #371 by @amontanez24
- Update GaussianCopulaTransformer to make it user friendly - Issue #370 by @amontanez24
- Update DateTimeRoundedTransformer to make it user friendly - Issue #369 by @amontanez24
- Update DateTimeTransformer to make it user friendly - Issue #368 by @amontanez24
- Update NumericalTransformer to make it user friendly - Issue #367 by @amontanez24
- Update LabelEncodingTransformer to make it user friendly - Issue #366 by @fealho
- Update OneHotEncodingTransformer to make it user friendly - Issue #365 by @fealho
- Update CategoricalTransformer to make it user friendly - Issue #364 by @fealho
- Update BooleanTransformer to make it user friendly - Issue #363 by @fealho
- Update names & functionality for handling missing values - Issue #362 by @pvk-developer
Bugs
- Checking keys of config as set - Issue #497 by @amontanez24
- Only update transformer used when necessary for update_sdtypes - Issue #469 by @amontanez24
- Fix how get_config prints transformers - Issue #468 by @pvk-developer
- NullTransformer reverse_transform alters input data due to not copying - Issue #455 by @amontanez24
- Attempting to transform a subset of the data should lead to an Error - Issue #451 by @amontanez24
- Detect_initial_config isn't detecting sdtype "numerical" - Issue #449 by @pvk-developer
- PIIAnonymizer not generating multiple locales - Issue #447 by @pvk-developer
- Error when printing ClusterBasedNormalizer and GaussianNormalizer - Issue #441 by @pvk-developer
- Datetime reverse transform crashes if datetime_format is specified - Issue #438 by @amontanez24
- Correct datetime format is not recovered on reverse_transform - Issue #437 by @pvk-developer
- Use numpy NaN values in BinaryEncoder - Issue #434 by @pvk-developer
- Duplicate _output_columns during fitting - Issue #423 by @fealho
Internal Improvements
- Making methods that aren't part of API private - Issue #489 by @amontanez24
- Fix columns missing in config and update transformers to None - Issue #495 by @pvk-developer
v0.6.4 - 2022-3-7
History
0.6.4 - 2022-3-7
This release fixes multiple bugs concerning the HyperTransformer
. One is that the get_transformer_tree_yaml
method no longer crashes on every call. Another is that calling the update_field_data_types
and update_default_data_type_transformers
after fitting no longer breaks the transform
method.
The HyperTransformer
now sorts its outputs for both transform
and reverse_transform
based on the order of the input's columns. It is also now possible to create transformers that simply drops columns during transform
and don't return any new columns.
New Features
- Support dropping a column trough a transformer - Issue #393 by @pvk-developer
- HyperTransformer should sort columns after transform and reverse_transform - Issue #405 by @fealho
Bugs
- get_transformer_tree_yaml fails - Issue #389 by @amontanez24
- HyperTransformer _unfit method not working correctly - Issue #390 by @amontanez24
- Blank dataframe after updating the data types - Issue #401 by @amontanez24
v0.6.3 - 2022-2-4
This release adds a new module to the RDT
library called performance
. This module can be used to evaluate the speed and peak memory usage of any transformer in RDT. This release also increases the maximum acceptable version of scikit-learn to make it more compatible with other libraries in the SDV
ecosystem. On top of that, it fixes a bug related to a new version of pandas
.
New Features
- Move profiling functions into RDT library - Issue #353 by @amontanez24
Housekeeping
- Increase scikit-learn dependency range - Issue #351 by @amontanez24
- pandas 1.4.0 release causes a small error - Issue #358 by @fealho
Bugs
- Performance tests get stuck on Unix if multiprocessing is involved - Issue #337 by @amontanez24