Skip to content

Releases: sdv-dev/RDT

v0.6.2 - 2021-12-28

28 Dec 23:49
Compare
Choose a tag to compare

This release adds a new BayesGMMTransformer. This transformer can be used to convert a numerical column into two
columns: a discrete column indicating the selected component of the GMM for each row, and a continuous column containing
the normalized value of each row based on the mean and std of the selected component. It is useful when the column being transformed
came from multiple distributions.

This release also adds multiple new methods to the HyperTransformer API. These allow for users to access the specfic
transformers used on each input field, as well as view the entire tree of transformers that are used when running transform.
The exact methods are:

  • BaseTransformer.get_input_columns() - Return list of input columns for a transformer.
  • BaseTransformer.get_output_columns() - Return list of output columns for a transformer.
  • HyperTransformer.get_transformer(field) - Return the transformer instance used for a field.
  • HyperTransformer.get_output_transformers(field) - Return dictionary mapping output columns of a field to the transformers used on them.
  • HyperTransformer.get_final_output_columns(field) - Return list of all final output columns related to a field.
  • HyperTransformer.get_transformer_tree_yaml() - Return YAML representation of transformers tree.

Additionally, this release fixes a bug where the HyperTransformer was incorrectly raising a NotFittedError. It also improved the
DatetimeTransformer by autonomously detecting if a column needs to be converted from dtype object to dtype datetime.

New Features

  • Cast column to datetime if specified in field transformers - Issue #321 by @amontanez24
  • Add a BayesianGMM Transformer - Issue #183 by @fealho
  • Add transformer tree structure and traversal methods - Issue #330 by @amontanez24

Bugs fixed

  • HyperTransformer raises NotFittedError after fitting - Issue #332 by @amontanez24

v0.6.1 - 2021-11-10

10 Nov 18:21
Compare
Choose a tag to compare

This release adds support for Python 3.9! It also removes unused document files.

Internal Improvements

v0.6.0 - 2021-10-29

29 Oct 21:28
Compare
Choose a tag to compare

This release makes major changes to the underlying code for RDT as well as the API for both the HyperTransformer and BaseTransformer.
The changes enable the following functionality:

  • The HyperTransformer can now apply a sequence of transformers to a column.
  • Transformers can now take multiple columns as an input.
  • RDT has been expanded to allow for infinite data types to be added instead of being restricted to pandas.dtypes.
  • Users can define acceptable output types for running HyperTransformer.transform.
  • The HyperTransformer will continuously apply transformations to the input fields until only acceptable data types are in the output.
  • Transformers can return data of any data type.
  • Transformers now have named outputs and output types.
  • Transformers can suggest which transformer to use on any of their outputs.

To take advantage of this functionality, the following API changes were made:

  • The HyperTransformer has new initialization parameters that allow users to specify data types for any field in their data as well as
    specify which transformer to use for a field or data type. The parameters are:
    • field_transformers - A dictionary allowing users to specify which transformer to use for a field or derived field. Derived fields
      are fields created by running transform on the input data.
    • field_data_types - A dictionary allowing users to specify the data type of a field.
    • default_data_type_transformers - A dictionary allowing users to specify the default transformer to use for a data type.
    • transform_output_types - A dictionary allowing users to specify which data types are acceptable for the output of transform.
      This is a result of the fact that transformers can now be applied in a sequence, and not every transformer will return numeric data.
  • Methods were also added to the HyperTransformer to allow these parameters to be modified. These include get_field_data_types,
    update_field_data_types, get_default_data_type_transformers, update_default_data_type_transformers and set_first_transformers_for_fields.
  • The BaseTransformer now requires the column names it will transform to be provided to fit, transform and reverse_transform.
  • The BaseTransformer added the following method to allow for users to see its output fields and output types: get_output_types.
  • The BaseTransformer added the following method to allow for users to see the next suggested transformer for each output field:
    get_next_transformers.

On top of the changes to the API and the capabilities of RDT, many automated checks and tests were also added to ensure that contributions
to the library abide by the current code style, stay performant and result in data of a high quality. These tests run on every push to the
repository. They can also be run locally via the following functions:

  • validate_transformer_code_style - Checks that new code follows the code style.
  • validate_transformer_quality - Tests that new transformers yield data that maintains relationships between columns.
  • validate_transformer_performance - Tests that new transformers don't take too much time or memory.
  • validate_transformer_unit_tests - Checks that the unit tests cover all new code, follow naming conventions and pass.
  • validate_transformer_integration - Checks that the integration tests follow naming conventions and pass.

New Features

Bugs fixed

  • If the input data has a different index, the reverse transformed data may be out of order - Issue #277 by @amontanez24

Documentation changes

Internal improvements

Other issues closed

  • DeprecationWarning: np.float is a deprecated alias for the builtin float - Issue #304 by @csala
  • Add pip check to CI workflows - Issue #290 by @csala
  • Should Transformers subclasses exist for specific configurations? - Issue #243 by @fealho

v0.5.3 - 2021-10-07

08 Oct 15:43
Compare
Choose a tag to compare

This release fixes a bug with learning rounding digits in the NumericalTransformer, and includes a few housekeeping improvements.

Issues closed

  • Update learn rounding digits to handle all nan data - Issue #244 by @katxiao
  • Adapt to latest PyLint housekeeping - Issue #216 by @fealho

v0.5.2 - 2021-08-16

17 Aug 04:49
Compare
Choose a tag to compare

This release fixes a couple of bugs introduced by the previous release regarding the
OneHotEncoder and the BooleanTransformer.

Issues closed

  • BooleanTransformer.reverse_transform sometimes crashes with TypeError - Issue #210 by @katxiao
  • OneHotEncoder causing shape misalignment in CopulaGAN, CTGAN, and TVAE - Issue #208 by @sarahmish
  • Boolean.transformer.reverse_transform modifies the input data - Issue #211 by @katxiao

v0.5.1 - 2021-08-11

11 Aug 21:14
Compare
Choose a tag to compare

0.5.1 - 2021-08-11

This release improves the overall performance of the library, both in terms of memory and time consumption.
More specifically, it makes the following modules more efficient: NullTransformer, DatetimeTransformer,
LabelEncodingTransformer, NumericalTransformer, CategoricalTransformer, BooleanTransformer and OneHotEncodingTransformer.

It also adds performance-based testing and a script for profiling the performance.

Issues closed

v0.5.0 - 2021-07-12

12 Jul 14:55
Compare
Choose a tag to compare

0.5.0 - 2021-07-12

This release updates the NumericalTransformer by adding a new rounding argument.
Users can now obtain numerical values with precision, either pre-specified or automatically computed from the given data.

Issues closed

v0.4.2 - 2021-06-08

08 Jun 11:47
Compare
Choose a tag to compare

This release adds a new method to the CategoricalTransformer to solve a bug where the transformer becomes unusable after being pickled and unpickled if it had NaN values in the data which it was fit on.

It also fixes some grammar mistakes in the documentation.

Issues closed

Documentation changes

v0.4.1 - 2021-03-29

30 Mar 00:51
Compare
Choose a tag to compare

This release improves the HyperTransformer memory usage when working with a high number of columns or a high number of categorical values when using one hot encoding.

Issues closed

v0.4.0 - 2021-02-24

24 Feb 19:54
Compare
Choose a tag to compare

In this release a change in the HyperTransformer allows using it to transform and
reverse transform a subset of the columns seen during training.

The anonymization functionality which was deprecated and not being used has also
been removed along with the Faker dependency.

Issues closed

  • Allow the HyperTransformer to be used on a subset of the columns - Issue #152 by @csala
  • Remove faker - Issue #150 by @csala