Releases: sdv-dev/RDT
v0.6.2 - 2021-12-28
This release adds a new BayesGMMTransformer
. This transformer can be used to convert a numerical column into two
columns: a discrete column indicating the selected component
of the GMM for each row, and a continuous column containing
the normalized value of each row based on the mean
and std
of the selected component
. It is useful when the column being transformed
came from multiple distributions.
This release also adds multiple new methods to the HyperTransformer
API. These allow for users to access the specfic
transformers used on each input field, as well as view the entire tree of transformers that are used when running transform
.
The exact methods are:
BaseTransformer.get_input_columns()
- Return list of input columns for a transformer.BaseTransformer.get_output_columns()
- Return list of output columns for a transformer.HyperTransformer.get_transformer(field)
- Return the transformer instance used for a field.HyperTransformer.get_output_transformers(field)
- Return dictionary mapping output columns of a field to the transformers used on them.HyperTransformer.get_final_output_columns(field)
- Return list of all final output columns related to a field.HyperTransformer.get_transformer_tree_yaml()
- Return YAML representation of transformers tree.
Additionally, this release fixes a bug where the HyperTransformer
was incorrectly raising a NotFittedError
. It also improved the
DatetimeTransformer
by autonomously detecting if a column needs to be converted from dtype
object
to dtype
datetime
.
New Features
- Cast column to datetime if specified in field transformers - Issue #321 by @amontanez24
- Add a BayesianGMM Transformer - Issue #183 by @fealho
- Add transformer tree structure and traversal methods - Issue #330 by @amontanez24
Bugs fixed
- HyperTransformer raises NotFittedError after fitting - Issue #332 by @amontanez24
v0.6.1 - 2021-11-10
This release adds support for Python 3.9! It also removes unused document files.
Internal Improvements
- Add support for Python 3.9 - Issue #323 by @amontanez24
- Remove docs - PR #322 by @pvk-developer
v0.6.0 - 2021-10-29
This release makes major changes to the underlying code for RDT as well as the API for both the HyperTransformer
and BaseTransformer
.
The changes enable the following functionality:
- The
HyperTransformer
can now apply a sequence of transformers to a column. - Transformers can now take multiple columns as an input.
- RDT has been expanded to allow for infinite data types to be added instead of being restricted to
pandas.dtypes
. - Users can define acceptable output types for running
HyperTransformer.transform
. - The
HyperTransformer
will continuously apply transformations to the input fields until only acceptable data types are in the output. - Transformers can return data of any data type.
- Transformers now have named outputs and output types.
- Transformers can suggest which transformer to use on any of their outputs.
To take advantage of this functionality, the following API changes were made:
- The
HyperTransformer
has new initialization parameters that allow users to specify data types for any field in their data as well as
specify which transformer to use for a field or data type. The parameters are:field_transformers
- A dictionary allowing users to specify which transformer to use for a field or derived field. Derived fields
are fields created by runningtransform
on the input data.field_data_types
- A dictionary allowing users to specify the data type of a field.default_data_type_transformers
- A dictionary allowing users to specify the default transformer to use for a data type.transform_output_types
- A dictionary allowing users to specify which data types are acceptable for the output oftransform
.
This is a result of the fact that transformers can now be applied in a sequence, and not every transformer will return numeric data.
- Methods were also added to the
HyperTransformer
to allow these parameters to be modified. These includeget_field_data_types
,
update_field_data_types
,get_default_data_type_transformers
,update_default_data_type_transformers
andset_first_transformers_for_fields
. - The
BaseTransformer
now requires the column names it will transform to be provided tofit
,transform
andreverse_transform
. - The
BaseTransformer
added the following method to allow for users to see its output fields and output types:get_output_types
. - The
BaseTransformer
added the following method to allow for users to see the next suggested transformer for each output field:
get_next_transformers
.
On top of the changes to the API and the capabilities of RDT, many automated checks and tests were also added to ensure that contributions
to the library abide by the current code style, stay performant and result in data of a high quality. These tests run on every push to the
repository. They can also be run locally via the following functions:
validate_transformer_code_style
- Checks that new code follows the code style.validate_transformer_quality
- Tests that new transformers yield data that maintains relationships between columns.validate_transformer_performance
- Tests that new transformers don't take too much time or memory.validate_transformer_unit_tests
- Checks that the unit tests cover all new code, follow naming conventions and pass.validate_transformer_integration
- Checks that the integration tests follow naming conventions and pass.
New Features
- Update HyperTransformer API - Issue #298 by @amontanez24
- Create validate_pull_request function - Issue #254 by @pvk-developer
- Create validate_transformer_unit_tests function - Issue #249 by @pvk-developer
- Create validate_transformer_performance function - Issue #251 by @katxiao
- Create validate_transformer_quality function - Issue #253 by @amontanez24
- Create validate_transformer_code_style function - Issue #248 by @pvk-developer
- Create validate_transformer_integration function - Issue #250 by @katxiao
- Enable users to specify transformers to use in HyperTransformer - Issue #233 by @amontanez24 and @csala
- Addons implementation - Issue #225 by @pvk-developer
- Create ways for HyperTransformer to know which transformers to apply to each data type - Issue #232 by @amontanez24 and @csala
- Update categorical transformers - PR #231 by @fealho
- Update numerical transformer - PR #227 by @fealho
- Update datetime transformer - PR #230 by @fealho
- Update boolean transformer - PR #228 by @fealho
- Update null transformer - PR #229 by @fealho
- Update the baseclass - PR #224 by @fealho
Bugs fixed
- If the input data has a different index, the reverse transformed data may be out of order - Issue #277 by @amontanez24
Documentation changes
- RDT contributing guide - Issue #301 by @katxiao and @amontanez24
Internal improvements
- Add PR template for new transformers - Issue #307 by @katxiao
- Implement Quality Tests for Transformers - Issue #252 by @amontanez24
- Update performance test structure - Issue #257 by @katxiao
- Automated integration test for transformers - Issue #223 by @katxiao
- Move datasets to its own module - Issue #235 by @katxiao
- Fix missing coverage in rdt unit tests - Issue #219 by @fealho
- Add repo-wide automation - Issue #309 by @katxiao
Other issues closed
v0.5.3 - 2021-10-07
v0.5.2 - 2021-08-16
This release fixes a couple of bugs introduced by the previous release regarding the
OneHotEncoder
and the BooleanTransformer
.
Issues closed
v0.5.1 - 2021-08-11
0.5.1 - 2021-08-11
This release improves the overall performance of the library, both in terms of memory and time consumption.
More specifically, it makes the following modules more efficient: NullTransformer
, DatetimeTransformer
,
LabelEncodingTransformer
, NumericalTransformer
, CategoricalTransformer
, BooleanTransformer
and OneHotEncodingTransformer
.
It also adds performance-based testing and a script for profiling the performance.
Issues closed
- Add performance-based testing - Issue #194 by @amontanez24
- Audit the NullTransformer - Issue #192 by @amontanez24
- Audit DatetimeTransformer - Issue #189 by @sarahmish
- Audit the LabelEncodingTransformer - Issue #184 by @amontanez24
- Audit the NumericalTransformer - Issue #181 by @fealho
- Audit CategoricalTransformer - Issue #180 by @katxiao
- Audit BooleanTransformer - Issue #179 by @katxiao
- Auditing OneHotEncodingTransformer - Issue #178 by @sarahmish
- Create script for profiling - Issue #176 by @amontanez24
- Create folder structure for performance testing - Issue #174 by @amontanez24
v0.5.0 - 2021-07-12
0.5.0 - 2021-07-12
This release updates the NumericalTransformer
by adding a new rounding
argument.
Users can now obtain numerical values with precision, either pre-specified or automatically computed from the given data.
Issues closed
- Add
rounding
argument toNumericalTransformer
- Issue #166 by @amontanez24 and @csala NumericalTransformer
rounding error with infinity - Issue #169 by @amontanez24- Add min and max arguments to NumericalTransformer - Issue #106 by @amontanez24
v0.4.2 - 2021-06-08
This release adds a new method to the CategoricalTransformer
to solve a bug where the transformer becomes unusable after being pickled and unpickled if it had NaN
values in the data which it was fit on.
It also fixes some grammar mistakes in the documentation.
Issues closed
- CategoricalTransformer with NaN values cannot be pickled bug - Issue #164 by @pvk-developer and @csala
Documentation changes
v0.4.1 - 2021-03-29
This release improves the HyperTransformer
memory usage when working with a high number of columns or a high number of categorical values when using one hot encoding.
Issues closed
Boolean
,Datetime
andLabelEncoding
transformers fail with 2Dndarray
- Issue #160 by @pvk-developerHyperTransformer
: Memory usage increase whenreverse_transform
is called - Issue #156 by @pvk-developer and @AnupamaGangadhar
v0.4.0 - 2021-02-24
In this release a change in the HyperTransformer allows using it to transform and
reverse transform a subset of the columns seen during training.
The anonymization functionality which was deprecated and not being used has also
been removed along with the Faker dependency.