An open source project from Data to AI Lab at MIT.
- License: MIT
- Documentation: https://DAI-Lab.github.io/SDGym/
- Homepage: https://github.com/DAI-Lab/SDGym
Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for non-temporal tabular data. SDGym is based on the paper Modeling Tabular data using Conditional GAN, and the project is part of the Data to AI Laboratory at MIT.
The benchmarking of a synthesizer is a process in which different datasets are generated by your synthesizer. Then, each couple of real and synthetic data is evaluated with multiple scores.
In order to use SDGym you will need a synthesizer function. This is a function that takes as input a numpy matrix with real data and outputs another numpy matrix with the same shape filled with synthesized data.
Also, alongside the real data, some additional variables informing about the column contents will be passed, which means that the exact signature of the function will be like this:
def my_synthesizer_function(
real_data: numpy.ndarray,
categorical_columns: list,
ordinal_columns: list
) -> syntehtesized_data: numpy.ndarray
If your synthesizer implements a different interface, you can wrap it in a function like this:
def my_synthesizer_function(real_data, categorical_columns, ordinal_columns):
# ...do all necessary steps here...
return synthesized_data
This function should contain inside it all the parameters and arguments needed to use your synthesizer and call it to generate the new synthesized data based on the real data that is being passed.
As we mentioned in the section before, the main input of SDGym is a synthesizer to be benchmarked, which is expected to be a function that has as unique input and output a table of data.
The inputs for your synthesizer funciton should be:
real_data
: a 2Dnumpy.ndarray
with the real data the your synthesizer will attempt to imitate.categorical_columns
: alist
with the indexes of any columns that should be considered categorical independently on their type.ordinal_columns
: alist
with the indexes of any integer columns that should be treated as ordinal values.
And the output should be a single 2D numpy.ndarray
with the exact same shape as the real_data
matrix.
All the datasets used for the benchmarking can be found inside the sgdym S3 bucket
in the form of an .npz
numpy matrix archive and a .json
metadata file that contains information
about the dataset structure and their columns.
In order to load these datasets in the same format as they will be passed to your synthesizer function
you can use the sdgym.load_dataset
function passing the name of the dataset to load.
In this example, we will load the adult
dataset:
from sdgym import load_dataset
data, categorical_columns, ordinal_columns = load_dataset('adult')
This will return a numpy matrix with the data that will be passed to your synthesizer function, as well as the list of indexes for the categorical and ordinal columns:
>>> data
array([[2.70000e+01, 0.00000e+00, 1.77119e+05, ..., 4.40000e+01,
0.00000e+00, 0.00000e+00],
[2.70000e+01, 0.00000e+00, 2.16481e+05, ..., 4.00000e+01,
0.00000e+00, 0.00000e+00],
[2.50000e+01, 0.00000e+00, 2.56263e+05, ..., 4.00000e+01,
0.00000e+00, 0.00000e+00],
...,
[4.50000e+01, 0.00000e+00, 2.07540e+05, ..., 4.00000e+01,
0.00000e+00, 1.00000e+00],
[5.10000e+01, 0.00000e+00, 1.80807e+05, ..., 4.00000e+01,
0.00000e+00, 0.00000e+00],
[6.10000e+01, 4.00000e+00, 1.86451e+05, ..., 4.00000e+01,
0.00000e+00, 1.00000e+00]], dtype=float32)
>>> categorical_columns
[1, 5, 6, 7, 8, 9, 13, 14]
>>> ordinal_columns
[3]
In order to get started using the benchmarking tool, some demo Synthesizers have been included in the library.
These synthesizers are classes that can be imported from the sdgym.synthesizers
module and have
the following methods:
fit
: Fits the synthesizer on the data. Expects the following arguments:data (numpy.ndarray)
: 2 dimensional Numpy matrix with the real data to learn from.categorical_columns (list or tuple)
: List of indexes of the columns that are categorical within the dataset.ordinal_columns (list or tuple)
: List of indexes of the columns that are ordinal within the dataset.
sample
: Generates new data resembling the original dataset. Expects the following arguments:n_samples (int)
: Number of samples to generate.
fit_sample
: Fits the synthesizer on the dataset and then samples as many rows as there were in the original dataset. It expects the same arguments as thefit
method, and is ready to be directly passed to thebenchmark
function in order to evaluate the synthesizer performance.
A complete example about how to use them can be found below in the Usage section.
SDGym has been developed and tested on Python 3.5, and 3.6
Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDGym is run.
These are the minimum commands needed to create a virtualenv using python3.6 for SDGym:
pip install virtualenv
virtualenv -p $(which python3.6) sdgym-venv
Afterwards, you have to execute this command to have the virtualenv activated:
source sdgym-venv/bin/activate
Remember about executing it every time you start a new console to work on SDGym!
After creating the virtualenv and activating it, we recommend using pip in order to install SDGym:
pip install sdgym
This will pull and install the latest stable release from PyPi.
Alternatively, with your virtualenv activated, you can clone the repository and install it from
source by running make install
on the stable
branch:
git clone [email protected]:DAI-Lab/SDGym.git
cd SDGym
git checkout stable
make install
If you want to contribute to the project, a few more steps are required to make the project ready for development.
First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.
Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:
git clone [email protected]:{your username}/SDGym.git
cd SDGym
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature
Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.
make install-develop
Make sure to use them regularly while developing by running the commands make lint
and
make test
.
In order to be able to use all the features from SDGym, some dependencies written in C++ need to be compiled.
In order to do this:
- make sure to have installed all the necessary dependencies to compile C++. In Linux distributions based on Ubuntu, this can be done with the following command:
sudo apt-get install build-essential
- Execute:
make compile
All you need to do in order to use the SDGym Benchmark, is to import and call the sdgym.benchmark
function passing it your synthesizer function:
from sdgym import benchmark
scores = benchmark(my_synthesizer_function)
The output of the benchmark
function will be a pd.DataFrame
containing all the scores
computed by the different evaluators:
accuracy f1 name distance dataset iter
0 0.7985 0.658301 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 0
1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 0
2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 0
3 0.8489 0.678716 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 0
0 0.7968 0.655943 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 1
1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 1
2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 1
3 0.8472 0.683775 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 1
0 0.7963 0.655272 DecisionTreeClassifier(class_weight='balanced'... 0.0 adult 2
1 0.8607 0.680285 AdaBoostClassifier(algorithm='SAMME.R', base_e... 0.0 adult 2
2 0.7948 0.660040 LogisticRegression(C=1.0, class_weight='balanc... 0.0 adult 2
3 0.8511 0.684467 MLPClassifier(activation='relu', alpha=0.0001,... 0.0 adult 2
In order to use the synthesizer classes included in SDGym, you need to follow these steps:
- Import the synthesizer class from
sdgym.synthesizers
:
from sdgym.synthesizers import IndependentSynthesizer
- Create an instance of the synthesizers passing any needed arguments. In this case we will use
the
IndependentSynthesizer
, which can be instantiated with no initialization arguments:
synthesizer = IndependentSynthesizer()
- Load some data to fit your synthesizer with. In this case, we will be loading the
adult
dataset:
from sdgym import load_dataset
data, categorical_columns, ordinal_columns = load_dataset('adult')
- Call its
fit
method passing the data as well as the lists of categorical and ordinal columns:
synthesizer.fit(data, categorical_columns, ordinal_columns)
- Call its
sample
method passing the number of rows that we want to sample:
sampled = synthesizer.sample(3)
This will return a numpy matrix of sampeld data with the same columns as the original data and as many rows as we have requested:
array([[5.1774925e+01, 0.0000000e+00, 5.3538445e+04, 6.0000000e+00,
8.9999313e+00, 2.0000000e+00, 1.0000000e+00, 3.0000000e+00,
2.0000000e+00, 1.0000000e+00, 3.7152294e-04, 1.9912617e-04,
1.0767025e+01, 0.0000000e+00, 0.0000000e+00],
[6.4843109e+01, 0.0000000e+00, 2.6462553e+05, 1.2000000e+01,
8.9993210e+00, 1.0000000e+00, 0.0000000e+00, 1.0000000e+00,
0.0000000e+00, 0.0000000e+00, 5.3685449e-06, 1.9797031e-03,
2.2253288e+01, 0.0000000e+00, 0.0000000e+00],
[6.5659584e+01, 5.0000000e+00, 3.6158912e+05, 8.0000000e+00,
9.0010223e+00, 0.0000000e+00, 1.2000000e+01, 3.0000000e+00,
0.0000000e+00, 0.0000000e+00, 1.0562389e-03, 0.0000000e+00,
3.9998917e+01, 0.0000000e+00, 0.0000000e+00]], dtype=float32)
Evaluating the performance of any of the Demo synthesizers as as simple as:
- Creaeting an instance of the synthesizer:
synthesizer = IndependentSynthesizer()
- Passing the
fit_sample
method of the instance to thebenchmark
function as your synthesizer function:
benchmark(synthesizer.fit_sample)
For more details about SDGym and all its possibilities and features, please check the documentation site.
There you can learn more about how to contribute to SDGym in order to help us developing new features or cool ideas.
SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas thought a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.
TGAN is a GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.