Skip to content
/ SDGym Public
forked from sdv-dev/SDGym

Benchmarking Synthetic data generation methods and Introducing Conditional Tabular GAN.

License

Notifications You must be signed in to change notification settings

pythiac/SDGym

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

“SDGym” An open source project from Data to AI Lab at MIT.

Travis PyPi Shield

SDGym - Synthetic Data Gym

Overview

Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators for non-temporal tabular data. SDGym is based on the paper Modeling Tabular data using Conditional GAN, and the project is part of the Data to AI Laboratory at MIT.

The benchmarking of a synthesizer is a process in which different datasets are generated by your synthesizer. Then, each couple of real and synthetic data is evaluated with multiple scores.

What is a synthesizer function?

In order to use SDGym you will need a synthesizer function. This is a function that takes as input a numpy matrix with real data and outputs another numpy matrix with the same shape filled with synthesized data.

Also, alongside the real data, some additional variables informing about the column contents will be passed, which means that the exact signature of the function will be like this:

def my_synthesizer_function(
    real_data: numpy.ndarray,
    categorical_columns: list,
    ordinal_columns: list
) -> syntehtesized_data: numpy.ndarray

If your synthesizer implements a different interface, you can wrap it in a function like this:

def my_synthesizer_function(real_data, categorical_columns, ordinal_columns):
    # ...do all necessary steps here...
    return synthesized_data

This function should contain inside it all the parameters and arguments needed to use your synthesizer and call it to generate the new synthesized data based on the real data that is being passed.

What data should your synthesizer work with?

As we mentioned in the section before, the main input of SDGym is a synthesizer to be benchmarked, which is expected to be a function that has as unique input and output a table of data.

The inputs for your synthesizer funciton should be:

  • real_data: a 2D numpy.ndarray with the real data the your synthesizer will attempt to imitate.
  • categorical_columns: a list with the indexes of any columns that should be considered categorical independently on their type.
  • ordinal_columns: a list with the indexes of any integer columns that should be treated as ordinal values.

And the output should be a single 2D numpy.ndarray with the exact same shape as the real_data matrix.

Benchmark datasets

All the datasets used for the benchmarking can be found inside the sgdym S3 bucket in the form of an .npz numpy matrix archive and a .json metadata file that contains information about the dataset structure and their columns.

In order to load these datasets in the same format as they will be passed to your synthesizer function you can use the sdgym.load_dataset function passing the name of the dataset to load.

In this example, we will load the adult dataset:

from sdgym import load_dataset

data, categorical_columns, ordinal_columns = load_dataset('adult')

This will return a numpy matrix with the data that will be passed to your synthesizer function, as well as the list of indexes for the categorical and ordinal columns:

>>> data
array([[2.70000e+01, 0.00000e+00, 1.77119e+05, ..., 4.40000e+01,
        0.00000e+00, 0.00000e+00],
       [2.70000e+01, 0.00000e+00, 2.16481e+05, ..., 4.00000e+01,
        0.00000e+00, 0.00000e+00],
       [2.50000e+01, 0.00000e+00, 2.56263e+05, ..., 4.00000e+01,
        0.00000e+00, 0.00000e+00],
       ...,
       [4.50000e+01, 0.00000e+00, 2.07540e+05, ..., 4.00000e+01,
        0.00000e+00, 1.00000e+00],
       [5.10000e+01, 0.00000e+00, 1.80807e+05, ..., 4.00000e+01,
        0.00000e+00, 0.00000e+00],
       [6.10000e+01, 4.00000e+00, 1.86451e+05, ..., 4.00000e+01,
        0.00000e+00, 1.00000e+00]], dtype=float32)
>>> categorical_columns
[1, 5, 6, 7, 8, 9, 13, 14]
>>> ordinal_columns
[3]

Demo Synthesizers

In order to get started using the benchmarking tool, some demo Synthesizers have been included in the library.

These synthesizers are classes that can be imported from the sdgym.synthesizers module and have the following methods:

  • fit: Fits the synthesizer on the data. Expects the following arguments:
    • data (numpy.ndarray): 2 dimensional Numpy matrix with the real data to learn from.
    • categorical_columns (list or tuple): List of indexes of the columns that are categorical within the dataset.
    • ordinal_columns (list or tuple): List of indexes of the columns that are ordinal within the dataset.
  • sample: Generates new data resembling the original dataset. Expects the following arguments:
    • n_samples (int): Number of samples to generate.
  • fit_sample: Fits the synthesizer on the dataset and then samples as many rows as there were in the original dataset. It expects the same arguments as the fit method, and is ready to be directly passed to the benchmark function in order to evaluate the synthesizer performance.

A complete example about how to use them can be found below in the Usage section.

Install

Requirements

SDGym has been developed and tested on Python 3.5, and 3.6

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where SDGym is run.

These are the minimum commands needed to create a virtualenv using python3.6 for SDGym:

pip install virtualenv
virtualenv -p $(which python3.6) sdgym-venv

Afterwards, you have to execute this command to have the virtualenv activated:

source sdgym-venv/bin/activate

Remember about executing it every time you start a new console to work on SDGym!

Install with pip

After creating the virtualenv and activating it, we recommend using pip in order to install SDGym:

pip install sdgym

This will pull and install the latest stable release from PyPi.

Install from source

Alternatively, with your virtualenv activated, you can clone the repository and install it from source by running make install on the stable branch:

git clone [email protected]:DAI-Lab/SDGym.git
cd SDGym
git checkout stable
make install

Install for Development

If you want to contribute to the project, a few more steps are required to make the project ready for development.

First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.

Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:

git clone [email protected]:{your username}/SDGym.git
cd SDGym
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature

Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.

make install-develop

Make sure to use them regularly while developing by running the commands make lint and make test.

Compile C++ dependencies

In order to be able to use all the features from SDGym, some dependencies written in C++ need to be compiled.

In order to do this:

  1. make sure to have installed all the necessary dependencies to compile C++. In Linux distributions based on Ubuntu, this can be done with the following command:
sudo apt-get install build-essential
  1. Execute:
make compile

Usage

Benchmark

All you need to do in order to use the SDGym Benchmark, is to import and call the sdgym.benchmark function passing it your synthesizer function:

from sdgym import benchmark

scores = benchmark(my_synthesizer_function)

The output of the benchmark function will be a pd.DataFrame containing all the scores computed by the different evaluators:

   accuracy        f1                                               name  distance dataset  iter
0    0.7985  0.658301  DecisionTreeClassifier(class_weight='balanced'...       0.0   adult     0
1    0.8607  0.680285  AdaBoostClassifier(algorithm='SAMME.R', base_e...       0.0   adult     0
2    0.7948  0.660040  LogisticRegression(C=1.0, class_weight='balanc...       0.0   adult     0
3    0.8489  0.678716  MLPClassifier(activation='relu', alpha=0.0001,...       0.0   adult     0
0    0.7968  0.655943  DecisionTreeClassifier(class_weight='balanced'...       0.0   adult     1
1    0.8607  0.680285  AdaBoostClassifier(algorithm='SAMME.R', base_e...       0.0   adult     1
2    0.7948  0.660040  LogisticRegression(C=1.0, class_weight='balanc...       0.0   adult     1
3    0.8472  0.683775  MLPClassifier(activation='relu', alpha=0.0001,...       0.0   adult     1
0    0.7963  0.655272  DecisionTreeClassifier(class_weight='balanced'...       0.0   adult     2
1    0.8607  0.680285  AdaBoostClassifier(algorithm='SAMME.R', base_e...       0.0   adult     2
2    0.7948  0.660040  LogisticRegression(C=1.0, class_weight='balanc...       0.0   adult     2
3    0.8511  0.684467  MLPClassifier(activation='relu', alpha=0.0001,...       0.0   adult     2

Using the Demo Synthesizers

In order to use the synthesizer classes included in SDGym, you need to follow these steps:

  1. Import the synthesizer class from sdgym.synthesizers:
from sdgym.synthesizers import IndependentSynthesizer
  1. Create an instance of the synthesizers passing any needed arguments. In this case we will use the IndependentSynthesizer, which can be instantiated with no initialization arguments:
synthesizer = IndependentSynthesizer()
  1. Load some data to fit your synthesizer with. In this case, we will be loading the adult dataset:
from sdgym import load_dataset

data, categorical_columns, ordinal_columns = load_dataset('adult')
  1. Call its fit method passing the data as well as the lists of categorical and ordinal columns:
synthesizer.fit(data, categorical_columns, ordinal_columns)
  1. Call its sample method passing the number of rows that we want to sample:
sampled = synthesizer.sample(3)

This will return a numpy matrix of sampeld data with the same columns as the original data and as many rows as we have requested:

array([[5.1774925e+01, 0.0000000e+00, 5.3538445e+04, 6.0000000e+00,
        8.9999313e+00, 2.0000000e+00, 1.0000000e+00, 3.0000000e+00,
        2.0000000e+00, 1.0000000e+00, 3.7152294e-04, 1.9912617e-04,
        1.0767025e+01, 0.0000000e+00, 0.0000000e+00],
       [6.4843109e+01, 0.0000000e+00, 2.6462553e+05, 1.2000000e+01,
        8.9993210e+00, 1.0000000e+00, 0.0000000e+00, 1.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 5.3685449e-06, 1.9797031e-03,
        2.2253288e+01, 0.0000000e+00, 0.0000000e+00],
       [6.5659584e+01, 5.0000000e+00, 3.6158912e+05, 8.0000000e+00,
        9.0010223e+00, 0.0000000e+00, 1.2000000e+01, 3.0000000e+00,
        0.0000000e+00, 0.0000000e+00, 1.0562389e-03, 0.0000000e+00,
        3.9998917e+01, 0.0000000e+00, 0.0000000e+00]], dtype=float32)

Benchmarking the Demo Synthesizers

Evaluating the performance of any of the Demo synthesizers as as simple as:

  1. Creaeting an instance of the synthesizer:
synthesizer = IndependentSynthesizer()
  1. Passing the fit_sample method of the instance to the benchmark function as your synthesizer function:
benchmark(synthesizer.fit_sample)

What's next?

For more details about SDGym and all its possibilities and features, please check the documentation site.

There you can learn more about how to contribute to SDGym in order to help us developing new features or cool ideas.

Related Projects

SDV

SDV, for Synthetic Data Vault, is the end-user library for synthesizing data in development under the HDI Project. SDV allows you to easily model and sample relational datasets using Copulas thought a simple API. Other features include anonymization of Personal Identifiable Information (PII) and preserving relational integrity on sampled records.

TGAN

TGAN is a GAN based model for synthesizing tabular data. It's also developed by the MIT's Data to AI Lab and is under active development.

About

Benchmarking Synthetic data generation methods and Introducing Conditional Tabular GAN.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 75.9%
  • C++ 20.6%
  • Makefile 3.5%