Skip to content

ahmadajal/MoDE

Repository files navigation

MoDE: Multi-objective-Data-Embedding

This repository contains the code and results for the paper "An Interpretable Data Embedding under Uncertain Distance Information", published at the International Conference on Data Mining (ICDM) in 2020.

Below you may see a nice visualization of the iterations of MoDE that show the convergence of the algorithm for the well-known S-curve dataset.

To get a glimpse of the advantages of using MoDE in data visualization, you may watch the conference presentation:

Usage

MoDE is implemented in a Python package called MoDE_embeddings. To install this package, run the following command in a python environment that has pip installed:

pip install MoDE-embeddings

Example: Running MoDE on the swiss roll data and computing the distance, correlation and order metrics

After installing MoDE embeddings, you can follow the script below to create MoDE embeddings for the swiss roll data:

from MoDE_embeddings.MoDE import MoDE
from MoDE_embeddings.metrics import distance_metric, correlation_metric, order_preservation
from sklearn.datasets import make_swiss_roll

# Load the swiss roll dataset
data, score = make_swiss_roll(n_samples=1000, random_state=1)

# Define MoDE embedding class
mode = MoDE(n_neighbor=20, n_components=2)
x_2d_mode = mode.fit_transform(data, score)
# Compute metrics
R_d = distance_metric(data, x_2d_mode, n_neighbor=20)
R_c = correlation_metric(data, x_2d_mode, n_neighbor=20)
R_o = order_preservation(data, mode.P, n_neighbor=20, score=score)

print(f"R_d = {R_d}, R_c = {R_c}, R_o = {R_o}")

For a full example of comparing MoDE with t-SNE, Isomap, and UMAP in terms of distance and correlation metrics, run the demo.py script in the main folder.

Details

Multi-objective Data Embedding (MoDE) is a 2D data embedding that captures, with high fidelity, multiple facets of the data relationships:

  • correlations,
  • distances, and,
  • orders or importance rankings.

An unique characteristic of MoDE is that it does not require exact distances between the objects, like most visualization techniques do. We can give ranges of lower and upper bound distances between objects, which means that MoDE can effectively visualize compressed or uncertain data!

Moreover, this embedding method enhances interpretability because:

  1. It incorporates the ranks or scores of the data samples (if such ranks exist in the dataset) in the resulting embeddings and by placing points with higher scores in higher angles in 2D, provides an interpretable data embedding.
  2. The embedding typically results in a "half-moon" visualization of the data. Therefore, the user sees typically a similar visualization of the data so understanding and interpretation is easier. For many other techniques, not only each dataset provides a different visualization outcome, but also different runs of the visualization method may give different visualization results.

In recent work we have also extended MoDE to work not only on 2D, but to project on any dimensionality.

Useful Links

This repository contains both the Python and MATLAB implementations of MoDE. Note that you can replicate the experimental results in the ICDM paper, using the MATLAB implementation of MoDE.

Below you can see the visulaization of MoDE embeddings for a dataset of stocks with 2252 samples and 1024 features. The market capitalization of each stock was used as the "rank" of each stock: higher rank will place the object at a higher angular position. You can also see the values of distance, correlation, and order preservation metrics on top of the plot.

  • $R_d$ shows how well pairwise distances are captured (1 is best).
  • $R_o$ shows how well "orders", or ranks, are captured (1 is best).
  • $R_c$ shows how well correlations are captured (1 is best).

mode_image

If you find this code useful or use it in a publication or research, please cite [1,2]. We would also love to hear how you have used this code.

References

[1] N. Freris, M. Vlachos, A. Ajalloeian: "An Interpretable Data Embedding under Uncertain Distance Information", Proc. of IEEE ICDM 2020

[2] N. Freris, A. Ajalloeian, M. Vlachos: "Interpretable Embedding and Visualization of Compressed Data", ACM Transactions on Knowledge Discovery from Data (TKDD), 2023

Description of the folders

  • "Experiment_Classification_Accuracy": This folder contains the code for the experiments which were conducted to compare MoDE with baseline embeddings in terms of the accuracy of a classification task (When the model was trained on the embeddings).

  • "MATLAB_implementation": This folder contains the MATLAB implementation of MoDE.

  • "Python_implemetation": This folder contains the Python implementation of MoDE.

  • "benchmark": This folder contains the code for benchmarking the MATLAB and Python implementations of MoDE in terms of results and execution time.