Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
BeGeiger committed Oct 8, 2023
1 parent 9e925bc commit 9c95f87
Show file tree
Hide file tree
Showing 115 changed files with 7,783 additions and 2 deletions.
5 changes: 5 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Auto detect text files and perform LF normalization
* text=auto

# Remove the tutorial jupyter notebook from the language calculation of github
tutorial.ipynb linguist-vendored
38 changes: 38 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: test

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
test:
name: test ${{ matrix.py }} on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
py:
- "3.9"
- "3.10"
- "3.11"
os:
- ubuntu-latest
- windows-latest
- macos-latest
steps:
- name: Checkout sources
uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.py }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install tox tox-gh-actions
- name: Run test suite
run: tox
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.ipynb_checkpoints
**/__pycache__/

dist/

.tox/
23 changes: 23 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# See https://pre-commit.com for more information
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: check-toml
- id: check-yaml
- id: end-of-file-fixer
- id: mixed-line-ending
- repo: https://github.com/python-poetry/poetry
rev: 1.6.1
hooks:
- id: poetry-check
- id: poetry-lock
- repo: https://github.com/psf/black
rev: 23.9.0
hooks:
- id: black
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
args: ["--profile", "black"]
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#### Changelog

All noteable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

---
---
72 changes: 70 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,70 @@
# salamander
Salamander is a non-negative matrix factorization framework for signature analysis
# Salamander

[![Python versions supported][python-image]][python-url]
[![License][license-image]][license-url]
[![Code style][style-image]][style-url]

[python-image]: https://img.shields.io/badge/python-3.9%20|%203.10%20|%203.11-blue.svg
[python-url]: https://github.com/BeGeiger/CorrNMF
[license-image]: https://img.shields.io/badge/License-MIT-yellow.svg
[license-url]: https://github.com/BeGeiger/CorrNMF/blob/main/LICENSE
[style-image]: https://img.shields.io/badge/code%20style-black-000000.svg
[style-url]: https://github.com/psf/black

Salamander is a non-negative matrix factorization (NMF) framework for signature analysis.
It implements multiple NMF algorithms, common visualizations, and can be easily customized & expanded.

---

## Installation

PyPI:
```bash
pip install salamander-learn
```

## Usage

The following example illustrates the basic syntax:

```python
import pandas as pd
import salamander-learn as sal

# samples and features have to be named appropriately
data_path = "..."
data = pd.read_csv(data_path, index_col=0)

# NMF with a Poisson noise model
model = sal.KLNMF(n_signatures=5)
model.fit(data)

# barplot
model.plot_signatures()

# stacked barplot
model.plot_exposures()

# signature correlation
model.plot_correlation()

# sample_correlation
model.plot_correlation(data="samples")

# dimensionality reduction of the exposures
# method: umap, pca or tsne
model.plot_embeddings(method="umap")
```

For examples of how to customize any NMF algorithm and the plots, check out [the tutorial](). The following algorithms are currently available:
* [NMF with KL-divergence loss](https://proceedings.neurips.cc/paper_files/paper/2000/file/f9d1152547c0bde01830b7e8bd60024c-Paper.pdf)
* [minimum-volume NMF](https://browse.arxiv.org/pdf/1907.02404.pdf)
* [a variant of correlated NMF](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=87224164eef14589b137547a3fa81f06eef9bbf4)

## License

MIT

## Changelog

Consult the [CHANGELOG](https://github.com/BeGeiger/CorrNMF/blob/main/CHANGELOG.md) file for enhancements and fixes of each version.
84 changes: 84 additions & 0 deletions data/pcawg_breast_indel.csv

Large diffs are not rendered by default.

97 changes: 97 additions & 0 deletions data/pcawg_breast_sbs.csv

Large diffs are not rendered by default.

1,261 changes: 1,261 additions & 0 deletions poetry.lock

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
[tool.poetry]
name = "salamander-learn"
version = "0.1.1"
description = "Salamander is a non-negative matrix factorization framework for signature analysis"
license = "MIT"
authors = ["Benedikt Geiger"]
maintainers = [
"Benedikt Geiger <[email protected]>",
]
packages = [{ include = "salamander", from = "src" }]


readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.9,<3.12"
fastcluster = "^1.2.6"
matplotlib = "^3.7.1"
numba = "^0.57"
numpy = "^1.24.3"
pandas = "^1.5.3"
scikit-learn = "^1.3.0"
scipy = "^1.10.1"
seaborn = "^0.13.0"
umap-learn = "^0.5.4"

[tool.poetry.group.dev.dependencies]
pytest = "^7.4.2"
pre-commit = "^3.4.0"
tox = "^4.11.3"

[tool.pytest.ini_options]
# /site-packages/umap/__init__.py:36: DeprecationWarning: pkg_resources is deprecated as an API.
filterwarnings = [
"ignore::DeprecationWarning:umap.*:",
]

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
11 changes: 11 additions & 0 deletions src/salamander/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""
Salamander: a non-negative matrix factorization framework for signature analysis
================================================================================
"""
from .nmf_framework.corrnmf_det import CorrNMFDet
from .nmf_framework.klnmf import KLNMF
from .nmf_framework.multimodal_corrnmf import MultimodalCorrNMF
from .nmf_framework.mvnmf import MvNMF

__version__ = "0.1.0"
__all__ = ["CorrNMFDet", "KLNMF", "MvNMF", "MultimodalCorrNMF"]
88 changes: 88 additions & 0 deletions src/salamander/consts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
NUCLEOTIDES = ["A", "C", "G", "T"]

SBS_TYPES_6 = ["C>A", "C>G", "C>T", "T>A", "T>C", "T>G"]
SBS_TYPES_96 = [
f"{n1}[{sbs_6}]{n2}"
for sbs_6 in SBS_TYPES_6
for n1 in NUCLEOTIDES
for n2 in NUCLEOTIDES
]

# fmt: off
INDEL_TYPES_83 = [
"DEL.C.1.1", "DEL.C.1.2", 'DEL.C.1.3', "DEL.C.1.4", "DEL.C.1.5", "DEL.C.1.6+",
"DEL.T.1.1", "DEL.T.1.2", 'DEL.T.1.3', "DEL.T.1.4", "DEL.T.1.5", "DEL.T.1.6+",
"INS.C.1.0", "INS.C.1.1", 'INS.C.1.2', "INS.C.1.3", "INS.C.1.4", "INS.C.1.5+",
"INS.T.1.0", "INS.T.1.1", 'INS.T.1.2', "INS.T.1.3", "INS.T.1.4", "INS.T.1.5+",
"DEL.repeats.2.1", "DEL.repeats.2.2", "DEL.repeats.2.3",
"DEL.repeats.2.4", "DEL.repeats.2.5", "DEL.repeats.2.6+",
"DEL.repeats.3.1", "DEL.repeats.3.2", "DEL.repeats.3.3",
"DEL.repeats.3.4", "DEL.repeats.3.5", "DEL.repeats.3.6+",
"DEL.repeats.4.1", "DEL.repeats.4.2", "DEL.repeats.4.3",
"DEL.repeats.4.4", "DEL.repeats.4.5", "DEL.repeats.4.6+",
"DEL.repeats.5+.1", "DEL.repeats.5+.2", "DEL.repeats.5+.3",
"DEL.repeats.5+.4", "DEL.repeats.5+.5", "DEL.repeats.5+.6+",
"INS.repeats.2.0", "INS.repeats.2.1", "INS.repeats.2.2",
"INS.repeats.2.3", "INS.repeats.2.4", "INS.repeats.2.5+",
"INS.repeats.3.0", "INS.repeats.3.1", "INS.repeats.3.2",
"INS.repeats.3.3", "INS.repeats.3.4", "INS.repeats.3.5+",
"INS.repeats.4.0", "INS.repeats.4.1", "INS.repeats.4.2",
"INS.repeats.4.3", "INS.repeats.4.4", "INS.repeats.4.5+",
"INS.repeats.5+.0", "INS.repeats.5+.1", "INS.repeats.5+.2",
"INS.repeats.5+.3", "INS.repeats.5+.4", "INS.repeats.5+.5+",
"DEL.MH.2.1",
"DEL.MH.3.1", "DEL.MH.3.2",
"DEL.MH.4.1", "DEL.MH.4.2", "DEL.MH.4.3",
"DEL.MH.5+.1", "DEL.MH.5+.2", "DEL.MH.5+.3", "DEL.MH.5+.4", "DEL.MH.5+.5+"
]
# fmt: on

# 10 colors
COLORS_MATHEMATICA = [
(0.368417, 0.506779, 0.709798),
(0.880722, 0.611041, 0.142051),
(0.560181, 0.691569, 0.194885),
(0.922526, 0.385626, 0.209179),
(0.528288, 0.470624, 0.701351),
(0.772079, 0.431554, 0.102387),
(0.363898, 0.618501, 0.782349),
(1.0, 0.75, 0.0),
(0.280264, 0.715, 0.429209),
(0.0, 0.0, 0.0),
]

# Trinucleotide colors for the 96 dimensional mutation spectrum
COLORS_TRINUCLEOTIDES = [
(0.33, 0.75, 0.98),
(0.0, 0.0, 0.0),
(0.85, 0.25, 0.22),
(0.78, 0.78, 0.78),
(0.51, 0.79, 0.24),
(0.89, 0.67, 0.72),
]

COLORS_SBS96 = [COLORS_TRINUCLEOTIDES[i // 16] for i in range(96)]

COLORS_INDEL = [
"#FCBD6F", # 1bp Del C
"#FD8001", # 1bp Del T
"#B0DC8B", # 1bp Ins C
"#35A02E", # 1bp Ins T
"#FCC9B4", # 2bp Del Repeats
"#FC896B", # 3bp Del Repeats
"#F04432", # 4bp Del Repeats
"#BC1A1A", # 5+ bp Del Repeats
"#CFE0F0", # 2bp Ins Repeats
"#94C3DF", # 3bp Ins Repeats
"#4A98C8", # 4bp Ins Repeats
"#1665AA", # 5+ bp Ins Repeats
"#E1E0ED", # 2bp Del MH
"#B5B5D8", # 3bp Del MH
"#8683BC", # 4bp Del MH
"#624099", # 5+bp Del MH
]

# 12 * 6 + 11 = 83 colors
n_times = 12 * [6] + [1, 2, 3, 5]
COLORS_INDEL83 = [n * [col] for n, col in zip(n_times, COLORS_INDEL)]
COLORS_INDEL83 = [col for color_list in COLORS_INDEL83 for col in color_list]
1 change: 1 addition & 0 deletions src/salamander/nmf_framework/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
""
Loading

0 comments on commit 9c95f87

Please sign in to comment.