Skip to content

Commit

Permalink
Merge branch 'main' into mudata-0.4-tests
Browse files Browse the repository at this point in the history
  • Loading branch information
gtca authored Aug 8, 2024
2 parents 6e7e6ae + 810ab69 commit 438e089
Show file tree
Hide file tree
Showing 13 changed files with 427 additions and 277 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Documentation Status](https://readthedocs.org/projects/mudata/badge/?version=latest)](http://mudata.readthedocs.io/)
[![PyPi version](https://img.shields.io/pypi/v/mudata)](https://pypi.org/project/mudata)
[![](https://img.shields.io/badge/scverse-core-black.svg?labelColor=white&logo=data:image/svg%2bxml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+PCFET0NUWVBFIHN2ZyBQVUJMSUMgIi0vL1czQy8vRFREIFNWRyAxLjEvL0VOIiAiaHR0cDovL3d3dy53My5vcmcvR3JhcGhpY3MvU1ZHLzEuMS9EVEQvc3ZnMTEuZHRkIj4KPHN2ZyB3aWR0aD0iMTAwJSIgaGVpZ2h0PSIxMDAlIiB2aWV3Qm94PSIwIDAgOTEgOTEiIHZlcnNpb249IjEuMSIKICAgIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIKICAgIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB4bWw6c3BhY2U9InByZXNlcnZlIgogICAgeG1sbnM6c2VyaWY9Imh0dHA6Ly93d3cuc2VyaWYuY29tLyIgc3R5bGU9ImZpbGwtcnVsZTpldmVub2RkO2NsaXAtcnVsZTpldmVub2RkO3N0cm9rZS1saW5lam9pbjpyb3VuZDtzdHJva2UtbWl0ZXJsaW1pdDoyOyI+CiAgICA8ZyBpZD0iRWJlbmVfMyI+CiAgICAgICAgPGc+CiAgICAgICAgICAgIDxwYXRoIGQ9Ik0zNSw4OS42Yy0yMi4zLC0zLjQgLTMwLjYsLTE5LjggLTMwLjYsLTE5LjhjMTAuOCwxNi45IDQzLDkuMSA1Mi45LDIuNWMxMi40LC04LjMgOCwtMTUuMyA2LjgsLTE4LjFjNS40LDcuMiA1LjMsMjMuNSAtMS4xLDI5LjRjLTUuNiw1LjEgLTE1LjMsNy45IC0yOCw2WiIgc3R5bGU9ImZpbGw6I2ZmZjtmaWxsLXJ1bGU6bm9uemVybztzdHJva2U6IzAwMDtzdHJva2Utd2lkdGg6MXB4OyIvPgogICAgICAgICAgICA8cGF0aCBkPSJNODMuOSw0My41YzIuOSwtNy4xIDAuOCwtMTIuNSAwLjUsLTEzLjNjLTAuNywtMS4zIC0xLjUsLTIuMyAtMi40LC0zLjFjLTE2LjEsLTEyLjYgLTU1LjksMSAtNzAuOSwxNi44Yy0xMC45LDExLjUgLTEwLjEsMjAgLTYuNywyNS44YzMuMSw0LjggNy45LDcuNiAxMy40LDljLTExLjUsLTEyLjQgOS44LC0zMS4xIDI5LC0zOGMyMSwtNy41IDMyLjUsLTMgMzcuMSwyLjhaIiBzdHlsZT0iZmlsbDojMzQzNDM0O2ZpbGwtcnVsZTpub256ZXJvO3N0cm9rZTojMDAwO3N0cm9rZS13aWR0aDoxcHg7Ii8+CiAgICAgICAgICAgIDxwYXRoIGQ9Ik03OS42LDUwLjRjOSwtMTAuNSA1LC0xOS43IDQuOCwtMjAuNGMtMCwwIDQuNCw3LjEgMi4yLDIyLjZjLTEuMiw4LjUgLTUuNCwxNiAtMTAuMSwxMS44Yy0yLjEsLTEuOCAtMywtNi45IDMuMSwtMTRaIiBzdHlsZT0iZmlsbDojZmZmO2ZpbGwtcnVsZTpub256ZXJvO3N0cm9rZTojMDAwO3N0cm9rZS13aWR0aDoxcHg7Ii8+CiAgICAgICAgICAgIDxwYXRoIGQ9Ik02NCw1NC4yYy0zLjMsLTQuOCAtOC4xLC03LjQgLTEyLjMsLTEwLjhjLTIuMiwtMS43IC0xNi40LC0xMS4yIC0xOS4yLC0xNS4xYy02LjQsLTYuNCAtOS41LC0xNi45IC0zLjQsLTIzLjFjLTQuNCwtMC44IC04LjIsMC4yIC0xMC42LDEuNWMtMS4xLDAuNiAtMi4xLDEuMiAtMi44LDJjLTYuNyw2LjIgLTUuOCwxNyAtMS42LDI0LjNjNC41LDcuOCAxMy4yLDE1LjQgMjQuMywyMi44YzUuMSwzLjQgMTUuNiw4LjQgMTkuMywxNmMxMS43LC04LjEgNy42LC0xNC45IDYuMywtMTcuNloiIHN0eWxlPSJmaWxsOiNiNGI0YjQ7ZmlsbC1ydWxlOm5vbnplcm87c3Ryb2tlOiMwMDA7c3Ryb2tlLXdpZHRoOjFweDsiLz4KICAgICAgICAgICAgPHBhdGggZD0iTTM4LjcsOS44YzcuOSw2LjMgMTIuNCw5LjggMjAsOC41YzUuNywtMSA0LjksLTcuOSAtNCwtMTMuNmMtNC40LC0yLjggLTkuNCwtNC4yIC0xNS43LC00LjJjLTcuNSwtMCAtMTYuMywzLjkgLTIwLjYsNi40YzQsLTIuMyAxMS45LC0zLjggMjAuMywyLjlaIiBzdHlsZT0iZmlsbDojZmZmO2ZpbGwtcnVsZTpub256ZXJvO3N0cm9rZTojMDAwO3N0cm9rZS13aWR0aDoxcHg7Ii8+CiAgICAgICAgPC9nPgogICAgPC9nPgo8L3N2Zz4=)](https://scverse.org)
[![Powered by NumFOCUS](https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://numfocus.org)

# MuData – multimodal data
Expand All @@ -28,7 +29,7 @@ MuData
.mod
AnnData
.X -- data matrix (cells x features)
.obs -- cells metadata (assay-specific)
.obs -- cell metadata (assay-specific)
.var -- annotation of features (genes, peaks, genomic sites)
.obsm
.varm
Expand Down
8 changes: 7 additions & 1 deletion docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,18 @@ It implements pull/push interface for annotations with functions :func:`mudata.M

:func:`mudata.MuData.update` performance and behaviour have been generally improved.
For compatibility reasons, this release keeps the old behaviour of pulling annotations on read/update as default.
This will be changed in the next release. In order to adopt the new behaviour, use :func:`mudata.set_options` with `pull_on_update=False`.

.. note::
If you want to adopt the new update behaviour, set ``mudata.set_options(pull_on_update=False)``. This will be the default behaviour in the next release.
With it, the annotations will not be copied from the modalities on :func:`mudata.MuData.update` implicitly.

To copy the annotations explicitly, you will need to use :func:`mudata.MuData.pull_obs` and/or :func:`mudata.MuData.pull_var`.

This release also comes with new functionalities such as :func:`mudata.to_anndata`, :func:`mudata.to_mudata`, and :func:`mudata.concat`.

:class:`mudata.MuData` objects now have a new ``.mod_names`` attribute. ``MuData.mod`` can be pretty-printed. Readers support ``fsspec``, and :func:`mudata.read_zarr` now supports ``mod-order``. The ``uns`` attribute now properly handled by the views.


v0.2.4
------

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
# -- Project information -----------------------------------------------------

project = "mudata"
copyright = "2020 - 2022, Danila Bredikhin"
copyright = "2020 - 2024, Danila Bredikhin"
author = "Danila Bredikhin"


Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ A flagship framework for multimodal omics analysis — ``muon`` — has been bui
notebooks/quickstart_mudata.ipynb
notebooks/nuances.ipynb
notebooks/axes.ipynb
notebooks/annotations_managements.ipynb
notebooks/annotations_management.ipynb

.. toctree::
:hidden:
Expand Down
25 changes: 24 additions & 1 deletion docs/source/io/input.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,31 @@ MuData objects can be read and cached from remote locations including via HTTP(S
A caching layer can be added in the following way:
::
fname_cached = "filecache::" + fname
with fsspec.open(fname_cached, filecache={'cache_storage': '/tmp/'}):
with fsspec.open(fname_cached, filecache={'cache_storage': '/tmp/'}) as f:
mdata = mudata.read_h5mu(f)


For more `fsspec` usage examples see [its documentation](https://filesystem-spec.readthedocs.io/).

S3
^^

MuData objects in the ``.h5mu`` format stored in an S3 bucket can be read with ``fsspec`` as well:
::
storage_options = {
'endpoint_url': 'localhost:9000',
'key': 'AWS_ACCESS_KEY_ID',
'secret': 'AWS_SECRET_ACCESS_KEY',
}

with fsspec.open('s3://bucket/dataset.h5mu', **storage_options) as f:
mudata.read_h5mu(f)


MuData objects stored in the ``.zarr`` format in an S3 bucket can be read from a *mapping*:
::
import s3fs

s3 = s3fs.S3FileSystem(**storage_options)
store = s3.get_mapper('s3://bucket/dataset.zarr')
mdata = mudata.read_zarr(store)
45 changes: 39 additions & 6 deletions docs/source/io/mudata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,29 @@ Individual modalities can be accessed with their names via the ``.mod`` attribut
.obs & .var
^^^^^^^^^^^

Samples (cells) annotation is accessible via the ``.obs`` attribute and by default includes copies of columns from ``.obs`` data frames of individual modalities. Same goes for ``.var``, which contains annotation of variables (features).
.. warning::
Version 0.3 introduces pull/push interface for annotations. For compatibility reasons, the old behaviour of pulling annotations on read/update is kept as default.

This will be changed in the next release, and the annotations will not be copied implicitly.
To adopt the new behaviour, use :func:`mudata.set_options` with ``pull_on_update=False``.
The new approach to ``.update()`` and annotations is described below.

Observations columns copied from individual modalities contain modality name as their prefix, e.g. ``rna:n_genes``. Same is true for variables columns however if there are columns with identical names in ``.var`` of multiple modalities — e.g. ``n_cells``, — these columns are merged across mdalities and no prefix is added.
Samples (cells) annotations are stored in the data frame accessible via the ``.obs`` attribute. Same goes for ``.var``, which contains annotation of variables (features).

When those slots are changed in ``AnnData`` objects of modalities, e.g. new columns are added or samples (cells) are filtered out, the changes have to be fetched with the ``.update()`` method:
Copies of columns from ``.obs`` or ``.var`` data frames of individual modalities can be added with the ``.pull_obs()`` or ``.pull_var()`` methods:
::
mdata.pull_obs()
mdata.pull_var()

When the annotations are changed in ``AnnData`` objects of modalities, e.g. new columns are added, they can be propagated to the ``.obs`` or ``.var`` data frames with the same ``.pull_obs()`` or ``.pull_var()`` methods.

Observations columns copied from individual modalities contain modality name as their prefix, e.g. ``rna:n_genes``. Same is true for variables columns however if there are columns with identical names in ``.var`` of multiple modalities — e.g. ``n_cells``, — these columns are merged across modalities and no prefix is added.

When there are changes directly related to observations or variables, e.g. samples (cells) are filtered out or features (genes) are renamed, the changes have to be fetched with the ``.update()`` method:
::
mdata.update()


.obsm
^^^^^

Expand All @@ -58,13 +73,13 @@ Multidimensional annotations of samples (cells) are accessible in the ``.obsm``
mdata.obsm
# => MuAxisArrays with keys: X_umap, X_mofa, prot, rna

As another multidimensional embedding, this slot contains boolean vectors, one per modality, indicating if samples (cells) are available in the respective modality. For instance, if all samples (cells) are the same across modalities, all values in those vectors are ``True``.
As another multidimensional embedding, this slot may contain boolean vectors, one per modality, indicating if samples (cells) are available in the respective modality. For instance, if all samples (cells) are the same across modalities, all values in those vectors are ``True``.


Container's shape
-----------------

The ``MuData`` object's shape is represented by two numbers calculated as a sum of the shapes of individual modalities — one for the number of observations and one for the number of variables.
The ``MuData`` object's shape is represented by two numbers calculated from the shapes of individual modalities — one for the number of observations and one for the number of variables.
::
mdata.shape
# => (9573, 132465)
Expand All @@ -80,10 +95,28 @@ By default, variables are always counted as belonging uniquely to a single modal

If the shape of a modality is changed, :func:`mudata.MuData.update` has to be run to bring the respective updates to the ``MuData`` object.


Keeping containers up to date
-----------------------------

Modalities inside the ``MuData`` container are full-fledged ``AnnData`` objects, which can be operated independently with any tool that works on ``AnnData`` objects. The shape of the ``MuData`` object as well as metadata fetched from individual modalities and boolean vectors of observations (in ``.obsm``) & variables (in ``.varm``) for each modality will then reflect the previous state of the data. To keep the container up to date, there is an ``.update()`` method that syncs the data.
.. warning::
Version 0.3 introduces pull/push interface for annotations. For compatibility reasons, the old behaviour of pulling annotations on read/update is kept as default.

This will be changed in the next release, and the annotations will not be copied implicitly.
To adopt the new behaviour, use :func:`mudata.set_options` with ``pull_on_update=False``.
The new approach to ``.update()`` and annotations is described below.

Modalities inside the ``MuData`` container are full-fledged ``AnnData`` objects, which can be operated independently with any tool that works on ``AnnData`` objects.
When modalities are changed externally, the shape of the ``MuData`` object as well as metadata fetched from individual modalities will then reflect the previous state of the data.
To keep the container up to date, there is an ``.update()`` method that syncs the ``.obs_names`` and ``.var_names`` of the ``MuData`` object with the ones of the modalities.


Managing annotations
--------------------

To fetch the corresponding annotations from individual modalities, there are :func:`mudata.MuData.pull_obs` and :func:`mudata.MuData.pull_var` methods.

To update the annotations of individual modalities with the global annotations, :func:`mudata.MuData.push_obs` and :func:`mudata.MuData.push_var` methods can be used.


Backed containers
Expand Down
28 changes: 26 additions & 2 deletions docs/source/io/spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,36 @@ Building on top of the `AnnData spec <https://anndata.readthedocs.io/en/latest/f

Modalities are stored in a ``.mod`` group of the ``.h5mu`` file in the alphabetical order. To preserve the order of the modalities, there is an attribute ``"mod-order"`` that lists the modalities in their respective order. If some modalities are missing from that attribute, the attribute is to be ignored.
::
>>> dict(f["mod-order"])
{'mod-order': array(['rna', 'protein'], dtype=object)}
>>> dict(f["mod"].attrs)
{'mod-order': array(['prot', 'rna'], dtype=object)}


.obsmap and .varmap
-------------------

While in practice ``MuData`` relies on ``.obs_names`` and ``.var_names`` to collate global observations and variables, it also allows to disambiguate between items with the same name using integer maps. For example, global observations will have non-zero integer values in ``.obsmap["rna"]`` if they are present in the ``"rna"`` modality. If an observation or a variable is missing from a modality, it will correspond to a ``0`` value.
::
>>> list(f["obsmap"].keys())
['prot', 'rna']
>>> import numpy as np
>>> np.array(f["obsmap"]["rna"])
array([ 1, 2, 3, ..., 3889, 3890, 3891], dtype=uint32)
>>> np.array(f["obsmap"]["prot"])
array([ 1, 2, 3, ..., 3889, 3890, 3891], dtype=uint32)

>>> list(f["varmap"].keys())
['prot', 'rna']
>>> np.array(f["varmap"]["rna"])
array([ 0, 0, 0, ..., 17804, 17805, 17806], dtype=uint32)
>>> np.array(f["varmap"]["prot"])
array([1, 2, 3, ..., 0, 0, 0], dtype=uint32)

.axis
-----

Axis describes which dimensions are shared: observations (``axis=0``), variables (``axis=1```), or both (``axis=-1``). It is recorded in the ``axis`` attribute of the file:
::
>>> f.attrs["axis"]
0

Multimodal datasets, which have observations shared between modalities, will have ``axis=0``. If no axis attribute is available such as in files with the older versions of this specification, it is assumed to be ``0`` by default.
Loading

0 comments on commit 438e089

Please sign in to comment.