Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parsing of loaded Dataframes. Improve function naming. #82

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ version: 2
build:
os: ubuntu-20.04
tools:
python: "3.12"
python: "3.13"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -753,18 +753,18 @@ edge_types = types.all.edges()
OntoWeaver provides a way to parallelize the extraction of nodes and edges from the provided database, with the aim of
reducing the runtime of the extraction process. By default, the parallel processing is disabled, and the data frame
is processed in a sequential manner. To enable parallel processing, the user can pass the maximum number of workers to
the `extract_all` function.
the `extract_table` function.

For example, to enable parallel processing with 16 workers, the user can call the function as follows:

```python
adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = 16)
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = 16)
```

To enable parallel processing with a good default working on any machine, you can use the [approach suggested by the concurrent module](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor).
```python
import os
adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))
```

## Information Fusion
Expand Down Expand Up @@ -795,8 +795,8 @@ The generic workflow is to first produce nodes and edges —as usual—
then call the `fusion.reconciliate` function on the produced nodes and edges:
```python
# Call the mappings:
adapter_A = ontoweaver.tabular.extract_all(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_all(input_table_B, mapping_B)
adapter_A = ontoweaver.tabular.extract_table(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_table(input_table_B, mapping_B)

# Aggregate the nodes and edges:
nodes = adapter_A.nodes + adapter_B.nodes
Expand Down
171 changes: 76 additions & 95 deletions docs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ OntoWeaver allows writing a simple declarative mapping to express how
columns from a `Pandas <https://pandas.pydata.org/>`__ table are to be
converted as typed nodes or edges in an SKG.

.. image:: docs/OntoWeaver_logo__big.svg
|image1|

It provides a simple layer of abstraction on top of
`Biocypher <https://biocypher.org>`__, which remains responsible for
Expand Down Expand Up @@ -128,74 +128,53 @@ populate the configured database. By default, the output script file is
saved in a subdirectory of ``./biocypher-out/``, which name is a
timestamp from when the adapter has been executed.

To actually insert data in an SKG database, you will have to use
Biocypher export API:
To configure your data mapping, you will have to first define the
mapping that you want to apply on your data. Then, you will need a
BioCypher configuration file (which mainly indiciate your ontologoy and
backend), and a schema configuration file (indicating which node and
edge types you want).

.. code:: python
To actually do something, you need to run OntoWeaver mapping onto your
data. We provide a command line interface to do so, called
``ontoweave``.

import yaml
import pandas as pd
import biocypher
import ontoweaver

# Load ontology
bc = biocypher.BioCypher(
biocypher_config_path = "tests/simplest/biocypher_config.yaml",
schema_config_path = "tests/simplest/schema_config.yaml"
)

# Load data
table = pd.read_csv("tests/simplest/data.csv")

# Load mapping
with open("tests/simplest/mapping.yaml") as fd:
mapping = yaml.full_load(fd)

# Run the adapter
adapter = ontoweaver.tabular.extract_all(table, mapping)

# Write nodes
bc.write_nodes( adapter.nodes )

# Write edges
bc.write_edges( adapter.edges )

# Write import script
bc.write_import_call()

# Now you have a script that you can run to actually insert data.

Additionally, you will have to define a strategy for the naming of
mapped items when creating nodes, by defining an ``affix`` and
``separator`` to be used during node creation. The ``affix`` used will
represent the ontology type of the item in question. Unless otherwise
defined, the ``affix`` defaults to ``suffix`` and ``separator`` defaults
to ``:``. This can be modified by changing the variables in the
``extract_all()`` function. ``Affix`` can be either a ``prefix``,
``suffix`` or ``none`` - in case you decide not to include the ontology
type in the node naming strategy. Special care should be exercised in
case there are several types of the same name in the database. There is
a possibility that nodes of the same name will be merged together during
mapping, so an ``affix`` should be present. Below are some examples of
node naming strategies. ``NAME`` refers to the name of the item in
question in your database, and ``TYPE`` refers to the type of the item
in the ontology.
If you use some default config file (usually ``biocypher_config.yaml``)
and schema (usually ``schema_config.yaml``), the simplest call would be:

.. code:: python
.. code:: sh

ontoweave my_data.csv:my_mapping.yaml

If you want to indicate your own configuration files, pass their name as
options:

.. code:: sh

ontoweave --config biocypher_config.yaml --schema schema_config.yaml data-1.1.csv:map-1.yaml data-1.2.csv:map-1.yaml data-A.csv:map-A.yaml

[...]
note that you can use the same mapping on several data files, and/or
several mappings.

# Affix defaults to "suffix", and separator defaults to ":"
# Node represented as [NAME]:[TYPE]
adapter = ontoweaver.tabular.extract_all(table, mapping)
To actually insert data in an SKG database, you need to run the import
files that are prepared by the previous command. Either you ask
*ontoweave* to run it for you:

# Node represented as [TYPE]-[NAME]
adapter = ontoweaver.tabular.extract_all(table, mapping, affix = "prefix", separator = "-")
.. code:: sh

# Node represented as [NAME]
adapter = ontoweaver.tabular.extract_all(table, mapping, affix = "none")
ontoweave my_data.csv:my_mapping.yaml --import-script-run

[...]
or you can capture the import script path and run it yourself:

.. code:: sh

script=$(ontoweave my_data.csv:my_mapping.yaml) # Capture.
$script # Run.

You will find more options by running the help command:

.. code:: sh

ontoweave --help

Mapping API
-----------
Expand All @@ -205,9 +184,9 @@ of a mapping from a table to ontology types. As such, its core input is
a dictionary, that takes the form of a YAML file. This configuration
file indicates:

- to which (node) type to map each line of the table,
- to which (node) type to map columns of the table,
- with which (edge) types to map relationships between nodes.
- to which (node) type to map each line of the table,
- to which (node) type to map columns of the table,
- with which (edge) types to map relationships between nodes.

The following explanations assume that you are familiar with
`Biocypher’s
Expand Down Expand Up @@ -522,7 +501,7 @@ characters will be removed and substituted with an underscore, in case
they are located inbetween allowed characters.

By default, the transformer will allow alphanumeric characters (A-Z,
a-z, 0-9), underscore (_), backtick (`), dot (.), and parentheses (),
a-z, 0-9), underscore (\_), backtick (\`), dot (.), and parentheses (),
and the substitute will be an empty string. If you wish to use the
default settings, you can write:

Expand Down Expand Up @@ -637,13 +616,13 @@ the mapping configurations.

Here is the list of available synonyms:

- ``subject`` = ``row`` = ``entry`` = ``line`` = ``source``
- ``column`` = ``columns`` = ``fields``
- ``to_object`` = ``to_target`` = ``to_node``
- ``from_subject`` = ``from_source``
- ``via_relation`` = ``via_edge`` = ``via_predicate``
- ``to_property`` = ``to_properties``
- ``for_object`` = ``for_objects``
- ``subject`` = ``row`` = ``entry`` = ``line`` = ``source``
- ``column`` = ``columns`` = ``fields``
- ``to_object`` = ``to_target`` = ``to_node``
- ``from_subject`` = ``from_source``
- ``via_relation`` = ``via_edge`` = ``via_predicate``
- ``to_property`` = ``to_properties``
- ``for_object`` = ``for_objects``

How To
------
Expand Down Expand Up @@ -845,14 +824,14 @@ edges from the provided database, with the aim of reducing the runtime
of the extraction process. By default, the parallel processing is
disabled, and the data frame is processed in a sequential manner. To
enable parallel processing, the user can pass the maximum number of
workers to the ``extract_all`` function.
workers to the ``extract_table`` function.

For example, to enable parallel processing with 16 workers, the user can
call the function as follows:

.. code:: python

adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = 16)
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = 16)

To enable parallel processing with a good default working on any
machine, you can use the `approach suggested by the concurrent
Expand All @@ -861,7 +840,7 @@ module <https://docs.python.org/3/library/concurrent.futures.html#concurrent.fut
.. code:: python

import os
adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))

Information Fusion
------------------
Expand Down Expand Up @@ -896,8 +875,8 @@ edges:
.. code:: python

# Call the mappings:
adapter_A = ontoweaver.tabular.extract_all(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_all(input_table_B, mapping_B)
adapter_A = ontoweaver.tabular.extract_table(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_table(input_table_B, mapping_B)

# Aggregate the nodes and edges:
nodes = adapter_A.nodes + adapter_B.nodes
Expand Down Expand Up @@ -955,9 +934,9 @@ A node being composed of an identifier, a type label, and a properties
dictionary, the ``serialize`` module provides function objects
reflecting the useful combinations of those components:

- ``ID`` (only the identifier)
- ``IDLabel`` (the identifier and the type label)
- ``All`` (the identifier, the type label, and the properties)
- ``ID`` (only the identifier)
- ``IDLabel`` (the identifier and the type label)
- ``All`` (the identifier, the type label, and the properties)

The user can instantiate those function objects, and pass them to the
``congregate`` module, to find which nodes are duplicates of each other.
Expand All @@ -974,27 +953,27 @@ For steps 2 to 4, OntoWeaver provides the ``merge`` module, which
provides ways to merge two nodes’ components into a single one. It is
separated into two submodules, depending on the type of the component:

- ``string`` for components that are strings (i.e. identifier and type
label),
- ``dictry`` for components that are dictionaries (i.e. properties).
- ``string`` for components that are strings (i.e. identifier and type
label),
- ``dictry`` for components that are dictionaries (i.e. properties).

The ``string`` submodule provides:

- ``UseKey``: replace the identifier with the serialization used at the
congregation step,
- ``UseFirst``/``UseLast``: replace the type label with the first/last
one seen,
- ``EnsureIdentical``: if two nodes’ components are not equal, raise an
error,
- ``OrderedSet``: aggregate all the components of all the seen nodes
into a single, lexicographically ordered list (joined by a
user-defined separator).
- ``UseKey``: replace the identifier with the serialization used at the
congregation step,
- ``UseFirst``/``UseLast``: replace the type label with the first/last
one seen,
- ``EnsureIdentical``: if two nodes’ components are not equal, raise an
error,
- ``OrderedSet``: aggregate all the components of all the seen nodes
into a single, lexicographically ordered list (joined by a
user-defined separator).

The ``dictry`` submodule provides:

- ``Append``: merge all seen dictionaries in a single one, and
aggregate all the values of all the duplicated fields into a single
lexicographically ordered list (joined by a user-defined separator).
- ``Append``: merge all seen dictionaries in a single one, and aggregate
all the values of all the duplicated fields into a single
lexicographically ordered list (joined by a user-defined separator).

For example, to fuse “congregated” nodes, one can do:

Expand Down Expand Up @@ -1112,3 +1091,5 @@ deciding their type based on their properties), implement a
If you need to decide how to fuse whole *sets* of duplicated nodes (for
instance if you need to know all duplicated nodes before deciding which
type to set), implement a ``fusion.Fusioner`` directly.

.. |image1| image:: docs/OntoWeaver_logo__big.svg
4 changes: 2 additions & 2 deletions docs/readme_sections/information_fusion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ edges:
.. code:: python

# Call the mappings:
adapter_A = ontoweaver.tabular.extract_all(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_all(input_table_B, mapping_B)
adapter_A = ontoweaver.tabular.extract_table(input_table_A, mapping_A)
adapter_B = ontoweaver.tabular.extract_table(input_table_B, mapping_B)

# Aggregate the nodes and edges:
nodes = adapter_A.nodes + adapter_B.nodes
Expand Down
6 changes: 6 additions & 0 deletions docs/readme_sections/overview.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
Overview
--------

OntoWeaver is a tool for importing table data in Semantic Knowledge
Graphs (SKG) databases.
Expand All @@ -6,6 +8,8 @@ OntoWeaver allows writing a simple declarative mapping to express how
columns from a `Pandas <https://pandas.pydata.org/>`__ table are to be
converted as typed nodes or edges in an SKG.

|image1|

It provides a simple layer of abstraction on top of
`Biocypher <https://biocypher.org>`__, which remains responsible for
doing the ontology alignment, supporting several graph database
Expand Down Expand Up @@ -35,3 +39,5 @@ looking like:
metadata: # Optional properties added to every node and edge.
- source: "My OntoWeaver adapter"
- version: "v1.2.3"

.. |image1| image:: docs/OntoWeaver_logo__big.svg
6 changes: 3 additions & 3 deletions docs/readme_sections/parallel_processing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ edges from the provided database, with the aim of reducing the runtime
of the extraction process. By default, the parallel processing is
disabled, and the data frame is processed in a sequential manner. To
enable parallel processing, the user can pass the maximum number of
workers to the ``extract_all`` function.
workers to the ``extract_table`` function.

For example, to enable parallel processing with 16 workers, the user can
call the function as follows:

.. code:: python

adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = 16)
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = 16)

To enable parallel processing with a good default working on any
machine, you can use the `approach suggested by the concurrent
Expand All @@ -22,4 +22,4 @@ module <https://docs.python.org/3/library/concurrent.futures.html#concurrent.fut
.. code:: python

import os
adapter = ontoweaver.tabular.extract_all(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))
adapter = ontoweaver.tabular.extract_table(table, mapping, parallel_mapping = min(32, (os.process_cpu_count() or 1) + 4))
Loading
Loading