Skip to content

Commit

Permalink
Adding AWS HealthOmics as a Module in "Play" tools (#954)
Browse files Browse the repository at this point in the history
### Feature or Bugfix
- Feature

### Background
Currently, data.all has integrations to AWS services such as SageMaker,
Athena, and QuickSight through the “Play” modules/tools Notebooks,
Worksheets, and Dashboards, respectively. This is valuable and
convenient for end users who want to process and share their data but
may not want to or know how to use these services in the AWS Console.

Researchers, scientists, and bioinformaticians often fall into this end
user category. One such AWS service that is popular amongst the research
community is AWS HealthOmics. HealthOmics helps users process and
generate insights from genomics and other biological data stored in the
cloud. Raw genomic data can be sent to a HealthOmics managed workflow
(aka Ready2Run workflow) that can perform various tasks such as quality
control, read alignment, transcript assembly, and gene expression
quantification. The output can then be stored in a manageable format for
querying/visualizing/sharing.

### AWS HealthOmics Integration

This feature contains both modularized backend and frontend changes for
adding HealthOmics as a “Play” module/tool to data.all. It specifically
adds the capability to view and instantiate HealthOmics Ready2Run
workflows as runs that can output and save omic data as data.all
Datasets.

### Consumption Patterns

* <ins>data.all Worksheets</ins>: Users can use Worksheets to make data
easier to query and combine with other forms of health data.
* <ins>data.all Notebooks/Studio</ins>: Users can Notebooks and Studio
to build, train, and deploy novel machine learning algorithms on the
multiomic and multimodal data.
* <ins>data.all Dashboards</ins>: Users can use the transformed data in
Dashboards for advanced analytics and visualizations.

### Considerations

* <ins>Linked Environment and Dataset Region</ins>: The HealthOmics run
must be performed in a data.all Linked Environment that is located in an
AWS Region that supports AWS HealthOmics. Similarly, the data.all source
and destination Dataset must live in same AWS Region as where the user
will perform the HealthOmics run.
* <ins>Ready2Run Workflow Support</ins>: Currently, only Ready2Run
workflows are supported. Ready2Run are pre-built workflows designed by
industry leading third-party software companies like Sentieon, Inc. and
NVIDIA, as well as common open-source pipelines such as AlphaFold for
protein structure prediction. Ready2Run workflows do not require the
management of software tools or workflow scripts. Bring your own, also
known as Private, workflows where you bring a custom workflow script,
are not yet supported. Please note that some Ready2Run workflows require
a subscription/license from the software provider to run.

### User Journey 

This example user journey depicts an end-to-end process from viewing
available HealthOmics Ready2Run workflows to instantiating a run and
viewing its output in a data.all Worksheet.

* <ins>Initiation</ins>:
* User navigates to the "Omics" section within data.all and browses
Ready2Run workflows
    
<img width="1894" alt="Screenshot 2024-01-17 at 11 31 23 PM"
src="https://github.com/data-dot-all/dataall/assets/28816838/122d1c96-921f-401a-8119-8b2f72779d7a">

* User can also search for a specific workflow directly 

<img width="1679" alt="Screenshot 2024-01-17 at 11 34 43 PM"
src="https://github.com/data-dot-all/dataall/assets/28816838/dc03593e-6116-44b2-a0cd-913da60054cd">

* After clicking on a workflow, users see a detailed view of it with a
full description of what it does

<img width="1894" alt="Screenshot 2024-01-17 at 11 35 15 PM"
src="https://github.com/data-dot-all/dataall/assets/28816838/6299ad7e-c19f-4026-9810-869d082488ae">


* <ins>Creation</ins>: After clicking on a workflow, users see a
detailed view and hit “Create Run”. Users fill in the run creation form
with the following parameters:

<img width="1908" alt="Screenshot 2024-01-17 at 11 37 58 PM"
src="https://github.com/data-dot-all/dataall/assets/28816838/440e3f23-c65d-4e00-b932-0a448944853e">

   * <ins>Workflow ID</ins>: Immutable ID of the Ready2Run workflow
    * <ins>Run Name</ins>: Customizable name of the run user will submit
* <ins>Environment</ins>: data.all Environment AWS Account where the
HealthOmics run will be (NOTE: the Environment must be in an AWS Region
supported by HealthOmics, ex: N. Virginia or London)
* <ins>Region</ins>: Pre-populated from the Environment and immutable
Region where the run will be
    * <ins>Owners</ins>: data.all group who owns the run
* <ins>Select S3 Output Destination</ins>: data.all Dataset where the
output omics data will reside (NOTE: please create this prior to kicking
off a run)
* <ins>Run Parameters</ins>: JSON parameter input in the format expected
by the Ready2Run workflow. It will be pre-populated with the correct
fields, and users will paste in their data in the appropriate fields.
For example, the raw input data in S3 that will be processed in the run.
(NOTE: the input data does not have to be in a data.all Dataset, as long
as it is accessible. For example, raw genomic data may be hosted
publicly on the AWS Registry of Open Data, and the S3 URI can be
provided in a field here)

* <ins>History</ins>:
    
* Users navigate to Run tab at the top to view a history of the
data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run
history deletion is still in progress)

![Screenshot 2024-01-17 at 11 40
01 PM](https://github.com/data-dot-all/dataall/assets/28816838/95af6147-332a-4837-a7a6-18ade74f9794)



* <ins>Data Consumption</ins>:
    
    * <ins>In Worksheets</ins>:
        * Users can select (or create) a new Worksheet. 
       
![Screenshot 2024-01-17 at 11 41
20 PM](https://github.com/data-dot-all/dataall/assets/28816838/4e017983-17e1-422f-a6ee-b723967b8720)


* Users can then query the data using SQL
        
<img width="1902" alt="Screenshot 2024-01-17 at 11 44 30 PM"
src="https://github.com/data-dot-all/dataall/assets/28816838/5a020647-ee90-4d36-b132-05893e80327b">





### Relates
- Github Issue - #563 

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)? Yes
  - Is the input sanitized? Yes 
- What precautions are you taking before deserializing the data you
consume? N/A
  - Is injection prevented by parametrizing queries? N/A
  - Have you ensured no `eval` or similar functions are used? N/A
- Does this PR introduce any functionality or component that requires
authorization? Yes
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
Yes
  - Are you logging failed auth attempts? N/A
- Are you using or adding any cryptographic features? No
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users? Yes
- Have you used the least-privilege principle? How? Yes, through scoped
policies added to the role


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: Patrick Guha <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: “Kiran <[email protected]>
  • Loading branch information
3 people authored Jun 14, 2024
1 parent 080fc98 commit 7a264ee
Show file tree
Hide file tree
Showing 61 changed files with 2,938 additions and 34 deletions.
6 changes: 3 additions & 3 deletions backend/dataall/base/cdkproxy/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
aws-cdk-lib==2.99.0
boto3==1.24.85
boto3-stubs==1.24.85
botocore==1.27.85
boto3==1.28.23
boto3-stubs==1.28.23
botocore==1.31.23
cdk-nag==2.7.2
constructs==10.0.73
starlette==0.36.3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ def generate_policies(self) -> [aws_iam.ManagedPolicy]:
'StringEquals': {
'iam:PassedToService': [
'glue.amazonaws.com',
'omics.amazonaws.com',
'lambda.amazonaws.com',
'sagemaker.amazonaws.com',
'states.amazonaws.com',
Expand Down
1 change: 1 addition & 0 deletions backend/dataall/core/environment/cdk/environment_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,7 @@ def create_group_environment_role(self, group: EnvironmentGroup, id: str):
iam.ServicePrincipal('databrew.amazonaws.com'),
iam.ServicePrincipal('codebuild.amazonaws.com'),
iam.ServicePrincipal('codepipeline.amazonaws.com'),
iam.ServicePrincipal('omics.amazonaws.com'),
self.pivot_role,
),
)
Expand Down
41 changes: 41 additions & 0 deletions backend/dataall/modules/omics/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
"""Contains the code related to HealthOmics"""

import logging
from typing import Set, List, Type

from dataall.base.loader import ImportMode, ModuleInterface
from dataall.modules.omics.db.omics_repository import OmicsRepository

log = logging.getLogger(__name__)


class OmicsApiModuleInterface(ModuleInterface):
"""Implements ModuleInterface for omics GraphQl lambda"""

@staticmethod
def is_supported(modes: Set[ImportMode]) -> bool:
return ImportMode.API in modes

@staticmethod
def depends_on() -> List[Type['ModuleInterface']]:
from dataall.modules.s3_datasets import DatasetApiModuleInterface

return [DatasetApiModuleInterface]

def __init__(self):
import dataall.modules.omics.api

log.info('API of omics has been imported')


class OmicsCdkModuleInterface(ModuleInterface):
"""Implements ModuleInterface for omics ecs tasks"""

@staticmethod
def is_supported(modes: Set[ImportMode]) -> bool:
return ImportMode.CDK in modes

def __init__(self):
import dataall.modules.omics.cdk

log.info('API of Omics has been imported')
5 changes: 5 additions & 0 deletions backend/dataall/modules/omics/api/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""The package defines the schema for Omics Pipelines"""

from dataall.modules.omics.api import input_types, mutations, queries, types, resolvers

__all__ = ['types', 'input_types', 'queries', 'mutations', 'resolvers']
6 changes: 6 additions & 0 deletions backend/dataall/modules/omics/api/enums.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from dataall.base.api.constants import GraphQLEnumMapper


class OmicsWorkflowType(GraphQLEnumMapper):
PRIVATE = 'PRIVATE'
READY2RUN = 'READY2RUN'
32 changes: 32 additions & 0 deletions backend/dataall/modules/omics/api/input_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"""The module defines GraphQL input types for Omics Runs"""

from dataall.base.api import gql

NewOmicsRunInput = gql.InputType(
name='NewOmicsRunInput',
arguments=[
gql.Field('environmentUri', type=gql.NonNullableType(gql.String)),
gql.Field('workflowUri', type=gql.NonNullableType(gql.String)),
gql.Field('label', type=gql.NonNullableType(gql.String)),
gql.Field('destination', type=gql.String),
gql.Field('parameterTemplate', type=gql.String),
gql.Field('SamlAdminGroupName', type=gql.NonNullableType(gql.String)),
],
)

OmicsFilter = gql.InputType(
name='OmicsFilter',
arguments=[
gql.Argument(name='term', type=gql.String),
gql.Argument(name='page', type=gql.Integer),
gql.Argument(name='pageSize', type=gql.Integer),
],
)

OmicsDeleteInput = gql.InputType(
name='OmicsDeleteInput',
arguments=[
gql.Argument(name='runUris', type=gql.NonNullableType(gql.ArrayType(gql.String))),
gql.Argument(name='deleteFromAWS', type=gql.Boolean),
],
)
20 changes: 20 additions & 0 deletions backend/dataall/modules/omics/api/mutations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
"""The module defines GraphQL mutations for Omics Pipelines"""

from dataall.base.api import gql
from .resolvers import create_omics_run, delete_omics_run
from .types import OmicsRun
from .input_types import NewOmicsRunInput, OmicsDeleteInput

createOmicsRun = gql.MutationField(
name='createOmicsRun',
type=OmicsRun,
args=[gql.Argument(name='input', type=gql.NonNullableType(NewOmicsRunInput))],
resolver=create_omics_run,
)

deleteOmicsRun = gql.MutationField(
name='deleteOmicsRun',
type=gql.Boolean,
args=[gql.Argument(name='input', type=gql.NonNullableType(OmicsDeleteInput))],
resolver=delete_omics_run,
)
27 changes: 27 additions & 0 deletions backend/dataall/modules/omics/api/queries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""The module defines GraphQL queries for Omics runs"""

from dataall.base.api import gql
from .resolvers import list_omics_runs, get_omics_workflow, list_omics_workflows
from .types import OmicsRunSearchResults, OmicsWorkflow, OmicsWorkflows
from .input_types import OmicsFilter

listOmicsRuns = gql.QueryField(
name='listOmicsRuns',
args=[gql.Argument(name='filter', type=OmicsFilter)],
resolver=list_omics_runs,
type=OmicsRunSearchResults,
)

getOmicsWorkflow = gql.QueryField(
name='getOmicsWorkflow',
args=[gql.Argument(name='workflowUri', type=gql.NonNullableType(gql.String))],
type=OmicsWorkflow,
resolver=get_omics_workflow,
)

listOmicsWorkflows = gql.QueryField(
name='listOmicsWorkflows',
args=[gql.Argument(name='filter', type=OmicsFilter)],
type=OmicsWorkflows,
resolver=list_omics_workflows,
)
76 changes: 76 additions & 0 deletions backend/dataall/modules/omics/api/resolvers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import logging
from dataall.base.api.context import Context
from dataall.base.db import exceptions
from dataall.modules.omics.services.omics_service import OmicsService
from dataall.modules.omics.db.omics_models import OmicsRun

log = logging.getLogger(__name__)


class RequestValidator:
"""Aggregates all validation logic for operating with omics"""

@staticmethod
def required_uri(uri):
if not uri:
raise exceptions.RequiredParameter('URI')

@staticmethod
def validate_creation_request(data):
required = RequestValidator._required
if not data:
raise exceptions.RequiredParameter('data')
if not data.get('label'):
raise exceptions.RequiredParameter('name')

required(data, 'environmentUri')
required(data, 'SamlAdminGroupName')
required(data, 'workflowUri')
required(data, 'parameterTemplate')
required(data, 'destination')

@staticmethod
def _required(data: dict, name: str):
if not data.get(name):
raise exceptions.RequiredParameter(name)


def create_omics_run(context: Context, source, input=None):
RequestValidator.validate_creation_request(input)
return OmicsService.create_omics_run(
uri=input['environmentUri'], admin_group=input['SamlAdminGroupName'], data=input
)


def list_omics_runs(context: Context, source, filter: dict = None):
if not filter:
filter = {}
return OmicsService.list_user_omics_runs(filter)


def list_omics_workflows(context: Context, source, filter: dict = None):
if not filter:
filter = {}
return OmicsService.list_omics_workflows(filter)


def get_omics_workflow(context: Context, source, workflowUri: str = None):
RequestValidator.required_uri(workflowUri)
return OmicsService.get_omics_workflow(workflowUri)


def delete_omics_run(context: Context, source, input):
RequestValidator.required_uri(input.get('runUris'))
return OmicsService.delete_omics_runs(uris=input.get('runUris'), delete_from_aws=input.get('deleteFromAWS', True))


def resolve_omics_workflow(context, source: OmicsRun, **kwargs):
if not source:
return None
return OmicsService.get_omics_workflow(source.workflowUri)


def resolve_omics_run_details(context, source: OmicsRun, **kwargs):
if not source:
return None
return OmicsService.get_omics_run_details_from_aws(source.runUri)
91 changes: 91 additions & 0 deletions backend/dataall/modules/omics/api/types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from dataall.base.api import gql
from .resolvers import resolve_omics_workflow, resolve_omics_run_details
from dataall.core.organizations.api.resolvers import resolve_organization_by_env
from dataall.core.environment.api.resolvers import resolve_environment

OmicsWorkflow = gql.ObjectType(
name='OmicsWorkflow',
fields=[
gql.Field(name='workflowUri', type=gql.String),
gql.Field(name='id', type=gql.String),
gql.Field(name='arn', type=gql.String),
gql.Field(name='name', type=gql.String),
gql.Field(name='label', type=gql.String),
gql.Field(name='type', type=gql.String),
gql.Field(name='description', type=gql.String),
gql.Field(name='parameterTemplate', type=gql.String),
gql.Field(name='environmentUri', type=gql.String),
],
)

OmicsWorkflows = gql.ObjectType(
name='OmicsWorkflows',
fields=[
gql.Field(name='count', type=gql.Integer),
gql.Field(name='page', type=gql.Integer),
gql.Field(name='pages', type=gql.Integer),
gql.Field(name='hasNext', type=gql.Boolean),
gql.Field(name='hasPrevious', type=gql.Boolean),
gql.Field(name='nodes', type=gql.ArrayType(OmicsWorkflow)),
],
)

OmicsRunStatus = gql.ObjectType(
name='OmicsRunStatus',
fields=[gql.Field(name='status', type=gql.String), gql.Field(name='statusMessage', type=gql.String)],
)


OmicsRun = gql.ObjectType(
name='OmicsRun',
fields=[
gql.Field('runUri', type=gql.ID),
gql.Field('environmentUri', type=gql.String),
gql.Field('organizationUri', type=gql.String),
gql.Field('name', type=gql.String),
gql.Field('label', type=gql.String),
gql.Field('description', type=gql.String),
gql.Field('tags', type=gql.ArrayType(gql.String)),
gql.Field('created', type=gql.String),
gql.Field('updated', type=gql.String),
gql.Field('owner', type=gql.String),
gql.Field('workflowUri', type=gql.String),
gql.Field('SamlAdminGroupName', type=gql.String),
gql.Field('parameterTemplate', type=gql.String),
gql.Field('outputDatasetUri', type=gql.String),
gql.Field('outputUri', type=gql.String),
gql.Field(
name='environment',
type=gql.Ref('Environment'),
resolver=resolve_environment,
),
gql.Field(
name='organization',
type=gql.Ref('Organization'),
resolver=resolve_organization_by_env,
),
gql.Field(
name='workflow',
type=OmicsWorkflow,
resolver=resolve_omics_workflow,
),
gql.Field(
name='status',
type=OmicsRunStatus,
resolver=resolve_omics_run_details,
),
],
)


OmicsRunSearchResults = gql.ObjectType(
name='OmicsRunSearchResults',
fields=[
gql.Field(name='count', type=gql.Integer),
gql.Field(name='page', type=gql.Integer),
gql.Field(name='pages', type=gql.Integer),
gql.Field(name='hasNext', type=gql.Boolean),
gql.Field(name='hasPrevious', type=gql.Boolean),
gql.Field(name='nodes', type=gql.ArrayType(OmicsRun)),
],
)
1 change: 1 addition & 0 deletions backend/dataall/modules/omics/aws/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Contains code that send requests to AWS using SDK (boto3)"""
Loading

0 comments on commit 7a264ee

Please sign in to comment.