Adding AWS HealthOmics as a Module in "Play" tools (#954)

### Feature or Bugfix - Feature ### Background Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console. Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing. ### AWS HealthOmics Integration This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets. ### Consumption Patterns * <ins>data.all Worksheets</ins>: Users can use Worksheets to make data easier to query and combine with other forms of health data. * <ins>data.all Notebooks/Studio</ins>: Users can Notebooks and Studio to build, train, and deploy novel machine learning algorithms on the multiomic and multimodal data. * <ins>data.all Dashboards</ins>: Users can use the transformed data in Dashboards for advanced analytics and visualizations. ### Considerations * <ins>Linked Environment and Dataset Region</ins>: The HealthOmics run must be performed in a data.all Linked Environment that is located in an AWS Region that supports AWS HealthOmics. Similarly, the data.all source and destination Dataset must live in same AWS Region as where the user will perform the HealthOmics run. * <ins>Ready2Run Workflow Support</ins>: Currently, only Ready2Run workflows are supported. Ready2Run are pre-built workflows designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, as well as common open-source pipelines such as AlphaFold for protein structure prediction. Ready2Run workflows do not require the management of software tools or workflow scripts. Bring your own, also known as Private, workflows where you bring a custom workflow script, are not yet supported. Please note that some Ready2Run workflows require a subscription/license from the software provider to run. ### User Journey This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet. * <ins>Initiation</ins>: * User navigates to the "Omics" section within data.all and browses Ready2Run workflows <img width="1894" alt="Screenshot 2024-01-17 at 11 31 23 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/122d1c96-921f-401a-8119-8b2f72779d7a"> * User can also search for a specific workflow directly <img width="1679" alt="Screenshot 2024-01-17 at 11 34 43 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/dc03593e-6116-44b2-a0cd-913da60054cd"> * After clicking on a workflow, users see a detailed view of it with a full description of what it does <img width="1894" alt="Screenshot 2024-01-17 at 11 35 15 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/6299ad7e-c19f-4026-9810-869d082488ae"> * <ins>Creation</ins>: After clicking on a workflow, users see a detailed view and hit “Create Run”. Users fill in the run creation form with the following parameters: <img width="1908" alt="Screenshot 2024-01-17 at 11 37 58 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/440e3f23-c65d-4e00-b932-0a448944853e"> * <ins>Workflow ID</ins>: Immutable ID of the Ready2Run workflow * <ins>Run Name</ins>: Customizable name of the run user will submit * <ins>Environment</ins>: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London) * <ins>Region</ins>: Pre-populated from the Environment and immutable Region where the run will be * <ins>Owners</ins>: data.all group who owns the run * <ins>Select S3 Output Destination</ins>: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run) * <ins>Run Parameters</ins>: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here) * <ins>History</ins>: * Users navigate to Run tab at the top to view a history of the data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run history deletion is still in progress) ![Screenshot 2024-01-17 at 11 40 01 PM](https://github.com/data-dot-all/dataall/assets/28816838/95af6147-332a-4837-a7a6-18ade74f9794) * <ins>Data Consumption</ins>: * <ins>In Worksheets</ins>: * Users can select (or create) a new Worksheet. ![Screenshot 2024-01-17 at 11 41 20 PM](https://github.com/data-dot-all/dataall/assets/28816838/4e017983-17e1-422f-a6ee-b723967b8720) * Users can then query the data using SQL <img width="1902" alt="Screenshot 2024-01-17 at 11 44 30 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/5a020647-ee90-4d36-b132-05893e80327b"> ### Relates - Github Issue - #563 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes - Is the input sanitized? Yes - What precautions are you taking before deserializing the data you consume? N/A - Is injection prevented by parametrizing queries? N/A - Have you ensured no `eval` or similar functions are used? N/A - Does this PR introduce any functionality or component that requires authorization? Yes - How have you ensured it respects the existing AuthN/AuthZ mechanisms? Yes - Are you logging failed auth attempts? N/A - Are you using or adding any cryptographic features? No - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? Yes - Have you used the least-privilege principle? How? Yes, through scoped policies added to the role By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Patrick Guha <[email protected]> Co-authored-by: dlpzx <[email protected]> Co-authored-by: “Kiran <[email protected]>
data-dot-all · Jun 14, 2024 · 7a264ee · 7a264ee
1 parent 080fc98
commit 7a264ee
Show file tree

Hide file tree

Showing 61 changed files with 2,938 additions and 34 deletions.
diff --git a/backend/dataall/base/cdkproxy/requirements.txt b/backend/dataall/base/cdkproxy/requirements.txt
@@ -1,7 +1,7 @@
 aws-cdk-lib==2.99.0
-boto3==1.24.85
-boto3-stubs==1.24.85
-botocore==1.27.85
+boto3==1.28.23
+boto3-stubs==1.28.23
+botocore==1.31.23
 cdk-nag==2.7.2
 constructs==10.0.73
 starlette==0.36.3

diff --git a/backend/dataall/core/environment/cdk/env_role_core_policies/service_policy.py b/backend/dataall/core/environment/cdk/env_role_core_policies/service_policy.py
@@ -89,6 +89,7 @@ def generate_policies(self) -> [aws_iam.ManagedPolicy]:
                             'StringEquals': {
                                 'iam:PassedToService': [
                                     'glue.amazonaws.com',
+                                    'omics.amazonaws.com',
                                     'lambda.amazonaws.com',
                                     'sagemaker.amazonaws.com',
                                     'states.amazonaws.com',

diff --git a/backend/dataall/core/environment/cdk/environment_stack.py b/backend/dataall/core/environment/cdk/environment_stack.py
@@ -487,6 +487,7 @@ def create_group_environment_role(self, group: EnvironmentGroup, id: str):
                 iam.ServicePrincipal('databrew.amazonaws.com'),
                 iam.ServicePrincipal('codebuild.amazonaws.com'),
                 iam.ServicePrincipal('codepipeline.amazonaws.com'),
+                iam.ServicePrincipal('omics.amazonaws.com'),
                 self.pivot_role,
             ),
         )

diff --git a/backend/dataall/modules/omics/__init__.py b/backend/dataall/modules/omics/__init__.py
@@ -0,0 +1,41 @@
+"""Contains the code related to HealthOmics"""
+
+import logging
+from typing import Set, List, Type
+
+from dataall.base.loader import ImportMode, ModuleInterface
+from dataall.modules.omics.db.omics_repository import OmicsRepository
+
+log = logging.getLogger(__name__)
+
+
+class OmicsApiModuleInterface(ModuleInterface):
+    """Implements ModuleInterface for omics GraphQl lambda"""
+
+    @staticmethod
+    def is_supported(modes: Set[ImportMode]) -> bool:
+        return ImportMode.API in modes
+
+    @staticmethod
+    def depends_on() -> List[Type['ModuleInterface']]:
+        from dataall.modules.s3_datasets import DatasetApiModuleInterface
+
+        return [DatasetApiModuleInterface]
+
+    def __init__(self):
+        import dataall.modules.omics.api
+
+        log.info('API of omics has been imported')
+
+
+class OmicsCdkModuleInterface(ModuleInterface):
+    """Implements ModuleInterface for omics ecs tasks"""
+
+    @staticmethod
+    def is_supported(modes: Set[ImportMode]) -> bool:
+        return ImportMode.CDK in modes
+
+    def __init__(self):
+        import dataall.modules.omics.cdk
+
+        log.info('API of Omics has been imported')
diff --git a/backend/dataall/modules/omics/api/__init__.py b/backend/dataall/modules/omics/api/__init__.py
@@ -0,0 +1,5 @@
+"""The package defines the schema for Omics Pipelines"""
+
+from dataall.modules.omics.api import input_types, mutations, queries, types, resolvers
+
+__all__ = ['types', 'input_types', 'queries', 'mutations', 'resolvers']
diff --git a/backend/dataall/modules/omics/api/enums.py b/backend/dataall/modules/omics/api/enums.py
@@ -0,0 +1,6 @@
+from dataall.base.api.constants import GraphQLEnumMapper
+
+
+class OmicsWorkflowType(GraphQLEnumMapper):
+    PRIVATE = 'PRIVATE'
+    READY2RUN = 'READY2RUN'
diff --git a/backend/dataall/modules/omics/api/input_types.py b/backend/dataall/modules/omics/api/input_types.py
@@ -0,0 +1,32 @@
+"""The module defines GraphQL input types for Omics Runs"""
+
+from dataall.base.api import gql
+
+NewOmicsRunInput = gql.InputType(
+    name='NewOmicsRunInput',
+    arguments=[
+        gql.Field('environmentUri', type=gql.NonNullableType(gql.String)),
+        gql.Field('workflowUri', type=gql.NonNullableType(gql.String)),
+        gql.Field('label', type=gql.NonNullableType(gql.String)),
+        gql.Field('destination', type=gql.String),
+        gql.Field('parameterTemplate', type=gql.String),
+        gql.Field('SamlAdminGroupName', type=gql.NonNullableType(gql.String)),
+    ],
+)
+
+OmicsFilter = gql.InputType(
+    name='OmicsFilter',
+    arguments=[
+        gql.Argument(name='term', type=gql.String),
+        gql.Argument(name='page', type=gql.Integer),
+        gql.Argument(name='pageSize', type=gql.Integer),
+    ],
+)
+
+OmicsDeleteInput = gql.InputType(
+    name='OmicsDeleteInput',
+    arguments=[
+        gql.Argument(name='runUris', type=gql.NonNullableType(gql.ArrayType(gql.String))),
+        gql.Argument(name='deleteFromAWS', type=gql.Boolean),
+    ],
+)
diff --git a/backend/dataall/modules/omics/api/mutations.py b/backend/dataall/modules/omics/api/mutations.py
@@ -0,0 +1,20 @@
+"""The module defines GraphQL mutations for Omics Pipelines"""
+
+from dataall.base.api import gql
+from .resolvers import create_omics_run, delete_omics_run
+from .types import OmicsRun
+from .input_types import NewOmicsRunInput, OmicsDeleteInput
+
+createOmicsRun = gql.MutationField(
+    name='createOmicsRun',
+    type=OmicsRun,
+    args=[gql.Argument(name='input', type=gql.NonNullableType(NewOmicsRunInput))],
+    resolver=create_omics_run,
+)
+
+deleteOmicsRun = gql.MutationField(
+    name='deleteOmicsRun',
+    type=gql.Boolean,
+    args=[gql.Argument(name='input', type=gql.NonNullableType(OmicsDeleteInput))],
+    resolver=delete_omics_run,
+)
diff --git a/backend/dataall/modules/omics/api/queries.py b/backend/dataall/modules/omics/api/queries.py
@@ -0,0 +1,27 @@
+"""The module defines GraphQL queries for Omics runs"""
+
+from dataall.base.api import gql
+from .resolvers import list_omics_runs, get_omics_workflow, list_omics_workflows
+from .types import OmicsRunSearchResults, OmicsWorkflow, OmicsWorkflows
+from .input_types import OmicsFilter
+
+listOmicsRuns = gql.QueryField(
+    name='listOmicsRuns',
+    args=[gql.Argument(name='filter', type=OmicsFilter)],
+    resolver=list_omics_runs,
+    type=OmicsRunSearchResults,
+)
+
+getOmicsWorkflow = gql.QueryField(
+    name='getOmicsWorkflow',
+    args=[gql.Argument(name='workflowUri', type=gql.NonNullableType(gql.String))],
+    type=OmicsWorkflow,
+    resolver=get_omics_workflow,
+)
+
+listOmicsWorkflows = gql.QueryField(
+    name='listOmicsWorkflows',
+    args=[gql.Argument(name='filter', type=OmicsFilter)],
+    type=OmicsWorkflows,
+    resolver=list_omics_workflows,
+)
diff --git a/backend/dataall/modules/omics/api/resolvers.py b/backend/dataall/modules/omics/api/resolvers.py
@@ -0,0 +1,76 @@
+import logging
+from dataall.base.api.context import Context
+from dataall.base.db import exceptions
+from dataall.modules.omics.services.omics_service import OmicsService
+from dataall.modules.omics.db.omics_models import OmicsRun
+
+log = logging.getLogger(__name__)
+
+
+class RequestValidator:
+    """Aggregates all validation logic for operating with omics"""
+
+    @staticmethod
+    def required_uri(uri):
+        if not uri:
+            raise exceptions.RequiredParameter('URI')
+
+    @staticmethod
+    def validate_creation_request(data):
+        required = RequestValidator._required
+        if not data:
+            raise exceptions.RequiredParameter('data')
+        if not data.get('label'):
+            raise exceptions.RequiredParameter('name')
+
+        required(data, 'environmentUri')
+        required(data, 'SamlAdminGroupName')
+        required(data, 'workflowUri')
+        required(data, 'parameterTemplate')
+        required(data, 'destination')
+
+    @staticmethod
+    def _required(data: dict, name: str):
+        if not data.get(name):
+            raise exceptions.RequiredParameter(name)
+
+
+def create_omics_run(context: Context, source, input=None):
+    RequestValidator.validate_creation_request(input)
+    return OmicsService.create_omics_run(
+        uri=input['environmentUri'], admin_group=input['SamlAdminGroupName'], data=input
+    )
+
+
+def list_omics_runs(context: Context, source, filter: dict = None):
+    if not filter:
+        filter = {}
+    return OmicsService.list_user_omics_runs(filter)
+
+
+def list_omics_workflows(context: Context, source, filter: dict = None):
+    if not filter:
+        filter = {}
+    return OmicsService.list_omics_workflows(filter)
+
+
+def get_omics_workflow(context: Context, source, workflowUri: str = None):
+    RequestValidator.required_uri(workflowUri)
+    return OmicsService.get_omics_workflow(workflowUri)
+
+
+def delete_omics_run(context: Context, source, input):
+    RequestValidator.required_uri(input.get('runUris'))
+    return OmicsService.delete_omics_runs(uris=input.get('runUris'), delete_from_aws=input.get('deleteFromAWS', True))
+
+
+def resolve_omics_workflow(context, source: OmicsRun, **kwargs):
+    if not source:
+        return None
+    return OmicsService.get_omics_workflow(source.workflowUri)
+
+
+def resolve_omics_run_details(context, source: OmicsRun, **kwargs):
+    if not source:
+        return None
+    return OmicsService.get_omics_run_details_from_aws(source.runUri)
diff --git a/backend/dataall/modules/omics/api/types.py b/backend/dataall/modules/omics/api/types.py
@@ -0,0 +1,91 @@
+from dataall.base.api import gql
+from .resolvers import resolve_omics_workflow, resolve_omics_run_details
+from dataall.core.organizations.api.resolvers import resolve_organization_by_env
+from dataall.core.environment.api.resolvers import resolve_environment
+
+OmicsWorkflow = gql.ObjectType(
+    name='OmicsWorkflow',
+    fields=[
+        gql.Field(name='workflowUri', type=gql.String),
+        gql.Field(name='id', type=gql.String),
+        gql.Field(name='arn', type=gql.String),
+        gql.Field(name='name', type=gql.String),
+        gql.Field(name='label', type=gql.String),
+        gql.Field(name='type', type=gql.String),
+        gql.Field(name='description', type=gql.String),
+        gql.Field(name='parameterTemplate', type=gql.String),
+        gql.Field(name='environmentUri', type=gql.String),
+    ],
+)
+
+OmicsWorkflows = gql.ObjectType(
+    name='OmicsWorkflows',
+    fields=[
+        gql.Field(name='count', type=gql.Integer),
+        gql.Field(name='page', type=gql.Integer),
+        gql.Field(name='pages', type=gql.Integer),
+        gql.Field(name='hasNext', type=gql.Boolean),
+        gql.Field(name='hasPrevious', type=gql.Boolean),
+        gql.Field(name='nodes', type=gql.ArrayType(OmicsWorkflow)),
+    ],
+)
+
+OmicsRunStatus = gql.ObjectType(
+    name='OmicsRunStatus',
+    fields=[gql.Field(name='status', type=gql.String), gql.Field(name='statusMessage', type=gql.String)],
+)
+
+
+OmicsRun = gql.ObjectType(
+    name='OmicsRun',
+    fields=[
+        gql.Field('runUri', type=gql.ID),
+        gql.Field('environmentUri', type=gql.String),
+        gql.Field('organizationUri', type=gql.String),
+        gql.Field('name', type=gql.String),
+        gql.Field('label', type=gql.String),
+        gql.Field('description', type=gql.String),
+        gql.Field('tags', type=gql.ArrayType(gql.String)),
+        gql.Field('created', type=gql.String),
+        gql.Field('updated', type=gql.String),
+        gql.Field('owner', type=gql.String),
+        gql.Field('workflowUri', type=gql.String),
+        gql.Field('SamlAdminGroupName', type=gql.String),
+        gql.Field('parameterTemplate', type=gql.String),
+        gql.Field('outputDatasetUri', type=gql.String),
+        gql.Field('outputUri', type=gql.String),
+        gql.Field(
+            name='environment',
+            type=gql.Ref('Environment'),
+            resolver=resolve_environment,
+        ),
+        gql.Field(
+            name='organization',
+            type=gql.Ref('Organization'),
+            resolver=resolve_organization_by_env,
+        ),
+        gql.Field(
+            name='workflow',
+            type=OmicsWorkflow,
+            resolver=resolve_omics_workflow,
+        ),
+        gql.Field(
+            name='status',
+            type=OmicsRunStatus,
+            resolver=resolve_omics_run_details,
+        ),
+    ],
+)
+
+
+OmicsRunSearchResults = gql.ObjectType(
+    name='OmicsRunSearchResults',
+    fields=[
+        gql.Field(name='count', type=gql.Integer),
+        gql.Field(name='page', type=gql.Integer),
+        gql.Field(name='pages', type=gql.Integer),
+        gql.Field(name='hasNext', type=gql.Boolean),
+        gql.Field(name='hasPrevious', type=gql.Boolean),
+        gql.Field(name='nodes', type=gql.ArrayType(OmicsRun)),
+    ],
+)
diff --git a/backend/dataall/modules/omics/aws/__init__.py b/backend/dataall/modules/omics/aws/__init__.py
@@ -0,0 +1 @@
+"""Contains code that send requests to AWS using SDK (boto3)"""
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Contains code that send requests to AWS using SDK (boto3)"""