Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding AWS HealthOmics as a Module in "Play" tools #954

Merged
merged 183 commits into from
Jun 14, 2024

Conversation

ironspur5
Copy link
Contributor

@ironspur5 ironspur5 commented Jan 9, 2024

Feature or Bugfix

  • Feature

Background

Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console.

Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing.

AWS HealthOmics Integration

This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets.

Consumption Patterns

  • data.all Worksheets: Users can use Worksheets to make data easier to query and combine with other forms of health data.
  • data.all Notebooks/Studio: Users can Notebooks and Studio to build, train, and deploy novel machine learning algorithms on the multiomic and multimodal data.
  • data.all Dashboards: Users can use the transformed data in Dashboards for advanced analytics and visualizations.

Considerations

  • Linked Environment and Dataset Region: The HealthOmics run must be performed in a data.all Linked Environment that is located in an AWS Region that supports AWS HealthOmics. Similarly, the data.all source and destination Dataset must live in same AWS Region as where the user will perform the HealthOmics run.
  • Ready2Run Workflow Support: Currently, only Ready2Run workflows are supported. Ready2Run are pre-built workflows designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, as well as common open-source pipelines such as AlphaFold for protein structure prediction. Ready2Run workflows do not require the management of software tools or workflow scripts. Bring your own, also known as Private, workflows where you bring a custom workflow script, are not yet supported. Please note that some Ready2Run workflows require a subscription/license from the software provider to run.

User Journey

This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet.

  • Initiation:
    • User navigates to the "Omics" section within data.all and browses Ready2Run workflows
Screenshot 2024-01-17 at 11 31 23 PM
  • User can also search for a specific workflow directly
Screenshot 2024-01-17 at 11 34 43 PM
  • After clicking on a workflow, users see a detailed view of it with a full description of what it does
Screenshot 2024-01-17 at 11 35 15 PM
  • Creation: After clicking on a workflow, users see a detailed view and hit “Create Run”. Users fill in the run creation form with the following parameters:
Screenshot 2024-01-17 at 11 37 58 PM
  • Workflow ID: Immutable ID of the Ready2Run workflow
    * Run Name: Customizable name of the run user will submit
    * Environment: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London)
    * Region: Pre-populated from the Environment and immutable Region where the run will be
    * Owners: data.all group who owns the run
    * Select S3 Output Destination: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run)
    * Run Parameters: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here)

  • History:

    • Users navigate to Run tab at the top to view a history of the data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run history deletion is still in progress)

Screenshot 2024-01-17 at 11 40 01 PM

  • Data Consumption:

    • In Worksheets:
      • Users can select (or create) a new Worksheet.

Screenshot 2024-01-17 at 11 41 20 PM

  • Users can then query the data using SQL
Screenshot 2024-01-17 at 11 44 30 PM

Relates

Security

Please answer the questions below briefly where applicable, or write N/A. Based on
OWASP 10.

  • Does this PR introduce or modify any input fields or queries - this includes
    fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes
    • Is the input sanitized? Yes
    • What precautions are you taking before deserializing the data you consume? N/A
    • Is injection prevented by parametrizing queries? N/A
    • Have you ensured no eval or similar functions are used? N/A
  • Does this PR introduce any functionality or component that requires authorization? Yes
    • How have you ensured it respects the existing AuthN/AuthZ mechanisms? Yes
    • Are you logging failed auth attempts? N/A
  • Are you using or adding any cryptographic features? No
    • Do you use a standard proven implementations?
    • Are the used keys controlled by the customer? Where are they stored?
  • Are you introducing any new policies/roles/users? Yes
    • Have you used the least-privilege principle? How? Yes, through scoped policies added to the role

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

AmrSaber and others added 30 commits March 30, 2023 13:08
Syncs modularization main branch with main (as there is a bug fix that
is useful to have)

---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: wolanlu <[email protected]>
This is a draft PR for showing purposes. There are still some minor
issues that needs to be addressed.

### Feature or Bugfix
- Refactoring

### Detail
There are following changes under this PR:
1. Modularization + Refactoring of notebooks
There are new modules that will play a major role in the future
refactoring:

   * Core = contains the code need for application to operate correctly
   * Common = common code for all modules
* Modules = the plugin/feature that can be inserted into the system (at
the moment only notebooks)

The other part that is related to modularization is the creation of
environment parameters.
Environment parameter will replace all hardcoded parameters of the
environment configuration.
There is a new file - config.json that allows you to configure an
application configuration.
All existing parameters will be migrated via db migration in AWS

2. Extracting permissions and request context (Optional for the
modularization)

Engine, user, and user groups had been passed as a parameter of context
in the request. This had forced to pass a lot of parameters to other
methods that weren't even needed. This information should be as a scope
of the request session.
There is a new way to retrieve the information using `RequestContext.`
There is also a new way to use permission checks that require less
parameters and make code cleaner. The old way was marked as deprecated

3. Restructure of the code (Optional for the modularization)

Since the modularization will touch all the places in the API code it
can be a good change to set a new structure of the code. There are small
re-organization in notebook module to address

   * Allocating the resources before the validating parameters
   * Not clear responsibility of the classes
   * Mixed layers 
  
   There are new structure :
   - resolvers = validate and pass code to service layer
   - service layer = bisnesss logic
   - repositories = database logic (all queries should be placed here)
   - aws = contains a wrapper client upon boto3
   - cdk = all logic related to create stacks or code for ecs
- tasks = code that will be executed in AWS lambda (short-living tasks)

   All names can be changed.

### Relates
[data-dot-all#295](data-dot-all#295)



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <[email protected]>
### Feature or Bug-fix
- Refactoring

### Detail
For the development dockerfile in the frontend, instead of starting from
amazon linux image and installing everything manually, it's a lot
simpler to start from node image (that is also from ECR for security)
and have all node-related environment pre-installed and just focus the
dockerfile on project-related configuration.

This PR does not introduce any update in the logic, it just updates the
development dockerfile and adds a `.dockerignore` file for frontend to
ignore `node_modules` and `build` folders.

---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
)

### Feature or Bug-fix
- Refactoring

### Details
This PR does 2 things:
- Makes the styling consistent across the project
- Removes dead code (unused variables, functions, and imports)

---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bug-fix
- Refactoring

### Detail
My recent PR data-dot-all#394 has a lot
of changes but all of these changes are to styles only and not the code
itself, this way the blame history for all the changed files is lost (it
will always blame me).

This PR adds the file`.git-blame-ignore-revs` which contains the id of
the styling commit and the command that you need to run once to fix the
blame history for you.

Also Github understands that file and will show the history correctly.

This PR is based on [this
article](https://www.stefanjudis.com/today-i-learned/how-to-exclude-commits-from-git-blame/)
and is suggested by Raul.

This file works for the whole repository, and in the future one only
needs to add ids of styling commits and the history will be preserved
correctly.

---
By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Refactoring (Modularization)

### Relates
- Related issues data-dot-all#295 and data-dot-all#412 

### Short Summary
First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) 

### Long Summary 
Datasets are huge. It's one of the central modules that's spread
everywhere across the application. Migrating the entire Dataset piece
would be very difficult task and, more importantly, even more difficult
to review. Therefore, I decided to break down this work into "small"
steps to make it more convenient to review.
Dataset's API consist of the following items:
* `Dataset`
* `DatasetTable`
* `DatasetTableColumn`
* `DatasetLocation`
* `DatasetProfiling`

In this PR, there is only creation of `Dataset` module and migration of
`DatasetTableColumn` (and some related to it pieces). Why? Because the
plan was to migrate it, to see what issues would come up along with it
and to address them here. The refactoring of `DatasetTableColumn` will
be in other PR.
The issues: 
1) Glossaries
2) Feed
3) Long tasks for datasets
4) Redshift

Glossaries rely on GraphQL UNION of different type (including datasets).
Created an abstraction for glossary registration. There was an idea to
change frontend, but it'd require a lot of work to do this

Feed: same as glossaries. Solved the similar way. For feed, changing
frontend API is more feasible, but I wanted it to be consistent with
glossaries

Long tasks for datasets. They migrated into tasks folder and doesn't
require a dedicated loading for its code (at least for now). But there
are two concerns:
1) The deployment uses a direct module folder references to run them
(e.g. `dataall.modules.datasets....`, so basically when a module is
deactivated, then we shouldn't deploy this tasks as well). I left a TODO
for it to address in future (when we migrate all modules), but we should
bear in mind that it might lead to inconsistencies.
2) There is a reference to `redshift` from long-running tasks = should
be address in `redshift` module

Redshift: it has some references to `datasets`. So there will be either
dependencies among modules or small code duplication (if `redshift`
doesn't rely hard on `datasets`) = will be addressed in `redshift`
module

Other changes:
Fixed and improved some tests 
Extracted glue handler code that related to `DatasetTableColumn`
Renamed import mode from tasks to handlers for async lambda.
A few hacks that will go away with next refactoring :) 

Next steps:
[Part2 ](nikpodsh#1) in preview :) 
Extract rest of datasets functionality (perhaps, in a few steps)
Refactor extractor modules the same way as notebooks
Extract tests to follow the same structure.



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetProfilingRun

### Relates
- data-dot-all#295 and data-dot-all#412 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the third part of dataset: `DatasetStorageLocation`
Introduced the Indexers: this code was migrated from the `indexers.py`
and put into modules.
Removed unused alarms (which didn't call actual alarm code)
Introduced `DatasetShareService` but it seems it will be migrated to
share module.
All `DatasetXXXServices` will be split onto Services (business logic)
and Repositories( DAO layer) in future parts

### Relates
- data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of DatasetTable:
Get rid of ElasticSearch connection for every request. Created a lazy
way to establish connection.

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dbalintx <[email protected]>
### Feature or Bugfix
- Refactoring

### Detail
Refactoring of the Dataset entity and related to it code.
Refactoring for Votes
Introduced DataPolicy (the same way as ServicePolicy was used used)
Extracted dataset related permissions. 
Used new `has_tenant_permission` instead of `has_tenant_perm` that
allows not to pass unused parameters

### Relates
data-dot-all#412 and data-dot-all#295

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
- Refactoring

### Detail
Creation of the mechanism to import dependencies for modules. It should
help easier extract common code between modules.
Now, one module can refer to another module in order to import it. 

Deleted `common` folder since it has been replaced replaced.
Added some checks if the module was unintentionally imported + logging


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Modularization of Worksheets

- Moved Worksheet related code to its own new module
- Merged AthenaQueryResult object into Worksheets
- Worksheet related permissions moved to new module
- Removed worksheet sharing related (unused) code from the entire repo


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
…ata-dot-all#463)

Merge of main (v1.5.2) -> modularization main

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: wolanlu <[email protected]>
Co-authored-by: Amr Saber <[email protected]>
Co-authored-by: Noah Paige <[email protected]>
Co-authored-by: kukushking <[email protected]>
Co-authored-by: Dariusz Osiennik <[email protected]>
Co-authored-by: Dennis Goldner <[email protected]>
Co-authored-by: Abdulrahman Kaitoua <[email protected]>
Co-authored-by: akaitoua-sa <[email protected]>
Co-authored-by: Gezim Musliaj <[email protected]>
Co-authored-by: Rick Bernotas <[email protected]>
Co-authored-by: David Mutune Kimengu <[email protected]>
### Feature or Bugfix
- Refactoring

### Detail
First part of the modularization of ML Studio 

- Created `SageMakerStudioService` and move `resolvers` business logic
into the service
- Redefined `resolvers` and add request validators
- Redefined `db` operations and added them into `SageMakerRepository`
- Created `sagemaker_studio_client` with Sagemaker Studio SDK calls
- Created `ec2_client` with EC2 SDK calls
- Added CloudFormation stack in the `cdk` package of the module
- Added CloudFormation stack as an environment extension in the `cdk`
package of the module
- Modified environment creation and edit api calls and views to read
`mlStudiosEnabled` from environment parameters table
- Fixed tests and added migration scripts

Additional:
- Standardized naming of functions and objects to
"sagemaker_studio_user" and avoid legacy "Notebook" references
- Removed unused api definition for `getSagemakerStudioUserApps` 
- Cleaned up unused frontend views and API methods
- Split migration scripts for environments, worksheets and mlstudio
- linting Worksheets module
- prettier frontend

### Relates
data-dot-all#448 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: wolanlu <[email protected]>
Co-authored-by: Amr Saber <[email protected]>
Co-authored-by: Noah Paige <[email protected]>
Co-authored-by: kukushking <[email protected]>
Co-authored-by: Dariusz Osiennik <[email protected]>
Co-authored-by: Dennis Goldner <[email protected]>
Co-authored-by: Nikita Podshivalov <[email protected]>
Co-authored-by: Abdulrahman Kaitoua <[email protected]>
Co-authored-by: akaitoua-sa <[email protected]>
Co-authored-by: Gezim Musliaj <[email protected]>
Co-authored-by: Rick Bernotas <[email protected]>
Co-authored-by: David Mutune Kimengu <[email protected]>
Co-authored-by: nikpodsh <[email protected]>
Co-authored-by: dbalintx <[email protected]>
### Feature or Bugfix
- Feature

### Detail
- For ML Studio and Environments, the original procedure of synthesizing
the template and testing the resulting json has been replaced by the
`cdk assertions` library recommended in the
[documentation](https://docs.aws.amazon.com/cdk/v2/guide/testing.html#testing_getting_started)
of CDK.
- In the Environment cdk testing, the `extent` method of the
`EnvironmentStackExtension` subclasses has been mocked. Now it only
tests the `EnvironmentSetup` stack as if no extensions were registered.
- In the MLStudio cdk testing, this PR adds a test for the
`SageMakerDomainExtension` mocking the environment stack. It tests the
`SageMakerDomainExtension` as a standalone.

Open question: 
1) The rest of cdk stacks (notebooks, pipelines, datasets...) are tested
using the old method of printing the json template. Should I go ahead
and migrate to using `cdk assertions` library? If so, I thought of doing
it for notebooks only and sync on the testing for datasets and pipelines
with @dbalintx and @nikpodsh.

2) With the `cdk assertions` library we can test the resources
properties more in depth. I added some tests on the environment stack
but we could add more asserts in all stacks. I will add more tests on
the MLStudio stack and I made a note in the GitHub project to review
this once modularization is complete.

### Relates
- data-dot-all#295 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
…that are inactive (data-dot-all#522)

Enabled the feature to turn off tests whenever a module if inactive.
Modularization of the sharing.

### Detail
Migrated the sharing part (including tasks)
Removed unused methods for datasets

There are a few issues that needs to be addressed before merging this
PR. But it can be view, since we don't expect major changes here.

### Related
 data-dot-all#295 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: kukushking <[email protected]>
Co-authored-by: Dariusz Osiennik <[email protected]>
Co-authored-by: Noah Paige <[email protected]>
Co-authored-by: Dennis Goldner <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Abdulrahman Kaitoua <[email protected]>
Co-authored-by: akaitoua-sa <[email protected]>
Co-authored-by: Gezim Musliaj <[email protected]>
Co-authored-by: Rick Bernotas <[email protected]>
Co-authored-by: David Mutune Kimengu <[email protected]>
Co-authored-by: dbalintx <[email protected]>
Modularization of data pipelines

Changes:
- Relevant files moved from dataall/Objects/api and from dataall/db to
the newly created module
- Relevant permissions extracted to the newly created module and are
being used with the new decorators
- Functions interacting with the DB were outsourced to repository
- extracted and moved Datapipelines related code from core CDK files to
the new module
dataall/cdkproxy/cdk_cli_wrapper.py
dataall/cdkproxy/stacks/pipeline.py
dataall/cdkproxy/cdkpipeline/cdk_pipeline.py
service policies

- extracted and moved Datapipelines related code from core AWS handlers
to the new module
dataall/aws/handlers/stepfunction.py
dataall/aws/handlers/codecommit.py
dataall/aws/handlers/codepipeline.py
dataall/aws/handlers/glue.py

- added module interface for the module
- unit tests updated



By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
### Feature or Bugfix
Refactoring

### Detail
Refactoring of the dashboards

### Relates
data-dot-all#509

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.


- [x] TODO: There was a huge merge conflict. I am testing the changes
after the merge

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Abdulrahman Kaitoua <[email protected]>
Co-authored-by: akaitoua-sa <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: Gezim Musliaj <[email protected]>
Co-authored-by: kukushking <[email protected]>
Co-authored-by: Rick Bernotas <[email protected]>
Co-authored-by: David Mutune Kimengu <[email protected]>
Co-authored-by: dbalintx <[email protected]>
Removing all Redshift related code from backend, frontend and docs.
Cleaned up dataall/core folder structure


### Feature or Bugfix
<!-- please choose -->
- Feature
- Bugfix
- Refactoring

### Detail
- <feature1 or bug1>
- <feature2 or bug2>

### Relates
- <URL or Ticket>

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
Added missed permission checks.

This PR depends on data-dot-all#537. Should be review after the dashboard
modularization

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Abdulrahman Kaitoua <[email protected]>
Co-authored-by: akaitoua-sa <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: Gezim Musliaj <[email protected]>
Co-authored-by: kukushking <[email protected]>
Co-authored-by: Rick Bernotas <[email protected]>
Co-authored-by: David Mutune Kimengu <[email protected]>
Co-authored-by: dbalintx <[email protected]>
@noah-paige
Copy link
Contributor

noah-paige commented May 31, 2024

First a foremost - this is a great contribution from @ironspur5 and team! All of your work and collaboration is super appreciated and I think this is an amazing starting point for health science applications and data consumption patterns in data.all :)

I left some comments throughout the PR - most are small changes or touch ups but let me know if any questions or disagreements. Additionally, some high-level notes I too:

Notes:

  • For existing DA Deployment (a heads up on what we need to call out during release @dlpzx)
    • Existing Groups need to be given MANAGE_OMICS permission from tenant
    • Need to update envs after enabling omics module
    • Need to run omics workflow fetcher at least 1 time before can start runs
  • Any need for an Omits VPC Endpoint? Similar to how we have VPC interface endpoints for other supported services

Enhancement Ideas (can be picked up after this PR)

  • Permissions
    • Is there any need for more in depth permission model? Currently we have:
      • MANAGE_OMICS_RUNS (Tenant Level)
      • CREATE_OMICS_RUNS (on Env Record)
      • DELETE_OMICS_RUN (on OmicsRun Record)
    • Especially if we move to supporting Private Workflows May Need more verbose permissions for update, get, list, for one if not both of omics workflow and omics runs
  • Private Workflow Types, RunGroups, etc.
  • Search + Filters on Omics Run History
  • Better UX JSON editing for parameter template
  • Provide Output Locations specific to Table Location or Folder Location rather than entire Dataset

Tests I performed:

  • CICD Pipeline Completes
  • Env Stack Update - adds Omics permissions to pivot role and env group roles (if they have omics env group permission)
  • Omics Permissions Show (both in Admin Tenant Permissions and Env Group Permissions + can edit appropriately for each)
  • ECS Workflow Fetch Task Successful (IMPORTANT: Omics Module needs to be enabled - left comment on file)
  • Workflows Paginated and showing on Omics Workflow Tab
  • Workflow Card with Additional Description and Parameter Template Shows Properly
  • Create Run from Workflow (Tested ESM Fold Workflow - mimiced Video)
  • Run History Shows OmicsRuns for the Users Teams (other users in separate team cannot view Omics Runs)
  • Run Status Updates when Job Completes
  • Output Data to correct S3 Output (tested both same dataset and a different owned dataset)
  • Omics Run Creator Team can Delete Job Run as well
  • Test with Shared Dataset as Input
  • Crawl + Query Output metrics in Worksheets

@noah-paige
Copy link
Contributor

A final suggestion left on omics_workflow_fetcher task

And 1 last thought - should we restrict omics runs to follow a pattern using NamingConventionService similar to how we do with other resources?

  • If so think we can also then restrict the pivot role permissions to arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}* (right now it is wildcard on name)

If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role
arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}* (currently implemented as such in backend/dataall/modules/omics/cdk/env_role_omics_policy.py

@ironspur5
Copy link
Contributor Author

A final suggestion left on omics_workflow_fetcher task

And 1 last thought - should we restrict omics runs to follow a pattern using NamingConventionService similar to how we do with other resources?

* If so think we can also then restrict the pivot role permissions to `arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}*` (right now it is wildcard on name)

If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}* (currently implemented as such in backend/dataall/modules/omics/cdk/env_role_omics_policy.py

I am not 100% certain, but I also don't think the self.env_resource_prefix is actually doing anything. Run ARNs don't include any naming/prefix in their format, just the Run ID. I think I put this in there to match what other modules looked like. But the tag key condition to prevent unauthorized run access is really what I was intending for.

Copy link
Contributor

@noah-paige noah-paige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again for all you hard work here @ironspur5 @dlpzx + team

Final changes look good - approving!

@anmolsgandhi
Copy link
Contributor

Great work on this initiative team - the collaboration and coordination across the teams, all focused on delivering the best possible experience for our customers, has been awesome. Let's continue this collaboration for other initiatives as well.

@dlpzx dlpzx merged commit 7a264ee into data-dot-all:main Jun 14, 2024
10 checks passed
@ironspur5
Copy link
Contributor Author

Thank you for all your support in getting this added, team!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HealthOmics module in "Play" tools
10 participants