Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding AWS HealthOmics as a Module in "Play" tools (#954)
### Feature or Bugfix - Feature ### Background Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console. Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing. ### AWS HealthOmics Integration This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets. ### Consumption Patterns * <ins>data.all Worksheets</ins>: Users can use Worksheets to make data easier to query and combine with other forms of health data. * <ins>data.all Notebooks/Studio</ins>: Users can Notebooks and Studio to build, train, and deploy novel machine learning algorithms on the multiomic and multimodal data. * <ins>data.all Dashboards</ins>: Users can use the transformed data in Dashboards for advanced analytics and visualizations. ### Considerations * <ins>Linked Environment and Dataset Region</ins>: The HealthOmics run must be performed in a data.all Linked Environment that is located in an AWS Region that supports AWS HealthOmics. Similarly, the data.all source and destination Dataset must live in same AWS Region as where the user will perform the HealthOmics run. * <ins>Ready2Run Workflow Support</ins>: Currently, only Ready2Run workflows are supported. Ready2Run are pre-built workflows designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, as well as common open-source pipelines such as AlphaFold for protein structure prediction. Ready2Run workflows do not require the management of software tools or workflow scripts. Bring your own, also known as Private, workflows where you bring a custom workflow script, are not yet supported. Please note that some Ready2Run workflows require a subscription/license from the software provider to run. ### User Journey This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet. * <ins>Initiation</ins>: * User navigates to the "Omics" section within data.all and browses Ready2Run workflows <img width="1894" alt="Screenshot 2024-01-17 at 11 31 23 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/122d1c96-921f-401a-8119-8b2f72779d7a"> * User can also search for a specific workflow directly <img width="1679" alt="Screenshot 2024-01-17 at 11 34 43 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/dc03593e-6116-44b2-a0cd-913da60054cd"> * After clicking on a workflow, users see a detailed view of it with a full description of what it does <img width="1894" alt="Screenshot 2024-01-17 at 11 35 15 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/6299ad7e-c19f-4026-9810-869d082488ae"> * <ins>Creation</ins>: After clicking on a workflow, users see a detailed view and hit “Create Run”. Users fill in the run creation form with the following parameters: <img width="1908" alt="Screenshot 2024-01-17 at 11 37 58 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/440e3f23-c65d-4e00-b932-0a448944853e"> * <ins>Workflow ID</ins>: Immutable ID of the Ready2Run workflow * <ins>Run Name</ins>: Customizable name of the run user will submit * <ins>Environment</ins>: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London) * <ins>Region</ins>: Pre-populated from the Environment and immutable Region where the run will be * <ins>Owners</ins>: data.all group who owns the run * <ins>Select S3 Output Destination</ins>: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run) * <ins>Run Parameters</ins>: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here) * <ins>History</ins>: * Users navigate to Run tab at the top to view a history of the data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run history deletion is still in progress) ![Screenshot 2024-01-17 at 11 40 01 PM](https://github.com/data-dot-all/dataall/assets/28816838/95af6147-332a-4837-a7a6-18ade74f9794) * <ins>Data Consumption</ins>: * <ins>In Worksheets</ins>: * Users can select (or create) a new Worksheet. ![Screenshot 2024-01-17 at 11 41 20 PM](https://github.com/data-dot-all/dataall/assets/28816838/4e017983-17e1-422f-a6ee-b723967b8720) * Users can then query the data using SQL <img width="1902" alt="Screenshot 2024-01-17 at 11 44 30 PM" src="https://github.com/data-dot-all/dataall/assets/28816838/5a020647-ee90-4d36-b132-05893e80327b"> ### Relates - Github Issue - #563 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes - Is the input sanitized? Yes - What precautions are you taking before deserializing the data you consume? N/A - Is injection prevented by parametrizing queries? N/A - Have you ensured no `eval` or similar functions are used? N/A - Does this PR introduce any functionality or component that requires authorization? Yes - How have you ensured it respects the existing AuthN/AuthZ mechanisms? Yes - Are you logging failed auth attempts? N/A - Are you using or adding any cryptographic features? No - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? Yes - Have you used the least-privilege principle? How? Yes, through scoped policies added to the role By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: Patrick Guha <[email protected]> Co-authored-by: dlpzx <[email protected]> Co-authored-by: “Kiran <[email protected]>
- Loading branch information