-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding AWS HealthOmics as a Module in "Play" tools #954
Adding AWS HealthOmics as a Module in "Play" tools #954
Conversation
Syncs modularization main branch with main (as there is a bug fix that is useful to have) --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <[email protected]> Co-authored-by: wolanlu <[email protected]>
This is a draft PR for showing purposes. There are still some minor issues that needs to be addressed. ### Feature or Bugfix - Refactoring ### Detail There are following changes under this PR: 1. Modularization + Refactoring of notebooks There are new modules that will play a major role in the future refactoring: * Core = contains the code need for application to operate correctly * Common = common code for all modules * Modules = the plugin/feature that can be inserted into the system (at the moment only notebooks) The other part that is related to modularization is the creation of environment parameters. Environment parameter will replace all hardcoded parameters of the environment configuration. There is a new file - config.json that allows you to configure an application configuration. All existing parameters will be migrated via db migration in AWS 2. Extracting permissions and request context (Optional for the modularization) Engine, user, and user groups had been passed as a parameter of context in the request. This had forced to pass a lot of parameters to other methods that weren't even needed. This information should be as a scope of the request session. There is a new way to retrieve the information using `RequestContext.` There is also a new way to use permission checks that require less parameters and make code cleaner. The old way was marked as deprecated 3. Restructure of the code (Optional for the modularization) Since the modularization will touch all the places in the API code it can be a good change to set a new structure of the code. There are small re-organization in notebook module to address * Allocating the resources before the validating parameters * Not clear responsibility of the classes * Mixed layers There are new structure : - resolvers = validate and pass code to service layer - service layer = bisnesss logic - repositories = database logic (all queries should be placed here) - aws = contains a wrapper client upon boto3 - cdk = all logic related to create stacks or code for ecs - tasks = code that will be executed in AWS lambda (short-living tasks) All names can be changed. ### Relates [data-dot-all#295](data-dot-all#295) By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <[email protected]>
### Feature or Bug-fix - Refactoring ### Detail For the development dockerfile in the frontend, instead of starting from amazon linux image and installing everything manually, it's a lot simpler to start from node image (that is also from ECR for security) and have all node-related environment pre-installed and just focus the dockerfile on project-related configuration. This PR does not introduce any update in the logic, it just updates the development dockerfile and adds a `.dockerignore` file for frontend to ignore `node_modules` and `build` folders. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
) ### Feature or Bug-fix - Refactoring ### Details This PR does 2 things: - Makes the styling consistent across the project - Removes dead code (unused variables, functions, and imports) --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bug-fix - Refactoring ### Detail My recent PR data-dot-all#394 has a lot of changes but all of these changes are to styles only and not the code itself, this way the blame history for all the changed files is lost (it will always blame me). This PR adds the file`.git-blame-ignore-revs` which contains the id of the styling commit and the command that you need to run once to fix the blame history for you. Also Github understands that file and will show the history correctly. This PR is based on [this article](https://www.stefanjudis.com/today-i-learned/how-to-exclude-commits-from-git-blame/) and is suggested by Raul. This file works for the whole repository, and in the future one only needs to add ids of styling commits and the history will be preserved correctly. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix - Refactoring (Modularization) ### Relates - Related issues data-dot-all#295 and data-dot-all#412 ### Short Summary First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) ### Long Summary Datasets are huge. It's one of the central modules that's spread everywhere across the application. Migrating the entire Dataset piece would be very difficult task and, more importantly, even more difficult to review. Therefore, I decided to break down this work into "small" steps to make it more convenient to review. Dataset's API consist of the following items: * `Dataset` * `DatasetTable` * `DatasetTableColumn` * `DatasetLocation` * `DatasetProfiling` In this PR, there is only creation of `Dataset` module and migration of `DatasetTableColumn` (and some related to it pieces). Why? Because the plan was to migrate it, to see what issues would come up along with it and to address them here. The refactoring of `DatasetTableColumn` will be in other PR. The issues: 1) Glossaries 2) Feed 3) Long tasks for datasets 4) Redshift Glossaries rely on GraphQL UNION of different type (including datasets). Created an abstraction for glossary registration. There was an idea to change frontend, but it'd require a lot of work to do this Feed: same as glossaries. Solved the similar way. For feed, changing frontend API is more feasible, but I wanted it to be consistent with glossaries Long tasks for datasets. They migrated into tasks folder and doesn't require a dedicated loading for its code (at least for now). But there are two concerns: 1) The deployment uses a direct module folder references to run them (e.g. `dataall.modules.datasets....`, so basically when a module is deactivated, then we shouldn't deploy this tasks as well). I left a TODO for it to address in future (when we migrate all modules), but we should bear in mind that it might lead to inconsistencies. 2) There is a reference to `redshift` from long-running tasks = should be address in `redshift` module Redshift: it has some references to `datasets`. So there will be either dependencies among modules or small code duplication (if `redshift` doesn't rely hard on `datasets`) = will be addressed in `redshift` module Other changes: Fixed and improved some tests Extracted glue handler code that related to `DatasetTableColumn` Renamed import mode from tasks to handlers for async lambda. A few hacks that will go away with next refactoring :) Next steps: [Part2 ](nikpodsh#1) in preview :) Extract rest of datasets functionality (perhaps, in a few steps) Refactor extractor modules the same way as notebooks Extract tests to follow the same structure. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetProfilingRun ### Relates - data-dot-all#295 and data-dot-all#412 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix - Refactoring ### Detail Refactoring of the third part of dataset: `DatasetStorageLocation` Introduced the Indexers: this code was migrated from the `indexers.py` and put into modules. Removed unused alarms (which didn't call actual alarm code) Introduced `DatasetShareService` but it seems it will be migrated to share module. All `DatasetXXXServices` will be split onto Services (business logic) and Repositories( DAO layer) in future parts ### Relates - data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix - Refactoring ### Detail Refactoring of DatasetTable: Get rid of ElasticSearch connection for every request. Created a lazy way to establish connection. ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dbalintx <[email protected]>
### Feature or Bugfix - Refactoring ### Detail Refactoring of the Dataset entity and related to it code. Refactoring for Votes Introduced DataPolicy (the same way as ServicePolicy was used used) Extracted dataset related permissions. Used new `has_tenant_permission` instead of `has_tenant_perm` that allows not to pass unused parameters ### Relates data-dot-all#412 and data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix - Refactoring ### Detail Creation of the mechanism to import dependencies for modules. It should help easier extract common code between modules. Now, one module can refer to another module in order to import it. Deleted `common` folder since it has been replaced replaced. Added some checks if the module was unintentionally imported + logging By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Modularization of Worksheets - Moved Worksheet related code to its own new module - Merged AthenaQueryResult object into Worksheets - Worksheet related permissions moved to new module - Removed worksheet sharing related (unused) code from the entire repo By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
…ata-dot-all#463) Merge of main (v1.5.2) -> modularization main By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dlpzx <[email protected]> Co-authored-by: wolanlu <[email protected]> Co-authored-by: Amr Saber <[email protected]> Co-authored-by: Noah Paige <[email protected]> Co-authored-by: kukushking <[email protected]> Co-authored-by: Dariusz Osiennik <[email protected]> Co-authored-by: Dennis Goldner <[email protected]> Co-authored-by: Abdulrahman Kaitoua <[email protected]> Co-authored-by: akaitoua-sa <[email protected]> Co-authored-by: Gezim Musliaj <[email protected]> Co-authored-by: Rick Bernotas <[email protected]> Co-authored-by: David Mutune Kimengu <[email protected]>
### Feature or Bugfix - Refactoring ### Detail First part of the modularization of ML Studio - Created `SageMakerStudioService` and move `resolvers` business logic into the service - Redefined `resolvers` and add request validators - Redefined `db` operations and added them into `SageMakerRepository` - Created `sagemaker_studio_client` with Sagemaker Studio SDK calls - Created `ec2_client` with EC2 SDK calls - Added CloudFormation stack in the `cdk` package of the module - Added CloudFormation stack as an environment extension in the `cdk` package of the module - Modified environment creation and edit api calls and views to read `mlStudiosEnabled` from environment parameters table - Fixed tests and added migration scripts Additional: - Standardized naming of functions and objects to "sagemaker_studio_user" and avoid legacy "Notebook" references - Removed unused api definition for `getSagemakerStudioUserApps` - Cleaned up unused frontend views and API methods - Split migration scripts for environments, worksheets and mlstudio - linting Worksheets module - prettier frontend ### Relates data-dot-all#448 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: wolanlu <[email protected]> Co-authored-by: Amr Saber <[email protected]> Co-authored-by: Noah Paige <[email protected]> Co-authored-by: kukushking <[email protected]> Co-authored-by: Dariusz Osiennik <[email protected]> Co-authored-by: Dennis Goldner <[email protected]> Co-authored-by: Nikita Podshivalov <[email protected]> Co-authored-by: Abdulrahman Kaitoua <[email protected]> Co-authored-by: akaitoua-sa <[email protected]> Co-authored-by: Gezim Musliaj <[email protected]> Co-authored-by: Rick Bernotas <[email protected]> Co-authored-by: David Mutune Kimengu <[email protected]> Co-authored-by: nikpodsh <[email protected]> Co-authored-by: dbalintx <[email protected]>
### Feature or Bugfix - Feature ### Detail - For ML Studio and Environments, the original procedure of synthesizing the template and testing the resulting json has been replaced by the `cdk assertions` library recommended in the [documentation](https://docs.aws.amazon.com/cdk/v2/guide/testing.html#testing_getting_started) of CDK. - In the Environment cdk testing, the `extent` method of the `EnvironmentStackExtension` subclasses has been mocked. Now it only tests the `EnvironmentSetup` stack as if no extensions were registered. - In the MLStudio cdk testing, this PR adds a test for the `SageMakerDomainExtension` mocking the environment stack. It tests the `SageMakerDomainExtension` as a standalone. Open question: 1) The rest of cdk stacks (notebooks, pipelines, datasets...) are tested using the old method of printing the json template. Should I go ahead and migrate to using `cdk assertions` library? If so, I thought of doing it for notebooks only and sync on the testing for datasets and pipelines with @dbalintx and @nikpodsh. 2) With the `cdk assertions` library we can test the resources properties more in depth. I added some tests on the environment stack but we could add more asserts in all stacks. I will add more tests on the MLStudio stack and I made a note in the GitHub project to review this once modularization is complete. ### Relates - data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
…that are inactive (data-dot-all#522) Enabled the feature to turn off tests whenever a module if inactive.
Modularization of the sharing. ### Detail Migrated the sharing part (including tasks) Removed unused methods for datasets There are a few issues that needs to be addressed before merging this PR. But it can be view, since we don't expect major changes here. ### Related data-dot-all#295 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dlpzx <[email protected]> Co-authored-by: kukushking <[email protected]> Co-authored-by: Dariusz Osiennik <[email protected]> Co-authored-by: Noah Paige <[email protected]> Co-authored-by: Dennis Goldner <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Abdulrahman Kaitoua <[email protected]> Co-authored-by: akaitoua-sa <[email protected]> Co-authored-by: Gezim Musliaj <[email protected]> Co-authored-by: Rick Bernotas <[email protected]> Co-authored-by: David Mutune Kimengu <[email protected]> Co-authored-by: dbalintx <[email protected]>
Modularization of data pipelines Changes: - Relevant files moved from dataall/Objects/api and from dataall/db to the newly created module - Relevant permissions extracted to the newly created module and are being used with the new decorators - Functions interacting with the DB were outsourced to repository - extracted and moved Datapipelines related code from core CDK files to the new module dataall/cdkproxy/cdk_cli_wrapper.py dataall/cdkproxy/stacks/pipeline.py dataall/cdkproxy/cdkpipeline/cdk_pipeline.py service policies - extracted and moved Datapipelines related code from core AWS handlers to the new module dataall/aws/handlers/stepfunction.py dataall/aws/handlers/codecommit.py dataall/aws/handlers/codepipeline.py dataall/aws/handlers/glue.py - added module interface for the module - unit tests updated By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
### Feature or Bugfix Refactoring ### Detail Refactoring of the dashboards ### Relates data-dot-all#509 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. - [x] TODO: There was a huge merge conflict. I am testing the changes after the merge --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Abdulrahman Kaitoua <[email protected]> Co-authored-by: akaitoua-sa <[email protected]> Co-authored-by: dlpzx <[email protected]> Co-authored-by: Gezim Musliaj <[email protected]> Co-authored-by: kukushking <[email protected]> Co-authored-by: Rick Bernotas <[email protected]> Co-authored-by: David Mutune Kimengu <[email protected]> Co-authored-by: dbalintx <[email protected]>
Removing all Redshift related code from backend, frontend and docs. Cleaned up dataall/core folder structure ### Feature or Bugfix <!-- please choose --> - Feature - Bugfix - Refactoring ### Detail - <feature1 or bug1> - <feature2 or bug2> ### Relates - <URL or Ticket> By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Added missed permission checks. This PR depends on data-dot-all#537. Should be review after the dashboard modularization By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Abdulrahman Kaitoua <[email protected]> Co-authored-by: akaitoua-sa <[email protected]> Co-authored-by: dlpzx <[email protected]> Co-authored-by: Gezim Musliaj <[email protected]> Co-authored-by: kukushking <[email protected]> Co-authored-by: Rick Bernotas <[email protected]> Co-authored-by: David Mutune Kimengu <[email protected]> Co-authored-by: dbalintx <[email protected]>
…o modularization/omics-module
First a foremost - this is a great contribution from @ironspur5 and team! All of your work and collaboration is super appreciated and I think this is an amazing starting point for health science applications and data consumption patterns in data.all :) I left some comments throughout the PR - most are small changes or touch ups but let me know if any questions or disagreements. Additionally, some high-level notes I too: Notes:
Enhancement Ideas (can be picked up after this PR)
Tests I performed:
|
backend/dataall/core/environment/cdk/env_role_core_policies/service_policy.py
Outdated
Show resolved
Hide resolved
…zation/omics-module # Conflicts: # config.json # frontend/yarn.lock
…cs and worksheets
A final suggestion left on omics_workflow_fetcher task And 1 last thought - should we restrict omics runs to follow a pattern using
If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role |
I am not 100% certain, but I also don't think the self.env_resource_prefix is actually doing anything. Run ARNs don't include any naming/prefix in their format, just the Run ID. I think I put this in there to match what other modules looked like. But the tag key condition to prevent unauthorized run access is really what I was intending for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks again for all you hard work here @ironspur5 @dlpzx + team
Final changes look good - approving!
Great work on this initiative team - the collaboration and coordination across the teams, all focused on delivering the best possible experience for our customers, has been awesome. Let's continue this collaboration for other initiatives as well. |
Thank you for all your support in getting this added, team!! |
Feature or Bugfix
Background
Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console.
Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing.
AWS HealthOmics Integration
This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets.
Consumption Patterns
Considerations
User Journey
This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet.
Workflow ID: Immutable ID of the Ready2Run workflow
* Run Name: Customizable name of the run user will submit
* Environment: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London)
* Region: Pre-populated from the Environment and immutable Region where the run will be
* Owners: data.all group who owns the run
* Select S3 Output Destination: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run)
* Run Parameters: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here)
History:
Data Consumption:
Relates
Security
Please answer the questions below briefly where applicable, or write
N/A
. Based onOWASP 10.
fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes
eval
or similar functions are used? N/ABy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.