Skip to content

Commit

Permalink
Consistent get_<DATA_ASSET> permissions - S3_Datasets (#1727)
Browse files Browse the repository at this point in the history
- Feature/Bugfix

This PR is the first part of a series and only handles S3_Datasets and
its child components Tables and Folders

Most API calls on a particular resource are restricted by GET_RESOURCE
permissions. But for resources that are indexed in the Catalog, some
metadata is considered public as it is useful for data consumers to
discover and understand the data assets. Users will click on these
resources from the Catalog view and call one of the following API calls:
- getDataset
- getDatasetStorageLocation
- getDatasetTable
- getDashboard
- getRedshiftDataset
- getRedshiftTable

From the above list, initially all are decorated with
resource_permission checks except for getDataset and getDatasetTable.

- Public information should be available for data consumers to explore,
that means that we first need to remove the resource_permission checks
from the list of APIs.
- Not all information is public, to get AWS information and other
restricted metadata we still need to verify GET_X permissions on the
resource.
- For restricted metadata, we should provide default messages that do
not break the frontend

In addition in this PR some unused fields are removed and `syncTables`
is simplified to return an integer with the count of synced tables

For each of the following I tested with a user with GET permissions and
without GET permissions. FE views render and show information or
unauthorized to view info placeholder
- [X] Dataset View, Dataset Edit Form and list Datasets
- [x] Dataset Data tab with list of tables and folders
- [X] Table view, Table Edit
- [X] Folder view and Folder Edit

Other checks
- [x] Complete share request
- [X] With requester check table and folder and view the details of the
account...
- [X] Worksheets query with owned table
- [X] Worksheets query with shared table
- [X] Crawl dataset - correct S3 path
- [X] Sync tables - correct params

Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
  • Loading branch information
dlpzx committed Dec 10, 2024
1 parent 6ee449b commit 5d8e33a
Show file tree
Hide file tree
Showing 38 changed files with 273 additions and 231 deletions.
6 changes: 6 additions & 0 deletions backend/dataall/modules/s3_datasets/api/dataset/resolvers.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,12 @@ def get_dataset_statistics(context: Context, source: S3Dataset, **kwargs):
return DatasetService.get_dataset_statistics(source)


def get_dataset_restricted_information(context: Context, source: S3Dataset, **kwargs):
if not source:
return None
return DatasetService.get_dataset_restricted_information(uri=source.datasetUri, dataset=source)


@is_feature_enabled('modules.s3_datasets.features.aws_actions')
def get_dataset_assume_role_url(context: Context, source, datasetUri: str = None):
return DatasetService.get_dataset_assume_role_url(uri=datasetUri)
Expand Down
45 changes: 22 additions & 23 deletions backend/dataall/modules/s3_datasets/api/dataset/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
get_dataset_statistics,
get_dataset_glossary_terms,
resolve_dataset_stack,
get_dataset_restricted_information,
)
from dataall.core.environment.api.enums import EnvironmentPermission

Expand All @@ -23,6 +24,22 @@
],
)

DatasetRestrictedInformation = gql.ObjectType(
name='DatasetRestrictedInformation',
fields=[
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='region', type=gql.String),
gql.Field(name='S3BucketName', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
gql.Field(name='IAMDatasetAdminRoleArn', type=gql.String),
gql.Field(name='KmsAlias', type=gql.String),
gql.Field(name='importedS3Bucket', type=gql.Boolean),
gql.Field(name='importedGlueDatabase', type=gql.Boolean),
gql.Field(name='importedKmsKey', type=gql.Boolean),
gql.Field(name='importedAdminRole', type=gql.Boolean),
],
)

Dataset = gql.ObjectType(
name='Dataset',
fields=[
Expand All @@ -35,29 +52,13 @@
gql.Field(name='created', type=gql.String),
gql.Field(name='updated', type=gql.String),
gql.Field(name='admins', type=gql.ArrayType(gql.String)),
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='region', type=gql.String),
gql.Field(name='S3BucketName', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
gql.Field(name='GlueCrawlerName', type=gql.String),
gql.Field(name='GlueCrawlerSchedule', type=gql.String),
gql.Field(name='GlueProfilingJobName', type=gql.String),
gql.Field(name='GlueProfilingTriggerSchedule', type=gql.String),
gql.Field(name='IAMDatasetAdminRoleArn', type=gql.String),
gql.Field(name='KmsAlias', type=gql.String),
gql.Field(name='bucketCreated', type=gql.Boolean),
gql.Field(name='glueDatabaseCreated', type=gql.Boolean),
gql.Field(name='iamAdminRoleCreated', type=gql.Boolean),
gql.Field(name='lakeformationLocationCreated', type=gql.Boolean),
gql.Field(name='bucketPolicyCreated', type=gql.Boolean),
gql.Field(name='SamlAdminGroupName', type=gql.String),
gql.Field(name='businessOwnerEmail', type=gql.String),
gql.Field(name='businessOwnerDelegationEmails', type=gql.ArrayType(gql.String)),
gql.Field(name='importedS3Bucket', type=gql.Boolean),
gql.Field(name='importedGlueDatabase', type=gql.Boolean),
gql.Field(name='importedKmsKey', type=gql.Boolean),
gql.Field(name='importedAdminRole', type=gql.Boolean),
gql.Field(name='imported', type=gql.Boolean),
gql.Field(
name='restricted',
type=DatasetRestrictedInformation,
resolver=get_dataset_restricted_information,
),
gql.Field(
name='environment',
type=gql.Ref('EnvironmentSimplified'),
Expand Down Expand Up @@ -126,8 +127,6 @@
name='GlueCrawler',
fields=[
gql.Field(name='Name', type=gql.ID),
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='region', type=gql.String),
gql.Field(name='status', type=gql.String),
],
)
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@ def resolve_dataset(context, source: DatasetStorageLocation, **kwargs):
return DatasetService.find_dataset(uri=source.datasetUri)


def get_folder_restricted_information(context: Context, source: DatasetStorageLocation, **kwargs):
if not source:
return None
return DatasetLocationService.get_folder_restricted_information(uri=source.locationUri, folder=source)


def resolve_glossary_terms(context: Context, source: DatasetStorageLocation, **kwargs):
if not source:
return None
Expand Down
48 changes: 10 additions & 38 deletions backend/dataall/modules/s3_datasets/api/storage_location/types.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
from dataall.base.api import gql
from dataall.modules.s3_datasets.api.storage_location.resolvers import resolve_glossary_terms, resolve_dataset
from dataall.modules.s3_datasets.api.storage_location.resolvers import (
resolve_glossary_terms,
resolve_dataset,
get_folder_restricted_information,
)

DatasetStorageLocation = gql.ObjectType(
name='DatasetStorageLocation',
Expand All @@ -11,13 +15,15 @@
gql.Field(name='owner', type=gql.String),
gql.Field(name='created', type=gql.String),
gql.Field(name='updated', type=gql.String),
gql.Field(name='region', type=gql.String),
gql.Field(name='tags', type=gql.ArrayType(gql.String)),
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='S3BucketName', type=gql.String),
gql.Field(name='S3Prefix', type=gql.String),
gql.Field(name='locationCreated', type=gql.Boolean),
gql.Field(name='dataset', type=gql.Ref('Dataset'), resolver=resolve_dataset),
gql.Field(
name='restricted',
type=gql.Ref('DatasetRestrictedInformation'),
resolver=get_folder_restricted_information,
),
gql.Field(name='userRoleForStorageLocation', type=gql.Ref('DatasetRole')),
gql.Field(name='environmentEndPoint', type=gql.String),
gql.Field(
Expand All @@ -40,37 +46,3 @@
gql.Field(name='hasPrevious', type=gql.Boolean),
],
)


DatasetAccessPoint = gql.ObjectType(
name='DatasetAccessPoint',
fields=[
gql.Field(name='accessPointUri', type=gql.ID),
gql.Field(name='location', type=DatasetStorageLocation),
gql.Field(name='dataset', type=gql.Ref('Dataset')),
gql.Field(name='name', type=gql.String),
gql.Field(name='description', type=gql.String),
gql.Field(name='owner', type=gql.String),
gql.Field(name='created', type=gql.String),
gql.Field(name='updated', type=gql.String),
gql.Field(name='region', type=gql.String),
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='S3BucketName', type=gql.String),
gql.Field(name='S3Prefix', type=gql.String),
gql.Field(name='S3AccessPointName', type=gql.String),
],
)


DatasetAccessPointSearchResult = gql.ObjectType(
name='DatasetAccessPointSearchResult',
fields=[
gql.Field(name='count', type=gql.Integer),
gql.Field(name='page', type=gql.Integer),
gql.Field(name='pageSize', type=gql.Integer),
gql.Field(name='pages', type=gql.Integer),
gql.Field(name='hasNext', type=gql.Integer),
gql.Field(name='hasPrevious', type=gql.Integer),
gql.Field(name='nodes', type=gql.ArrayType(DatasetAccessPoint)),
],
)
2 changes: 1 addition & 1 deletion backend/dataall/modules/s3_datasets/api/table/mutations.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,6 @@
syncTables = gql.MutationField(
name='syncTables',
args=[gql.Argument(name='datasetUri', type=gql.NonNullableType(gql.String))],
type=gql.Ref('DatasetTableSearchResult'),
type=gql.Integer,
resolver=sync_tables,
)
6 changes: 6 additions & 0 deletions backend/dataall/modules/s3_datasets/api/table/resolvers.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,9 @@ def resolve_glossary_terms(context: Context, source: DatasetTable, **kwargs):
return None
with context.engine.scoped_session() as session:
return GlossaryRepository.get_glossary_terms_links(session, source.tableUri, 'DatasetTable')


def get_dataset_table_restricted_information(context: Context, source: DatasetTable, **kwargs):
if not source:
return None
return DatasetTableService.get_table_restricted_information(uri=source.tableUri, table=source)
17 changes: 13 additions & 4 deletions backend/dataall/modules/s3_datasets/api/table/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
resolve_dataset,
get_glue_table_properties,
resolve_glossary_terms,
get_dataset_table_restricted_information,
)

TablePermission = gql.ObjectType(
Expand All @@ -21,6 +22,15 @@
gql.Field(name='nodes', type=gql.ArrayType(TablePermission)),
],
)
DatasetTableRestrictedInformation = gql.ObjectType(
name='DatasetTableRestrictedInformation',
fields=[
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
gql.Field(name='GlueTableName', type=gql.String),
gql.Field(name='S3Prefix', type=gql.String),
],
)

DatasetTable = gql.ObjectType(
name='DatasetTable',
Expand All @@ -35,12 +45,11 @@
gql.Field(name='created', type=gql.String),
gql.Field(name='updated', type=gql.String),
gql.Field(name='admins', type=gql.ArrayType(gql.String)),
gql.Field(name='AwsAccountId', type=gql.String),
gql.Field(name='GlueDatabaseName', type=gql.String),
gql.Field(name='GlueTableName', type=gql.String),
gql.Field(name='LastGlueTableStatus', type=gql.String),
gql.Field(name='S3Prefix', type=gql.String),
gql.Field(name='GlueTableConfig', type=gql.String),
gql.Field(
name='restricted', type=DatasetTableRestrictedInformation, resolver=get_dataset_table_restricted_information
),
gql.Field(
name='GlueTableProperties',
type=gql.String,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from dataall.core.permissions.services.resource_policy_service import ResourcePolicyService
from dataall.core.permissions.services.tenant_policy_service import TenantPolicyService
from dataall.modules.catalog.db.glossary_repositories import GlossaryRepository
from dataall.base.db.exceptions import ResourceShared, ResourceAlreadyExists
from dataall.base.db.exceptions import ResourceAlreadyExists
from dataall.modules.s3_datasets.services.dataset_service import DatasetService
from dataall.modules.s3_datasets.aws.s3_location_client import S3LocationClient
from dataall.modules.s3_datasets.db.dataset_location_repositories import DatasetLocationRepository
Expand Down Expand Up @@ -59,7 +59,6 @@ def list_dataset_locations(uri: str, filter: dict = None):
return DatasetLocationRepository.list_dataset_locations(session=session, uri=uri, data=filter)

@staticmethod
@ResourcePolicyService.has_resource_permission(GET_DATASET_FOLDER)
def get_storage_location(uri):
with get_context().db_engine.scoped_session() as session:
return DatasetLocationRepository.get_location_by_uri(session, uri)
Expand Down Expand Up @@ -135,3 +134,10 @@ def _delete_dataset_folder_read_permission(session, dataset: S3Dataset, location
}
for group in permission_group:
ResourcePolicyService.delete_resource_policy(session=session, group=group, resource_uri=location_uri)

@staticmethod
@ResourcePolicyService.has_resource_permission(GET_DATASET_FOLDER)
def get_folder_restricted_information(uri: str, folder: DatasetStorageLocation):
context = get_context()
with context.db_engine.scoped_session() as session:
return DatasetRepository.get_dataset_by_uri(session, folder.datasetUri)
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,11 @@ def get_dataset_statistics(dataset: S3Dataset):
'upvotes': count_upvotes or 0,
}

@staticmethod
@ResourcePolicyService.has_resource_permission(GET_DATASET)
def get_dataset_restricted_information(uri: str, dataset: S3Dataset):
return dataset

@staticmethod
@TenantPolicyService.has_tenant_permission(MANAGE_DATASETS)
@ResourcePolicyService.has_resource_permission(CREDENTIALS_DATASET)
Expand Down Expand Up @@ -364,8 +369,6 @@ def start_crawler(uri: str, data: dict = None):

return {
'Name': dataset.GlueCrawlerName,
'AwsAccountId': dataset.AwsAccountId,
'region': dataset.region,
'status': crawler.get('LastCrawl', {}).get('Status', 'N/A'),
}

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import logging

from dataall.base.context import get_context
from dataall.core.permissions.services.resource_policy_service import ResourcePolicyService
from dataall.core.permissions.services.tenant_policy_service import TenantPolicyService
Expand Down Expand Up @@ -41,6 +40,11 @@ def get_table(uri: str):
with get_context().db_engine.scoped_session() as session:
return DatasetTableRepository.get_dataset_table_by_uri(session, uri)

@staticmethod
@ResourcePolicyService.has_resource_permission(GET_DATASET_TABLE)
def get_table_restricted_information(uri: str, table: DatasetTable):
return table

@staticmethod
@TenantPolicyService.has_tenant_permission(MANAGE_DATASETS)
@ResourcePolicyService.has_resource_permission(UPDATE_DATASET_TABLE, parent_resource=_get_dataset_uri)
Expand Down Expand Up @@ -118,11 +122,7 @@ def sync_tables_for_dataset(cls, uri):
DatasetTableIndexer.upsert_all(session=session, dataset_uri=dataset.datasetUri)
DatasetTableIndexer.remove_all_deleted(session=session, dataset_uri=dataset.datasetUri)
DatasetIndexer.upsert(session=session, dataset_uri=dataset.datasetUri)
return DatasetRepository.paginated_dataset_tables(
session=session,
uri=uri,
data={'page': 1, 'pageSize': 10},
)
return DatasetRepository.count_dataset_tables(session, dataset.datasetUri)

@staticmethod
def sync_existing_tables(session, uri, glue_tables=None):
Expand Down
6 changes: 4 additions & 2 deletions frontend/src/modules/Folders/components/FolderOverview.js
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,14 @@ export const FolderOverview = (props) => {
}
/>
</Box>
<FolderS3Properties folder={folder} isAdmin={isAdmin} />
{folder.restricted && (
<FolderS3Properties folder={folder} isAdmin={isAdmin} />
)}
</Grid>
<Grid item lg={4} xl={3} xs={12}>
<ObjectMetadata
environment={folder.dataset?.environment}
region={folder.dataset?.region}
region={folder.restricted?.region}
organization={folder.dataset?.environment.organization}
owner={folder.owner}
admins={folder.dataset?.SamlAdminGroupName || '-'}
Expand Down
8 changes: 4 additions & 4 deletions frontend/src/modules/Folders/components/FolderS3Properties.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,31 @@ export const FolderS3Properties = (props) => {
S3 URI
</Typography>
<Typography color="textPrimary" variant="body2">
{`s3://${folder.dataset.S3BucketName}/${folder.S3Prefix}/`}
{`s3://${folder.restricted.S3BucketName}/${folder.S3Prefix}/`}
</Typography>
</CardContent>
<CardContent>
<Typography color="textSecondary" variant="subtitle2">
S3 ARN
</Typography>
<Typography color="textPrimary" variant="body2">
{`arn:aws:s3:::${folder.dataset.S3BucketName}/${folder.S3Prefix}/`}
{`arn:aws:s3:::${folder.restricted.S3BucketName}/${folder.S3Prefix}/`}
</Typography>
</CardContent>
<CardContent>
<Typography color="textSecondary" variant="subtitle2">
Region
</Typography>
<Typography color="textPrimary" variant="body2">
{folder.dataset.region}
{folder.restricted.region}
</Typography>
</CardContent>
<CardContent>
<Typography color="textSecondary" variant="subtitle2">
Account
</Typography>
<Typography color="textPrimary" variant="body2">
{folder.dataset.AwsAccountId}
{folder.restricted.AwsAccountId}
</Typography>
</CardContent>
</Card>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,17 @@ export const getDatasetStorageLocation = (locationUri) => ({
query: gql`
query getDatasetStorageLocation($locationUri: String!) {
getDatasetStorageLocation(locationUri: $locationUri) {
restricted {
AwsAccountId
region
S3BucketName
}
dataset {
datasetUri
name
label
userRoleForDataset
region
SamlAdminGroupName
S3BucketName
AwsAccountId
owner
environment {
environmentUri
Expand All @@ -31,7 +34,6 @@ export const getDatasetStorageLocation = (locationUri) => ({
created
tags
locationUri
AwsAccountId
label
name
S3Prefix
Expand Down
Loading

0 comments on commit 5d8e33a

Please sign in to comment.