Skip to content

Commit

Permalink
Bugfix 932: Investigate why some shares did not go to failed state, b…
Browse files Browse the repository at this point in the history
…ut remained stuck or in-progress (#933)

### Feature or Bugfix
- Bugfix

### Detail
- Investigate why the shares stayed in progress even if they actually
must have failed because of missing permissions or any other reason.
- Please look at the ISSUE linked to read more about the cause. 
 - Also includes some linting fixes. 

### Resolution:

I have changed the Subject to be constructed like this:
subject = f'ALARM: DATAALL S3 Bucket {bucket.S3BucketName} Sharing
Failure'[:100]

This will ensure any unnecessary words from Subject are removed and that
we always guarantee a 100 character limit. The message body is further
created properly and with a lot of details in case the Subject line gets
truncated.
I will repeat this pattern for revoke shares too and for folder and
table shares

### Relates
- #932

### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Noah Paige <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: jaidisido <[email protected]>
Co-authored-by: dlpzx <[email protected]>
Co-authored-by: mourya-33 <[email protected]>
Co-authored-by: nikpodsh <[email protected]>
Co-authored-by: MK <[email protected]>
Co-authored-by: Manjula <[email protected]>
Co-authored-by: Zilvinas Saltys <[email protected]>
Co-authored-by: Zilvinas Saltys <[email protected]>
Co-authored-by: Daniel Lorch <[email protected]>
Co-authored-by: Anushka Singh <[email protected]>
Co-authored-by: Tejas Rajopadhye <[email protected]>
Co-authored-by: trajopadhye <[email protected]>
  • Loading branch information
16 people authored Jan 4, 2024
1 parent ef966d8 commit 458b714
Show file tree
Hide file tree
Showing 8 changed files with 64 additions and 48 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,9 @@ def trigger_table_sharing_failure_alarm(
target_environment: Environment,
):
log.info('Triggering share failure alarm...')
subject = (
f'ALARM: DATAALL Table {table.GlueTableName} Sharing Failure Notification'
)
subject = f'Data.all Share Failure for Table {table.GlueTableName}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the table {table.GlueTableName} with Lake Formation.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the table {table.GlueTableName} with Lake Formation.
Alarm Details:
- State Change: OK -> ALARM
Expand Down Expand Up @@ -51,9 +49,9 @@ def trigger_revoke_table_sharing_failure_alarm(
target_environment: Environment,
):
log.info('Triggering share failure alarm...')
subject = f'ALARM: DATAALL Table {table.GlueTableName} Revoking LF permissions Failure Notification'
subject = f'Data.all Revoke LF Permissions Failure for Table {table.GlueTableName}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to revoke Lake Formation permissions for table {table.GlueTableName} with Lake Formation.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to revoke Lake Formation permissions for table {table.GlueTableName} with Lake Formation.
Alarm Details:
- State Change: OK -> ALARM
Expand All @@ -76,11 +74,9 @@ def trigger_revoke_table_sharing_failure_alarm(

def trigger_dataset_sync_failure_alarm(self, dataset: Dataset, error: str):
log.info(f'Triggering dataset {dataset.name} tables sync failure alarm...')
subject = (
f'ALARM: DATAALL Dataset {dataset.name} Tables Sync Failure Notification'
)
subject = f'Data.all Dataset Tables Sync Failure for {dataset.name}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to synchronize Dataset {dataset.name} tables from AWS Glue to the Search Catalog.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to synchronize Dataset {dataset.name} tables from AWS Glue to the Search Catalog.
Alarm Details:
- State Change: OK -> ALARM
Expand All @@ -101,11 +97,9 @@ def trigger_folder_sharing_failure_alarm(
target_environment: Environment,
):
log.info('Triggering share failure alarm...')
subject = (
f'ALARM: DATAALL Folder {folder.S3Prefix} Sharing Failure Notification'
)
subject = f'Data.all Folder Share Failure for {folder.S3Prefix}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the folder {folder.S3Prefix} with S3 Access Point.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the folder {folder.S3Prefix} with S3 Access Point.
Alarm Details:
- State Change: OK -> ALARM
- Reason for State Change: S3 Folder sharing failure
Expand All @@ -129,11 +123,9 @@ def trigger_revoke_folder_sharing_failure_alarm(
target_environment: Environment,
):
log.info('Triggering share failure alarm...')
subject = (
f'ALARM: DATAALL Folder {folder.S3Prefix} Sharing Revoke Failure Notification'
)
subject = f'Data.all Folder Share Revoke Failure for {folder.S3Prefix}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the folder {folder.S3Prefix} with S3 Access Point.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to share the folder {folder.S3Prefix} with S3 Access Point.
Alarm Details:
- State Change: OK -> ALARM
- Reason for State Change: S3 Folder sharing Revoke failure
Expand Down Expand Up @@ -173,11 +165,9 @@ def handle_bucket_sharing_failure(self, bucket: DatasetBucket,
target_environment: Environment,
alarm_type: str):
log.info(f'Triggering {alarm_type} failure alarm...')
subject = (
f'ALARM: DATAALL S3 Bucket {bucket.S3BucketName} {alarm_type} Failure Notification'
)
subject = f'Data.all S3 Bucket Failure for {bucket.S3BucketName} {alarm_type}'[:100]
message = f"""
You are receiving this email because your DATAALL {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to {alarm_type} the S3 Bucket {bucket.S3BucketName}.
You are receiving this email because your Data.all {self.envname} environment in the {self.region} region has entered the ALARM state, because it failed to {alarm_type} the S3 Bucket {bucket.S3BucketName}.
Alarm Details:
- State Change: OK -> ALARM
- Reason for State Change: S3 Bucket {alarm_type} failure
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -510,7 +510,7 @@ def handle_share_failure(self, error: Exception) -> None:
self.target_folder, self.share, self.target_environment
)

def handle_revoke_failure(self, error: Exception) -> None:
def handle_revoke_failure(self, error: Exception) -> bool:
"""
Handles share failure by raising an alarm to alarmsTopic
Returns
Expand All @@ -526,6 +526,7 @@ def handle_revoke_failure(self, error: Exception) -> None:
DatasetAlarmService().trigger_revoke_folder_sharing_failure_alarm(
self.target_folder, self.share, self.target_environment
)
return True

@staticmethod
def generate_default_kms_decrypt_policy_statement(target_requester_arn):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ def handle_revoke_failure(self, error: Exception) -> bool:
f'with target account {self.target_environment.AwsAccountId}/{self.target_environment.region} '
f'due to: {error}'
)
DatasetAlarmService().trigger_revoke_folder_sharing_failure_alarm(
DatasetAlarmService().trigger_revoke_s3_bucket_sharing_failure_alarm(
self.target_bucket, self.share, self.target_environment
)
return True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,14 @@ def process_approved_shares(self) -> bool:
shared_item_SM.update_state_single_item(self.session, share_item, new_state)

except Exception as e:
self.handle_share_failure(table=table, share_item=share_item, error=e)
# must run first to ensure state transitions to failed
new_state = shared_item_SM.run_transition(ShareItemActions.Failure.value)
shared_item_SM.update_state_single_item(self.session, share_item, new_state)
success = False

# statements which can throw exceptions but are not critical
self.handle_share_failure(table=table, share_item=share_item, error=e)

return success

def process_revoked_shares(self) -> bool:
Expand Down Expand Up @@ -178,9 +181,12 @@ def process_revoked_shares(self) -> bool:
revoked_item_SM.update_state_single_item(self.session, share_item, new_state)

except Exception as e:
self.handle_revoke_failure(share_item=share_item, table=table, error=e)
# must run first to ensure state transitions to failed
new_state = revoked_item_SM.run_transition(ShareItemActions.Failure.value)
revoked_item_SM.update_state_single_item(self.session, share_item, new_state)
success = False

# statements which can throw exceptions but are not critical
self.handle_revoke_failure(share_item=share_item, table=table, error=e)

return success
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,14 @@ def process_approved_shares(self) -> bool:
shared_item_SM.update_state_single_item(self.session, share_item, new_state)

except Exception as e:
self.handle_share_failure(table, share_item, e)
# must run first to ensure state transitions to failed
new_state = shared_item_SM.run_transition(ShareItemActions.Failure.value)
shared_item_SM.update_state_single_item(self.session, share_item, new_state)
success = False

# statements which can throw exceptions but are not critical
self.handle_share_failure(table, share_item, e)

return success

def process_revoked_shares(self) -> bool:
Expand Down Expand Up @@ -151,9 +154,12 @@ def process_revoked_shares(self) -> bool:
revoked_item_SM.update_state_single_item(self.session, share_item, new_state)

except Exception as e:
self.handle_revoke_failure(share_item, table, e)
# must run first to ensure state transitions to failed
new_state = revoked_item_SM.run_transition(ShareItemActions.Failure.value)
revoked_item_SM.update_state_single_item(self.session, share_item, new_state)
success = False

# statements which can throw exceptions but are not critical
self.handle_revoke_failure(share_item, table, e)

return success
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,14 @@ def process_approved_shares(
shared_item_SM.update_state_single_item(session, sharing_item, new_state)

except Exception as e:
sharing_folder.handle_share_failure(e)
# must run first to ensure state transitions to failed
new_state = shared_item_SM.run_transition(ShareItemActions.Failure.value)
shared_item_SM.update_state_single_item(session, sharing_item, new_state)
success = False

# statements which can throw exceptions but are not critical
sharing_folder.handle_share_failure(e)

return success

@classmethod
Expand Down Expand Up @@ -160,11 +163,14 @@ def process_revoked_shares(
revoked_item_SM.update_state_single_item(session, removing_item, new_state)

except Exception as e:
removing_folder.handle_revoke_failure(e)
# must run first to ensure state transitions to failed
new_state = revoked_item_SM.run_transition(ShareItemActions.Failure.value)
revoked_item_SM.update_state_single_item(session, removing_item, new_state)
success = False

# statements which can throw exceptions but are not critical
removing_folder.handle_revoke_failure(e)

return success

@classmethod
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,14 @@ def process_approved_shares(
shared_item_SM.update_state_single_item(session, sharing_item, new_state)

except Exception as e:
sharing_bucket.handle_share_failure(e)
# must run first to ensure state transitions to failed
new_state = shared_item_SM.run_transition(ShareItemActions.Failure.value)
shared_item_SM.update_state_single_item(session, sharing_item, new_state)
success = False

# statements which can throw exceptions but are not critical
sharing_bucket.handle_share_failure(e)

return success

@classmethod
Expand Down Expand Up @@ -161,9 +165,12 @@ def process_revoked_shares(
revoked_item_SM.update_state_single_item(session, removing_item, new_state)

except Exception as e:
removing_bucket.handle_revoke_failure(e)
# must run first to ensure state transitions to failed
new_state = revoked_item_SM.run_transition(ShareItemActions.Failure.value)
revoked_item_SM.update_state_single_item(session, removing_item, new_state)
success = False

# statements which can throw exceptions but are not critical
removing_bucket.handle_revoke_failure(e)

return success
Loading

0 comments on commit 458b714

Please sign in to comment.