Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next alpha release for MIT folks #81

Open
yarikoptic opened this issue Dec 13, 2024 · 16 comments
Open

Next alpha release for MIT folks #81

yarikoptic opened this issue Dec 13, 2024 · 16 comments
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Dec 13, 2024

Could we plan on what else to do to cut the next release for folks to try out? E.g. should following still open ATM issues be addressed first

?

@jwodder
Copy link
Member

jwodder commented Dec 13, 2024

@yarikoptic
Copy link
Member Author

ok, I closed #53 . @jwodder please do the needed run to check for #54 on drogon (not enough space on typhon I think ATM) with desired level of verbosity etc to track down/resolve that issue.

@yarikoptic
Copy link
Member Author

FWIW, I freed up typhon space enough to run it there too now

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

@yarikoptic I believe #83 fixes #54. I ran the code on drogon with --path-filter '^zarr/aa', and it successfully ran for over ten hours before failing due to an error on S3's end.

@yarikoptic
Copy link
Member Author

Great, thanks!

Re error on S3 end -- I think all kinds of 500s could happen. Do we have retries in place for whenever interactions with S3 fail for some 5xx reason?

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

@yarikoptic No. I'm not even sure how to discern 5xx from other errors with this SDK.

@yarikoptic
Copy link
Member Author

What particular SDK? (sounds "too bad to be true")

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

The official Rust S3 SDK

@yarikoptic
Copy link
Member Author

I see them testing retries on 500 for some components e.g.
https://github.com/awslabs/aws-sdk-rust/blob/7791b008320b70ea59880079ba901908f3712f5c/sdk/aws-config/src/imds/client.rs#L809
and issues like

hinting that in general it should be retried. What specific error did you receive? might need an issue to be reported against SDK.

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

@yarikoptic The error was:

Error: failed to get object at s3://dandiarchive/zarr/aae64bf8-594e-4c7b-a1a2-9bcbc8638186/0/0/0/4/8/156?versionId=o_N.RxYr0iL73ZNug020YtzrSaWoJce2

Caused by:
    0: service error
    1: unhandled error (InternalError)
    2: Error { code: "InternalError", message: "We encountered an internal error. Please try again.", aws_request_id: "XB0VWDXJH3351P6T", s3_extended_request_id: "QdlprriTEB4HT9v6CzktdTBJUZf8XFecTws0XGaTVEJ6rjRukIxT2Pu+/NdMBOq8fNRzsAHGTk9cK3GmDAFjvXxREvGj0+rcb6P/AEpNYkY=" }

This occurred on a "get object" call.

@yarikoptic
Copy link
Member Author

ah, so it looks very similar to that

and IMHO should be retried for this GET object. Could you please report an issue? meanwhile -- wouldn't it be possible to configure retries on any InternalErrors?

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

@yarikoptic The SDK uses a default retry config, so it's possible it did retry but it just didn't show up in the logs due to a low log level. I seem to recall from working on dandidav that the AWS SDK logs a lot of noise at DEBUG and TRACE level, so increasing the logs might not be such a good idea.

@yarikoptic
Copy link
Member Author

do you know if it logs at INFO or WARNING level when initial fail before retry happens? I failed to track it down in the code there

@jwodder
Copy link
Member

jwodder commented Dec 17, 2024

@yarikoptic The logs from the most recent run don't have any entries emitted by the SDK (which was limited to ERROR, WARNING, and INFO by the top-level logging config).

EDIT: Correction: I just double-checked, and there are 6 INFO messages from the aws_smithy_runtime_api crate that all just say "Connection encountered an issue and should not be re-used. Marking it for closure" and link to https://smithy-lang.github.io/smithy-rs/design/client/detailed_error_explanations.html. I'm not sure what role that plays in retrying.

@jwodder
Copy link
Member

jwodder commented Dec 20, 2024

@yarikoptic So what, if anything, should we do about retrying?

@yarikoptic
Copy link
Member Author

  • if you think that it didn't retry at all -- we better submit an issue against SDK
  • if you think that it did retry but not enough, and I think that is what we could do anyways to robustify, let's change default retrying policy to retry up to e.g. 10 times with exponential backoff like they describe in https://docs.aws.amazon.com/sdk-for-rust/latest/dg/retries.html .
    • if we encounter similar problem after introducing this change to retry policy, it would almost definitely mean it didn't retry and we must report.

jwodder added a commit that referenced this issue Dec 20, 2024
Increase S3 retry count to 10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants