Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching Open workflow intermediate build assets over S3 gives a 403 error in a really simple setup #909

Open
sacundim opened this issue Apr 10, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@sacundim
Copy link

As @tsibley mentions in Pull Request #903, the Open workflow documentation recommends that users preferentially access the intermediate build assets over S3 instead of HTTPS. The documentation notes that this requires the S3 client to be authenticated with AWS:

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter.

What I observe with my own Open-based build in AWS Batch, however, is that my job is authenticated and is able to access my own private S3 buckets:

+ echo 'Sun Apr 10 00:18:16 UTC 2022: Checking access to destination and jobs buckets'
+ aws s3 ls s3://covid-19-puerto-rico/auspice/
Sun Apr 10 00:18:16 UTC 2022: Checking access to destination and jobs buckets
2022-03-19 23:31:04     897015 ncov_global.json
2022-03-19 23:31:04      39894 ncov_global_root-sequence.json
2022-03-19 23:31:04      47575 ncov_global_tip-frequencies.json
2022-04-03 13:44:16   81954630 ncov_puerto-rico.json
2022-04-03 13:44:16      39894 ncov_puerto-rico_root-sequence.json
2022-04-03 13:44:16    3438099 ncov_puerto-rico_tip-frequencies.json
+ aws s3 ls s3://covid-19-puerto-rico-nextstrain-jobs/

...but nevertheless gets an HTTP 403 error when the build tries to get the assets from S3:

+ echo 'Sun Apr 10 00:18:17 UTC 2022: Running the Nexstrain build'
+ snakemake --printshellcmds --profile puerto-rico_profiles/puerto-rico_open/
Sun Apr 10 00:18:17 UTC 2022: Running the Nexstrain build
Building DAG of jobs...
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/snakemake/__init__.py", line 633, in snakemake
keepincomplete=keep_incomplete,
File "/usr/local/lib/python3.7/site-packages/snakemake/workflow.py", line 565, in execute
dag.init()

[...]

File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 262, in exists
return self.exists_remote
File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 135, in wrapper
v = func(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/snakemake/io.py", line 314, in exists_remote
return self.remote_object.exists()
File "/usr/local/lib/python3.7/site-packages/snakemake/remote/S3.py", line 79, in exists
return self._s3c.exists_in_bucket(self.s3_bucket, self.s3_key)
File "/usr/local/lib/python3.7/site-packages/snakemake/remote/S3.py", line 327, in exists_in_bucket
self.s3.Object(bucket_name, key).load()
File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 564, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 88, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 401, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 731, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Environment info:

The documentation states that:

Both https and s3 should work out of the box in the standard Nextstrain Conda and Docker execution environments.

...and I don't see what I could have possibly done that breaks the Docker execution environment, so at the very least I think this would merit a documentation fix.

@sacundim sacundim added the bug Something isn't working label Apr 10, 2022
sacundim pushed a commit to sacundim/covid-19-puerto-rico-nextstrain that referenced this issue Apr 10, 2022
…stead of over HTTPS."

This reverts commit 8e101b8.

I filed an issue upstream over this:

* nextstrain/ncov#909
@sacundim
Copy link
Author

Ok, I figured out the problem. I misunderstood this documentation to mean that authenticating to AWS was sufficient to be able to read from those buckets:

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. [...] Both https and s3 should work out of the box in the standard Nextstrain Conda and Docker execution environments.

But that really means (and does technically say) that such authentication is necessary (not sufficient!) for the execution environment to be able to access the s3://nextstrain-data/ bucket. In my case, IAM is denying my Batch job access to the bucket for the simple reason that I didn't give my job containers permission to access them. Fix:

Since this is a money savings for your project, I think that you might likely want to document this a bit more explicitly so that people don't have to be AWS gurus.

@tsibley
Copy link
Member

tsibley commented Apr 26, 2022

@sacundim Thanks for digging into this and relaying your findings here! I agree the documentation here could be clarified.

What you ran into was a nuance of cross-account access in AWS. As briefly described in AWS docs about "public" access (emphasis mine):

For IAM users and role principals within your account, no other permissions are required. For principals in other accounts, they must also have identity-based permissions in their account that allow them to access your resource. This is called cross-account access.

The link describes in more detail why you needed to grant access to s3://nextstrain-data in your own account's IAM configuration.

Something like the above should be mentioned in our docs.

@tsibley
Copy link
Member

tsibley commented Apr 26, 2022

Relatedly, I wish it was easier in Snakemake's S3 remote support to disable request signing for these specific S3 requests, since anonymous access works fine and avoids the issue of setting up IAM for cross-account access.

@victorlin victorlin moved this from New to Backlog in Nextstrain planning (archived) Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

2 participants