-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mount public wpsoutputs data #360
Conversation
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1873/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-69.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1229/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1921/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-46.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1246/NOTEBOOK TEST RESULTS |
…mount-public-wpsoutputs-data
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1928/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-46.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1251/NOTEBOOK TEST RESULTS |
CHANGES.md
Outdated
- Add public wps outputs directory to Cowbird and add corresponding volume mount to JupyterHub. | ||
- Update `cowbird` service from [1.2.0](https://github.com/Ouranosinc/cowbird/tree/1.2.0) | ||
to [2.0.0](https://github.com/Ouranosinc/cowbird/tree/2.0.0) | ||
- Require `MongoDB==5.0` Docker image for Cowbird's database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add that Cowbird now uses a dedicated MongoDB instance with a mention that if anyone is migrating data, they need to transfer it over as needed (technically breaking, though I don't expect anyone is using Cowbird yet, so not that critical).
This should also be indicated as "breaking change" in the PR description and included in the merge message once completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the changelog, I took most of the description that was already used in the precent similar change applied to Weaver.
birdhouse/components/cowbird/config/cowbird/config.yml.template
Outdated
Show resolved
Hide resolved
WPS_OUTPUTS_DIR: ${WPS_OUTPUTS_DIR} | ||
WPS_OUTPUTS_PUBLIC_SUBDIR: ${WPS_OUTPUTS_PUBLIC_SUBDIR} | ||
WORKSPACE_DIR: ${DATA_PERSIST_ROOT}/${USER_WORKSPACES} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the variables themselves actually used in Cowbird (through os.getenv
or similar)?
It seems they should already be included in the cowbird/config.yml.template, but maybe I'm mistaken?
Not a big deal though if they are provided again here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables are indeed also included in the config.yml.template
file, but when that template file gets converted to config.yml
file, the env variables are not all resolved. I notice for example that the WORKSPACE_DIR
and WPS_OUTPUTS_DIR
keep their unresolved format ${VAR_NAME}
. By including the environment variables here in the docker-compose
, the variables can get resolved later in Cowbird if necessary.
When loading the config in Cowbird, it attempts to resolved remaining unresolved env vars found in the config
using os.path.expandvars()
.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/1968/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-20.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1266/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2031/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-118.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1291/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2066/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1311/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2067/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-133.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1312/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2068/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-46.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1313/NOTEBOOK TEST RESULTS |
.readthedocs.yml
Outdated
image: stable | ||
os: "ubuntu-22.04" | ||
tools: | ||
python: "3.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless something breaks during build, 3.8 at least, since this is already EOL for a while.
(could also be a separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version updated to 3.10, readthedocs seems to build successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very weird that even latest commit built RTD without problem with the options that you found were causing errors:
ce5c14d
But your latest one is also good:
90611a7
(#360)
Funky stuff happening recently on RTD it seems...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the latest build on master could be successful was because RTD is deprecating the option progressively, doing temporary deprecation enforcements until in a few weeks where the option becomes fully deprecated (see blog here). I just happened to build it on Sept. 18th where I got the deprecation error.
…mount-public-wpsoutputs-data
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2086/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-20.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1327/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2087/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-90.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/1328/NOTEBOOK TEST RESULTS |
…mount-public-wpsoutputs-data
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/2093/Result : failure BIRDHOUSE_DEPLOY_BRANCH : mount-public-wpsoutputs-data DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-20.rdext.crim.ca Infrastructure deployment failed. Instance has not been destroyed. @matprov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. This is a kind of PR I needed to deploy to my test VM for testing but I just don't have the bandwidth to work on multiple PR anymore. The requirement for a VM has increased quite a bit and I am not able to run multiple VMs on my workstation simultaneously anymore.
Anyways, without a real deploy I spot a few changes that I think is worth considering fixing.
volumes: | ||
- ./components/cowbird/config/cowbird/config.yml:/opt/local/src/cowbird/config/cowbird.yml | ||
- ./components/cowbird/config/cowbird/cowbird.ini:/opt/local/src/cowbird/config/cowbird.ini | ||
# even if not running tasks here, they must be registered to send them off to the right place! | ||
- ./components/cowbird/config/cowbird/celeryconfig.py:/opt/local/src/cowbird/config/celeryconfig.py | ||
- "${DATA_PERSIST_ROOT}/${USER_WORKSPACES}:/${USER_WORKSPACES}" | ||
- "${DATA_PERSIST_ROOT}:${DATA_PERSIST_ROOT}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woo, this has write-access to the entire $DATA_PERSIST_ROOT
!!! Isn't this too much access?
I am guessing it only needs access to JUPYTERHUB_USER_DATA_DIR="$DATA_PERSIST_ROOT/jupyterhub_user_data"
and Thredds wps_output dir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChaamC correct me if I'm wrong... but the reason for this is that both of those directories need to be mounted in the same volume for hardlinks to work:
@tlvu In #356 we introduce the DATA_PERSIST_SHARED_ROOT
variable here instead. This gives us the ability to mount only the required subdirectories as as single volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am a bit behind in most things cowbird. Understood that for hard links to work, the src and dest of the link should be on the same filesystem.
So if we need to expose Thredds wps_outputs and possibly Geoserver workspace in the Jupyter env workspace, then we need to only volume-mount those 3, in the cowbird container, not the entire /data
.
If DATA_PERSIST_SHARED_ROOT
is the same as DATA_PERSIST_ROOT
by default, then it comes back to the same as exposing the entire /data
to cowbird writable.
Basically would something like the following works:
services:
cowbird:
volumes:
- ./components/cowbird/config/cowbird/config.yml:/opt/local/src/cowbird/config/cowbird.yml
- ./components/cowbird/config/cowbird/cowbird.ini:/opt/local/src/cowbird/config/cowbird.ini
# even if not running tasks here, they must be registered to send them off to the right place!
- ./components/cowbird/config/cowbird/celeryconfig.py:/opt/local/src/cowbird/config/celeryconfig.py
- "${DATA_PERSIST_SHARED_ROOT}/jupyter_user_data:${DATA_PERSIST_SHARED_ROOT}/jupyter_user_data"
- "${DATA_PERSIST_SHARED_ROOT}/datasets/wps_outputs:${DATA_PERSIST_SHARED_ROOT}/datasets/wps_outputs"
- "${DATA_PERSIST_SHARED_ROOT}/geoserver/workspaces:${DATA_PERSIST_SHARED_ROOT}/geoserver/workspaces"
So the paths inside and outside of the cowbird container is exactly the same, giving it the impression it has access to the entire /data
but it's not true.
By the way, I am not sure ${DATA_PERSIST_SHARED_ROOT}/datasets/wps_outputs
actually works since I think the actual wps_outputs is in a data-volume https://github.com/bird-house/birdhouse-deploy/blob/master/birdhouse/config/wps_outputs-volume/docker-compose-extra.yml
Usage of that wps_outputs data-volume:
birdhouse/config/raven/config/wps_outputs-volume/docker-compose-extra.yml
6: - wps_outputs:/data/wpsoutputs
birdhouse/config/thredds/docker-compose-extra.yml
24: - wps_outputs:/pavics-data/wps_outputs
Anyways, many inter-connected pieces so not simple to wrap my head around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
DATA_PERSIST_SHARED_ROOT
is the same asDATA_PERSIST_ROOT
by default, then it comes back to the same as exposing the entire/data
to cowbird writable.
Yes it would. But... I would like to see DATA_PERSIST_SHARED_ROOT
not be the same as DATA_PERSIST_ROOT
. The only reason it is like that by default is to maintain backwards compatibility.
Basically would something like the following works:
No, unfortunately that wouldn't work. They actually need to be mounted at the same mount-point. The relative location matters for symlinks, but a shared mount-point matters for hard-links.
By the way, I am not sure
${DATA_PERSIST_SHARED_ROOT}/datasets/wps_outputs
actually works since I think the actual wps_outputs is in a data-volume
I don't have enough knowledge to comment on this. Sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChaamC correct me if I'm wrong... but the reason for this is that both of those directories need to be mounted in the same volume for hardlinks to work:
Yes, exactly. If we want to share the wps_outputs data to the user, we need to use hardlinks, and cowbird requires the src and destination of the hardlinks to be in the same volume/file partition, or it will trigger a Cross-device link
.
I needed to mount the full data directory to be able to make hardlinks between files from the /data/wps_outputs
and /data/user_workspaces
directories.
We can update the volume mount to use the upcoming new variable DATA_PERSIST_SHARED_ROOT
as long as all files used by cowbird for this PR's feature are contained in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check, for existing deployments, if we want DATA_PERSIST_SHARED_ROOT=/data_shared_mount
different than DATA_PERSIST_ROOT=/data
, we have to move
- /data/wps_outputs => /data_shared_mount/wps_outputs
- /data/user_workspaces => /data_shared_mount/user_workspaces
- anything else needed for the migration?
I think this migration should be documented because anyone starting with the current default values and do not want Cowbird to have access to the entire /data
will have to perform this manual migration.
Or we set a different value immediately in default.env
and in env.local.example
we document the backward-compatible value DATA_PERSIST_SHARED_ROOT=/data
and warn this gives Cowbird too much write-access. I prefer different default value now to avoid migration later. Cowbird is not in the default enable list right now so this should not break too many existing deployments.
Also, I still do not understand why we are referring to /data/wps_outputs
when currently wps_outputs is a data-volume as mentioned in my comment #360 (comment) above.
This means different volume-mount between wps_outputs
dir and user_workpaces
dir and hardlink probably do not work.
Does Cowbird fallback to a regular copy when hardlink do not work? Does it log somewhere when hardlink do not work?
Is there a notebook or test script that tests Cowbird in the PAVICS stack I can try?
Also question for the future as I heard maybe Cowbird will manage Geoserver workspace too. Does this means we will need to migrate Geoserver data dir under DATA_PERSIST_SHARED_ROOT
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything else needed for the migration?
That's it, at least concerning this PR's changes.
Although, I'm thinking about something else to watch out for. Not sure if this variable's path could be affected, but changing the JUPYTERHUB_USER_DATA_DIR
variable's path could break the symlink that is used in the cowbird workspace to link to the JUPYTERHUB_USER_DATA_DIR
directory. So, maybe something to watch out for. The user workspaces could eventually be resynchable if we decide to change the jupyterhub path, but the resync is not yet implemented in cowbird for this symlink, so for now it would need a manual recreation of symlinks in each user workspace if that variable is changed.
If we know we will eventually do the migration, I don't see any reason not to do it now. I think it would be okay to change it now if you prefer.
Also, I still do not understand why we are referring to /data/wps_outputs when currently wps_outputs is a data-volume as mentioned in my comment #360 (comment) above.
This means different volume-mount between wps_outputs dir and user_workpaces dir and hardlink probably do not work.
I am not sure to understand the problem here. Do you mean the fact that the birds who will produce wps_outputs data only mount the wps_outputs volume (and not the user workspace volume)? Any bird that produces wps_outputs only needs access to the wps_outputs volume. Only cowbird requires the full access to the data directory to allow the creation of hardlinks between the wps_outputs and user workspace directories. After that, when the hardlink is created, the volume mount does not require the full data mount to have the hardlink accessible. It is really only for the moment of creation, in cowbird that we require the full mount of the data directory.
Does Cowbird fallback to a regular copy when hardlink do not work? Does it log somewhere when hardlink do not work?
If the hardlink fails, the failure is only logged in Cowbird's logs. We want to avoid having 2 independant copies that could diverge. We did implement the resync function for the wps_outputs data, so if we need, we can call that API endpoint on cowbird, and cowbird will regenerate any missing hardlinks that could have resulted from a previous failure.
Is there a notebook or test script that tests Cowbird in the PAVICS stack I can try?
Nope, not at the moment, although I am checking to create a notebook to verify the different parts of the user workspace (notebook symlink, wps_outputs hardlinks, geoserver data, etc.), but I am not sure yet if it is achievable.
Also question for the future as I heard maybe Cowbird will manage Geoserver workspace too. Does this means we will need to migrate Geoserver data dir under DATA_PERSIST_SHARED_ROOT later?
Cowbird currently already manages geoserver's file permissions on the user workspace. Since the data is only found in the user workspace directly, it doesn't require any symlink/hardlinks, as data is already accessible to the user,it shouldn't be a problem.
Unless there is external geoserver data to be made accessible to the user that I am not aware of yet?
Overview
This PR addresses the need for users to have access to public wps outputs data. The latest version of Cowbird monitors the wps outputs data directory (which contains public data and user restricted data) and isolates its public data via hardlinks to another directory. This other directory is then mounted on JupyterLab instances to give data access to the users.
cowbird#40 must first be merged before merging this PR.
Changes
Non-breaking changes
WPS_OUTPUTS_DIR
env variable to manage the location of the wps outputs data.Breaking changes
cowbird
service from 1.2.0 to latest versionMongoDB==5.0
Docker image for Cowbird's database.Because of the new
MongoDB==5.0
database requirement for Cowbird that uses (potentially) distinct version from otherbirds, a separate Docker image is employed only for Cowbird. If some processes, jobs, or other Cowbird-related data
was already defined on one of your server instances, manual transfer between the generic
${DATA_PERSIST_ROOT}/mongodb_persist
to new${DATA_PERSIST_ROOT}/mongodb_cowbird_persist
directory must beaccomplished. The data in the new directory should then be migrated to the new version following the same procedure as
described for Weaver in
Database Migration.
Related Issue / Discussion
Related to Jira task DAC-570 and PR cowbird#40
Update Cowbird's version after Cowbird's PR is merged