Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THREDDS: add more options to configure catalog.xml #472

Merged
merged 8 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,37 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- THREDDS: add more options to configure `catalog.xml`
- The default THREDDS configuration creates two default datasets, the *Service Data* dataset and the
*Main* dataset. The *Service Data* dataset is used internally and hosts WPS outputs. The *Main* dataset is the
place where users can access data served by THREDDS. Both of these are configured to serve files with the following
extensions: .nc .ncml .txt .md .rst .csv

- In order to allow the THREDDS server to serve files with additional extensions, this introduces two new
variables:
- `THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS`: this allows users to specify additional [filter
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved
elements](https://docs.unidata.ucar.edu/tds/current/userguide/tds_dataset_scan_ref.html#including-only-desired-files) to the *Service Data* dataset. This is especially useful if a WPS
outputs files with an extension other than the default (eg: .h5) to the `wps_outputs/` directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the use case - not against it, just making sure the intended use is appropriate.

Is there any advantage of exposing those HDF5 files via THREDDS rather than accessing them directly by the WPS-outputs dir? If anything, I would expect Nginx to provide much better/faster responses, as well potentially additional support of Content-Range requests if enabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I have no intuition about what is better/faster in this case

- `THREDDS_DATASET_DATASETSCAN_BODY`: this allows users to specify the whole body of the *Main* dataset's
[`<datasetScan>`](https://docs.unidata.ucar.edu/tds/current/userguide/tds_dataset_scan_ref.html) element.
This allows users to fully customize how this dataset serves files.

- We limit the configuration options for the *Service Data* dataset more than the *Main* dataset because the *Service
Data* dataset requires a basic configuration in order to properly serve WPS outputs. Making significant changes
to this configuration could have unexpected negative impacts on WPS usage.

- In order to allow customization of the Magpie THREDDS configuration in case new file extensions are added we introduce
two additional variables:
- `THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES`: additional file prefixes (ie. regular expression match patterns) that Magpie
should treat as metadata (accessible with "browse" permissions).
- `THREDDS_MAGPIE_EXTRA_DATA_PREFIXES`: additional file prefixes (ie. regular expression match patterns) that Magpie
should treat as data (accessible with "read" permissions).

- The defaults for these new variables are fully backwards compatible. Without changing these variables, the THREDDS
server should behave exactly the same as before except that .md files and .rst files are now considered metadata
files according to the Magpie configuration, meaning that they can now be viewed with "browse" permissions.

[2.5.3](https://github.com/bird-house/birdhouse-deploy/tree/2.5.3) (2024-09-11)
------------------------------------------------------------------------------------------------------------------
Expand Down
37 changes: 11 additions & 26 deletions birdhouse/components/thredds/catalog.xml.template
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@
xmlns:xlink="http://www.w3.org/1999/xlink" >

<service name="all" serviceType="Compound" base="" >
<service name="http" serviceType="HTTPServer" base="/twitcher/ows/proxy/thredds/fileServer/" />
<service name="odap" serviceType="OpenDAP" base="/twitcher/ows/proxy/thredds/dodsC/" />
<service name="ncml" serviceType="NCML" base="/twitcher/ows/proxy/thredds/ncml/"/>
<service name="uddc" serviceType="UDDC" base="/twitcher/ows/proxy/thredds/uddc/"/>
<service name="iso" serviceType="ISO" base="/twitcher/ows/proxy/thredds/iso/"/>
<service name="wcs" serviceType="WCS" base="/twitcher/ows/proxy/thredds/wcs/" />
<service name="wms" serviceType="WMS" base="/twitcher/ows/proxy/thredds/wms/" />
<service name="subsetServer" serviceType="NetcdfSubset" base="/twitcher/ows/proxy/thredds/ncss/" />
<service name="http" serviceType="HTTPServer" base="${TWITCHER_PROTECTED_PATH}/thredds/fileServer/" />
<service name="odap" serviceType="OpenDAP" base="${TWITCHER_PROTECTED_PATH}/thredds/dodsC/" />
<service name="ncml" serviceType="NCML" base="${TWITCHER_PROTECTED_PATH}/thredds/ncml/"/>
<service name="uddc" serviceType="UDDC" base="${TWITCHER_PROTECTED_PATH}/thredds/uddc/"/>
<service name="iso" serviceType="ISO" base="${TWITCHER_PROTECTED_PATH}/thredds/iso/"/>
<service name="wcs" serviceType="WCS" base="${TWITCHER_PROTECTED_PATH}/thredds/wcs/" />
<service name="wms" serviceType="WMS" base="${TWITCHER_PROTECTED_PATH}/thredds/wms/" />
<service name="subsetServer" serviceType="NetcdfSubset" base="${TWITCHER_PROTECTED_PATH}/thredds/ncss/" />
</service>

<datasetScan name="${THREDDS_SERVICE_DATA_LOCATION_NAME}" ID="${THREDDS_SERVICE_DATA_URL_PATH}" path="${THREDDS_SERVICE_DATA_URL_PATH}" location="${THREDDS_SERVICE_DATA_LOCATION_ON_CONTAINER}">
Expand All @@ -21,30 +21,15 @@
</metadata>

<filter>
<include wildcard="*.nc" />
<include wildcard="*.ncml" />
<include wildcard="*.txt" />
<include wildcard="*.md" />
<include wildcard="*.rst" />
<include wildcard="*.csv" />
${THREDDS_DEFAULT_FILE_FILTERS}
${THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS}
</filter>

</datasetScan>

<datasetScan name="${THREDDS_DATASET_LOCATION_NAME}" ID="${THREDDS_DATASET_URL_PATH}" path="${THREDDS_DATASET_URL_PATH}" location="${THREDDS_DATASET_LOCATION_ON_CONTAINER}">

<metadata inherited="true">
<serviceName>all</serviceName>
</metadata>

<filter>
<include wildcard="*.nc" />
<include wildcard="*.ncml" />
<include wildcard="*.txt" />
<include wildcard="*.md" />
<include wildcard="*.rst" />
<include wildcard="*.csv" />
</filter>
${THREDDS_DATASET_DATASETSCAN_BODY}

</datasetScan>

Expand Down
37 changes: 20 additions & 17 deletions birdhouse/components/thredds/config/magpie/providers.cfg.template
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,24 @@ providers:
- ".+\\.ncml" # match longest extension first to avoid tuncating it by match of sorter '.nc'
- ".+\\.nc"
metadata_type:
prefixes:
- null # note: special YAML value evaluated as `no-prefix`, use quotes if literal value is needed
- "\\w+\\.gif" # threddsIcon, folder icon, etc.
- "\\w+\\.ico" # favicon
- "\\w+\\.txt" # licence
- "\\w+\\.css" # tds.css
- "catalog\\.\\w+" # note: special case for `THREDDS` top-level directory (root) accessed for `BROWSE`
- catalog
- ncml
- uddc
- iso
prefixes: [
null, # note: special YAML value evaluated as `no-prefix`, use quotes if literal value is needed
"\\w+\\.gif", # threddsIcon, folder icon, etc.
"\\w+\\.ico", # favicon
"\\w+\\.css", # tds.css
"catalog\\.\\w+", # note: special case for `THREDDS` top-level directory (root) accessed for `BROWSE`
catalog,
ncml,
uddc,
iso,
${THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES}
]
data_type:
prefixes:
- fileServer
- dodsC
- wcs
- wms
- ncss
prefixes: [
fileServer,
dodsC,
wcs,
wms,
ncss,
${THREDDS_MAGPIE_EXTRA_DATA_PREFIXES}
]
27 changes: 27 additions & 0 deletions birdhouse/components/thredds/default.env
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,29 @@ export THREDDS_SERVICE_DATA_LOCATION_NAME='Birdhouse'
export THREDDS_DATASET_URL_PATH='datasets'
export THREDDS_SERVICE_DATA_URL_PATH='birdhouse'

export THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES='".+\\.txt", ".+\\.md", ".+\\.rst"'
export THREDDS_MAGPIE_EXTRA_DATA_PREFIXES=''

export THREDDS_DEFAULT_FILE_FILTERS='
<include wildcard="*.nc" />
<include wildcard="*.ncml" />
<include wildcard="*.txt" />
<include wildcard="*.md" />
<include wildcard="*.rst" />
<include wildcard="*.csv" />
'

export THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS=''

export THREDDS_DATASET_DATASETSCAN_BODY="
<metadata inherited='true'>
<serviceName>all</serviceName>
</metadata>

<filter>
${THREDDS_DEFAULT_FILE_FILTERS}
</filter>
"

# add any new variables not already in 'VARS' or 'OPTIONAL_VARS' that must be replaced in templates here
VARS="
Expand All @@ -28,6 +50,8 @@ VARS="
\$THREDDS_DATASET_LOCATION_NAME
\$THREDDS_DATASET_URL_PATH
\$THREDDS_DATASET_LOCATION_ON_CONTAINER
\$THREDDS_DATASET_DATASETSCAN_BODY
\$THREDDS_DEFAULT_FILE_FILTERS
"

OPTIONAL_VARS="
Expand All @@ -39,6 +63,9 @@ OPTIONAL_VARS="
\$THREDDS_IMAGE
\$THREDDS_IMAGE_URI
\$THREDDS_ADDITIONAL_CATALOG
\$THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS
\$THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES
\$THREDDS_MAGPIE_EXTRA_DATA_PREFIXES
"

export DELAYED_EVAL="
Expand Down
79 changes: 69 additions & 10 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -456,26 +456,85 @@ export GEOSERVER_ADMIN_PASSWORD="${__DEFAULT__GEOSERVER_ADMIN_PASSWORD}"

# Additional catalogs for THREDDS. Add as many datasetScan XML blocks as needed to THREDDS_ADDITIONAL_CATALOG.
# Each block defines a new top-level catalog. See birdhouse/components/thredds/catalog.xml.template for more information.
export THREDDS_ADDITIONAL_CATALOG=""
#export THREDDS_ADDITIONAL_CATALOG="
# <datasetScan name='dataset_location_name' ID='dataset_url_path' path='dataset_url_path' location='dataset_location_on_container'>
export THREDDS_ADDITIONAL_CATALOG=''
#export THREDDS_ADDITIONAL_CATALOG='
# <datasetScan name="dataset_location_name" ID="dataset_url_path" path="dataset_url_path" location="dataset_location_on_container">
#
# <metadata inherited='true'>
# <metadata inherited="true">
# <serviceName>all</serviceName>
# </metadata>
#
# <filter>
# <include wildcard='*.nc' />
# <include wildcard='*.ncml' />
# <include wildcard='*.txt' />
# <include wildcard='*.md' />
# <include wildcard='*.rst' />
# <include wildcard='*.csv' />
# <include wildcard="*.nc" />
# <include wildcard="*.ncml" />
# <include wildcard="*.txt" />
# <include wildcard="*.md" />
# <include wildcard="*.rst" />
# <include wildcard="*.csv" />
# </filter>
#
# </datasetScan>
#'
# It is possible to define additional compound services in the THREDDS_ADDITIONAL_CATALOG variable as well.
# This may be useful if you are creating a catalog that only provides a subset of the services defined in the
# compound service named "all" (see birdhouse/components/thredds/catalog.xml.template).
# DO NOT define any non-compound services in THREDDS_ADDITIONAL_CATALOG that is not an exact copy of one of the
# variables defined in "all"! Especially, do not change the "base" attribute of any existing service.
# Doing so may break the way that access permissions are enforced when accessing data through this service.

# Additional file filters to add for the Service Data THREDDS dataset. By default, the Service Data dataset will only
# serve files with the following extensions: .nc .ncml .txt .md .rst .csv
# If you need this dataset to serve other files you should update the THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS to add
# additional file filters.
Comment on lines +487 to +488
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the corresponding THREDDS_MAGPIE_... variables as well?

# This may be useful to set if a WPS outputs files to the wps_outputs/ directory (hosted under the Service Data dataset)
# in a file format other than one of the defaults.
# See the example below which would also enable serving .png and .h5 files.
#export THREDDS_SERVICE_DATA_EXTRA_FILE_FILTERS="
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved
# <include wildcard="*.png" />
# <include wildcard="*.h5" />
#"

# Set this variable to customize the body of the <datasetScan> XML element for the main THREDDS dataset. This is typically
# the dataset where you would store most of the data served by THREDDS (additional datasets can be configured by setting the
# THREDDS_ADDITIONAL_CATALOG variable).
# By default, the main dataset will only serve files with the following extensions: .nc .ncml .txt .md .rst .csv and will use
# the THREDDS service named "all" (see components/thredds/catalog.xml.template). However this can be customized if desired.
# See the example below which would change the configuration to also serve .h5 and .json files and exclude .md files.
# See the THREDDS documentation for the <datasetScan> element for all configuration options.
#export THREDDS_DATASET_DATASETSCAN_BODY="
# <metadata inherited='true'>
# <serviceName>all</serviceName>
# </metadata>
#
# <filter>
# ${THREDDS_DEFAULT_FILE_FILTERS}
# <include wildcard='*.h5' />
# <include wildcard='*.json' />
# <exclude wildcard='*.md' />
# </filter>
#"

# Files served by THREDDS are considered to either contain data or metadata (or both). The THREDDS Magpie service allows
# us to handle access permissions different for metadata vs. data. Magpie let's users with "browse" permissions access
# metadata but only users with "read" permissions can access data.
# By accessing files through different THREDDS services (see THREDDS documentation), we can either read the metadata with
# "browse" permissions or the data itself with "read" permissions. For example, by default a NetCDF file can be accessed
# using the NCML service to get its metadata or through the NCSS service to access the data itself.
#
# If you have a file that you would like to be treated as metadata (Magpie will allow users with "browse" permissions to
# access it) no matter which THREDDS service is used to access it, add the file pattern to the `THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES`
# variable. Similarly, if you have a file that you would like to be treated as data no matter which THREDDS service is used
# to access it, add the file pattern to the `THREDDS_MAGPIE_EXTRA_DATA_PREFIXES` variable.
#
# For example, if you want all files with a .h5 extension to be treated as data files in all cases, add '".+\\.h5"' to the
# `THREDDS_MAGPIE_EXTRA_DATA_PREFIXES` variable. Note that values are regular expressions (python) where slashes are double
# escaped. Expressions should be surrounded by double quotes and if multiple expressions are included they should be comma
# delimited.
#
# Current defaults are:
#export THREDDS_MAGPIE_EXTRA_METADATA_PREFIXES='".+\\.txt", ".+\\.md", ".+\\.rst"'
#export THREDDS_MAGPIE_EXTRA_DATA_PREFIXES=''

# Allow using Github as external AuthN/AuthZ provider with Magpie
# To setup Github as login, goto <https://github.com/settings/developers> under section [OAuth Apps]
# and create a new Magpie application with configurations:
Expand Down
Loading