Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve / standardize the way that we pass through default parameters and build-specific parameters #1131

Open
trvrb opened this issue Jul 26, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@trvrb
Copy link
Member

trvrb commented Jul 26, 2024

Context

Currently, the workflow is a bit of mess when it comes to how one can specify parameters. There seems to be 3 different approaches:

1 – parameters.yaml only

For the frequencies rule, the parameter pivot_interval is retrieved via config["frequencies"]["pivot_interval"]. This makes it grab exactly what's in parameters.yaml under frequencies:pivot_interval. If someone tries to specify a 4-week build-specific pivot_interval via

frequencies:
  global_all-time:
    pivot_interval: 4

The workflow will secretly just keep using the pivot_interval specified in paramaters.yaml.

Most parameters in the workflow work like this.

2 – builds.yaml override with necessity of default

For the traits rule, the parameter sampling_bias_correction is retrieved via _get_sampling_bias_correction_for_wildcards which is defined as

def _get_sampling_bias_correction_for_wildcards(wildcards):
    if wildcards.build_name in config["traits"] and 'sampling_bias_correction' in config["traits"][wildcards.build_name]:
        return config["traits"][wildcards.build_name]["sampling_bias_correction"]
    else:
        return config["traits"]["default"]["sampling_bias_correction"]

Ie it first looks for a build-specific list of traits:{build_name}:sampling_bias_correction and if doesn't find it, it expects traits:default:sampling_bias_correction in builds.yaml or parameters.yaml. This is described in the docs as

   traits:
     default:
       sampling_bias_correction: 2.5
       columns: ["country"]
     washington:
       # Override default sampling bias correction for
       # "washington" build and continue to use default
       # trait columns.
       sampling_bias_correction: 5.0

This works, but requires parameters.yaml to look like:

traits:
  default:
    sampling_bias_correction: 2.5
    columns: ["country"]

with an extra default key compared to other entries in parameters.yaml.

This strategy is only used for the traits rule.

3 – builds.yaml override without default

For the frequencies rule, the parameter min_date is retrieved via _get_min_date_for_frequencies which is defined as

def _get_min_date_for_frequencies(wildcards):
    if wildcards.build_name in config["frequencies"] and "min_date" in config["frequencies"][wildcards.build_name]:
        return config["frequencies"][wildcards.build_name]["min_date"]
    elif "frequencies" in config and "min_date" in config["frequencies"]:
        return config["frequencies"]["min_date"]
    else:
        # If not explicitly specified, default to 1 year back from the present
        min_date_cutoff = datetime.date.today() - datetime.timedelta(weeks=52)
        return numeric_date(
            min_date_cutoff
        )

Ie it starts with trying to grab build-specific frequencies:{build_name}:min_date from builds.yaml. If this doesn't exist, it looks for frequencies:min_date in builds.yaml or parameters.yaml and if this doesn't exist it directly returns a sensible default.

This strategy is only used for the frequencies rule.

Description

I believe that we should replace the strategy 2 above used only in traits rule with a setup like 3 above. This is what I did when I realized we weren't collecting narrow_bandwidth properly. In PR #1130 I followed strategy 3 to specify narrow_bandwidth as:

def _get_narrow_bandwidth_for_wildcards(wildcards):
    # check if builds.yaml contains frequencies:{build_name}:narrow_bandwidth
    if wildcards.build_name in config["frequencies"] and 'narrow_bandwidth' in config["frequencies"][wildcards.build_name]:
        return config["frequencies"][wildcards.build_name]["narrow_bandwidth"]
    # check if parameters.yaml contains frequencies:narrow_bandwidth
    elif "frequencies" in config and "narrow_bandwidth" in config["frequencies"]:
        return config["frequencies"]["narrow_bandwidth"]
    # else return augur frequencies default value
    else:
        return 0.0833

We have the issue that if we swap _get_sampling_bias_correction_for_wildcards to use strategy 3, we'll need to update parameters.yaml to read

traits:
  sampling_bias_correction: 2.5
  columns: ["country"]

This will break custom profiles that external users are running. We could provide backwards compatibility however by looking first for traits:sampling_bias_correction and then for traits:defaults:sampling_bias_correction.

Does this seem reasonable?

@trvrb trvrb added the enhancement New feature or request label Jul 26, 2024
trvrb added a commit that referenced this issue Jul 26, 2024
Per rationale in issue 1131 (#1131), switch to following a three step look up for setting build-specific parameters of colors:clade_recency.
@victorlin
Copy link
Member

Voting for strategy 2. Reasoning:

In theory, allowing arbitrary build names under keys such as traits/frequencies/etc. means certain build names are "reserved" and not allowed. In strategy 2, this means a single name default is not allowed. In strategy 3, this means many names min_date/max_date/etc. are not allowed.

Practically speaking, there may be little chance that a user would want to name a build using one of the reserved names. But it'd be easier to keep the exception list small by using default as a self-documenting namespace for these settings.

@joverlee521
Copy link
Contributor

I would also vote to keep strategy 2. I think having an explicit default key is a nice indicator that the config param can be specified by per build.

Using the frequencies configs as an example because it's a mix of per workflow and per build params.

Without the default key, it's not clear which params can be specified per build (as highlighted in strategy 1):

frequencies:
    min_date: 2020-01-01
    max_date: 2020-05-10
    pivot_interval: 4
    pivot_interval: 1
    pivot_interval_units: "weeks"
    narrow_bandwidth: 0.041
    proportion_wide: 0.0

With the default key, I think it's easier to explain that the min_date and max_date are the only params that can be specified per build.

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
    pivot_interval: 4
    pivot_interval: 1
    pivot_interval_units: "weeks"
    narrow_bandwidth: 0.041
    proportion_wide: 0.0

@victorlin
Copy link
Member

victorlin commented Jul 29, 2024

Using the frequencies configs as an example because it's a mix of per workflow and per build params.

Good example! I would go one step further and make sure that all parameters can be made build-specific:

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
        pivot_interval: 4
        pivot_interval: 1
        pivot_interval_units: "weeks"
        narrow_bandwidth: 0.041
        proportion_wide: 0.0
    # every other entry is a build

Or if there is good reason to prevent them from being set on the build level, clearly mark them as applying to all builds via a separate namespace:

frequencies:
    default:
        min_date: 2020-01-01
        max_date: 2020-05-10
    all-builds:
        pivot_interval: 4
        pivot_interval: 1
        pivot_interval_units: "weeks"
        narrow_bandwidth: 0.041
        proportion_wide: 0.0
    # every other entry is a build

@jameshadfield
Copy link
Member

2 – builds.yaml override with necessity of default

This is essentially the solution I came to for avian-flu. It felt a little cumbersome sometimes, but overall I think works well. I used the FALLBACK key rather than default, but the idea is the same, e.g.:

  clock_rates:
    FALLBACK:            # anything not specified by a subtype/time combination
        pb2: ''          # falls back to FALLBACK, and the empty string means no
        pb1: ''          # supplied clock rate, i.e. infer the clock
        ...
    h5nx:
      2y:
        pb2: [0.00287, *clock_std_dev]
        pb1: [0.00267, *clock_std_dev]
        ...

Solution 3 is the same, it just defines default values and build names in the same namespace/dictionary. I'd be happy with consistent use of either (2) or (3).

The workflow will secretly just keep using the pivot_interval specified in paramaters.yaml.

We should really prevent this from happening. Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.

trvrb added a commit that referenced this issue Sep 26, 2024
Per rationale in issue 1131 (#1131), switch to following a three step look up for setting build-specific parameters of colors:clade_recency.
@trvrb trvrb mentioned this issue Sep 26, 2024
1 task
@trvrb
Copy link
Member Author

trvrb commented Sep 26, 2024

Thanks for the feedback here @joverlee521, @victorlin and @jameshadfield. Based on your preferences I updated clade_recency in #1132 to follow strategy 2 and use config["colors"]["default"]["clade_recency"]. We should plan to update existing uses of strategy 3 to follow strategy 2 instead.

trvrb added a commit that referenced this issue Sep 27, 2024
This follows recommendation in issue #1131 to use the pattern

frequencies:
  default:
    min_date: "6M"
    narrow_bandwidth: 0.038

This use of "default" extends just to min_date, max_date and narrow_bandwidth. This behavior is now documented in parameters.yaml as well as workflow-config-file.rst.

The specification of frequencies parameters in builds.yaml now follows this pattern.
trvrb added a commit that referenced this issue Sep 27, 2024
This follows recommendation in issue #1131 to use the pattern

frequencies:
  default:
    min_date: "6M"
    narrow_bandwidth: 0.038

This use of "default" extends just to min_date, max_date and narrow_bandwidth. This behavior is now documented in parameters.yaml as well as workflow-config-file.rst.

The specification of frequencies parameters in builds.yaml now follows this pattern.
trvrb added a commit that referenced this issue Sep 27, 2024
This follows recommendation in issue #1131 to use the pattern

frequencies:
  default:
    min_date: "6M"
    narrow_bandwidth: 0.038

This use of "default" extends just to min_date, max_date and narrow_bandwidth. This behavior is now documented in parameters.yaml as well as workflow-config-file.rst.

The specification of frequencies parameters in builds.yaml now follows this pattern.
victorlin pushed a commit that referenced this issue Oct 3, 2024
Per rationale in issue 1131 (#1131), switch to following a three step look up for setting build-specific parameters of colors:clade_recency.
@tsibley
Copy link
Member

tsibley commented Oct 22, 2024

Schemas are a (the?) way to identify this as an invalid config file and error out early. I don't think we've written any schema files for Snakemake, but we make extensive use of them for augur/auspice JSONs.

We should absolutely be making greater use of schema-based validation for our workflow configs. The only example of us doing so that I'm aware of is ncov, which provides a very basic, non-comprehensive schema (better than nothing, but not great).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants