Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Add all collections and items for public cmip6 data #205

Closed
wants to merge 8 commits into from

Conversation

moradology
Copy link
Contributor

@moradology moradology commented Oct 19, 2022

Need to add descriptions and run ingest against the dev stack but comments and thoughts are appreciated as I'm not entirely happy with how many collections there are for this one data release

Closes #204

@moradology moradology assigned moradology and unassigned moradology Oct 19, 2022
@moradology
Copy link
Contributor Author

Perhaps of interest to @ividito @anayeaye

@moradology moradology marked this pull request as ready for review October 25, 2022 18:02
"license": "MIT",
"description": "Predicted year at which the average daily temperature has risen by 2-degrees",
"stac_version": "1.0.0",
"dashboard:is_periodic": false,
Copy link
Contributor Author

@moradology moradology Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this simply not be here? When do we want to add "dashboard:is_periodic": false? @anayeaye

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want dashboard:is_periodic but it should be true + we need to add dashboard:time_density day. This will tell the summarizer to just return the first and last date and the dashboard UI will create a daily timestep picker. Not sure how nice 30 years of daily data will feel in the UI experience, though, if that's what you mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this - the rasters in question have values which are years rather than, themselves, being indexed by years. In a sense, these two rasters are temporal indices. The rasters themselves are definitely not periodic in the normal sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of COGs set one year apart. Now that I am looking at the crossing year data I see that there is only one COG for each scenario. Not a time series. Sorry about that.

Given that, I think the best thing to do is keep dashboard:is_periodic false and adddashboard:time_density null.

This will let the dashboard apply the default mosaic xyz logic here but we are definitely pushing the edges of the default cog dashboard display logic here.

@moradology moradology temporarily deployed to dev October 27, 2022 16:37 Inactive
@moradology
Copy link
Contributor Author

This has proven a bit more than the current strategies for ingest can handle. Lambda payload limits mean that the daily ingests, in particular, are vastly too large to be processed all at once. I've had some luck running ingests for a single year (by modifying the regex to capture that detail) but don't think we want to have tens of thousands of lines of json to stare at. I'll be adding a script to this PR which generates JSON sufficient to carry out these ingests but more thought is likely required if we expect this to happen in other datasets (which seems likely)

@moradology
Copy link
Contributor Author

I think I am doing what is described in this documentation to work around payload sizes: https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html

@moradology moradology temporarily deployed to dev November 2, 2022 19:57 Inactive
@moradology moradology temporarily deployed to dev November 3, 2022 22:19 Inactive
@moradology moradology temporarily deployed to dev November 3, 2022 22:44 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 16:37 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 16:48 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 16:54 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 17:08 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 17:26 Inactive
@moradology moradology temporarily deployed to dev November 7, 2022 17:30 Inactive
@vlulla
Copy link
Contributor

vlulla commented Nov 7, 2022

Shouldn't the ssp property at https://github.com/NASA-IMPACT/veda-data-pipelines/blob/feature/gddp-ingest/data/step_function_inputs/nex-gddp-cmip6-daily-ssp245.json#L9 be ssp="245"?

@moradology moradology temporarily deployed to dev November 7, 2022 17:46 Inactive
@moradology
Copy link
Contributor Author

@alukach Our pairing session appears to have landed on a successful strategy:
image

And records are showing up as expected: https://dev-stac.delta-backend.com/collections/nex-gddp-cmip6-monthly-ensemble/items

@moradology moradology temporarily deployed to dev November 8, 2022 17:47 Inactive
@moradology moradology temporarily deployed to dev November 8, 2022 18:37 Inactive
@moradology moradology temporarily deployed to dev November 8, 2022 18:58 Inactive
@moradology moradology temporarily deployed to dev November 9, 2022 21:31 Inactive
@moradology moradology temporarily deployed to dev November 9, 2022 21:42 Inactive
@moradology moradology temporarily deployed to dev November 9, 2022 21:56 Inactive
@moradology moradology force-pushed the feature/gddp-ingest branch 3 times, most recently from 2d079ee to cc5a1c5 Compare November 10, 2022 16:14
Add support for daily record datetime inference.
Add CMIP6 ingests.
@moradology moradology temporarily deployed to dev November 10, 2022 17:49 Inactive
@moradology moradology temporarily deployed to dev November 10, 2022 18:06 Inactive
@moradology
Copy link
Contributor Author

Now running with (acceptable levels of) parallelism:
image

@moradology moradology temporarily deployed to dev November 10, 2022 20:30 Inactive
@moradology moradology temporarily deployed to dev November 14, 2022 17:48 Inactive
@moradology moradology temporarily deployed to dev November 14, 2022 18:41 Inactive
@moradology
Copy link
Contributor Author

Difficulties this PR overcomes:

  1. Payload sizes were blowing up almost immediately on the previous deployed infrastructure, so a paginated, iterative approach which stops building up a list of ingests and passes on a continuation token is used. Early strategies investigated and suggested for resolving this issue revolved around temporary storage of parameters on S3 to avoid payload limits but this proved untenable as the payload is actually used in conditions for branching of the state machine.
  2. There is a maximum history size on state machines in AWS, so instead of processing an ingest in a single monolithic run, the end of the state machine fires off a lambda to trigger the next iteration (which has a fresh history and thus avoids error).
  3. Concurrency limitations w/ AWS lambda functions have been a consistent pain when processing multiple ingests of this size (and especially once the end of a state machine spawns another!). The fix here is not incredibly elegant, but it appears to be sufficient for our needs now:
    a. do not start more than 10 or so ingests at once, that appears to be where things get hairy
    b. max_concurrency has been dropped to 1 so that lambdas aren't being spawned in super high numbers during steps that tend to happen quickly anyway
    c. retries are used liberally on asset discovery lambdas, as the lambda limit errors which end up causing trouble for us (while minding a and b above) tend to be transient errors

Features:
This PR also adds support for daily time interval datasets (which CMIP6 is, partially).

@moradology
Copy link
Contributor Author

It looks like the directory structure in the nex-gddp-cmip6-cog bucket supports GISS E2.1 model output for ssp585, but there appear to be no records where we'd expect (e.g. s3://nex-gddp-cmip6-cog/daily/GISS-E2-1-G/ssp585/r1i1p1f1/tasmax/). I've added ingests for these locations and an exception which informs users of empty bucket/prefix combinations (which present as failed state machine runs).

So the question from here is: should those ingests be removed or should the data it depends on be provided?

@moradology
Copy link
Contributor Author

Converted this to draft, as the ingests here are useful but the various bugfixes to make these ingests work will be unnecessary once the airflow branch has been accepted (#257)

@gadomski gadomski mentioned this pull request Sep 20, 2023
7 tasks
@gadomski
Copy link
Contributor

This repo is being sunsetted (#360), so we are closing all open PRs. Please re-open on https://github.com/NASA-IMPACT/veda-data if needed.

@gadomski gadomski closed this Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a test set of CMIP6 datasets
4 participants