Add all collections and items for public cmip6 data #205

moradology · 2022-10-19T19:13:35Z

Need to add descriptions and run ingest against the dev stack but comments and thoughts are appreciated as I'm not entirely happy with how many collections there are for this one data release

Closes #204

moradology · 2022-10-19T19:14:13Z

Perhaps of interest to @ividito @anayeaye

moradology · 2022-10-25T20:58:14Z

data/collections/nex-gddp-cmip6/nex-gddp-cmip6-crossover.json

+    "license": "MIT",
+    "description": "Predicted year at which the average daily temperature has risen by 2-degrees",
+    "stac_version": "1.0.0",
+    "dashboard:is_periodic": false,


Should this simply not be here? When do we want to add "dashboard:is_periodic": false? @anayeaye

I think we do want dashboard:is_periodic but it should be true + we need to add dashboard:time_density day. This will tell the summarizer to just return the first and last date and the dashboard UI will create a daily timestep picker. Not sure how nice 30 years of daily data will feel in the UI experience, though, if that's what you mean

I'm not sure about this - the rasters in question have values which are years rather than, themselves, being indexed by years. In a sense, these two rasters are temporal indices. The rasters themselves are definitely not periodic in the normal sense

I was thinking of COGs set one year apart. Now that I am looking at the crossing year data I see that there is only one COG for each scenario. Not a time series. Sorry about that.

Given that, I think the best thing to do is keep dashboard:is_periodic false and adddashboard:time_density null.

This will let the dashboard apply the default mosaic xyz logic here but we are definitely pushing the edges of the default cog dashboard display logic here.

moradology · 2022-10-27T18:32:31Z

This has proven a bit more than the current strategies for ingest can handle. Lambda payload limits mean that the daily ingests, in particular, are vastly too large to be processed all at once. I've had some luck running ingests for a single year (by modifying the regex to capture that detail) but don't think we want to have tens of thousands of lines of json to stare at. I'll be adding a script to this PR which generates JSON sufficient to carry out these ingests but more thought is likely required if we expect this to happen in other datasets (which seems likely)

moradology · 2022-11-02T19:55:16Z

I think I am doing what is described in this documentation to work around payload sizes: https://docs.aws.amazon.com/step-functions/latest/dg/avoid-exec-failures.html

vlulla · 2022-11-07T17:31:09Z

Shouldn't the ssp property at https://github.com/NASA-IMPACT/veda-data-pipelines/blob/feature/gddp-ingest/data/step_function_inputs/nex-gddp-cmip6-daily-ssp245.json#L9 be ssp="245"?

moradology · 2022-11-07T17:57:24Z

@alukach Our pairing session appears to have landed on a successful strategy:

And records are showing up as expected: https://dev-stac.delta-backend.com/collections/nex-gddp-cmip6-monthly-ensemble/items

Add support for daily record datetime inference. Add CMIP6 ingests.

moradology · 2022-11-10T18:19:13Z

Now running with (acceptable levels of) parallelism:

…are used

moradology · 2022-11-14T19:07:59Z

Difficulties this PR overcomes:

Payload sizes were blowing up almost immediately on the previous deployed infrastructure, so a paginated, iterative approach which stops building up a list of ingests and passes on a continuation token is used. Early strategies investigated and suggested for resolving this issue revolved around temporary storage of parameters on S3 to avoid payload limits but this proved untenable as the payload is actually used in conditions for branching of the state machine.
There is a maximum history size on state machines in AWS, so instead of processing an ingest in a single monolithic run, the end of the state machine fires off a lambda to trigger the next iteration (which has a fresh history and thus avoids error).
Concurrency limitations w/ AWS lambda functions have been a consistent pain when processing multiple ingests of this size (and especially once the end of a state machine spawns another!). The fix here is not incredibly elegant, but it appears to be sufficient for our needs now:
a. do not start more than 10 or so ingests at once, that appears to be where things get hairy
b. max_concurrency has been dropped to 1 so that lambdas aren't being spawned in super high numbers during steps that tend to happen quickly anyway
c. retries are used liberally on asset discovery lambdas, as the lambda limit errors which end up causing trouble for us (while minding a and b above) tend to be transient errors

Features:
This PR also adds support for daily time interval datasets (which CMIP6 is, partially).

moradology · 2022-11-14T19:18:18Z

It looks like the directory structure in the nex-gddp-cmip6-cog bucket supports GISS E2.1 model output for ssp585, but there appear to be no records where we'd expect (e.g. s3://nex-gddp-cmip6-cog/daily/GISS-E2-1-G/ssp585/r1i1p1f1/tasmax/). I've added ingests for these locations and an exception which informs users of empty bucket/prefix combinations (which present as failed state machine runs).

So the question from here is: should those ingests be removed or should the data it depends on be provided?

moradology · 2023-01-10T16:53:02Z

Converted this to draft, as the ingests here are useful but the various bugfixes to make these ingests work will be unnecessary once the airflow branch has been accepted (#257)

gadomski · 2023-09-22T11:27:36Z

This repo is being sunsetted (#360), so we are closing all open PRs. Please re-open on https://github.com/NASA-IMPACT/veda-data if needed.

moradology assigned moradology and unassigned moradology Oct 19, 2022

moradology marked this pull request as ready for review October 25, 2022 18:02

moradology force-pushed the feature/gddp-ingest branch from 6a7eb15 to 15e97b9 Compare October 25, 2022 18:11

moradology commented Oct 25, 2022

View reviewed changes

moradology temporarily deployed to dev October 27, 2022 16:37 Inactive

moradology force-pushed the feature/gddp-ingest branch from 933703e to 4ee139e Compare November 2, 2022 19:57

moradology temporarily deployed to dev November 2, 2022 19:57 Inactive

moradology force-pushed the feature/gddp-ingest branch from 4ee139e to 45066ac Compare November 3, 2022 22:02

moradology had a problem deploying to dev November 3, 2022 22:03 Failure

moradology had a problem deploying to dev November 3, 2022 22:10 Failure

moradology temporarily deployed to dev November 3, 2022 22:19 Inactive

moradology had a problem deploying to dev November 3, 2022 22:28 Failure

moradology temporarily deployed to dev November 3, 2022 22:44 Inactive

moradology temporarily deployed to dev November 7, 2022 16:37 Inactive

moradology temporarily deployed to dev November 7, 2022 16:48 Inactive

moradology temporarily deployed to dev November 7, 2022 16:54 Inactive

moradology temporarily deployed to dev November 7, 2022 17:08 Inactive

moradology temporarily deployed to dev November 7, 2022 17:26 Inactive

moradology temporarily deployed to dev November 7, 2022 17:30 Inactive

moradology temporarily deployed to dev November 7, 2022 17:46 Inactive

moradology force-pushed the feature/gddp-ingest branch from 62e0aa6 to 1fc23cb Compare November 7, 2022 17:57

moradology temporarily deployed to dev November 8, 2022 17:47 Inactive

moradology temporarily deployed to dev November 8, 2022 18:37 Inactive

moradology temporarily deployed to dev November 8, 2022 18:58 Inactive

moradology temporarily deployed to dev November 9, 2022 21:31 Inactive

moradology temporarily deployed to dev November 9, 2022 21:42 Inactive

moradology temporarily deployed to dev November 9, 2022 21:56 Inactive

moradology force-pushed the feature/gddp-ingest branch 3 times, most recently from 2d079ee to cc5a1c5 Compare November 10, 2022 16:14

Rewrite ingest and discovery process for large datasets.

b923736

Add support for daily record datetime inference. Add CMIP6 ingests.

moradology force-pushed the feature/gddp-ingest branch from cc5a1c5 to b923736 Compare November 10, 2022 16:15

Reduce concurrency in step functions to avoid lambda limits

1a80ed0

moradology temporarily deployed to dev November 10, 2022 17:49 Inactive

Organize cmip6 ingests, throw with useful error when no files found

8854ce9

moradology temporarily deployed to dev November 10, 2022 18:06 Inactive

Print insertion response after every new job

ab00adb

Retry all lambdas to avoid transient TooManyRequest errors

55cfa7f

moradology temporarily deployed to dev November 10, 2022 20:30 Inactive

moradology temporarily deployed to dev November 14, 2022 17:48 Inactive

moradology temporarily deployed to dev November 14, 2022 18:41 Inactive

Propagate through event parameters to file objects so date arguments …

f02e625

…are used

moradology force-pushed the feature/gddp-ingest branch from e0f9374 to f02e625 Compare November 14, 2022 18:47

Lint

b0ec0ac

Add 'day' to event validation

ace5f5a

j08lue mentioned this pull request Dec 1, 2022

Configure CMIP6 datasets & discovery NASA-IMPACT/veda-config#153

Closed

moradology marked this pull request as draft January 10, 2023 16:51

gadomski mentioned this pull request Sep 20, 2023

Close all open PRs on this repo? #361

Closed

7 tasks

gadomski closed this Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add all collections and items for public cmip6 data #205

Add all collections and items for public cmip6 data #205

moradology commented Oct 19, 2022 •

edited

Loading

moradology commented Oct 19, 2022

moradology Oct 25, 2022 •

edited

Loading

anayeaye Oct 25, 2022

moradology Oct 27, 2022

anayeaye Oct 27, 2022

moradology commented Oct 27, 2022

moradology commented Nov 2, 2022

vlulla commented Nov 7, 2022

moradology commented Nov 7, 2022

moradology commented Nov 10, 2022

moradology commented Nov 14, 2022

moradology commented Nov 14, 2022

moradology commented Jan 10, 2023

gadomski commented Sep 22, 2023

Add all collections and items for public cmip6 data #205

Add all collections and items for public cmip6 data #205

Conversation

moradology commented Oct 19, 2022 • edited Loading

moradology commented Oct 19, 2022

moradology Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

anayeaye Oct 25, 2022

Choose a reason for hiding this comment

moradology Oct 27, 2022

Choose a reason for hiding this comment

anayeaye Oct 27, 2022

Choose a reason for hiding this comment

moradology commented Oct 27, 2022

moradology commented Nov 2, 2022

vlulla commented Nov 7, 2022

moradology commented Nov 7, 2022

moradology commented Nov 10, 2022

moradology commented Nov 14, 2022

moradology commented Nov 14, 2022

moradology commented Jan 10, 2023

gadomski commented Sep 22, 2023

moradology commented Oct 19, 2022 •

edited

Loading

moradology Oct 25, 2022 •

edited

Loading