Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-2232 - Functionality added to support SMAP L3 products #15

Closed
wants to merge 41 commits into from

Conversation

sudha-murthy
Copy link
Collaborator

@sudha-murthy sudha-murthy commented Sep 10, 2024

Description

SMAP L3 does not have 1D dimension scales and grid mapping variable which is needed by HOSS to do spatial subsetting.
Methods added to override the missing grid mapping with overrides in the hoss_config.json and supporting methods.
Methods also added to use the 2D coordinate datasets to generate 1D dimension scales that could be used for to calculate
the index ranges to provide the spatial subsetted outputs

Jira Issue ID

DAS-2232

Local Test Steps

  • Spatial subsetting can be tested without mask fill. (till DAS-2215 is complete)
  • Run Harmony in the box with HOSS.
  • Comment out of mask_fill in services.yaml file
  • Run a spatial subset request locally and ensure you get the right subsetted output
  • e.g.
  1. http://localhost:3000/C1268452378-EEDTEST/ogc-api-coverages/1.0.0/collections/parameter_vars/coverage/rangeset?forceAsync=true&granuleId=G1268452388-EEDTEST&subset=lat(70%3A80)&subset=lon(-160%3A-150)&variable=Soil_Moisture_Retrieval_Data_AM%2Fstatic_water_body_fraction&skipPreview=true
  2. http://localhost:3000/C1268452378-EEDTEST/ogc-api-coverages/1.0.0/collections/parameter_vars/coverage/rangeset?forceAsync=true&granuleId=G1268452388-EEDTEST&subset=lat(54%3A72)&subset=lon(2%3A42)&format=application%2Fx-netcdf4&variable=Soil_Moisture_Retrieval_Data_AM%2Falbedo%2CSoil_Moisture_Retrieval_Data_AM%2Fsurface_flag&skipPreview=true
  • 3D variables do not work - pending on DAS-2238
  • New unit tests have not been added. The current unit tests are passing
  • DAS-2236 written to handle fill values in the corners.
  • Jupyter test notebooks exist for SMAP L3 and need to be updated

PR Acceptance Checklist

  • Jira ticket acceptance criteria met.
  • [X ] CHANGELOG.md updated to include high level summary of PR changes.
  • docker/service_version.txt updated if publishing a release.
  • Tests added/updated and passing.
  • Documentation updated (if needed).

@sudha-murthy sudha-murthy marked this pull request as draft September 12, 2024 17:40
@sudha-murthy sudha-murthy marked this pull request as ready for review September 13, 2024 05:09
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sudha, There's a lot here to review and unfortunately I haven't worked with HOSS directly before. I tried to look through what I could and make decent suggestions.

But for sure, you need to test your changes. When you run the tests a coverage report is generated. Before your changes the results were:

Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
Hoss/dimension_utilities.py      156      2    99%
hoss/spatial.py                   61      0   100%
--------------------------------------------------
TOTAL                            668      8    99%

And after they dropped considerably:


Test Coverage Estimates
Name                           Stmts   Miss  Cover
--------------------------------------------------
hoss/dimension_utilities.py      242     65    73%
hoss/spatial.py                   70      8    89%
--------------------------------------------------
TOTAL                            771     81    89%

It is very difficult to be able to understand the changes when I can't look at test and see what a function was supposed to do. Likewise, the function comments were not updated to describe the new functionality. Hopefully I'm not confusing anything with my lack of familiarity. I will defer to Owen if there's differences of opinion on what should be done.

Final test instructions assume a strong understanding of the problem you were solving, but I was able to eventually run the two test URLs and get output. Should the output files be geolocated somehow? Because they open in panoply, but aren't geo-2d, just 2d.

CHANGELOG.md Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/projection_utilities.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
hoss/spatial.py Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sudha-murthy for putting a lot of effort into this PR!

This review is not 100% thorough. I think there are a few bigger things here that need to be addressed, and that will make it easier to digest in full:

  • I think the new functionality to deal with overriding dimensions should not just go in the get_projected_x_y_ranges. By doing so, it's adding complexity to that function, which makes things harder to understand overall. I also think it is forcing you to do some things that would otherwise be unnecessary - for example, I don't think you need to write the variables to the NetCDF-4 file, you just need to ask a function for 1-D dimension variables and spit out 1-D numpy arrays.
  • I'm really worried about the use of set objects to contain overriding dimensions. The ordering of dimensions is important, and must match the underlying array axes. I suspect this is the reason you are having issues with 3-D variables (it's certainly part of the issue).
  • A lot of the additional code is written in ways that adds high levels of cyclomatic complexity. I've suggested a few ways to cut that down, but it's worth taking a step back and working out what is common to code branches and what is truly different. Try to minimise what has to go in an if/elif/else, and then make sure each branch gives you the same output. The add_index_ranges function is perhaps the area of most concern.
  • There are a few places where the code branches and only assigns variables in one branch which are then used after the branching is finished. This will cause errors when something skips that block and later tries to access a non-existent variable.
  • This PR has no new or updated unit tests, either for new functions of new branches of code. I know I'm religious about these things, but they are absolutely needed. Particularly on a PR of this scope. There are a bunch of changes, and they are complicated and hard to understand. Not only do tests make sure things work, reading them makes it easier to understand how the code flows.
  • There are a lot of places where code has changed significantly, but the documentation strings describing the functions have not. They should definitely be updated to keep in sync with what's going on.

Sorry for all the comments, but hopefully they are helpful!

CHANGELOG.md Outdated Show resolved Hide resolved
docker/service_version.txt Show resolved Hide resolved
@@ -55,6 +59,42 @@ def is_index_subset(message: Message) -> bool:
)


def get_override_projected_dimensions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not include any changes to unit tests. Each new function you are adding (or new branch of code within an existing conditional block) needs unit testing that hits every branch of the code within a function. I've not reviewed any of the code changes in detail yet, but this lack of tests is an absolute blocker for me to approve the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Owen. Will add the unit tests.

hoss/subset.py Outdated Show resolved Hide resolved
hoss/subset.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
if variable.dimensions == []:
override_dimensions = get_override_dimensions(varinfo, [variable_name])
if len(override_dimensions) > 0:
for override in reversed(list(override_dimensions)):
Copy link
Member

@owenlittlejohns owenlittlejohns Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you reversing things here? The ordering should be preserved. If the order of override dimensions does not reflect the ordering of array axes, then that needs to be fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revisit that. I had a problem with fixing the override dimensions list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0

@@ -422,22 +559,48 @@ def add_index_range(

"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function: READ THIS COMMENT FIRST:

I think this is really overcomplicated. Fundamentally, all this function is doing is getting a list of dimensions for a variable. Then it is looking in a cache to see if there is an index range for any of the dimension names, before using that cache value to tack something on the end of a string.

Maybe I'm being naive, but it feels like really all you need is something to determine the correct list of variable dimensions, then all the rest of the logic (looking in the cache and appending strings to the end of the variable name) is the same. That latter stuff 100% does not need to be in the duplicated in multiple condition branches. It's making this function unnecessarily hard to read.

The other, really important comment: I am super suspicious of the bit where you are needing to reverse the order of the dimensions list. However that is derived, it should be reliably in the ordering of the variable axes. I'm wary that what this means is that you have found that for your particular use-case the "random" ordering in a set is predictable for the pseudo-dimensions you have for SMAP L3, and you can coincidentally impose the order you need coincidentally by reversing the set. I really think dimensions should not be passed around in sets, because you need that ordering. I strongly suspect this is the root cause of your issues with 3-D variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering and the shape are things I need to get from varinfo. I dont have that information. Updated DAS-2241 for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rephrasing a different comment: I'm very uncomfortable with the thought of merging code with known problems into main. Going back to one of the things mentioned in the TRT breakout last week:

Teams ought to have a high-quality, maintainable baseline without known defects (particularly in new features) that is deployable to the production environment at all times.

Instead of making a massive PR that does 80% of a lot of stuff, this probably should be submitted piecemeal, with each piece being made 100% solid in a series of smaller PRs.

If you and David feel we're at a point that we can't break the changes down in that way, then a compromise might be to update this PR to merge into a feature branch. Then once all of the multiple PRs are merged into the feature branch, the feature branch can be merged into main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think terminology has confused some of this discussion. The question of reversing is not one of reversing dimensions, but the lat/lon variables coming from the Coordinates attribute. The recommendation here is that the get_coordinates method itself should return in a standard order (reversed in case), based upon the variable name - and not using reverse as shown here.

I don't think this is a case of "Known Issue".

We are planning to release a version for NSIDC to start testing with, which may not handle the 3D cases or other issues, but this release should not break any existing use cases, and should not truly break, but simply not handle as desired. Incremental feature release is a tenet of agile development.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are in commit 756f7c0

I have simplified the method. The order for SMAP is 'projected_y', 'projected_x'. The override section of the code is only used by SMAP at the moment. It can be generalized if we can get that order of dimensions from varinfo. I am not sure if the order of dimensions is used for other products currently handled by hoss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of sets for required_dimensions is based on how it is returned by varinfo and how it was originally in hoss before my changes. the bounds update requires the dimensions to be a set , It fails for a list

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between a set of all dimensions for all required variables (as used in the prefetch function), which aggregates all Variable.dimensions attributes (which individually are lists), and the dimensions on an individual variable. Variable.dimensions is not a set for ordering purposes, it is a list.

With regards to bounds - we know that the pseudo-dimensions won't have a bounds attribute, so you might be better to not try to find bounds variables for them. Then you'll avoid any compatibility issues there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_index_range is a lot simpler now

hoss/dimension_utilities.py Show resolved Hide resolved
hoss/dimension_utilities.py Outdated Show resolved Hide resolved
@owenlittlejohns
Copy link
Member

A couple of quick thoughts:

autydp yesterday
The review is a bit disjointed now with all the comments.

I agree that this PR is busy with a lot of comment. I'd recommend that @sudha-murthy picks a few of the comments in the PR to address at a time, pushes a commit up with them, and indicates in the comments that she has addressed them (with a link to the commit). Then the person making the comment can decide whether to mark it as resolved. Resolving the comments will make the PR look a lot less cluttered, and allow us to all focus on the remaining pieces.

flamingbear 4 hours ago
But since it looks like you've thumbed up a bunch of things, you can add a comment with the githash as a way to alert the commenter that you have addressed the issue, when you get there.

Yup - agreed.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Comment on lines 78 to 82
) -> bool:
"""
returns the list of required variables without any
dimensions
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two quick things here:

  1. The return type in the type hint is wrong: it should be set[str] not bool.
  2. It would probably be clearer in the documentation string to say "set of required variables" not "list of required variables". (I'm guessing you meant list in a generic sense, rather than the Python type, but it's a little ambiguous)



def get_variables_with_anonymous_dims(
varinfo: VarInfoFromDmr, required_variables: set[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: It probably isn't necessary in this function to refer to the variables as required_variables. That is a piece of information outside of the scope of this function, maybe a better choice of name would be variable_names.

hoss/spatial.py Outdated Show resolved Hide resolved
Copy link
Member

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding some unit tests - I'll keep an eye out for more in future commits.

I resolved older comments I had left, which now look to be taken care of (thanks for that). There are still outstanding older items, and I've added some more comments on things that I looked at in more detail this time around. (To be honest, given the huge scale of this PR, it's hard to review it all in a single go, and so there are still bits I'm spotting only now, despite trying to go through the PR a couple of times)

Comment on lines 193 to 200
if lat_arr.ndim > 1:
col_size = lat_arr.shape[0]
row_size = lat_arr.shape[1]
if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
if lat_arr.ndim and lon_arr.ndim == 1:
col_size = lat_arr.size
row_size = lon_arr.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit wonky:

The middle check that the array sizes are equal explicitly checks the size of the 0th and 1st axes of the array. But after that you have a condition for whether the arrays only have one dimensions. This means one of two things:

  1. Either the last bit with lat_array.ndim and lon_arr.ndim are both 1 will never get reached (because they are always 2-D or higher)
  2. The coordinate arrays could be 1-D, and then the check for lon_arr.shape[1] != lat_arr.shape[1] will raise an exception, because the shape tuple doesn't have enough elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is corrected.

row_size = lat_arr.shape[1]
if (lon_arr.shape[0] != lat_arr.shape[0]) or (lon_arr.shape[1] != lat_arr.shape[1]):
raise IrregularCoordinateDatasets(lon_arr.shape, lat_arr.shape)
if lat_arr.ndim and lon_arr.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition isn't doing what I believe you think it is.

My guess is you are trying to check if both lat_arr and lon_arr have 1 dimension. What's really happening is that you are checking if lat_arr.ndim has a "truthy" value (so isn't 0, where it's an integer), and then you are checking if lon_arr.ndim == 1. If it helps, the checks are really:

if (lat_arr.ndim) and (lon_arr.ndim == 1):

I think what you are trying to do is:

if lat_arr.ndim == 1 and lon_arr.ndim == 1:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. I think it was a typo. I corrected it


"""
override_variable = varinfo.get_variable(variable_name)
projected_dimension_name = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would fit much better with the rest of the style of the code if you used else for these sorts of default values, instead of declaring them and then overriding them if a condition is met.

Yes, it's a personal preference, but it's consistent with the rest of the code (which is the more important thing here).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had to declare it above as default value to keep the variable in scope for the return call

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you've understood my comment - you have two choices here: declare it first, or declare it in an else. I am asking you to be consistent to the rest of the style of the repository and put the default value in an else statement.

To avoid two else statements, you could restructure the logic below as:

if override_variable is not None and override_variables.is_latitude():
    projected_dimension_name = 'projected_y'
elif override_variable is not None and override_variable.is_longitude():
    projected_dimension_name = 'projected_x'
else:
    projected_dimension_name = ''

I still have two other concerns here:

  1. Returning an empty string seems weird.
  2. The projected dimension name should be distinct in the case of two grids. Using just projected_x and projected_y will cause clashes for those SMAP L3 products with both global and polar grids. (This must be resolved before merging this PR)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • corrected the names to add the group name (in next commit)
  • simplified the structure to not individually check for lat/lon (in next commit)
  • about returning an empty string..-technically it should never be called from our code unless we have the coordinates. I could throw an exception - just seemed overkill

Comment on lines +191 to +192
row_size = 0
col_size = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one example but there are quite a few places in this code that I think default values are being used instead of raising a genuine exception. If none of the conditions below are met, then it would be much better to catch the issue now and raise a user-friendly exception before later code tries to use these values and finds it can't do what it needs to (and raises some unintelligible exception to the end user).

I think the question to ask with a bunch of these functions is: if they fall back on the default values, can the rest of the code work using those return values. I think in this case (and a few others) the honest answer is no.

Comment on lines 215 to 216
lat_col = lat_arr[:, 0]
lon_row = lon_arr[0, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making an assumption that latitude and longitude always represent the same axes in an array for all collections. We know that isn't true: for example the GPM_3IMERGHH collection from GES DISC have things the wrong way around (time, lon, lat).

It would be better to take both arrays and check one row and one column of each to find which array is varying along each dimension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check both to get the same index values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your answer does not explain what will happen for collections when the ordering is reversed. I'm about to write an big old essay on the way to do this that will avoid any assumptions on latitude and longitude ordering (which I genuinely think will lead to problems).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method gets row and column on both lat/lon datasets. Will add a test for lat/lon with both types of ordering

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated - de350b6

Comment on lines +189 to +193
def test_get_x_y_index_ranges_from_coordinates(
self,
mock_get_x_y_extents,
mock_get_dimension_index_range,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test!

You definitely need either one more that requests SMAP L3 data for multiple grids, or this test could be updated to do so using a collection that has multiple grids.

Comment on lines 135 to 137
# lat_arr = prefetch_dataset[self.latitude][:]
# lon_arr = prefetch_dataset[self.longitude][:]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented out lines should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed - de350b6

lat_fill,
lon_fill,
)
for actual, expected in zip(actual_geo_corners, expected_geo_corners):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things here:

  1. Why only self.assertAlmostEqual? You are defining the expected values, so you could define them to be exactly what you need.
  2. (Possibly a personal preference) zip makes things a little more complicated. Instead you could do something like:
for index, expected_corner in enumerate(expected_geo_corners):
    self.assertTupleEqual(actual_geo_corners[index], expected_corner)

If you stick with zip, could you maybe tweak the variables in the loop so they are actual_corner and expected_corner. Just to make it easier when reading this test back. Thanks!


"""

expected_result1 = np.array([0, 1, 2, 3, 4])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something of a theme here - there are a bunch of variable names in these tests that are really vague (expected_result1, expected, expected_result, similar things with actual). It would be clearer to the reader if the names were more specific. Something like expected_valid_indices.

Unit tests are a secondary form of documentation for developers, and can be really informative. It's really helpful (to me at least 😅) if they are as easy to read as possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - de350b6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to add in the tests! A recommendation for next time would be to couple the writing of a function with the accompanying tests, instead of leaving all the tests to the end. (I tend to write tests for functions just after writing the code for the function itself, because that way it's still fresh in my head, but also when I move on from that function, I feel confident that it works when I call it elsewhere)

if variable.dimensions:
range_strings = get_range_strings(variable.dimensions, index_ranges)
else:
override_dimensions = get_override_projected_dimensions(varinfo, variable_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sudha-murthy - this is one of the things we discussed this morning, and I wanted to capture a comment on the PR so we don't lose it.

get_override_projected_dimensions only returns a list of fake spatial dimensions for the variable. However, for variables that have more dimensions (e.g., 3-D variables), this list is returning a 2-element list for a numpy array with 3 axes. This is causing the failure you reported with requests to OPeNDAP. When making requests to OPeNDAP each variable must follow one of two formats in the DAP4 constraint expression:

  • Not include any index information for the variable. E.g.: /variable_name.
  • Including index information for all dimensions. If you want the full range, you can just indicate that with []. So it's something like: /variable_name[][1:2][3:4]

Because the returned list of override dimensions only includes spatial information, this function is only constructing part of the list of index information (assuming the example from the second bullet point above is what you want, this function is actually just getting `/variable_name[1:2][3:4]).

The answer here is to use a function that retrieves something for all axes on the variable, and in the correct order. (The naming of the non spatial pseudo-dimensions doesn't matter, because all we're doing is seeing if there is information in the cache of index ranges. If not, the [] will be applied for that dimension)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic that is done for a variable with real dimensions:

for dimension in variable_dimensions:
    dimension_range = index_ranges.get(dimension)
    if dimension_range is not None and dimension_range[0] <= dimension_range[1]:
        range_strings.append(f'[{dimension_range[0]}:{dimension_range[1]}]')
    else:
        range_strings.append('[]')

You can see that if the dimension name is not in the cache of index_ranges, then [] is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the override code uses that same method.. But there is no 'dimension name' for the third dimension.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it sounds like we're pretty much agreed on the problem here:

get_override_projected_dimensions only returns a list of fake spatial dimensions for the variable. However, for variables that have more dimensions (e.g., 3-D variables), this list is returning a 2-element list for a numpy array with 3 axes.

So the solution is probably to be retrieving the list of variable dimensions in a different way (specifically how is still being hotly debated in a number of places).

hoss/spatial.py Outdated
crs = get_variable_crs(non_spatial_variable, varinfo)
projected_x = 'projected_x'
projected_y = 'projected_y'
override_dimension_datasets = update_dimension_variables(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you be consistent with the rest of the code and use "variable" to refer to variables, and reserve "dataset" to refer to the netCDF4.Dataset object. Using "datasets" to mean variables will cause confusion in the future. (I acknowledge that "dataset" is also often used to mean "variable" by other software, but it is not so far used that way in the HOSS code, due to the terminology of the netCDF4 Python package)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - 0666248

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I took a quick skim and there's one place remaining that uses dataset instead of variable. It's the get_valid_indices_in_dataset function. Otherwise, I think the language is now consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the update_dimension_variables is now called create_dimension_scales_from_coordinates based on David's feedback (will be in the new commits in the new branch)

the get_valid_indices_in_dataset does get the valid indices from the numpy array passed in. Not sure if we can call it get_valid_indices_in_variable...

@owenlittlejohns
Copy link
Member

Okay, this is me trying to capture my thoughts after our meeting yesterday (and mulling it over a bit more since):

I think, in general, I am fine with the way in which the projected x and projected y index ranges are calculated. While I don't think the calculation of the corners is strictly necessary, as you just need any two elements (preferably more, but two is enough) in a single row/column to infer the resolution and extents of a grid dimension, once there are a full battery of unit tests up, it is a methodology that should successfully get the index ranges. For the sake of moving this monster PR along, I am fine to proceed with the corners methodology.

I think there are three bigger pieces here that I'm holding out for (along with outstanding smaller items about general coding best practices):

  1. Completing the unit test coverage of the new code. The percentage can sometimes be misleading, if you call a function that itself calls lower level functions. Those lower level functions will also need their own unit tests.

  2. Handling of multiple grids. Hard-coding the projected dimension names in the index_ranges cache to projected_x and projected_y will cause issues for multiple grids (which some SMAP L3 products have). The dimension index ranges for both will be written to the same keys (so the last one to be calculated will be the only one stored in the cache). The pseudo-dimension names need some uniqueness, likely based on the combination of coordinates from which they are derived.

  3. Relating the projected x and y pseudo-dimensions to the correct axes of the numpy arrays.

Why do we need this now? the current assumption that there is a specific ordering of these variables (['projected_y', projected_x']`) is a problem, as we know that we have some collections that are (..., lat, lon), some that are (lat, lon, ...), and some that are (..., lon, lat). Simplifying this to the 2-D case (which is the minimum @autydp wants to solve in this PR), we need to account for two options:

  • A science variable with array shape (a, b), where the projected x dimension (also shape (a, b)) varies across a row and the projected y dimension (also shape (a, b)) varies along a column.
  • A science variable with array shape (a, b) where the projected x dimension (also shape (a, b)) variable along a column and the projected y dimension (also shape (a, b)) varies across a row.

(Note - I believe the solution below would account for both N-D science variables and also degenerate dimension sizes (a, a))

Likely this only matters in the retrieval of the index ranges in the add_index_range function, as this is where the string for a variable is constructed to send a request to OPeNDAP and it must order the index ranges correctly.

How can we do this:

  • First - it's important to remember that 2-D Python array indices are of order (row, column).
  • Next up, the one assumption I am going to make here is that the ordering of spatial dimensions in the coordinate variables (and therefore the projected x and y variables) matches the ordering of the same dimensions in the science variables that refer to them (there may be other dimensions padding them on either side, but the order of those two dimensions is assumed to be the same). This doesn't generally matter for grids where there are a different number of rows and columns, but will matter for things like the polar grids of SMAP L3, where there are the same number of rows and columns (something like (500, 500) is I remember rightly).
  • To determine the dimension that corresponds to the row indices, you just look at the values in one row of the projected x and projected y variables. One of them should vary in a regular manner, the other should be the same for every element. If projected x varies along the row, then projected x corresponds to your row indices.
  • To determine the dimensions that correspond to the column indices, you do the same thing, but checking in a single column (in the previous bullet point just do s/row/column/).
  • If your first row or column doesn't have at least 2 non-fill values, then move on to the next row/column until you hit one that has multiple non-fill values. (If none of the rows or none of the columns have enough non-fill values, an exception should have been raised much sooner in the code because index ranges should not have been able to be calculated).
  • From the bullet points above you should know which projected grid dimension corresponds is your row and which is your column. This means you should know the order to pull items out of the index_ranges cache.
  • Bonus: For 3-D variables, all the same logic applies, except for non-spatial dimensions you want to make sure that you just retrieve the [] string. In a general case of a science variable with (..., a, b, ...), I would suggest that you can take the size of the 2-D projected coordinate arrays to determine (a, b) and then just work out where in the shape of the science variable that sequence occurs.

Just as a general point, the problem of dimension ordering (once you have the correct index ranges) in no way depends on what is a projected x coordinate and what is a projected y coordinate. It is entirely determined by which of those 2-D arrays varies along the a row and which varies up a column. Making an assumption that couples dimension ordering to which is x and which is y will inevitably break things for some future collection. (See GPM_3IMERGHH for an example where latitude and longitude are ordered opposite to how we would expect)

@sudha-murthy
Copy link
Collaborator Author

sudha-murthy commented Oct 16, 2024

I think, in general, I am fine with the way in which the projected x and projected y index ranges are calculated. While I don't think the calculation of the corners is strictly necessary, as you just need any two elements (preferably more, but two is enough) in a single row/column to infer the resolution and extents of a grid dimension, once there are a full battery of unit tests up, it is a methodology that should successfully get the index ranges. For the sake of moving this monster PR along, I am fine to proceed with the corners methodology.

I have changed the corners method to getting just any 2 valid points. the version 8 had fill values in the lower right corner
I don't have all the unit tests in - in the last commit. will add them in the next commit

@sudha-murthy
Copy link
Collaborator Author

  • A science variable with array shape (a, b), where the projected x dimension (also shape (a, b)) varies across a row and the projected y dimension (also shape (a, b)) varies along a column.
  • A science variable with array shape (a, b) where the projected x dimension (also shape (a, b)) variable along a column and the projected y dimension (also shape (a, b)) varies across a row.

you mean latitude varies over row/col versus longitude varies over row/column right?
the projected_x and projected_y are 1D dimension scales

@sudha-murthy
Copy link
Collaborator Author

  • If your first row or column doesn't have at least 2 non-fill values, then move on to the next row/column until you hit one that has multiple non-fill values. (If none of the rows or none of the columns have enough non-fill values, an exception should have been raised much sooner in the code because index ranges should not have been able to be calculated).

I think my latest commit has some more checks to move to the next row or next column. It is not yet perfect since the same valid points have to be on the latitude and longitude grids

@sudha-murthy
Copy link
Collaborator Author

  1. Handling of multiple grids. Hard-coding the projected dimension names in the index_ranges cache to projected_x and projected_y will cause issues for multiple grids (which some SMAP L3 products have). The dimension index ranges for both will be written to the same keys (so the last one to be calculated will be the only one stored in the cache). The pseudo-dimension names need some uniqueness, likely based on the combination of coordinates from which they are derived.

yes. will make sure I add that test case and verify and correct the uniqueness issues

@owenlittlejohns
Copy link
Member

you mean latitude varies over row/col versus longitude varies over row/column right?
the projected_x and projected_y are 1D dimension scales

I suppose I was still in the mindset of projecting the whole latitude and longitude arrays, which still feels simpler. Given the corners approach, that's fair - the projected_x and projected_y are 1-D.

That said this indicated that the corners approach is losing the information we need to determine which projected spatial dimension corresponds to which dimension axes. I think this comment has actually made me realise the corners approach is insufficient to do what we need. We need the 2-D projected coordinate arrays (projected_x and projected_y) to know which is varying along the row and which is varying along a column.

@owenlittlejohns owenlittlejohns changed the base branch from main to DAS-2208-SMAP-L3 October 16, 2024 22:03
@sudha-murthy sudha-murthy requested a review from autydp October 17, 2024 15:25
hoss/subset.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants