Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIG-1582: Implement overview thresholds & DIG-1533: Change cancer_type to primary_site #205

Merged
merged 17 commits into from
May 23, 2024

Conversation

SonQBChau
Copy link

@SonQBChau SonQBChau commented May 10, 2024

What's new

  • Introduce a threshold mechanism to display masked values for small counts, e.g., "<5".
  • Rewrite queries for masking and improved performance.
  • Update return types from int to str to handle masked values.
  • Update schema return structure to accommodate new queries.
  • Rename discover_cancer_type_count to discover_primary_site_count
  • Use lazyFunction for enhanced random seeds in testing
  • Add a test suite to ensure proper functionality of overview masking for small counts.

Types of Change(s)

  • ✨ New feature
  • 💥 Breaking change

Which other services does it affect?

  • data-portal
  • query

Has it been tested for:

  • Locally tested
  • Docker tested

How to test

  • For local, run tox
  • For docker, pull this branch, add to docker-compose AGGREGATE_COUNT_THRESHOLD=${AGGREGATE_COUNT_THRESHOLD} and run test-integration
  • Optional: Add some data and call overview endpoints to see if value <5 gets properly masked.

SonQBChau and others added 9 commits May 9, 2024 22:39
- Introduced threshold for small counts to display masked values, e.g., <5.
- Rewrite queries for masking and performance.
- Updated return types from int to str to cope with "<5"
- Update schema return structure for new queries
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 97.93103% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 96.69%. Comparing base (c05e921) to head (b72daf2).

Files Patch % Lines
...hord_metadata_service/mohpackets/apis/discovery.py 91.42% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #205      +/-   ##
===========================================
+ Coverage    95.11%   96.69%   +1.57%     
===========================================
  Files           34       35       +1     
  Lines         3503     3570      +67     
===========================================
+ Hits          3332     3452     +120     
+ Misses         171      118      -53     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SonQBChau SonQBChau marked this pull request as ready for review May 10, 2024 18:11
Copy link
Member

@CourtneyGosselin CourtneyGosselin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! I was able to spin up the environment and test some of the calls. The results return as expected, showing the new AGGREGATE_COUNT_THRESHOLD <5

@OrdiNeu
Copy link

OrdiNeu commented May 21, 2024

Getting an error on the frontpage with this PR, from Katsu's logs:

Cannot use None as a query value                                                                                                                                                                   [22/1852]
Traceback (most recent call last):                                                                                                                                                                          
  File "/usr/local/lib/python3.12/site-packages/ninja/operation.py", line 107, in run
    result = self.view_func(request, **values)   
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                              
  File "/app/chord_metadata_service/chord_metadata_service/mohpackets/apis/discovery.py", line 302, in discover_diagnosis_age_count
    .annotate(                   
     ^^^^^^^^^                                                                                        
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query.py", line 1630, in annotate
    return self._annotate(args, kwargs, select=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                        
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query.py", line 1680, in _annotate
    clone.query.add_annotation(                                                                       
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1198, in add_annotation
    annotation = annotation.resolve_expression(self, allow_joins=True, reuse=None)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/expressions.py", line 1513, in resolve_expression
    c.cases[pos] = case.resolve_expression(                                                                                                                                                                 
                   ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                   File "/usr/local/lib/python3.12/site-packages/django/db/models/expressions.py", line 1432, in resolve_expression
    c.condition = c.condition.resolve_expression(                                                                                                                                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query_utils.py", line 90, in resolve_expression                                                                               
    clause, joins = query._add_q(                                                                     
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1645, in _add_q
    child_clause, needed_inner = self.build_filter( 
                                 ^^^^^^^^^^^^^^^^^^ 
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1513, in build_filter
    condition = self.build_lookup(lookups, reffed_expression, value)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1394, in build_lookup
    raise ValueError("Cannot use None as a query value")
ValueError: Cannot use None as a query value
[21/May/2024 16:06:46] [django] ERROR: Cannot use None as a query value
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ninja/operation.py", line 107, in run
    result = self.view_func(request, **values)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/chord_metadata_service/chord_metadata_service/mohpackets/apis/discovery.py", line 302, in discover_diagnosis_age_count
    .annotate(
     ^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query.py", line 1630, in annotate
    return self._annotate(args, kwargs, select=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query.py", line 1680, in _annotate
    clone.query.add_annotation(
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1198, in add_annotation
    annotation = annotation.resolve_expression(self, allow_joins=True, reuse=None)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/expressions.py", line 1513, in resolve_expression
    c.cases[pos] = case.resolve_expression(
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/expressions.py", line 1432, in resolve_expression
    c.condition = c.condition.resolve_expression(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/query_utils.py", line 90, in resolve_expression
    clause, joins = query._add_q(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1645, in _add_q
    child_clause, needed_inner = self.build_filter(
                                 ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1513, in build_filter
    condition = self.build_lookup(lookups, reffed_expression, value)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/django/db/models/sql/query.py", line 1394, in build_lookup
    raise ValueError("Cannot use None as a query value")
ValueError: Cannot use None as a query value
Internal Server Error: /v2/discovery/overview/diagnosis_age_count/
[21/May/2024 16:06:46] [django.request] ERROR: Internal Server Error: /v2/discovery/overview/diagnosis_age_count/
[pid: 16|app: 0|req: 43/43] 172.20.0.1 () {32 vars in 2040 bytes} [Tue May 21 16:06:46 2024] GET /v2/discovery/overview/diagnosis_age_count/ => generated 2247 bytes in 10 msecs (HTTP/1.1 500) 9 headers in 664 bytes (1 switches on core 0)

@mshadbolt mshadbolt changed the title Sonchau/overview threshold DIG-1582: overview threshold & DIG-1533: change cancer_type to primary_site May 21, 2024
@mshadbolt mshadbolt changed the title DIG-1582: overview threshold & DIG-1533: change cancer_type to primary_site DIG-1582: Implement overview thresholds & DIG-1533: Change cancer_type to primary_site May 21, 2024
@SonQBChau
Copy link
Author

I can confirm this bug, looking into it

@SonQBChau
Copy link
Author

fixed. Can you check it again @OrdiNeu ?

@OrdiNeu
Copy link

OrdiNeu commented May 21, 2024

I've rebuilt a few times with the latest changes but still get the same error listed above. For reference, I'm just visiting the frontpage and checking the logs -- it's a request straight to Katsu's /v2/discovery/overview/diagnosis_age_count/ endpoint.

@SonQBChau
Copy link
Author

SonQBChau commented May 22, 2024

@OrdiNeu Have you added AGGREGATE_COUNT_THRESHOLD=${AGGREGATE_COUNT_THRESHOLD} to docker compose? (I put a note in the description, but even I forgot it myself).

To avoid this in the future, I have added the code to check if any .env is missing.

Let me know if it works for you.

@OrdiNeu
Copy link

OrdiNeu commented May 22, 2024

Oh, that sort of works now. Getting a new error now though, on individual_count only:

'>=' not supported between instances of 'int' and 'str'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ninja/operation.py", line 107, in run
    result = self.view_func(request, **values)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/chord_metadata_service/chord_metadata_service/mohpackets/apis/discovery.py", line 188, in discover_individual_count
    if donor_count >= SMALL_NUMBER_THRESHOLD
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'int' and 'str'
[22/May/2024 14:56:51] [django] ERROR: '>=' not supported between instances of 'int' and 'str'
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ninja/operation.py", line 107, in run
    result = self.view_func(request, **values)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/chord_metadata_service/chord_metadata_service/mohpackets/apis/discovery.py", line 188, in discover_individual_count
    if donor_count >= SMALL_NUMBER_THRESHOLD
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'int' and 'str'
Internal Server Error: /v2/discovery/overview/individual_count/
[22/May/2024 14:56:51] [django.request] ERROR: Internal Server Error: /v2/discovery/overview/individual_count/

- fix the bug where AGGREGATE_COUNT_THRESHOLD read as a str in docker env
- should show 0 instead of <5 for value 0
@SonQBChau
Copy link
Author

I fixed the bug where AGGREGATE_COUNT_THRESHOLD is not an int. Accidentally, I found a bug with cache but I will address it in another ticket

Copy link

@OrdiNeu OrdiNeu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, well-written ✔️

The latest fix made it work for me

"REDIS_PASSWORD_FILE",
]

missing_vars = [var for var in required_env_vars if not os.getenv(var)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file switches between using os.getenv() and os.environ[] in several places -- I'd recommend sticking to one or the other (as far as I understand it, one's a wrapper of the other)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a slight difference, and it's on purpose. os.getenv returns None if the variable is not found, allowing me to catch it and throw an exception.

Why not use os.getenv below? If it returns None for a missing variable, it can cause the bug we encountered before. If I (or someone else) tries to get a new variable but forgets to add it to required_vars, it will crash instead of causing a subtle bug, making it easier to catch.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- perhaps it would be best if lines 34-36 used os.getenv then, rather than os.environ.get()? For consistency

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thank you for the eagle eye! There was a merge conflict and I used the upstream code without realized it

"""


class OverviewTestCase(BaseTestCase):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are well-written 👍

}
date_of_birth = factory.LazyFunction(
lambda: {
"month_interval": random.randint(0, 1000),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this make the month_interval and day_interval not line up? E.g. you could have a day_interval of 0 and a month_interval of 1000.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, but I'm not testing the date logic here. I agree that a more realistic generator would be better, but these mock data serve my purpose for now.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me -- maybe document this discrepancy in the Note: at the top of this file. Otherwise it looks good ✔️

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@OrdiNeu
Copy link

OrdiNeu commented May 23, 2024

NB: we should probably hold on updating the submodule version on CanDIGv2's develop until we have both Query PR and the Data Portal PR merged at the same time. In addition, we need a second Data Portal PR to handle censoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants