-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add order statistic warning #230
Conversation
… compared is potentially due to random noise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yannmclatchie Great, thank you for the PR! I put a few suggestions and questions in the review comments.
|
||
if (max(elpd_diff) <= order_stat) { | ||
# flag warning if we suspect no model is theoretically better than the baseline | ||
warning("Difference in performance potentially due to chance.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we point users to the paper in the warning message (e.g., "See McLatchie and Vehtari (2023) for details".
)? @avehtari what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now added this
R/loo_compare.R
Outdated
#' If more than \eqn{11} models are compared, then the worst model by elpd is | ||
#' taken as the baseline model, and the risk of the difference in predictive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly confused by this because the printed loo_compare
results will still show the model with the best ELPD as the baseline (i.e. all elpd_diff
are relative to that one), right?
If that's still true, then I think we should rephrase this to indicate that this is just happening in an internal check and doesn't affect which model is considered as the baseline for computing elpd_diff
in the output. Does that make sense? Or am I just confused (totally possible)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now updated the docstring, let me know if it is now clearer 🙏
Codecov Report
@@ Coverage Diff @@
## master #230 +/- ##
==========================================
+ Coverage 92.96% 93.00% +0.04%
==========================================
Files 30 30
Lines 2770 2788 +18
==========================================
+ Hits 2575 2593 +18
Misses 195 195
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
Thanks for the feedback @jgabry ! I've gone through your suggestions and made the changes with also a small change in the logic: we use the median model rather than the worst model to take the internal differences and compute the order statistic. I have now also updated the documentation to make it clear that this second layer of elpd differences is different to what the user will eventually see and is strictly internal, as you noted. Let me know if there are any further changes that could be made to clarify |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yannmclatchie Sorry for the delay on this! I've been super busy recently. I added one more comment suggesting minor edits to the documentation, but otherwise this looks good. After you take a look at my suggestion and approve it I can merge this PR!
@avehtari Does this look good to you too?
Co-authored-by: Jonah Gabry <[email protected]>
Thanks for updating the doc @yannmclatchie (and thanks for adding this to the package!). @avehtari if you're happy with this I think we can go ahead and merge it once all the automated tests are run again. |
I can check this at latest on Tuesday (too late today) |
Sounds good, thanks |
OK for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging now. Thanks again!
This PR adds a warning when we estimate that the elpd difference of a collection of models is purely due to chance, developed with @avehtari.
If more than 11 models are compared, then the median model by elpd is taken as the baseline model, and the risk of the difference in predictive performance being due to random noise is estimated as described by McLatchie and Vehtari (Section 3.2, 2023). This will flag a warning if it is deemed that there is a risk of over-fitting due to the selection process, and users are recommended to avoid model selection based on LOO-CV, and instead to favour of model averaging/stacking or projection predictive inference.