-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] GLMRegressor
using statsmodels GLM
with gaussian link
#222
Conversation
Gaussian Regressor
using statsmodels GLM with gaussian linkGaussian Regressor
using statsmodels GLM with gaussian link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is great!
Looks like we're almost done.
What is important is that you follow the "scikit-learn-like" interface precisely, and that means:
- no data should be passed in
__init__
. The data isX
andy
. If this needs to be renamed or converged (e.g., toendog
etc), do that in_fit
or_predict
_fit
should have the signaturedef _fit(self, X, y):
, no additional arguments. If there are additional arguments, they need to be in__init__
and read fromself.argname
Also see the regression.py
extension template.
Minor things, but blocking (since it's code formatting):
- ensure you follow code formatting guidelines, here's the guide on setting it up locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html
- you accidentally added a newline in
_sklearn.py
, kindly revert that
Let us know if you need any help or have any questions about the above (e.g., on discord).
Yes, I think that's the right way to do it. Because |
Hi @fkiraly, I've made the changes and added a couple more extra features (add_constant and get_test_params). Let me know if there are any more changes you would like me to implement. |
One more thing, should I include the ability to add any family to be present i.e to include all the distribution families? I know in #7 it only said to implement the gaussian link |
That would be great, though:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
I left some comments above, most important is to get predict_proba
right.
Also, you should test your model using skpro.utils.check_estimator
, that utility allows you to quickly run relevant tests locally.
Hey @fkiraly, I made some code updates, and I ran |
In These instances are then used in the tests from So, if To get a full traceback for debugging, use For more details, see https://www.sktime.net/en/latest/developer_guide/testing_framework.html |
Thanks Franz! i believed i figured out what the source of the array mismatch was. In statsmodels theres an option to add a column of constants inside the GLM model if a user specifies it (the second parameter set specifies |
That sounds like a bug in Either way, if we know what the issue is, can we compensate for it by adding the column ourselves? (i assume it is an all-ones column) We should also report the bug in |
Yes, if the user specifies |
i did some testing with a toy example as follows:
A value error of ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0) is then returned |
I mean, we do this only internally in Does this make sense? At the same time, I would (i.e., I'd suggest you do that) report the bug in |
Understood, but would we not want to leverage the same idea for |
Yes, of course - I would move the logic into a separate method, e.g., |
skpro/regression/linear/_glm.py
Outdated
from statsmodels.genmod.generalized_linear_model import GLM | ||
from statsmodels.tools import add_constant | ||
|
||
if self.add_constant: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this in _fit
, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i'll change the code so its consistent with the other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought statsmodels
does it in fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my knowledge, the GLM
model on statsmodels
does not automatically include a constant column at any point during any fitting or prediction process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so you added this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought it was worth including as it was consistently used throughout all the tutorials and documentation I went over prior to writing the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, no blocking comments.
Some comments which might still be worth addressing:
family
can be picked by the user, but only Gaussian works at the moment, other choices will cause the estimator to break.- we should try to cover as many parameters as we can in
get_test_params
. - the docstring should say, for every parameter, what possible values are. E.g., what are possible values for
cov_type
,method
, what are expected sizes forstart_params
, etc.
Sounds good! I'll keep my eye on this page and improve the code so that it is more robust when including other families. Thanks again for all the guidance! |
let me know if you want to do it in this PR, or separately. |
I'll make a seperate pr and include this pr so they are linked together (so this one doesn't get too congested with comments) once my GSoC proposal is submitted. |
ok - I've removed the parameters that are broken or not useable. You can handle them later if you want. |
GaussianRegressor
using statsmodels GLM
with gaussian linkGLMRegressor
using statsmodels GLM
with gaussian link
@julian-fong forgot to add their badges to `all-contributorsrc` for #222, this PR adds them. Also fixes an unrelated typo.
Reference Issues/PRs
Implements a GLM model for the gaussian link in #7 , looking for feedback on the code changes
What does this implement/fix? Explain your changes.
The Gaussian Regressor is a direct interface to the statsmodel package's GLM model, and _fit, _predict, and _predict_proba have all been defined as noted in the regression extension template (have not started on the class method get_test_params)
Does your contribution introduce a new dependency? If yes, which one?
Yes, this contribution introduces a new dependency to the statsmodel package. A link to the package page can be found here: https://www.statsmodels.org/stable/index.html
What should a reviewer concentrate their feedback on?
This is just a rough draft (and there are probably loads of mistakes!) of the implementation. I am looking for feedback on the areas where I could improve/fix on code wise (on the constructor, fit, and predict methods), or even design wise.
Something I noticed in the GLM statsmodels package is that the fitted GLM model is returned as a separate object, as it is returned as part of a
GLMResultsWrapper
object.Datasets X, y are to be defined in the constructor of the model, and not inside the fit function, so I opted to include the params of the GLM model's fit function inside the _fit function in
Gaussian Regressor
. (link to the fit function: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.fit.html#statsmodels.genmod.generalized_linear_model.GLM.fit)I was considering whether or not to define the statsmodel GLM object inside the fit function instead of the constructor (similar to the
MapieRegressor
to follow the template to the letter, but later decided against. I am wondering if that is the recommended option to implement theGaussianRegressor
as well.Appreciate the feedback!
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skpro
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference