Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] GLMRegressor using statsmodels GLM with gaussian link #222

Merged
merged 29 commits into from
Mar 31, 2024

Conversation

julian-fong
Copy link
Contributor

Reference Issues/PRs

Implements a GLM model for the gaussian link in #7 , looking for feedback on the code changes

What does this implement/fix? Explain your changes.

The Gaussian Regressor is a direct interface to the statsmodel package's GLM model, and _fit, _predict, and _predict_proba have all been defined as noted in the regression extension template (have not started on the class method get_test_params)

Does your contribution introduce a new dependency? If yes, which one?

Yes, this contribution introduces a new dependency to the statsmodel package. A link to the package page can be found here: https://www.statsmodels.org/stable/index.html

What should a reviewer concentrate their feedback on?

This is just a rough draft (and there are probably loads of mistakes!) of the implementation. I am looking for feedback on the areas where I could improve/fix on code wise (on the constructor, fit, and predict methods), or even design wise.

Something I noticed in the GLM statsmodels package is that the fitted GLM model is returned as a separate object, as it is returned as part of a GLMResultsWrapper object.

Datasets X, y are to be defined in the constructor of the model, and not inside the fit function, so I opted to include the params of the GLM model's fit function inside the _fit function in Gaussian Regressor. (link to the fit function: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.fit.html#statsmodels.genmod.generalized_linear_model.GLM.fit)

I was considering whether or not to define the statsmodel GLM object inside the fit function instead of the constructor (similar to the MapieRegressor to follow the template to the letter, but later decided against. I am wondering if that is the recommended option to implement the GaussianRegressor as well.

Appreciate the feedback!

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

@julian-fong julian-fong marked this pull request as draft March 25, 2024 00:13
@julian-fong julian-fong changed the title [ENH] Added Gaussian Regressor using statsmodels GLM with gaussian link [ENH] - adding or improving code: Added Gaussian Regressor using statsmodels GLM with gaussian link Mar 25, 2024
@julian-fong julian-fong marked this pull request as ready for review March 25, 2024 00:13
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great!
Looks like we're almost done.

What is important is that you follow the "scikit-learn-like" interface precisely, and that means:

  • no data should be passed in __init__. The data is X and y. If this needs to be renamed or converged (e.g., to endog etc), do that in _fit or _predict
  • _fit should have the signature def _fit(self, X, y):, no additional arguments. If there are additional arguments, they need to be in __init__ and read from self.argname

Also see the regression.py extension template.

Minor things, but blocking (since it's code formatting):

Let us know if you need any help or have any questions about the above (e.g., on discord).

@fkiraly fkiraly added enhancement module:regression probabilistic regression module labels Mar 25, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Mar 25, 2024

I was considering whether or not to define the statsmodel GLM object inside the fit function instead of the constructor (similar to the MapieRegressor to follow the template to the letter, but later decided against. I am wondering if that is the recommended option to implement the GaussianRegressor as well.

Yes, I think that's the right way to do it. Because GLM requires access to the data, and the earliest you see it is in _fit. So your initial intuition was the right one.

@fkiraly fkiraly added the interfacing algorithms Interfacing existing algorithms/estimators from third party packages label Mar 25, 2024
@julian-fong julian-fong requested a review from fkiraly March 25, 2024 14:25
@julian-fong
Copy link
Contributor Author

Hi @fkiraly, I've made the changes and added a couple more extra features (add_constant and get_test_params). Let me know if there are any more changes you would like me to implement.

@julian-fong
Copy link
Contributor Author

One more thing, should I include the ability to add any family to be present i.e to include all the distribution families? I know in #7 it only said to implement the gaussian link

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 25, 2024

should I include the ability to add any family to be present i.e to include all the distribution families?

That would be great, though:

  • I'm not sure if all families have a corresponding distribution object already implemented
  • I'd suggest we make sure first this passes all check_estimator tests. Adding more families before that runs might complicate work on this, since it risks introducing a back/forth between parts of logic.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

I left some comments above, most important is to get predict_proba right.

Also, you should test your model using skpro.utils.check_estimator, that utility allows you to quickly run relevant tests locally.

@julian-fong
Copy link
Contributor Author

Hey @fkiraly,

I made some code updates, and I ran check_estimators from skpro.utils.estimator_checks locally and found one re-occuring error: NotImplementedError('no conversion defined from type pd_DataFrame_Table to pd_DataFrame'). I am wondering how I can fix this issue

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 30, 2024

Can you give me any insight as to what GLMRegressor-1 is and how it differs from GLMRegressor-0?

In get_test_params, you return a list of dict. This list, in the same order, is used to construct test instances, the first of which becomes GLMRegressor-0, the second GLMRegressr-1 (in python we start counting at 0).

These instances are then used in the tests from TestAllRegressors, substituted to the estimator_instance args.

So, if test_input_output_contract[GLMRegressor-1] fails, it means that estimator_instance = GLMRegressor(**params), where params is the second element in the list returned by get_test_params fails the test test_input_output_contract.

To get a full traceback for debugging, use raise_exceptions=True in check_estimator, you can also select individual tests by tests_to_run and fixtures_to_run.

For more details, see

https://www.sktime.net/en/latest/developer_guide/testing_framework.html

@julian-fong
Copy link
Contributor Author

julian-fong commented Mar 30, 2024

Thanks Franz! i believed i figured out what the source of the array mismatch was. In statsmodels theres an option to add a column of constants inside the GLM model if a user specifies it (the second parameter set specifies add_constant = True. In that case, it'll create an extra column called 'const' inside the pandas dataframe during training. When the tests call _predict_proba, since there is an extra column from _fit, it'll create an array mismatch because X_test does not account for that extra column

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 30, 2024

When the tests call _predict_proba, since there is an extra column from _fit, it'll create an array mismatch because X_test does not account for that extra column

That sounds like a bug in statsmodels, in get_predictions specifically, no?

Either way, if we know what the issue is, can we compensate for it by adding the column ourselves? (i assume it is an all-ones column)

We should also report the bug in statsmodels, if you can verify that it's indeed coming from get_predictions, and we should open a tracking issue that links to the statsmodels issue.

@julian-fong
Copy link
Contributor Author

julian-fong commented Mar 30, 2024

Yes, if the user specifies add_constant = True in initialization, we can compensate by adding it to the column ourselves if it does not exist already (this affects both the _predict and _predict_proba function). One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

@julian-fong
Copy link
Contributor Author

julian-fong commented Mar 30, 2024

i did some testing with a toy example as follows:

import statsmodels.api as sm

# Create a sample dataset with two predictor variables (X1 and X2) and a response variable (y)
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 3, 4, 5, 6],
}

data2 = {
    'y': [3, 4, 5, 6, 7]
}
X = pd.DataFrame(data)
print(X)

# Add constant for intercept term
X = sm.add_constant(X)
print(X)
# Fit a GLM model

y = pd.DataFrame(data2)
model = sm.GLM(y, X, family=Gaussian())
result = model.fit()

# Print summary of the fitted model
# print(result.summary())

# Create a new observation for prediction
X_test = pd.DataFrame({'const': [1,1,1], 'X1': [6,2,4], 'X2': [7,2,5]})  # New observations with constant
X_test_ = pd.DataFrame({'X1': [6,2,4], 'X2': [7,2,5]}) #new observations without constant automatically set
# Predict with the model

print("-------predictions-------")
#will work
prediction = result.predict(X_test)
pred = result.get_prediction(X_test).summary_frame()
print(prediction)
print(pred)

#won't work
#prediction = result.predict(X_test_)
pred = result.get_prediction(X_test_).summary_frame()

# print("Prediction for new observation:", prediction)

y_test = pd.DataFrame(prediction, columns = y.columns)

A value error of ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0) is then returned

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 30, 2024

One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

I mean, we do this only internally in _predict, to compensate for the bug in statsmodels until it is fixed.
That is, if add_constant=True, then make a copy of the df X where a column is added, and pass that (with the additional column) to statsmodels. The skpro user doesn't have to do that.

Does this make sense?

At the same time, I would (i.e., I'd suggest you do that) report the bug in statsmodels, with a reproducible example. If you want a review before that, you can post the bug report here first.

@julian-fong
Copy link
Contributor Author

One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

I mean, we do this only internally in _predict, to compensate for the bug in statsmodels until it is fixed. That is, if add_constant=True, then make a copy of the df X where a column is added, and pass that (with the additional column) to statsmodels. The skpro user doesn't have to do that.

Understood, but would we not want to leverage the same idea for _predict_proba since that method also requires adding a column 'const' if it is missing?

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 30, 2024

Understood, but would we not want to leverage the same idea for _predict_proba since that method also requires adding a column 'const' if it is missing?

Yes, of course - I would move the logic into a separate method, e.g., _prep_X, to avoid copy-pasting too much code.

from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.tools import add_constant

if self.add_constant:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this in _fit, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i'll change the code so its consistent with the other methods

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought statsmodels does it in fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my knowledge, the GLM model on statsmodels does not automatically include a constant column at any point during any fitting or prediction process

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you added this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought it was worth including as it was consistently used throughout all the tutorials and documentation I went over prior to writing the code

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 31, 2024

FYI, if you want to indicate that you're done with making changes and addressing the review comments, you can click here:
image
(the circling arrows)

Because, to me it looks like this is ready?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, no blocking comments.

Some comments which might still be worth addressing:

  • family can be picked by the user, but only Gaussian works at the moment, other choices will cause the estimator to break.
  • we should try to cover as many parameters as we can in get_test_params.
  • the docstring should say, for every parameter, what possible values are. E.g., what are possible values for cov_type, method, what are expected sizes for start_params, etc.

@julian-fong
Copy link
Contributor Author

Sounds good! I'll keep my eye on this page and improve the code so that it is more robust when including other families. Thanks again for all the guidance!

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 31, 2024

let me know if you want to do it in this PR, or separately.

@julian-fong
Copy link
Contributor Author

I'll make a seperate pr and include this pr so they are linked together (so this one doesn't get too congested with comments) once my GSoC proposal is submitted.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 31, 2024

ok - I've removed the parameters that are broken or not useable. You can handle them later if you want.

@fkiraly fkiraly merged commit 91ca77a into sktime:main Mar 31, 2024
36 checks passed
@fkiraly fkiraly changed the title [ENH] GaussianRegressor using statsmodels GLM with gaussian link [ENH] GLMRegressor using statsmodels GLM with gaussian link Mar 31, 2024
@julian-fong julian-fong deleted the add_gaussian_regressor branch April 2, 2024 19:49
fkiraly added a commit that referenced this pull request Apr 6, 2024
@julian-fong forgot to add their badges to `all-contributorsrc` for
#222, this PR adds them.

Also fixes an unrelated typo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement interfacing algorithms Interfacing existing algorithms/estimators from third party packages module:regression probabilistic regression module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants