[sklearn] OneHotEncoder does't work correctly #684

faterazer · 2023-02-09T13:22:06Z

Hello, I found this project last week, and thanks for all of these work.

I installed Hummingbird-ml==0.47 by pip, and I want to know which version of sklearn should I use.

I want to use one-hot encoder of sklearn to preprocess my categorical features, but the result's dim of sklearn is different from the dim of converted pytorch model. For sklearn, 15 features -> 69 dim，but for converted pytorch mdoel, 15 features -> 76 dim.

After my check, I'm sure the problem is the argument of sklearn's OneHotEncoder:

Changed in version 1.1: 'infrequent_if_exist' was added to automatically handle unknown categories and infrequent categories.

Is there any way to solve this problem？Thanks for any solution!

The text was updated successfully, but these errors were encountered:

ksaur · 2023-02-09T20:47:39Z

Hi @faterazer, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1.

Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.)

Can you post a little bit of your code so we can take a look? Maybe we need to add the new field.

faterazer · 2023-02-13T04:32:44Z

Hi, so appreciated your suggestions, I read the letter and checked through my operations. Unfortunately, the problem still exists. I guess providing more details could be convenient for you to locate the problem. So I post my code and test data, and they are all in test.zip. Now, let me describe my processing flow： 1. In test.zip, I constructed some data for test, they all categorical features, fifteen columns in total. I saved data as test/test.csv . 2. For some reasons, I need to cross the conda environments. At first, I use a conda environment, which includes python 3.10, sci-kit learn 1.2.1, and does not include hummingbird-ml. I construct an OneHotEncoder of sklearn, and then fit the test data. Finally, I save the encoder/pipeline as a binary file by pickle. You could find the code in test/A.py . 3. Then, I use another conda environment, which includes python 3.8, sci-kit learn 1.2.1, and hummingbird-ml 0.4.7. I load my sklearn preprocessor from the binary file by pickle, and then use hummingbird-ml to covert it. Finally, I check the outputs from sklearn and hummingbrid-ml, however, the shapes are different. You could find the code in test/B.py. 4. I found that if I modify the code on line 16 of test/A.py. From OneHotEncoder(sparse_output=False, handle_unknown="infrequent_if_exist", min_frequency=0.005) to OneHotEncoder(sparse_output=False, handle_unknown="ignore"), then everything is ok. I found the changelog of sklearn, it said since version 1.1, sklearn provides the new choice of handle_unknown, which I would like to use but caused the problem. Could you look into my operations and codes? Did I make a mistake in any step? Or is there a solution to fix the problem? I appreciate your reading and efforts. Thanks again for all your work in hummingbird-ml. It's an awesome project, and I hope I could use it all the time. Yours sincerely, faterazer

…

________________________________ 发件人: Karla Saur ***@***.***> 发送时间: 2023年2月10日 4:47 收件人: microsoft/hummingbird ***@***.***> 抄送: fater ***@***.***>; Mention ***@***.***> 主题: Re: [microsoft/hummingbird] [sklearn] OneHotEncoder does't work correctly (Issue #684) Hi @faterazer<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffaterazer&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LQMqlDk9H7kSbEwZB2hloKbLmfkTsCQqReSC2kREe8U%3D&reserved=0>, thanks for reaching out! We use whatever the most current version of SKL is, so right now 1.2.1. Was your model trained on the same version of scikit-learn that you're trying to use Hummingbird with? Just trying to make sure it's not a simple fix. (Lots of times, users have issues if the model is trained with an older version of SKL and then they call Hummingbird on a saved model.) Can you post a little bit of your code so we can take a look? Maybe we need to add the new field. ― Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhummingbird%2Fissues%2F684%23issuecomment-1424813248&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Op7tq2w8p4yPrT7Dfspe9IrXWX4MxvkVq3GzhEQ0X3s%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADIJWXPKKYMTUS3NO7SBTOLWWVJXNANCNFSM6AAAAAAUWROEPA&data=05%7C01%7C%7C2ad26099a78349f23fcb08db0adee93a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638115724754805942%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9OSe%2BWzec7QbtxCwlk%2B5x2pTr2mOWg4kKjAnJDEGtvQ%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ksaur · 2023-02-13T18:50:10Z

Hello! I think that the attachment (test.zip) got dropped. If it's easier, you could check them into a fork in github and put a link!

faterazer · 2023-02-14T03:51:47Z

Hello! I think that the attachment (test.zip) got dropped. If it's easier, you could check them into a fork in github and put a link!
test.zip
How about this time? I reply directly through Github.

ksaur · 2023-02-15T00:54:32Z

Thank you for your in-depth example with details! I was able to reproduce everything you said.

Yes it looks like we need to add this feature to the list of supported options (and we should at least be putting an error for ones we don't support). We'll add that to the queue!

faterazer · 2023-02-15T04:27:01Z

So glad my example helped. I really hope that the problem could be solved in the near future. Thanks your efforts. 🙂

…

________________________________ 发件人: Karla Saur ***@***.***> 发送时间: 2023年2月15日 8:54 收件人: microsoft/hummingbird ***@***.***> 抄送: fater ***@***.***>; Mention ***@***.***> 主题: Re: [microsoft/hummingbird] [sklearn] OneHotEncoder does't work correctly (Issue #684) Thank you for your in-depth example with details! I was able to reproduce everything you said. Yes it looks like we need to add this feature to the list of supported options (and we should at least be putting an error for ones we don't support). We'll add that to the queue! — Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmicrosoft%2Fhummingbird%2Fissues%2F684%23issuecomment-1430596819&data=05%7C01%7C%7C0d6ed3660585437bdd1b08db0eef3aba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638120192875740336%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=21VB5RdPUqcpu1R%2FOUE%2FQPnLaDKk8mLEVjnrgys4e3o%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADIJWXIKM3SU35SDXSV23HLWXQSNJANCNFSM6AAAAAAUWROEPA&data=05%7C01%7C%7C0d6ed3660585437bdd1b08db0eef3aba%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638120192875896584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jkHWjTpAs1PiI9g%2FBgIRIDfY8MsersmFWT%2FTQRAk7Pc%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

ksaur added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed labels Feb 15, 2023

ksaur linked a pull request Apr 4, 2023 that will close this issue

Updates for OneHotEncoder in newer SKL versions #696

Draft

ksaur self-assigned this Apr 4, 2023

ksaur linked a pull request Apr 4, 2023 that will close this issue

Updates for OneHotEncoder in newer SKL versions #696

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sklearn] OneHotEncoder does't work correctly #684

[sklearn] OneHotEncoder does't work correctly #684

faterazer commented Feb 9, 2023

ksaur commented Feb 9, 2023

faterazer commented Feb 13, 2023 via email

ksaur commented Feb 13, 2023

faterazer commented Feb 14, 2023

ksaur commented Feb 15, 2023

faterazer commented Feb 15, 2023 via email

[sklearn] OneHotEncoder does't work correctly #684

[sklearn] OneHotEncoder does't work correctly #684

Comments

faterazer commented Feb 9, 2023

ksaur commented Feb 9, 2023

faterazer commented Feb 13, 2023 via email

ksaur commented Feb 13, 2023

faterazer commented Feb 14, 2023

ksaur commented Feb 15, 2023

faterazer commented Feb 15, 2023 via email