Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to map pandas categorical variable to integer #69

Open
zyxue opened this issue Mar 21, 2023 · 5 comments
Open

Is there a way to map pandas categorical variable to integer #69

zyxue opened this issue Mar 21, 2023 · 5 comments

Comments

@zyxue
Copy link

zyxue commented Mar 21, 2023

I have toy model that takes three variables, x1, x2, x3, x3 is a categorical variable with two categories, a and b. The model is trained using Python API,

import lightgbm
import pandas as pd

model = lightgbm.LGBMRegressor(
        alpha=0.5,
        objective="quantile",
    )

df = pd.DataFrame(
    [
        [1, 1.2, 'a', 111],
        [2, 2.3, 'b', 222],
    ],
    columns=["x1", "x2", 'x3', "y"],
).assign(x3=lambda df: df.x3.astype('category'))

model.fit(df[["x1", "x2", 'x3']], df["y"])

with open('/tmp/simple-regression-model.lgb', 'wt') as opened:
    opened.write( model.booster_.model_to_string())

The serialized model file is like

tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=2
objective=quantile
feature_names=x1 x2 x3
feature_infos=none none -1:0:1
tree_sizes=223

...

[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]
[num_gpu: 1]

end of parameters

pandas_categorical:[["a", "b"]]

As seen, at the end of the file it includes the pandas_categorical. Now, I'd like to serve the model in Java, I've made some demo code working,

    try {
      String modelStr = Files.readString(Path.of("/tmp/simple-regression-model.lgb"));
      LGBMBooster booster = LGBMBooster.loadModelFromString(modelStr);
      String[] featureNames = booster.getFeatureNames();

      System.out.println(Arrays.toString(featureNames));

      float[] input = new float[] {1.0f, 1.0f, 1f};  // 1f for x3 is arbitrarily coded.
      double pred = booster.predictForMatSingleRow(input, PredictionType.C_API_PREDICT_NORMAL);

      System.out.println(pred);

    } catch (Exception e) {
    }

But I'm not sure what's the best way to map x3 (e.g. a or b) to its corresponding integral value properly in Java? In Python, it's handled automatically if the input is a pd.DataFrame.

@shuttie
Copy link
Contributor

shuttie commented Nov 13, 2023

AFAIK you need to pass the categorical_feature parameter prior to creating the booster. See this example: https://github.com/metarank/ltrlib/blob/02cec0419ccc83a85837d1235d89ebdf385b274c/src/main/scala/io/github/metarank/ltrlib/booster/LightGBMBooster.scala#L57

@kensinxie
Copy link

@zyxue Hi, I encountered the same issue. How did you manage to overcome it?

@zyxue
Copy link
Author

zyxue commented Mar 30, 2024

I switched to catboost, https://catboost.ai/en/docs/concepts/java-package

@kensinxie
Copy link

AFAIK you need to pass the categorical_feature parameter prior to creating the booster. See this example: https://github.com/metarank/ltrlib/blob/02cec0419ccc83a85837d1235d89ebdf385b274c/src/main/scala/io/github/metarank/ltrlib/booster/LightGBMBooster.scala#L57

@shuttie Hi, I do set the params categorical_feature when I train the model, so you can see the model shows that :

[boosting: gbdt]
[objective: binary]
...
[categorical_feature: 1,16,17,18,19,20,21,23]
...
pandas_categorical:[[0, 1, 6, 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 30, 34, 38, 39, 40, 41, 43, 44, 45, 46, 74, 77, 78, 79, 83, 107, 111, 112, 113, 126, 138, 141, 144, 145, 154, 160, 161], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5, 6, 7], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48], [0, 1, 2, 3, 4], [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

When using Python to train or predict a LightGBM model, I can use the following code:

data[cat_cols] = data[cat_cols].astype('category')
model.predict(data)

However, in Java, I can only pass a double list to the model for prediction. And that meke it different for the same model inferenced in py and java

@bfeif
Copy link

bfeif commented Jul 25, 2024

I switched to catboost, https://catboost.ai/en/docs/concepts/java-package

@zyxue Do you have some example code of this you could share?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants