Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update new feature engineering code format #272

Merged
merged 2 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@ import os

import numpy as np
import pandas as pd
from factor import feat_eng
from factor import feature_engineering_cls

if os.path.exists("valid.pkl"):
valid_df = pd.read_pickle("valid.pkl")
else:
raise FileNotFoundError("No valid data found.")

new_feat = feat_eng(valid_df)
cls = feature_engineering_cls()
cls.fit(valid_df)
new_feat = cls.transform(valid_df)
new_feat.to_hdf("result.h5", key="data", mode="w")
24 changes: 17 additions & 7 deletions rdagent/scenarios/kaggle/experiment/meta_tpl/feature/feature.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
import pandas as pd

"""
Here is the feature engineering code for each task, with the function name specified as feat_eng.
The file name should start with feat_, followed by the specific task name.
Here is the feature engineering code for each task, with a class that has a fit and transform method.
Remember
"""


def feat_eng(X: pd.DataFrame):
"""
return the selected features
"""
return X
class IdentityFeature:
def fit(self, train_df: pd.DataFrame):
"""
Fit the feature engineering model to the training data.
"""
pass

def transform(self, X: pd.DataFrame):
"""
Transform the input data.
"""
return X


feature_engineering_cls = IdentityFeature
9 changes: 5 additions & 4 deletions rdagent/scenarios/kaggle/experiment/meta_tpl/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,11 @@ def import_module_from_path(module_name, module_path):
X_test_l = []

for f in DIRNAME.glob("feature/feat*.py"):
m = import_module_from_path(f.stem, f)
X_train_f = m.feat_eng(X_train)
X_valid_f = m.feat_eng(X_valid)
X_test_f = m.feat_eng(X_test)
cls = import_module_from_path(f.stem, f).feature_engineering_cls()
cls.fit(X_train)
X_train_f = cls.transform(X_train)
X_valid_f = cls.transform(X_valid)
X_test_f = cls.transform(X_test)

X_train_l.append(X_train_f)
X_valid_l.append(X_valid_f)
Expand Down
36 changes: 25 additions & 11 deletions rdagent/scenarios/kaggle/experiment/prompts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,13 @@ kg_background: |-
kg_feature_interface: |-
Your code should contain several parts:
1. The import part: import the necessary libraries.
2. A feat_eng() function that handles feature engineering for each task.
The function should take the following arguments:
- X: The features as a pandas DataFrame.
The function should return the new features as a pandas DataFrame.
The input to `feat_eng` will be a pandas DataFrame, which should be processed to return a new DataFrame containing only the engineered features.
2. A class that contains the feature engineering logic.
The class should have the following methods:
- fit: This method should fit the feature engineering model to the training data.
- transform: This method should transform the input data and return it.
For some tasks like generating new features, the fit method may not be necessary. Please pass this function as a no-op.
3. A variable called feature_engineering_cls that contains the class name.
The input to 'fit' is the training data in pandas dataframe, and the input to 'transform' is the data to be transformed in pandas dataframe.
The original columns should be excluded from the returned DataFrame.

Exception handling will be managed externally, so avoid using try-except blocks in your code. The user will handle any exceptions that arise and provide feedback as needed.
Expand All @@ -83,12 +85,24 @@ kg_feature_interface: |-
```python
import pandas as pd

def feat_eng(X: pd.DataFrame):
"""
return the selected features
"""
return X.mean(axis=1).to_frame("mean_feature") # Example feature engineering
return X.fillna(0) # Example feature processing
class FeatureEngineeringName:
def fit(self, train_df: pd.DataFrame):
"""
Fit the feature engineering model to the training data.
For example, for one hot encoding, this would involve fitting the encoder to the training data.
For feature scaling, this would involve fitting the scaler to the training data.
"""
return self

def transform(self, X: pd.DataFrame):
"""
Transform the input data.
"""
return X
return X.mean(axis=1).to_frame("mean_feature") # Example feature engineering
return X.fillna(0) # Example feature processing

feature_engineering_cls = FeatureEngineeringName
```

To Note:
Expand Down