'Context-aware' question #663

deklanw · 2021-01-08T19:18:43Z

deklanw
Jan 8, 2021

I'm trying to implement a non-neural model which uses item features. I see that the ContextRecommender class uses nn.Embedding to learn embeddings concurrently along with the other parameters in a neural net.

All I need is to use a combination of sklearn.preprocessing.OneHotEncoder and sklearn.preprocessing.MultiLabelBinarizer to get a simple sparse feature representation for every item, for all the token and token_seq fields. What I need is a matrix of shape [d, num_items] where d is large enough to encode all token fields (one-hot-encoded) and token_seq fields (multilabel-binarized).

But, I'm not totally sure how to do this in RecBole. Do I subclass ContextRecommender or GeneralRecommender? How do I get the values all-at-once?

In particular, for ml-100k I need to one-hot encode release_year and multilabel binarize class and stack them, for every item, all-at-once, efficiently.

Answered by deklanw

Jan 19, 2021

Figured out that dataset.get_item_feature is the key here. It returns 0-padded factorized encodings for token_seq, so you need to drop the first column. Something like

def encode_categorical_item_features(dataset, included_features):
    item_features = dataset.get_item_feature()

    mlb = MultiLabelBinarizer(sparse_output=True)
    ohe = OneHotEncoder(sparse=True)

    encoded_feats = []

    for feat in included_features:
        t = dataset.field2type[feat]
        feat_frame = item_features[feat].numpy()

        if t == FeatureType.TOKEN:
            encoded = ohe.fit_transform(feat_frame.reshape(-1, 1))
            encoded_feats.append(encoded)
        elif t == FeatureType.TOKEN_SEQ…

View full answer

deklanw · 2021-01-19T22:53:07Z

deklanw
Jan 19, 2021
Author

Figured out that dataset.get_item_feature is the key here. It returns 0-padded factorized encodings for token_seq, so you need to drop the first column. Something like

def encode_categorical_item_features(dataset, included_features):
    item_features = dataset.get_item_feature()

    mlb = MultiLabelBinarizer(sparse_output=True)
    ohe = OneHotEncoder(sparse=True)

    encoded_feats = []

    for feat in included_features:
        t = dataset.field2type[feat]
        feat_frame = item_features[feat].numpy()

        if t == FeatureType.TOKEN:
            encoded = ohe.fit_transform(feat_frame.reshape(-1, 1))
            encoded_feats.append(encoded)
        elif t == FeatureType.TOKEN_SEQ:
            encoded = mlb.fit_transform(feat_frame)

            # drop first column which corresponds to the padding 0; real categories start at 1
            # convert to csc first?
            encoded = encoded[:, 1:]
            encoded_feats.append(encoded)
        else:
            raise Warning(
                f'ADD-EASE only supports token or token_seq types. [{feat}] is of type [{t}].')

    if not encoded_feats:
        raise ValueError(
            f'No valid token or token_seq features to include.')

    return sp.hstack(encoded_feats).T.astype(np.float32)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Context-aware' question #663

{{title}}

Replies: 1 comment

{{title}}

Select a reply

'Context-aware' question #663

deklanw Jan 8, 2021

Replies: 1 comment

deklanw Jan 19, 2021 Author

deklanw
Jan 8, 2021

deklanw
Jan 19, 2021
Author