-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation of Implicit sequential model throws ValueError #161
Comments
There is nothing obviously wrong with the code you posted, thanks for doing
the analysis.
Before you start the evaluation routine on your real data, can you compare
the number of items in your train and test data? They should be the same.
1. I don't think it'll help much.
2. Sure.
3. No.
…On Fri, May 10, 2019, 07:30 Karl F ***@***.***> wrote:
Hi!
I'm trying to train an implicit sequential model on click stream data, but
as soon as I try to evaluate (e.g. using MRR, or Precision & Recall) after
having trained the model, it throws an error:
mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
ValueErrorTraceback (most recent call last)
<ipython-input-78-349343a26e9b> in <module>
----> 1 mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
~/.local/lib/python3.7/site-packages/spotlight/evaluation.py in mrr_score(model, test, train)
45 continue
46
---> 47 predictions = -model.predict(user_id)
48
49 if train is not None:
~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in predict(self, sequences, item_ids)
316
317 self._check_input(item_ids)
--> 318 self._check_input(sequences)
319
320 sequences = torch.from_numpy(sequences.astype(np.int64).reshape(1, -1))
~/.local/lib/python3.7/site-packages/spotlight/sequence/implicit.py in _check_input(self, item_ids)
188
189 if item_id_max >= self._num_items:
--> 190 raise ValueError('Maximum item id greater '
191 'than number of items in model.')
192
ValueError: Maximum item id greater than number of items in model.
Perhaps the error is obvious, but I can't pinpoint what I'm doing wrong,
so below I'll describe as concisely as possible, what I'm doing.
Comparison of experimental with synthetic data
I tried generating synthetic data and use that instead of my experimental
data, and then it works. This lead me to compare the data structure of the
synthetic data with my experimental:
Table 1: Synthetic data with N=100 unique users, M=1k unique items, and
Q=10k interactions
user_id item_id timestamp
0 958 1
0 657 2
0 172 3
1 129 4
1 . 5
1 . 6
. . .
. . .
. . .
. . .
N . Q-2
N . Q-1
N 459 Q Table 2: Experimental data, N=2.5M users, M=20k items, Q=14.8M
interactions
user_id item_id timestamp
725397 3992 0
2108444 10093 1
2108444 10093 2
1840496 15616 3
1792861 16551 4
1960701 16537 5
1140742 6791 6
2074022 4263 .
2368959 19258 .
2368959 17218 .
. . .
. . Q-1
. . Q
1.
Both data sets have users indexed from [0..N-1], but my experimental
is not sorted on user_ids as is the case for the synthetic data.
2.
Both data sets have item_ids indexed from [1..M], yet it only throws
the "ValueError: Maximum item id greater than number of items in model."
for my experimental data.
3.
I've re-shaped my timestamps to be just the data frame index after
sorting on time, so this is also as in the synthetic data set. (Previously
my timestamps were in seconds since 1970 of the event, and some events were
simultaneous, i.e. order arbitrary/degenerate state.
Code for processing the experimental data:
# pandas dataframe with unique string identifier for users ('session_id'), # and 'Article number' for item_id, and 'timestamp' for event
df = df.sort_values(by=['timestamp']).reset_index(drop=True)
# encode string identifiers for users and items to integer values:from sklearn import preprocessing
le_usr = preprocessing.LabelEncoder() # user encoder
le_itm = preprocessing.LabelEncoder() # item encoder
# shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32')
user_ids = (le_usr.fit_transform(df['session_id']) + 0).astype('int32')
from spotlight.interactions import Interactions
implicit_interactions = Interactions(user_ids, item_ids, timestamps=df.index.values)
from spotlight.cross_validation import user_based_train_test_split, random_train_test_split
train, test = random_train_test_split(implicit_interactions, 0.2)
Code for training the model:
from spotlight.sequence.implicit import ImplicitSequenceModel
sequential_interaction = train.to_sequence()
implicit_sequence_model = ImplicitSequenceModel(use_cuda=True, n_iter=10, loss='pointwise', representation='pooling')
implicit_sequence_model.fit(sequential_interaction, verbose=True)
import spotlight.evaluation
mrr = spotlight.evaluation.mrr_score(implicit_sequence_model, test, train)
Questions on input format:
Here are some questions I thought might pinpoint the error, in where my
data might differ from the synthetic data set:
1.
Is there any purpose, or even harm, to include users with only a
single interaction?
2.
Does the model allow a user have multiple events with the same
timestamp-value?
3.
As long as (userid,itemid,timestamp) triplets pair up, does
row-ordering matter?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#161>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASIEAY5UKSYUVFR3OVAVMDPUWBGZANCNFSM4HMDTTZQ>
.
|
Thanks for fast reply!
They're the same as far as I can tell, this is the output after I've run In [6]: test
Out[6]: <Interactions dataset (2517443 users x 20861 items x 2968924 interactions)>
In [7]: train
Out[7]: <Interactions dataset (2517443 users x 20861 items x 11875692 interactions)> I've also tried using either But indeed the actual number of items is one less (20860, see below) than the interaction dataset thinks (20861, see above), for some reason: In [8]: print(len(np.unique(item_ids)), min(item_ids), max(item_ids))
20860 1 20860
In [15]: len(item_ids) - (2968924 + 11875692)
Out[15]: 0 Is this some how related to me doing a # shift item_ids with +1 (but not user_ids):
item_ids = (le_itm.fit_transform(df['Article number']) + 1).astype('int32') If I don't do this, I will have a zero indexed item_vector and that will trigger an assert/error check, if I remember correctly. |
One explanation for why this would happen is if I didn't propagate the total number of items correctly across train/test splits and sequential interaction conversion (the total number of items in the model must be the higher of the maximum item id in train/test). However, I don't see anything wrong with the code. The invariant that needs to be upheld is I think unless you can provide a snippet that I can run that has the same problem I won't be able to help further. (By the way, random train/test split doesn't make any sense for sequential models: use the user-based split.) |
Hi @maciejkula After 6 months, I've now revisited this, and I believe I know exactly how to trigger this bug. (Quick recap of above: Evaluating my ImplicitSequenceModel worked with synthetic data, but not with my "real" data, as I got error: I provide code that transforms the synthetic data to my use case, which triggers the bug. The following code will trigger the bug: from spotlight.cross_validation import user_based_train_test_split
from spotlight.datasets.synthetic import generate_sequential
from spotlight.evaluation import sequence_mrr_score
from spotlight.evaluation import mrr_score
from spotlight.sequence.implicit import ImplicitSequenceModel
trigger_crash = True
if trigger_crash:
n_items = 100
else:
n_items = 1000
dataset = generate_sequential(num_users=1000,
num_items=n_items,
num_interactions=10000,
concentration_parameter=0.01,
order=3)
train, test = user_based_train_test_split(dataset)
train_seq = train.to_sequence()
model = ImplicitSequenceModel(n_iter=3,
representation='cnn',
loss='bpr')
model.fit(train_seq, verbose=True)
# this always works
test_seq = test.to_sequence()
mrr_seq = sequence_mrr_score(model, test_seq)
print(mrr_seq)
# using mrr_score (or precision_recall) with num_items < num_users
# triggers crash:
mrr = mrr_score(model, test)
print(mrr) I.e. if Question is:
|
Hi!
I'm trying to train an implicit sequential model on click stream data, but as soon as I try to evaluate (e.g. using MRR, or Precision & Recall) after having trained the model, it throws an error:
Perhaps the error is obvious, but I can't pinpoint what I'm doing wrong, so below I'll describe as concisely as possible, what I'm doing.
Comparison of experimental with synthetic data
I tried generating synthetic data and use that instead of my experimental data, and then it works. This lead me to compare the data structure of the synthetic data with my experimental:
Both data sets have users indexed from
[0..N-1]
, but my experimental is not sorted onuser_ids
as is the case for the synthetic data.Both data sets have
item_ids
indexed from[1..M]
, yet it only throws the "ValueError: Maximum item id greater than number of items in model." for my experimental data.I've re-shaped my timestamps to be just the data frame index after sorting on time, so this is also as in the synthetic data set. (Previously my timestamps were in seconds since 1970 of the event, and some events were simultaneous, i.e. order arbitrary/degenerate state.
Code for processing the experimental data:
Code for training the model:
Questions on input format:
Here are some questions I thought might pinpoint the error, in where my data might differ from the synthetic data set:
Is there any purpose, or even harm, to include users with only a single interaction?
Does the model allow a user have multiple events with the same timestamp-value?
As long as
(userid,itemid,timestamp)
triplets pair up, does row-ordering matter?The text was updated successfully, but these errors were encountered: