Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting inconsistent results with Bert #3

Open
scovail opened this issue Oct 18, 2020 · 4 comments
Open

Getting inconsistent results with Bert #3

scovail opened this issue Oct 18, 2020 · 4 comments

Comments

@scovail
Copy link

scovail commented Oct 18, 2020

I've gotten inconsistent results trying to generate sentence vectors using Bert, which is causing the cosine distance calculation to be incorrect. When I make two calls to vectorizer.bert and pass a list each time, then calculate cosine distance for the matching pairs, sentences that are identical are not being identified as such (row 1 of the output). However, when I pass the two identical strings as a list and then compare the vectors that were generated, the results are correct (row 2). In the example below, strings 0, 1, 3, 4, 5 and 6 are identical and should have a cosine distance of 0.

In [9]: import pandas as pd
...: from scipy.spatial.distance import cosine
...: from sent2vec.vectorizer import Vectorizer
...:
...: vectorizer = Vectorizer()
...:
...: df=pd.read_csv('temp2.csv', names=['String A', 'String B'])
...: newa = []
...: newb = []
...:
...: for index, row in df.iterrows():
...: newa.append(row['String A'])
...: newb.append(row['String B'])
...:
...: x=newa[50:60]
...: y=newb[50:60]
...:
...: vectorizer.bert(x)
...: xvex=vectorizer.vectors
...: vectorizer.bert(y)
...: yvex=vectorizer.vectors
...:
...: for i in range(0,10):
...: flag = x[i] == y[i]
...: print('\n({:d}) {:45s} {:45s} {:f}'.format(i, x[i], y[i], cosine(xvex[i], yvex[i])))
...: vectorizer.bert([x[i], y[i]])
...: print('({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(vectorizer.vectors[0], vectorizer.vectors[1])))

(0) String A: 401k retirement String B: 401k retirement
(1) String A: 401k retirement accounts String B: 401k retirement accounts
(2) String A: 401k retirement funds String B: 401k retirement plans
(3) String A: 401k retirement investing String B: 401k retirement investing
(4) String A: 401k retirement plan String B: 401k retirement plan
(5) String A: 401k retirement plans String B: 401k retirement plans
(6) String A: 401k retirement savings String B: 401k retirement savings
(7) String A: 401k retirement savings plan String B: 401k plan retirement
(8) String A: 401k retirement savings plan String B: 401k retirement plans
(9) String A: 401k retirement services String B: 401k plan retirement

(0) 401k retirement 401k retirement 0.002897
(0) 401k retirement 401k retirement 0.000000

(1) 401k retirement accounts 401k retirement accounts 0.006706
(1) 401k retirement accounts 401k retirement accounts 0.000000

(2) 401k retirement funds 401k retirement plans 0.012481
(2) 401k retirement funds 401k retirement plans 0.013344

(3) 401k retirement investing 401k retirement investing 0.004481
(3) 401k retirement investing 401k retirement investing 0.000000

(4) 401k retirement plan 401k retirement plan 0.006325
(4) 401k retirement plan 401k retirement plan 0.000000

(5) 401k retirement plans 401k retirement plans 0.005616
(5) 401k retirement plans 401k retirement plans 0.000000

(6) 401k retirement savings 401k retirement savings 0.006093
(6) 401k retirement savings 401k retirement savings 0.000000

(7) 401k retirement savings plan 401k plan retirement 0.013586
(7) 401k retirement savings plan 401k plan retirement 0.023076

(8) 401k retirement savings plan 401k retirement plans 0.008529
(8) 401k retirement savings plan 401k retirement plans 0.017313

(9) 401k retirement services 401k plan retirement 0.017170
(9) 401k retirement services 401k plan retirement 0.014167

@pdrm83
Copy link
Owner

pdrm83 commented Oct 19, 2020

Thanks for your message. It is hard to find what is going on in your implementation. Especially, since you use a file as input, I can't rerun your code. I have one guess though. The input must be in a form of a list of strings. So, if you want to submit a string it has to be in the form of a list of strings with one string: sentences = ["401k retirement accounts"]

There is no randomization in vectorizer.bert. So, it must return identical responses in each run. I added two more tests to this repo to examine this hypothesis. Hope this helps. I will set up automated tests for this project in the near future. After that, everyone can easily fork the repo and submit a PR if they want to make a progress or find a bug. Thanks!

@scovail
Copy link
Author

scovail commented Oct 20, 2020

Thanks for your response. Rather than use an input file I set up lists within the code and passed those directly to the vectorizer (see below) with the same results as the first test (identical terms not matching). It appears that the second call to vectorizer.bert with the list y is returning different vectors. Hope this makes sense. Thanks.

In [33]: from scipy.spatial.distance import cosine
...: from sent2vec.vectorizer import Vectorizer
...:
...: vectorizer = Vectorizer()
...:
...: x = ['401k retirement', '401k retirement accounts', '401k retirement funds', '401k retirement investing',
...: '401k retirement plan', '401k retirement plans', '401k retirement savings', '401k retirement savings plan',
...: '401k retirement savings plan', '401k retirement services']
...:
...: y = ['401k retirement', '401k retirement accounts', '401k retirement plans', '401k retirement investing',
...: '401k retirement plan', '401k retirement plans', '401k retirement savings', '401k plan retirement',
...: '401k retirement plans', '401k plan retirement']
...:
...: vectorizer.bert(x)
...: xvex=vectorizer.vectors
...: vectorizer.bert(y)
...: yvex=vectorizer.vectors
...:
...: for i in range(0,10):
...: print('\n({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(xvex[i], yvex[i])))
...: vectorizer.bert([x[i], y[i]])
...: print('({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(vectorizer.vectors[0], vectorizer.vectors[1])))
...:

(0) 401k retirement 401k retirement 0.002897
(0) 401k retirement 401k retirement 0.000000

(1) 401k retirement accounts 401k retirement accounts 0.006706
(1) 401k retirement accounts 401k retirement accounts 0.000000

(2) 401k retirement funds 401k retirement plans 0.012481
(2) 401k retirement funds 401k retirement plans 0.013344

(3) 401k retirement investing 401k retirement investing 0.004481
(3) 401k retirement investing 401k retirement investing 0.000000

(4) 401k retirement plan 401k retirement plan 0.006325
(4) 401k retirement plan 401k retirement plan 0.000000

(5) 401k retirement plans 401k retirement plans 0.005616
(5) 401k retirement plans 401k retirement plans 0.000000

(6) 401k retirement savings 401k retirement savings 0.006093
(6) 401k retirement savings 401k retirement savings 0.000000

(7) 401k retirement savings plan 401k plan retirement 0.013586
(7) 401k retirement savings plan 401k plan retirement 0.023076

(8) 401k retirement savings plan 401k retirement plans 0.008529
(8) 401k retirement savings plan 401k retirement plans 0.017313

(9) 401k retirement services 401k plan retirement 0.017170
(9) 401k retirement services 401k plan retirement 0.014167

@pdrm83
Copy link
Owner

pdrm83 commented Dec 1, 2020

Sorry to get back to you late. Please feel free to contribute and revise. I would be appreciated if you can help improve this open-source project.

@almarengo
Copy link
Contributor

sent2vec uses padded sentences, this means that when given a list of sentences the code creates a matrix of dimension (number_of_sentences x max_number_of_tokens) that is then given as input to distilBert.

In your example, x and y have 2 different max_number_of_tokens, in particular in x[7] and x[8] you have the sentence "401k retirement savings plan" which is the longest (4 tokens) but y has a max_number_of_tokens = 3. For this reason, x and y are two different input dimensions to the model, you are seeing those inconsistent results.

The second line of results is the right approach for your problem: running the vectorizer on pair of sentences and measure the cosine distance between the two. That is when you are getting a distance of 0 for same sentences.

If you were to run your code on x and y having the same max_number_of_tokens you would get the correct results. Below is a lazy example with x and y being exactly the same:

from scipy.spatial.distance import cosine
from sent2vec.vectorizer import Vectorizer
from sent2vec.splitter import Splitter

x = ['401k retirement', '401k retirement accounts', '401k retirement funds', '401k retirement investing', '401k retirement plan', '401k retirement plans', '401k retirement savings', '401k retirement savings plan', '401k retirement savings plan', '401k retirement services']

y = ['401k retirement', '401k retirement accounts', '401k retirement funds', '401k retirement investing', '401k retirement plan', '401k retirement plans', '401k retirement savings', '401k retirement savings plan', '401k retirement savings plan', '401k retirement services']

vectorizer = Vectorizer()
vectorizer.bert(x)
xvex=vectorizer.vectors
vectorizer.bert(y)
yvex=vectorizer.vectors

 for i in range(0,10):
        print('\n({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(xvex[i], yvex[i])))
        vectorizer.bert([x[i], y[i]])
        print('({:d}) {:45s} {:45s} {:4f}'.format(i, x[i], y[i], cosine(vectorizer.vectors[0], vectorizer.vectors[1])))

Below the results:

(0) 401k retirement                               401k retirement                               0.000000
(0) 401k retirement                               401k retirement                               0.000000

(1) 401k retirement accounts                      401k retirement accounts                      0.000000
(1) 401k retirement accounts                      401k retirement accounts                      0.000000

(2) 401k retirement funds                         401k retirement funds                         0.000000
(2) 401k retirement funds                         401k retirement funds                         0.000000

(3) 401k retirement investing                     401k retirement investing                     0.000000
(3) 401k retirement investing                     401k retirement investing                     0.000000

(4) 401k retirement plan                          401k retirement plan                          0.000000
(4) 401k retirement plan                          401k retirement plan                          0.000000

(5) 401k retirement plans                         401k retirement plans                         0.000000
(5) 401k retirement plans                         401k retirement plans                         0.000000

(6) 401k retirement savings                       401k retirement savings                       0.000000
(6) 401k retirement savings                       401k retirement savings                       0.000000

(7) 401k retirement savings plan                  401k retirement savings plan                  0.000000
(7) 401k retirement savings plan                  401k retirement savings plan                  0.000000

(8) 401k retirement savings plan                  401k retirement savings plan                  0.000000
(8) 401k retirement savings plan                  401k retirement savings plan                  0.000000

(9) 401k retirement services                      401k retirement services                      0.000000
(9) 401k retirement services                      401k retirement services                      0.000000

I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants