Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

Open
riyajatar37003 opened this issue Dec 4, 2024 · 3 comments

Comments

@riyajatar37003
Copy link

Hi,
Lets assume i have few completely new tokens which was never seen by tokenizer/model, now i just want to add these new tokens to tokenizer and update the weights of tokens that are newly added without touching other tokens weight matrics. How can i achieve this any guidance or sources for this with the help of sentence-transformers .train method?

thanks
@tomaarsen

@tomaarsen
Copy link
Collaborator

Hello!

I believe @kacperlukawski wrote on this exact topic in his blogpost about Word Injection: https://www.kacperlukawski.com/posts/word-injection/
I think it should cover exactly what you're looking for.

  • Tom Aarsen

@riyajatar37003
Copy link
Author

bge reranker is not direclt supported in sentence transformer .
i tried loading with cross encoder but the scores are different compare to auto class of hf.
@tomaarsen

@tomaarsen
Copy link
Collaborator

See #3117 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants