adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

riyajatar37003 · 2024-12-04T13:02:16Z

Hi,
Lets assume i have few completely new tokens which was never seen by tokenizer/model, now i just want to add these new tokens to tokenizer and update the weights of tokens that are newly added without touching other tokens weight matrics. How can i achieve this any guidance or sources for this with the help of sentence-transformers .train method?

thanks
@tomaarsen

tomaarsen · 2024-12-04T13:06:42Z

Hello!

I believe @kacperlukawski wrote on this exact topic in his blogpost about Word Injection: https://www.kacperlukawski.com/posts/word-injection/
I think it should cover exactly what you're looking for.

Tom Aarsen

riyajatar37003 · 2024-12-04T14:23:37Z

bge reranker is not direclt supported in sentence transformer .
i tried loading with cross encoder but the scores are different compare to auto class of hf.
@tomaarsen

tomaarsen · 2024-12-05T10:53:49Z

See #3117 (comment)

riyajatar37003 closed this as completed Dec 4, 2024

riyajatar37003 reopened this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

riyajatar37003 commented Dec 4, 2024

tomaarsen commented Dec 4, 2024

riyajatar37003 commented Dec 4, 2024

tomaarsen commented Dec 5, 2024

adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

adding new tokens to tokenizer without disturbing the base models embedding weight metrics of tokens #3116

Comments

riyajatar37003 commented Dec 4, 2024

tomaarsen commented Dec 4, 2024

riyajatar37003 commented Dec 4, 2024

tomaarsen commented Dec 5, 2024