No proper encodings for covid-related terms #21

OleksiiRomanko · 2021-12-29T12:24:53Z

I have just checked encodings that autotokenizer produces. It seems that for words "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" it produces more than one token, while tokenizer produces one token for 'conventional' words like apple.
E.g.

from transformers import  AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-bert-v2", do_lower_case=True)
tokenizer(['wuhan', "covid","coronavirus","sars-cov-2","apple","city"], truncation=True, padding=True, max_length=512)

Result:

{'input_ids': [[101, 8814, 4819, 102, 0, 0, 0, 0, 0], [101, 2522, 17258, 102, 0, 0, 0, 0, 0], [101, 21887, 23350, 102, 0, 0, 0, 0, 0], [101, 18906, 2015, 1011, 2522, 2615, 1011, 1016, 102], [101, 6207, 102, 0, 0, 0, 0, 0, 0], [101, 2103, 102, 0, 0, 0, 0, 0, 0]]}.

As you can see, there are two encoded values for 'wuhan', "covid","coronavirus" ([8814, 4819],[2522, 17258],[ 21887, 23350] accordingly), while one id for apple and city (as it should be - [ 6207] and [2103]).

I have also checked tokenizer dictionary (vocab.txt) from https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2/tree/main
and there are no such terms as "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2" (as mentioned in the readme - https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2).

I wonder why model does not recognize covid-related terms and how do I make the model 'understand' these terms? It seems that poor performance of models in my specific case (web texts that mention covid only once) may be related to this issue

The text was updated successfully, but these errors were encountered:

peregilk · 2021-12-29T12:48:55Z

The model is a continued pre-training of the BERT-model. It is using the vocabulary that was used in this model (created before the covid).

It is however pretrained on huge amounts of covid-related terms, and the BERT architecture is perfectly capable of learning these composite words. It should have no problems understanding these terms. In my experience, the main downside is that the text gets a bit longer. The value of building on the pre-trained BERT weights usually is more important.

I would be more worried about words being added after the pretraining was done. It would for instance have no knowledge of "Delta" and "Omikron". This needs to be learned during finetuning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No proper encodings for covid-related terms #21

No proper encodings for covid-related terms #21

OleksiiRomanko commented Dec 29, 2021 •

edited

Loading

peregilk commented Dec 29, 2021 •

edited

Loading

No proper encodings for covid-related terms #21

No proper encodings for covid-related terms #21

Comments

OleksiiRomanko commented Dec 29, 2021 • edited Loading

peregilk commented Dec 29, 2021 • edited Loading

OleksiiRomanko commented Dec 29, 2021 •

edited

Loading

peregilk commented Dec 29, 2021 •

edited

Loading