Chinese characters #19

jiangyklala · 2023-04-24T11:44:37Z

Hello, it seems that there is a slight difference in the calculation of Chinese characters between this project and the Tokenizer on the official website.
The model I use is gpt-3.5-turbo, and the following are two comparison pictures:

The text was updated successfully, but these errors were encountered:

tox-p · 2023-04-24T20:40:55Z

Do you mean this tokenizer: https://platform.openai.com/tokenizer ?

The above linked tokenizer uses r50k_base as encoding, while gpt-3.5-turbo uses cl100k_base as encoding.

Try this one, mentioned in this tiktoken FAQ, for your comparison: https://tiktokenizer.vercel.app/ (make sure to use the textbox input and not the message input if comparing the encoding of a raw string like in your screen)

jiangyklala · 2023-04-25T02:18:49Z

Solved !
Thanks for answering and contributing such a good library !

jiangyklala closed this as completed Apr 25, 2023

tox-p mentioned this issue May 18, 2023

Is the Korean token calculation method different from GPT tokenizer? #27

Closed

tox-p mentioned this issue Jun 9, 2023

Count is different #30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese characters #19

Chinese characters #19

jiangyklala commented Apr 24, 2023

tox-p commented Apr 24, 2023

jiangyklala commented Apr 25, 2023

Chinese characters #19

Chinese characters #19

Comments

jiangyklala commented Apr 24, 2023

tox-p commented Apr 24, 2023

jiangyklala commented Apr 25, 2023