Is the Korean token calculation method different from GPT tokenizer? #27

DanielDonghaKim · 2023-05-18T07:00:22Z

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncodingForModel(ModelType.GPT_3_5_TURBO);
List<Integer> encoded = enc.encode("한국어 토큰 수 테스트. JTokkit에서의 한국어 토큰 개수가 달라요.");

GPT Tokenizer
I calculated the number of tokens on the page above, but it is printed differently from the number calculated by JTokkit. I want to know why.

tox-p · 2023-05-18T09:49:24Z

Those are different encodings. With JTokkit you used cl100k_base as encoding (the encoding for gpt-3.5-turbo) but the website you used for reference uses r50k_base.

See this issue, that had a similiar misunderstanding, for more details 🙂 #19

tox-p closed this as completed May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the Korean token calculation method different from GPT tokenizer? #27

Is the Korean token calculation method different from GPT tokenizer? #27

DanielDonghaKim commented May 18, 2023

tox-p commented May 18, 2023 •

edited

Loading

Is the Korean token calculation method different from GPT tokenizer? #27

Is the Korean token calculation method different from GPT tokenizer? #27

Comments

DanielDonghaKim commented May 18, 2023

tox-p commented May 18, 2023 • edited Loading

tox-p commented May 18, 2023 •

edited

Loading