Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the Korean token calculation method different from GPT tokenizer? #27

Closed
DanielDonghaKim opened this issue May 18, 2023 · 1 comment

Comments

@DanielDonghaKim
Copy link

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncodingForModel(ModelType.GPT_3_5_TURBO);
List<Integer> encoded = enc.encode("한국어 토큰 수 테스트. JTokkit에서의 한국어 토큰 개수가 달라요.");

GPT Tokenizer
I calculated the number of tokens on the page above, but it is printed differently from the number calculated by JTokkit. I want to know why.

image

image

@tox-p
Copy link
Contributor

tox-p commented May 18, 2023

Those are different encodings. With JTokkit you used cl100k_base as encoding (the encoding for gpt-3.5-turbo) but the website you used for reference uses r50k_base.

See this issue, that had a similiar misunderstanding, for more details 🙂 #19

@tox-p tox-p closed this as completed May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants