You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EncodingRegistryregistry = Encodings.newDefaultEncodingRegistry();
Encodingenc = registry.getEncodingForModel(ModelType.GPT_3_5_TURBO);
List<Integer> encoded = enc.encode("한국어 토큰 수 테스트. JTokkit에서의 한국어 토큰 개수가 달라요.");
GPT Tokenizer
I calculated the number of tokens on the page above, but it is printed differently from the number calculated by JTokkit. I want to know why.
The text was updated successfully, but these errors were encountered:
Those are different encodings. With JTokkit you used cl100k_base as encoding (the encoding for gpt-3.5-turbo) but the website you used for reference uses r50k_base.
See this issue, that had a similiar misunderstanding, for more details 🙂 #19
GPT Tokenizer
I calculated the number of tokens on the page above, but it is printed differently from the number calculated by JTokkit. I want to know why.
The text was updated successfully, but these errors were encountered: