Add regex loading from tokenizer.json and code refinement #863

wenbingl · 2024-12-19T18:42:36Z

using regex pattern if it exist in tokenizer.json
add tokenizer test case like AMD/OLMa through pp_api

sayanshaw24 · 2024-12-20T18:40:41Z

operators/tokenizer/bpe_tokenizer_model.hpp

+    }
+
+    if (iter_type->get<std::string>() != "Sequence") {
+      return {kOrtxErrorNotImplemented, "Unsupported pretokenizer type!"};


something in the test seems off, or maybe my understanding is wrong - i saw in test_pp_api.py you use the "amd/AMD-OLMo-1B-SFT-DPO" model, and if you look in the tokenizer.json for it, pre_tokenizer type is "ByteLevel", not "Sequence" (Note: you have support for "ByteLevel" for "pretokenizers" below, (which I guess you are expecting within the "pre_tokenizer") but not for "pre_tokenizer" itself); so it would fail here right? (but in the CI it is passing)

So - maybe we should add "ByteLevel" to the supported types for "pre_tokenizer" as well here, but also first identify why it is not failing the test currently, perhaps the type is not being extracted right or it is conflating "pretokenizers" and "pre_tokenizer".

sayanshaw24 · 2024-12-20T18:41:18Z

operators/tokenizer/bpe_tokenizer_model.hpp

+          continue;
+        }
+
+        auto regex_str = iter_pattern->find("Regex");


i am seeing some examples of lowercase "regex" in tokenizer.json as well - perhaps we make the case insensitive here?

sayanshaw24 · 2024-12-20T18:53:56Z

Can we add an example to test the regex string option as well? HF "amd/AMD-OLMo-1B-SFT-DPO" has a "pre_tokenizer" value with type ""ByteLevel" and "use_regex": true, but maybe we can find one with a "regex" string that we can then use to demonstrate the custom regex functionality.

wenbingl and others added 2 commits December 19, 2024 18:40

Add regex loading from tokenizer.json and code refinement

5eda73c

Merge branch 'main' into toki

8dbe9c8

sayanshaw24 reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex loading from tokenizer.json and code refinement #863

Add regex loading from tokenizer.json and code refinement #863

wenbingl commented Dec 19, 2024 •

edited

Loading

sayanshaw24 Dec 20, 2024 •

edited

Loading

sayanshaw24 Dec 20, 2024 •

edited

Loading

sayanshaw24 commented Dec 20, 2024

Add regex loading from tokenizer.json and code refinement #863

Are you sure you want to change the base?

Add regex loading from tokenizer.json and code refinement #863

Conversation

wenbingl commented Dec 19, 2024 • edited Loading

sayanshaw24 Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

sayanshaw24 Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

sayanshaw24 commented Dec 20, 2024

wenbingl commented Dec 19, 2024 •

edited

Loading

sayanshaw24 Dec 20, 2024 •

edited

Loading

sayanshaw24 Dec 20, 2024 •

edited

Loading