Additional support for T5 Tokenizer - SentencepieceTokenizer #828

r4ghu · 2024-10-18T17:15:23Z

Hi team,
I would like to request some support for adding additional features for T5Tokenizer / SentencepieceTokenizer. I was able to convert the HuggingFace T5 Tokenizer to Onnx format using the following code -

import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort

# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "Translate this English sentence to French: Hello, how are you?"
input_ids = tokenizer.encode(text, return_tensors="np")

# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, ort_decoder = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True}, post_kwargs={})

# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
    f.write(ort_tokenizer.SerializeToString())

with open("tokenizer_decoder.onnx", "wb") as f:
    f.write(ort_decoder.SerializeToString())

# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)
decoder_session = ort.InferenceSession("tokenizer_decoder.onnx", sess_options=session_options)

# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
np.testing.assert_array_equal(input_ids[0], actual_ids)
print("Actual IDs:", actual_ids)

# Decode the tokenized input
output = decoder_session.run(None, {'ids': actual_ids})[0]
print("Decoded sentence:", output)
assert output == text

So far, the tokenizer works great without issues when I pass normal sentences. But when I add the sentinel tokens into my input sentence, the tokenizer behavior differs from the HuggingFace tokenizer. Can you please add some additional feature to support sentinel tokens in SentencepieceTokenizer? If it's possible to get this functionality working with a workaround of existing logic, I would like to know as it can simplify some preprocessing logic in my tokenization logic to handle sentinel tokens.

wenbingl · 2024-10-21T17:44:42Z

Can you provide an example of inconsistence result from HF tokenizer here for investigation? And when you are saying 'sentinel tokens', are they somethings like ' <extra_id_0>...'? Some of tokens I think were handled by HF python code, the others are processed by SentencePiece library. May I ask how these tokens are used in your app?

r4ghu · 2024-10-21T20:24:43Z

Hi @wenbingl ,
Thanks for the quick response on this!

And when you are saying 'sentinel tokens', are they somethings like ' <extra_id_0>...'? Some of tokens I think were handled by HF python code, the others are processed by SentencePiece library. May I ask how these tokens are used in your app?

Yes, when I mentioned sentinel tokens, they are the tokens from <extra_id_0>...<extra_id_99>. I am using them for masking some tokens in the input string during inference.

Can you provide an example of inconsistence result from HF tokenizer here for investigation?

I am attaching the python script I used to convert and test the tokenizer here -

import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort

# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "<extra_id_0> am looking foward to hearing from you."
input_ids = tokenizer.encode(text, return_tensors="np")

# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, _ = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True})

# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
    f.write(ort_tokenizer.SerializeToString())

# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)

# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
print("HuggingFace Tokenizer IDs:", input_ids[0])
print("OnnxRuntime Tokenizer IDs:", actual_ids)

And I got the following result while running the above script -

HuggingFace Tokenizer IDs: [32099   183   479  5575  2239    12  3507    45    25     5     1]
OnnxRuntime Tokenizer IDs: [    3     2 25666   834    23    26   834   632  3155   183   479  5575
  2239    12  3507    45    25     5     1]

wenbingl · 2024-10-22T16:45:02Z

Thanks, we will take a look at the issue.

r4ghu · 2024-11-19T19:59:42Z

Hi @wenbingl ,
Checking in to see if there is any update on this issue. Our Windows release is partially blocked with this feature and any updates will help with planning our roadmap accordingly.

wenbingl · 2024-11-20T18:31:28Z

Hi @wenbingl , Checking in to see if there is any update on this issue. Our Windows release is partially blocked with this feature and any updates will help with planning our roadmap accordingly.

Is it possible to call ort-extensions via C API, like the following:

onnxruntime-extensions/test/pp_api_test/test_tokenizer.cc

Line 675 in 22d9623

    
           const char* input[] = {"I <extra_id_0> like walking my cute dog\n and\x17 then, 生活的真谛是  \t\t\t\t \n\n61"};

?

r4ghu · 2024-11-21T05:43:08Z

Hi @wenbingl ,
Most of our inference code for Windows is based on C# and it would be really helpful if this functionality could be exposed through the C# Nuget package.

KarelZe · 2024-12-11T18:42:34Z

@r4ghu You might want to add the sentinel tokens manually to the protobuf file. See my comment here #852 (comment).

KarelZe mentioned this issue Dec 1, 2024

Mismatch for XLM_ROBERTA tokenizer w/wo special tokens🔀 #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

r4ghu commented Oct 18, 2024

wenbingl commented Oct 21, 2024

r4ghu commented Oct 21, 2024

wenbingl commented Oct 22, 2024 •

edited

Loading

r4ghu commented Nov 19, 2024

wenbingl commented Nov 20, 2024

r4ghu commented Nov 21, 2024

KarelZe commented Dec 11, 2024

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

Comments

r4ghu commented Oct 18, 2024

wenbingl commented Oct 21, 2024

r4ghu commented Oct 21, 2024

wenbingl commented Oct 22, 2024 • edited Loading

r4ghu commented Nov 19, 2024

wenbingl commented Nov 20, 2024

r4ghu commented Nov 21, 2024

KarelZe commented Dec 11, 2024

wenbingl commented Oct 22, 2024 •

edited

Loading