Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

Open
r4ghu opened this issue Oct 18, 2024 · 7 comments
Open

Additional support for T5 Tokenizer - SentencepieceTokenizer #828

r4ghu opened this issue Oct 18, 2024 · 7 comments

Comments

@r4ghu
Copy link

r4ghu commented Oct 18, 2024

Hi team,
I would like to request some support for adding additional features for T5Tokenizer / SentencepieceTokenizer. I was able to convert the HuggingFace T5 Tokenizer to Onnx format using the following code -

import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort

# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "Translate this English sentence to French: Hello, how are you?"
input_ids = tokenizer.encode(text, return_tensors="np")

# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, ort_decoder = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True}, post_kwargs={})

# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
    f.write(ort_tokenizer.SerializeToString())

with open("tokenizer_decoder.onnx", "wb") as f:
    f.write(ort_decoder.SerializeToString())

# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)
decoder_session = ort.InferenceSession("tokenizer_decoder.onnx", sess_options=session_options)

# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
np.testing.assert_array_equal(input_ids[0], actual_ids)
print("Actual IDs:", actual_ids)

# Decode the tokenized input
output = decoder_session.run(None, {'ids': actual_ids})[0]
print("Decoded sentence:", output)
assert output == text

So far, the tokenizer works great without issues when I pass normal sentences. But when I add the sentinel tokens into my input sentence, the tokenizer behavior differs from the HuggingFace tokenizer. Can you please add some additional feature to support sentinel tokens in SentencepieceTokenizer? If it's possible to get this functionality working with a workaround of existing logic, I would like to know as it can simplify some preprocessing logic in my tokenization logic to handle sentinel tokens.

@wenbingl
Copy link
Member

Can you provide an example of inconsistence result from HF tokenizer here for investigation? And when you are saying 'sentinel tokens', are they somethings like ' <extra_id_0>...'? Some of tokens I think were handled by HF python code, the others are processed by SentencePiece library. May I ask how these tokens are used in your app?

@r4ghu
Copy link
Author

r4ghu commented Oct 21, 2024

Hi @wenbingl ,
Thanks for the quick response on this!

And when you are saying 'sentinel tokens', are they somethings like ' <extra_id_0>...'? Some of tokens I think were handled by HF python code, the others are processed by SentencePiece library. May I ask how these tokens are used in your app?

Yes, when I mentioned sentinel tokens, they are the tokens from <extra_id_0>...<extra_id_99>. I am using them for masking some tokens in the input string during inference.

Can you provide an example of inconsistence result from HF tokenizer here for investigation?

I am attaching the python script I used to convert and test the tokenizer here -

import numpy as np
from transformers import T5TokenizerFast
from onnxruntime_extensions import gen_processing_models, get_library_path
import onnxruntime as ort

# Initialize the tokenizer
tokenizer = T5TokenizerFast.from_pretrained("t5-small")
text = "<extra_id_0> am looking foward to hearing from you."
input_ids = tokenizer.encode(text, return_tensors="np")

# Create the ONNX graphs for the tokenizer
# ort_tokenzer - Model to perform tokenization from string input to tokenized output
# ort_decoder - Model to perform decoding from tokenized input to string output
ort_tokenizer, _ = gen_processing_models(tokenizer, pre_kwargs={'CAST_TOKEN_ID': True})

# Save the ONNX graphs
with open("tokenizer.onnx", "wb") as f:
    f.write(ort_tokenizer.SerializeToString())

# Run inference with the ONNX models
session_options = ort.SessionOptions()
session_options.register_custom_ops_library(get_library_path())
tokenizer_session = ort.InferenceSession("tokenizer.onnx", sess_options=session_options)

# Tokenize the input
actual_ids = tokenizer_session.run(None, {'inputs':[text]})[0]
print("HuggingFace Tokenizer IDs:", input_ids[0])
print("OnnxRuntime Tokenizer IDs:", actual_ids)

And I got the following result while running the above script -

HuggingFace Tokenizer IDs: [32099   183   479  5575  2239    12  3507    45    25     5     1]
OnnxRuntime Tokenizer IDs: [    3     2 25666   834    23    26   834   632  3155   183   479  5575
  2239    12  3507    45    25     5     1]

@wenbingl
Copy link
Member

wenbingl commented Oct 22, 2024

Thanks, we will take a look at the issue.

@r4ghu
Copy link
Author

r4ghu commented Nov 19, 2024

Hi @wenbingl ,
Checking in to see if there is any update on this issue. Our Windows release is partially blocked with this feature and any updates will help with planning our roadmap accordingly.

@wenbingl
Copy link
Member

Hi @wenbingl , Checking in to see if there is any update on this issue. Our Windows release is partially blocked with this feature and any updates will help with planning our roadmap accordingly.

Is it possible to call ort-extensions via C API, like the following:

const char* input[] = {"I <extra_id_0> like walking my cute dog\n and\x17 then, 生活的真谛是 \t\t\t\t \n\n61"};
?

@r4ghu
Copy link
Author

r4ghu commented Nov 21, 2024

Hi @wenbingl ,
Most of our inference code for Windows is based on C# and it would be really helpful if this functionality could be exposed through the C# Nuget package.

@KarelZe
Copy link

KarelZe commented Dec 11, 2024

@r4ghu You might want to add the sentinel tokens manually to the protobuf file. See my comment here #852 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants