Catch token count issue while streaming with customized models

If llama, llava, phi, or some other models are used for streaming (with stream=True), the current design would crash after fetching the response. A warning is enough in this case, just like the non-streaming use cases.
microsoft · Jul 28, 2024 · a40aa56 · a40aa56
1 parent 61b9e8b
commit a40aa56
Showing 1 changed file with 6 additions and 1 deletion.
diff --git a/autogen/oai/client.py b/autogen/oai/client.py
@@ -272,7 +272,12 @@ def create(self, params: Dict[str, Any]) -> ChatCompletion:
 
             # Prepare the final ChatCompletion object based on the accumulated data
             model = chunk.model.replace("gpt-35", "gpt-3.5")  # hack for Azure API
-            prompt_tokens = count_token(params["messages"], model)
+            try:
+                prompt_tokens = count_token(params["messages"], model)
+            except Exception as e:
+                # Catch token calculation error if streaming with customized models.
+                logger.warning(str(e))
+                prompt_tokens = 0
             response = ChatCompletion(
                 id=chunk.id,
                 model=chunk.model,