How to speed up model inference #1

whyiug · 2024-06-14T09:40:51Z

Hi, guys, thanks for your work.
I got a question: the fixed policy templates are too long, which can seriously affect the speed of model inference, have you considered optimisation methods?
Is it possible to store kv cache. For the llamaguard, prefix KV caching can be used if it is prefixed.(This may not be possible because of the llava architecture, where the prefix is an image and not a fixed template, and the token of the image is not fixed. I was just wondering what you guys were thinking.)

whyiug · 2024-07-02T07:59:19Z

Here's an idea, put the policy in the system prompt.

lukashelff · 2024-07-16T15:42:35Z

Thank you for the hint. Initially, we also thought about stating the policy within our system prompt. Unfortunately, the conversation templates are implemented relatively statically in the training code of llava. So far, we haven't had the chance to implement it, but the idea is very sensible, and we will probably include it in our next iteration of LlavaGuard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up model inference #1

How to speed up model inference #1

whyiug commented Jun 14, 2024

whyiug commented Jul 2, 2024

lukashelff commented Jul 16, 2024

How to speed up model inference #1

How to speed up model inference #1

Comments

whyiug commented Jun 14, 2024

whyiug commented Jul 2, 2024

lukashelff commented Jul 16, 2024