-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P1] TGI and vLLM support #63
Comments
Thanks for reaching out!
|
Ok cool, thanks for the answers.
My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time. Right now, it's possible to add LoRA adapters on the fly, but this slows inference (as you have to add to each element of the linear layer matrices). |
@RonanKMcGovern Thanks for the inputs! On the point:
Yes! I agree. We already have the concept of subspaces, each subspace gets trained separately during training for different tasks (e.g., one for sentence completion, one for instruction following); at the inference time, we can fire up one subspace while keeping the other silent. The subspace concept is embedded in interventions. I think across interventions, it might even be more effective. For the subspace concept, we have tutorial here: compreft. On the point:
Yes - memory + disk checkpoint saving are two good advantages (but not essential) of ReFT (again, want to emphasis that LoReFT is just one intervention-based instance of ReFT, we now have different types of interventions defined in Another benefit of ReFT is the fact that we found it to be effective while only intervening on the prompt tokens (all of our experiments are only intervening on the prompt tokens). This make it way more effective than vanilla adaptors where there is overhead during decoding steps. For ReFT with its current prompt-only setup, there shall not be overhead during decoding steps since after we intervene on the prompt, token reprs are cached in KV cache. But yeah, there is definitely overhead compared with LoRA, since interventions take extra compute. |
Really nice,
So you can intervene at the prompt level . I in principle this should then
allow for batching of prompts (with different interventions).
Yeah this will be really useful, but needs to be in a library with
continuous batching to be useful (without continuous batching it’s not
possible to get the benefits).
Very cool overall and will be a new way to fine tune.
Probably would make sense to add in ORPO-style odds ratios into the
objective function too. This will allow preference tuning simultaneously.
…On Tue 23 Apr 2024 at 20:16, Zen ***@***.***> wrote:
@RonanKMcGovern <https://github.com/RonanKMcGovern> Thanks for the inputs!
On the point:
My sense is that the potential value of LoREFT is if you can make it
really easy to add different adapters to different sequences in the same
batch at inference time.
Yes! I agree. We already have the concept of *subspaces*, each subspace
gets trained separately during training for different tasks (e.g., one for
sentence completion, one for instruction following); at the inference time,
we can fire up one subspace while keeping the other silent. The subspace
concept is embedded in interventions. I think across interventions, it
might even be more effective.
For the subspace concept, we have tutorial here: compreft
<https://github.com/stanfordnlp/pyreft/tree/main/examples/composition>.
—
Reply to this email directly, view it on GitHub
<#63 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CSE35YGQ4B4R5OQWO3Y62XQVAVCNFSM6AAAAABGTVJZ6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGIZTGNJTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @frankaging is anyone actively working on the vLLM part? I can maybe help offer a bounty for it. I'd like to make a video on pyreft and I feel that vLLM support is needed for it to be useful |
Regarding the above comment, I was wondering why not just intervene with something like A * (B h + b), I guess that's what you have in mind when you drop the orthogonalization. Also, I assume you tried just intervening with something like A * b.... (i.e. a constant)? |
@RonanKMcGovern Thanks for the notes! Yes, we will provide full ablation studies in our next paper revision. We tried all the variants of interventions listed here: https://github.com/stanfordnlp/pyreft/blob/main/pyreft/interventions.py. Current findings are: |
Are there plans for inference support. This is needed if it's to be used by devs in production.
Is fine tuning much faster than LoRA?
The text was updated successfully, but these errors were encountered: