[P1] TGI and vLLM support #63

RonanKMcGovern · 2024-04-22T23:10:58Z

Are there plans for inference support. This is needed if it's to be used by devs in production.
Is fine tuning much faster than LoRA?

Optimization and backward pass are MUCH faster, but surely forward pass is similar (technically, slightly slower)

Why so many epochs?

I was surprised to see 10-12 epochs in the paper.
in practice with LoRA I find less is more (often just do one epoch with constant LR) because it stops overfitting

aryamanarora · 2024-04-22T23:27:20Z

Thanks for reaching out!

Yes, we're not particularly experienced in that kind of work but we're planning on attempting this in the summer (e.g. vLLM support). There is some OS community interest already which is promising.
Our guess is that overall runtime is a little faster. We're putting out a new version of the paper later this week where we explore ablations on LoReFT that simplify the equation and/or remove orthogonalisation. The main training speed bottleneck is orthogonalisation on the matrix $\mathbf{R}$, but experiments show that we can get rid of that and maintain performance. So expect to see more speedups soon!
In the new version of the paper we also report 3-epoch results, which outperform DoRA. We've also done some experiments on few-shot adaptation where train loss goes to 0 (overfitting for dozens of epochs) but we still get generalisation behaviour, not overfitting. So in this respect, we think adding more epochs will help more for ReFTs as compared to LoRA and variants.

RonanKMcGovern · 2024-04-23T17:22:02Z

Ok cool, thanks for the answers.

I suggest posting an issue on the vLLM github repo and asking for help. Shouldn't be that hard. Same with TGI (text generation inference on HF) - they will want to consider it. Tell them you're the author. Summer is too far away, things move too fast, so I'd say move sooner.
Regarding fine-tuning:
With LoRA the VRAM is already nearly all made up of the weights and the activations. The trainable parameter gradients and optimizer states are fairly small, so I think the benefits there of LoReFT (i.e. reducing VRAM) are good, but not essential. Same with training time (back propagation is faster with LoReFT, but the forward pass is still there and dominant). [maybe I'm missing/underestimating something here though about LoReFT]

My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time. Right now, it's possible to add LoRA adapters on the fly, but this slows inference (as you have to add to each element of the linear layer matrices).

frankaging · 2024-04-23T19:16:05Z

@RonanKMcGovern Thanks for the inputs!

On the point:

My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time.

Yes! I agree. We already have the concept of subspaces, each subspace gets trained separately during training for different tasks (e.g., one for sentence completion, one for instruction following); at the inference time, we can fire up one subspace while keeping the other silent. The subspace concept is embedded in interventions. I think across interventions, it might even be more effective.

For the subspace concept, we have tutorial here: compreft.

On the point:

so I think the benefits there of LoReFT (i.e. reducing VRAM) are good, but not essential. Same with training time

Yes - memory + disk checkpoint saving are two good advantages (but not essential) of ReFT (again, want to emphasis that LoReFT is just one intervention-based instance of ReFT, we now have different types of interventions defined in interventions.py).

Another benefit of ReFT is the fact that we found it to be effective while only intervening on the prompt tokens (all of our experiments are only intervening on the prompt tokens). This make it way more effective than vanilla adaptors where there is overhead during decoding steps. For ReFT with its current prompt-only setup, there shall not be overhead during decoding steps since after we intervene on the prompt, token reprs are cached in KV cache. But yeah, there is definitely overhead compared with LoRA, since interventions take extra compute.

RonanKMcGovern · 2024-04-24T08:17:27Z

Really nice, So you can intervene at the prompt level . I in principle this should then allow for batching of prompts (with different interventions). Yeah this will be really useful, but needs to be in a library with continuous batching to be useful (without continuous batching it’s not possible to get the benefits). Very cool overall and will be a new way to fine tune. Probably would make sense to add in ORPO-style odds ratios into the objective function too. This will allow preference tuning simultaneously.

…

On Tue 23 Apr 2024 at 20:16, Zen ***@***.***> wrote: @RonanKMcGovern <https://github.com/RonanKMcGovern> Thanks for the inputs! On the point: My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time. Yes! I agree. We already have the concept of *subspaces*, each subspace gets trained separately during training for different tasks (e.g., one for sentence completion, one for instruction following); at the inference time, we can fire up one subspace while keeping the other silent. The subspace concept is embedded in interventions. I think across interventions, it might even be more effective. For the subspace concept, we have tutorial here: compreft <https://github.com/stanfordnlp/pyreft/tree/main/examples/composition>. — Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASVG6CSE35YGQ4B4R5OQWO3Y62XQVAVCNFSM6AAAAABGTVJZ6KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZTGIZTGNJTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RonanKMcGovern · 2024-04-27T11:45:19Z

Hi @frankaging is anyone actively working on the vLLM part?

I can maybe help offer a bounty for it.

I'd like to make a video on pyreft and I feel that vLLM support is needed for it to be useful

RonanKMcGovern · 2024-04-30T08:12:47Z

2. The main training speed bottleneck is orthogonalisation on the matrix R,

Regarding the above comment, I was wondering why not just intervene with something like A * (B h + b), I guess that's what you have in mind when you drop the orthogonalization. Also, I assume you tried just intervening with something like A * b.... (i.e. a constant)?

frankaging · 2024-04-30T19:27:52Z

@RonanKMcGovern Thanks for the notes! Yes, we will provide full ablation studies in our next paper revision. We tried all the variants of interventions listed here: https://github.com/stanfordnlp/pyreft/blob/main/pyreft/interventions.py.

Current findings are: A * (B h + b) works pretty well, matching LoReFT performance with lower compute; h + A * b or h + b or A * h + b work poorly.

frankaging changed the title ~~TGI and vLLM support~~ [P1] TGI and vLLM support Apr 22, 2024

frankaging added the question Further information is requested label Apr 22, 2024

This was referenced Apr 27, 2024

Support for ReFT huggingface/text-generation-inference#1822

Closed

Add support for ReFT vllm-project/vllm#4413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P1] TGI and vLLM support #63

[P1] TGI and vLLM support #63

RonanKMcGovern commented Apr 22, 2024

aryamanarora commented Apr 22, 2024

RonanKMcGovern commented Apr 23, 2024

frankaging commented Apr 23, 2024 •

edited

Loading

RonanKMcGovern commented Apr 24, 2024 via email

RonanKMcGovern commented Apr 27, 2024

RonanKMcGovern commented Apr 30, 2024

frankaging commented Apr 30, 2024

[P1] TGI and vLLM support #63

[P1] TGI and vLLM support #63

Comments

RonanKMcGovern commented Apr 22, 2024

aryamanarora commented Apr 22, 2024

RonanKMcGovern commented Apr 23, 2024

frankaging commented Apr 23, 2024 • edited Loading

RonanKMcGovern commented Apr 24, 2024 via email

RonanKMcGovern commented Apr 27, 2024

RonanKMcGovern commented Apr 30, 2024

frankaging commented Apr 30, 2024

frankaging commented Apr 23, 2024 •

edited

Loading