Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P1] TGI and vLLM support #63

Open
RonanKMcGovern opened this issue Apr 22, 2024 · 7 comments
Open

[P1] TGI and vLLM support #63

RonanKMcGovern opened this issue Apr 22, 2024 · 7 comments
Labels
question Further information is requested

Comments

@RonanKMcGovern
Copy link

  1. Are there plans for inference support. This is needed if it's to be used by devs in production.

  2. Is fine tuning much faster than LoRA?

  • Optimization and backward pass are MUCH faster, but surely forward pass is similar (technically, slightly slower)
  1. Why so many epochs?
  • I was surprised to see 10-12 epochs in the paper.
  • in practice with LoRA I find less is more (often just do one epoch with constant LR) because it stops overfitting
@aryamanarora
Copy link
Collaborator

Thanks for reaching out!

  1. Yes, we're not particularly experienced in that kind of work but we're planning on attempting this in the summer (e.g. vLLM support). There is some OS community interest already which is promising.
  2. Our guess is that overall runtime is a little faster. We're putting out a new version of the paper later this week where we explore ablations on LoReFT that simplify the equation and/or remove orthogonalisation. The main training speed bottleneck is orthogonalisation on the matrix $\mathbf{R}$, but experiments show that we can get rid of that and maintain performance. So expect to see more speedups soon!
  3. In the new version of the paper we also report 3-epoch results, which outperform DoRA. We've also done some experiments on few-shot adaptation where train loss goes to 0 (overfitting for dozens of epochs) but we still get generalisation behaviour, not overfitting. So in this respect, we think adding more epochs will help more for ReFTs as compared to LoRA and variants.

@frankaging frankaging changed the title TGI and vLLM support [P1] TGI and vLLM support Apr 22, 2024
@frankaging frankaging added the question Further information is requested label Apr 22, 2024
@RonanKMcGovern
Copy link
Author

Ok cool, thanks for the answers.

  1. I suggest posting an issue on the vLLM github repo and asking for help. Shouldn't be that hard. Same with TGI (text generation inference on HF) - they will want to consider it. Tell them you're the author. Summer is too far away, things move too fast, so I'd say move sooner.

  2. Regarding fine-tuning:
    With LoRA the VRAM is already nearly all made up of the weights and the activations. The trainable parameter gradients and optimizer states are fairly small, so I think the benefits there of LoReFT (i.e. reducing VRAM) are good, but not essential. Same with training time (back propagation is faster with LoReFT, but the forward pass is still there and dominant). [maybe I'm missing/underestimating something here though about LoReFT]

My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time. Right now, it's possible to add LoRA adapters on the fly, but this slows inference (as you have to add to each element of the linear layer matrices).

@frankaging
Copy link
Collaborator

frankaging commented Apr 23, 2024

@RonanKMcGovern Thanks for the inputs!

On the point:

My sense is that the potential value of LoREFT is if you can make it really easy to add different adapters to different sequences in the same batch at inference time.

Yes! I agree. We already have the concept of subspaces, each subspace gets trained separately during training for different tasks (e.g., one for sentence completion, one for instruction following); at the inference time, we can fire up one subspace while keeping the other silent. The subspace concept is embedded in interventions. I think across interventions, it might even be more effective.

For the subspace concept, we have tutorial here: compreft.

On the point:

so I think the benefits there of LoReFT (i.e. reducing VRAM) are good, but not essential. Same with training time

Yes - memory + disk checkpoint saving are two good advantages (but not essential) of ReFT (again, want to emphasis that LoReFT is just one intervention-based instance of ReFT, we now have different types of interventions defined in interventions.py).

Another benefit of ReFT is the fact that we found it to be effective while only intervening on the prompt tokens (all of our experiments are only intervening on the prompt tokens). This make it way more effective than vanilla adaptors where there is overhead during decoding steps. For ReFT with its current prompt-only setup, there shall not be overhead during decoding steps since after we intervene on the prompt, token reprs are cached in KV cache. But yeah, there is definitely overhead compared with LoRA, since interventions take extra compute.

@RonanKMcGovern
Copy link
Author

RonanKMcGovern commented Apr 24, 2024 via email

@RonanKMcGovern
Copy link
Author

Hi @frankaging is anyone actively working on the vLLM part?

I can maybe help offer a bounty for it.

I'd like to make a video on pyreft and I feel that vLLM support is needed for it to be useful

@RonanKMcGovern
Copy link
Author

2. The main training speed bottleneck is orthogonalisation on the matrix R,

Regarding the above comment, I was wondering why not just intervene with something like A * (B h + b), I guess that's what you have in mind when you drop the orthogonalization. Also, I assume you tried just intervening with something like A * b.... (i.e. a constant)?

@frankaging
Copy link
Collaborator

@RonanKMcGovern Thanks for the notes! Yes, we will provide full ablation studies in our next paper revision. We tried all the variants of interventions listed here: https://github.com/stanfordnlp/pyreft/blob/main/pyreft/interventions.py.

Current findings are: A * (B h + b) works pretty well, matching LoReFT performance with lower compute; h + A * b or h + b or A * h + b work poorly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants