Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you use LLaVA AnyRes? #27

Open
kimwongyuda opened this issue Dec 27, 2024 · 4 comments
Open

Do you use LLaVA AnyRes? #27

kimwongyuda opened this issue Dec 27, 2024 · 4 comments

Comments

@kimwongyuda
Copy link

Thank you for your nice work.

When I try to run the code, just 1 batch size per device is allowed, even though I use llava with mistral and also lora (but grad cache not used).
It is suspected that a lot of image tokens from AnyRes of llava occupy gpus too much.

I didn't modify any code of your work.
How can I increase batch size? & Did you use AnyRes like above?

Thank you.

@XMHZZ2018
Copy link
Contributor

@kimwongyuda

Thank you for your interest in our work! You are correct that image tokens can significantly consume GPU memory, limiting the per-device batch size to around 2 to 4 on devices like the H100. If your GPU has less memory, it’s expected that only a per device batch size of 1 might be feasible.

However, the actual batch size is not the same as the per-device batch size. We utilize the GradCache technique to scale the actual batch size to 2K or even larger. "--grad_cache True" will enable GradCache. (The README contains the whole commands.) Please let me know whether it works.

@kimwongyuda
Copy link
Author

kimwongyuda commented Dec 27, 2024

Without GradCache, limitation of batch size is 2~4 as you said above.

Then, how did you use batch size 256 in Table 3 of the paper on 8 * H100? The environment of table 3 experiment doesn't look like using GradCache (basic setting).

Is it from model size difference between phi3.5v and llava_next? or Did you use gradient accumulation?

Thank you.

@XMHZZ2018
Copy link
Contributor

Hi @kimwongyuda , all the experiments are using GradCache to scale up the batch size. (It's a default settings.)

@kimwongyuda
Copy link
Author

@XMHZZ2018
Sorry. I misunderstood. Thank you so much for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants