You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I understand, one of the core contributions claimed in the paper is that the whole training does not require the derivatives of LLM, so it saves a lot of resources.
But my experiments show that, both methods cannot really forbid the computation of gradients.
Denote some network blocks as function $g$, and $g$ is restricted by no_grad or requires_grad = False. And there are some network blocks $f$ attached before $g$, so the whole networks looks like $$g(f(x))$$.
However, $f$ does require gradients as $f$ need to be updated. And my experiments show that the gradients of $g$ will be computed in this case, because there is no other way to compute the gradients of $f$. So no_grad/requires_grad = False will have no effect. The gradients will still be computed.
I wonder, in this case, how exactly does the author arrange to make the gradient computation of LLM never happens. Because the training runs too fast, this has no possibility to happen.
The text was updated successfully, but these errors were encountered:
As I understand, one of the core contributions claimed in the paper is that the whole training does not require the derivatives of LLM, so it saves a lot of resources.
But how is this enforced in the code?
In LMAdaptorModel,
In PromptedClassificationReward, there is a no_grad decorator:
But my experiments show that, both methods cannot really forbid the computation of gradients.
Denote some network blocks as function$g$ , and $g$ is restricted by no_grad or requires_grad = False. And there are some network blocks $f$ attached before $g$ , so the whole networks looks like $$g(f(x))$$ .
However,$f$ does require gradients as $f$ need to be updated. And my experiments show that the gradients of $g$ will be computed in this case, because there is no other way to compute the gradients of $f$ . So no_grad/requires_grad = False will have no effect. The gradients will still be computed.
I wonder, in this case, how exactly does the author arrange to make the gradient computation of LLM never happens. Because the training runs too fast, this has no possibility to happen.
The text was updated successfully, but these errors were encountered: