-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seems that MoA does not work on MATH and QA with both weak and strong LLMs #41
Comments
here i paste one example, for your information.
And llama-3.1-8B's answer is (which is correct)
but when applying MoA, llama-3.1-8B not only takes the originla question, but also three answers from the inter layer, here
mistal-v0.2's answer is (which is wrong)
mistal-v0.3's answer is (which is correct)
By ingesting the answers from mistral models, llama-3.1 changes its answer to (which is wrong)
|
among other testing cases, not always mistral-v0.3 is more like correct, but not always. It outperforms v0.1 and v0.2 a little bit actually, but not a significant margin. I expect with MoA, to be exact, with intermidiate opitions, the final aggregator should have better reponse compared to the version without it. |
update:
aggregator: The experimental result is that Here I also paste one example from GSM8K oracle answer:
answer from aggregator directly without MoA:
answer from aggregator with MoA, where in the prompt, it will consist of the responses from llms in intermidiate layer:
answer from
answer from
answer from
answer from
answer from
|
This sounds not surprising after reading the MoA paper https://arxiv.org/pdf/2406.04692 . Table 4. there shows effect of having different models either as proposers or as aggregators, and weaker models drop a lot as aggregators, while still being useful proposers.
From experience, when you overwhelm weak model with a lot of info it will struggle to pick useful bits, perhaps something like this happens in your experiments. |
I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is
mistralai/Mistral-7B-Instruct-v0.1/2/3
, while the aggregator ismeta-llama/Meta-Llama-3.1-8B-Instruct
.(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).
Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.
This finding also applies to HotpotQA.
Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?
The text was updated successfully, but these errors were encountered: