chore(deps): update dependency lm-eval to v0.4.7 #274

red-hat-konflux · 2024-12-21T20:12:47Z

This PR contains the following updates:

Package	Update	Change
lm-eval	patch	`==0.4.4` -> `==0.4.7`

Release Notes

EleutherAI/lm-evaluation-harness (lm-eval)

`v0.4.7`

Compare Source

lm-eval v0.4.7 Release Notes

This release includes several bug fixes, minor improvements to model handling, and task additions.

⚠️ Python 3.8 End of Support Notice

Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.

Backwards Incompatibilities

Chat Template Delimiter Handling (in v0.4.6)

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val) task and we welcome contributions and feedback from the community!

New VLM-Specific Arguments

VLM models can be configured with several new arguments within --model_args to support their specific requirements:

max_images (int): Set the maximum number of images for each prompt.
interleave (bool): Determines the positioning of image inputs. When True (default) images are interleaved with the text. When False all images are placed at the front of the text. This is model dependent.

hf-multimodal specific args:

image_token_id (int) or image_string (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an "<image>" string to indicate the location of images in the input, while Qwen2-VL models expect an "<|image_pad|>" sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family
convert_img_format (bool): Whether to convert the images to RGB format.

Example usage:

lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template

Important considerations

Chat Template: Most VLMs require the --apply_chat_template flag to ensure proper input formatting according to the model's expected chat template.
Some VLM models are limited to processing a single image per prompt. For these models, always set max_images=1. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring interleave=False.
Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.

Tested VLM Models

We have currently most notably tested the implementation with the following models:

llava-hf/llava-1.5-7b-hf
llava-hf/llava-v1.6-mistral-7b-hf
Qwen/Qwen2-VL-2B-Instruct
HuggingFaceM4/idefics2 (requires the latest transformers from source)

New Tasks

Several new tasks have been contributed to the library for this version!

New tasks as of v0.4.5 include:

Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
TurkishMMLU by @ArdaYueksel in #2283
PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Backwards Incompatibilities

Finalizing `group` versus `tag` split

We've now fully deprecated the use of group keys directly within a task's configuration file. The appropriate key to use is now solely tag for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.

Handling of Causal vs. Seq2seq backend in HFLM

In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS to a different factory class, such as transformers.AutoModelForVision2Seq.

As a result, those users who subclass HFLM but do not call HFLM.__init__() may now also need to set the self.backend attribute to either "causal" or "seq2seq" during initialization themselves.

While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see https://github.com/EleutherAI/lm-evaluation-harness/pull/2353 for the full set of changes.

Future Plans

We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!

Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)

What's Changed

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
Multimodal prototyping by @lintangsutawika in https://github.com/EleutherAI/lm-evaluation-harness/pull/2243
Update README.md by @SYusupov in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
remove comma by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2315
Update neuron backend by @dacorvo in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
Fixed dummy model by @Am1n3e in https://github.com/EleutherAI/lm-evaluation-harness/pull/2339
Add a note for missing dependencies by @eldarkurtic in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
squad v2: load metric with evaluate by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2351
fix writeout script by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2350
Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
change group to tags in task eus_exams task configs by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2320
change glianorex to test split by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2332
mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2334
Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
add mmlu readme by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2282
openai: better error messages; fix greedy matching by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2327
fix some bugs of mmlu by @eyuansu62 in https://github.com/EleutherAI/lm-evaluation-harness/pull/2299
Add new benchmark: Portuguese bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
Fix missing key in custom task loading. by @giuliolovisotto in https://github.com/EleutherAI/lm-evaluation-harness/pull/2304
Add new benchmark: Spanish bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2157
Add new benchmark: Galician bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2155
Add new benchmark: Basque bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2153
Add new benchmark: Catalan bench by @zxcvuser in https://github.com/EleutherAI/lm-evaluation-harness/pull/2154
fix tests by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2380
Hotfix! by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2383
Solution for CSAT-QA tasks evaluation by @KyujinHan in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
LingOly - Fixing scoring bugs for smaller models by @am-bean in https://github.com/EleutherAI/lm-evaluation-harness/pull/2376
Fix float limit override by @cjluo-omniml in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325
[API] tokenizer: add trust-remote-code by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2372
HF: switch conditional checks to self.backend from AUTO_MODEL_CLASS by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2353
max_images are passed on to vllms limit_mm_per_prompt by @baberabb in https://github.com/EleutherAI/lm-evaluation-harness/pull/2387
Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2388
Bump version to v0.4.5 by @haileyschoelkopf in https://github.com/EleutherAI/lm-evaluation-harness/pull/2389

New Contributors

@Malikeh97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2232
@SYusupov made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2297
@dacorvo made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2314
@eldarkurtic made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2336
@giuliolovisotto made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2288
@ArdaYueksel made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2283
@zxcvuser made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2156
@KyujinHan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2385
@cjluo-omniml made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/2325

Full Changelog: EleutherAI/lm-evaluation-harness@v0.4.4...v0.4.5

Configuration

📅 Schedule: Branch creation - "after 5am on saturday" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

To execute skipped test pipelines write comment /ok-to-test.

This PR has been generated by MintMaker (powered by Renovate Bot).

Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>

openshift-ci · 2024-12-21T20:12:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: red-hat-konflux[bot]
Once this PR has been reviewed and has the lgtm label, please assign danielezonca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-12-21T20:13:01Z

Hi @red-hat-konflux[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

chore(deps): update dependency lm-eval to v0.4.7

53363a4

Signed-off-by: red-hat-konflux <126015336+red-hat-konflux[bot]@users.noreply.github.com>

openshift-ci bot requested review from heyselbi and tarukumar December 21, 2024 20:12

openshift-ci bot added the needs-ok-to-test label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency lm-eval to v0.4.7 #274

chore(deps): update dependency lm-eval to v0.4.7 #274

red-hat-konflux bot commented Dec 21, 2024

openshift-ci bot commented Dec 21, 2024

openshift-ci bot commented Dec 21, 2024

chore(deps): update dependency lm-eval to v0.4.7 #274

Are you sure you want to change the base?

chore(deps): update dependency lm-eval to v0.4.7 #274

Conversation

red-hat-konflux bot commented Dec 21, 2024

Release Notes

v0.4.7

lm-eval v0.4.7 Release Notes

⚠️ Python 3.8 End of Support Notice

Backwards Incompatibilities

Chat Template Delimiter Handling (in v0.4.6)

New Benchmarks & Tasks

What's Changed

New Contributors

v0.4.5

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

New VLM-Specific Arguments

Example usage:

Important considerations

Tested VLM Models

New Tasks

Backwards Incompatibilities

Finalizing group versus tag split

Handling of Causal vs. Seq2seq backend in HFLM

Future Plans

What's Changed

New Contributors

Configuration

openshift-ci bot commented Dec 21, 2024

openshift-ci bot commented Dec 21, 2024

`v0.4.7`

`v0.4.5`

Finalizing `group` versus `tag` split