MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

🔔 News

🔥[2024-10-18]: Data and evaluation code are now available. We will continue to add evaluation pipelines for more models.

📄[2024-10-14]: Paper released on arXiv.

Introduction

MEGA-Bench is an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, addressing the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a diverse and rich set of multimodal tasks while enabling cost-effective and accurate model evaluation.

Key features of MEGA-Bench:

505 realistic tasks encompassing over 8,000 samples from 16 expert annotators
Wide range of output formats including numbers, phrases, code, LaTeX, coordinates, JSON, free-form, etc.
Over 40 metrics developed to evaluate these diverse tasks
Fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill)
Interactive visualization of model capabilities

Unlike existing benchmarks that unify problems into standard multi-choice questions, MEGA-Bench embraces the diversity of real-world tasks and their output formats. This allows for a more comprehensive evaluation of vision-language models across various dimensions.

Setup

Clone the repository by:

# Clone the repository with the task examples
git clone https://github.com/TIGER-AI-Lab/MEGA-Bench.git
# Or skip the task examples host by Git LFS
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/TIGER-AI-Lab/MEGA-Bench.git

cd MEGA-Bench

Dataset download

The MEGA-Bench dataset is now available on Hugging Face:

MEGA-Bench Dataset

Since the Hugging Face Datasets viewer does not support visualizing large rows with many images, we chose to only keep the file paths of images/videos in the HF dataset. Please download the data, unzip it, and set up the data path properly using the following commands:

wget https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench/resolve/main/data.zip?download=true -O data.zip
unzip data.zip -d megabench

Environment Setup

First, set up the environment with the following commands. The packages are mainly for the evaluation metrics used in MEGA-Bench.

conda create -n megabench python=3.12
conda activate megabench
pip install -r requirements.txt
python -c "import nltk; nltk.download('wordnet'); nltk.download('punkt')"

Evaluation

📌 Note: Due to slight reorganization of the prompt and cleanup for uploading to Hugging Face Datasets, the evaluation results from this repository may differ slightly from those reported in our paper and leaderboard. However, the overall performance trend and capability report should remain consistent.

For several popular models, we convert the evaluation results used in our paper to the format of the this repository, please find them in megabench/results/paper_results_converted.

The table below lists information about supported models in this repository. See megabench/models/model_type.py for the full list. We will add code for more models in the future.

Model Name	Type name for command line	Dependency
GPT-4o	GPT_4O_0513, GPT_4O_0806	OpenAI API
GPT-4o-mini	GPT_4O_MINI	OpenAI API
Claude 3 Haiku	CLAUDE_3_HAIKU	Anthropic API
Claude 3.5 Sonnet	CLAUDE_3_5_SONNET	Anthropic API
Gemini 1.5 Pro	GEMINI_PRO, GEMINI_PRO_002	Google API
Gemini 1.5 Flash	GEMINI_FLASH, GEMINI_FLASH_002	Google API
Qwen2-VL-72B	QWEN2_VL_72B	vLLM
Qwen2-VL-7B	QWEN2_VL_7B	vLLM
InternVL2-Llama3-76B	INTERNVL2_LLAMA3_76B	vLLM
InternVL2-8B	INTERNVL2_8B	vLLM
Llava-OneVision-72B	LLAVA_ONEVISION_72B	HF Transformers
Llava-OneVision-7B	LLAVA_ONEVISION_7B	HF Transformers
Pixtral-12B	PIXTRAL_12B	vLLM
Phi-3.5-vision	PHI_3_5_VISION	HF Transformers

Proprietary Models (GPT, Claude, Gemini)

To run with GPT or Claude, set up the OpenAI or Anthropic API key:

export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export GOOGLE_API_KEY=<your_gemini_api_key>

Example commands for running evaluation with GPT-4o (0513) or Claude-3.5-Sonnet on the Core subset, using multiprocessing with 2 subprocesses:

cd megabench

# GPT-4o (0513)
python main.py --model_type GPT_4O_0513 --output_file results/GPT-4o-0513/all_query_responses.json --print_response --dataset_subset_name core --multiprocess --processes 2

# Claude-3.5-Sonnet
python main.py --model_type CLAUDE_3_5_SONNET --output_file results/Claude-3.5-Sonnet/all_query_responses.json --print_response --dataset_subset_name core --multiprocess --processes 2

# Gemini-1.5-Pro-002
python main.py --model_type GEMINI_PRO_002 --output_file results/Gemini-1.5-Pro-002/all_query_responses.json --print_response --dataset_subset_name core --multiprocess --processes 2

To run with the Open-ended subset, set --dataset_subset_name open. The evaluation processor will evaluate all tasks in the response output file. If you only want to evaluate the Core subset, you should set different output file paths for Core and Open-ended subsets.

To evaluate the Open-ended subset, you need to set up the OpenAI API key first.

Open-source Models (Qwen2VL, InternVL2, Llava-OneVision)

To run with Qwen2VL or InternVL2, first install the latest vllm:

pip install vllm -U

Example commands for running evaluation with Qwen2VL or InternVL2 on the Core subset:

cd megabench

# InternVL2-8B
python main.py --model_type INTERNVL2_8B \
   --output_file results/InternVL2_8B/all_query_responses.json \
   --print_response --ngpus 4 --gpu_utils 0.9 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name core

# Qwen2VL-7B
python main.py --model_type QWEN2_VL_7B \
   --output_file results/Qwen2_VL_7B/all_query_responses.json \
   --print_response --ngpus 4 --gpu_utils 0.9 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name core

To run with Llava-OneVision, first install the [llava-next] repositary (https://github.com/LLaVA-VL/LLaVA-NeXT) (We found the version supported by vllm got much lower performance than the official one). Then run the following command:

python main.py --model_type LLAVA_ONEVISION_7B \
   --output_file results/Llava_OneVision_7B/all_query_responses.json \
   --print_response --dataset_subset_name core

If you want to evaluate on the open set, change it to --dataset_subset_name open.

Get multi-dimensional breakdown analysis

We provide a script for multi-dimensional breakdown analysis (results like on our leaderboard hosted on Hugging Face).

For example, the results in results/Qwen2_VL_7B/ are generated by running the following commands sequentially. The files in results/Qwen2_VL_7B/analysis/ are the task-level evaluation results and the multi-dimensional breakdown analysis results for all keywords over the 5 dimensions (application, skills, input format, output format, and visual input number).

# Run the query for the Core subset, don't evaluate at this time
python main.py --model_type QWEN2_VL_7B \
   --output_file results/Qwen2_VL_7B/all_query_responses.json \
   --print_response --ngpus 4 --gpu_utils 0.9 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name core  --query_only

# Run the query for the Open subset, run evaluation after finishing all queries
python main.py --model_type QWEN2_VL_7B \
   --output_file results/Qwen2_VL_7B/all_query_responses.json \
   --print_response --ngpus 4 --gpu_utils 0.9 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name open

# Run the multi-dimensional breakdown analysis
python tools/derive_breakdown_results.py --input_dir results/Qwen2_VL_7B/

Ground truth sanity check

The ground-truth sanity check is to check the validity of our rule-based evaluation metrics (for the Core tasks). This basically creates an "oracle model" that composes its response using the ground truth answer, and we evaluate the results to see if the oracle model can get full scores - not getting full scores means the evaluation metrics are not working properly.

Run the ground-truth sanity check with the following command:

python main.py \
   --model_type GROUND_TRUTH_ORACLE_SANITY_CHECK \
   --output_file results/Ground_truth_oracle_sanity_check/all_query_responses.json \
   --force_regenerate \
   --multiprocess --processes 48 \
   --dataset_name TIGER-Lab/MEGA-Bench \
   --dataset_subset_name core

This should produce full scores (i.e., 1.0) for all Core tasks, which helps verify the validity of the metric implementations.

Detailed explanation of command line arguments

The launch script main.py has the following arguments:

Argument	Description	Default
`--model_type`	Type of model to use	"GPT_4O_MINI"
`--model_path`	Custom model path for local open-source models	None
`--ngpus`	Number of GPUs to use (for vllm models)	None
`--gpu_utils`	GPU memory utilization (0.0 to 1.0, for vllm models)	None
`--output_file`	Path for query responses output	None
`--output_score_filename`	Filename for evaluation scores	"data_with_scores.json"
`--task_name`	Name of a specific task to process, if specified, the pipeline will run only a single task	None
`--force_regenerate`	Force regeneration of answers	False
`--query_only`	Perform only the model query, without evaluation	False
`--evaluation_only`	Perform only the evaluation step	False
`--multiprocess`	Enable multiprocessing (for API-based proprietary models)	False
`--processes`	Number of processes for multiprocessing	2
`--print_response`	Print model's response (helpful for debugging)	False
`--dataset_name`	Name of the dataset	"TIGER-Lab/MEGA-Bench"
`--dataset_subset_name`	Subset of the dataset to use	"core"

Evaluate Open-ended tasks with a VLM judge

By default, we use GPT-4o-2024-08-06 as the VLM judge to evaluate the Open-ended tasks. If you want to use a different VLM model for evaluating the Open subset, you can set the following environment variables:

export MEGABENCH_OPEN_API_KEY=<your_api_key>
export MEGABENCH_OPEN_API_MODEL=<your_vlm_model_name>
export MEGABENCH_OPEN_API_URL=<your_vlm_api_url>

Contact and Citation

For any questions or concerns, please contact:

Jiacheng Chen: jcchen.work@gmail.com
Wenhu Chen: wenhuchen@uwaterloo.ca

If you find this work useful for your research, please consider citing our paper:

@article{chen2024mega-bench,
  title={MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks},
  author={Chen, Jiacheng and Liang, Tianhao and Siu, Sherman and Wang, Zhengqing and Wang, Kai and Wang, Yubo and Ni, Yuansheng and Zhu, Wang and Jiang, Ziyan and Lyu, Bohan and Jiang, Dongfu and He, Xuan and Liu, Yuan and Hu, Hexiang and Yue, Xiang and Chen, Wenhu},
  journal={arXiv preprint arXiv:2410.10563},
  year={2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

🔔 News

Table of Contents

Introduction

Setup

Dataset download

Environment Setup

Evaluation

Proprietary Models (GPT, Claude, Gemini)

Open-source Models (Qwen2VL, InternVL2, Llava-OneVision)

Get multi-dimensional breakdown analysis

Ground truth sanity check

Detailed explanation of command line arguments

Evaluate Open-ended tasks with a VLM judge

Contact and Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

🔔 News

Table of Contents

Introduction

Setup

Dataset download

Environment Setup

Evaluation

Proprietary Models (GPT, Claude, Gemini)

Open-source Models (Qwen2VL, InternVL2, Llava-OneVision)

Get multi-dimensional breakdown analysis

Ground truth sanity check

Detailed explanation of command line arguments

Evaluate Open-ended tasks with a VLM judge

Contact and Citation