Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

[📖 arXiv Paper] [📊 Dataset][🏆 Models]

🔥 Update

[12/26]🔥SliME is supported by VLMEvalKit and LMMs-Eval. Feel free to use it without hesitation!
[10/26]🔥SliME-8B achieves better high-resolution understanding performance on MME-RealWorld compared to Mini-Gemini and LLaVA-Next.
[07/16]🔥The SliME strategy demonstrates exceptional versatility, extending seamlessly to video analysis (See Slime_video.md). Remarkably, even though the model has never been specifically trained on video data, it is capable of processing up to 8 frames. In the Video-MME benchmark, the model surpasses numerous 7B/8B baselines that have undergone training on video datasets.
[06/11]🔥SliME is coming! We release the paper, code, models, and data for SliME!
[06/11]🔥SliME-70B will be released soon.

🔮 Install

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/yfzhang114/SliME.git

Install Package

conda create -n slime python=3.10 -y
conda activate slime
cd SliME
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install ninja
pip install datasets
pip install flash-attn --no-build-isolation

🔍 Model

We provide all our fully finetuned models on Stage 1/2 and 3 data for SliME:

Model	Base LLM	Vision Encoder	Finetuning Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	SharedGPT+SMR	full_ft	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	SharedGPT+SMR	Lora	ckpt

Here are the pretrained weights on Stage 1/2 data only:

Model	Base LLM	Vision Encoder	Pretrain Data	Finetuning schedule	Download
SliME-7B	Vicuna-7B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-8B	Llama-3-8B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-13B	Vicuna-13B-v1.5	CLIP-L	LLaVA-Pretrain	1e	ckpt
SliME-70B	Llama-3-70B-Instruct	CLIP-L	LLaVA-Pretrain	1e	ckpt

🔮 Preparation

Dataset

Please follow LLaVA and SharedGPT4V to prepare the corresponding images and data.

SMR data structure

data
├── arxivqa
│   └── images
├── DVQA
│   └── images
├── Geometry3K
│   └── 0-2400 dirs
├── ChartQA
│   └── train_images
└── GeoQA3
│    ├── image
│    └── json
├── mathvision
├── scienceqa
├── tabmwp
└── GeoQA3
│    ├── train
│    └── test
│    └── val
└── ai2d
│    ├── abc_images
│    └── images
└── geoqa+
│   └── images

You can find the pre-processing code at this URL. If you have any questions about file names or image paths, please refer to the pre-processing code.

Arxiv QA Download images using this download url

python playground/data/process_arxivqa.py

DVQA

Download images using this url.

ChartQA

Clone this repo

extract all the training images in ChartQA_Dataset/train/png into ChartQA

Geometry3K

Download images using this url.

The image path in our json file will be os.path.join(f'Geometry3K/i', 'img_diagram.png')

GeoQA3