Installation

We provide installation instructions for:

Setting up environments for inference with Video-LMMs
Downloading and setting-up model weights (if required) for Video-LMMs

Setting environment and weights for TimeChat

Note: instructions are borrowed from the TimeChat Github repository

Run the following commands to install environment for TimeChat

cd Video-LMMs-Inference/TimeChat
# First, install ffmpeg.
apt update
apt install ffmpeg
# Then, create a conda environment:
conda env create -f environment.yml
conda activate timechat
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Follow the below instructions to set-up the model weights for TimeChat

Pre-trained Image Encoder (EVA ViT-g)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth

Pre-trained Image Q-Former (InstructBLIP Q-Former)

wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Use git-lfs to download weights of Video-LLaMA (7B):

git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned

Instruct-tuned TimeChat-7B

git lfs install
git clone https://huggingface.co/ShuhuaiRen/TimeChat-7b

The file structure looks like:

TimeChat/ckpt/
        |–– Video-LLaMA-2-7B-Finetuned/
            |-- llama-2-7b-chat-hf/
            |-- VL_LLaMA_2_7B_Finetuned.pth
        |–– instruct-blip/
            |-- instruct_blip_vicuna7b_trimmed.pth
        |–– eva-vit-g/
            |-- eva_vit_g.pth
        |-- timechat/
            |-- timechat_7b.pth

Setting environment for Video-LLaVA

Note: instructions are borrowed from the Video-LLaVA Github repository

Run the following commands to install environment for Video-LLaVA

## Following requirements must be met for successful installation
# Python >= 3.10
# Pytorch == 2.0.1
# CUDA Version >= 11.7
# Install required packages:

cd Video-LMMs-Inference/Video-LLaVA
# install anaconda environment and packages
conda create -n videollava python=3.10 -y
conda activate videollava

pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d

Model Weights: Note that Video-LLaVA will automatically download the weights after running for first time. No need to manually download the model weights.

Setting environment for Gemini-Pro-Vision

Note: We use google-cloud platform for performing inference using Gemini model. Specifically, you would need to set-up the following:

Configure a project (or use an existing one, if any) on google cloud more info here
Create a google-cloud bucket, and upload the CVRR-ES dataset in that bucket.
Run the following commands to install the packages

conda create -n gemini python=3.10 -y
pip install --upgrade google-cloud-aiplatform
gcloud auth application-default login

Setting environment for GPT4-(V)ision

Run the following commands to install the packages

conda create -n gpt4v python=3.10 -y
# install open-ai
pip install openai==1.13.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INSTALL.md

INSTALL.md

Installation

Setting environment and weights for TimeChat

Pre-trained Image Encoder (EVA ViT-g)

Pre-trained Image Q-Former (InstructBLIP Q-Former)

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Instruct-tuned TimeChat-7B

Setting environment for Video-LLaVA

Setting environment for Gemini-Pro-Vision

Setting environment for GPT4-(V)ision

Files

INSTALL.md

Latest commit

History

INSTALL.md

File metadata and controls

Installation

Setting environment and weights for TimeChat

Pre-trained Image Encoder (EVA ViT-g)

Pre-trained Image Q-Former (InstructBLIP Q-Former)

Pre-trained Language Decoder (LLaMA-2-7B) and Video Encoder (Video Q-Former of Video-LLaMA)

Instruct-tuned TimeChat-7B

Setting environment for Video-LLaVA

Setting environment for Gemini-Pro-Vision

Setting environment for GPT4-(V)ision