Improving CLIP Training with Language Rewrites
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
The CLIP Model is Secretly an Image-to-Prompt Converter
Optimizing Prompts for Text-to-Image Generation
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing
Visual Instruction Inversion: Image Editing via Image Prompting
Tuning Multi-mode Token-level Prompt Alignment across Modalities
SwapPrompt: Test-Time Prompt Adaptation for Vision-Language Models
LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
Subject-driven Text-to-Image Generation via Apprenticeship Learning
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
Norm-guided latent space exploration for text-to-image generation
Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
StyleDrop: Text-to-Image Synthesis of Any Style
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
Conditional Score Guidance for Text-Driven Image-to-Image Translation
TextDiffuser: Diffusion Models as Text Painters
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Cross-modal Active Complementary Learning with Self-refining Correspondence
Test-Time Distribution Normalization for Contrastively Learned Visual-language Models
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
An Inverse Scaling Law for CLIP Training
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
ChatGPT-Powered Hierarchical Comparisons for Image Classification
Learning Mask-aware CLIP Representations for Zero-Shot Segmentation
What Makes Good Examples for Visual In-Context Learning?
Towards Consistent Video Editing with Text-to-Image Diffusion Models
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)