5868 lines
339 KiB
HTML
5868 lines
339 KiB
HTML
<h1 id="awesome-vlm-architectures-awesome">👁️🗨️Awesome VLM Architectures
|
||
<a href="https://awesome.re"><img src="https://awesome.re/badge.svg"
|
||
alt="Awesome" /></a></h1>
|
||
<figure>
|
||
<img
|
||
src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/5c9ee091-1f37-4d92-8398-a7d4e006c014"
|
||
alt="VLM" />
|
||
<figcaption aria-hidden="true">VLM</figcaption>
|
||
</figure>
|
||
<p><strong>Vision-Language Models (VLMs)</strong> feature a multimodal
|
||
architecture that processes image and text data simultaneously. They can
|
||
perform <strong>Visual Question Answering (VQA)</strong>, <strong>image
|
||
captioning</strong> and <strong>Text-To-Image search</strong> kind of
|
||
tasks. VLMs utilize techniques like multimodal fusing with
|
||
cross-attention, masked-language modeling, and image-text matching to
|
||
relate visual semantics to textual representations. This repository
|
||
contains information on famous Vision Language Models (VLMs), including
|
||
details about their architectures, training procedures, and the datasets
|
||
used for training. <strong>Click to expand for further details for every
|
||
architecture</strong> - 📙
|
||
<a href="https://github.com/gokayfem/ComfyUI_VLM_nodes">Visit my other
|
||
repo to try Vision Language Models on ComfyUI</a></p>
|
||
<h2 id="contents">Contents</h2>
|
||
<ul>
|
||
<li><a href="#architectures">Architectures</a></li>
|
||
<li><a href="#important-references">Important References</a></li>
|
||
</ul>
|
||
<h2 id="models">Models</h2>
|
||
<p><a
|
||
href="#llava-large-language-and-vision-assistant---visual-instruction-tuning">LLaVA</a>
|
||
| <a
|
||
href="#llava-15-improved-baselines-with-visual-instruction-tuning">LLaVA
|
||
1.5</a> | <a
|
||
href="#llava-16-llava-next-improved-reasoning-ocr-and-world-knowledge">LLaVA
|
||
1.6</a> | <a
|
||
href="#paligemma-a-versatile-and-transferable-3b-vision-language-model">PaliGemma</a>
|
||
| <a
|
||
href="#paligemma-2-a-family-of-versatile-vlms-for-transfer">PaliGemma
|
||
2</a> | <a
|
||
href="#aimv2-multimodal-autoregressive-pre-training-of-large-vision-encoders">AIMv2</a>
|
||
| <a
|
||
href="#apollo-an-exploration-of-video-understanding-in-large-multimodal-models">Apollo</a>
|
||
| <a
|
||
href="#aria-an-open-multimodal-native-mixture-of-experts-model">ARIA</a>
|
||
| <a href="#eve-unveiling-encoder-free-vision-language-models">EVE</a> |
|
||
<a
|
||
href="#evev2-improved-baselines-for-encoder-free-vision-language-models">EVEv2</a>
|
||
| <a
|
||
href="#janus-pro-unified-multimodal-understanding-and-generation-with-data-and-model-scaling">Janus-Pro</a>
|
||
| <a
|
||
href="#llava-cot-let-vision-language-models-reason-step-by-step">LLaVA-CoT</a>
|
||
| <a
|
||
href="#llm2clip-powerful-language-model-unlocks-richer-visual-representation">LLM2CLIP</a>
|
||
| <a
|
||
href="#maya-an-instruction-finetuned-multilingual-multimodal-model">Maya</a>
|
||
| <a
|
||
href="#minimax-01-scaling-foundation-models-with-lightning-attention">MiniMax-01</a>
|
||
| <a href="#nvlm-open-frontier-class-multimodal-llms">NVLM</a> | <a
|
||
href="#omnivlm-a-token-compressed-sub-billion-parameter-vision-language-model-for-efficient-on-device-inference">OmniVLM</a>
|
||
| <a
|
||
href="#pixtral-12b-a-cutting-edge-open-multimodal-language-model">Pixtral
|
||
12B</a> | <a
|
||
href="#sa2va-marrying-sam2-with-llava-for-dense-grounded-understanding-of-images-and-videos">Sa2VA</a>
|
||
| <a
|
||
href="#tarsier2-advancing-large-vision-language-models-from-detailed-video-description-to-comprehensive-video-understanding">Tarsier2</a>
|
||
| <a
|
||
href="#ui-tars-pioneering-automated-gui-interaction-with-native-agents">UI-TARS</a>
|
||
| <a
|
||
href="#videochat-flash-hierarchical-compression-for-long-context-video-modeling">VideoChat-Flash</a>
|
||
| <a
|
||
href="#videollama-3-frontier-multimodal-foundation-models-for-image-and-video-understanding">VideoLLaMA
|
||
3</a> | <a
|
||
href="#llama-32-vision-enhanced-multimodal-capabilities-built-on-llama-3">Llama
|
||
3.2-Vision</a> | <a
|
||
href="#smolvlm-a-small-efficient-and-open-source-vision-language-model">SmolVLM</a>
|
||
| <a href="#idefics">IDEFICS</a> | <a href="#idefics2">IDEFICS2</a> | <a
|
||
href="#idefics3-8b-building-and-better-understanding-vision-language-models">IDEFICS3-8B</a>
|
||
| <a
|
||
href="#internlm-xcomposer2-mastering-free-form-text-image-composition-and-comprehension-in-vision-language-large-model">InternLM-XComposer2</a>
|
||
| <a
|
||
href="#internlm-xcomposer2-4khd-a-pioneering-large-vision-language-model-handling-resolutions-from-336-pixels-to-4k-hd">InternLM-XComposer2-4KHD</a>
|
||
| <a
|
||
href="#internlm-xcomposer-25-a-versatile-large-vision-language-model-supporting-long-contextual-input-and-output">InternLM-XComposer-2.5</a>
|
||
| <a
|
||
href="#internvl-25-expanding-performance-boundaries-of-open-source-multimodal-models-with-model-data-and-test-time-scaling">InternVL
|
||
2.5</a> | <a
|
||
href="#deepseek-vl-towards-real-world-vision-language-understanding">DeepSeek-VL</a>
|
||
| <a
|
||
href="#deepseek-vl2-mixture-of-experts-vision-language-models-for-advanced-multimodal-understanding">DeepSeek-VL2</a>
|
||
| <a
|
||
href="#mantis-mastering-multi-image-understanding-through-interleaved-instruction-tuning">MANTIS</a>
|
||
| <a
|
||
href="#qwen-vl-a-versatile-vision-language-model-for-understanding-localization-text-reading-and-beyond">Qwen-VL</a>
|
||
| <a
|
||
href="#qwen2-vl-a-powerful-open-source-vision-language-model-for-image-and-video-understanding">Qwen2-VL</a>
|
||
| <a
|
||
href="#qwen25-vl-enhanced-vision-language-capabilities-in-the-qwen-series">Qwen2.5-VL</a>
|
||
| <a href="#moondream1-and-moondream2">moondream1</a> | <a
|
||
href="#moondream1-and-moondream2">moondream2</a> | <a
|
||
href="#moondream-next-compact-vision-language-model-with-enhanced-capabilities">Moondream-next</a>
|
||
| <a
|
||
href="#sphinx-x-scaling-data-and-parameters-for-a-family-of-multi-modal-large-language-models">SPHINX-X</a>
|
||
| <a href="#blip-bootstrapping-language-image-pre-training">BLIP</a> |
|
||
<a
|
||
href="#blip-2-bootstrapping-language-image-pre-training-with-frozen-image-encoders-and-large-language-models">BLIP-2</a>
|
||
| <a
|
||
href="#xgen-mm-blip-3-an-open-source-framework-for-building-powerful-and-responsible-large-multimodal-models">xGen-MM
|
||
(BLIP-3)</a> | <a
|
||
href="#instructblip-towards-general-purpose-vision-language-models-with-instruction-tuning">InstructBLIP</a>
|
||
| <a
|
||
href="#kosmos-1-language-is-not-all-you-need-aligning-perception-with-language-models">KOSMOS-1</a>
|
||
| <a
|
||
href="#kosmos-2-grounding-multimodal-large-language-models-to-the-world">KOSMOS-2</a>
|
||
| <a
|
||
href="#convllava-hierarchical-backbones-as-visual-encoder-for-large-multimodal-models">ConvLLaVA</a>
|
||
| <a href="#parrot-multilingual-visual-instruction-tuning">Parrot</a> |
|
||
<a
|
||
href="#omg-llava-bridging-image-level-object-level-pixel-level-reasoning-and-understanding">OMG-LLaVA</a>
|
||
| <a
|
||
href="#evlm-an-efficient-vision-language-model-for-visual-understanding">EVLM</a>
|
||
| <a
|
||
href="#slowfast-llava-a-strong-training-free-baseline-for-video-large-language-models">SlowFast-LLaVA</a>
|
||
| <a href="#nous-hermes-2-vision---mistral-7b">Nous-Hermes-2-Vision -
|
||
Mistral 7B</a> | <a
|
||
href="#tinygpt-v-efficient-multimodal-large-language-model-via-small-backbones">TinyGPT-V</a>
|
||
| <a
|
||
href="#covlm-composing-visual-entities-and-relationships-in-large-language-models-via-communicative-decoding">CoVLM</a>
|
||
| <a href="#glamm-pixel-grounding-large-multimodal-model">GLaMM</a> | <a
|
||
href="#cosmo-contrastive-streamlined-multimodal-model-with-interleaved-pre-training">COSMO</a>
|
||
| <a href="#firellava">FireLLaVA</a> | <a
|
||
href="#u-llava-unifying-multi-modal-tasks-via-large-language-model">u-LLaVA</a>
|
||
| <a
|
||
href="#moe-llava-mixture-of-experts-for-large-vision-language-models">MoE-LLaVA</a>
|
||
| <a
|
||
href="#bliva-a-simple-multimodal-llm-for-better-handling-of-text-rich-visual-questions">BLIVA</a>
|
||
| <a
|
||
href="#mobilevlm-a-fast-strong-and-open-vision-language-assistant-for-mobile-devices">MobileVLM</a>
|
||
| <a
|
||
href="#frozen-multimodal-few-shot-learning-with-frozen-language-models">FROZEN</a>
|
||
| <a
|
||
href="#flamingo-a-visual-language-model-for-few-shot-learning">Flamingo</a>
|
||
| <a
|
||
href="#openflamingo-an-open-source-framework-for-training-large-autoregressive-vision-language-models">OpenFlamingo</a>
|
||
| <a
|
||
href="#pali-a-jointly-scaled-multilingual-language-image-model">PaLI</a>
|
||
| <a
|
||
href="#pali-3-vision-language-models-smaller-faster-stronger">PaLI-3</a>
|
||
| <a href="#palm-e-an-embodied-multimodal-language-model">PaLM-E</a> |
|
||
<a
|
||
href="#minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models">MiniGPT-4</a>
|
||
| <a
|
||
href="#minigpt-v2-large-language-model-as-a-unified-interface-for-vision-language-multi-task-learning">MiniGPT-v2</a>
|
||
| <a
|
||
href="#llava-plus-learning-to-use-tools-for-creating-multimodal-agents">LLaVA-Plus</a>
|
||
| <a href="#bakllava">BakLLaVA</a> | <a
|
||
href="#cogvlm-visual-expert-for-pretrained-language-models">CogVLM</a> |
|
||
<a
|
||
href="#cogvlm2-enhanced-vision-language-models-for-image-and-video-understanding">CogVLM2</a>
|
||
| <a
|
||
href="#ferret-refer-and-ground-anything-anywhere-at-any-granularity">Ferret</a>
|
||
| <a href="#fuyu-8b-a-multimodal-architecture-for-ai-agents">Fuyu-8B</a>
|
||
| <a href="#otterhd-a-high-resolution-multi-modality-model">OtterHD</a>
|
||
| <a
|
||
href="#sphinx-the-joint-mixing-of-weights-tasks-and-visual-embeddings-for-multi-modal-large-language-models">SPHINX</a>
|
||
| <a
|
||
href="#eagle-2-building-post-training-data-strategies-from-scratch-for-frontier-vision-language-models">Eagle
|
||
2</a> | <a
|
||
href="#eagle-exploring-the-design-space-for-multimodal-llms-with-mixture-of-encoders">EAGLE</a>
|
||
| <a
|
||
href="#vita-towards-open-source-interactive-omni-multimodal-llm">VITA</a>
|
||
| <a
|
||
href="#llava-onevision-easy-visual-task-transfer">LLaVA-OneVision</a> |
|
||
<a
|
||
href="#minicpm-o-26-a-gpt-4o-level-mllm-for-vision-speech-and-multimodal-live-streaming">MiniCPM-o-2.6</a>
|
||
| <a href="#minicpm-v-a-gpt-4v-level-mllm-on-your-phone">MiniCPM-V</a> |
|
||
<a
|
||
href="#inf-llava-high-resolution-image-perception-for-multimodal-large-language-models">INF-LLaVA</a>
|
||
| <a
|
||
href="#florence-2-a-deep-dive-into-its-unified-architecture-and-multi-task-capabilities">Florence-2</a>
|
||
| <a
|
||
href="#multiinstruct-improving-multi-modal-zero-shot-learning-via-instruction-tuning">MULTIINSTRUCT</a>
|
||
| <a href="#mousi-poly-visual-expert-vision-language-models">MouSi</a> |
|
||
<a
|
||
href="#lavin-cheap-and-quick-efficient-vision-language-instruction-tuning-for-large-language-models">LaVIN</a>
|
||
| <a href="#clip-contrastive-language-image-pre-training">CLIP</a> | <a
|
||
href="#metaclip-demystifying-clip-data">MetaCLIP</a> | <a
|
||
href="#alpha-clip-a-clip-model-focusing-on-wherever-you-want">Alpha-CLIP</a>
|
||
| <a href="#glip-grounded-language-image-pre-training">GLIP</a> | <a
|
||
href="#imagebind-one-embedding-space-to-bind-them-all">ImageBind</a> |
|
||
<a
|
||
href="#siglip-sigmoid-loss-for-language-image-pre-training">SigLIP</a> |
|
||
<a
|
||
href="#vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale">ViT</a></p>
|
||
<h2 id="architectures">Architectures</h2>
|
||
<h2
|
||
id="llava-large-language-and-vision-assistant---visual-instruction-tuning"><strong>LLaVA:
|
||
Large Language and Vision Assistant - Visual Instruction
|
||
Tuning</strong></h2>
|
||
<p>LLaVA seamlessly integrates a pre-trained language model (Vicuna)
|
||
with a visual encoder (CLIP) using a simple linear layer, creating a
|
||
robust architecture capable of effectively processing and understanding
|
||
language-image instructions.</p>
|
||
<a href="https://arxiv.org/abs/2304.08485"><img
|
||
src="https://img.shields.io/badge/arXiv-2304.08485-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/haotian-liu/LLaVA"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://llava.hliu.cc/"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/722f0fbb-ea52-4a8a-ab1e-bec45ca7d04f" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>LLaVA</strong>: At the heart of LLaVA’s architecture is the
|
||
fusion of a pre-trained language model with a visual model, specifically
|
||
designed to process and understand language-image instruction data
|
||
effectively. This integration enables LLaVA to leverage the distinct
|
||
strengths of both models, employing the CLIP visual encoder for robust
|
||
image feature extraction and the Vicuna language model for intricate
|
||
language instruction processing. A noteworthy feature of this
|
||
architecture is the use of <strong>a simple linear layer</strong> that
|
||
bridges image features to the word embedding space, facilitating a
|
||
seamless alignment between visual and linguistic representations. The
|
||
training methodology of LLaVA is meticulously structured into a
|
||
two-stage instruction-tuning procedure. Initially, the model undergoes
|
||
pre-training focused on feature alignment, utilizing a carefully
|
||
filtered dataset to synchronize image features with LLM word embeddings.
|
||
Subsequently, the model is fine-tuned end-to-end on tailored tasks such
|
||
as multimodal chatbot functionalities and Science QA, with the aim of
|
||
refining its instruction-following prowess. This sophisticated training
|
||
regimen is underpinned by the use of multimodal instruction-following
|
||
data generated via GPT-4, converting image-text pairs into formats
|
||
conducive to instruction-following tasks. The alignment of text and
|
||
image data is innovatively achieved through <strong>a trainable
|
||
projection matrix</strong>, converting visual features into language
|
||
embedding tokens within a unified dimensional space, thereby enhancing
|
||
the model’s ability to encode vision and text cohesively.The datasets
|
||
deployed for LLaVA’s training and evaluation are strategically selected
|
||
to bolster its multimodal capabilities. The Filtered CC3M dataset serves
|
||
as the foundation for pre-training, aligning visual and language
|
||
features, while the LLaVA-Instruct-158K dataset generated using GPT-4 is
|
||
pivotal for fine-tuning the model on diverse multimodal tasks.
|
||
Additionally, the ScienceQA dataset plays a critical role in assessing
|
||
LLaVA’s proficiency in multimodal reasoning tasks, demonstrating the
|
||
model’s comprehensive training and its potential to significantly
|
||
advance the field of multimodal interaction and understanding.
|
||
</details>
|
||
<h2
|
||
id="llava-1.5-improved-baselines-with-visual-instruction-tuning"><strong>LLaVA
|
||
1.5: Improved Baselines with Visual Instruction Tuning</strong></h2>
|
||
<p>LLaVA 1.5 enhances its multimodal understanding by replacing its
|
||
initial linear projection with a more powerful multi-layer perceptron
|
||
(MLP), enabling a deeper integration of visual features from
|
||
CLIP-ViT-L-336px and linguistic data.</p>
|
||
<a href="https://arxiv.org/abs/2310.03744"><img
|
||
src="https://img.shields.io/badge/arXiv-2310.03744-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/c7112b75-3b86-48a2-9c0f-f1dc1dc6ee06" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>LLaVA 1.5</strong>: This iteration introduces a refined
|
||
architecture that incorporates a CLIP-ViT-L-336px vision encoder
|
||
alongside <strong>a multi-layer perceptron (MLP) projection
|
||
layer</strong>. This combination not only boosts the model’s data
|
||
efficiency but also its performance across various benchmarks,
|
||
showcasing a leap in multimodal understanding. The architecture’s core
|
||
components, the CLIP-ViT-L for visual encoding and the MLP for
|
||
vision-language cross-modal connection, work synergistically to enhance
|
||
the model’s capacity to integrate and interpret visual and linguistic
|
||
inputs.Training methods have been optimized in LLaVA 1.5 to achieve
|
||
unprecedented performance on 11 benchmarks, utilizing a two-stage
|
||
approach that emphasizes efficient feature alignment and fine-tuning
|
||
with VQA data specifically tailored for academic tasks. The paper
|
||
highlights a shift towards more sophisticated multimodal alignment
|
||
techniques, <strong>replacing the original linear projection</strong>
|
||
with a more powerful <strong>MLP vision-language connector</strong>.
|
||
This strategic improvement facilitates a deeper and more nuanced
|
||
integration of visual and linguistic data. Moreover, the adoption of an
|
||
MLP-based vision-language connector for alignment fusion methods further
|
||
strengthens the model’s ability to merge visual and textual
|
||
representations effectively, ensuring closer alignment in the embedding
|
||
space.The utilization of datasets such as VQA-v2, GQA, and other
|
||
academic-task-oriented VQA datasets, enriched with OCR and region-level
|
||
perception data, underscores the model’s enhanced visual understanding
|
||
and reasoning capabilities. These datasets play a crucial role in
|
||
elevating LLaVA 1.5’s performance, enabling it to set new standards with
|
||
academic-task-oriented data. Through these advancements, LLaVA 1.5 not
|
||
only pushes the boundaries of multimodal learning but also sets a new
|
||
benchmark for future research in the field.
|
||
</details>
|
||
<h2
|
||
id="llava-1.6-llava-next-improved-reasoning-ocr-and-world-knowledge"><strong>LLaVA
|
||
1.6: LLaVA-NeXT Improved reasoning, OCR, and world
|
||
knowledge</strong></h2>
|
||
<p>LLaVA-NeXT advances on LLaVA-1.5 by incorporating high-resolution
|
||
image processing, enhancing visual reasoning and OCR capabilities, while
|
||
maintaining a data-efficient design through knowledge transfer from its
|
||
predecessor and a refined training process.</p>
|
||
<a href="https://llava-vl.github.io/blog/2024-01-30-llava-next/"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen,
|
||
Yong Jae Lee
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/032ef144-ec10-41da-80a1-2cecd66c86ee" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>LLaVA-NeXT</strong>: Represents a significant step forward in
|
||
the evolution of large language models with visual capabilities,
|
||
building upon the foundations laid by LLaVA-1.5. This model introduces
|
||
several enhancements aimed at improving image resolution, visual
|
||
reasoning, optical character recognition (OCR), and the integration of
|
||
world knowledge, all while retaining the minimalist and data-efficient
|
||
design of its predecessor. The architecture of LLaVA-NeXT is optimized
|
||
for high performance, supporting input image resolutions up to 672x672,
|
||
336x1344, and 1344x336 pixels. This improvement facilitates a more
|
||
detailed visual perception, which, coupled with an enhanced visual
|
||
instruction tuning data mixture, significantly bolsters the model’s
|
||
reasoning and OCR capabilities. Furthermore, LLaVA-NeXT achieves
|
||
efficient deployment through the use of SGLang, a feature that
|
||
underscores its design’s focus on performance and data
|
||
efficiency.Training LLaVA-NeXT requires less than 1 million visual
|
||
instruction tuning samples, leveraging the <strong>pre-trained
|
||
connector</strong> from LLaVA-1.5 for efficient knowledge transfer. The
|
||
training process, remarkably swift, utilizes 32 A100 GPUs and completes
|
||
in approximately one day, a testament to the model’s efficient design
|
||
and deployment strategy. The alignment techniques in LLaVA-NeXT are
|
||
particularly noteworthy, utilizing high-resolution images and a
|
||
high-quality data mixture to enhance the model’s capabilities in visual
|
||
conversation and instruction following. The model’s use of dynamic
|
||
high-resolution techniques, known as ‘AnyRes’, allows for effective
|
||
handling of images with varying resolutions, improving the model’s
|
||
overall visual understanding.The datasets employed in training
|
||
LLaVA-NeXT, including LAION-GPT-V, ShareGPT-4V, DocVQA, SynDog-EN,
|
||
ChartQA, DVQA, and AI2D, are meticulously chosen to augment the model’s
|
||
visual reasoning, OCR capabilities, and comprehension of charts and
|
||
diagrams. This strategic selection aims to elevate the model’s
|
||
performance across a wide range of multimodal tasks, emphasizing its
|
||
enhanced ability to process and understand complex visual information.
|
||
Through these improvements, LLaVA-NeXT sets a new benchmark for models
|
||
at the intersection of language and vision, offering unprecedented
|
||
capabilities in visual reasoning, OCR, and the application of world
|
||
knowledge in multimodal contexts.
|
||
</details>
|
||
<h2
|
||
id="paligemma-a-versatile-and-transferable-3b-vision-language-model"><strong>PaliGemma:
|
||
A Versatile and Transferable 3B Vision-Language Model</strong></h2>
|
||
<p>PaliGemma is a compact, open-source vision-language model designed to
|
||
be easily transferable to a diverse range of tasks. It combines a
|
||
powerful SigLIP image encoder with the Gemma-2B language model,
|
||
achieving strong performance on over 40 diverse tasks, including
|
||
standard VLM benchmarks, remote-sensing, and segmentation. PaliGemma is
|
||
pretrained using a multi-stage approach, focusing on maximizing the
|
||
density of learning signal and providing different checkpoints with
|
||
varying image resolutions. This versatile foundation model is easily
|
||
fine-tuned for specific tasks and serves as a valuable tool for
|
||
researchers and practitioners exploring the capabilities of VLMs.</p>
|
||
<p><a href="https://arxiv.org/pdf/2407.07726"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.07726-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/big-vision/paligemma"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov,
|
||
Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael
|
||
Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers,
|
||
Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil
|
||
Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra,
|
||
Matthias Bauer, Matko Bošnjak, Xi Chen, Matthias Minderer, Paul
|
||
Voigtlaender, Ioana Bica, Ivana Balazevic, Joan Puigcerver, Pinelopi
|
||
Papalampidi, Olivier Henaff, Xi Xiong, Radu Soricut, Jeremiah Harmsen,
|
||
Xiaohua Zhai</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/186371d0-6861-4b68-b32e-fee77cc75ef2" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
PaliGemma stands out as a highly versatile and transferable 3-billion
|
||
parameter Vision-Language Model (VLM) meticulously designed for broad
|
||
applicability across a wide spectrum of visual-language tasks. Its
|
||
foundation lies in the integration of two powerful components: a
|
||
SigLIP-So400m vision encoder, known for its exceptional performance
|
||
despite its compact size, and the Gemma-2B language model, a pretrained
|
||
autoregressive decoder-only model from the Gemma family. This
|
||
combination enables PaliGemma to effectively process and understand both
|
||
visual and textual information, making it adept at handling tasks
|
||
ranging from image captioning and visual question answering to more
|
||
specialized tasks like remote-sensing and segmentation. PaliGemma’s
|
||
architecture is streamlined and efficient. It uses a simple linear
|
||
projection to align the visual features extracted by the SigLIP encoder
|
||
with the vocabulary tokens of the Gemma language model, enabling
|
||
seamless fusion of the two modalities. A key aspect of PaliGemma’s
|
||
training is the emphasis on “density of learning signal,” prioritizing a
|
||
broad range of skills and knowledge over achieving high zero-shot
|
||
performance. This approach involves a multi-stage pretraining process
|
||
that starts with unimodal pretraining of individual components using
|
||
publicly available checkpoints, followed by extensive multimodal
|
||
pretraining on a diverse mixture of large-scale vision-language tasks.
|
||
Notably, PaliGemma deviates from the common practice of freezing the
|
||
image encoder during pretraining, allowing it to learn spatial and
|
||
relational understanding from complex tasks like captioning. To further
|
||
enhance its capabilities, PaliGemma undergoes a resolution increase
|
||
stage, where it is trained on higher-resolution images, enabling it to
|
||
handle tasks that benefit from finer visual details. This multi-stage
|
||
pretraining process results in a family of three PaliGemma checkpoints
|
||
at varying image resolutions (224px, 448px, and 896px), each pretrained
|
||
with broad visual knowledge. These checkpoints serve as strong base
|
||
models that can be easily transferred to specific downstream tasks.
|
||
PaliGemma’s transferability is demonstrated through its impressive
|
||
performance on over 30 academic benchmarks, including those involving
|
||
multiple images, such as NLVR2 and short-video understanding tasks. The
|
||
model’s ability to adapt quickly to new tasks with minimal fine-tuning
|
||
highlights its versatility and makes it a valuable tool for exploring
|
||
and advancing the capabilities of VLMs. Furthermore, the model’s
|
||
open-source nature, along with its straightforward architecture and
|
||
training recipe, encourages further research and experimentation within
|
||
the VLM community, driving progress towards more powerful and
|
||
general-purpose multimodal AI systems.
|
||
</details>
|
||
<h2
|
||
id="paligemma-2-a-family-of-versatile-vlms-for-transfer"><strong>PaliGemma
|
||
2: A Family of Versatile VLMs for Transfer</strong></h2>
|
||
<p>PaliGemma 2 is an upgraded family of open Vision-Language Models
|
||
(VLMs) based on Gemma 2 language models, combined with the SigLIP-So400m
|
||
vision encoder. It offers models in three sizes (3B, 10B, 28B) and three
|
||
resolutions (224px², 448px², 896px²), trained in multiple stages for
|
||
broad knowledge transfer. PaliGemma 2 achieves state-of-the-art results
|
||
on various tasks, including OCR-related challenges like
|
||
table/molecular/music score recognition, and long-form captioning.</p>
|
||
<p><a href="https://arxiv.org/abs/2412.03555"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.03555-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/collections/google/paligemma-2-release-67500e1e1dbfdd4dee27ba48"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers,
|
||
Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony
|
||
Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele
|
||
Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin,
|
||
Lucas Beyer and Xiaohua Zhai</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/4ce402be-d644-4143-a57c-9e7f4d811d95" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
PaliGemma 2 closely follows the architecture of its predecessor,
|
||
PaliGemma. It uses a pre-trained SigLIP-So400m vision encoder. The
|
||
embeddings from this encoder are mapped to the input space of the Gemma
|
||
2 language model using a <em>linear projection</em>. The combined visual
|
||
and text embeddings are then fed into the Gemma 2 model, which
|
||
autoregressively generates the output. The model comes in three size
|
||
variants (2B, 9B, and 27B parameters in the Gemma 2 component,
|
||
corresponding to 3B, 10B, and 28B total parameters) and is trained at
|
||
three resolutions (224x224, 448x448, and 896x896 pixels). This allows
|
||
for analysis of the interplay between model size, resolution, and
|
||
transfer performance. The input image gets concatenated with the input
|
||
text tokes and Gemma 2 autoregressively completes this prefix with an
|
||
answer. PaliGemma 2’s training follows a three-stage approach, similar
|
||
to the original PaliGemma: <strong>Stage 1:</strong> The pre-trained
|
||
SigLIP-So400m and Gemma 2 checkpoints are combined and trained jointly
|
||
on a multimodal task mixture of 1 billion examples. The image resolution
|
||
is 224px². <strong>Stage 2:</strong> Training continues for 50 million
|
||
examples at 448px² resolution, then for 10 million examples at 896px².
|
||
Tasks benefiting from higher resolution are upweighted. <strong>Stage
|
||
3:</strong> Fine-tuning the checkpoints from stage 1 or 2 on the target
|
||
tasks. The training data mixture includes captioning, grounded
|
||
captioning, OCR, visual question answering (VQA), detection, and
|
||
instance segmentation. Notably, the training data relies heavily on
|
||
<em>machine-generated labels</em> from publicly available specialist
|
||
models, <em>avoiding the use of large commercial VLMs</em> for label
|
||
generation. <strong>Gemma 2 Language Models:</strong> The core upgrade
|
||
is the use of the more recent and capable Gemma 2 family of language
|
||
models, replacing the original Gemma model in PaliGemma.
|
||
<strong>Resolution and Model Size Scaling:</strong> PaliGemma 2
|
||
systematically explores the impact of both image resolution and language
|
||
model size on transfer performance. This is a key contribution, as most
|
||
prior work did not jointly study these factors with consistent training
|
||
recipes.
|
||
</details>
|
||
<h2
|
||
id="aimv2-multimodal-autoregressive-pre-training-of-large-vision-encoders"><strong>AIMv2:
|
||
Multimodal Autoregressive Pre-training of Large Vision
|
||
Encoders</strong></h2>
|
||
<p>AIMv2 is a family of generalist vision encoders that autoregressively
|
||
generates both image patches and text tokens, achieving state-of-the-art
|
||
performance in multimodal image understanding and strong results in
|
||
vision benchmarks like localization, grounding, and classification,
|
||
demonstrating scalability and efficiency.</p>
|
||
<p><a href="https://arxiv.org/abs/2411.14402"><img
|
||
src="https://img.shields.io/badge/arXiv-2411.14402-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/apple/ml-aim"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/apple/aimv2-large-patch14-224"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Enrico Fini, Mustafa Shukor, David Haldimann, Sai Aitharaju, Alexander
|
||
Toshev, Marcin Eichner, Moin Nabi, Xiujun Li, Philipp Dufter, Michal
|
||
Klein, Victor G. Turrisi da Costa, Louis Béthune, Zhe Gan, Alaaeldin
|
||
El-Nouby</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/c89a0be9-8743-4800-8d3c-ec51a4c99f4d" width="600"/>
|
||
<!-- Replace with the actual URL -->
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
AIMv2 (Autoregressive Image Models v2) introduces a novel pre-training
|
||
method for large-scale vision encoders that extends autoregressive
|
||
pre-training to a multimodal setting, encompassing both images and text.
|
||
The core architecture pairs a Vision Transformer (ViT) encoder with a
|
||
causal multimodal decoder. The vision encoder processes raw image
|
||
patches (using prefix attention), while the multimodal decoder
|
||
autoregressively generates both image patches (using pixel MSE loss) and
|
||
text tokens (using cross-entropy loss). Crucially, image patches and
|
||
text tokens are treated as a single, unified sequence. This allows the
|
||
model to learn a joint representation of visual and textual information.
|
||
The image is always prepended to the beginning of text sequence. The
|
||
training process is streamlined and efficient. It resembles that of AIM
|
||
and LLMs, relying solely on the autoregressive objective. There are no
|
||
specialized inter-batch communication methods or excessively large batch
|
||
sizes are required. This contrasts with contrastive methods (e.g., CLIP,
|
||
SigLIP), which are often more challenging to train and scale. The
|
||
training data consists of a mixture of publicly available (DFN-2B, COYO)
|
||
and proprietary datasets (HQITP), comprising both alt-text and synthetic
|
||
captions. AIMv2 demonstrates strong scaling properties, consistently
|
||
improving performance with increased data or model parameters. The model
|
||
family includes variants ranging from 300 million to 3 billion
|
||
parameters. A key optimization is the use of prefix attention within the
|
||
vision encoder, enabling bidirectional attention during inference
|
||
without fine-tuning. Other architectural choices include the
|
||
incorporation of SwiGLU and RMSNorm, inspired by recent successes in
|
||
language modeling. AIMv2 excels in a variety of tasks. It performs
|
||
favorably on multimodal understanding benchmarks compared to
|
||
state-of-the-art vision-language pre-trained methods . It also exhibits
|
||
strong performance on open-vocabulary object detection and referring
|
||
expression comprehension, surpassing DINOv2. Additionally, it achieves
|
||
impressive recognition performance with a frozen trunk. The model
|
||
supports native image resolution and adaptation to zero-shot
|
||
recognition, demonstrating its flexibility. Post-training strategies,
|
||
including high-resolution adaptation, further enhance the model’s
|
||
capabilities. Ablation studies demonstrate the importance of joint image
|
||
and text modeling, validate design choices, and explore scaling
|
||
characteristics.
|
||
</details>
|
||
<h2
|
||
id="apollo-an-exploration-of-video-understanding-in-large-multimodal-models"><strong>Apollo:
|
||
An Exploration of Video Understanding in Large Multimodal
|
||
Models</strong></h2>
|
||
<p>Apollo is a state-of-the-art family of Large Multimodal Models (LMMs)
|
||
designed for video understanding, achieving superior performance across
|
||
different model sizes by leveraging “Scaling Consistency” and exploring
|
||
video-specific aspects like sampling, architectures, data composition,
|
||
and training schedules. The 7B model is start of the art, and Apollo-3B
|
||
outperforms most existing 7B models.</p>
|
||
<p><a href="https://arxiv.org/abs/2412.10360"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.10360-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://apollo-lmms.github.io/"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe
|
||
Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang,
|
||
Serena Yeung-Levy, Xide Xia</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/9222064a-d7a3-4e6b-a14d-bc9a5c679450" width="600" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<p>Apollo leverages the Qwen2.5 series of Large Language Models (LLMs)
|
||
with 1.5B, 3B, and 7B parameters. The key architectural innovation is
|
||
the combination of a SigLIP-SO400M image encoder and an InternVideo2
|
||
video encoder. Features from both encoders are interpolated and
|
||
concatenated channel-wise before being fed into a Perceiver Resampler,
|
||
which outputs 32 tokens per frame. This combination was empirically
|
||
found to be superior to other encoder choices. The model uses a 3-stage
|
||
training approach. Critically, the paper introduces the concept of
|
||
“Scaling Consistency,” demonstrating that design decisions made on
|
||
smaller models and datasets (up to a critical size) effectively transfer
|
||
to larger models. This allows for more efficient experimentation. The
|
||
paper also advocates for frames-per-second (fps) sampling during
|
||
training, as opposed to uniform frame sampling, and demonstrates its
|
||
superiority. The optimal number of tokens is 8-32 per frame. It also
|
||
includes a curated benchmark, ApolloBench, that reduces evaluation time
|
||
by 41x compared to existing benchmarks while maintaining high
|
||
correlation and focusing on temporal reasoning and perception. The
|
||
exploration also includes Token Resampling showing that Perceiver
|
||
resampling has a good performace. Token Integration is also discussed:
|
||
Adding tokens (text, learned, etc.) between the video tokens derived
|
||
from different frames or clips is sufficient for efficient token
|
||
integration. Training Stages is also disscussed, concluding that
|
||
progressively unfreezing the different components in different stages
|
||
leads to superior model training dynamics. Finally, training the Video
|
||
Encoder is discussed. The paper concludes that Finetuning video encoders
|
||
on only video data further improves overall performance, especially on
|
||
reasoning and domain-specific tasks. Data Composition is also studied.
|
||
It concludes that Data mixture matters, and including a moderate amount
|
||
of text data and maintaining a slight video-heavy mix leads to optimal
|
||
performance.</p>
|
||
</details>
|
||
<h2
|
||
id="aria-an-open-multimodal-native-mixture-of-experts-model"><strong>ARIA:
|
||
An Open Multimodal Native Mixture-of-Experts Model</strong></h2>
|
||
<p>ARIA is an open-source, multimodal native Mixture-of-Experts (MoE)
|
||
model designed to seamlessly integrate and understand diverse modalities
|
||
like text, code, images, and video, achieving state-of-the-art
|
||
performance in its class. It features a fine-grained MoE decoder for
|
||
efficient parameter utilization, a lightweight visual encoder, and a
|
||
4-stage training pipeline that builds capabilities in language
|
||
understanding, multimodal comprehension, long context handling, and
|
||
instruction following.</p>
|
||
<p><a href="https://arxiv.org/abs/2410.05993"><img
|
||
src="https://img.shields.io/badge/arXiv-2410.05993-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/rhymes-ai/Aria"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/blog/RhymesAI/aria"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu,
|
||
Xinyao Niu, Fan Zhou, Chengen Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi
|
||
Ren, Chao Li, Yifan Ye, Peng Liu, Lihuan Zhang, Hanshu Yan, Guoyin Wang,
|
||
Bei Chen, Junnan Li</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/efe4a7ba-756a-4da8-b261-5a0093f28b03" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
ARIA’s architecture is centered around a fine-grained Mixture-of-Experts
|
||
(MoE) decoder, which is more efficient than traditional dense decoders.
|
||
This MoE approach activates 3.5B parameters per text token and 3.9B per
|
||
visual token, out of a total of 24.9B parameters. The model uses 66
|
||
experts in each MoE layer, with 2 shared across all inputs for common
|
||
knowledge, and 6 activated per token by a router. The visual encoder is
|
||
a lightweight (438M parameter) Vision Transformer (ViT) combined with a
|
||
projection module. The ViT processes images at various resolutions
|
||
(medium, high, and ultra-high), preserving aspect ratios. The projection
|
||
module uses cross-attention and an FFN layer to convert image embeddings
|
||
into visual tokens, which are then integrated with text tokens by the
|
||
MoE. ARIA’s training uses a 4-stage pipeline: (1) Language pre-training
|
||
(6.4T text tokens, 8K context window); (2) Multimodal pre-training (400B
|
||
multimodal tokens, including interleaved image-text, synthetic image
|
||
captions, document transcriptions and QA, video captions and QA); (3)
|
||
Multimodal long-context pre-training (extending context to 64K tokens);
|
||
and (4) Multimodal post-training (instruction following with 20B
|
||
tokens). The data curation process is rigorous, incorporating techniques
|
||
like de-duplication, quality filtering, and data clustering. The
|
||
training infrastructure avoids pipeline parallelism, using a combination
|
||
of expert parallelism and ZeRO-1 data parallelism, which contributes to
|
||
efficient training without the need for tensor parallelism. A
|
||
load-balancing loss and z-loss are used to stabilize training. The paper
|
||
demonstrates that, despite having modality-generic experts, ARIA
|
||
naturally develops expert specialization during pre-training. Analysis
|
||
of expert activation shows distinct visual specialization in several
|
||
layers, particularly for image, video, and PDF content. ARIA also shows
|
||
excellent performance in handling long-context multimodal data,
|
||
surpassing other open models and competing favorably with proprietary
|
||
models in tasks like long video and document understanding.
|
||
</details>
|
||
<h2 id="eve-unveiling-encoder-free-vision-language-models"><strong>EVE:
|
||
Unveiling Encoder-Free Vision-Language Models</strong></h2>
|
||
<p>EVE is an encoder-free vision-language model (VLM) that directly
|
||
processes images and text within a unified decoder-only architecture,
|
||
eliminating the need for a separate vision encoder. It achieves
|
||
competitive performance with encoder-based VLMs of similar size on
|
||
multiple vision-language benchmarks using only 35M publicly accessible
|
||
data, with the model efficiently handling high-resolution images with
|
||
arbitrary aspect ratios.</p>
|
||
<a href="https://arxiv.org/abs/2406.11832"><img
|
||
src="https://img.shields.io/badge/arXiv-2406.11832-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/baaivision/EVE"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/BAAI/EVE-7B-HD-v1.0"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong
|
||
Wang
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/c10e987d-9e11-41d7-968c-617b60d3b0bd" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>EVE (Encoder-free Vision-language modEl)</strong>: This model
|
||
distinguishes itself by completely removing the vision encoder component
|
||
typically found in VLMs. Instead, it directly integrates visual
|
||
information into a decoder-only architecture (based on Vicuna-7B). This
|
||
is achieved through a novel <strong>Patch Embedding Layer (PEL)</strong>
|
||
that processes image patches directly, combined with a <strong>Patch
|
||
Aligning Layer (PAL)</strong> that facilitates learning from a
|
||
pre-trained vision encoder (CLIP-ViT-L/14) without updating the encoder
|
||
itself. Crucially, EVE does <em>not</em> use a traditional image encoder
|
||
during inference. The <strong>PEL</strong> uses a convolution layer and
|
||
average pooling to create 2D feature maps from the input image. It then
|
||
employs cross-attention (CA1) within a limited receptive field to
|
||
enhance these features. A special <code><CLS></code> token
|
||
provides a holistic view of each patch feature, and a learnable newline
|
||
token <code><SPL></code> is inserted after each row of patch
|
||
features to represent the 2D structure. The <strong>PAL</strong> aligns
|
||
EVE’s patch features with those from a frozen, pre-trained vision
|
||
encoder (CLIP-ViT-L/14). This is done hierarchically, aggregating
|
||
features across multiple layers of the decoder and using a layer-wise
|
||
cross-attention (CA3) mechanism. A Mean Squared Error (MSE) loss between
|
||
EVE’s features and the vision encoder’s features encourages alignment.
|
||
This “implicit” supervision from the vision encoder improves visual
|
||
understanding. Importantly, PAL is <em>only</em> used during training,
|
||
not inference. The training process occurs in three stages:
|
||
<strong>LLM-guided Pre-training:</strong> Only the PEL and PAL are
|
||
trained, aligning the visual features with the frozen LLM (Vicuna-7B).
|
||
This stage uses a subset (16M) of the total training data.
|
||
<strong>Generative Pre-training:</strong> The entire model (including
|
||
the LLM) is trained, using the full 33M dataset. Both text prediction
|
||
(cross-entropy loss) and visual alignment (MSE loss) are used.
|
||
<strong>Supervised Fine-tuning:</strong> The entire model is fine-tuned
|
||
on instruction-following datasets (LLaVA-mix-665K and others). The key
|
||
innovations that allow EVE to work well without a vision encoder are:
|
||
<strong>LLM-Centric Pre-alignment:</strong> Stage 1 is critical for
|
||
preventing model collapse and accelerating convergence. Aligning visual
|
||
features <em>before</em> fully training the LLM is essential.
|
||
<strong>Vision Recognition Capability via Extra Supervision:</strong>
|
||
The PAL provides supervision from a pre-trained vision encoder during
|
||
training, which enhances visual understanding without requiring the
|
||
encoder during inference. <strong>Flexible Input Handling:</strong> The
|
||
architecture naturally handles images of arbitrary aspect ratios and
|
||
resolutions, without needing resizing, padding, or partitioning. No
|
||
reliance on vision encoder: The image are directly input into the LLM
|
||
model. EVE uses a curated dataset of 33M publicly available image-text
|
||
pairs for pre-training, with captions generated by Emu2 and LLaVA-1.5.
|
||
Supervised fine-tuning utilizes datasets like LLaVA-mix-665K, AI2D,
|
||
DocVQA, and others.
|
||
</details>
|
||
<p>Okay, let’s break down the information from the provided paper on
|
||
EVEv2 and create a feature extraction similar to your examples.</p>
|
||
<h2
|
||
id="evev2-improved-baselines-for-encoder-free-vision-language-models"><strong>EVEv2:
|
||
Improved Baselines for Encoder-Free Vision-Language Models</strong></h2>
|
||
<p>EVEv2 represents a significant advancement in encoder-free
|
||
vision-language models (VLMs), addressing limitations of previous
|
||
approaches by introducing a “Divide-and-Conquer” architecture that
|
||
maximizes scaling efficiency, reduces inter-modality interference, and
|
||
achieves strong performance with superior data efficiency.</p>
|
||
<p><a
|
||
href="https://github.com/baaivision/EVE/blob/main/EVEv2/images/EVEv2.0.pdf"><img
|
||
src="https://img.shields.io/badge/arXiv-2406.11832-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/baaivision/EVE/blob/main/EVEv2/README.md"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/BAAI/EVE-7B-HD-v2.0"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan,
|
||
Wenxuan Wang, Huchuan Lu, Xinlong Wang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/23a33fe6-d4c5-4a9d-b45f-f5612f7848a5" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
EVEv2 departs from the traditional encoder-based VLM approach. Instead
|
||
of relying on a pre-trained vision encoder (like CLIP), it builds visual
|
||
perception <em>directly within</em> a decoder-only Large Language Model
|
||
(LLM). Key architectural features include:
|
||
<strong>Divide-and-Conquer:</strong> This is the core innovation.
|
||
Instead of mixing visual and textual information throughout the entire
|
||
LLM, EVEv2 introduces <em>modality-specific</em> components. This means
|
||
separate attention matrices (query, key, value), Layer Normalization
|
||
layers, and Feed-Forward Networks for visual and textual tokens. This
|
||
reduces interference and allows for more efficient learning. It’s a
|
||
fully sparse, decoder-only architecture. <strong>Patch Embedding
|
||
Layer:</strong> A minimalist patch embedding layer is learned <em>from
|
||
scratch</em>. This avoids the inductive biases of pre-trained vision
|
||
encoders. It uses two convolutional layers (Conv1 and Conv2) to process
|
||
image patches. <strong>Lossless Encoding:</strong> Unlike some
|
||
encoder-free models that use discrete tokenization (which can lose
|
||
information), EVEv2 aims for lossless encoding of visual information.
|
||
<strong>LLM Adaptation:</strong> The architecture is designed for
|
||
seamless adaptation to existing LLMs. The paper experiments with
|
||
Vicuna-7B and Qwen2-7B. <strong>Multi-Stage Training:</strong> A
|
||
four-stage training process is used: <strong>LLM-guided
|
||
Pre-aligning:</strong> Only the patch embedding layer is trained, using
|
||
re-captioned web data (EVE-recap-10M). The LLM is frozen. This
|
||
establishes a basic alignment between visual and textual
|
||
representations. <strong>Vision Perception Learning:</strong> Vision
|
||
layers within the LLM are trained, using progressively larger datasets
|
||
and image resolutions. The LLM weights are still frozen.
|
||
<strong>Vision-Text Fully alligning:</strong> The entire network is
|
||
update. <strong>Supervised Fine-tuning (SFT):</strong> The entire model
|
||
is fine-tuned on question-answering and instruction-following datasets.
|
||
<strong>DenseFusion++:</strong> A new, efficient captioning engine is
|
||
introduced to generate high-quality image-text pairs for training. This
|
||
is crucial for building strong visual perception from scratch. It
|
||
leverages multiple vision experts. <strong>Data Efficiency:</strong> A
|
||
key focus of the research is demonstrating that EVEv2 can achieve strong
|
||
performance with <em>less</em> data than comparable encoder-based
|
||
models, thanks to its efficient architecture.
|
||
</details>
|
||
<h2
|
||
id="janus-pro-unified-multimodal-understanding-and-generation-with-data-and-model-scaling"><strong>Janus-Pro:
|
||
Unified Multimodal Understanding and Generation with Data and Model
|
||
Scaling</strong></h2>
|
||
<p>Janus-Pro significantly improves upon the original Janus model by
|
||
optimizing the training strategy, expanding the training data, and
|
||
scaling up the model size, resulting in enhanced multimodal
|
||
understanding, text-to-image instruction-following, and generation
|
||
stability.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.17811"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.17811-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/deepseek-ai/Janus"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/deepseek-ai/Janus-Pro-7B"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie,
|
||
Xingkai Yu, Chong Ruan</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/657b0f2a-7a0e-4aed-a214-a33485990790" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Janus-Pro maintains the core architecture of Janus, which decouples
|
||
visual encoding for multimodal understanding and generation. It uses a
|
||
unified autoregressive transformer but employs separate encoders for
|
||
understanding (SigLIP) and generation (VQ tokenizer). The understanding
|
||
encoder extracts semantic features, flattened and mapped to the LLM’s
|
||
input space via an “understanding adaptor.” The generation encoder
|
||
converts images to discrete IDs, flattened and mapped via a “generation
|
||
adaptor.” These feature sequences are concatenated and fed to the LLM.
|
||
The model includes a built-in prediction head (from the LLM) and a
|
||
randomly initialized prediction head for image generation. The key
|
||
improvements in Janus-Pro lie in three areas: <strong>Optimized Training
|
||
Strategy:</strong> Janus-Pro uses a three-stage training process.
|
||
<strong>Stage I:</strong> Focuses on training the adaptors and image
|
||
head with longer training on ImageNet, improving parameter
|
||
initialization. <strong>Stage II:</strong> Unified pretraining, updating
|
||
all components <em>except</em> the understanding and generation
|
||
encoders. Crucially, it <em>removes</em> ImageNet data from this stage
|
||
and uses only “normal” text-to-image data, improving efficiency.
|
||
<strong>Stage III:</strong> Supervised fine-tuning, further updating the
|
||
understanding encoder. The data ratio (multimodal:text:text-to-image) is
|
||
adjusted from 7:3:10 to 5:1:4, improving multimodal understanding
|
||
without sacrificing generation. <strong>Data Scaling:</strong> Janus-Pro
|
||
significantly expands the training data. <strong>Multimodal
|
||
Understanding:</strong> Adds ~90 million samples from sources like
|
||
DeepSeek-VL2, including image captions (YFCC), table/chart/document
|
||
understanding (Docmatix), MEME understanding, and Chinese conversational
|
||
data. <strong>Visual Generation:</strong> Adds ~72 million
|
||
<em>synthetic</em> aesthetic data samples, balancing real and synthetic
|
||
data 1:1 during unified pretraining. This improves generation stability
|
||
and aesthetic quality. <strong>Model Scaling:</strong> Janus-Pro scales
|
||
up from 1.5B to 7B LLM parameters (DeepSeek-LLM). This significantly
|
||
improves convergence speed for both understanding and generation. The
|
||
training uses a sequence length of 4096, SigLIP-Large-Patch16-384 for
|
||
understanding, and a VQ tokenizer with a codebook of 16,384 for
|
||
generation. Adaptors are two-layer MLPs. Training is performed with
|
||
HAI-LLM, a distributed training framework. Evaluation is conducted on
|
||
benchmarks like GQA, MME, SEED, MMB, MM-Vet, MMMU (for understanding)
|
||
and GenEval, DPG-Bench (for generation). Janus-Pro achieves
|
||
state-of-the-art results among unified multimodal models, demonstrating
|
||
significant improvements in both multimodal understanding and
|
||
text-to-image generation.
|
||
</details>
|
||
<h2
|
||
id="llava-cot-let-vision-language-models-reason-step-by-step"><strong>LLaVA-CoT:
|
||
Let Vision Language Models Reason Step-by-Step</strong></h2>
|
||
<p>LLaVA-CoT is a novel Vision-Language Model (VLM) designed to perform
|
||
autonomous, multi-stage reasoning, enabling it to tackle complex visual
|
||
question-answering tasks by independently engaging in sequential stages
|
||
of summarization, visual interpretation, logical reasoning, and
|
||
conclusion generation.</p>
|
||
<p><a href="https://arxiv.org/abs/2411.10440"><img
|
||
src="https://img.shields.io/badge/arXiv-2411.10440-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/PKU-YuanGroup/LLaVA-CoT"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/Xkev/Llama-3.2V-11B-cot"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, Li Yuan</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/1a5e32f0-4ffc-4514-8401-25777c2fac10" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
LLaVA-CoT builds upon the Llama-3.2-Vision model and introduces a
|
||
structured, four-stage reasoning process: Summary (briefly outlines the
|
||
task), Caption (describes relevant image parts), Reasoning (detailed
|
||
analysis), and Conclusion (provides the final answer). Each stage is
|
||
marked with specific tags (
|
||
<SUMMARY>
|
||
,
|
||
<CAPTION>
|
||
, <REASONING>, <CONCLUSION>) to maintain clarity. Unlike traditional
|
||
Chain-of-Thought (CoT) prompting, LLaVA-CoT promotes structured thinking
|
||
by first organizing the problem and known information, then performing
|
||
detailed reasoning, and finally deriving a conclusion. The model is
|
||
trained on the newly compiled LLaVA-CoT-100k dataset. This dataset
|
||
integrates samples from various visual question answering sources and
|
||
providing structured reasoning instructions. The dataset contains 99k
|
||
image and Question answer pairs using GPT-4o to provide details. Data is
|
||
gathered from general VQA datasets (ShareGPT4V, ChartQA, A-OKVQA,
|
||
DocVQA, PISC, CLEVR) and Science targeted VQA (AI2D, GeoQA+, ScienceQA,
|
||
CLEVR-Math). The paper also proposes a novel inference-time stage-level
|
||
beam search method. This method generates multiple candidate results at
|
||
<em>each</em> stage of the reasoning process, selecting the best to
|
||
continue, improving performance and scalability. This contrasts with
|
||
traditional best-of-N or sentence-level beam search. The entire model is
|
||
trained using the Supervised-Fine Tuning.
|
||
</details>
|
||
<h2
|
||
id="llm2clip-powerful-language-model-unlocks-richer-visual-representation"><strong>LLM2CLIP:
|
||
Powerful Language Model Unlocks Richer Visual
|
||
Representation</strong></h2>
|
||
<p>LLM2CLIP is a fine-tuning approach which integrates Large Language
|
||
Models (LLMs) with pre-trained CLIP visual encoders. It improves the
|
||
model by using the LLM’s ability to proccess and understant long
|
||
captions, open-world knowledge.</p>
|
||
<p><a href="https://arxiv.org/abs/2411.04997"><img
|
||
src="https://img.shields.io/badge/arXiv-2411.04997-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/microsoft/LLM2CLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/microsoft/LLM2CLIP-EVA02-B-16"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu,
|
||
Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/44d6952e-98ea-4875-bd9c-0a09a683bcbb" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
LLM2CLIP is fine-tuning approach. It integrates LLM (Large Language
|
||
Models) to already pretrained CLIP visual encoders. The main problem
|
||
which is tried to be solved is that; LLM’s text understanding capability
|
||
is not reflected on CLIP models. The authors highlight that directly
|
||
incorporating LLMs into CLIP often fails due to the poor separability of
|
||
LLM output features. To tackle this, they introduce a two-stage
|
||
approach. <strong>Stage 1: Caption Contrastive (CC)
|
||
Fine-tuning:</strong> The LLM (specifically Llama-3 8B) is fine-tuned
|
||
using a contrastive learning framework on a dataset of image captions
|
||
(CC3M). This stage <em>doesn’t train for autoregressive
|
||
capabilities</em>, instead, it is transforming the causal attention to
|
||
bidirectional, to function it as an encoder. This stage aims to improve
|
||
the discriminative power of the LLM’s output space, making it easier to
|
||
distinguish between different captions, using supervised SimCSE loss.
|
||
<strong>Stage 2: CLIP Vision Encoder Fine-tuning:</strong> The
|
||
pre-trained CLIP visual encoder is fine-tuned using the CC-fine-tuned
|
||
LLM, now acting as a “super” text encoder. The LLM’s gradients are
|
||
<em>frozen</em> during this stage to preserve its acquired knowledge and
|
||
reduce computational cost. Learnable adapters (linear layers) are added
|
||
after the LLM to facilitate alignment with the CLIP visual encoder.
|
||
Instead of the typical image-text contrastive loss, a caption-to-caption
|
||
contrastive framework is used during LLM fine-tuning. This forces the
|
||
LLM to produce distinct representations for different captions
|
||
describing the same image. It uses Supervised SimCSE. Makes the model
|
||
encoder. Freezing the LLM during CLIP fine-tuning is crucial for
|
||
efficiency and preserving the LLM’s knowledge. These adapters bridge the
|
||
gap between the frozen LLM and the CLIP visual encoder. The method is
|
||
surprisingly efficient, requiring only a small amount of open-source
|
||
data (15M or even 3M image-text pairs) and a single epoch of training in
|
||
some cases. It leverages LoRA (Low-Rank Adaptation) for efficient
|
||
fine-tuning. LLM2CLIP can effectively leverage dense captions (detailed
|
||
image descriptions), a known limitation of standard CLIP. Uses
|
||
ShareCaptioner-modified CC-3M (for CC fine-tuning), Wikitext-103, and a
|
||
combination of CC-3M, CC-12M, YFCC-15M, and Recaption-1B for CLIP
|
||
fine-tuning. The paper demonstrates that, after fine-tuning of the
|
||
output space of the LLM, using LLM has a significant impact and it
|
||
substantially improves the performance on downstream tasks.
|
||
</details>
|
||
<h2
|
||
id="maya-an-instruction-finetuned-multilingual-multimodal-model"><strong>Maya:
|
||
An Instruction Finetuned Multilingual Multimodal Model</strong></h2>
|
||
<p>Maya is an open-source Multimodal Multilingual Vision Language Model
|
||
(mVLM) designed to address the limitations of current VLMs in handling
|
||
low-resource languages and diverse cultural contexts, achieved by
|
||
creating a new multilingual image-text pretraining dataset, performing
|
||
toxicity analysis and mitigation, and fine-tuning for enhanced cultural
|
||
and linguistic comprehension.</p>
|
||
<p><a href="https://arxiv.org/abs/2412.07112"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.07112-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/nahidalam/maya"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/maya-multimodal/maya"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Nahid Alam, Karthik Reddy Kanjula, Bala Krishna S Vegesna, S M Iftekhar
|
||
Uddin, Drishti Sharma, Abhipsha Das, Shayekh Bin Islam, Surya
|
||
Guthikonda, Timothy Chung, Anthony Susevski, Ryan Sze-Yin Chan, Roshan
|
||
Santhosh, Snegha A, Chen Liu, Isha Chaturvedi, Ashvanth.S, Snehanshu
|
||
Mukherjee, Alham Fikri Aji</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/f413afd9-3eee-4a5e-940a-b148fdf3189b" width="600"/>
|
||
<!-- Dummy Image -->
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Architecture:</strong> Maya builds upon the LLaVA 1.5 framework.
|
||
It uses the Aya-23 8B model as its Large Language Model (LLM) due to
|
||
Aya’s strong multilingual capabilities (trained on 23 languages).
|
||
Critically, it <em>replaces</em> the CLIP vision encoder used in LLaVA
|
||
with SigLIP. This is motivated by SigLIP’s superior performance,
|
||
multilingual support, and ability to handle variable-length image
|
||
patches (allowing for more flexible input sizes). The visual features
|
||
from SigLIP (<code>Zv = g(Xv)</code>) are passed through a trainable
|
||
projection matrix (<code>W</code>, a 2-layer MLP with GELU activation)
|
||
to align them with the LLM’s embedding space, producing visual features
|
||
<code>Hv</code>. The architecture is fairly standard for this type of
|
||
model, concatenating visual and textual features for input to the LLM.
|
||
The training process involves two main phases: pretraining and
|
||
finetuning. <strong>Pretraining:</strong> The model is pretrained on a
|
||
newly created multilingual image-text dataset. This dataset is derived
|
||
from the English-only LLaVA pretraining dataset (558k image-text pairs)
|
||
and translated into seven additional languages (Chinese, French,
|
||
Spanish, Russian, Hindi, Japanese, and Arabic) using a sophisticated
|
||
translation pipeline. This pipeline uses the Aya 35B model, optimized
|
||
prompt engineering (determined empirically using BLEU and N-gram
|
||
scores), and a batch processing approach with quality checks. Crucially,
|
||
this dataset undergoes <em>toxicity filtering</em>. LLaVAGuard and
|
||
Toxic-BERT are used to identify and remove toxic image-caption pairs,
|
||
creating a “toxicity-free” version of the dataset (removing 7,531 toxic
|
||
images). The pretraining uses a learning rate of 1e-3 and a cosine
|
||
scheduler. Only the projection matrix is trained during pretraining.
|
||
<strong>Finetuning:</strong> The pretrained model is then
|
||
instruction-tuned using the PALO 150K instruction-tuning dataset (which
|
||
covers 10 languages). Full finetuning is performed (as opposed to LoRA),
|
||
with frozen vision encoder and LLM. The core alignment technique is the
|
||
trainable projection matrix (the 2-layer MLP) that maps the SigLIP
|
||
visual features into the embedding space of the Aya-23 LLM. This is a
|
||
simple but effective method, common in many VLMs. The paper
|
||
<em>explicitly</em> states they did <em>not</em> use more complex
|
||
alignment techniques like gated soft-attention (Flamingo) or Q-Former
|
||
(BLIP-2) in this phase, reserving those for future work.
|
||
<strong>Pretraining Dataset:</strong> A new multilingual dataset created
|
||
by translating and filtering the LLaVA pretraining dataset. This dataset
|
||
is a key contribution of the paper. The translation process and toxicity
|
||
filtering are described in detail. <strong>Instruction Tuning
|
||
Dataset:</strong> PALO 150K instruction-tuning dataset.
|
||
<strong>Evaluation Datasets</strong>: PALO multilingual evalution,
|
||
VizWiz, GQA, ScienceQA, TextVQA, POPE, MMBench, MM-Vet, MME.
|
||
<strong>Multilingual Image-Text Pretraining Dataset:</strong> A new
|
||
dataset of 558,000 images in eight languages. <strong>Toxicity Analysis
|
||
and Mitigation:</strong> A thorough analysis of toxicity in the original
|
||
LLaVA dataset and the creation of a toxicity-free version. This is a
|
||
significant and novel aspect. <strong>Multilingual Model:</strong> A
|
||
model (Maya) that shows improved performance in understanding cultural
|
||
and linguistic nuances, especially in comparison to models trained
|
||
primarily on English data. The results show that Maya performs
|
||
comparably to or better than models of similar size (LLaVA-7B) and often
|
||
approaches the performance of larger models (PALO-13B) on multilingual
|
||
benchmarks. The toxicity filtering has a minimal impact on overall
|
||
performance, suggesting that valuable information isn’t lost by removing
|
||
toxic content. The paper includes both quantitative benchmark results
|
||
and qualitative examples demonstrating the model’s capabilities.
|
||
</details>
|
||
<h2
|
||
id="minimax-01-scaling-foundation-models-with-lightning-attention"><strong>MiniMax-01:
|
||
Scaling Foundation Models with Lightning Attention</strong></h2>
|
||
<p>MiniMax-01 is a series of large foundation models, including
|
||
MiniMax-Text-01 and MiniMax-VL-01, that achieve performance comparable
|
||
to top-tier models (like GPT-4o and Claude-3.5-Sonnet) while offering
|
||
significantly longer context windows (up to 4 million tokens). It
|
||
achieves this through a novel architecture incorporating lightning
|
||
attention (a highly efficient linear attention variant), Mixture of
|
||
Experts (MoE), and optimized training/inference frameworks.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.08313"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.08313-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/MiniMax-AI/MiniMax-01"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/MiniMaxAI/MiniMax-VL-01"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng
|
||
Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin
|
||
Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang,
|
||
Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu,
|
||
Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang,
|
||
Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju,
|
||
Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li,
|
||
Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li,
|
||
Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang,
|
||
Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun
|
||
Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou,
|
||
Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu,
|
||
Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li,
|
||
Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao,
|
||
Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo
|
||
Jiang, Zijia Wu</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<p><strong>Hybrid Attention:</strong> The core innovation is the hybrid
|
||
attention mechanism. It primarily uses “lightning attention” (an
|
||
I/O-aware implementation of TransNormer linear attention) for
|
||
efficiency. However, to maintain strong retrieval capabilities, it
|
||
strategically inserts a standard transformer block with softmax
|
||
attention after every seven transnormer blocks (with lightning
|
||
attention). This is a key differentiator from purely linear attention
|
||
models, which often struggle with retrieval tasks. <strong>Mixture of
|
||
Experts (MoE):</strong> To scale the model efficiently, MiniMax-01
|
||
employs a Mixture of Experts (MoE) architecture in the feed-forward
|
||
layers. It has a massive 456 billion total parameters, but only 45.9
|
||
billion are activated for each token, using 32 experts with a top-2
|
||
routing strategy. This allows for a large model capacity without a
|
||
corresponding increase in computational cost per token.
|
||
<strong>Vision-Language Model (MiniMax-VL-01):</strong> The
|
||
vision-language model (MiniMax-VL-01) builds upon MiniMax-Text-01 by
|
||
integrating a lightweight Vision Transformer (ViT) module. It uses a
|
||
dynamic resolution strategy, resizing input images to various sizes
|
||
(from 336x336 to 2016x2016) and concatenating features from both resized
|
||
patches and a standard thumbnail. It <em>does not</em> use pooling or
|
||
downsampling on the visual features, relying instead on the long-context
|
||
capabilities of the architecture. Demonstrates the viability of linear
|
||
attention at a massive scale, achieving performance comparable to
|
||
top-tier models while significantly extending the context window.
|
||
<strong>Long-Context Capability:</strong> Supports context inputs of up
|
||
to 4 million tokens, with strong performance in long-context
|
||
evaluations. <strong>Efficient Training and Inference
|
||
Framework:</strong> Introduces several novel algorithmic and engineering
|
||
optimizations to handle the hybrid architecture, MoE, and long contexts
|
||
efficiently. <strong>Pre-training:</strong> A meticulously curated
|
||
corpus incorporating academic literature, books, web content, and
|
||
programming code. <strong>Vision-Language Pre-training (VL-01):</strong>
|
||
A substantial image-caption dataset (694 million unique pairs) and a
|
||
dataset of 100 million images with fine-grained descriptions.
|
||
<strong>Vision-Language Instruction Data (VL-01):</strong> A
|
||
comprehensive and diverse instruction-based dataset synthesized from a
|
||
wide array of image-related tasks. <strong>Alignment Datasets</strong>
|
||
are also mentioned but are not detailed in the ocr. <strong>Hybrid
|
||
Attention:</strong> The core fusion method is the hybrid attention
|
||
mechanism, which combines the efficiency of lightning attention (linear)
|
||
with the retrieval capabilities of softmax attention. <strong>MoE
|
||
Routing:</strong> The MoE architecture with its top-2 routing strategy
|
||
allows for selective activation of experts, enhancing model capacity
|
||
without increasing computational cost per token. A global router is used
|
||
for load balancing. <strong>Vision-Language Fusion (VL-01):</strong>
|
||
Visual features from the ViT are projected into the embedding space of
|
||
the LLM using a two-layer MLP. The raw, high-dimensional visual features
|
||
are directly used without pooling or downsampling, leveraging the
|
||
long-context capabilities of the architecture. <strong>Varlen Ring
|
||
Attention and LASP+:</strong> These algorithms enable efficient handling
|
||
of long, variable-length sequences and data packing during both training
|
||
and inference. Post-Training and Alignment: Various techniques are used
|
||
for alignment.</p>
|
||
</details>
|
||
<h2 id="nvlm-open-frontier-class-multimodal-llms"><strong>NVLM: Open
|
||
Frontier-Class Multimodal LLMs</strong></h2>
|
||
<p>NVLM 1.0 is a family of multimodal large language models (LLMs)
|
||
achieving state-of-the-art results on vision-language tasks, rivaling
|
||
proprietary and open-access models. It demonstrates improved text-only
|
||
performance after multimodal training and offers a comprehensive
|
||
comparison of decoder-only and cross-attention-based architectures,
|
||
introducing a novel hybrid architecture and a 1-D tile-tagging design
|
||
for high-resolution images.</p>
|
||
<a href="https://arxiv.org/abs/2409.11402"><img
|
||
src="https://img.shields.io/badge/arXiv-2409.11402-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/NVIDIA/Megatron-LM/tree/NVLM-1.0"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/nvidia/NVLM-D-72B"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon
|
||
Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/da882643-ac1d-4566-8287-cd8da3897a88" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>NVLM (NVIDIA Vision Language Model)</strong> introduces a family
|
||
of models with three primary architectures: NVLM-D (Decoder-only),
|
||
NVLM-X (Cross-attention-based), and NVLM-H (Hybrid). All models share a
|
||
common vision pathway, employing a frozen InternViT-6B-448px-V1-5 vision
|
||
encoder with dynamic high-resolution (DHR) processing. DHR involves
|
||
dividing input images into tiles (up to 6, with varying aspect ratios)
|
||
and a downscaled global “thumbnail” tile. These tiles are processed by
|
||
the vision encoder, and the resulting 1024 tokens per tile are
|
||
downsampled to 256 via pixel shuffling. <strong>NVLM-D
|
||
(Decoder-only):</strong> Connects the vision encoder to the LLM
|
||
(Qwen2-72B-Instruct or Nous-Hermes-2-Yi-34B) via a 2-layer MLP
|
||
projector. It introduces a novel <em>1-D tile-tagging</em> design for
|
||
handling high-resolution images. Text-based tile tags (e.g.,
|
||
<code><tile_1></code>) are inserted before the flattened image
|
||
tokens of each tile to provide positional information to the LLM.
|
||
Training involves pretraining (frozen LLM and vision encoder, training
|
||
only the MLP) and supervised fine-tuning (SFT) (unfrozen LLM and MLP).
|
||
Crucially, a high-quality text-only SFT dataset is included to
|
||
maintain/improve text-only performance. <strong>NVLM-X
|
||
(Cross-attention-based):</strong> Uses gated cross-attention layers to
|
||
process image tokens, similar to Flamingo, but <em>without</em> a
|
||
Perceiver resampler. Image features are projected to the LLM’s hidden
|
||
dimension with a one-layer MLP. Gated X-attention layers are interleaved
|
||
with LLM self-attention layers. Training also has pretraining and SFT
|
||
stages. The LLM backbone is unfrozen during SFT, and a high-quality
|
||
text-only dataset is used. 1-D tile tags are also used, but within the
|
||
X-attention layers. <strong>NVLM-H (Hybrid):</strong> Combines aspects
|
||
of NVLM-D and NVLM-X. The thumbnail image tokens are processed by the
|
||
LLM’s self-attention layers (like NVLM-D), while the regular tile tokens
|
||
are processed by gated cross-attention (like NVLM-X). This aims to
|
||
balance multimodal reasoning with computational efficiency. It also uses
|
||
1-D tile tags in cross-attention. The 1-D tile-tagging design
|
||
significantly improves performance, especially on OCR-related tasks,
|
||
compared to simply concatenating image tokens or using 2D grid/bounding
|
||
box tags. The authors emphasize that dataset quality and task diversity
|
||
are more important than sheer scale, even during pretraining. NVLM
|
||
models achieve strong performance on <em>both</em> vision-language and
|
||
text-only tasks. This is achieved by including a high-quality text-only
|
||
dataset during SFT and incorporating multimodal math and reasoning data.
|
||
Decoder VS X-Attention: Cross attention based models are more efficient
|
||
in high-resolution images. However, Decoder models provides unified
|
||
multimodel reasoning and higher accuracy in OCR-related tasks. Curated
|
||
from open-source datasets, including captioning (COCO, CC3M, SBU,
|
||
LAION-115M), VQA (VQAv2, Visual Genome, DVQA), document understanding
|
||
(Docmatix), OCR/Scene-Text (various datasets), and Math (CLEVR-Math).
|
||
Emphasis on quality over quantity. A diverse collection of task-oriented
|
||
datasets, including captioning, VQA, chart/diagram understanding,
|
||
document understanding, OCR, math, and science datasets. High-quality
|
||
text-only data from various sources (ShareGPT, SlimOrca, EvolInstruct,
|
||
etc.) and categories (general, math, coding) is crucial for
|
||
maintaining/improving text-only performance. Refined using GPT-40 and
|
||
GPT-40-mini. NVLM models are evaluated on a wide range of
|
||
vision-language benchmarks (MMMU, MathVista, OCRBench, AI2D, ChartQA,
|
||
DocVQA, TextVQA, RealWorldQA, VQAv2) and text-only benchmarks (MMLU,
|
||
GSM8K, MATH, HumanEval).
|
||
</details>
|
||
<h2
|
||
id="omnivlm-a-token-compressed-sub-billion-parameter-vision-language-model-for-efficient-on-device-inference"><strong>OmniVLM:
|
||
A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for
|
||
Efficient On-Device Inference</strong></h2>
|
||
<p>OmniVLM is a sub-billion-parameter vision-language model designed for
|
||
efficient on-device inference, featuring a token compression mechanism
|
||
that reduces visual token sequence length from 729 to 81, drastically
|
||
cutting computational overhead while maintaining visual-semantic
|
||
fidelity. It uses Qwen2.5-0.5B-Instruct model, Google’s SigLIP-400M.</p>
|
||
<a href="https://arxiv.org/abs/2412.11475"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.11475-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/NexaAI/nexa-sdk"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/NexaAIDev/OmniVLM-968M"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Wei Chen, Zhiyuan Li, Shuo Xin
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/da2a140a-5efe-4499-addc-8ccbb3e9792a" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
OmniVLM addresses the challenges of deploying vision-language models
|
||
(VLMs) on resource-constrained edge devices. It achieves this through a
|
||
novel token compression mechanism and a multi-stage training pipeline.
|
||
The core innovation is the <strong>image token compression</strong>,
|
||
which transforms the embedding dimensions from [batch_size, 729,
|
||
hidden_size] to [batch_size, 81, hidden_size] within the projection
|
||
layer. This 9x reduction in token count is achieved through reshaping,
|
||
chosen after empirical comparison against convolution-based methods. The
|
||
model architecture (Figure 1) builds upon the LLaVA framework, employing
|
||
Google’s SigLIP-400M as the vision encoder, Qwen2.5-0.5B-Instruct as the
|
||
base language model, and a Multi-Layer Perceptron (MLP) as the
|
||
projection layer. The training pipeline consists of three stages: (1)
|
||
<strong>Pretraining</strong> on large-scale image-caption pairs
|
||
(primarily from the LLaVA pretraining dataset) to learn
|
||
visual-linguistic alignments, training only the projection layer; (2)
|
||
<strong>Supervised Fine-Tuning (SFT)</strong> on a mix of datasets
|
||
(LLaVA, UnimmChat, and internal data) to improve contextual
|
||
understanding and conversational coherence, training the projector and
|
||
LLM while freezing the vision encoder; and (3) <strong>Minimal-Edit
|
||
Direct Preference Optimization (DPO)</strong>, using a teacher model to
|
||
create minimally edited corrections to the base model’s outputs, forming
|
||
chosen-rejected pairs for preference learning, again freezing the vision
|
||
encoder and training the projector and LLM. The DPO process leverages
|
||
GPT-4V to generate synthetic training pairs. Extensive experiments show
|
||
that the 81-token configuration provides the optimal balance between
|
||
computational efficiency and model performance. OmniVLM outperforms
|
||
nanoLLAVA on benchmarks like ScienceQA, POPE, and MMMU, demonstrating
|
||
improved reasoning, multimodal comprehension, and generalization.
|
||
Crucially, it achieves significantly faster inference speeds (9.1x
|
||
faster time-to-first-token and 1.5x higher decoding speed compared to
|
||
nanoLLAVA on a laptop, and 8x faster TTFT on a mobile device), making it
|
||
suitable for deployment on edge devices like smartphones and laptops.
|
||
</details>
|
||
<h2
|
||
id="pixtral-12b-a-cutting-edge-open-multimodal-language-model"><strong>Pixtral
|
||
12B: A Cutting-Edge Open Multimodal Language Model</strong></h2>
|
||
<p>Pixtral 12B is a 12-billion-parameter multimodal language model
|
||
developed by Mistral AI, designed to excel in both understanding images
|
||
and text, achieving leading performance on various multimodal
|
||
benchmarks. The core of the VLM is built upon the transformer
|
||
architecture. A strong aspect of the VLM is, Pixtral 12B is trained with
|
||
a new vision encoder from scratch to natively support variable image
|
||
sizes and aspect ratios.</p>
|
||
<a href="https://arxiv.org/abs/2410.07073"><img
|
||
src="https://img.shields.io/badge/arXiv-2410.07073-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/pixtral.md"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/mistralai/Pixtral-12B-2409"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout,
|
||
Devendra Chaplot, Jessica Chudnovsky, et al. (Mistral AI Science Team)
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/5187d3c0-e284-40eb-bb94-53105c8cbe11" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Pixtral 12B</strong> has two main components, <em>vision encoder
|
||
(Pixtral-ViT)</em>, which tokenizes images and a <em>multimodal
|
||
decoder</em>, which predicts the next token given a sequence of text and
|
||
images. Pixtral can take an arbitrary number of images as an input,
|
||
provided they fit within its 128K context window. <strong>The vision
|
||
encoder (Pixtral-ViT)</strong> is trained from scratch with a novel
|
||
ROPE-2D implementation, allowing it to process images at their native
|
||
resolution and aspect ratio. The model can flexibly process images at
|
||
low resolution in latency-constrained settings, while processing images
|
||
at high resolution when fine-grained reasoning is required. For
|
||
distinguishing between images with same number of patches but different
|
||
aspect ratios, <strong>[IMAGE BREAK]</strong> tokens are inserted
|
||
between image rows. Additionally, an <strong>[IMAGE END]</strong> token
|
||
at the end of image sequence. The model employs a <strong>gated
|
||
FFN</strong> architecture, implementing gating in the hidden layer in
|
||
place of standard feedforward layer in the attention block. For
|
||
processing images within a single batch, the model flattens images along
|
||
the sequence dimension and concatenates them. A block diagonal mask is
|
||
constructed to prevent attention leakage between patches of different
|
||
images. Traditional learned and absolute position embeddings are
|
||
replaced by <strong>ROPE-2D</strong>, which allows handling variable
|
||
image sizes. The <strong>multimodal decoder</strong> of Pixtral is built
|
||
on top of Mistral Nemo 12B [15], a 12-billion parameter decoder-only
|
||
language model. The decoder uses a causal self-attention. The vision
|
||
encoder is connected to the multimodal decoder by a two-layer fully
|
||
connected network. The paper describes Pixtral as an instruction-tuned
|
||
model, pre-trained on large-scale interleaved image and text documents.
|
||
The Paper contributes an open-source benchmark called
|
||
<strong>MM-MT-Bench</strong>, for evaluating vision-language models.
|
||
Pixtral excels at multimodal instruction following, surpassing
|
||
comparable open-source models on the MM-MT-Bench benchmark.
|
||
</details>
|
||
<h2
|
||
id="sa2va-marrying-sam2-with-llava-for-dense-grounded-understanding-of-images-and-videos"><strong>Sa2VA:
|
||
Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and
|
||
Videos</strong></h2>
|
||
<p>Sa2VA is a unified model for dense grounded understanding of both
|
||
images and videos, integrating the SAM-2 video segmentation model with
|
||
the LLaVA vision-language model. It supports a wide array of image and
|
||
video tasks, like referring segmentation and conversation, by treating
|
||
all inputs (text, images, videos) as tokens in a shared LLM space,
|
||
generating instruction tokens that guide SAM-2 for precise mask
|
||
production.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.04001"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.04001-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/magic-research/Sa2VA"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/papers/2501.04001"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping
|
||
Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/7527a503-4987-4401-961b-f52532788b1f" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Sa2VA leverages a pre-trained LLaVA-like model (containing a visual
|
||
encoder, visual projection layer, and LLM) and appends SAM-2 alongside
|
||
it. Crucially, it uses a <em>decoupled design</em>, where SAM-2’s
|
||
decoder and memory module are frozen. This preserves SAM-2’s perception
|
||
and tracking capabilities and allows Sa2VA to be a plug-and-play module,
|
||
updatable with newer MLLMs. The connection between the LLM and SAM-2 is
|
||
a special “[SEG]” token. The LLM generates this token, and its hidden
|
||
states act as a spatial-temporal prompt for SAM-2’s decoder, which
|
||
produces segmentation masks. The model is trained end-to-end,
|
||
demonstrating scalability. The training uses a unified
|
||
instruction-tuning format for various tasks: referring segmentation,
|
||
visual question answering (VQA), and grounded conversation generation
|
||
(GCG) for both images and videos. It treats all images, videos and
|
||
prompts as visual tokens. A key aspect is the co-training with multiple
|
||
datasets, including image and video data. The authors introduce
|
||
<em>Ref-SAV</em>, an auto-labeled dataset with over 72,000 object
|
||
expressions in complex video scenes, and manually validate 2,000 video
|
||
objects in Ref-SAV for benchmarking referring video object segmentation.
|
||
A simple mask tracking method re-utilizes SAM-2’s knowledge. The model
|
||
formulates all tasks as a single instruction-tuning process. Datasets
|
||
used for co-training are: LLAVA 1.5 (665K), RefCOCO (17K), RefCOCO+
|
||
(17K), RefCOCOg (22K), Grand-f (214K), ChatUniVi (100K). Ref-YTVOS
|
||
(3.5K), MeVIS (0.6K), ReVOS (1.7K) and Ref-SAV (37K).
|
||
</details>
|
||
<h2
|
||
id="tarsier2-advancing-large-vision-language-models-from-detailed-video-description-to-comprehensive-video-understanding"><strong>Tarsier2:
|
||
Advancing Large Vision-Language Models from Detailed Video Description
|
||
to Comprehensive Video Understanding</strong></h2>
|
||
<p>Tarsier2 is a state-of-the-art large vision-language model (LVLM)
|
||
that excels in generating detailed and accurate video descriptions and
|
||
demonstrates superior general video understanding capabilities. It
|
||
scales pre-training data, performs fine-grained temporal alignment
|
||
during supervised fine-tuning, and uses model-based sampling with Direct
|
||
Preference Optimization (DPO) to improve performance, outperforming
|
||
models like GPT-4o and Gemini 1.5 Pro.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.07888"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.07888-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/bytedance/tarsier"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/omni-research/Tarsier-7b"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/e6626842-69ac-4547-8c4b-cb260dd349ca" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Tarsier2 utilizes a straightforward architecture consisting of a vision
|
||
encoder, a vision adaptor, and a large language model (LLM),
|
||
specifically building upon Qwen2-VL. The model undergoes a three-stage
|
||
training process: pre-training, supervised fine-tuning (SFT), and
|
||
reinforcement learning (RL) using Direct Preference Optimization (DPO).
|
||
A key improvement over its predecessor, Tarsier, is the significant
|
||
expansion of the pre-training dataset from 11 million to 40 million
|
||
video-text pairs. This expansion includes the meticulous collection and
|
||
filtering of 11 million commentary videos (explanations and analyses of
|
||
movies and TV shows), providing rich contextual information. During the
|
||
SFT stage, Tarsier2 is trained on a dataset containing 150K instances,
|
||
each with a detailed video description and specific frame annotations
|
||
corresponding to each described event. This <em>fine-grained temporal
|
||
alignment</em> provides supervision that improves accuracy and reduces
|
||
hallucinations compared to traditional video-caption alignment. The SFT
|
||
phase is conducted in two steps. The initial step is frame to event
|
||
allignment. Then, the model’s output to make a more human-like style.
|
||
The final training stage employs DPO with automatically generated
|
||
preference data. Negative samples are created by corrupting videos
|
||
(clip-switching, clip-reversing, clip-cropping, and down-sampling), and
|
||
a preference data filtering method (using AutoDQ) ensures high-quality
|
||
pairs. Tarsier2 achieves state-of-the-art results on 15 public
|
||
benchmarks, demonstrating its versatility across tasks such as video
|
||
question-answering, video grounding, hallucination tests, and embodied
|
||
question-answering. A recaptioning dataset, Tarsier2-Recap-585K, is also
|
||
released.
|
||
</details>
|
||
<h2
|
||
id="ui-tars-pioneering-automated-gui-interaction-with-native-agents"><strong>UI-TARS:
|
||
Pioneering Automated GUI Interaction with Native Agents</strong></h2>
|
||
<p>UI-TARS is a native GUI agent model that operates solely on
|
||
screenshots, performing human-like interactions (keyboard and mouse
|
||
operations). Unlike frameworks relying on wrapped commercial models
|
||
(e.g., GPT-4o), UI-TARS is an end-to-end model achieving
|
||
state-of-the-art (SOTA) performance on 10+ GUI agent benchmarks in
|
||
perception, grounding, and task execution, significantly outperforming
|
||
sophisticated frameworks.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.12326"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.12326-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/bytedance/UI-TARS"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/bytedance-research/UI-TARS-7B-SFT"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo
|
||
Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong,
|
||
Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang,
|
||
Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng,
|
||
Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li,
|
||
Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/9dccbdf3-a0ab-4ae4-930b-09a974f14a03" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
UI-TARS leverages several key innovations: (1) <strong>Enhanced
|
||
Perception</strong>, utilizing a large-scale GUI screenshot dataset for
|
||
context-aware understanding and precise captioning of UI elements; (2)
|
||
<strong>Unified Action Modeling</strong>, standardizing actions into a
|
||
unified space across platforms and achieving precise grounding through
|
||
large-scale action traces; (3) <strong>System-2 Reasoning</strong>,
|
||
incorporating deliberate reasoning for multi-step decision-making,
|
||
including task decomposition, reflection, and milestone recognition; and
|
||
(4) <strong>Iterative Training with Reflective Online Traces</strong>,
|
||
addressing the data bottleneck by automatically collecting, filtering,
|
||
and refining interaction traces on hundreds of virtual machines. The
|
||
model is trained iteratively and tuned via reflection, continuously
|
||
learning from mistakes and adapting to new situations with minimal human
|
||
intervention. The architecture takes screenshots as input and uses a
|
||
Vision-Language Model (VLM), specifically Qwen-2-VL 7B and 72B, to
|
||
process visual information and generate actions. The action space is
|
||
unified across platforms (mobile, desktop, web) and includes actions
|
||
like click, type, scroll, and drag. Reasoning is infused by generating
|
||
explicit “thoughts” before each action, inspired by the ReAct framework.
|
||
These thoughts are generated through a combination of curated GUI
|
||
tutorials and augmented action traces, incorporating patterns like task
|
||
decomposition, long-term consistency, milestone recognition, trial and
|
||
error, and reflection. The training process involves multiple stages,
|
||
starting with perception enhancement using a curated dataset of GUI
|
||
screenshots and associated metadata. This dataset supports tasks like
|
||
element description, dense captioning, state transition captioning,
|
||
question answering, and set-of-mark prompting. Action modeling is
|
||
improved by creating a large-scale dataset of action traces and using
|
||
grounding data to pair element descriptions with spatial coordinates.
|
||
The model is trained using a combination of supervised fine-tuning (SFT)
|
||
and Direct Preference Optimization (DPO) with reflection tuning to learn
|
||
from errors.
|
||
</details>
|
||
<h2
|
||
id="videochat-flash-hierarchical-compression-for-long-context-video-modeling"><strong>VideoChat-Flash:
|
||
Hierarchical Compression for Long-Context Video Modeling</strong></h2>
|
||
<p>VideoChat-Flash is a system designed for handling long-form video
|
||
content in multimodal large language models (MLLMs). It introduces a
|
||
Hierarchical visual token Compression (HiCo) method to reduce
|
||
computational load while preserving essential details, and uses a
|
||
multi-stage learning approach with a new long-video dataset (LongVid) to
|
||
achieve state-of-the-art performance on both long and short video
|
||
benchmarks.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.00574"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.00574-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/OpenGVLab/VideoChat-Flash"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang,
|
||
Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang,
|
||
Limin Wang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/49048795-6a76-41e7-b441-1313d0813630" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<p><strong>Hierarchical visual token Compression (HiCo):</strong> This
|
||
is the core innovation. It compresses video information at two levels:
|
||
<strong>Clip-level Compression:</strong> The video is divided into
|
||
clips. A visual encoder (UMT-L) processes each clip, and a compressor
|
||
(token merging with MLP) reduces the number of visual tokens. This
|
||
exploits inter-frame redundancy. <strong>Video-level
|
||
Compression:</strong> During the LLM (Qwen2-7B) interaction, visual
|
||
tokens are further reduced using a progressive visual dropout strategy.
|
||
This leverages the idea that the LLM focuses on the entire context at
|
||
shallow layers and specific details at deeper layers. It combines
|
||
uniform dropout (shallow layers) and text-guided selection (deep
|
||
layers). <strong>Visual Encoder:</strong> UMT-L@224 [30] (a video
|
||
encoder, shown to be more efficient than image encoders like SigLIP).
|
||
<strong>Visual-Language Connector:</strong> A compressor (token merging)
|
||
followed by an MLP projection. <strong>Large Language Model
|
||
(LLM):</strong> Qwen2-7B. <strong>Multi-stage Short-to-Long
|
||
Learning:</strong> This is a crucial training strategy: <strong>Stage 1:
|
||
Video-Language Alignment:</strong> Train the compressor and MLP with
|
||
image-text and short video-text pairs (0.5M each). <strong>Stage 2:
|
||
Short Video Pre-training:</strong> Enhance visual understanding with
|
||
more images (3.5M) and short videos (2.5M). <strong>Stage 3: Joint Short
|
||
& Long Video Instruction Tuning:</strong> Fine-tune on a mix of
|
||
images (1.1M), short videos (1.7M), and long videos (0.7M) with
|
||
instruction-following data. <strong>Stage 4: Efficient High-Resolution
|
||
Post-finetuning:</strong> Adapt to higher resolutions (224 to 448) by
|
||
fine-tuning the video encoder on a subset (25%) of Stage 3
|
||
data.<strong>Dynamic Video Sampling:</strong> Uses a dual sampling
|
||
strategy: dense sampling (15 fps) for short videos (capturing motion)
|
||
and sparse sampling (1 fps) for long videos (capturing events).
|
||
<strong>Timestamp-aware Prompt:</strong> Uses a simple text prompt to
|
||
provide timestamp information to the model: “The video lasts for N
|
||
seconds, and T frames are uniformly sampled from
|
||
it.<strong>LongVid:</strong> A new large-scale long video
|
||
instruction-tuning dataset introduced in the paper. It contains 114,228
|
||
long videos and 3,444,849 question-answer pairs across five task types.
|
||
It leverages existing datasets (Ego4D, HowTo100M, HD-Vila, MiraData) and
|
||
generates dense event labels. <strong>Mixed Training Data:</strong> Uses
|
||
a combination of short and long videos during training. <strong>NIAH
|
||
(Needle In A video Haystack)</strong>. A newly created dataset for
|
||
testing models capabilities for understanding long contexts.</p>
|
||
</details>
|
||
<h2
|
||
id="videollama-3-frontier-multimodal-foundation-models-for-image-and-video-understanding"><strong>VideoLLaMA
|
||
3: Frontier Multimodal Foundation Models for Image and Video
|
||
Understanding</strong></h2>
|
||
<p>VideoLLaMA3 is a vision-centric multimodal foundation model designed
|
||
for both image and video understanding, emphasizing a training paradigm
|
||
and framework that prioritize high-quality image-text data, alongside an
|
||
adaptable vision encoder and video token compression, to achieve
|
||
state-of-the-art performance.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.13106v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.13106-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/DAMO-NLP-SG/VideoLLaMA3"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/papers/2501.13106"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan,
|
||
Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin,
|
||
Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/350a1228-c14e-45ed-b59f-e99608ee9a7d" width=600/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>VideoLLaMA3</strong> introduces a vision-centric approach in
|
||
both its training paradigm and framework design, focusing on enhancing
|
||
image and video understanding capabilities. The core architecture
|
||
incorporates a pre-trained vision encoder (SigLIP), a video compressor
|
||
(DiffFP), a projector, and a large language model (LLM - Qwen2.5). The
|
||
model employs a four-stage training process: 1) <strong>Vision Encoder
|
||
Adaptation</strong>, where the vision encoder is adapted to accept
|
||
images of variable resolutions using scene images, document data, and
|
||
scene text images; 2) <strong>Vision-Language Alignment</strong>, which
|
||
jointly tunes the vision encoder, projector, and LLM using large-scale
|
||
image-text data (including detailed captions, documents, and charts) and
|
||
a small amount of text-only data; 3) <strong>Multi-task
|
||
Fine-tuning</strong>, incorporating image-text data for downstream tasks
|
||
and general video caption data; and 4) <strong>Video-centric
|
||
Fine-tuning</strong>, using general videos, streaming videos, temporally
|
||
grounded videos, image-only, and text-only data. A key innovation is
|
||
<strong>Any-resolution Vision Tokenization (AVT)</strong>, which allows
|
||
the vision encoder to process images and videos of any resolution by
|
||
replacing fixed positional embeddings with Rotary Position Embedding
|
||
(RoPE). This enables handling images with variable shapes and minimal
|
||
information loss. For video inputs, <strong>Differential Frame Pruner
|
||
(DiffFP)</strong> acts as a video compressor, reducing the number of
|
||
vision tokens by comparing the 1-norm distance between temporally
|
||
consecutive patches in pixel space and pruning redundant patches. This
|
||
makes video representations more compact and precise. The training data
|
||
mixture is carefully curated for each stage, emphasizing high-quality
|
||
image-text data. The Vision Encoder Adaptation stage uses datasets like
|
||
VL3-Syn7M-short, LLaVA-Pretrain-558k, and document datasets. The
|
||
Vision-Language Alignment stage expands on this with detailed captions,
|
||
OCR data, and fine-grained data with bounding boxes. The Multi-task
|
||
Fine-tuning stage adds question-answering data and general video caption
|
||
data. Finally, the Video-centric Fine-tuning stage includes general
|
||
videos, streaming videos, and temporal grounding data. This
|
||
“vision-centric” approach, prioritizing image understanding as a
|
||
foundation for video understanding, along with AVT and DiffFP, allows
|
||
VideoLLaMA3 to achieve strong performance on both image and video
|
||
benchmarks.
|
||
</details>
|
||
<h2
|
||
id="llama-3.2-vision-enhanced-multimodal-capabilities-built-on-llama-3"><strong>Llama
|
||
3.2-Vision: Enhanced Multimodal Capabilities Built on Llama
|
||
3</strong></h2>
|
||
<p>Llama 3.2-Vision extends the Llama 3 text-only model with multimodal
|
||
capabilities, allowing it to process both text and images. This model,
|
||
available in 11B and 90B parameter sizes, leverages a vision adapter
|
||
with cross-attention layers to integrate image representations from a
|
||
separate vision encoder into the core Llama 3 LLM, achieving strong
|
||
performance on visual recognition, image reasoning, captioning, and
|
||
visual question answering.</p>
|
||
<a href="https://github.com/meta-llama/llama-models"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/meta-llama/Llama-3.2-11B-Vision"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Meta
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/f6428237-8607-4de1-975d-dfc4c683b7a3" width=600/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Llama 3.2-Vision</strong> builds upon the Llama 3 architecture,
|
||
an auto-regressive language model using an optimized transformer. It
|
||
adds a <em>vision adapter</em>, comprised of cross-attention layers, to
|
||
incorporate visual information. This adapter receives input from a
|
||
<em>separate vision encoder</em> (not part of the core Llama 3 model),
|
||
allowing the model to process images without directly converting them
|
||
into text tokens. The <code><|image|></code> tag within the prompt
|
||
signifies the presence of an image and dictates where the visual
|
||
information is integrated via cross-attention. This integration occurs
|
||
<em>after</em> the image tag and influences <em>subsequent</em> text
|
||
tokens. The model supports a context length of 128k tokens and utilizes
|
||
Grouped-Query Attention (GQA). The model family was trained on 6B
|
||
image-text pairs. Pretraining data cutoff is December 2023, supports
|
||
English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
|
||
However image-text tasks are only in English. The model’s training
|
||
involves supervised fine-tuning (SFT) and reinforcement learning with
|
||
human feedback (RLHF) for instruction-tuned versions. The base models
|
||
are suitable for text completion, with or without an image, using
|
||
specific prompt formats. Instruction-tuned models excel at tasks like
|
||
Visual Question Answering (VQA), Document VQA (DocVQA), image
|
||
captioning, and image-text retrieval. The training process includes
|
||
stages of pretraining and annealing, leveraging a vast amount of data
|
||
and significant computational resources (H100 GPUs). Key capabilities
|
||
include handling both text and image inputs, answering questions about
|
||
images, generating captions, and performing visual reasoning. The model
|
||
<em>does not</em> support built-in tool calling (like
|
||
<code>brave_search</code> or <code>wolfram_alpha</code>) when an image
|
||
is present in the prompt; tool calling is only available for text-only
|
||
inputs. The intended use cases cover a wide range of applications, but
|
||
usage is restricted by the Llama 3.2 Community License and Acceptable
|
||
Use Policy, particularly regarding languages and potential misuse. Meta
|
||
emphasizes a responsible deployment approach, including providing tools
|
||
like Llama Guard for safety and encouraging developers to implement
|
||
appropriate safeguards. The model underwent extensive evaluations,
|
||
including red teaming and assessments for critical risks such as CBRNE,
|
||
child safety, and cyber attacks.
|
||
</details>
|
||
<h2
|
||
id="smolvlm-a-small-efficient-and-open-source-vision-language-model"><strong>SmolVLM:
|
||
A Small, Efficient, and Open-Source Vision-Language Model</strong></h2>
|
||
<p>SmolVLM is a 2B parameter vision-language model (VLM) that achieves
|
||
state-of-the-art performance for its memory footprint, offering a small,
|
||
fast, and memory-efficient solution for multimodal tasks. It is fully
|
||
open-source, with all model checkpoints, datasets, training recipes, and
|
||
tools released under the Apache 2.0 license, enabling local deployment,
|
||
reduced inference costs, and user customization.</p>
|
||
<p><a href="https://huggingface.co/blog/smolvlm"><img
|
||
src="https://img.shields.io/badge/Blog-SmolVLM%20Blog-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/huggingface/smollm"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Andres Marafioti, Merve Noyan, Miquel Farré, Elie Bakouch, Pedro
|
||
Cuenca</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/901ed802-5c1c-4733-ab2a-6b61514b9c71" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
SmolVLM builds upon the architecture of Idefics3, leveraging a similar
|
||
implementation in transformers but with key differences to enhance
|
||
efficiency. It replaces the Llama 3.1 8B language backbone with the
|
||
smaller SmolLM2 1.7B model. A more aggressive image compression strategy
|
||
is employed, using a pixel shuffle strategy that reduces visual
|
||
information by a factor of 9 (compared to 4x in Idefics3). This allows
|
||
for 384x384 patches, and a shape-optimized SigLIP is used as the vision
|
||
backbone with 14x14 inner patches. The model demonstrates superior
|
||
memory usage compared to other VLMs in transformers, enabling efficient
|
||
on-device inference. For instance, encoding a single image and prompt
|
||
requires only 1.2k tokens, significantly less than models like Qwen2-VL.
|
||
This efficiency translates to faster prefill and generation throughputs.
|
||
SmolVLM achieves strong performance on benchmarks such as MMMU,
|
||
MathVista, MMStar, DocVQA, and TextVQA. It also shows promising results
|
||
in basic video analysis, leveraging its long context capabilities.
|
||
Training involved extending the context window of SmolLM2 to 16k tokens
|
||
using techniques like RoPE base value adjustment and fine-tuning on a
|
||
mixture of long- and short-context datasets. A curated training dataset,
|
||
largely based on The Cauldron and Docmatix, was used for the VLM
|
||
training. Checkpoint selection was based on a weighted metric across
|
||
multiple vision-language benchmarks. The model is integrated with
|
||
VLMEvalKit for easy evaluation, and it can be readily used and
|
||
fine-tuned with the transformers library. TRL integration allows for
|
||
applying Direct Preference Optimization (DPO). A notebook is provided
|
||
for fine-tuning on VQAv2, with options for LoRA, QLoRA, or full
|
||
fine-tuning, even within the constraints of consumer GPUs.
|
||
</details>
|
||
<h2 id="idefics2"><strong>Idefics2</strong></h2>
|
||
<p>IDEFICS2, an 8B parameter open-source vision-language model,
|
||
efficiently processes interleaved image and text sequences by combining
|
||
a SigLIP vision encoder, a Mistral-7B LLM, and a Perceiver pooling layer
|
||
with MLP projection for robust text encoding, excelling in tasks like
|
||
OCR and document understanding.</p>
|
||
<a href="https://arxiv.org/abs/2405.02246"><img
|
||
src="https://img.shields.io/badge/arXiv-2405.02246-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/spaces/HuggingFaceM4/idefics-8b"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/c197c8c5-8da2-4d96-8999-8e05e81f1506" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
IDEFICS2 is an 8B parameter open-source vision-language model adept at
|
||
handling interleaved image and text sequences. IDEFICS2 utilizes a
|
||
vision-language architecture designed for efficient processing of image
|
||
and text sequences. It employs the SigLIP model as the vision encoder,
|
||
extracting features from images in their native resolutions and aspect
|
||
ratios. The Mistral-7B model serves as the LLM backbone, providing
|
||
language understanding and generation capabilities. For text encoding,
|
||
IDEFICS2 leverages a <strong>Perceiver pooling layer</strong> followed
|
||
by an <strong>MLP projection</strong> to integrate visual features with
|
||
the LLM’s embedding space. This combination of vision encoder, LLM, and
|
||
text encoder enables IDEFICS2 to handle various multimodal tasks, with a
|
||
particular focus on OCR and document understanding. The model is trained
|
||
on a diverse dataset encompassing OBELICS, LAION Coco, and PMD, with
|
||
additional data for OCR tasks. Fine-tuning is performed on instruction
|
||
datasets like The Cauldron and OpenHermes-2.5.
|
||
</details>
|
||
<h2
|
||
id="idefics3-8b-building-and-better-understanding-vision-language-models"><strong>Idefics3-8B:
|
||
Building and Better Understanding Vision-Language Models</strong></h2>
|
||
<p>Idefics3-8B is a powerful open-source vision-language model (VLM)
|
||
that significantly outperforms its predecessor, Idefics2-8B, while being
|
||
trained efficiently and exclusively on open datasets. It leverages a
|
||
straightforward pipeline and introduces Docmatix, a massive dataset for
|
||
document understanding, to achieve state-of-the-art performance within
|
||
its size category across various multimodal benchmarks.</p>
|
||
<a href="https://arxiv.org/abs/2408.12637"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.12637-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/spaces/HuggingFaceM4/idefics3"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/5e61fec2-b41b-4ad8-a167-1966f169b866" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Idefics3-8B builds upon the foundation of pre-trained unimodal models,
|
||
specifically Llama 3.1 instruct as the language model and SigLIP-SO400M
|
||
as the vision encoder. It adopts a self-attention architecture, where
|
||
visual features are treated as tokens and concatenated with text tokens
|
||
before being fed into the LLM. To enhance OCR capabilities and address
|
||
the bottleneck of limited visual tokens per image, Idefics3-8B replaces
|
||
the perceiver resampler used in Idefics2 with a simple pixel shuffle
|
||
strategy, similar to InternVL-1.5. This strategy reduces the number of
|
||
image hidden states by a factor of 4, allowing for the encoding of
|
||
larger images (up to 364x364 pixels) into 169 visual tokens. The model
|
||
utilizes an image-splitting strategy during both training and inference,
|
||
dividing the original image into a matrix of 364x364 pixel tiles. To
|
||
preserve the 2D structure and positional information of these tiles, a
|
||
text token ‘’ is inserted after each row of tiles, and the downscaled
|
||
original image is appended to the sequence. Additionally, each tile is
|
||
prepended with textual tokens indicating its position in the matrix. The
|
||
training process consists of three stages of pre-training followed by
|
||
supervised fine-tuning. In the first pre-training stage, the backbones
|
||
(LLM and vision encoder) are frozen, and only the newly initialized
|
||
parameters are trained. The maximum image resolution is gradually
|
||
increased from 364² to 1820². From the second stage onward, the
|
||
backbones are efficiently trained using DoRA (a variant of LoRA), and
|
||
larger images are introduced into the training data. The final
|
||
pre-training stage focuses on training with large synthetic datasets,
|
||
including Docmatix, Websight, LNQA, PixelProse, and ChartGemma. During
|
||
supervised fine-tuning, NEFTune noise is applied to the inputs, and the
|
||
loss is calculated only on the answer tokens. The learning rate is kept
|
||
constant for the first two pre-training stages and linearly decayed to
|
||
zero during the final pre-training stage and supervised fine-tuning.
|
||
Idefics3-8B demonstrates significant improvements over Idefics2,
|
||
particularly in document understanding tasks, achieving a 13.7-point
|
||
improvement on DocVQA. This highlights the effectiveness of the Docmatix
|
||
dataset and the architectural choices made in Idefics3-8B. The model
|
||
also achieves state-of-the-art performance within its size category
|
||
across various multimodal benchmarks, including MMMU, MathVista, MMStar,
|
||
and TextVQA, showcasing its strong capabilities in visual understanding
|
||
and reasoning.
|
||
</details>
|
||
<h2
|
||
id="internlm-xcomposer2-mastering-free-form-text-image-composition-and-comprehension-in-vision-language-large-model"><strong>InternLM-XComposer2:
|
||
Mastering Free-form Text-Image Composition and Comprehension in
|
||
Vision-Language Large Model</strong></h2>
|
||
<p>InternLM-XComposer2 excels in free-form text-image composition and
|
||
comprehension by connecting a CLIP pre-trained vision encoder with the
|
||
powerful InternLM-2 LLM using a novel Partial LoRA module, enabling
|
||
efficient alignment of visual and language tokens for enhanced
|
||
multimodal understanding.</p>
|
||
<a href="https://arxiv.org/abs/2401.16420"><img
|
||
src="https://img.shields.io/badge/arXiv-2401.16420-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/InternLM/InternLM-XComposer"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/Willow123/InternLM-XComposer"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang,
|
||
Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang,
|
||
Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai
|
||
Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/732d3b7b-02de-42d3-ae76-800bf035b391" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>InternLM-XComposer2</strong>: This model introduces a
|
||
sophisticated architecture that leverages a vision encoder and a Large
|
||
Language Model (LLM), interconnected through a Partial Low-Rank
|
||
Adaptation (LoRA) module. This innovative setup allows
|
||
InternLM-XComposer2 to effectively process both images and text,
|
||
employing visual tokens generated by the vision encoder alongside
|
||
language tokens derived from the tokenized text. The vision encoder,
|
||
pre-trained using CLIP for image-language contrastive learning, and
|
||
InternLM-2, which serves as the LLM with multi-lingual capabilities, are
|
||
key components of this architecture. <strong>The Partial LoRA</strong>
|
||
module distinguishes itself by aligning visual and language tokens
|
||
through low-rank adaptation applied specifically to visual tokens,
|
||
enhancing the model’s multimodal understanding and processing
|
||
efficiency. The training methodology of InternLM-XComposer2 is
|
||
multifaceted, focusing on fine-tuning the vision encoder and Partial
|
||
LoRA to align visual tokens with the LLM across various datasets. This
|
||
process involves general semantic alignment, world knowledge alignment,
|
||
and vision capability enhancement to refine the model’s ability to
|
||
interpret image information and compose text-image content. Supervised
|
||
fine-tuning further includes multi-task training and free-form
|
||
text-image composition, aiming to optimize the model’s performance in
|
||
leveraging image information for comprehensive text-image generation and
|
||
understanding. Alignment techniques and fusion methods in
|
||
InternLM-XComposer2 utilize the Partial LoRA module for the effective
|
||
integration of different modalities, thereby enriching the LLM with
|
||
modality-specific knowledge while preserving its inherent capabilities.
|
||
This selective enhancement of visual tokens through Partial LoRA enables
|
||
the model to exhibit robust performance across visual and textual
|
||
domains, facilitating detailed perception, logical reasoning, and
|
||
extensive knowledge integration in multimodal understanding. The model
|
||
employs a diverse array of datasets, including ShareGPT4V-PT, COCO,
|
||
Nocaps, TextCaps, and many others, for pre-training and supervised
|
||
fine-tuning. These datasets serve to equip InternLM-XComposer2 with a
|
||
broad range of capabilities, including general semantic alignment, world
|
||
knowledge alignment, vision capability enhancement, and the facilitation
|
||
of free-form text-image composition, marking a significant advancement
|
||
in the field of vision-language large models.
|
||
</details>
|
||
<h2
|
||
id="internlm-xcomposer2-4khd-a-pioneering-large-vision-language-model-handling-resolutions-from-336-pixels-to-4k-hd"><strong>InternLM-XComposer2-4KHD:
|
||
A Pioneering Large Vision-Language Model Handling Resolutions from 336
|
||
Pixels to 4K HD</strong></h2>
|
||
<p>InternLM-XComposer2-4KHD, building on its predecessor, pioneers
|
||
high-resolution image handling in LVLMs by employing dynamic resolution
|
||
with automatic patch configuration, adapting to resolutions from 336
|
||
pixels up to 4K HD for enhanced visual understanding without
|
||
distortion.</p>
|
||
<a href="https://arxiv.org/abs/2404.06512v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2404.06512v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang,
|
||
Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang
|
||
Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen,
|
||
Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi
|
||
Wang<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/c09b67fb-32eb-43de-82fa-96c3af22caf4" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>InternLM-XComposer2-4KHD</strong>: Cutting-edge Large
|
||
Vision-Language Model (LVLM) designed to handle ultra-high resolutions,
|
||
up to 4K HD and beyond, while also supporting diverse resolutions from
|
||
336 pixels. The model builds upon the InternLM-XComposer2 architecture,
|
||
incorporating a novel <strong>dynamic resolution with automatic patch
|
||
configuration</strong> technique. This allows the model to dynamically
|
||
adjust patch layouts and counts based on the input image’s aspect ratio,
|
||
enabling efficient processing of high-resolution images while preserving
|
||
their original proportions. To address potential ambiguity arising from
|
||
variable patch configurations, a newline token is introduced to
|
||
delineate rows of patch tokens, significantly improving performance.
|
||
InternLM-XComposer2-4KHD is pre-trained on a diverse dataset, including
|
||
image-caption pairs, concept knowledge, and OCR datasets, focusing on
|
||
enhancing high-resolution and structural image understanding. Supervised
|
||
fine-tuning further incorporates a mixed-resolution strategy, employing
|
||
higher resolution for tasks requiring fine-grained detail, like HD-OCR
|
||
tasks, and dynamically adjusted resolution for other tasks. This
|
||
approach enables the model to excel in both high-resolution scenarios
|
||
and general vision-language understanding tasks.
|
||
</details>
|
||
<h2
|
||
id="internlm-xcomposer-2.5-a-versatile-large-vision-language-model-supporting-long-contextual-input-and-output"><strong>InternLM-XComposer-2.5:
|
||
A Versatile Large Vision Language Model Supporting Long-Contextual Input
|
||
and Output</strong></h2>
|
||
<p>InternLM-XComposer-2.5 (IXC-2.5) is a versatile Large Vision Language
|
||
Model (LVLM) designed to handle long-contextual input and output,
|
||
excelling in various text-image comprehension and composition tasks. It
|
||
achieves performance comparable to GPT-4V with a significantly smaller
|
||
7B LLM backend, demonstrating its efficiency and scalability.</p>
|
||
<p><a href="https://arxiv.org/pdf/2407.03320"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.03320-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/InternLM/InternLM-XComposer"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/Willow123/InternLM-XComposer"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen,
|
||
Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei
|
||
Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li,
|
||
Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng
|
||
Dai, Yu Qiao, Dahua Lin, Jiaqi Wang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/1330a013-930b-4b23-90dc-94616b59ca0b" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
InternLM-XComposer-2.5 builds upon its previous iterations (IXC-2 and
|
||
IXC-2-4KHD) and features a three-component architecture: a lightweight
|
||
<strong>OpenAI ViT-L/14 vision encoder</strong>, a powerful InternLM2-7B
|
||
LLM, and <strong>Partial LoRA</strong> for efficient alignment between
|
||
the visual and language modalities. IXC-2.5 supports diverse input
|
||
modalities, including text, single/multiple images, and videos. It
|
||
utilizes a Unified Dynamic Image Partition strategy to handle
|
||
high-resolution images and videos, resizing and padding them into
|
||
smaller patches while preserving aspect ratios. For videos, frames are
|
||
sampled and concatenated along the short side, creating a
|
||
high-resolution composite image. The model is pre-trained in three
|
||
stages: general semantic alignment, world knowledge alignment, and
|
||
vision capability enhancement, using a diverse range of datasets. During
|
||
pre-training, the LLM is frozen, and the vision encoder and Partial LoRA
|
||
are fine-tuned to align visual tokens with the LLM. Supervised
|
||
fine-tuning is then performed on a collection of datasets covering
|
||
various tasks, including captioning, visual question answering,
|
||
multi-turn QA, science QA, chart QA, math QA, OCR QA, video
|
||
understanding, and conversation. This fine-tuning process involves
|
||
jointly training all components with a weighted data sampling strategy
|
||
and specific learning rate schedules for each component. IXC-2.5 also
|
||
introduces two novel applications: crafting webpages and composing
|
||
high-quality text-image articles. For webpage generation, the model is
|
||
trained on a combination of synthetic and real-world web data, enabling
|
||
it to generate HTML, CSS, and JavaScript code based on screenshots,
|
||
instructions, or resume documents. For article composing, IXC-2.5
|
||
leverages Chain-of-Thought (CoT) and Direct Preference Optimization
|
||
(DPO) techniques to enhance the quality of written content. This
|
||
involves rewriting original prompts using CoT, generating diverse
|
||
responses using different random seeds, and training a reward model to
|
||
select preferred responses, ultimately leading to more creative and
|
||
high-quality article generation.
|
||
</details>
|
||
<h2
|
||
id="internvl-2.5-expanding-performance-boundaries-of-open-source-multimodal-models-with-model-data-and-test-time-scaling"><strong>InternVL
|
||
2.5: Expanding Performance Boundaries of Open-Source Multimodal Models
|
||
with Model, Data, and Test-Time Scaling</strong></h2>
|
||
<p>InternVL 2.5 is an advanced Multimodal Large Language Model (MLLM)
|
||
series that builds upon InternVL 2.0, maintaining its core architecture
|
||
while enhancing training and testing strategies, and data quality, to
|
||
rival leading commercial models like GPT-4o and Claude-3.5-Sonnet.</p>
|
||
<a href="https://arxiv.org/abs/2412.05271"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.05271-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/OpenGVLab/InternVL"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/OpenGVLab/InternVL2_5-78B"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui,
|
||
Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, Lixin Gu, Xuehui Wang,
|
||
Qingyun Li, Yimin Ren, Zixuan Chen, Jiapeng Luo, Jiahao Wang, Tan Jiang,
|
||
Bo Wang, Conghui He, Botian Shi, Xingcheng Zhang, Han Lv, Yi Wang, Wenqi
|
||
Shao, Pei Chu, Zhongying Tu, Tong He, Zhiyong Wu, Huipeng Deng, Jiaye
|
||
Ge, Kai Chen, Kaipeng Zhang, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu,
|
||
Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/d1651bde-a587-4b60-83e4-40468d6442ee" width="600"/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>InternVL 2.5</strong> retains the “ViT-MLP-LLM” architecture of
|
||
its predecessors, combining a pre-trained InternViT (either InternViT-6B
|
||
or InternViT-300M) with LLMs of varying sizes (InternLM 2.5, Qwen 2.5)
|
||
via a 2-layer MLP projector. A key feature is the pixel unshuffle
|
||
operation, reducing visual tokens from 1024 to 256 per 448x448 image
|
||
tile, improving scalability for high-resolution processing. The model
|
||
architecture supports dynamic resolution, adapting to image aspect
|
||
ratios by dividing images into 448x448 tiles. Crucially, InternVL 2.0
|
||
and 2.5 incorporate multi-image and video data, in addition to
|
||
single-image and text-only data. The training strategy involves a
|
||
three-stage pipeline: (1) MLP warmup, where only the MLP projector is
|
||
trained, (2) optional ViT incremental learning, where the vision encoder
|
||
and MLP are trained to enhance visual feature extraction, particularly
|
||
for domains rare in web-scale data, and (3) full model instruction
|
||
tuning, where the entire model is trained on high-quality multimodal
|
||
instruction datasets. A progressive scaling strategy is employed,
|
||
starting with smaller LLMs and scaling up, allowing for efficient
|
||
alignment of the vision encoder with larger LLMs. Training enhancements
|
||
include random JPEG compression (for robustness to real-world image
|
||
quality) and loss reweighting (to balance contributions from responses
|
||
of different lengths). Data organization is optimized using parameters
|
||
like <code>nmax</code> (maximum tile number) and a repeat factor
|
||
(<code>r</code>) to control data sampling frequency. A data-packing
|
||
strategy concatenates multiple samples into longer sequences to improve
|
||
GPU utilization. A significant contribution is a data filtering pipeline
|
||
to remove low-quality samples, particularly those with repetitive
|
||
patterns, mitigating the risk of repetitive generation, a common issue
|
||
in MLLMs. The data mixture includes a wide range of tasks (captioning,
|
||
general QA, mathematics, charts, OCR, etc.) and modalities
|
||
(single-image, multi-image, video, text). The model was evaluated
|
||
comprehensively on diverse benchmarks including multi-discipline
|
||
reasoning (MMMU, MMMU-Pro), document understanding (DocVQA),
|
||
multi-image/video understanding, real-world comprehension, multimodal
|
||
hallucination detection, visual grounding, multilingual capabilities,
|
||
and pure language processing.
|
||
</details>
|
||
<h2
|
||
id="deepseek-vl-towards-real-world-vision-language-understanding"><strong>DeepSeek-VL:
|
||
Towards Real-World Vision-Language Understanding</strong></h2>
|
||
<p>DeepSeek-VL, utilizing a hybrid vision encoder combining SigLIP-L and
|
||
SAM-B, excels in real-world vision-language understanding by efficiently
|
||
processing high-resolution images and integrating extracted features
|
||
with a DeepSeek LLM backbone through a two-layer hybrid MLP adapter.</p>
|
||
<a href="https://arxiv.org/abs/2403.05525"><img
|
||
src="https://img.shields.io/badge/arXiv-2401.16420-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/deepseek-ai/DeepSeek-VL"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/deepseek-ai/DeepSeek-VL-7B"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang
|
||
Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng,
|
||
Hanwei Xu, Zhenda Xie, Chong Ruan<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/7b7283d2-b2d5-4ab6-891a-18a9760ef7ca" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>DeepSeek-VL</strong>: Employs a hybrid vision encoder
|
||
architecture, fusing a <strong>SigLIP-L encoder</strong> for semantic
|
||
understanding with a <strong>SAM-B encoder</strong> for high-resolution
|
||
detail extraction. This allows for efficient processing of 1024x1024
|
||
images while capturing both global and fine-grained visual features.
|
||
<strong>A two-layer hybrid MLP adapter</strong> then integrates these
|
||
features with the DeepSeek LLM backbone. The model is pre-trained on a
|
||
diverse dataset encompassing web screenshots, PDFs, OCR, charts, and
|
||
knowledge-based content from sources like Common Crawl, Web Code,
|
||
E-books, and arXiv articles. This pretraining is further refined using a
|
||
curated instruction-tuning dataset based on real user scenarios and
|
||
categorized into a comprehensive taxonomy covering recognition,
|
||
conversion, analysis, reasoning, evaluation, and safety tasks. By
|
||
combining this diverse data with its unique architecture and fusion
|
||
strategies, DeepSeek-VL aims to deliver robust performance across a wide
|
||
range of real-world vision-language applications.
|
||
</details>
|
||
<h2
|
||
id="deepseek-vl2-mixture-of-experts-vision-language-models-for-advanced-multimodal-understanding"><strong>DeepSeek-VL2:
|
||
Mixture-of-Experts Vision-Language Models for Advanced Multimodal
|
||
Understanding</strong></h2>
|
||
<p>DeepSeek-VL2 is an advanced series of large Mixture-of-Experts (MoE)
|
||
Vision-Language Models that significantly improves upon its predecessor,
|
||
DeepSeek-VL, by incorporating a dynamic tiling vision encoding strategy
|
||
for high-resolution images and leveraging DeepSeekMoE models with
|
||
Multi-head Latent Attention for efficient inference. It is trained on a
|
||
large vision-language dataset, shows top performance in tasks.</p>
|
||
<p><a href="https://arxiv.org/abs/2412.10302"><img
|
||
src="https://img.shields.io/badge/arXiv-2412.10302-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/deepseek-ai/DeepSeek-VL2"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/deepseek-ai/deepseek-vl2-small"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai,
|
||
and et al.</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/6bf7a0ce-5fa1-46ae-9f24-cb75df607a19" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<p>DeepSeek-VL2 builds upon a LLaVA-style architecture. it consists of
|
||
three core modules: (1) a vision encoder, (2) a vision-language adaptor,
|
||
and (3) a Mixture-of-Experts language model. It introduces two major
|
||
enhancements: a dynamic tiling strategy and it uses DeepSeekMOE language
|
||
model that has Multi-head Latent Attention. The dynamic tiling strategy
|
||
addresses the limitations of fixed-resolution encoders by splitting
|
||
high-resolution images into tiles. It uses a single SigLIP-SO400M-384
|
||
vision encoder. A set of candidate resolutions CR = {(m· 384, η · 384) |
|
||
m∈ N, n ∈ N, 1 ≤ m, n,mn ≤ 9} is defined, representing different aspect
|
||
ratios. For an input image, the optimal resolution from CR that
|
||
minimizes padding is selected. The resized image is then divided into m₁
|
||
× n₁ local tiles of 384 × 384 pixels, plus one global thumbnail tile.
|
||
The SigLIP-SO400M-384 processes all (1 + m¡ × n₁) tiles, yielding 729
|
||
visual embeddings (27x27) of 1152 dimensions per tile. Dynamic tiling is
|
||
disabled for multiple (>2) images for efficiency. A 2x2 pixel shuffle
|
||
compresses each tile’s visual tokens to 14x14 (196 tokens). Special
|
||
tokens are added: 14 <tile_newline> tokens at the end of each row in the
|
||
global thumbnail (total 210 tokens); m₁ · 14 <tile_newline> tokens at
|
||
the end of the final column of the local tiles; and a <view_separator>
|
||
token between the global thumbnail and local tiles. The total visual
|
||
sequence length is 210 + 1 + m₁ · 14 × (nį · 14 + 1). This sequence is
|
||
projected into the LLM’s embedding space by a two-layer MLP. The
|
||
language model utilizes DeepSeekMoE, featuring Multi-head Latent
|
||
Attention (MLA) to compress the Key-Value (KV) cache, improving
|
||
inference speed and throughput. The MoE architecture further enhances
|
||
efficiency. A global bias term is used during MoE training for load
|
||
balancing. DeepSeek-VL2 comes in three variants (Tiny, Small, and Base)
|
||
with 1.0B, 2.8B, and 4.5B activated parameters, respectively. The
|
||
training data is constructed in three stages: (1) VL alignment, (2) VL
|
||
pretraining, and (3) supervised fine-tuning (SFT). The alignment stage
|
||
uses ShareGPT4V (1.2M samples). Pretraining data combines VL and
|
||
text-only data (70/30 ratio), including interleaved image-text data
|
||
(WIT, WikiHow, OBELICS, Wanjuan, and in-house data), image captioning
|
||
data (various open-source datasets with quality enhancements and
|
||
filtering), OCR data (LaTeX OCR, 12M RenderedText, and in-house data),
|
||
general VQA data, table/chart/document understanding data (PubTabNet,
|
||
FinTabNet, Docmatix), web-to-code and plot-to-Python data (Websight, and
|
||
Python plots), QA with visual prompts, visual grounding data and
|
||
grounded conversation data. SFT data includes enhanced general visual
|
||
question-answering data, cleaned OCR and document understanding data,
|
||
enhanced table and chart understanding data, improved
|
||
reasoning/logic/math data, textbook/academic questions, and expanded
|
||
web-to-code and plot-to-Python data, visual grounding data, grounded
|
||
conversation data. Text only datasets were used during SFT stage. The
|
||
training methodology involves a three-stage pipeline. Stage 1 trains the
|
||
vision encoder and vision-language adaptor MLP, keeping the language
|
||
model fixed, using image-text paired data. Stage 2 performs
|
||
vision-language pre-training with all parameters unlocked, using ~800B
|
||
image-text tokens. Stage 3 conducts supervised fine-tuning. Visual
|
||
understanding is emphasized, and the loss is computed only on text
|
||
tokens. Unlike previous work, the fixed-resolution vision encoder is
|
||
adapted during training.</p>
|
||
</details>
|
||
<h2
|
||
id="mantis-mastering-multi-image-understanding-through-interleaved-instruction-tuning"><strong>MANTIS:
|
||
Mastering Multi-Image Understanding Through Interleaved Instruction
|
||
Tuning</strong></h2>
|
||
<p>MANTIS is a family of open-source large multimodal models that
|
||
demonstrate state-of-the-art performance on multi-image visual language
|
||
tasks. By focusing on instruction tuning with a carefully curated
|
||
multi-image dataset, MANTIS achieves superior results using
|
||
significantly less data than models trained with massive web datasets.
|
||
This efficient approach opens new avenues for developing powerful
|
||
multi-image LMMs with limited resources.</p>
|
||
<a href="https://arxiv.org/abs/2405.01483"><img
|
||
src="https://img.shields.io/badge/arXiv-2405.01483-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/TIGER-AI-Lab/Mantis"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/TIGER-Lab/Mantis"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu
|
||
Chen<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/dd4bbdf4-5ab9-4e12-89bd-94c5beb2d114" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Mantis</strong>: a powerful and efficient multi-image Large
|
||
Multimodal Models (LMMs), demonstrating that massive pre-training on
|
||
noisy web data is not the only path towards achieving state-of-the-art
|
||
performance in complex visual-language tasks. Instead, MANTIS focuses on
|
||
instruction tuning using high-quality, academic-level data, achieving
|
||
remarkable results on various multi-image benchmarks while using
|
||
significantly less data than its counterparts. Central to MANTIS’s
|
||
success is the meticulously curated MANTIS-INSTRUCT dataset, a
|
||
collection of 721K multi-image instruction data carefully designed to
|
||
instill four crucial skills: co-reference, comparison, reasoning, and
|
||
temporal understanding. These skills equip MANTIS with a comprehensive
|
||
toolkit for tackling the challenges of multi-image understanding.
|
||
Co-reference enables the model to understand references like “second
|
||
image” in natural language and correctly identify the corresponding
|
||
image within the input. Comparison fosters the ability to analyze and
|
||
identify subtle differences and commonalities between multiple images, a
|
||
skill crucial for tasks like visual similarity assessment and difference
|
||
description. Reasoning empowers the model to go beyond simple
|
||
comparisons and make complex inferences by combining its world knowledge
|
||
with the information extracted from multiple images, allowing it to
|
||
solve intricate logical reasoning puzzles and answer challenging
|
||
multi-image questions. Finally, temporal understanding equips MANTIS
|
||
with the capability to process and understand image sequences, capturing
|
||
the dynamic aspects of videos, comics, and other temporal visual data.
|
||
MANTIS leverages a simple yet effective architecture based on existing
|
||
pre-trained LLMs like LLaMA-3 and vision transformer encoders from CLIP
|
||
or SigLIP. A multimodal projector, similar to the one used in LLaVA,
|
||
aligns the visual embeddings with the text embeddings, facilitating
|
||
their seamless integration within the LLM. This streamlined approach
|
||
avoids the complexity of previous architectures like Q-Former while
|
||
retaining high performance. Extensive evaluations on five multi-image
|
||
benchmarks, including NLVR2, QBench, BLINK, MVBench, and a newly curated
|
||
Mantis-Eval dataset, demonstrate MANTIS’s superior performance,
|
||
exceeding existing open-source LMMs and even matching the results of the
|
||
powerful GPT-4V. Notably, MANTIS surpasses Idefics2-8B, a model
|
||
pre-trained on 200x larger interleaved multi-image data, showcasing the
|
||
effectiveness of instruction tuning with high-quality academic-level
|
||
data. Furthermore, MANTIS retains strong single-image performance on par
|
||
with existing state-of-the-art models, demonstrating its versatility and
|
||
adaptability. MANTIS’s impressive results, combined with its efficient
|
||
training and open-source nature, offer a compelling alternative to
|
||
traditional pre-training-heavy approaches, opening new possibilities for
|
||
researchers and practitioners seeking to develop powerful and versatile
|
||
multi-image LMMs with minimal computational resources.
|
||
</details>
|
||
<h2
|
||
id="qwen-vl-a-versatile-vision-language-model-for-understanding-localization-text-reading-and-beyond"><strong>Qwen-VL:
|
||
A Versatile Vision-Language Model for Understanding, Localization, Text
|
||
Reading, and Beyond</strong></h2>
|
||
<p>Qwen-VL distinguishes itself by integrating a Vision Transformer with
|
||
a large language model through a novel vision-language adapter,
|
||
employing cross-attention mechanisms for precise alignment of visual and
|
||
linguistic data, achieving high performance in various vision-language
|
||
tasks.</p>
|
||
<a href="https://arxiv.org/abs/2308.12966"><img
|
||
src="https://img.shields.io/badge/arXiv-2308.12966-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/qwenlm/qwen-vl"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang,
|
||
Junyang Lin, Chang Zhou, Jingren Zhou
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/c9358aad-63e2-44d3-b3af-38e9d4f6aeaa" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Qwen-VL</strong>: Represents an advanced architecture in the
|
||
vision-language domain, constructed on a foundational large language
|
||
model with the integration of a Vision Transformer (ViT) for visual
|
||
encoding. This model stands out for its innovative approach to
|
||
processing and aligning visual and linguistic data, featuring a
|
||
<strong>vision-language adapter equipped with cross-attention
|
||
mechanisms</strong>. These mechanisms enable the efficient compression
|
||
and integration of image features into the language model, a critical
|
||
component for achieving precise alignment between visual inputs and
|
||
text. The architecture’s design focuses on optimizing the handling of
|
||
image features, employing a position-aware strategy to maintain spatial
|
||
relevance of visual data when merged with textual information.The
|
||
training methodology of Qwen-VL is meticulously structured into
|
||
<strong>three distinct phases</strong>, starting with an <strong>initial
|
||
pre-training</strong> on a diverse collection of weakly labeled
|
||
image-text pairs. This is followed by <strong>multi-task
|
||
pre-training</strong>, utilizing high-quality annotated datasets and
|
||
larger input resolutions to refine the model’s capabilities in various
|
||
tasks such as instruction following and dialogue. The final phase
|
||
involves <strong>supervised fine-tuning</strong>, further honing the
|
||
model’s performance across a spectrum of vision-language tasks. Special
|
||
tokens and bounding box inputs are utilized for differentiating between
|
||
image and text inputs and achieving fine-grained visual understanding,
|
||
respectively.Qwen-VL’s alignment techniques are innovative, employing a
|
||
cross-attention mechanism within its vision-language adapter to fuse
|
||
visual and textual features effectively. This approach ensures the
|
||
preservation of spatial information post feature compression through the
|
||
use of positional encodings. The model leverages an extensive suite of
|
||
datasets for training, including LAION-en, LAION-zh, and various others
|
||
for pre-training, alongside specialized datasets like GQA, VGQA, and
|
||
VQAv2 for multi-task pre-training. These datasets are instrumental in
|
||
supporting a broad array of vision-language tasks, emphasizing
|
||
multilingual capabilities, fine-grained visual understanding, and the
|
||
model’s proficiency in captioning, visual question answering, grounding,
|
||
and OCR tasks.
|
||
</details>
|
||
<h2
|
||
id="qwen2-vl-a-powerful-open-source-vision-language-model-for-image-and-video-understanding"><strong>Qwen2-VL:
|
||
A Powerful Open-Source Vision-Language Model for Image and Video
|
||
Understanding</strong></h2>
|
||
<p>Qwen2-VL is the latest iteration of the Qwen vision-language model
|
||
family, building upon the Qwen-VL architecture and introducing
|
||
significant enhancements for improved understanding of images and
|
||
videos. It excels in various tasks, including visual question answering,
|
||
dialogue, content creation, and even agent-based control of devices like
|
||
mobile phones and robots.</p>
|
||
<p><a href="https://github.com/QwenLM/Qwen2-VL"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan,
|
||
Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou,
|
||
Jingren</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/37c2fb7a-66e1-475f-86e4-f00b4ac1c879" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Qwen2-VL continues to leverage the core architecture of Qwen-VL,
|
||
combining a Vision Transformer (ViT) with approximately 600M parameters
|
||
and Qwen2 language models. This ViT is designed to handle both image and
|
||
video inputs seamlessly. The key architectural improvements in Qwen2-VL
|
||
include Naive Dynamic Resolution support and Multimodal Rotary Position
|
||
Embedding (M-ROPE). Naive Dynamic Resolution allows the model to handle
|
||
arbitrary image resolutions by mapping them into a dynamic number of
|
||
visual tokens. This ensures that the model input accurately reflects the
|
||
information content of the image, regardless of its original resolution.
|
||
This approach is more aligned with human visual perception, which adapts
|
||
to different image sizes and resolutions. M-ROPE enhances the model’s
|
||
ability to capture positional information in multimodal inputs. It
|
||
deconstructs the original rotary embedding into three parts,
|
||
representing temporal, height, and width information. This allows the
|
||
LLM to simultaneously process and integrate 1D textual, 2D visual
|
||
(image), and 3D video positional information, leading to a more
|
||
comprehensive understanding of the input sequence. These architectural
|
||
enhancements, combined with a robust training process, enable Qwen2-VL
|
||
to achieve state-of-the-art performance on various visual understanding
|
||
benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA. It can
|
||
also understand videos over 20 minutes long, enabling high-quality
|
||
video-based question answering, dialogue, and content creation.
|
||
Furthermore, Qwen2-VL’s capabilities in complex reasoning and
|
||
decision-making allow it to be integrated with devices like mobile
|
||
phones and robots for automatic operation based on visual input and text
|
||
instructions. The model also supports multilingual understanding of text
|
||
within images, including most European languages, Japanese, Korean,
|
||
Arabic, and Vietnamese, broadening its applicability to a global user
|
||
base.
|
||
</details>
|
||
<h2
|
||
id="qwen2.5-vl-enhanced-vision-language-capabilities-in-the-qwen-series"><strong>Qwen2.5-VL:
|
||
Enhanced Vision-Language Capabilities in the Qwen Series</strong></h2>
|
||
<p>Qwen2.5-VL represents a significant advancement in the Qwen series of
|
||
vision-language models, offering improved image recognition, precise
|
||
object grounding, enhanced text recognition, document parsing, and video
|
||
comprehension, while also functioning as a visual agent capable of
|
||
computer and phone use.</p>
|
||
<p><a href="https://qwenlm.github.io/blog/qwen2.5-vl/"><img
|
||
src="https://img.shields.io/badge/Blog-Qwen%20Team%20Blog-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/QwenLM/Qwen2.5-VL"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Qwen Team</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/59f0878d-42c1-4013-af78-406b2f4763fe" width=600/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Qwen2.5-VL builds upon its predecessor, Qwen2-VL, with substantial
|
||
improvements in perception of temporal and spatial scales, as well as a
|
||
simplified network structure for increased efficiency.
|
||
<strong>World-wide Image Recognition:</strong> Expanded recognition
|
||
capabilities covering a vast array of categories, including landmarks,
|
||
objects, and even film/TV IPs. <strong>Precise Object
|
||
Grounding:</strong> Uses bounding boxes and point-based representations
|
||
for object localization, with standardized JSON output for coordinates
|
||
and attributes, enabling hierarchical positioning. <strong>Enhanced Text
|
||
Recognition (OCR):</strong> Improved multi-scenario, multi-language, and
|
||
multi-orientation text recognition and localization, with enhanced
|
||
information extraction for applications like document processing.
|
||
<strong>Powerful Document Parsing:</strong> Introduces “QwenVL HTML”
|
||
format, leveraging HTML for layout information extraction from
|
||
documents, magazines, research papers, web pages, and mobile
|
||
screenshots. <strong>Enhanced Video Comprehension:</strong> Supports
|
||
understanding of ultra-long videos (hourly scale) with dynamic frame
|
||
rate (FPS) training and absolute time encoding. Enables second-level
|
||
event localization and key point summarization. <strong>Visual Agent
|
||
Capabilities:</strong> Can function as a visual agent for computer and
|
||
phone use, capable of reasoning and dynamically directing tools. Capable
|
||
of tasks like booking flights. <strong>Time and Image Size
|
||
Perception</strong> In the spatial dimension, the model is capable of
|
||
adapting varying image sizes into tokens and directly represents
|
||
coordinates by detection boxes. In the temporal dimension, the model can
|
||
comprehend the pace of time through temporal dimension. <strong>Visual
|
||
Encoder</strong> A native dynamic resolution ViT is trained from
|
||
scratch. Window Attention is used to minimize computational load. The
|
||
model comes in three sizes (3B, 7B, and 72B parameters), with both base
|
||
and instruct-tuned versions available. The 72B-Instruct model achieves
|
||
competitive performance on various benchmarks, excelling in document and
|
||
diagram understanding. Smaller models also demonstrate strong
|
||
performance, with the 7B-Instruct model outperforming GPT-4o-mini in
|
||
several tasks and the 3B model exceeding the performance of the previous
|
||
Qwen2-VL 7B model. The models is trained with 18 trillion tokens. Future
|
||
developments aim to further enhance problem-solving, reasoning, and
|
||
multi-modality integration.
|
||
</details>
|
||
<h2 id="moondream1-and-moondream2"><strong>moondream1 and
|
||
moondream2</strong></h2>
|
||
<p>moondream1 and moondream2 are vision-language models with moondream2
|
||
building upon moondream1’s SigLIP vision encoder and Phi-1.5 language
|
||
backbone by incorporating an MLP projector for enhanced visual and
|
||
textual representation alignment.</p>
|
||
<a href="https://github.com/vikhyat/moondream"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/vikhyatk/moondream2"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
<span class="citation" data-cites="vikhyatk">@vikhyatk</span>
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/awesome-vlm-architectures/assets/88277926/e979d327-3423-4a91-92f2-02a3dc3189a8" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>moondream1 and moondream2</strong>: A series of vision-language
|
||
models. moondream1 is a 1.6B parameter model that leverages
|
||
<strong>SigLIP</strong> as the vision encoder and
|
||
<strong>Phi-1.5</strong> as the language backbone, trained on the LLaVA
|
||
dataset. moondream2 expands upon this foundation, utilizing a 1.86B
|
||
parameter model initialized with weights from SigLIP and Phi-1.5. It
|
||
incorporates <strong>an MLP projector</strong> to bridge the visual and
|
||
textual representations, potentially leading to enhanced vision-language
|
||
alignment and improved performance across various tasks.
|
||
</details>
|
||
<h2
|
||
id="moondream-next-compact-vision-language-model-with-enhanced-capabilities"><strong>Moondream-next:
|
||
Compact Vision-Language Model with Enhanced Capabilities</strong></h2>
|
||
<p>Moondream is a compact (1.9B parameters) vision-language model (VLM)
|
||
that prioritizes practical usability and accessibility, offering
|
||
features like structured output (JSON, XML, Markdown, CSV), improved
|
||
OCR, and a novel experimental Gaze Detection capability, while
|
||
maintaining fast performance and ease of deployment.</p>
|
||
<p><a href="https://moondream.ai/"><img
|
||
src="https://img.shields.io/badge/Blog-Moondream%20Blog-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/vikhyat/moondream"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/vikhyatk/moondream-next"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a></p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Moondream distinguishes itself by being exceptionally small (1.9B
|
||
parameters) while supporting a wide range of functionalities typically
|
||
found in larger, more specialized models. The architecture is not
|
||
explicitly detailed in the provided text, but it mentions improvements
|
||
to the “vision layer” for better OCR performance. This suggests a
|
||
structure where visual input is processed by a vision encoder, and then
|
||
integrated with a language model. The key feature is its ability to
|
||
perform multiple Vision AI tasks (“capabilities”) within a single,
|
||
unified model, including: object detection, captioning, visual querying,
|
||
pointing (x,y coordinate retrieval), and the newly added gaze detection.
|
||
The model also newly supports structured output formats, generating
|
||
outputs directly as JSON, XML, Markdown, or CSV, making integration with
|
||
applications much easier. The “Gaze Detection” capability is
|
||
specifically highlighted as a novel and experimental feature, indicating
|
||
a focus on real-world applications beyond standard benchmarks. The
|
||
training data and process are not thoroughly described, although the
|
||
text notes increased training on “document querying and understanding”
|
||
for OCR enhancement. The model’s creators express a cautious approach to
|
||
benchmarks, acknowledging their limitations and potential for
|
||
manipulation, yet also highlight improved benchmark scores in this
|
||
release, suggesting a balance between practical utility and measurable
|
||
performance. It does not rely on external apis.
|
||
</details>
|
||
<h2
|
||
id="sphinx-x-scaling-data-and-parameters-for-a-family-of-multi-modal-large-language-models"><strong>SPHINX-X:
|
||
Scaling Data and Parameters for a Family of Multi-modal Large Language
|
||
Models</strong></h2>
|
||
<p>SPHINX-X refines multi-modal large language models by streamlining
|
||
its architecture to use two visual encoders, CLIP-ConvNeXt and DINOv2,
|
||
and implementing an efficient single-stage training process for enhanced
|
||
performance across diverse multi-modal tasks.</p>
|
||
<a href="https://arxiv.org/abs/2402.05935"><img
|
||
src="https://img.shields.io/badge/arXiv-2402.05935-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/alpha-vllm/llama2-accessory"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/Alpha-VLLM/SPHINX"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a><br />
|
||
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng
|
||
Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi
|
||
Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu
|
||
Qiao
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/1c4e9a86-9a21-4911-bcb6-d2a79c181510" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>SPHINX-X</strong>: Represents an advanced iteration in the
|
||
development of Multi-modal Large Language Models (MLLM), building upon
|
||
its predecessor, SPHINX, by optimizing both architecture and training
|
||
efficiency. The core modifications introduced in SPHINX-X include the
|
||
elimination of redundant visual encoders, the incorporation of
|
||
<strong>learnable skip tokens</strong> to bypass <strong>fully-padded
|
||
sub-images</strong>, and the simplification of the multi-stage training
|
||
process into a singular, <strong>all-in-one training</strong> paradigm.
|
||
This approach is designed to enhance the model’s efficiency and
|
||
effectiveness across a broad spectrum of multi-modal tasks. The
|
||
architecture of SPHINX-X retains two key visual encoders,
|
||
<strong>CLIP-ConvNeXt and DINOv2</strong>, ensuring robust text-image
|
||
alignment capabilities, especially for high-resolution images and varied
|
||
aspect ratios. This streamlined model architecture enables a unified
|
||
encoding approach for both vision and text, emphasizing scalable and
|
||
efficient training methodologies. The training strategy is
|
||
comprehensive, directly engaging all model parameters across a
|
||
wide-ranging multi-modal dataset, which encompasses public resources
|
||
covering language, vision, and vision-language tasks. Additionally,
|
||
SPHINX-X enriches this dataset with specially curated OCR-intensive and
|
||
Set-of-Mark datasets to further extend the model’s versatility and
|
||
generalization capabilities. The datasets utilized in SPHINX-X aim to
|
||
foster a deep, comprehensive understanding across multiple domains,
|
||
enhancing the model’s performance in OCR, document layout detection, and
|
||
fine-grained multi-modal understanding. By training over various base
|
||
Large Language Models (LLMs) with different parameter sizes and
|
||
multilingual capabilities, SPHINX-X achieves a spectrum of MLLMs that
|
||
showcase a strong correlation between multi-modal performance and the
|
||
scales of data and parameters involved. This strategy allows SPHINX-X to
|
||
set a new benchmark in multi-modal large language model performance,
|
||
significantly advancing the field’s capabilities in handling complex,
|
||
multi-domain tasks.
|
||
</details>
|
||
<h2 id="blip-bootstrapping-language-image-pre-training"><strong>BLIP:
|
||
Bootstrapping Language-Image Pre-training</strong></h2>
|
||
<p>BLIP introduces a versatile Multimodal Mixture of Encoder-Decoder
|
||
(MED) architecture, integrating a visual transformer and a BERT-based
|
||
text encoder with cross-attention layers, enabling unified
|
||
vision-language understanding and generation across a wide range of
|
||
tasks.</p>
|
||
<a href="https://arxiv.org/abs/2201.12086"><img
|
||
src="https://img.shields.io/badge/arXiv-2201.12086-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/salesforce/BLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/27db1037-2b48-4097-9891-019ba77fc536" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>BLIP</strong>: Introduces an innovative approach to unified
|
||
vision-language understanding and generation through its Multimodal
|
||
Mixture of Encoder-Decoder (MED) architecture. This architecture is
|
||
designed to be highly versatile, capable of serving as a unimodal
|
||
encoder, an image-grounded text encoder, or an image-grounded text
|
||
decoder. This flexibility allows BLIP to adeptly handle a wide array of
|
||
vision-language tasks, showcasing its adaptability across various
|
||
applications. The MED architecture incorporates a Visual Transformer to
|
||
encode images, a BERT-based text encoder for processing textual
|
||
information, additional <strong>cross-attention layers</strong> to
|
||
facilitate image-text interaction, and <strong>causal self-attention
|
||
layers</strong> for generating text based on image inputs. These
|
||
components enable BLIP to support three key functionalities: encoding of
|
||
either modality on its own, encoding of text grounded in images, and
|
||
decoding of text from images, thus covering a comprehensive range of
|
||
tasks from understanding to generation.BLIP’s training methodology is
|
||
grounded in the joint optimization of three pre-training objectives:
|
||
Image-Text Contrastive Learning (ITC), Image-Text Matching (ITM), and
|
||
Image-Conditioned Language Modeling (LM). These objectives are designed
|
||
to align visual and textual features, learn fine-grained image-text
|
||
alignment, and enable text generation from images, respectively. The
|
||
model utilizes a mix of human-annotated and web-collected noisy
|
||
image-text pairs for training, balancing the precision of manually
|
||
annotated data with the scale and diversity of data collected from the
|
||
web. This approach ensures robustness and scalability in BLIP’s
|
||
performance across vision-language tasks.For alignment and fusion of
|
||
multimodal information, BLIP employs ITC and ITM losses to achieve
|
||
precise text-image alignment, utilizing a multimodal representation that
|
||
accurately captures the nuanced relationship between visual and textual
|
||
data. The architecture’s cross-attention layers play a crucial role in
|
||
incorporating visual information into the text encoder for
|
||
image-grounded text encoding. Simultaneously, modifications to the
|
||
self-attention layers in the decoder facilitate text generation,
|
||
effectively merging vision and text for unified processing. BLIP’s
|
||
pre-training leverages a diverse set of datasets, including COCO, Visual
|
||
Genome, Conceptual Captions, Conceptual 12M, SBU Captions, and LAION.
|
||
These datasets are instrumental in learning a broad spectrum of
|
||
vision-language tasks, with high-quality human-annotated pairs and
|
||
extensive web datasets providing the necessary depth and breadth for
|
||
comprehensive pre-training.
|
||
</details>
|
||
<h2
|
||
id="blip-2-bootstrapping-language-image-pre-training-with-frozen-image-encoders-and-large-language-models"><strong>BLIP-2:
|
||
Bootstrapping Language-Image Pre-training with Frozen Image Encoders and
|
||
Large Language Models</strong></h2>
|
||
<p>BLIP-2 leverages the power of frozen pre-trained image encoders and
|
||
large language models, connecting them through a lightweight Querying
|
||
Transformer (Q-Former) to efficiently extract and integrate visual
|
||
features for enhanced vision-language understanding and generation.</p>
|
||
<a href="https://arxiv.org/abs/2301.12597"><img
|
||
src="https://img.shields.io/badge/arXiv-2301.12597-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/salesforce/LAVIS/tree/main/projects/blip2"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/Salesforce/BLIP2"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao,
|
||
Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/604460f9-478c-4cc1-ba35-287447c04b26" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>BLIP-2</strong>: The model architecture integrates frozen
|
||
pre-trained image encoders and large language models (LLMs), employing a
|
||
lightweight <strong>Querying Transformer (Q-Former)</strong> to
|
||
facilitate the interaction between these modalities. The Q-Former plays
|
||
a crucial role in extracting and integrating visual features relevant to
|
||
textual queries, allowing for a more nuanced understanding and
|
||
generation of language based on visual inputs.The training methodology
|
||
of BLIP-2 is structured around a two-stage pre-training strategy.
|
||
Initially, it focuses on learning vision-language representations
|
||
utilizing the frozen image encoders. Subsequently, it advances to
|
||
vision-to-language generative learning, leveraging the capabilities of
|
||
frozen LLMs. This strategy, coupled with the use of <strong>learnable
|
||
query vectors within the Q-Former</strong>, enables effective
|
||
vision-language alignment. The alignment process is further enhanced
|
||
through fusion methods that extract language-informative visual
|
||
representations, which are then synthesized with the outputs of LLMs to
|
||
generate pertinent textual descriptions. A diverse array of datasets
|
||
including COCO, Visual Genome, CC3M, CC12M, SBU, and LAION400M underpins
|
||
the comprehensive pre-training regime of BLIP-2. These datasets provide
|
||
a rich variety of image-text pairs, essential for training the model
|
||
across a broad spectrum of visual representations and language
|
||
generation tasks. The model’s architecture and training approaches are
|
||
designed to address the prohibitive costs associated with
|
||
vision-and-language pre-training, offering a more efficient pathway to
|
||
developing multimodal understanding and generation capabilities.
|
||
</details>
|
||
<h2
|
||
id="xgen-mm-blip-3-an-open-source-framework-for-building-powerful-and-responsible-large-multimodal-models"><strong>xGen-MM
|
||
(BLIP-3): An Open-Source Framework for Building Powerful and Responsible
|
||
Large Multimodal Models</strong></h2>
|
||
<p>xGen-MM (BLIP-3) is a comprehensive framework developed by Salesforce
|
||
for training a series of open-source large multimodal models (LMMs)
|
||
designed to excel in a variety of visual language tasks. It provides
|
||
meticulously curated datasets, a streamlined training recipe, model
|
||
architectures, and a suite of open LMMs capable of performing various
|
||
visual language tasks. xGen-MM focuses on scalability, using a
|
||
simplified architecture and a unified training objective to enable
|
||
training on larger, more diverse datasets. The framework also includes a
|
||
safety-tuned model to mitigate harmful behaviors and promote responsible
|
||
AI development.</p>
|
||
<p><a href="https://arxiv.org/abs/2408.08872"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.08872-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/collections/Salesforce/xgen-mm-1-models-and-datasets-662971d6cecbf3a7f80ecc2e"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil
|
||
Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo,
|
||
Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning
|
||
Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang,
|
||
Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos
|
||
Niebles, Caiming Xiong, Ran Xu</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/e6e166c8-871e-420c-bbf1-b64c3c22e06a" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
xGen-MM (BLIP-3), short for xGen-MultiModal, addresses limitations of
|
||
previous open-source efforts by providing a complete ecosystem for LMM
|
||
development. Central to its approach is the utilization of diverse,
|
||
large-scale, and high-quality multimodal data, which enables xGen-MM to
|
||
achieve competitive performance against both open-source and proprietary
|
||
LMMs. Instead of relying on the intricate Q-Former architecture and
|
||
multiple training objectives used in its predecessor, BLIP-2, xGen-MM
|
||
streamlines the process by employing a more scalable vision token
|
||
sampler (perceiver resampler) and unifying the training objective to a
|
||
single auto-regressive loss on text tokens. This simplification enables
|
||
larger-scale training and focuses the model on effectively learning from
|
||
the rich multimodal context. Furthermore, xGen-MM incorporates safety
|
||
measures, introducing a safety-tuned model with DPO to mitigate
|
||
potential harmful behaviors like hallucinations and promote responsible
|
||
AI development. By open-sourcing its models, datasets, and fine-tuning
|
||
code, xGen-MM aims to empower the research community and foster
|
||
advancements in the field of LMMs, making these powerful tools more
|
||
accessible and encouraging further exploration of their capabilities.
|
||
</details>
|
||
<h2
|
||
id="instructblip-towards-general-purpose-vision-language-models-with-instruction-tuning"><strong>InstructBLIP:
|
||
Towards General-purpose Vision-Language Models with Instruction
|
||
Tuning</strong></h2>
|
||
<p>InstructBLIP enhances the BLIP-2 framework by introducing instruction
|
||
tuning to its Query Transformer (Q-Former), enabling the model to
|
||
extract instruction-aware visual features and achieve state-of-the-art
|
||
zero-shot performance across diverse vision-language tasks.</p>
|
||
<a href="https://arxiv.org/abs/2305.06500v2"><img
|
||
src="https://img.shields.io/badge/arXiv-2305.06500v2-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/salesforce/LAVIS/tree/main/projects/instructblip"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/hysts/InstructBLIP"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao,
|
||
Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/5839e3a6-6fb8-469c-b84e-d60a851c1642" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>InstructBLIP</strong>: represents an advanced step in the
|
||
development of vision-language models through instruction tuning,
|
||
building on the capabilities of the pre-trained BLIP-2 models. It
|
||
integrates an image encoder, a large language model (LLM), and <strong>a
|
||
Query Transformer (Q-Former)</strong>, which is specifically fine-tuned
|
||
to bridge the visual and linguistic components while keeping the image
|
||
encoder and LLM static. This architecture enables the extraction of
|
||
instruction-aware visual features, enhancing the model’s responsiveness
|
||
to varied instructional contexts. Training InstructBLIP involves a
|
||
careful selection of 26 datasets across 11 task categories, transformed
|
||
into an instruction tuning format to foster the model’s adaptability
|
||
across a broad spectrum of vision-language tasks. The model employs a
|
||
balanced sampling strategy and standard language modeling loss,
|
||
augmented with OCR tokens for datasets involving scene texts, to
|
||
fine-tune its instruction following capabilities. The unique approach of
|
||
instruction-aware visual feature extraction through the Q-Former allows
|
||
the model to tailor feature extraction to the specific requirements of
|
||
the instruction, significantly improving performance across both seen
|
||
and unseen tasks. Implementation details reveal the flexibility of
|
||
InstructBLIP’s architecture, which is easily adaptable to incorporate
|
||
various LLMs, thanks to the modular design of the BLIP-2 framework. The
|
||
model showcases state-of-the-art zero-shot performance across a wide
|
||
range of vision-language tasks, outperforming previous models like
|
||
BLIP-2 and Flamingo in zero-shot evaluations and achieving notable
|
||
results when fine-tuned on specific downstream tasks. InstructBLIP’s
|
||
open-source availability and its performance across different benchmarks
|
||
highlight its potential as a general-purpose vision-language model.
|
||
</details>
|
||
<h2
|
||
id="kosmos-1-language-is-not-all-you-need-aligning-perception-with-language-models"><strong>KOSMOS-1:
|
||
Language Is Not All You Need: Aligning Perception with Language
|
||
Models</strong></h2>
|
||
<p>KOSMOS-1, a multimodal large language model, leverages a
|
||
Transformer-based architecture enhanced with MAGNETO and XPOS to
|
||
seamlessly process text and various modalities, aligning perception with
|
||
language models through training on diverse web-scale multimodal corpora
|
||
for enhanced zero-shot and few-shot learning capabilities.</p>
|
||
<a href="https://arxiv.org/abs/2302.14045"><img
|
||
src="https://img.shields.io/badge/arXiv-2302.14045-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/microsoft/unilm"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming
|
||
Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu,
|
||
Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit
|
||
Som, Xia Song, Furu Wei
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/33fd99a9-e89a-4905-8917-f03452fd5e6a" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>KOSMOS-1</strong>: A transformative multimodal large language
|
||
model, meticulously designed to harmonize the perception of general
|
||
modalities with linguistic models, facilitating zero-shot learning,
|
||
few-shot learning, and auto-regressive output generation. At its core,
|
||
KOSMOS-1 employs a Transformer-based causal language model architecture,
|
||
adept at processing both textual and various other modalities. This
|
||
innovative approach is bolstered by key architectural components,
|
||
including a Transformer-based decoder for input sequence handling,
|
||
embedding modules for vector encoding of text and modalities, and the
|
||
integration of <strong>MAGNETO and XPOS</strong> for architectural
|
||
enhancements. These elements collectively enable the model to adeptly
|
||
navigate and process multimodal information. The training regimen of
|
||
KOSMOS-1 is distinguished by its comprehensive utilization of web-scale
|
||
multimodal corpora, which encompasses monomodal data, cross-modal paired
|
||
data, and interleaved multimodal data, emphasizing the next-token
|
||
prediction tasks to optimize the log-likelihood of tokens. This
|
||
methodology ensures a robust foundation for the model, enhancing its
|
||
ability to understand and generate content across various modalities.
|
||
Furthermore, the alignment techniques employed are particularly
|
||
noteworthy; by leveraging interleaved image-text data, KOSMOS-1 aligns
|
||
the perceptual capabilities of general modalities with language models
|
||
in an unprecedented manner, thereby enriching the model’s understanding
|
||
and interpretative capacities. KOSMOS-1’s training datasets, including
|
||
The Pile, Common Crawl, English LAION-2B, LAION-400M, COYO-700M, and
|
||
Conceptual Captions, are meticulously selected to serve dual purposes:
|
||
fostering representation learning and language tasks through text
|
||
corpora, and aligning perception with language models via image-caption
|
||
pairs and interleaved data. This strategic selection of datasets not
|
||
only bolsters the model’s linguistic competencies but also significantly
|
||
enhances its few-shot abilities, marking a significant milestone in the
|
||
integration of perception and language models.
|
||
</details>
|
||
<h3
|
||
id="kosmos-2-grounding-multimodal-large-language-models-to-the-world"><strong>KOSMOS-2:
|
||
Grounding Multimodal Large Language Models to the World</strong></h3>
|
||
<p>KOSMOS-2, extending the KOSMOS-1 architecture, incorporates grounded
|
||
image-text pairs using discrete location tokens linked to text spans,
|
||
effectively anchoring text to specific image regions, thereby enhancing
|
||
multimodal understanding and reference accuracy.</p>
|
||
<a href="https://arxiv.org/abs/2306.14824"><img
|
||
src="https://img.shields.io/badge/arXiv-2306.14824-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/microsoft/unilm/tree/master/kosmos-2"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/ydshieh/Kosmos-2"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming
|
||
Ma, Furu Wei
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/17420c9c-759d-4690-bfc8-e8d7792111e7" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>KOSMOS-2</strong>: Built upon the foundational architecture of
|
||
KOSMOS-1, it retains the Transformer-based causal language model
|
||
architecture and training objectives, while introducing a significant
|
||
innovation by incorporating grounded image-text pairs into its training
|
||
regimen. This addition seeks to bridge the gap between visual and
|
||
textual information, enabling a more cohesive understanding of
|
||
multimodal content. The model differentiates itself by training on a
|
||
web-scale dataset of grounded image-text pairs, known as GRIT, which
|
||
includes continuous coordinates of bounding boxes translated into
|
||
discrete location tokens. These tokens are intricately linked with text
|
||
spans, creating a unified input representation that seamlessly
|
||
integrates visual and textual elements. The training of KOSMOS-2 is
|
||
extensive and multifaceted, utilizing grounded image-text pairs,
|
||
monomodal text corpora, image-caption pairs, and interleaved image-text
|
||
data to foster a robust learning environment. The model’s training
|
||
leverages a large batch size and employs the AdamW optimizer, running on
|
||
256 V100 GPUs. This process is augmented by instruction tuning with both
|
||
vision-language and language-only instruction datasets, aiming to refine
|
||
the model’s understanding and processing capabilities across different
|
||
modalities. The grounding technique is a pivotal aspect of KOSMOS-2,
|
||
where <strong>continuous coordinates of bounding boxes</strong> are
|
||
converted into <strong>discrete location tokens</strong>. These tokens
|
||
are then linked with corresponding text spans, anchoring the textual
|
||
output to specific visual inputs, enhancing the model’s ability to refer
|
||
to and describe particular image regions or objects with precision.
|
||
KOSMOS-2’s alignment techniques and fusion methods play a critical role
|
||
in its ability to understand and refer to specific parts of an image
|
||
directly, employing a unified input representation that combines image
|
||
embeddings with grounded text and location tokens. This approach not
|
||
only improves the model’s referential accuracy but also its overall
|
||
multimodal comprehension. The model is trained using a variety of
|
||
datasets, including the specially created GRIT dataset for grounding
|
||
capabilities, along with monomodal text corpora, image-caption pairs,
|
||
and interleaved image-text data to bolster its language understanding,
|
||
multimodal perception, and in-context learning abilities. Through these
|
||
innovations, KOSMOS-2 represents a significant advancement in grounding
|
||
multimodal large language models, offering enhanced capabilities in
|
||
linking textual and visual information cohesively.
|
||
</details>
|
||
<h2
|
||
id="convllava-hierarchical-backbones-as-visual-encoder-for-large-multimodal-models"><strong>ConvLLaVA:
|
||
Hierarchical Backbones as Visual Encoder for Large Multimodal
|
||
Models</strong></h2>
|
||
<p>ConvLLaVA addresses the limitations of Vision Transformers (ViTs) in
|
||
high-resolution Large Multimodal Models (LMMs) by replacing them with a
|
||
hierarchical backbone, ConvNeXt, as the visual encoder. This
|
||
architectural shift aims to reduce the computational burden caused by
|
||
excessive visual tokens and quadratic complexity often associated with
|
||
ViTs, especially when dealing with high-resolution images.</p>
|
||
<p><a href="https://arxiv.org/abs/2405.15738"><img
|
||
src="https://img.shields.io/badge/arXiv-2405.15738-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/alibaba/conv-llava"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/papers/2405.15738"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song,
|
||
Shiji Song, Gao Huang, Bo Zheng</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/ad7e129a-f958-4b30-8327-7df509994bea" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
ConvLLaVA leverages the inherent information compression capabilities of
|
||
ConvNeXt, a hierarchical convolutional neural network. ConvLLaVA, unlike
|
||
traditional LMMs that rely on ViTs, employs a <strong>five-stage
|
||
ConvNeXt architecture</strong> as its visual encoder. This encoder
|
||
progressively compresses visual information across its stages,
|
||
significantly reducing the number of visual tokens generated compared to
|
||
ViT. The architecture mirrors other popular general LMMs like LLaVA,
|
||
Qwen-VL, and VILA, consisting of a vision encoder (ConvNeXt), a large
|
||
language model (LLM - Vicuna in this case), and a vision-language
|
||
projector (a two-layer MLP). The ConvNeXt encoder processes the input
|
||
image and generates latent visual embeddings. These embeddings are then
|
||
projected into the embedding space of the LLM by the vision-language
|
||
projector. Finally, the projected visual embeddings are concatenated
|
||
with the text embeddings generated by the LLM’s tokenizer, and this
|
||
combined input is fed into the LLM. The entire model is trained using a
|
||
language modeling loss. To further enhance ConvLLaVA’s performance, the
|
||
authors introduce two key optimizations: firstly, they update the
|
||
pretrained ConvNeXt weights instead of freezing them, allowing the model
|
||
to adapt to high-resolution inputs and improve the quality of visual
|
||
representations. Secondly, they introduce an additional ConvNeXt stage,
|
||
effectively creating a five-stage architecture (ConvNeXt†) that further
|
||
compresses visual information, enabling the model to handle even higher
|
||
resolutions (up to 1536x1536) while generating a manageable number of
|
||
visual tokens (576). This hierarchical compression approach, combined
|
||
with the linear spatial complexity of ConvNeXt, significantly reduces
|
||
the computational burden on the LLM compared to ViT-based models, making
|
||
ConvLLaVA a more efficient and scalable solution for high-resolution
|
||
multimodal tasks.
|
||
</details>
|
||
<h2 id="parrot-multilingual-visual-instruction-tuning"><strong>Parrot:
|
||
Multilingual Visual Instruction Tuning</strong></h2>
|
||
<p>Parrot tackles the issue of “multilingual erosion” in Multimodal
|
||
Large Language Models (MLLMs), where models trained primarily on
|
||
English-centric data struggle to understand and respond in other
|
||
languages. It achieves this by using textual guidance to align visual
|
||
tokens with language-specific embeddings, effectively enhancing the
|
||
model’s multilingual capabilities.</p>
|
||
<p><a href="https://arxiv.org/abs/2406.02539"><img
|
||
src="https://img.shields.io/badge/arXiv-2406.02539-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/AIDC-AI/Parrot"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen,
|
||
Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/467964a0-4ccc-4cec-802a-c93b310d3118" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Parrot builds upon the LLaVA framework, utilizing a pre-trained CLIP
|
||
ViT-L/14 as the vision encoder and Qwen1.5-Chat as the LLM. The
|
||
architecture consists of three main components: a vision encoder, a
|
||
large language model (LLM), and a multilingual
|
||
<strong>Mixture-of-Experts (MoE)</strong> module. The vision encoder
|
||
processes the input image and generates visual features, which are then
|
||
projected into the embedding space of the LLM using a learned projector.
|
||
To address the multilingual challenge, Parrot introduces a novel textual
|
||
guidance mechanism. It first calculates cross-attention between the
|
||
class token of the visual features and the text embeddings derived from
|
||
the input prompt. This cross-attention output is then fed into the MoE
|
||
module’s router, which predicts the probability of activating each
|
||
language expert. Each expert is a specialized MLP trained to transform
|
||
the English-biased visual embeddings into language-specific
|
||
representations. The router selects the most relevant experts based on
|
||
the input language, and their outputs are combined to generate the final
|
||
language-specific visual embeddings. These embeddings are then combined
|
||
with the original visual embeddings using a weighted sum, ensuring that
|
||
the model retains its ability to process visual information effectively
|
||
across different languages. This entire process allows Parrot to align
|
||
visual tokens with textual embeddings at the language level, effectively
|
||
mitigating multilingual erosion and enhancing the model’s ability to
|
||
understand and respond in multiple languages.
|
||
</details>
|
||
<h2
|
||
id="omg-llava-bridging-image-level-object-level-pixel-level-reasoning-and-understanding"><strong>OMG-LLaVA:
|
||
Bridging Image-level, Object-level, Pixel-level Reasoning and
|
||
Understanding</strong></h2>
|
||
<p>OMG-LLaVA presents a novel framework that unifies image-level,
|
||
object-level, and pixel-level reasoning and understanding within a
|
||
single Multimodal Large Language Model (MLLM). It leverages the power of
|
||
a frozen universal segmentation model (OMG-Seg) for visual encoding and
|
||
a Large Language Model (LLM) for text understanding and response
|
||
generation, enabling a wide range of multimodal tasks within a single,
|
||
elegant architecture.</p>
|
||
<p><a href="https://arxiv.org/abs/2406.19389"><img
|
||
src="https://img.shields.io/badge/arXiv-2406.19389-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/lxtGH/OMG-Seg"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/papers/2406.19389"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji,
|
||
Chen Change Loy, Shuicheng Yan</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/c2830cc5-ab00-4c48-898e-a077cdc7b947" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
OMG-LLaVA consists of two main components: a frozen universal perception
|
||
module (based on OMG-Seg) and a Large Language Model (LLM). The
|
||
universal perception module is responsible for encoding the input image
|
||
and visual prompts into three types of visual tokens: pixel-centric,
|
||
object-centric, and object-centric derived from visual prompts. The
|
||
pixel-centric tokens are generated by a <strong>ConvNeXt-L based CLIP
|
||
image encoder</strong>, capturing dense image features. The
|
||
object-centric tokens are generated by the OMG decoder, which takes
|
||
learnable object queries and visual prompt queries as input and attends
|
||
to the image features to extract object-level information. This decoder
|
||
can handle point, box, and mask prompts by applying constraints on the
|
||
attention masks. To bridge the gap between the frozen perception module
|
||
and the LLM, a novel “perception prior embedding” strategy is
|
||
introduced. This strategy fuses the image features with the object
|
||
queries from the OMG decoder using a mask score derived from the
|
||
segmentation masks and confidence scores. The resulting weighted object
|
||
queries are then added to the image features to generate the
|
||
pixel-centric visual tokens, providing the LLM with rich object-level
|
||
information. The object-centric visual tokens are directly taken from
|
||
the foreground object queries of the OMG decoder. Both types of visual
|
||
tokens, along with the text instruction tokens, are fed into the LLM,
|
||
which is responsible for understanding the user’s intent and generating
|
||
the appropriate response. The LLM outputs text responses and
|
||
object-centric visual tokens, which are then decoded by the frozen OMG
|
||
decoder to produce segmentation masks. This unified architecture allows
|
||
OMG-LLaVA to perform a wide range of tasks, including image captioning,
|
||
visual question answering, referring segmentation, reasoning
|
||
segmentation, grounded conversation generation, and region captioning,
|
||
all within a single model.
|
||
</details>
|
||
<h2
|
||
id="evlm-an-efficient-vision-language-model-for-visual-understanding"><strong>EVLM:
|
||
An Efficient Vision-Language Model for Visual
|
||
Understanding</strong></h2>
|
||
<p>EVLM is an efficient multimodal language model designed to minimize
|
||
computational costs while maximizing the model’s ability to perceive
|
||
visual signals comprehensively. It addresses the challenges of handling
|
||
long sequences of visual signals, particularly in video data, by
|
||
employing a cross-attention mechanism and hierarchical ViT features,
|
||
achieving competitive performance in tasks like image and video
|
||
captioning.</p>
|
||
<p><a href="https://arxiv.org/abs/2407.14177"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.14177-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/papers/2407.14177"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu,
|
||
Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan,
|
||
Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/87563a37-e65e-44d4-a0e1-aea452ae313c" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
EVLM is built upon the Flamingo architecture, incorporating a visual
|
||
encoder, a large language model, and a Gated Cross-Attention Layer. To
|
||
enhance visual perception, EVLM utilizes the 4.4B EVA2-CLIP-E-Plus model
|
||
as the visual encoder, extracting hierarchical visual features by
|
||
uniformly sampling 8 feature sequences from the last 40 layers of the
|
||
transformer. These features are then sequentially fed into different
|
||
Gated Cross-Attention layers of the Flamingo model. Unlike Flamingo,
|
||
which uses a single media token image, EVLM replaces it with a set of 16
|
||
learnable tokens, aiming to capture visual features similar to Q-former.
|
||
The attention mechanism is designed to allow each set of learnable
|
||
tokens to interact only with the corresponding image, while text
|
||
sequences interact only with the previous image in the multimodal
|
||
sequence. This approach ensures efficient interaction between visual and
|
||
textual information. For the language model, EVLM employs the
|
||
Qwen-14B-Chat 1.0, chosen for its strong performance in content
|
||
understanding and logical reasoning. A gated cross-attention layer is
|
||
inserted before every transformer layer of the language model to
|
||
condition it on visual inputs. To further enhance model effectiveness
|
||
and scale trainable parameters, a Mixture of Experts (MoE) mechanism is
|
||
applied to the Cross Attention layer. This involves replicating and
|
||
segmenting the FFN of the base model into multiple fine-grained experts,
|
||
with a routing layer selecting the appropriate set of experts for each
|
||
token. The model undergoes a three-stage training process: multi-modal
|
||
pre-training, multi-task continual pre-training, and multi-modal
|
||
instruction fine-tuning. Pre-training focuses on cross-modal alignment
|
||
and modeling intrinsic relationships within multimodal data, using a
|
||
large-scale dataset of bilingual image-text captions and web-type
|
||
multimodal data. Continual pre-training further enhances the model’s
|
||
visual question-answering ability, while instruction fine-tuning
|
||
activates its instruction-following capabilities using a diverse range
|
||
of high-quality instruction tuning data.
|
||
</details>
|
||
<h2
|
||
id="slowfast-llava-a-strong-training-free-baseline-for-video-large-language-models"><strong>SlowFast-LLaVA:
|
||
A Strong Training-Free Baseline for Video Large Language
|
||
Models</strong></h2>
|
||
<p>SlowFast-LLaVA (SF-LLaVA) is a training-free video large language
|
||
model that effectively captures both detailed spatial semantics and
|
||
long-range temporal context in videos without requiring any additional
|
||
fine-tuning on video data. It achieves this by leveraging a two-stream
|
||
SlowFast design inspired by action recognition models, allowing it to
|
||
process a larger number of frames and outperform existing training-free
|
||
methods on various video benchmarks.</p>
|
||
<p><a href="https://arxiv.org/abs/2407.15841"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.15841-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/papers/2407.15841"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Mingze Xu, Mingfei Gao, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming
|
||
Gang, Kai Kang, Afshin Dehghan</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/6e1e2f43-86a7-42e3-998a-24bbd8f1c741" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
SF-LLaVA builds upon the LLaVA-NeXT framework and utilizes a two-stream
|
||
approach, similar to SlowFast networks in action recognition, to process
|
||
video inputs. The model first uniformly samples N frames from the input
|
||
video. These frames are then processed independently by a visual
|
||
encoder, such as CLIP-L, followed by a visual-language adapter for
|
||
feature alignment. The resulting frame features are then fed into two
|
||
separate pathways: Slow and Fast. <strong>The Slow pathway</strong>
|
||
focuses on capturing detailed spatial semantics by processing a smaller
|
||
number of frames (Nslow) at a higher spatial resolution (e.g., 8 frames
|
||
with 24x24 tokens). It applies spatial pooling with a small stride
|
||
(e.g., 1x2) to aggregate features and reduce the number of tokens.
|
||
<strong>The Fast pathway</strong> focuses on capturing temporal context
|
||
and motion cues by processing all N frames (Nfast = N) at a lower
|
||
spatial resolution (e.g., 64 frames with 4x4 tokens). It applies
|
||
aggressive spatial pooling to each frame to prioritize temporal
|
||
information. The features from both pathways are then flattened and
|
||
concatenated, forming a comprehensive video representation that balances
|
||
spatial details and temporal context. This aggregated feature vector,
|
||
along with the text prompt and question, is then fed into the LLM
|
||
(LLaVA-NeXT) to generate the final answer. This training-free approach
|
||
eliminates the need for expensive fine-tuning on video datasets, making
|
||
SF-LLaVA highly efficient and adaptable to various video scenarios. The
|
||
authors demonstrate the effectiveness of SF-LLaVA on three different
|
||
video question-answering tasks (Open-Ended VideoQA, Multiple Choice
|
||
VideoQA, and Text Generation) across eight benchmarks, showcasing its
|
||
superior performance compared to existing training-free methods and even
|
||
surpassing some state-of-the-art supervised fine-tuned video LLMs.
|
||
</details>
|
||
<h2
|
||
id="inf-llava-high-resolution-image-perception-for-multimodal-large-language-models"><strong>INF-LLaVA:
|
||
High-Resolution Image Perception for Multimodal Large Language
|
||
Models</strong></h2>
|
||
<p>INF-LLaVA is a novel Multimodal Large Language Model (MLLM) designed
|
||
to effectively process high-resolution images. It addresses the
|
||
limitations of existing cropping-based and dual-encoder methods by
|
||
introducing two innovative modules: Dual-perspective Cropping Module
|
||
(DCM) and Dual-perspective Enhancement Module (DEM). DCM segments
|
||
high-resolution images into sub-images from both local and global
|
||
perspectives, preserving detailed and contextual information. DEM
|
||
facilitates efficient interaction between local and global features,
|
||
enhancing the model’s understanding of complex visual relationships.
|
||
Extensive evaluations demonstrate INF-LLaVA’s superior performance on
|
||
various benchmarks, establishing a new state-of-the-art in
|
||
vision-language tasks.</p>
|
||
<p><a href="https://arxiv.org/abs/2407.16198"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.16198-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/WeihuangLin/INF-LLaVA"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/papers/2407.16198"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Yiwei Ma, Zhibin Wang, Xiaoshuai Sun, Weihuang Lin, Qiang Zhou, Jiayi
|
||
Ji, Rongrong Ji</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/641027c4-a5eb-42e8-8486-b58f3508c553" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
INF-LLaVA pushes the boundaries of Multimodal Large Language Models
|
||
(MLLMs) by tackling the critical challenge of high-resolution image
|
||
perception. It aims to leverage the richness of detail present in
|
||
high-resolution images without succumbing to the computational
|
||
limitations imposed by traditional MLLM architectures. INF-LLaVA
|
||
achieves this through a sophisticated approach that combines innovative
|
||
cropping and feature enhancement techniques, resulting in a model
|
||
capable of simultaneously capturing fine-grained local details and
|
||
comprehensive global context. At the core of INF-LLaVA lies the
|
||
Dual-perspective Cropping Module (DCM), a strategic cropping strategy
|
||
that surpasses conventional approaches by integrating both local and
|
||
global perspectives. This dual-perspective approach ensures that each
|
||
extracted sub-image retains not only the intricate details essential for
|
||
accurate analysis but also the broader contextual information crucial
|
||
for understanding the relationships between objects. While
|
||
local-perspective cropping preserves continuous visual information at
|
||
high resolution, capturing the essence of individual objects and
|
||
regions, global-perspective cropping leverages a unique interleaving
|
||
technique to preserve the overall spatial relationships between objects
|
||
within the high-resolution image. This balanced combination ensures that
|
||
the model can perceive both the “trees” and the “forest,” enabling a
|
||
holistic understanding of the visual scene. To further enhance the
|
||
model’s understanding, INF-LLaVA introduces the Dual-perspective
|
||
Enhancement Module (DEM). This module facilitates efficient and
|
||
effective interaction between the local and global features extracted by
|
||
the vision encoder, enriching the representation with multi-scale
|
||
information. Instead of relying on computationally expensive
|
||
cross-attention directly on high-resolution features, DEM employs a more
|
||
resource-efficient strategy. It leverages 2D positional priors to
|
||
concatenate global-perspective sub-image features back into the original
|
||
image’s shape, effectively recreating a high-resolution representation
|
||
of the global context. These recombined features are then re-cropped
|
||
from a local perspective, and cross-attention is performed between
|
||
corresponding local and global sub-images to enhance global features
|
||
with fine-grained local details. A symmetrical process enhances local
|
||
features with global context. This meticulously designed interaction
|
||
between local and global features ensures that the resulting
|
||
representation is not only rich in detail but also cognizant of the
|
||
broader context. The dual-enhanced features are then projected into a
|
||
format compatible with the LLM through a linear connector. The LLM then
|
||
processes the combined visual and textual information to generate a
|
||
coherent and contextually relevant response. Through extensive
|
||
evaluations on a diverse set of benchmarks, including ScienceQA-img,
|
||
OKVQA, SEEDBench, MMBench, AI2D, LLaVA-Bench-in-the-wild, and MMMU,
|
||
INF-LLaVA demonstrates its superior performance over existing MLLMs. Its
|
||
ability to effectively handle high-resolution images while maintaining
|
||
computational efficiency establishes a new state-of-the-art in the
|
||
field. The open-source release of INF-LLaVA, along with its pretrained
|
||
model and code, paves the way for further research and exploration of
|
||
high-resolution image perception in multimodal large language models,
|
||
pushing the boundaries of multimodal understanding and enabling the
|
||
development of more powerful and versatile AI systems.
|
||
</details>
|
||
<h2 id="vila²-vila-augmented-vila"><strong>VILA²: VILA Augmented
|
||
VILA</strong></h2>
|
||
<p>VILA² (VILA-augmented-VILA) introduces a novel approach to address
|
||
the limitations of data quantity and quality in training Visual Language
|
||
Models (VLMs). Instead of relying on costly human annotation or
|
||
distillation from proprietary models, VILA² leverages the VLM itself to
|
||
iteratively refine and augment its pretraining data, leading to
|
||
significant performance improvements and achieving state-of-the-art
|
||
results on the MMMU leaderboard among open-sourced models.</p>
|
||
<p><a href="https://arxiv.org/abs/2407.17453"><img
|
||
src="https://img.shields.io/badge/arXiv-2407.17453-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/papers/2407.17453"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Yunhao Fang, Ligeng Zhu, Yao Lu, Yan Wang, Pavlo Molchanov, Jang Hyun
|
||
Cho, Marco Pavone, Song Han, Hongxu Yin</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/b7602734-1163-49aa-bf78-27ae42a520bd" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
VILA² employs a two-step iterative process: self-augmenting and
|
||
specialist-augmenting. The self-augmenting loop focuses on enhancing the
|
||
general knowledge of the VLM by using the model itself to re-caption its
|
||
pretraining data. This process starts with an initial VLM (VILA0)
|
||
trained on a dataset with typically short and brief captions, like COYO.
|
||
VILA0 is then used to generate longer and more detailed captions for the
|
||
same images, creating a synthetic dataset. This augmented dataset,
|
||
combined with the original data, is used to train the next iteration of
|
||
the VLM (VILA1). This loop can be repeated multiple times, with each
|
||
iteration improving the caption quality and subsequently the VLM’s
|
||
performance. However, this self-augmentation process eventually reaches
|
||
saturation. To overcome this limitation, VILA² introduces the
|
||
<strong>specialist-augmenting loo</strong>p. This involves fine-tuning
|
||
the self-augmented VLM on specific downstream tasks, creating specialist
|
||
VLMs with expertise in areas like spatial awareness, OCR, and grounding.
|
||
These specialists are then used to re-caption the pretraining data,
|
||
focusing on their specific domain knowledge. The self-augmented VLM is
|
||
then retrained on this specialist-recaptioned data, further boosting its
|
||
performance. This approach leverages the synergy between the vast amount
|
||
of data in pretraining and the specialized knowledge acquired during
|
||
fine-tuning. The architecture of VILA² follows the standard
|
||
auto-regressive VLM design, consisting of a large language model (LLM),
|
||
a visual encoder, and an image-text projector. The authors experiment
|
||
with different LLMs (Llama2-7B, Llama3-8B-Instruct, and Yi-34B) and
|
||
visual encoders (SigLIP and InternViT-6B). They also introduce a 4x
|
||
downsampling of visual tokens to reduce computational cost. The training
|
||
process follows the typical three-stage paradigm: projector
|
||
initialization, vision-language pre-training, and visual
|
||
instruction-tuning. VILA² demonstrates significant performance
|
||
improvements over previous state-of-the-art methods on various
|
||
benchmarks, including general VQA, text-oriented VQA, general multimodal
|
||
benchmarks, and image captioning. This highlights the effectiveness of
|
||
the proposed self- and specialist-augmentation techniques in enhancing
|
||
VLM training and achieving state-of-the-art results.
|
||
</details>
|
||
<h2 id="minicpm-v-a-gpt-4v-level-mllm-on-your-phone"><strong>MiniCPM-V:
|
||
A GPT-4V Level MLLM on Your Phone</strong></h2>
|
||
<p>MiniCPM-V is a series of efficient Multimodal Large Language Models
|
||
(MLLMs) designed for deployment on end-side devices like mobile phones
|
||
and personal computers. The latest iteration, MiniCPM-Llama3-V 2.5,
|
||
achieves performance comparable to GPT-4V, Gemini Pro, and Claude 3
|
||
while being significantly smaller and more efficient, demonstrating the
|
||
feasibility of deploying powerful MLLMs on resource-constrained
|
||
devices.</p>
|
||
<p><a href="https://arxiv.org/pdf/2408.01800"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.01800-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/OpenBMB/MiniCPM-V"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/openbmb/MiniCPM-V-2_6"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu,
|
||
Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong
|
||
Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie
|
||
Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, Maosong Sun</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/d943871a-ca05-46d6-9572-7fe02dda1495" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
MiniCPM-V focuses on achieving a balance between performance and
|
||
efficiency, crucial for real-world applications on end-side devices. The
|
||
model architecture consists of three key modules: a visual encoder, a
|
||
compression layer, and an LLM. For the visual encoder, MiniCPM-V
|
||
utilizes SigLIP SoViT-400m/14, chosen for its efficiency and
|
||
effectiveness. To handle high-resolution images with varying aspect
|
||
ratios, the model employs an adaptive visual encoding approach. This
|
||
involves dividing the input image into slices that better match the
|
||
ViT’s pre-training settings in terms of resolution and aspect ratio. A
|
||
score function is used to select the optimal partition of slices,
|
||
ensuring a good match with the ViT’s pre-training. Each slice is then
|
||
resized proportionally and interpolated to fit the ViT’s input size.
|
||
After visual encoding, each slice is represented by 1024 tokens,
|
||
resulting in a large number of tokens for multiple slices. To address
|
||
this, a token compression module is employed, using one-layer
|
||
cross-attention with a moderate number of queries to compress the visual
|
||
tokens of each slice into 64 or 96 tokens. This significantly reduces
|
||
the computational cost and memory footprint, making the model suitable
|
||
for end-side deployment. A spatial schema is also introduced to indicate
|
||
the position of each slice relative to the whole image, further
|
||
enhancing the model’s understanding of spatial relationships. The
|
||
compressed visual tokens, along with the text input, are then fed into
|
||
the LLM, which is based on MiniCPM 2B for earlier versions and
|
||
Llama3-Instruct 8B for MiniCPM-Llama3-V 2.5. The training process
|
||
consists of three phases: pre-training, supervised fine-tuning, and
|
||
RLAIF-V (Reinforcement Learning from AI Feedback for Vision).
|
||
Pre-training aims to align the visual modules with the LLM’s input space
|
||
and learn foundational multimodal knowledge. It involves three stages:
|
||
warming up the compression layer, extending the input resolution of the
|
||
visual encoder, and training the visual modules with the adaptive visual
|
||
encoding strategy. Supervised fine-tuning further enhances the model’s
|
||
knowledge and interaction capabilities using high-quality visual
|
||
question answering datasets. The SFT data is categorized into two parts:
|
||
one focusing on basic recognition capabilities and the other on
|
||
generating detailed responses and following instructions. Finally,
|
||
RLAIF-V is employed to mitigate the hallucination problem common in
|
||
MLLMs. This involves generating multiple responses for an instruction,
|
||
evaluating their correctness using a divide-and-conquer strategy, and
|
||
then optimizing the model using Direct Preference Optimization (DPO) on
|
||
a preference dataset. MiniCPM-V demonstrates impressive performance on
|
||
various benchmarks, including general multimodal benchmarks, OCR
|
||
benchmarks, and multilingual multimodal interaction, while being
|
||
efficient enough for deployment on mobile phones. This highlights the
|
||
potential of pushing the boundaries of end-side MLLMs and bringing
|
||
powerful AI capabilities to user devices.
|
||
</details>
|
||
<h2
|
||
id="minicpm-o-2.6-a-gpt-4o-level-mllm-for-vision-speech-and-multimodal-live-streaming"><strong>MiniCPM-o-2.6:
|
||
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live
|
||
Streaming</strong></h2>
|
||
<p>MiniCPM-o-2.6 is a powerful 8B parameter multimodal large language
|
||
model (MLLM) that excels in vision, speech, and multimodal live
|
||
streaming, achieving performance comparable to GPT-4o in several
|
||
benchmarks, while maintaining high efficiency for deployment on edge
|
||
devices.</p>
|
||
<p><a
|
||
href="https://openbmb.notion.site/MiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9"><img
|
||
src="https://img.shields.io/badge/Blog-MiniCPM%20Team%20Blog-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/OpenBMB/MiniCPM-o"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/openbmb/MiniCPM-o-2_6"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
OpenBMB</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/cb066a40-8da7-4775-b002-7c054697f1ec" width=600/>
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
MiniCPM-o-2.6 employs an end-to-end omni-modal architecture. It
|
||
integrates several pre-trained components: <strong>Vision
|
||
Encoder:</strong> SigLip-400M <strong>Audio Encoder:</strong>
|
||
Whisper-medium-300M <strong>Text-to-Speech (TTS):</strong> ChatTTS-200M
|
||
<strong>Large Language Model (LLM):</strong> Qwen2.5-7B. These
|
||
components are connected and trained end-to-end. A key innovation is the
|
||
“Omni-modal Live Streaming Mechanism.” This involves: <strong>Online
|
||
Modality Encoders/Decoders:</strong> The offline encoders and decoders
|
||
are transformed into online versions to handle streaming inputs and
|
||
outputs. <strong>Time-Division Multiplexing (TDM):</strong> A TDM
|
||
mechanism within the LLM backbone processes omni-modal streams. It
|
||
divides parallel streams (video, audio) into sequential information
|
||
within short time slices. <strong>Configurable Speech Modeling:</strong>
|
||
A multimodal system prompt (including text and audio prompts) allows for
|
||
flexible voice configuration during inference, enabling voice cloning
|
||
and description-based voice creation.
|
||
</details>
|
||
<h2
|
||
id="llava-onevision-easy-visual-task-transfer"><strong>LLaVA-OneVision:
|
||
Easy Visual Task Transfer</strong></h2>
|
||
<p>LLaVA-OneVision is a family of open large multimodal models (LMMs)
|
||
designed to excel in various computer vision scenarios, including
|
||
single-image, multi-image, and video understanding. It pushes the
|
||
performance boundaries of open LMMs by consolidating insights from the
|
||
LLaVA-NeXT blog series, focusing on data, models, and visual
|
||
representations. Notably, LLaVA-OneVision demonstrates strong transfer
|
||
learning capabilities, enabling it to excel in video understanding tasks
|
||
by leveraging knowledge learned from image data.</p>
|
||
<p><a href="https://arxiv.org/abs/2408.03326"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.03326-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://llava-vl.github.io/blog/2024-08-05-llava-onevision/"><img
|
||
src="https://img.shields.io/badge/🌐-Website-blue" alt="Website" /></a>
|
||
<a href="https://huggingface.co/papers/2408.03326"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang,
|
||
Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/abe36db3-571d-4068-b532-7512d4a5fcc5" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
LLaVA-OneVision inherits the minimalist design of the LLaVA series,
|
||
aiming to effectively leverage pre-trained capabilities of both the LLM
|
||
and the visual model while facilitating strong scaling. The architecture
|
||
consists of three key components: a large language model (LLM), a vision
|
||
encoder, and a projector. The authors choose Qwen-2 as the LLM due to
|
||
its strong language capabilities and various model sizes available. For
|
||
the vision encoder, they opt for SigLIP, which has shown to yield higher
|
||
LMM performance among open vision encoders. A 2-layer MLP is used as the
|
||
projector to map image features into the word embedding space, creating
|
||
a sequence of visual tokens. The model utilizes a flexible visual
|
||
representation strategy called Higher AnyRes, which builds upon the
|
||
original AnyRes strategy introduced in LLaVA-NeXT. This strategy
|
||
involves dividing the input image into crops, each with a resolution
|
||
suitable for the vision encoder, and then applying bilinear
|
||
interpolation to reduce the number of tokens per crop if needed. This
|
||
allows the model to handle high-resolution images and videos efficiently
|
||
while preserving important visual details. The specific configuration of
|
||
<strong>Higher AnyRes</strong> is adapted for different scenarios:
|
||
single-image, multi-image, and video. For single-image data, a large
|
||
maximum spatial configuration is used to maintain the original image
|
||
resolution and a large number of visual tokens are allocated to
|
||
effectively represent the visual signal. For multi-image data, only the
|
||
base image resolution is considered, eliminating the need for multi-crop
|
||
and saving computational resources. For video data, each frame is
|
||
resized to the base image resolution and bilinear interpolation is used
|
||
to reduce the number of tokens per frame, allowing for the processing of
|
||
a larger number of frames. The training process follows a three-stage
|
||
curriculum learning approach: language-image alignment, high-quality
|
||
knowledge learning, and visual instruction tuning. The first stage
|
||
focuses on aligning visual features with the LLM’s embedding space using
|
||
the LLaVA align dataset. The second stage refines and enhances the
|
||
model’s knowledge base using high-quality data from three major
|
||
categories: re-captioned detailed description data, document/OCR data,
|
||
and Chinese and language data. The final stage involves visual
|
||
instruction tuning, where the model is trained on a diverse set of
|
||
visual tasks with preferred responses. This stage is further divided
|
||
into two phases: single-image training and OneVision training.
|
||
Single-image training focuses on single-image scenarios, while OneVision
|
||
training expands the model’s capabilities to multi-image and video
|
||
scenarios, enabling task transfer and emerging capabilities.
|
||
LLaVA-OneVision demonstrates state-of-the-art performance on various
|
||
benchmarks, including single-image, multi-image, and video tasks,
|
||
showcasing its effectiveness and versatility in handling diverse visual
|
||
scenarios.
|
||
</details>
|
||
<h2
|
||
id="vita-towards-open-source-interactive-omni-multimodal-llm"><strong>VITA:
|
||
Towards Open-Source Interactive Omni Multimodal LLM</strong></h2>
|
||
<p>VITA is the first open-source Multimodal Large Language Model (MLLM)
|
||
capable of simultaneously processing and analyzing video, image, text,
|
||
and audio modalities while offering an advanced multimodal interactive
|
||
experience. It addresses the limitations of existing open-source models,
|
||
which often excel in either understanding or interaction but rarely
|
||
both, by integrating architectural innovations with advanced training
|
||
and development strategies.</p>
|
||
<p><a href="https://arxiv.org/pdf/2408.05211"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.05211-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/VITA-MLLM/VITA"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/VITA-MLLM"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan
|
||
Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji,
|
||
Yunsheng Wu, Caifeng Shan, Xing Sun</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/94e2b781-0c86-47df-ac18-76ebc71bb349" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
VITA starts with the Mixtral 8x7B model as its language foundation,
|
||
chosen for its strong performance and sparse mixture of experts (SMoE)
|
||
architecture. To enhance its Chinese language capabilities, the
|
||
vocabulary is expanded with Chinese terms, and the model undergoes
|
||
bilingual instruction tuning using a high-quality bilingual text corpus.
|
||
This ensures proficiency in both Chinese and English. For visual
|
||
modality, VITA employs InternViT-300M-448px as the visual encoder,
|
||
processing images at 448x448 resolution and generating 256 tokens after
|
||
passing through a two-layer MLP visual connector. High-resolution images
|
||
are handled using a dynamic patching strategy, while videos are treated
|
||
as special cases of images, with frame sampling based on video length.
|
||
For audio modality, a Mel Filter Bank block is used to process the input
|
||
audio, followed by 4xCNN downsampling layers and a 24-layer transformer,
|
||
resulting in 25 tokens for every 2 seconds of audio. A two-layer MLP
|
||
serves as the audio-text modality connector. The training pipeline
|
||
consists of three stages: LLM instruction tuning, multimodal alignment,
|
||
and multimodal instruction tuning. LLM instruction tuning focuses on
|
||
enhancing the base LLM’s bilingual capabilities. Multimodal alignment
|
||
aims to bridge the representation gap between text and other modalities
|
||
by training individual encoders and connectors for each modality. This
|
||
involves collecting and curating a large-scale, high-quality multimodal
|
||
dataset, including image descriptions, general image QA, OCR and diagram
|
||
data, general video descriptions, general video QA, and pure text data.
|
||
Multimodal instruction tuning further refines the model’s ability to
|
||
follow instructions and understand different modalities. A specially
|
||
designed state token is introduced to distinguish the type of input
|
||
query (effective audio, noisy audio, or text), enabling non-awakening
|
||
interaction during inference. To achieve natural multimodal
|
||
human-computer interaction, VITA introduces two key innovations:
|
||
non-awakening interaction and audio interrupt interaction. These are
|
||
implemented using a duplex pipeline during deployment. Two VITA models
|
||
run concurrently: one for generating responses to user queries
|
||
(Generation model) and the other for monitoring environmental audio
|
||
(Monitoring model). The Monitoring model uses SileroVAD for voice
|
||
activity detection and filters out noisy audio based on the state token.
|
||
If an effective audio query is detected, the Monitoring model interrupts
|
||
the Generation model, consolidates the historical context, and responds
|
||
to the latest query. The two models then swap identities, ensuring
|
||
continuous monitoring and seamless interaction.VITA demonstrates strong
|
||
performance on various unimodal and multimodal benchmarks, showcasing
|
||
its robust foundational capabilities in multilingual, vision, and audio
|
||
understanding. While still lagging behind closed-source counterparts in
|
||
certain areas, VITA represents a significant step towards open-source
|
||
interactive omni-modal LLMs, paving the way for future research and
|
||
development in this field.
|
||
</details>
|
||
<h2
|
||
id="eagle-exploring-the-design-space-for-multimodal-llms-with-mixture-of-encoders"><strong>EAGLE:
|
||
Exploring The Design Space for Multimodal LLMs with Mixture of
|
||
Encoders</strong></h2>
|
||
<p>EAGLE is a family of open-source Multimodal Large Language Models
|
||
(MLLMs) that leverage a mixture of vision encoders to achieve
|
||
state-of-the-art performance on various benchmarks, particularly in
|
||
tasks involving OCR and document understanding. The study focuses on
|
||
systematically exploring the design space of MLLMs with multiple vision
|
||
encoders, aiming to identify optimal design choices and improve MLLM
|
||
perception.</p>
|
||
<p><a href="https://arxiv.org/pdf/2408.15998"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.15998-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/NVlabs/EAGLE"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan,
|
||
De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan
|
||
Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/4e057a78-3fad-4a04-9a05-0f5361a8255b" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
EAGLE builds upon the LLaVA architecture, consisting of a large language
|
||
model, a vision encoder, and a projection layer. The core innovation
|
||
lies in integrating multiple vision experts, each pre-trained on
|
||
different tasks and resolutions, to enhance the model’s ability to
|
||
perceive and comprehend diverse visual information. The study explores
|
||
various aspects of this design space, including high-resolution
|
||
adaptation, fusion paradigms, and optimal encoder combinations. It
|
||
introduces a Pre-Alignment training stage to address representational
|
||
inconsistencies between vision-focused encoders and language tokens. The
|
||
training process consists of three progressive stages: vision-language
|
||
pre-alignment, joint-projector training, and supervised fine-tuning.
|
||
EAGLE achieves state-of-the-art performance on various benchmarks,
|
||
demonstrating significant advantages in OCR and document understanding
|
||
tasks. The study highlights the importance of systematic design space
|
||
exploration and the effectiveness of combining multiple vision experts
|
||
with a streamlined fusion strategy and a pre-alignment training stage
|
||
for building high-performing MLLMs.
|
||
</details>
|
||
<h2
|
||
id="eagle-2-building-post-training-data-strategies-from-scratch-for-frontier-vision-language-models"><strong>Eagle
|
||
2: Building Post-Training Data Strategies from Scratch for Frontier
|
||
Vision-Language Models</strong></h2>
|
||
<p>Eagle 2 is a family of vision-language models (VLMs) developed with a
|
||
data-centric approach, focusing on post-training data strategies to
|
||
achieve state-of-the-art performance. The models build upon open-source
|
||
components and prioritize data diversity and quality, using a
|
||
three-stage training recipe and a tiled mixture of vision encoders
|
||
(MoVE) architecture, achieving results that match or surpass those of
|
||
larger, proprietary models.</p>
|
||
<p><a href="https://arxiv.org/abs/2501.14818"><img
|
||
src="https://img.shields.io/badge/arXiv-2501.14818-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/NVlabs/EAGLE"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/nvidia/Eagle2-9B"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Yilin Zhao, Subhashree
|
||
Radhakrishnan, Nadine Chang, Matthieu Le, De-An Huang, Ilia Karmanov,
|
||
Lukas Voegtle, Jose M. Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao,
|
||
Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Karan Sapra, Amala
|
||
Deshmukh, Tuomas Rintamaki, Philipp Fischer, Timo Roman, Tong Lu, Guilin
|
||
Liu, Zhiding Yu</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/e4280077-c80f-4cca-bd8f-3122a322520e" width="600"/>
|
||
<!-- Placeholder, no single architecture image -->
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Eagle 2</strong> adopts a “diversity first, then quality” data
|
||
strategy, beginning with a large, diverse pool of over 180 data sources,
|
||
followed by rigorous filtering and selection. The architecture uses a
|
||
tiled mixture of vision encoders (MoVE), specifically SigLIP and
|
||
ConvNeXt-XXLarge, with image tiling to handle high resolutions. Each
|
||
image tile is encoded by channel-concatenated MoVE. The vision encoder
|
||
outputs are concatenated and aligned with the LLM (Qwen2.5) via a simple
|
||
MLP connector. A three-stage training recipe is used: Stage 1 trains the
|
||
connector to align modalities; Stage 1.5 trains the full model on a
|
||
large, diverse dataset; and Stage 2 fine-tunes on a high-quality
|
||
instruction-tuning dataset. Crucially, <em>all</em> available visual
|
||
instruction data is used in Stage 1.5, not just captioning/knowledge
|
||
data. Balanced data packing addresses limitations in existing
|
||
open-source frameworks. The core contribution is the detailed data
|
||
strategy. This involves: (1) <strong>Data Collection</strong>: Building
|
||
a highly diverse data pool (180+ sources) through both passive gathering
|
||
(monitoring arXiv, HuggingFace) and proactive searching (addressing
|
||
“bucket effect” via error analysis). (2) <strong>Data
|
||
Filtering</strong>: Removing low-quality samples based on criteria like
|
||
mismatched question-answer pairs, irrelevant image-question pairs,
|
||
repeated text, and numeric formatting issues. (3) <strong>Data
|
||
Selection</strong>: Choosing optimal subsets based on data source
|
||
diversity, distribution, and K-means clustering on SSCD image embeddings
|
||
to ensure balance across types (especially useful for chart data, etc.).
|
||
(4) <strong>Data Augmentation</strong>: Mining information from input
|
||
images through techniques like Chain-of-Thought (CoT) explanation
|
||
generation, rule-based QA generation, and expanding short answers into
|
||
longer ones. (5) <strong>Data Formating:</strong> remove unnecessary
|
||
decorations. Training uses a three-stage approach: <strong>Stage
|
||
1:</strong> Aligns language and image modalities by training the MLP
|
||
connector. <strong>Stage 1.5:</strong> Trains the <em>full</em> model
|
||
using a large-scale, diverse dataset (21.6M samples). <em>All</em>
|
||
available visual instruction data is used here, unlike common two-stage
|
||
approaches, leading to substantial improvements. <strong>Stage
|
||
2:</strong> Fine-tunes the full model on a carefully curated,
|
||
high-quality visual instruction tuning dataset (4.6M samples). The model
|
||
is trained with AdamW. Eagle 2 demonstrates strong performance across a
|
||
wide range of multimodal benchmarks, matching or outperforming frontier
|
||
open-source and some closed-source VLMs.
|
||
</details>
|
||
<h2
|
||
id="florence-2-a-deep-dive-into-its-unified-architecture-and-multi-task-capabilities"><strong>Florence-2:
|
||
A Deep Dive into its Unified Architecture and Multi-Task
|
||
Capabilities</strong></h2>
|
||
<p>Florence-2 presents a significant advancement in vision foundation
|
||
models, aiming to achieve a single, versatile representation capable of
|
||
handling a wide spectrum of vision and vision-language tasks through a
|
||
unified, prompt-based approach. Unlike previous models that often
|
||
specialize in specific tasks, Florence-2 is designed to be a generalist,
|
||
adept at performing tasks with simple text instructions, similar to how
|
||
Large Language Models (LLMs) operate.</p>
|
||
<p><a href="https://arxiv.org/pdf/2311.06242"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.06242-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://huggingface.co/spaces/gokaygokay/Florence-2"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu,
|
||
Michael Zeng, Ce Liu, Lu Yuan</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/f9c1f95b-ba6a-4a55-bf52-fa043b339d27" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
Florence-2 lies a sophisticated architecture comprised of two key
|
||
components: an image encoder and a multi-modality encoder-decoder. The
|
||
image encoder, powered by the powerful DaViT architecture, transforms
|
||
the input image into a sequence of visual token embeddings, effectively
|
||
capturing the visual information. These visual embeddings are then
|
||
combined with text embeddings derived from task-specific prompts. This
|
||
fusion of visual and linguistic information is processed by a standard
|
||
transformer-based multi-modality encoder-decoder. This component acts as
|
||
the brain of the model, meticulously analyzing the combined input and
|
||
generating the desired output in textual form. This unified
|
||
architecture, with a single set of parameters governing various tasks,
|
||
eliminates the need for task-specific modifications, leading to a
|
||
streamlined and efficient model. This design philosophy mirrors the
|
||
trend in the NLP community, where models with consistent underlying
|
||
structures are preferred for their versatility and ease of development.
|
||
Florence-2’s capabilities span a multitude of tasks, showcasing its
|
||
remarkable adaptability. It excels at generating detailed image
|
||
captions, capturing the essence of an image through rich textual
|
||
descriptions. Its prowess extends to visual grounding, accurately
|
||
pinpointing specific objects or regions within an image based on textual
|
||
phrases. Florence-2 also demonstrates impressive performance in
|
||
open-vocabulary object detection, identifying objects by their names,
|
||
even if those objects were not part of its training data. This
|
||
capability highlights the model’s ability to generalize its knowledge
|
||
and understand novel visual concepts. Furthermore, Florence-2 tackles
|
||
dense region captioning, providing detailed descriptions for multiple
|
||
regions within an image, and even performs optical character recognition
|
||
(OCR), extracting text from images. This broad range of capabilities
|
||
makes Florence-2 a powerful tool for numerous applications, pushing the
|
||
boundaries of multimodal understanding in AI.
|
||
</details>
|
||
<h2
|
||
id="multiinstruct-improving-multi-modal-zero-shot-learning-via-instruction-tuning"><strong>MULTIINSTRUCT:
|
||
Improving Multi-Modal Zero-Shot Learning via Instruction
|
||
Tuning</strong></h2>
|
||
<p>MULTIINSTRUCT leverages the OFA model as its foundation, employing a
|
||
Transformer-based sequence-to-sequence architecture and instruction
|
||
tuning techniques on a diverse dataset, effectively aligning text and
|
||
image tokens within a unified space for enhanced multi-modal zero-shot
|
||
learning.</p>
|
||
<a href="https://arxiv.org/abs/2212.10773"><img
|
||
src="https://img.shields.io/badge/arXiv-2212.10773-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/vt-nlp/multiinstruct"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Zhiyang Xu, Ying Shen, Lifu Huang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/bedfc8b1-7aff-44af-b605-4470ad030bdf" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MULTIINSTRUCT</strong>: introduces a novel approach to enhance
|
||
multi-modal zero-shot learning by leveraging instruction tuning, built
|
||
upon the foundation of the <strong>OFA (Omnipotent Fast
|
||
Adapters)</strong> as its core pre-trained multi-modal model. This model
|
||
adopts a Transformer-based sequence-to-sequence architecture that
|
||
efficiently encodes a mix of instructions, text, images, and bounding
|
||
boxes within a unified token space. Such a design enables MULTIINSTRUCT
|
||
to process and interpret a wide range of input types, including optional
|
||
images, through a comprehensive encoder-decoder framework. The encoder
|
||
component is dedicated to processing the diverse inputs and
|
||
instructions, while the decoder is tasked with generating the
|
||
corresponding outputs. At the heart of MULTIINSTRUCT’s training
|
||
methodology is the innovative use of the model-specific MULTIINSTRUCT
|
||
dataset, alongside instruction tuning techniques that incorporate
|
||
instances from multiple tasks. This approach involves a combination of
|
||
random shuffling and sampling of instruction templates for batch
|
||
training, significantly enriching the learning process. Furthermore, the
|
||
model explores advanced transfer learning strategies through Mixed
|
||
Instruction Tuning and Sequential Instruction Tuning, utilizing the
|
||
NATURAL INSTRUCTIONS dataset. This strategy not only enhances the
|
||
model’s adaptability across a wide spectrum of multi-modal tasks but
|
||
also boosts its performance in zero-shot learning scenarios. The
|
||
alignment techniques employed by MULTIINSTRUCT, such as byte-pair
|
||
encoding and VQ-GAN, play a crucial role in aligning text and image
|
||
tokens within a unified vocabulary. This seamless integration allows the
|
||
model to effectively process and interpret various types of inputs and
|
||
outputs. The use of a unified sequence-to-sequence architecture
|
||
facilitates a deeper integration and alignment of vision and language
|
||
modalities, underscoring the model’s innovative approach to bridging the
|
||
gap between different types of data. The datasets used for training and
|
||
fine-tuning, namely MULTIINSTRUCT and NATURAL INSTRUCTIONS, are
|
||
specifically chosen to bolster the model’s capabilities in handling
|
||
multi-modal tasks and instructions, showcasing its versatility and
|
||
effectiveness in enhancing multi-modal zero-shot learning.
|
||
</details>
|
||
<h2 id="mousi-poly-visual-expert-vision-language-models"><strong>MouSi:
|
||
Poly-Visual-Expert Vision-Language Models</strong></h2>
|
||
<p>MouSi pushes the boundaries of VLMs by incorporating multiple visual
|
||
experts like CLIP and SAM, utilizing a poly-expert fusion network to
|
||
combine their outputs and interface with powerful LLMs like Vicuna,
|
||
thereby enabling a more comprehensive understanding and processing of
|
||
visual information.</p>
|
||
<a href="https://arxiv.org/abs/2401.17221"><img
|
||
src="https://img.shields.io/badge/arXiv-2401.17221-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/fudannlplab/mousi"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song,
|
||
Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang
|
||
Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang
|
||
Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang
|
||
Jiang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/7e09c9d8-4c18-4970-9a24-b5e538285a72" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MouSi</strong>: Represents an innovative approach to
|
||
Vision-Language Models (VLMs) by integrating multiple visual experts
|
||
into a unified architecture, aiming to surpass the limitations inherent
|
||
to models reliant on a singular visual component. This architecture
|
||
leverages a poly-expert fusion network, which incorporates outputs from
|
||
varied visual experts, such as CLIP for image-text matching and SAM for
|
||
image segmentation. This network facilitates an efficient interface with
|
||
pre-trained Large Language Models (LLMs), notably utilizing a model like
|
||
Vicuna v1.5. MouSi distinguishes itself by employing a multi-expert
|
||
visual encoder that selects relevant experts from a pool, and it
|
||
features two types of <strong>poly-expert fusion networks: a projection
|
||
fusion method and a Q-Former fusion method.</strong> The training
|
||
methodology of MouSi is characterized by a two-phase approach.
|
||
Initially, during the pre-training phase, both the text-only LLM and the
|
||
multi-expert encoder are kept static, with the training focus squarely
|
||
on the poly-visual fusion network. Subsequently, in the fine-tuning
|
||
phase, the LLM is activated for training in conjunction with the
|
||
poly-visual fusion network, using high-quality supervised datasets. This
|
||
methodology ensures that MouSi benefits from robust pre-existing
|
||
language models while simultaneously enhancing its capability to process
|
||
and integrate complex visual information. For alignment and fusion of
|
||
the multimodal inputs, MouSi employs its poly-expert fusion network to
|
||
amalgamate the outputs from the various visual experts, aligning them
|
||
with the vision input tokens. This alignment is critical for encoding
|
||
vision and text cohesively, a process facilitated by either the
|
||
projection fusion method or the more complex Q-Former fusion method.
|
||
These methods allow for the effective compression of multi-channel
|
||
visual information into a format that can be efficiently processed
|
||
alongside textual data. The datasets used in MouSi’s training regimen
|
||
include LCS-558K and the LAION-CC-SBU collection for pre-training, aimed
|
||
at aligning text and image representation spaces, and diverse,
|
||
high-quality SFT datasets for fine-tuning, enhancing the model’s
|
||
performance across a broad spectrum of multimodal tasks.
|
||
</details>
|
||
<h2
|
||
id="lavin-cheap-and-quick-efficient-vision-language-instruction-tuning-for-large-language-models"><strong>LaVIN:
|
||
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
|
||
Language Models</strong></h2>
|
||
<p>LaVIN offers an efficient and cost-effective approach to
|
||
vision-language instruction tuning by employing a Mixture-of-Modality
|
||
Adapter (MM-Adapter), significantly reducing trainable parameters and
|
||
enabling a streamlined optimization process for LLMs without extensive
|
||
pre-training.</p>
|
||
<a href="https://arxiv.org/abs/2305.15023v3"><img
|
||
src="https://img.shields.io/badge/arXiv-2305.15023v3-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/luogen1996/lavin"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong
|
||
Ji
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/8afc8259-fa72-4e52-8080-a4ea12208e32" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>LaVIN</strong>: This model introduces the Mixture-of-Modality
|
||
Adaptation (MMA) learning regime, a pioneering method that leverages
|
||
<strong>lightweight adapters</strong> to fine-tune LLMs for
|
||
vision-language (VL) instruction tasks. The core of LaVIN’s architecture
|
||
is the <strong>Mixture-of-Modality Adapter (MM-Adapter)</strong>, which
|
||
connects the image encoder to the LLM using minimal adaptation modules,
|
||
allowing for a streamlined optimization of the multimodal LLM through a
|
||
relatively small number of parameters. The training methodology of LaVIN
|
||
is notably efficient, employing the MMA strategy to fine-tune only the
|
||
inserted adapters, thus significantly reducing the optimized parameter
|
||
count to between three to five million. This method substantially lowers
|
||
both training time and storage requirements, circumventing the need for
|
||
additional VL pre-training. The MM-Adapter is instrumental in
|
||
facilitating the seamless transition between single- and multi-modal
|
||
instructions, thereby enhancing the model’s adaptability to various VL
|
||
tasks. Additionally, it employs a dynamic routing function that adjusts
|
||
adaptations for input features, enabling an effective integration of
|
||
vision and text embeddings. LaVIN’s performance and versatility are
|
||
further demonstrated through its application on diverse datasets,
|
||
including ScienceQA, Alphaca-52k, and LLaVA-158k. ScienceQA is utilized
|
||
to assess the model’s multimodal question-answering capabilities, while
|
||
the Alphaca-52k (text-only) and LLaVA-158k (text-image pairs) datasets
|
||
are leveraged to refine and expand LaVIN’s functionality as a multimodal
|
||
chatbot. This strategic use of datasets underscores LaVIN’s advanced
|
||
vision-language understanding, illustrating its potential to
|
||
significantly contribute to the field of multimodal learning and
|
||
interaction.
|
||
</details>
|
||
<h2 id="nous-hermes-2-vision---mistral-7b"><strong>Nous-Hermes-2-Vision
|
||
- Mistral 7B</strong></h2>
|
||
<p>Nous-Hermes-2-Vision builds upon OpenHermes-2.5 by integrating the
|
||
efficient SigLIP-400M vision encoder and incorporating a custom dataset
|
||
with function calling capabilities, enabling it to not only understand
|
||
visual and textual information but also extract specific text from
|
||
images, advancing its functionality as a Vision-Language Action
|
||
Model.</p>
|
||
<a
|
||
href="https://huggingface.co/NousResearch/Nous-Hermes-2-Vision-Alpha"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a><br />
|
||
This project is led by qnguyen3 and teknium.
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Nous-Hermes-2-Vision</strong>: Represents a notable advancement
|
||
in the realm of Vision-Language Models, marking its distinction through
|
||
the integration of two key enhancements that elevate its capabilities
|
||
beyond traditional models. This model is an evolution from its
|
||
predecessor, <strong>OpenHermes-2.5-Mistral-7B</strong>, and
|
||
distinguishes itself by incorporating the <strong>SigLIP-400M</strong>
|
||
for significantly improved performance and efficiency, moving away from
|
||
the standard reliance on larger 3B vision encoders. Additionally, it
|
||
introduces a custom dataset that includes function calling capabilities,
|
||
transforming it into a more dynamic Vision-Language Action Model. The
|
||
training of Nous-Hermes-2-Vision utilized a diverse dataset comprising
|
||
220K images from LVIS-INSTRUCT4V, 60K from ShareGPT4V, 150K private
|
||
function calling data, and 50K conversations from teknium’s
|
||
OpenHermes-2.5. Such a varied dataset ensures the model’s proficiency
|
||
across a broad spectrum of vision-language tasks, including object
|
||
recognition, instruction following, and conversational understanding.
|
||
The model’s innovative approach to integrating vision and language,
|
||
particularly through the use of custom datasets for function calling,
|
||
allows for encoding vision and text together in a way that supports
|
||
action-oriented tasks and automation. A key feature of
|
||
Nous-Hermes-2-Vision is its ability to interact with images to extract
|
||
valuable text information from visual content, thus enabling detailed
|
||
analyses and responses in natural language. This capability is
|
||
underscored by the model’s utilization of the SigLIP-400M, opting for a
|
||
more lightweight and efficient architecture while enhancing performance
|
||
in vision-language tasks. The model is further enriched with a custom
|
||
dataset that includes <strong>function calling</strong>, allowing for
|
||
the extraction of written information from images through specific tags,
|
||
thus broadening its application scope for developers and researchers
|
||
alike. Despite its innovative features, early usage of
|
||
Nous-Hermes-2-Vision has revealed some challenges, such as
|
||
hallucinations and spamming of EOS tokens. Recognizing these issues, the
|
||
research team, led by Quan Nguyen and Teknium, has committed to
|
||
releasing an updated version to address these problems, demonstrating
|
||
their dedication to refining the model’s capabilities.
|
||
</details>
|
||
<h2
|
||
id="tinygpt-v-efficient-multimodal-large-language-model-via-small-backbones"><strong>TinyGPT-V:
|
||
Efficient Multimodal Large Language Model via Small
|
||
Backbones</strong></h2>
|
||
<p>TinyGPT-V prioritizes efficiency in multimodal large language models
|
||
by combining a compact EVA-ViT visual encoder with linear projection
|
||
layers and the powerful Phi-2 language model, achieving robust
|
||
performance in vision-language tasks despite its smaller size.</p>
|
||
<a href="https://arxiv.org/abs/2312.16862v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2312.16862v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/DLYuanGod/TinyGPT-V"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/llizhx/TinyGPT-V"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Zhengqing Yuan, Zhaoxu Li, Lichao Sun
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/3e7c93bc-7963-4c2e-b207-226a03d152ca" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>TinyGPT-V</strong>: introduces a compact yet powerful
|
||
architecture tailored for efficient multimodal large language model
|
||
applications, leveraging small backbones for streamlined processing.
|
||
This model integrates a visual encoder, specifically EVA of Vision
|
||
Transformer (ViT), with <strong>linear projection layers</strong> and
|
||
the Phi-2 language model, constituting its core components. The visual
|
||
encoder remains inactive during training, focusing on image resolution
|
||
adjustments across various stages to enhance image understanding. The
|
||
<strong>linear projection layers</strong>, particularly with the
|
||
incorporation of the <strong>Q-Former layer</strong> from BLIP-2, aim to
|
||
efficiently embed visual features into the language model, reducing the
|
||
number of parameters needing training. The Phi-2 large language model
|
||
backbone, a 2.7 billion-parameter model, excels in reasoning and
|
||
language comprehension, effectively handling vision-language operations
|
||
including spatial location tasks through textual bounding box
|
||
depictions. The training of TinyGPT-V unfolds across four stages:
|
||
warm-up, pre-training, instruction fine-tuning, and multi-task learning.
|
||
Each stage is meticulously designed to progressively enhance the model’s
|
||
capabilities in understanding and generating language based on visual
|
||
inputs, with a special emphasis on human-like learning and conversation
|
||
abilities in later stages. The use of datasets such as LAION, CC3M, SBU,
|
||
and more, across these stages, supports the model’s development in
|
||
vision-language understanding, generation, and task execution like
|
||
visual question answering and image captioning. A noteworthy aspect of
|
||
TinyGPT-V’s architecture is the implementation of normalization
|
||
techniques and LoRA (Low-Rank Adaptation) to stabilize training and
|
||
optimize the model’s performance across different modalities. Addressing
|
||
challenges like NaN or INF values in multimodal data computation, these
|
||
mechanisms enhance training stability and efficiency. Furthermore, the
|
||
model employs a multi-task instruction template to manage task
|
||
ambiguity, utilizing MiniGPT-v2 tokens for task-specific instructions,
|
||
facilitating precise and accurate task execution.
|
||
</details>
|
||
<h2
|
||
id="covlm-composing-visual-entities-and-relationships-in-large-language-models-via-communicative-decoding"><strong>CoVLM:
|
||
Composing Visual Entities and Relationships in Large Language Models Via
|
||
Communicative Decoding</strong></h2>
|
||
<p>CoVLM distinguishes itself by using novel communication tokens to
|
||
enable dynamic interaction between its CLIP ViT-L image encoder, YOLOX
|
||
detection network, and Pythia language model, facilitating sophisticated
|
||
communication for superior compositional reasoning in vision-language
|
||
tasks.</p>
|
||
<a href="https://arxiv.org/abs/2311.03354v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.03354v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang
|
||
Shen, Chuang Gan
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/80e807cb-c2cf-491a-a3b4-1223afde1981" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>CoVLM</strong>: This model is distinct in its approach,
|
||
employing a novel set of <strong>communication tokens</strong> that
|
||
facilitate dynamic interaction between a vision encoder, detection
|
||
network, and a language model (LLM). The architecture of CoVLM
|
||
integrates a CLIP ViT-L image encoder and a YOLOX detection network,
|
||
alongside a pre-trained Pythia model for language processing. These
|
||
components work in tandem to guide the LLM in composing visual entities
|
||
and relationships within the textual context, enhancing the model’s
|
||
ability to dynamically communicate with the vision encoder and detection
|
||
network. CoVLM is pre-trained on a diverse and extensive image-text
|
||
dataset comprising 97 million image-text pairs, drawn from a variety of
|
||
sources. This extensive dataset supports the model’s grounding pipeline,
|
||
which is crucial for associating text spans with their corresponding
|
||
visual entities in images. The model utilizes special communication
|
||
tokens for facilitating iterative communication between its vision and
|
||
language components, enabling a sophisticated form of top-down and
|
||
bottom-up communication. This communication is key to achieving high
|
||
performance in vision-language tasks, as it allows the model to
|
||
seamlessly integrate and interact between language tokens and visual
|
||
embeddings. The datasets employed for pre-training, such as COCO, CC3M,
|
||
CC12M, Visual Genome, SBU, and LAION400M, are meticulously selected to
|
||
enhance the model’s ability to ground image-text pairs effectively. This
|
||
strategic choice is aimed at facilitating the association of textual
|
||
descriptions with their corresponding visual entities, thereby improving
|
||
the model’s overall performance across a range of multimodal tasks.
|
||
CoVLM’s innovative approach to integrating visual detection networks
|
||
with LLMs enables a new level of compositional reasoning, setting it
|
||
apart from previous vision-language models.
|
||
</details>
|
||
<h2 id="glamm-pixel-grounding-large-multimodal-model"><strong>GLaMM:
|
||
Pixel Grounding Large Multimodal Model</strong></h2>
|
||
<p>GLaMM excels in pixel-level grounding by utilizing a five-component
|
||
architecture encompassing global and regional image encoders, an LLM, a
|
||
grounding image encoder, and a pixel decoder, allowing for comprehensive
|
||
visual understanding and precise object localization within images.</p>
|
||
<a href="https://arxiv.org/abs/2311.03356"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.03356-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/mbzuai-oryx/groundingLMM"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman
|
||
Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing,
|
||
Ming-Hsuan Yang, Fahad S. Khan<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/ccb22206-6a48-4b77-8cc1-094fe86d72fd" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>GLaMM</strong>: At its core, GLaMM comprises five essential
|
||
components: the <strong>Global Image Encoder, Region Encoder, Language
|
||
Model (LLM), Grounding Image Encoder, and Pixel Decoder</strong>. This
|
||
architecture is designed to facilitate a wide range of interactions with
|
||
visual content, from scene-level understanding through the Global Image
|
||
Encoder, to detailed region-level interpretations via the Region
|
||
Encoder, and down to precise pixel-level object grounding with the
|
||
Grounding Image Encoder. The Pixel Decoder component further enriches
|
||
the model’s capabilities by generating <strong>segmentation
|
||
masks</strong>, enabling GLaMM to respond to both textual and visual
|
||
prompts with high fidelity. The training methodology of GLaMM involves a
|
||
dual-pathway approach, encompassing both automated and manual data
|
||
annotation pipelines to create the Grounding-anything Dataset (GranD).
|
||
GranD is pivotal for the model’s training, especially for its Grounded
|
||
Conversation Generation (GCG) task, offering a rich set of 7.5 million
|
||
unique concepts grounded in 810 million regions, complete with
|
||
segmentation masks. This dataset not only supports the pretraining and
|
||
fine-tuning phases of GLaMM but also underlines its unique ability to
|
||
generate grounded conversations that are contextually relevant to the
|
||
visual stimuli. Alignment techniques within GLaMM utilize a
|
||
vision-to-language (V-L) projection layer, facilitating the mapping of
|
||
image features into the language space, thereby ensuring effective
|
||
text-image alignment. Furthermore, the model employs a
|
||
language-to-prompt (L-P) projection layer, transforming text embeddings
|
||
related to segmentation into the decoder space. This dual-projection
|
||
system allows for an integrated encoding of vision and text, bolstering
|
||
GLaMM’s capacity for pixel-level grounding and positioning it as a
|
||
significant advancement in the field of multimodal interactions.
|
||
</details>
|
||
<h2
|
||
id="cosmo-contrastive-streamlined-multimodal-model-with-interleaved-pre-training"><strong>COSMO:
|
||
COntrastive Streamlined MultimOdal Model with Interleaved
|
||
Pre-Training</strong></h2>
|
||
<p>COSMO presents a streamlined multimodal framework by combining a
|
||
Vision Transformer with a partitioned Large Language Model, optimizing
|
||
the processing of interleaved data sequences through a combination of
|
||
language modeling and contrastive loss functions.</p>
|
||
<a href="https://arxiv.org/abs/2401.00849v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2401.00849v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="http://fingerrec.github.io/cosmo"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin
|
||
Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/0c256daa-1573-4110-a665-5927ee2e293f" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>COSMO</strong>: This framework is distinctive for its
|
||
architecture that merges a visual encoder, leveraging the Vision
|
||
Transformer (ViT) from Open-CLIP, with a partitioned Large Language
|
||
Model (LLM). The LLM is systematically divided into segments dedicated
|
||
to unimodal text processing and multimodal data handling, aiming to
|
||
streamline the overall processing of interleaved data sequences. The
|
||
introduction of an additional contrastive loss component stands out as a
|
||
strategy to improve performance across both classification and
|
||
generation tasks. Training of COSMO is carried out through a unique
|
||
combination of language modeling loss and contrastive loss, focusing on
|
||
the efficient management of interleaved text and visual sequences. This
|
||
process is optimized with the use of the AdamW optimizer, a cosine
|
||
learning rate schedule, and the implementation of DeepSpeed fp16
|
||
precision, distributed across 128 NVIDIA V100 GPUs. The partitioning
|
||
strategy of the LLM into dedicated components is a testament to the
|
||
framework’s commitment to computational efficiency and efficacy in
|
||
handling extensive data sequences. The model’s alignment techniques are
|
||
notably advanced, featuring a learnable query that facilitates global
|
||
attention across all tokens, alongside an additional query for
|
||
<strong>Text Fusion Layers</strong>, optimizing the model’s
|
||
understanding of token sets and enhancing image-text alignment through
|
||
contrastive loss. <strong>The gated cross-attention layers</strong> for
|
||
multimodal fusion introduce a significant reduction in learnable
|
||
parameters by introducing bottlenecks in input and output feature
|
||
channels. This method of lightweight fusion is pivotal in integrating
|
||
visual information for precise next-token prediction. COSMO’s training
|
||
leverages a diverse array of datasets including CC3M, SBU, LAION400M,
|
||
DataComp1B, MMC4, WebVid, and Howto-Interlink7M. The introduction of
|
||
Howto-Interlink7M, in particular, underscores the model’s innovative
|
||
approach to improving video-language understanding through high-quality
|
||
annotated captions, demonstrating its effectiveness across 14 diverse
|
||
downstream tasks.
|
||
</details>
|
||
<h2 id="firellava"><strong>FireLLaVA</strong></h2>
|
||
<p>FireLLaVA breaks new ground by combining the CodeLlama 34B Instruct
|
||
model for advanced language understanding with a CLIP-ViT-based visual
|
||
interpretation component, training on a unique dataset incorporating
|
||
bounding box labels and captions to excel in visual language
|
||
conversations.</p>
|
||
<p><a href="https://huggingface.co/fireworks-ai/FireLLaVA-13b"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a></p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>FireLLaVA</strong>: As the first of its kind within the LLaVA
|
||
lineage, FireLLaVA integrates a dual-component architecture that
|
||
leverages the CodeLlama 34B Instruct model for nuanced language
|
||
understanding and a visual interpretation component akin to OpenAI’s
|
||
CLIP-ViT. This model is distinctive for its use of bounding box labels
|
||
and captions to generate visual language conversations, a method that
|
||
underscores its innovative approach to multi-modal training. The
|
||
training regimen for FireLLaVA is meticulously crafted, utilizing 588K
|
||
lines of visual question answering and conversation data. This dataset
|
||
amalgamates permissive original LLaVA data with newly generated data
|
||
from Fireworks.ai, demonstrating a unique approach to instruction
|
||
fine-tuning that enhances the model’s ability to comprehend and
|
||
articulate responses that bridge textual and visual inputs. The
|
||
integration of bounding box labels and captions not only serves as a
|
||
mechanism for generating training data but also facilitates the
|
||
alignment of text and image data, a crucial step in achieving coherent
|
||
multi-modal understanding. Although the specific methods employed for
|
||
alignment fusion within FireLLaVA’s architecture remain under-described,
|
||
it is inferred that embedding fusion plays a critical role in
|
||
synthesizing vision and text inputs. By drawing on original LLaVA
|
||
training materials and Fireworks.ai’s proprietary data, FireLLaVA sets a
|
||
precedent for the development of VLMs capable of navigating the
|
||
complexities of commercial applications. This model embodies a
|
||
significant advancement in the field of visual language modeling,
|
||
offering insights into the potential of OSS models to contribute to the
|
||
evolving landscape of multi-modal AI research and deployment.
|
||
</details>
|
||
<h2
|
||
id="u-llava-unifying-multi-modal-tasks-via-large-language-model"><strong>u-LLaVA:
|
||
Unifying Multi-Modal Tasks via Large Language Model</strong></h2>
|
||
<p>u-LLaVA introduces a novel projector-based architecture that unifies
|
||
multi-modal tasks by connecting specialized expert models with a central
|
||
Large Language Model (LLM), enabling seamless modality alignment and
|
||
efficient multi-task learning through a two-stage training approach.</p>
|
||
<a href="https://arxiv.org/abs/2311.05348"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.05348-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/OPPOMKLab/u-LLaVA"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, Yi-Jie Huang,
|
||
Yaqian Li
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/dcb6b046-fa56-4a02-9123-2ef2185c635a" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>u-LLaVA</strong>: Represents a pioneering approach in the
|
||
integration of Large Language Models (LLMs) with specialized expert
|
||
models to address a wide array of multi-modal tasks. This architecture
|
||
is designed to leverage the strengths of LLMs as a central hub,
|
||
facilitating seamless modality alignment and multi-task learning.
|
||
Through a novel <strong>projector-based structure</strong> that
|
||
incorporates CLIP’s Vision Transformer (ViT-L/14) and LLaMA2, u-LLaVA
|
||
introduces a flexible framework capable of handling diverse modalities
|
||
and tasks. The system integrates special tokens for modality and task
|
||
expressions, alongside dedicated modules for segmentation, grounding,
|
||
and in-painting, to enrich its multi-modal capabilities. The training
|
||
methodology of u-LLaVA is executed in two distinct stages, beginning
|
||
with a coarse-grained alignment to ensure the alignment of
|
||
representation spaces across different modalities. This foundational
|
||
step is crucial for establishing a common ground for further, more
|
||
nuanced task-specific adaptations. Following this, a fine-grained
|
||
alignment phase focuses on the refinement of task-specific instruction
|
||
data, optimizing the model’s performance for targeted applications. This
|
||
dual-stage training approach ensures that u-LLaVA can efficiently adapt
|
||
to a variety of tasks with minimal additional training requirements.
|
||
Central to u-LLaVA’s effectiveness is its innovative use of
|
||
projector-based alignment techniques and fusion methods, which enable
|
||
the integration of visual and textual representations within the LLM’s
|
||
framework. By mapping hidden states and text embeddings through
|
||
projectors, u-LLaVA facilitates modality fusion, leveraging the
|
||
extensive knowledge embedded within LLMs for complex task solving. The
|
||
datasets utilized for training, including LLaVA CC3M, Conversation-58K,
|
||
Detail-23K, and others, are meticulously curated to support the model’s
|
||
versatile capabilities across tasks such as image captioning, video
|
||
captioning, visual question answering (VQA), referential expression
|
||
comprehension (RES), semantic segmentation, and salient object
|
||
detection/segmentation. This strategic selection and organization of
|
||
datasets underscore u-LLaVA’s commitment to advancing multi-modal task
|
||
unification through Large Language Models.
|
||
</details>
|
||
<h2
|
||
id="moe-llava-mixture-of-experts-for-large-vision-language-models"><strong>MoE-LLaVA:
|
||
Mixture of Experts for Large Vision-Language Models</strong></h2>
|
||
<p>MoE-LLaVA introduces a novel approach by incorporating Mixture of
|
||
Experts (MoE) within a large vision-language model, using learnable
|
||
routers to selectively activate expert modules for processing specific
|
||
tokens, thereby enhancing efficiency and enabling nuanced understanding
|
||
of multimodal inputs.</p>
|
||
<a href="https://arxiv.org/abs/2401.15947"><img
|
||
src="https://img.shields.io/badge/arXiv-2401.15947-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/PKU-YuanGroup/MoE-LLaVA"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/LanguageBind/MoE-LLaVA"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa
|
||
Huang, Junwu Zhang, Munan Ning, Li Yuan
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/0e5e214b-be64-4aac-aba4-04c97970b9de" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MoE-LLaVA</strong>: Represents an innovative leap in the
|
||
development of large vision-language models through the integration of
|
||
<strong>Mixture of Experts (MoE)</strong> within a sophisticated
|
||
architectural framework. This model is characterized by its sparse
|
||
design, wherein individual tokens are directed towards a selection of
|
||
experts based on <strong>learnable routers</strong>, ensuring that only
|
||
the top-k experts are activated for any given token’s processing. Such
|
||
an approach not only enhances the model’s efficiency but also its
|
||
capability to handle diverse and complex data inputs by leveraging
|
||
specialized processing paths for different types of information. At the
|
||
heart of MoE-LLaVA’s architecture are several critical components,
|
||
including a vision encoder, <strong>a visual projection MLP
|
||
layer</strong>, <strong>word embedding layers</strong>,
|
||
<strong>multi-head self-attention blocks</strong>, <strong>feed-forward
|
||
neural networks</strong>, and notably, <strong>the MoE blocks</strong>
|
||
themselves. These elements are seamlessly integrated through the use of
|
||
layer normalization and residual connections, establishing a robust and
|
||
adaptable framework capable of deep multimodal understanding. The
|
||
training methodology for MoE-LLaVA is meticulously structured in three
|
||
stages, each designed to gradually enhance the model’s proficiency in
|
||
integrating and processing visual and textual data. This includes
|
||
initial adaptation of image tokens, training of all LLM parameters
|
||
excluding the vision encoder, and specialized training of the MoE
|
||
layers, with the latter utilizing initialization weights from previous
|
||
stages for optimal performance. Alignment techniques and fusion methods
|
||
employed by MoE-LLaVA are pivotal in achieving a harmonious integration
|
||
of text and image modalities. By utilizing learnable routers to
|
||
dynamically allocate tokens to the most apt experts and subsequently
|
||
processing these through a combination of LLM and MoE blocks, the model
|
||
achieves a nuanced understanding of multimodal inputs. The datasets
|
||
employed throughout the training phases—ranging from LLaVA-PT for
|
||
pretraining to Hybrid-FT for multimodal instruction tuning, and LLaVA-FT
|
||
for fine-tuning the MoE layers—further underscore the model’s ability to
|
||
refine its understanding across a broad spectrum of multimodal tasks.
|
||
This strategic deployment of diverse datasets not only facilitates a
|
||
comprehensive tuning of the model’s capabilities but also underscores
|
||
its potential in advancing the field of vision-language processing.
|
||
</details>
|
||
<h2
|
||
id="bliva-a-simple-multimodal-llm-for-better-handling-of-text-rich-visual-questions"><strong>BLIVA:
|
||
A Simple Multimodal LLM for Better Handling of Text-rich Visual
|
||
Questions</strong></h2>
|
||
<p>BLIVA augments the InstructBLIP model with a Visual Assistant,
|
||
incorporating encoded patch embeddings alongside learned query
|
||
embeddings to enhance the LLM’s understanding of text-rich visual
|
||
contexts, thereby excelling in handling complex visual questions.</p>
|
||
<a href="https://arxiv.org/abs/2308.09936v3"><img
|
||
src="https://img.shields.io/badge/arXiv-2308.09936v3-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/mlpc-ucsd/bliva"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Wenbo Hu, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, Zhuowen Tu
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/44c53b8a-ad35-4eca-a68b-63af32e6ccf1" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>BLIVA</strong>: This model builds upon the foundation of
|
||
InstructBLIP, incorporating a Visual Assistant to enhance its
|
||
understanding and processing of text-rich visual contexts. BLIVA’s
|
||
architecture is designed to capture the intricacies of visual content
|
||
that may be overlooked during the query decoding process by melding
|
||
learned query embeddings from InstructBLIP with directly projected
|
||
encoded patch embeddings. The core components of BLIVA include a vision
|
||
tower, responsible for encoding visual inputs into patch embeddings; a
|
||
<strong>Q-former</strong>, which refines query embeddings; and a
|
||
<strong>projection layer</strong> that bridges the visual and linguistic
|
||
domains, enabling the LLM to access a rich tapestry of visual knowledge.
|
||
The training methodology of BLIVA is structured around a two-stage
|
||
scheme: initial pre-training on image-text pairs derived from captioning
|
||
datasets, followed by instruction tuning using Visual Question Answering
|
||
(VQA) data. This process begins with the pre-training of the projection
|
||
layer for patch embeddings, succeeded by the fine-tuning of both the
|
||
Q-former and the projection layer, while the image encoder and LLM
|
||
remain static to prevent catastrophic forgetting. This approach ensures
|
||
that BLIVA is finely attuned to visual information, enhancing its
|
||
ability to handle complex visual questions. BLIVA’s alignment techniques
|
||
and fusion methods stand out for their integration of learned query
|
||
embeddings with an additional visual assistant branch that utilizes
|
||
encoded patch embeddings. By concatenating these embeddings and feeding
|
||
them directly into the LLM, BLIVA significantly improves the model’s
|
||
text-image visual perception capabilities. This enhanced multimodal
|
||
understanding is further demonstrated through the use of diverse
|
||
datasets, including image captioning datasets for pre-training,
|
||
instruction tuning VQA data for performance enhancement, and YTTB-VQA
|
||
(YouTube Thumbnail Visual Question-Answer pairs) to showcase BLIVA’s
|
||
proficiency in processing text-rich images and its suitability for
|
||
real-world applications.
|
||
</details>
|
||
<h2
|
||
id="mobilevlm-a-fast-strong-and-open-vision-language-assistant-for-mobile-devices"><strong>MobileVLM:
|
||
A Fast, Strong and Open Vision Language Assistant for Mobile
|
||
Devices</strong></h2>
|
||
<p>MobileVLM offers a mobile-optimized vision-language model that
|
||
combines a CLIP ViT-L/14 visual encoder with the efficient MobileLLaMA
|
||
language model and a Lightweight Downsample Projector (LDP), enabling
|
||
effective multimodal processing and alignment within the constraints of
|
||
mobile devices.</p>
|
||
<a href="https://arxiv.org/abs/2312.16886"><img
|
||
src="https://img.shields.io/badge/arXiv-2312.16886-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/meituan-automl/mobilevlm"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming
|
||
Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, Chunhua Shen
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/59a06109-ba49-4299-951c-d7c0c562bca3" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MobileVLM</strong>: Introduces a compact yet robust architecture
|
||
designed to facilitate efficient vision-language tasks on mobile
|
||
devices, distinguishing itself through a blend of specialized components
|
||
and a streamlined training methodology tailored for edge computing
|
||
environments. At its core, MobileVLM integrates a visual encoder based
|
||
on the CLIP ViT-L/14 model with a resolution of 336x336, MobileLLaMA—a
|
||
language model optimized for mobile devices, and a <strong>Lightweight
|
||
Downsample Projector (LDP)</strong> that bridges the gap between visual
|
||
and textual data with minimal computational overhead. This synergy
|
||
between components ensures that MobileVLM can process and align
|
||
multimodal inputs effectively, making it well-suited for mobile
|
||
applications where resource efficiency is paramount. The training
|
||
regimen for MobileVLM unfolds in three distinct phases, each
|
||
contributing uniquely to the model’s development. Initially, the
|
||
language model undergoes pre-training using the text-centric RedPajama
|
||
v1 dataset, laying a solid linguistic foundation. Subsequent supervised
|
||
fine-tuning leverages multi-turn dialogues between humans and ChatGPT,
|
||
refining the model’s conversational abilities. The final stage involves
|
||
training the integrated vision-language model on diverse multimodal
|
||
datasets, equipping MobileVLM with the capacity to interpret and respond
|
||
to both visual and textual stimuli. This comprehensive training approach
|
||
ensures that MobileVLM achieves a balance between performance and
|
||
efficiency, making it adept at handling complex vision-language
|
||
interactions on mobile platforms. Central to MobileVLM’s effectiveness
|
||
is the Lightweight Downsample Projector (LDP), a novel component
|
||
designed for the efficient alignment of visual and textual features. By
|
||
employing mobile-friendly operations such as depth-wise convolution, LDP
|
||
manages to downsample visual tokens to match the language model’s input
|
||
dimensions, preserving spatial information while minimizing
|
||
computational demands. This alignment mechanism, in conjunction with the
|
||
efficient fusion of vision and text embeddings, enables MobileVLM to
|
||
maintain high levels of accuracy and responsiveness in mobile
|
||
environments. Through the use of carefully selected datasets, including
|
||
RedPajama v1 for linguistic pre-training and various multimodal datasets
|
||
for comprehensive vision-language modeling, MobileVLM showcases its
|
||
capability to navigate the challenges of mobile-based vision-language
|
||
tasks with remarkable efficiency.
|
||
</details>
|
||
<h2
|
||
id="frozen-multimodal-few-shot-learning-with-frozen-language-models"><strong>FROZEN:
|
||
Multimodal Few-Shot Learning with Frozen Language Models</strong></h2>
|
||
<p>FROZEN enables multimodal few-shot learning by pairing a pre-trained,
|
||
frozen language model with a trainable vision encoder (NF-ResNet-50)
|
||
that converts images into a dynamic visual prefix, allowing the model to
|
||
process and generate language in context with visual data without
|
||
altering its core language capabilities.</p>
|
||
<a href="https://arxiv.org/abs/2106.13884"><img
|
||
src="https://img.shields.io/badge/arXiv-2106.13884-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol
|
||
Vinyals, Felix Hill
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/4156475d-e501-495e-98bb-66efdd5b03f7" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>FROZEN</strong>: Presents an innovative approach to extending
|
||
the few-shot learning capabilities of pre-existing language models into
|
||
the multimodal domain, specifically targeting the integration of visual
|
||
and linguistic elements without the need to alter the foundational
|
||
language model parameters. This methodology introduces a vision encoder,
|
||
specifically an <strong>NF-ResNet-50</strong>, designed to translate
|
||
images into a continuous sequence of embeddings. These embeddings serve
|
||
as a visual prefix to the input for a pre-trained autoregressive
|
||
language model based on the Transformer architecture, enabling the
|
||
language model to process and generate content relevant to the given
|
||
visual context. The core innovation lies in the system’s modularity,
|
||
achieved by keeping the language model’s weights static while
|
||
<strong>only updating the vision encoder</strong> during training. This
|
||
approach leverages the Conceptual Captions dataset, focusing on the
|
||
alignment of image-caption pairs to train the vision encoder, thereby
|
||
simplifying the integration of visual data into language models. The
|
||
architecture of FROZEN is distinguished by its use of a dynamic visual
|
||
prefix, a departure from the conventional static text prompts typical in
|
||
prefix tuning. This dynamic prefix is achieved by linearly mapping and
|
||
reshaping the vision encoder’s output into a sequence of embeddings,
|
||
mirroring the functionality of text-based prefix tokens in traditional
|
||
language model tuning. This mechanism allows the model to adapt more
|
||
fluidly to multimodal inputs, enhancing its ability to interpret and
|
||
generate language that is contextually aligned with visual data. The
|
||
employment of a dynamic visual prefix is a key factor in FROZEN’s
|
||
ability to improve task performance across multimodal settings through
|
||
in-context learning, providing a novel solution to the challenge of
|
||
incorporating visual information into the language generation process.
|
||
The utilization of the Conceptual Captions dataset is central to
|
||
FROZEN’s training methodology, enabling the <strong>vision encoder to
|
||
adeptly convert images</strong> into a format that the language model
|
||
can process. This dataset serves the dual purpose of enhancing the
|
||
model’s understanding of visual content and its associated linguistic
|
||
descriptions, thereby facilitating the generation of accurate and
|
||
contextually relevant captions. The strategic combination of a static
|
||
language model with a trainable vision encoder encapsulates FROZEN’s
|
||
approach to multimodal few-shot learning, offering a streamlined and
|
||
effective pathway to integrating visual data into linguistic models.
|
||
</details>
|
||
<h2
|
||
id="flamingo-a-visual-language-model-for-few-shot-learning"><strong>Flamingo:
|
||
a Visual Language Model for Few-Shot Learning</strong></h2>
|
||
<p>Flamingo pioneers a Perceiver-based VLM architecture that utilizes a
|
||
Perceiver Resampler and gated cross-attention dense layers, enabling it
|
||
to process interleaved text and visual sequences for impressive few-shot
|
||
learning performance across a variety of multimodal tasks.</p>
|
||
<a href="https://arxiv.org/abs/2204.14198v2"><img
|
||
src="https://img.shields.io/badge/arXiv-2204.14198v2-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain
|
||
Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm
|
||
Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao
|
||
Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian
|
||
Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj
|
||
Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen
|
||
Simonyan
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/b46ebf3e-67fc-401e-a6ea-6f4797da372d" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Flamingo</strong>: Represents an innovative approach in the
|
||
realm of Visual Language Models (VLMs), specifically designed to excel
|
||
in few-shot learning tasks. This model is distinguished by its capacity
|
||
to process sequences of text tokens that are interwoven with visual
|
||
data, such as images or videos, to generate textual outputs. At the core
|
||
of Flamingo’s architecture is the adoption of a Perceiver-based
|
||
framework that adeptly manages high-resolution visual inputs. This
|
||
design choice enables the handling of complex, multimodal information
|
||
streams by transforming large visual feature maps into a concise number
|
||
of visual tokens through the <strong>Perceiver Resampler</strong>.
|
||
Further refining its architecture, Flamingo incorporates <strong>gated
|
||
cross-attention dense (GATED XATTN-DENSE) layers</strong>, which play a
|
||
pivotal role in conditioning the language model on visual inputs,
|
||
thereby facilitating a nuanced understanding and generation of language
|
||
based on the visual context. The training regimen of Flamingo is both
|
||
extensive and diverse, encompassing a wide array of datasets culled from
|
||
the web. This includes a rich mixture of interleaved image and text
|
||
data, image-text pairs, and video-text pairs, which collectively
|
||
contribute to the model’s robust few-shot learning capabilities. A
|
||
distinctive aspect of Flamingo’s training is its strategy to minimize a
|
||
weighted sum of per-dataset expected negative log-likelihoods of text
|
||
given visual inputs. This approach, combined with a gradient
|
||
accumulation strategy across all datasets, ensures comprehensive
|
||
learning from varied multimodal contexts. The datasets employed in
|
||
training, namely MultiModal MassiveWeb (M3W), ALIGN dataset, Long Text
|
||
& Image Pairs (LTIP), and Video & Text Pairs (VTP), each serve a
|
||
specific purpose. M3W facilitates training on interleaved text and image
|
||
data, ALIGN on image-text pairs, LTIP on high-quality image-text pairs,
|
||
and VTP on video-text pairs, ensuring Flamingo’s adeptness across
|
||
different visual language tasks. In its alignment techniques, Flamingo
|
||
introduces an image-causal modeling approach to manage text-to-image
|
||
cross-attention effectively, allowing the model to attend selectively to
|
||
visual tokens of the image that immediately precede the given text token
|
||
in the sequence. This capability is further enhanced by the gated
|
||
cross-attention layers, which employ a tanh-gating mechanism to merge
|
||
the output of these layers with the input representation from the
|
||
residual connection. Such an alignment fusion method ensures that
|
||
Flamingo can seamlessly integrate vision and text embeddings,
|
||
underscoring its innovative architecture and the breadth of its
|
||
training. Through these mechanisms, Flamingo stands out as a significant
|
||
advancement in the integration of visual and textual data for language
|
||
model training, showcasing its versatility and effectiveness in few-shot
|
||
learning scenarios.
|
||
</details>
|
||
<h2
|
||
id="openflamingo-an-open-source-framework-for-training-large-autoregressive-vision-language-models"><strong>OpenFlamingo:
|
||
An Open-Source Framework for Training Large Autoregressive
|
||
Vision-Language Models</strong></h2>
|
||
<p>OpenFlamingo, an open-source adaptation of DeepMind’s Flamingo,
|
||
combines a CLIP ViT-L/14 visual encoder with a 7B parameter language
|
||
model, utilizing frozen cross-attention modules for efficient and
|
||
effective multimodal fusion during the decoding process, resulting in
|
||
impressive performance on various vision-language tasks.</p>
|
||
<a href="https://arxiv.org/abs/2308.01390"><img
|
||
src="https://img.shields.io/badge/arXiv-2308.01390-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/mlfoundations/open_flamingo"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy,
|
||
Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori
|
||
Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco,
|
||
Mitchell Wortsman, Ludwig Schmidt
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>OpenFlamingo</strong>: Represents an innovative leap in the
|
||
integration of vision and language models, providing an open-source
|
||
adaptation of DeepMind’s Flamingo framework. This model is structured
|
||
around a powerful combination of a CLIP Vision Transformer Large
|
||
(ViT-L/14) for encoding visual inputs and a 7-billion parameter
|
||
Multimodal Pretrained Transformer (MPT-7B) for language processing. The
|
||
architecture is distinctive for its inclusion of cross-attention modules
|
||
within every fourth decoder block of the language model, which remains
|
||
frozen during training. These modules are pivotal for the model’s
|
||
ability to attentively merge visual information with textual context
|
||
during the decoding process, thereby enhancing its multimodal
|
||
understanding. The training methodology for OpenFlamingo is grounded in
|
||
a comprehensive strategy that harnesses the vast data landscape of the
|
||
internet. It utilizes a rich dataset amalgam comprising LAION-2B and the
|
||
Multimodal version of the Common Crawl (C4) dataset, focusing on
|
||
image-text pair sequences. This approach is facilitated by
|
||
DistributedDataParallel training across an impressive array of 64 A100
|
||
80GB GPUs, leveraging automatic BF16 mixed precision for optimized
|
||
performance. The model’s alignment techniques are inspired by the
|
||
original Flamingo’s design philosophy, which emphasizes the importance
|
||
of keeping the core vision and language models static while dynamically
|
||
training the connecting <strong>cross-attention modules</strong> for
|
||
decoding. This selective training process ensures that OpenFlamingo can
|
||
effectively fuse visual and textual data, thereby significantly
|
||
improving its proficiency in generating relevant text based on visual
|
||
cues. Furthermore, the datasets used are instrumental in refining
|
||
OpenFlamingo’s capacity for understanding complex visual-textual
|
||
interactions. Trained specifically on image-text sequences, the model
|
||
demonstrates superior performance in tasks requiring nuanced
|
||
interpretation of visual content, such as captioning, visual question
|
||
answering, and image classification. This strategic focus on multimodal
|
||
datasets underscores the model’s purpose to bridge the gap between
|
||
visual perception and linguistic expression, marking a substantial
|
||
advancement in the field of multimodal AI. Through these architectural
|
||
innovations and training strategies, OpenFlamingo sets a new standard
|
||
for open-source models in the domain of visual-language tasks.
|
||
</details>
|
||
<h2 id="idefics"><strong>IDEFICS</strong></h2>
|
||
<p>IDEFICS, an 80B parameter vision-language model inspired by Flamingo,
|
||
processes interleaved image and text sequences, utilizing a GPT-4 and
|
||
Flamingo-based architecture to achieve robust multimodal understanding,
|
||
trained on a diverse range of web-based datasets, including the
|
||
specialized OBELICS dataset.</p>
|
||
<a href="https://huggingface.co/HuggingFaceM4/idefics-80b"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>IDEFICS</strong>: stands for “an 80 billion parameters vision
|
||
and language model,” distinguishing itself as a robust model designed to
|
||
mimic Flamingo’s capabilities while integrating substantial advancements
|
||
in handling multimodal inputs. This model is crafted to accept sequences
|
||
of images and text, generating text outputs that reflect a deep
|
||
understanding of both visual and textual information. The architecture
|
||
of IDEFICS builds on the foundations laid by GPT-4 and Flamingo,
|
||
showcasing a harmonious blend of vision and language processing
|
||
capabilities within a singular model framework. This strategic design
|
||
allows IDEFICS to process and interpret complex multimodal inputs
|
||
efficiently, setting a new precedent in the field of integrated
|
||
vision-language models. During its development, IDEFICS faced challenges
|
||
related to loss spikes, which were effectively mitigated through
|
||
rollback strategies and precise adjustments in the learning rate. An
|
||
auxiliary z-loss was introduced to normalize logits, significantly
|
||
enhancing training stability. The model adopts Flamingo’s methodological
|
||
approach for alignment, utilizing pretrained vision and language
|
||
backbones to foster a nuanced cross-modal understanding. Although
|
||
specific details on fusion techniques for vision and text embeddings
|
||
remain under wraps, it is inferred that the model employs
|
||
<strong>cross-attention mechanisms</strong> akin to Flamingo’s,
|
||
facilitating a sophisticated integration of visual and textual data.
|
||
Training on OBELICS—a meticulously curated collection of interleaved
|
||
image-text web documents—and other web-scraped datasets, IDEFICS aims to
|
||
excel in multimodal tasks. The OBELICS dataset, in particular, is
|
||
designed to augment the model’s performance by providing access to
|
||
longer text contexts and a diverse array of web document types. This
|
||
strategic dataset selection underscores IDEFICS’s commitment to
|
||
enhancing its proficiency across a spectrum of multimodal applications,
|
||
leveraging the rich, varied content found in web documents to refine its
|
||
understanding and output generation capabilities.
|
||
</details>
|
||
<h2
|
||
id="pali-a-jointly-scaled-multilingual-language-image-model"><strong>PaLI:
|
||
A Jointly-Scaled Multilingual Language-Image Model</strong></h2>
|
||
<p>PaLI distinguishes itself as a jointly-scaled multilingual
|
||
language-image model that utilizes a unified interface to process both
|
||
unimodal and multimodal tasks, integrating a powerful ViT-e visual
|
||
encoder with an mT5-based text encoder-decoder Transformer for
|
||
comprehensive language and vision understanding.</p>
|
||
<a href="https://arxiv.org/abs/2209.06794"><img
|
||
src="https://img.shields.io/badge/arXiv-2209.06794-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/google-research/big_vision"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul
|
||
Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin,
|
||
Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic,
|
||
Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/2565afb0-901c-4438-9488-c73a86261aa5" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>PALI</strong>: This model stands out by its ability to handle
|
||
both unimodal (language or vision) and multimodal (language and vision
|
||
together) tasks through a unified interface that accepts images and text
|
||
as inputs, subsequently generating text as the output. The architecture
|
||
of PALI ingeniously integrates a text encoder-decoder Transformer, based
|
||
on pre-trained mT5 models, with visual tokens processed by a Vision
|
||
Transformer (ViT) named ViT-e. ViT-e marks a significant advancement in
|
||
visual processing with up to 4 billion parameters, setting a new
|
||
precedent for the integration of visual components within language
|
||
models. The PALI model utilizes pre-trained unimodal checkpoints,
|
||
optimizing the efficiency of its training processes. Training
|
||
methodologies for PALI are robust and diverse, incorporating a mixture
|
||
of pre-training tasks aimed at enhancing the model’s capability across a
|
||
broad spectrum of downstream applications. Leveraging the expansive
|
||
image-language dataset WebLI, which encompasses 10 billion images and
|
||
texts across over 100 languages, PALI undergoes a comprehensive
|
||
two-phase training regime. This includes a specific focus on
|
||
high-resolution training for its largest model variant, PALI-17B. Such
|
||
an approach ensures that PALI is not just multilingual but also highly
|
||
adept at processing and understanding complex visual and textual data.
|
||
The alignment and fusion techniques employed by PALI are particularly
|
||
noteworthy. By adopting a unified modeling interface, the model treats
|
||
various tasks with a task-agnostic perspective, allowing it to
|
||
seamlessly transition between different types of vision and language
|
||
tasks. The fusion of vision and text is achieved through <strong>a
|
||
cross-attention mechanism</strong>, where a sequence of visual tokens
|
||
from the Vision Transformer is integrated with the text encoder-decoder
|
||
Transformer. This method enables an efficient and effective blending of
|
||
multimodal information. The use of datasets such as WebLI, Conceptual
|
||
Captions, and OCR data from WebLI, along with others like VQ2A-CC3M and
|
||
Open Images, further enriches PALI’s training, equipping it with a vast
|
||
and versatile multimodal proficiency. This proficiency spans across
|
||
multilingual settings, captioning, OCR, and visual question answering
|
||
(VQA), ensuring PALI’s comprehensive understanding and generation
|
||
capabilities across a wide array of languages and tasks.
|
||
</details>
|
||
<h2
|
||
id="pali-3-vision-language-models-smaller-faster-stronger"><strong>PaLI-3
|
||
Vision Language Models: Smaller, Faster, Stronger</strong></h2>
|
||
<p>PaLI-3 presents a powerful yet efficient vision-language model that
|
||
integrates a contrastively pretrained 2B SigLIP vision model with a 3B
|
||
UL2 Transformer, achieving impressive performance in tasks like
|
||
captioning and visual question answering through a multi-stage training
|
||
process that emphasizes scalability and robustness.</p>
|
||
<a href="https://arxiv.org/abs/2310.09199"><img
|
||
src="https://img.shields.io/badge/arXiv-2310.09199-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/kyegomez/PALI3"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul
|
||
Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin,
|
||
Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic,
|
||
Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/92d34b30-b13b-44ed-90b5-3c8568a9b634" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>PaLI-3</strong> :Its architecture integrates a contrastively
|
||
pretrained 2B <strong>SigLIP vision model</strong> with a 3B
|
||
encoder-decoder UL2 Transformer, focusing on the efficient processing of
|
||
visual and textual data. The training methodology of PaLI-3 includes
|
||
<strong>contrastive pretraining of the image encoder</strong> on a vast
|
||
scale of image-text data, subsequent multimodal training, and resolution
|
||
increase stages to refine its performance further. These stages ensure
|
||
that PaLI-3 achieves a nuanced understanding of visually-situated text
|
||
and object localization, supported by datasets such as Web-scale
|
||
image-text data, RefCOCO, WebLI, CC3M-35L, and various VQA datasets. The
|
||
visual component of PaLI-3 utilizes a vision transformer pretrained in a
|
||
contrastive manner, emphasizing efficiency, scalability, and robustness.
|
||
This approach allows for a more nuanced pretraining of the image
|
||
embedding component, which, when combined with text embeddings, enhances
|
||
the model’s ability to understand and generate text based on visual
|
||
inputs. The full model employs these visual tokens alongside embedded
|
||
input text tokens within a UL2 encoder-decoder framework, demonstrating
|
||
its capability in generating text outputs for tasks such as captioning
|
||
and visual question answering (VQA). PaLI-3’s training process involves
|
||
several key stages, starting with unimodal pretraining of the image
|
||
encoder using image-text pairs from the web. This is followed by
|
||
multimodal training, where the image encoder and text encoder-decoder
|
||
are combined and trained on a mixture of tasks and data, focusing on
|
||
visually-situated text and object detection. The resolution increase
|
||
stage further enhances performance by fine-tuning the model with
|
||
high-resolution inputs. Finally, task specialization involves
|
||
fine-tuning PaLI-3 on individual benchmark tasks, optimizing its
|
||
performance across a wide range of applications.
|
||
</details>
|
||
<h2 id="palm-e-an-embodied-multimodal-language-model"><strong>PaLM-E: An
|
||
Embodied Multimodal Language Model</strong></h2>
|
||
<p>PaLM-E innovates by embedding continuous sensory data, including
|
||
images and sensor readings, into the language representation space of a
|
||
pre-trained PaLM model, enabling it to process and generate text that
|
||
reflects embodied reasoning and understanding of the physical world.</p>
|
||
<a href="https://arxiv.org/abs/2303.03378"><img
|
||
src="https://img.shields.io/badge/arXiv-2303.03378-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://palm-e.github.io"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha
|
||
Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong,
|
||
Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel
|
||
Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc
|
||
Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/67e5bbc7-1800-46e8-8ef1-b3b72a901a12" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>PaLM-E</strong>: Represents an innovative step in the
|
||
development of multimodal language models by integrating continuous
|
||
embodied observations—ranging from images and state estimates to various
|
||
sensor modalities—into the linguistic embedding space of a pre-trained
|
||
language model. It utilizes a decoder-only large language model (LLM)
|
||
architecture that generates textual completions autoregressively, taking
|
||
multimodal inputs into account. The core architecture of PaLM-E
|
||
leverages a pre-trained PaLM as its language backbone, enhancing it with
|
||
encoders that transform sensor modalities into a <strong>sequence of
|
||
vectors</strong> compatible with the language model’s embedding
|
||
dimensions. This integration allows for the seamless combination of
|
||
continuous sensor information with textual data, crafting multimodal
|
||
sentences that the model processes. Training methodologies for PaLM-E
|
||
are comprehensive and end-to-end, utilizing datasets composed of both
|
||
continuous observations and textual information. The model employs a
|
||
cross-entropy loss function for non-prefix tokens, with a training
|
||
regimen that includes pre-trained Vision Transformers (ViTs) for image
|
||
feature extraction alongside novel and pre-trained input encoders. The
|
||
approach allows for flexibility in model training, including options for
|
||
freezing pre-trained components or co-training them across varied data
|
||
sets. This strategy ensures that PaLM-E benefits from both the depth of
|
||
pre-trained models and the specificity of tailored encoders for
|
||
continuous data. PaLM-E’s alignment techniques and fusion methods are
|
||
pivotal for its operation, employing encoders to integrate continuous
|
||
sensor data into the linguistic embedding space effectively. This
|
||
integration facilitates an understanding and generation of responses
|
||
that reflect a blend of textual and sensor input, mimicking embodied
|
||
reasoning. The model processes multimodal sentences—interleaved
|
||
sequences of sensor observations and text—through its
|
||
<strong>self-attention layers</strong>, similar to how it handles
|
||
traditional text tokens. This methodology ensures a cohesive encoding of
|
||
vision and text information. PaLM-E’s training leverages a diverse array
|
||
of datasets, including large-scale vision-and-language data and
|
||
specialized robotics tasks datasets, aiming to excel across a broad
|
||
spectrum of embodied reasoning tasks. This diverse training background
|
||
enables PaLM-E to harness cross-domain transfer learning, enhancing its
|
||
capabilities in specific robotics applications and general
|
||
vision-language tasks alike.
|
||
</details>
|
||
<h2
|
||
id="minigpt-4-enhancing-vision-language-understanding-with-advanced-large-language-models"><strong>MiniGPT-4:
|
||
Enhancing Vision-Language Understanding with Advanced Large Language
|
||
Models</strong></h2>
|
||
<p>MiniGPT-4 seamlessly blends visual and language processing by
|
||
connecting a pretrained Vision Transformer and Q-Former to a frozen
|
||
Vicuna LLM using a single linear projection layer, achieving impressive
|
||
vision-language understanding through a two-stage training approach
|
||
focused on efficient alignment and enhanced generation quality.</p>
|
||
<a href="https://arxiv.org/abs/2304.10592v2"><img
|
||
src="https://img.shields.io/badge/arXiv-2304.10592v2-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/vision-cair/minigpt-4"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/0e5ff945-1271-4189-8dd9-b0abd88eacc1" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MiniGPT-4</strong>: presents an advanced integration of vision
|
||
and language processing capabilities through a meticulously designed
|
||
architecture that marries a frozen visual encoder with a frozen advanced
|
||
Large Language Model (LLM), specifically Vicuna. At the heart of
|
||
MiniGPT-4 is its novel approach to aligning visual and linguistic
|
||
modalities: it employs <strong>a single linear projection layer</strong>
|
||
to bridge the pretrained Vision Transformer (ViT) and
|
||
<strong>Q-Former</strong> with the Vicuna LLM. This design choice
|
||
underscores a commitment to efficiency, focusing on leveraging existing,
|
||
robust components to achieve a seamless integration of visual features
|
||
with sophisticated language capabilities. The training methodology for
|
||
MiniGPT-4 is bifurcated into two distinct stages, optimizing both the
|
||
initial alignment of visual and language features and the subsequent
|
||
enhancement of generation reliability and naturalness. Initially,
|
||
MiniGPT-4 undergoes training for 20,000 steps with a batch size of 256
|
||
on 4 A100 GPUs, utilizing a combined dataset from sources like
|
||
Conceptual Captions, SBU, and LAION for foundational vision-language
|
||
knowledge. This stage is crucial for establishing the basic alignment
|
||
between the visual encoder and the Vicuna LLM. The second stage of
|
||
finetuning, leveraging a curated dataset of 3,500 detailed image
|
||
descriptions, is pivotal for refining the model’s output, focusing on
|
||
generating more detailed, reliable, and naturally flowing text. The
|
||
strategic use of datasets in MiniGPT-4’s training regimen underscores
|
||
its dual objectives: foundational vision-language alignment and the
|
||
enhancement of output naturalness and detail. Initial datasets
|
||
facilitate the basic integration of visual and linguistic elements,
|
||
while the curated dataset of detailed image descriptions serves to
|
||
significantly improve the model’s capability in generating nuanced and
|
||
accurate natural language descriptions. Through this comprehensive and
|
||
staged training approach, MiniGPT-4 achieves a refined balance between
|
||
efficient visual-language alignment and the production of high-quality,
|
||
detailed textual outputs, marking a significant step forward in the
|
||
field of vision-language understanding.
|
||
</details>
|
||
<h2
|
||
id="minigpt-v2-large-language-model-as-a-unified-interface-for-vision-language-multi-task-learning"><strong>MiniGPT-v2:
|
||
large language model as a unified interface for vision-language
|
||
multi-task learning</strong></h2>
|
||
<p>MiniGPT-v2 acts as a unified interface for vision-language multi-task
|
||
learning by connecting a static Visual Transformer to a 7B parameter
|
||
LLaMA-2-chat language model through a linear projection layer,
|
||
efficiently processing high-resolution images and excelling in various
|
||
tasks through a three-stage training approach.</p>
|
||
<a href="https://arxiv.org/abs/2310.09478v3"><img
|
||
src="https://img.shields.io/badge/arXiv-2310.09478v3-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan
|
||
Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed
|
||
Elhoseiny
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/2354442a-0e96-4010-8b4f-8bc3d666427e" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MiniGPT-v2</strong>: A sophisticated model designed to serve as
|
||
a unified interface for vision-language multi-task learning, leveraging
|
||
the innovative integration of a visual backbone with a large language
|
||
model. At its core, the architecture combines a Visual Transformer (ViT)
|
||
as its visual backbone, which is kept static during training, with
|
||
<strong>a linear projection layer</strong> that effectively merges every
|
||
four neighboring visual tokens into one. These consolidated tokens are
|
||
then projected into the feature space of LLaMA-2-chat, a 7-billion
|
||
parameter language model, facilitating the processing of high-resolution
|
||
images (448x448 pixels). This structure allows MiniGPT-v2 to efficiently
|
||
bridge the gap between visual input and language model processing,
|
||
catering to a wide array of vision-language tasks. The training
|
||
methodology employed by MiniGPT-v2 is particularly noteworthy,
|
||
encompassing a three-stage strategy to comprehensively cover the
|
||
spectrum of knowledge acquisition and task-specific performance
|
||
enhancement. Initially, the model is exposed to a mix of weakly-labeled
|
||
and fine-grained datasets, focusing on broad vision-language
|
||
understanding. The training progressively shifts towards more
|
||
fine-grained data to hone in on specific task improvements. In the final
|
||
stage, MiniGPT-v2 is trained on multi-modal instruction and language
|
||
datasets, aiming to refine its response to multi-modal instructions. The
|
||
use of task-specific identifier tokens during training plays a crucial
|
||
role in reducing ambiguity and sharpening task distinction, enabling the
|
||
model to adeptly navigate the complexities of vision-language tasks. To
|
||
support its extensive training and operational capabilities, MiniGPT-v2
|
||
utilizes a diverse array of datasets, including LAION, CC3M, SBU,
|
||
GRIT-20M, COCO caption, and several others, each selected to fulfill
|
||
distinct stages of the training process—from broad knowledge acquisition
|
||
to task-specific improvements and sophisticated multi-modal instruction
|
||
handling. This strategic dataset employment underscores MiniGPT-v2’s
|
||
capacity to assimilate and apply knowledge across a broad range of
|
||
vision-language contexts, positioning it as a versatile tool in the
|
||
evolving landscape of multi-task learning interfaces.
|
||
</details>
|
||
<h2
|
||
id="llava-plus-learning-to-use-tools-for-creating-multimodal-agents"><strong>LLaVA-Plus:
|
||
Learning to Use Tools for Creating Multimodal Agents</strong></h2>
|
||
<p>LLaVA-Plus pioneers the creation of multimodal agents by integrating
|
||
diverse vision and vision-language models into a skill repository,
|
||
enabling the agent to learn and use tools effectively through end-to-end
|
||
training on comprehensive multimodal instruction-following data.</p>
|
||
<a href="https://arxiv.org/abs/2311.05437"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.05437-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/LLaVA-VL/LLaVA-Plus-Codebase"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren,
|
||
Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao,
|
||
Chunyuan Li
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/1ede1c4f-bdeb-48e0-ae8e-ccfbee1dea51" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>LLaVA-Plus</strong>: Represents an innovative leap in the design
|
||
of multimodal agents, integrating a diverse array of vision and
|
||
vision-language pre-trained models into a comprehensive skill
|
||
repository. This integration enables LLaVA-Plus to leverage end-to-end
|
||
training to systematically expand its capabilities, allowing it to
|
||
activate and combine relevant tools based on the users’ multimodal
|
||
inputs. The architecture of LLaVA-Plus is centered around a unified
|
||
scheme for representing <strong>multimodal instruction-following
|
||
data</strong>, which is essential for its advanced end-to-end trained
|
||
multimodal instruction-following capabilities. The model is
|
||
distinguished by its training methods, which utilize curated multimodal
|
||
instruction-following data covering a broad spectrum of tasks, including
|
||
visual understanding, generation, external knowledge retrieval, and
|
||
their combinations. This approach allows LLaVA-Plus to incorporate new
|
||
tools through instruction tuning, thereby expanding its abilities by
|
||
learning to use these tools effectively. The training datasets—COCO,
|
||
HierText, InfoSeek, JourneyDB, and Instruct P2P—are meticulously
|
||
selected to enhance the model’s training on visual understanding skills
|
||
such as detection, segmentation, captioning, OCR, and external knowledge
|
||
retrieval, alongside generation tasks and skill compositions. LLaVA-Plus
|
||
employs unique alignment techniques and fusion methods that utilize raw
|
||
visual signals during human-AI interaction sessions to improve tool use
|
||
performance, planning, and reasoning. These techniques enable the
|
||
seamless integration of vision and text embeddings by combining user
|
||
inputs, tool activation prompts, and execution results into a unified
|
||
dialogue format. This strategic approach not only facilitates enhanced
|
||
interaction between the model and its users but also significantly
|
||
boosts the model’s overall performance and versatility in handling
|
||
complex multimodal tasks.
|
||
</details>
|
||
<h2 id="bakllava"><strong>BakLLaVA</strong></h2>
|
||
<p>BakLLaVA elevates the LLaVA framework by employing a Mistral 7B base
|
||
enhanced with LLaVA 1.5 architecture, undergoing a meticulous two-stage
|
||
training process on a diverse dataset to achieve superior performance in
|
||
multimodal benchmarks, outperforming competitors like Llama 2 13B.</p>
|
||
<a href="https://github.com/skunkworksai/bakllava"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/SkunkworksAI/BakLLaVA-1"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>BakLLaVA</strong>: Represents an innovative advancement in the
|
||
realm of AI models, distinguishing itself with significant architectural
|
||
enhancements over its predecessor, LLaVA. Developed with a strong focus
|
||
on integrating multimodal capabilities into language models, BakLLaVA
|
||
leverages a <strong>Mistral 7B</strong> base, augmented with the
|
||
advanced <strong>LLaVA 1.5 architecture</strong>, to push the boundaries
|
||
of performance in various benchmarks. This model has been meticulously
|
||
designed to outperform notable predecessors, such as Llama 2 13B, across
|
||
several benchmarks, showcasing the efficiency and effectiveness of its
|
||
underlying architecture .The training methodology of BakLLaVA is
|
||
particularly noteworthy, employing a feature alignment stage that
|
||
utilizes 600K filtered CC3M images for establishing a robust
|
||
vision-language connection. This process is complemented by a visual
|
||
instruction tuning stage, where 150K GPT-generated multimodal
|
||
instructions are utilized, signifying a tailored approach towards
|
||
encoding vision and text together. Such a methodological approach not
|
||
only enhances feature alignment but also optimizes the model for a broad
|
||
spectrum of conceptual coverage, efficiency in training, and overall
|
||
performance. BakLLaVA’s architecture benefits from a diverse dataset
|
||
compilation including 558K filtered image-text pairs from LAION/CC/SBU,
|
||
captioned by BLIP, alongside 158K GPT-generated multimodal
|
||
instruction-following data, 450K academic-task-oriented VQA data, and
|
||
40K ShareGPT data, among others. This extensive dataset collection is
|
||
pivotal for the model’s training, ensuring broad concept coverage and
|
||
reinforcing the model’s capabilities in feature alignment and visual
|
||
instruction tuning. The strategic selection of datasets underscores
|
||
BakLLaVA’s commitment to advancing AI’s understanding and processing of
|
||
complex visual and textual information, setting a new standard for
|
||
multimodal AI models.
|
||
</details>
|
||
<h2
|
||
id="cogvlm-visual-expert-for-pretrained-language-models"><strong>CogVLM:
|
||
Visual Expert for Pretrained Language Models</strong></h2>
|
||
<p>CogVLM enhances pretrained language models with a dedicated visual
|
||
expert module, incorporating a QKV matrix and MLP within each layer to
|
||
achieve deep visual-language feature alignment, enabling superior
|
||
performance in multimodal tasks such as image captioning and visual
|
||
question answering.</p>
|
||
<a href="https://arxiv.org/abs/2311.03079v2"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.03079v2-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/thudm/cogvlm"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang,
|
||
Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu,
|
||
Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/93d951e1-ad49-47fd-9135-c11bc69d49bc" />
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>CogVLM</strong>: This approach enables the model to deeply fuse
|
||
vision-language features, enhancing its ability to process and
|
||
understand multimodal inputs. The architecture of CogVLM is built around
|
||
several key components: a Vision Transformer (ViT) encoder, <strong>an
|
||
MLP adapter</strong>, a pretrained large language model akin to GPT, and
|
||
the innovative visual expert module. These components work in tandem to
|
||
facilitate the model’s advanced capabilities in handling complex visual
|
||
and textual information. The training methodology for CogVLM is
|
||
comprehensive, encompassing both pretraining and fine-tuning phases.
|
||
During pretraining, the model undergoes learning with a focus on image
|
||
captioning loss and Referring Expression Comprehension (REC) across an
|
||
extensive dataset comprising over 1.5 billion image-text pairs and a
|
||
visual grounding dataset featuring 40 million images. The fine-tuning
|
||
phase employs a unified instruction-supervised approach across a variety
|
||
of visual question-answering datasets, further refining the model’s
|
||
performance. CogVLM’s alignment techniques are particularly noteworthy,
|
||
employing <strong>a visual expert module</strong> in each layer that
|
||
leverages a <strong>QKV (Query, Key, Value) matrix</strong> and an
|
||
<strong>MLP (Multilayer Perceptron)</strong> to achieve deep
|
||
visual-language feature alignment. This method not only allows for the
|
||
seamless integration of image features into the language model’s
|
||
processing layers but also significantly enhances the model’s overall
|
||
multimodal processing capabilities. The datasets employed in training
|
||
and refining CogVLM include LAION-2B, COYO-700M, a visual grounding
|
||
dataset of 40 million images, and several visual question-answering
|
||
datasets like VQAv2, OKVQA, TextVQA, OCRVQA, and ScienceQA. These
|
||
datasets serve multiple purposes, from pretraining and instruction
|
||
alignment to enhancing the model’s proficiency in tasks such as image
|
||
captioning and referring expression comprehension. Through this
|
||
strategic use of diverse datasets, CogVLM is positioned to excel in a
|
||
wide array of multimodal tasks, marking a significant advancement in the
|
||
field of vision-language models.
|
||
</details>
|
||
<h2
|
||
id="cogvlm2-enhanced-vision-language-models-for-image-and-video-understanding"><strong>CogVLM2:
|
||
Enhanced Vision-Language Models for Image and Video
|
||
Understanding</strong></h2>
|
||
<p>CogVLM2 is a family of open-source visual language models designed to
|
||
push the boundaries of image and video understanding. This new
|
||
generation builds upon the success of previous CogVLM models, focusing
|
||
on enhanced vision-language fusion, efficient high-resolution
|
||
architecture, and broader modalities and applications.</p>
|
||
<p><a href="https://arxiv.org/abs/2408.16500"><img
|
||
src="https://img.shields.io/badge/arXiv-2408.16500-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/THUDM/CogVLM2"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/collections/THUDM/cogvlm2-6645f36a29948b67dc4eef75"><img
|
||
src="https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg"
|
||
alt="HuggingFace" /></a><br />
|
||
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang,
|
||
Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang,
|
||
Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi,
|
||
Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie
|
||
Tang</p>
|
||
<p align="center">
|
||
<img src="https://github.com/user-attachments/assets/f60247aa-66b3-486c-891c-c29cefe8aed4" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
CogVLM2 is a new generation visual language model designed for
|
||
comprehensive image and video understanding. It leverages a powerful ViT
|
||
encoder to extract visual features from high-resolution images or video
|
||
sequences, which are then downsampled by a convolutional layer and
|
||
aligned with linguistic representations through a SwiGLU module. This
|
||
adapter efficiently bridges the visual and language modalities while
|
||
preserving critical image information. The model then utilizes a visual
|
||
expert architecture, integrating visual features into both the attention
|
||
and FFN modules of the language decoder. This approach allows for deep
|
||
vision-language fusion without compromising the model’s inherent
|
||
language capabilities. Notably, CogVLM2-Video extends this architecture
|
||
to handle videos, incorporating timestamps alongside multi-frame inputs
|
||
to enable temporal localization and question-answering capabilities. The
|
||
CogVLM2 family has achieved state-of-the-art results on various
|
||
benchmarks, including MMBench, MM-Vet, TextVQA, MVBench, and VCG-Bench,
|
||
showcasing its versatility and effectiveness across a wide range of
|
||
image and video understanding tasks.
|
||
</details>
|
||
<h2
|
||
id="ferret-refer-and-ground-anything-anywhere-at-any-granularity"><strong>Ferret:
|
||
Refer and Ground Anything Anywhere at Any Granularity</strong></h2>
|
||
<p>FERRET, a multimodal large language model, excels in spatial
|
||
referencing and grounding by using a hybrid region representation that
|
||
combines discrete coordinates with continuous features, allowing it to
|
||
precisely pinpoint objects and regions within images, regardless of
|
||
their complexity.</p>
|
||
<a href="https://arxiv.org/abs/2310.07704v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2310.07704v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/apple/ml-ferret"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui
|
||
Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/a5ff801f-d523-4383-8b89-e2499976b2bb" />
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>FERRET</strong>: stands as a multimodal large language model
|
||
(MLLM) that pioneers in spatially referring to any object within an
|
||
image, irrespective of its shape or granularity, and grounding
|
||
open-vocabulary descriptions with precision. The architecture of FERRET
|
||
is distinguished by its hybrid region representation, which marries
|
||
discrete coordinates with continuous features to depict image regions.
|
||
This novel approach enables the model to handle a wide range of spatial
|
||
referring tasks, from pinpointing precise locations to addressing more
|
||
abstract, shapeless areas within images. At the core of FERRET’s
|
||
architecture are several key components: an image encoder tasked with
|
||
deriving image embeddings, <strong>a spatial-aware visual
|
||
sampler</strong> designed to extract regional continuous features, and a
|
||
language model that integrates image, text, and region features. This
|
||
intricate setup facilitates the model’s unique ability to understand and
|
||
generate language that refers to spatial elements in images with
|
||
unprecedented accuracy. The training of FERRET is conducted on the GRIT
|
||
dataset, which includes over 1.1 million samples imbued with
|
||
hierarchical spatial knowledge. This process is augmented by
|
||
spatial-aware visual sampling techniques that cater to the diverse
|
||
shapes and densities found in spatial data, allowing for the
|
||
simultaneous generation of text and coordinates for objects within
|
||
images.FERRET’s alignment techniques and fusion methods are particularly
|
||
noteworthy. By blending discrete coordinates with continuous visual
|
||
features, the model can process inputs of freely formed regions and
|
||
ground descriptions in its outputs accurately. This capability is
|
||
supported by a diverse dataset portfolio, including GRIT for its rich
|
||
spatial annotations, and Visual Genome, RefCOCOs, and Flickr30k for
|
||
tasks such as object detection, phrase grounding, and evaluating the
|
||
model’s proficiency in referring and grounding. Through these
|
||
methodologies, FERRET advances the field of multimodal language models
|
||
by providing a versatile framework for spatial reasoning and language
|
||
grounding in visual contexts.
|
||
</details>
|
||
<h2
|
||
id="fuyu-8b-a-multimodal-architecture-for-ai-agents"><strong>Fuyu-8B: A
|
||
Multimodal Architecture for AI Agents</strong></h2>
|
||
<p>Fuyu-8B introduces a streamlined architecture for AI agents by
|
||
directly projecting image patches into a decoder-only transformer,
|
||
simplifying multimodal processing by treating image and text tokens
|
||
uniformly, and achieving efficient performance in vision-language tasks
|
||
despite its straightforward design.</p>
|
||
<p><a href="https://www.adept.ai/blog/fuyu-8b"><img
|
||
src="https://img.shields.io/badge/https%3A%2F%2Fwww.adept.ai%2Fblog%2Ffuyu-8b?style=flat&label=Fuyu%208B"
|
||
alt="Link" /></a> <a href="https://huggingface.co/adept/fuyu-8b"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Model" /></a><br />
|
||
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus
|
||
Odena, Arushi Somani, Sağnak Taşırlar</p>
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/61a75fb4-ced7-419c-bff7-7cb2e3ddc02d" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Fuyu-8B</strong>: A streamlined multimodal model tailored for
|
||
digital agents, distinguished by its unique approach to handling visual
|
||
data and its integration with textual information. At the core of
|
||
Fuyu-8B’s architecture is a decoder-only transformer, a departure from
|
||
traditional models that rely on separate image encoders. This design
|
||
facilitates the direct projection of image patches into the
|
||
transformer’s initial layer with <strong>a linear projection</strong>,
|
||
allowing Fuyu-8B to process images of any resolution without the need
|
||
for complex training stages or the integration of resolution-specific
|
||
mechanisms. The simplicity of this architecture does not only lie in its
|
||
unified processing of image and text data but also in its elimination of
|
||
the need for cross-attention mechanisms or adapters, streamlining the
|
||
model’s training and inference processes. In terms of alignment
|
||
techniques, Fuyu-8B employs a novel approach by treating image tokens on
|
||
par with text tokens from the inception of the model’s processing
|
||
pipeline. This method does away with separate position embeddings for
|
||
images, thereby simplifying the alignment process between textual and
|
||
visual data. The model’s ability to support arbitrary image resolutions
|
||
and perform fine-grained localization is particularly advantageous for
|
||
applications requiring detailed visual understanding alongside textual
|
||
interaction. The datasets utilized in Fuyu-8B’s development, including
|
||
VQAv2, OKVQA, COCO Captions, and AI2D, are instrumental in benchmarking
|
||
the model against standard image understanding tasks such as visual
|
||
question answering and caption generation. Despite Fuyu-8B’s primary
|
||
focus on applications within digital agents, the selection of these
|
||
datasets ensures a comprehensive evaluation of its capabilities in
|
||
broader contexts of image understanding and multimodal interaction.
|
||
Through its innovative architecture and methodological simplicity,
|
||
Fuyu-8B sets a new direction for the development of AI agents capable of
|
||
sophisticated multimodal reasoning.
|
||
</details>
|
||
<h2 id="otterhd-a-high-resolution-multi-modality-model"><strong>OtterHD:
|
||
A High-Resolution Multi-modality Model</strong></h2>
|
||
<p>OtterHD-8B, inspired by Fuyu-8B, directly integrates pixel-level
|
||
information from high-resolution images (up to 1024x1024 pixels) into
|
||
its language model using position embeddings, eliminating the need for a
|
||
separate vision encoder and enabling precise interpretation of detailed
|
||
visual inputs alongside textual instructions.</p>
|
||
<a href="https://arxiv.org/abs/2311.04219v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.04219v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/luodian/otter"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> <a
|
||
href="https://huggingface.co/spaces/Otter-AI/OtterHD-Demo"><img
|
||
src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"
|
||
alt="Gradio" /></a><br />
|
||
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>OtterHD-8B</strong>: Represents an evolutionary step in
|
||
multi-modality model design, building on the foundation of the
|
||
<strong>Fuyu-8B architecture</strong> to interpret high-resolution
|
||
visual inputs with exceptional precision. Unlike traditional models
|
||
limited by fixed-size vision encoders, OtterHD-8B is equipped to handle
|
||
flexible input dimensions, allowing for enhanced versatility across a
|
||
variety of inference requirements. This model integrates pixel-level
|
||
visual information directly into the language model without the need for
|
||
a separate vision encoder, employing position embeddings to comprehend
|
||
varying image sizes and enabling the processing of high-resolution
|
||
images up to 1024x1024 pixels. Instruction tuning in OtterHD-8B is
|
||
tailored towards accommodating various image resolutions, with the model
|
||
being trained on a diverse dataset mixture including LLaVA-Instruct,
|
||
VQAv2, GQA, OKVQA, OCRVQA, A-OKVQA, COCO-GOI, COCO-Caption, TextQA,
|
||
RefCOCO, COCO-ITM, ImageNet, and LLaVA-RLHF. This training employs
|
||
FlashAttention-2 and other fused operators for optimization, leveraging
|
||
PyTorch and HuggingFace transformers. The direct integration of
|
||
pixel-level information into the language model, facilitated by position
|
||
embeddings, enables OtterHD-8B to understand and generate responses to
|
||
high-resolution images alongside textual instructions without
|
||
conventional vision and text embedding fusion methods. The datasets
|
||
chosen for training OtterHD-8B underscore its focus on a broad array of
|
||
vision and language tasks, including question answering, object
|
||
recognition, and text-image alignment, aiming to enhance the model’s
|
||
capabilities in these areas. By directly processing image patches
|
||
alongside textual instructions, OtterHD-8B eschews traditional fusion
|
||
methods, leveraging its architecture to interpret and respond to complex
|
||
multimodal inputs. This approach not only marks a significant
|
||
advancement in handling high-resolution images but also in the model’s
|
||
overall ability to comprehend and interact with visual and textual data,
|
||
positioning OtterHD-8B as a notable development in the field of
|
||
multi-modality models.
|
||
</details>
|
||
<h2
|
||
id="sphinx-the-joint-mixing-of-weights-tasks-and-visual-embeddings-for-multi-modal-large-language-models"><strong>SPHINX:
|
||
The Joint Mixing of Weights, Tasks, and Visual Embeddings for
|
||
Multi-modal Large Language Models</strong></h2>
|
||
<p>SPHINX pushes the boundaries of multi-modal LLMs by jointly mixing
|
||
model weights, tasks, and visual embeddings during training, utilizing a
|
||
two-stage approach that unfreezes the LLM (LLaMA-2) during pre-training
|
||
for enhanced cross-modal learning and achieving impressive performance
|
||
on a variety of vision-language tasks.</p>
|
||
<a href="https://arxiv.org/abs/2311.07575v1"><img
|
||
src="https://img.shields.io/badge/arXiv-2311.07575v1-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/alpha-vllm/"><img
|
||
src="https://badges.aleen42.com/src/github.svg" alt="GitHub" /></a> Ziyi
|
||
Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu,
|
||
Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi
|
||
Zhang, Xuming He, Hongsheng Li, Yu Qiao
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/3a1bf3fa-d0c5-4692-b9a8-97bea41ce226" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>SPHINX</strong>: stands out as a multi-modal large language
|
||
model (MLLM) designed to enhance the integration of language and vision
|
||
through an innovative approach that includes the <strong>joint mixing of
|
||
model weights</strong>, tuning tasks, and visual embeddings. This model
|
||
is particularly distinguished by its methodology of unfreezing the large
|
||
language model during pre-training to foster more effective cross-modal
|
||
learning. The architecture of SPHINX is built upon a foundation that
|
||
combines vision encoders, <strong>two linear projection layers</strong>,
|
||
and leverages LLaMA-2 as the language model backbone. It adopts a
|
||
two-stage training paradigm that emphasizes pre-training for
|
||
vision-language alignment followed by fine-tuning aimed at visual
|
||
instruction-following tasks. In the realm of training methodologies,
|
||
SPHINX employs a strategy that emphasizes <strong>the joint mixing of
|
||
model weights</strong>, tuning tasks, and visual embeddings, setting a
|
||
precedent for robust cross-modal knowledge acquisition. This approach is
|
||
complemented by a pre-training regimen that utilizes both real-world and
|
||
synthetic data, thereby ensuring a comprehensive understanding across
|
||
various visual instruction tasks. The model introduces an efficient
|
||
strategy for processing high-resolution images, utilizing mixed scales
|
||
and sub-images to accommodate diverse visual inputs. Moreover, SPHINX
|
||
achieves vision-language alignment by integrating comprehensive visual
|
||
embeddings, unfreezing the LLM during pre-training, and employing a
|
||
weight-mixing strategy that bridges domain-specific knowledge across
|
||
different network architectures and training paradigms. The datasets
|
||
utilized in training SPHINX, including LAION-400M, LAION-COCO,
|
||
RefinedWeb, VQAV2, GQA, OKVQA, A-OKVQA, OCRVQA, TextCaps, COCO, LVIS,
|
||
RefCOCO, VG, and Flickr30k, serve a multifaceted purpose. They are
|
||
instrumental in achieving multi-modal alignment, language-only tuning,
|
||
and addressing a wide spectrum of visual question answering and general
|
||
vision tasks. These tasks range from object detection and human pose
|
||
estimation to referring object localization and understanding
|
||
descriptions within the context of image regions. SPHINX, through its
|
||
meticulous design and strategic training approach, sets a new benchmark
|
||
in the field of multi-modal large language models, advancing the
|
||
capabilities in vision-language integration.
|
||
</details>
|
||
<h2 id="clip-contrastive-language-image-pre-training"><strong>CLIP:
|
||
Contrastive Language-Image Pre-training</strong></h2>
|
||
<p>CLIP leverages a contrastive learning approach, training separate
|
||
image and text encoders on a massive dataset of 400 million image-text
|
||
pairs to predict the most relevant captions for images, enabling
|
||
impressive zero-shot transfer capabilities to various downstream tasks
|
||
without requiring task-specific training data.</p>
|
||
<a href="https://arxiv.org/abs/2103.00020"><img
|
||
src="https://img.shields.io/badge/arXiv-2103.00020-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/openai/CLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
|
||
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
|
||
Clark, Gretchen Krueger, Ilya Sutskever
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/c335c342-9a2c-4d4e-83d6-d3077cc32643" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>CLIP</strong>: model represents a groundbreaking approach in the
|
||
field of machine learning, aiming to bridge the gap between visual and
|
||
textual information through natural language supervision. Its
|
||
architecture is designed to understand and predict <strong>the most
|
||
fitting captions for given images</strong>, a methodology that stems
|
||
from its training on a vast dataset of 400 million image-text pairs.
|
||
This extensive training enables CLIP to learn state-of-the-art (SOTA)
|
||
image representations and apply this knowledge to a wide range of
|
||
downstream tasks without the need for task-specific training data,
|
||
facilitating zero-shot transfer capabilities. At the core of CLIP are
|
||
two primary components: <strong>an image encoder</strong> and <strong>a
|
||
text encoder</strong>. These encoders are trained using a contrastive
|
||
learning approach, optimizing for a contrastive objective that seeks to
|
||
maximize the cosine similarity between correct image-text pairs while
|
||
minimizing it for incorrect ones. This process is achieved through
|
||
<strong>a symmetric cross-entropy loss over the similarity scores
|
||
between the embeddings of images and texts</strong>, enabling the model
|
||
to effectively link visual concepts with their linguistic descriptions.
|
||
The model’s ability to generalize across various tasks is further
|
||
enhanced by its training methodology and the specific datasets it
|
||
utilizes. By covering a broad spectrum of visual concepts and leveraging
|
||
natural language for supervision, CLIP is adept at learning
|
||
representations that are highly transferable to new tasks and domains.
|
||
The custom dataset of 400 million image-text pairs, curated from the
|
||
internet, plays a pivotal role in this process, providing the diverse
|
||
and extensive visual and textual information necessary for the model to
|
||
learn effectively. Through these innovations, CLIP sets a new standard
|
||
for learning transferable visual models, showcasing the power of natural
|
||
language in facilitating robust and versatile visual understanding.
|
||
</details>
|
||
<h2 id="metaclip-demystifying-clip-data"><strong>MetaCLIP: Demystifying
|
||
CLIP Data</strong></h2>
|
||
<p>MetaCLIP refines the data curation process for training
|
||
vision-language models by employing algorithms that leverage
|
||
CLIP-derived metadata to create a balanced and high-quality dataset from
|
||
vast sources like CommonCrawl, resulting in improved performance and
|
||
diversity compared to models trained on CLIP’s original dataset.</p>
|
||
<a href="https://arxiv.org/abs/2309.16671"><img
|
||
src="https://img.shields.io/badge/arXiv-2309.16671-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/facebookresearch/MetaCLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes,
|
||
Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph
|
||
Feichtenhofer
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/a6c79d0e-a4c7-48c9-86b6-3a8cc9853e11" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>MetaCLIP</strong>: Represents an innovative approach in the
|
||
realm of data curation for machine learning, specifically targeting the
|
||
<strong>enhancement of training datasets</strong> through metadata
|
||
utilization derived from CLIP’s concepts. This model is designed to sift
|
||
through extensive raw data pools, such as the CommonCrawl dataset, to
|
||
curate a high-quality, balanced subset that significantly betters the
|
||
diversity and performance metrics of the data used for training machine
|
||
learning models. The essence of MetaCLIP lies in its unique architecture
|
||
that incorporates data curation algorithms, which are adept at
|
||
leveraging metadata for the purpose of balancing and enriching the
|
||
training dataset both in terms of quality and diversity. The
|
||
architecture of MetaCLIP is structured around these <strong>data
|
||
curation algorithms</strong>, which play a pivotal role in the framework
|
||
by identifying and assembling a balanced and high-quality dataset from a
|
||
vast collection of 400 million image-text pairs initially sourced from
|
||
CommonCrawl. This process is instrumental in MetaCLIP’s ability to
|
||
demonstrate superior performance on various benchmarks, including
|
||
zero-shot ImageNet classification, when compared to datasets curated
|
||
using CLIP’s original methodologies. The training methods employed by
|
||
MetaCLIP, therefore, are not just about processing and learning from
|
||
data but also about intelligently selecting the data that is most
|
||
beneficial for the training process, ensuring that the model is trained
|
||
on a dataset that is representative, diverse, and of high quality. The
|
||
purpose behind employing datasets like CommonCrawl within the MetaCLIP
|
||
framework is to address and overcome the limitations observed in CLIP’s
|
||
original dataset. By curating a balanced and high-quality dataset of 400
|
||
million image-text pairs, MetaCLIP sets a new precedent in the field of
|
||
machine learning data curation. This strategic selection and enhancement
|
||
of the training dataset enable MetaCLIP to significantly improve
|
||
performance on standard benchmarks compared to its predecessor,
|
||
highlighting the importance of dataset quality and diversity in
|
||
achieving high performance in machine learning tasks. Through its
|
||
innovative approach to data curation, MetaCLIP offers a promising avenue
|
||
for enhancing the capabilities of machine learning models, particularly
|
||
in applications requiring robust image-text understanding and
|
||
classification.
|
||
</details>
|
||
<h2
|
||
id="alpha-clip-a-clip-model-focusing-on-wherever-you-want"><strong>Alpha-CLIP:
|
||
A CLIP Model Focusing on Wherever You Want</strong></h2>
|
||
<p>Alpha-CLIP builds upon the CLIP model by incorporating region
|
||
awareness through the addition of an alpha channel to the image encoder,
|
||
trained on millions of RGBA region-text pairs, enabling precise control
|
||
over image emphasis and enhancing performance across various tasks
|
||
requiring detailed spatial understanding.</p>
|
||
<a href="https://arxiv.org/abs/2312.03818"><img
|
||
src="https://img.shields.io/badge/arXiv-22312.03818-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/SunzeY/AlphaCLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun
|
||
Xiong, Dahua Lin, Jiaqi Wang
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/07bd6161-1682-4954-97f3-3770258bfa8c" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>Alpha-CLIP</strong>: Introduces a significant enhancement to the
|
||
original CLIP model, incorporating region awareness to its repertoire of
|
||
capabilities. This model is fine-tuned on millions of RGBA region-text
|
||
pairs, enabling it to maintain CLIP’s visual recognition prowess while
|
||
offering precise control over the emphasis of image content. By
|
||
integrating an additional <strong>alpha channel into the CLIP image
|
||
encoder</strong>, Alpha-CLIP allows for detailed segmentation and
|
||
region-specific processing without modifying the foundational CLIP
|
||
weights, thus facilitating a nuanced approach to image understanding
|
||
that respects the spatial dynamics of visual data. The training of
|
||
Alpha-CLIP leverages a novel data generation pipeline designed to
|
||
produce a vast array of RGBA-region text pairs. This process involves
|
||
the creation of natural images equipped with foreground alpha channels
|
||
and their corresponding referring expressions for specific regions. Such
|
||
a methodology not only enables the fine-tuning of the model with an
|
||
additional alpha channel input but also underpins its ability to perform
|
||
with heightened specificity across various tasks. These tasks range from
|
||
image recognition to multimodal large language models, and extend into
|
||
both 2D and 3D generation domains, showcasing Alpha-CLIP’s versatility
|
||
and broad applicability. Datasets like LAION-400M, LAION-5B, and GRIT
|
||
play a crucial role in training Alpha-CLIP, providing a wide spectrum of
|
||
images for initial training and fine-grained mask-level labels for
|
||
enhancing local perception capabilities. This strategic choice of
|
||
datasets ensures that Alpha-CLIP is not only well-equipped for general
|
||
visual recognition tasks but also capable of nuanced, region-specific
|
||
processing and understanding, setting a new standard for models at the
|
||
intersection of language and vision.
|
||
</details>
|
||
<h2 id="glip-grounded-language-image-pre-training"><strong>GLIP:
|
||
Grounded Language-Image Pre-training</strong></h2>
|
||
<p>GLIP revolutionizes language-image pre-training by unifying object
|
||
detection and phrase grounding, allowing it to understand and execute
|
||
tasks requiring object-level precision and language awareness through a
|
||
deep integration of visual and textual information during training.</p>
|
||
<a href="https://arxiv.org/abs/2112.03857"><img
|
||
src="https://img.shields.io/badge/arXiv-2112.03857-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a href="https://github.com/microsoft/GLIP"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang,
|
||
Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng
|
||
Hwang, Kai-Wei Chang, Jianfeng Gao
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/06e6f8dc-fbd8-49da-8651-a22ee2edcf3d" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>GLIP</strong>: A novel approach that innovatively unifies the
|
||
tasks of object detection and phrase grounding by redefining object
|
||
detection as a phrase grounding challenge. This strategic reformation
|
||
allows the model to exploit extensive image-text paired datasets for
|
||
pre-training, equipping it with the capability to comprehend and execute
|
||
tasks that require object-level precision, language awareness, and
|
||
semantically rich visual representations. At its core, GLIP’s
|
||
architecture is designed to deeply integrate visual and textual
|
||
information, enhancing its understanding of complex visual scenes in
|
||
conjunction with textual prompts. The architecture of GLIP is composed
|
||
of several critical components, including a visual encoder that can
|
||
either be a Convolutional Neural Network (CNN) or a Transformer, tasked
|
||
with extracting features from regions or bounding boxes within images.
|
||
It also includes a language encoder dedicated to processing text prompts
|
||
and prediction heads (box classifier and box regressor) that are trained
|
||
using <strong>classification</strong> and <strong>localization
|
||
loss</strong>. A distinctive feature of GLIP is its method of deep
|
||
fusion between image and text, specifically in the latter stages of
|
||
encoding, which merges visual and textual information more
|
||
comprehensively than traditional methods. GLIP’s training methodology is
|
||
as innovative as its architecture, employing a unified formulation that
|
||
amalgamates detection and grounding tasks into a singular workflow. This
|
||
model is trained end-to-end, optimizing losses defined for <strong>both
|
||
detection</strong> (focusing on localization and classification) and
|
||
<strong>grounding</strong> (centering on alignment scores between image
|
||
regions and corresponding words in the prompt). Such deep integration of
|
||
visual and language features during training is pivotal, facilitating
|
||
the model’s ability to learn effectively from paired image-text data.
|
||
The datasets utilized for training GLIP, including COCO, OpenImages,
|
||
Objects365, Visual Genome, Flickr30k-entities, LVIS, and PhraseCut, are
|
||
meticulously selected to cover a wide array of object classes and
|
||
scenarios, each serving a unique purpose from object detection and
|
||
phrase grounding to instance segmentation and referring expression
|
||
segmentation. Through this comprehensive training, GLIP sets a new
|
||
precedent in the realm of language-image pre-training, demonstrating
|
||
advanced capabilities in interpreting and interacting with both visual
|
||
and textual data.
|
||
</details>
|
||
<h2
|
||
id="imagebind-one-embedding-space-to-bind-them-all"><strong>ImageBind:
|
||
One Embedding Space To Bind Them All</strong></h2>
|
||
<p>ImageBind revolutionizes multimodal learning by creating a single,
|
||
joint embedding space that integrates six modalities – images, text,
|
||
audio, depth, thermal, and IMU data – through image-paired data as a
|
||
central binding agent, allowing for zero-shot classification and
|
||
retrieval across diverse data types.</p>
|
||
<a href="https://arxiv.org/abs/2305.05665"><img
|
||
src="https://img.shields.io/badge/arXiv-2305.05665-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/facebookresearch/imagebind"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan
|
||
Vasudev Alwala, Armand Joulin, Ishan Misra
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/fbf8bcdd-b1bb-4fd8-8723-3c82e84ef759" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>ImageBind</strong>: Introduces an innovative approach to
|
||
multimodal learning by creating <strong>a joint embedding space</strong>
|
||
that encompasses six different modalities: <strong>images, text, audio,
|
||
depth, thermal, and IMU (Inertial Measurement Unit)</strong> data. This
|
||
model uniquely employs image-paired data as a central binding agent,
|
||
enabling it to leverage the capabilities of large-scale vision-language
|
||
models to extend zero-shot capabilities to new, previously unlinked
|
||
modalities. By doing so, ImageBind not only facilitates a deeper
|
||
integration of diverse data types but also opens up new avenues for
|
||
zero-shot classification and retrieval across a wide range of
|
||
applications. At the heart of ImageBind’s architecture lies a
|
||
transformer-based design, adapted for each specific modality to ensure
|
||
optimal processing and representation. For instance, it utilizes a
|
||
Vision Transformer for image data, with each modality encoder being
|
||
augmented by <strong>modality-specific linear projection heads</strong>.
|
||
These adaptations are crucial for maintaining a uniform embedding size
|
||
across the disparate data types, ensuring that the model can effectively
|
||
learn from and link together the various modalities. This uniformity is
|
||
key to ImageBind’s ability to create a cohesive and comprehensive
|
||
embedding space that captures the nuances of each data type. The
|
||
training methodology behind ImageBind is particularly noteworthy. It
|
||
employs contrastive learning, utilizing both web-scale image-text data
|
||
and naturally occurring paired data from various modalities, such as
|
||
video-audio and image-depth pairs. This strategy allows the model to
|
||
learn a single joint embedding space without requiring all modalities to
|
||
co-occur, a significant advantage that enhances its flexibility and
|
||
applicability. The use of datasets like Audioset, SUN RGB-D, LLVIP, and
|
||
Ego4D, which provide naturally paired data across the model’s target
|
||
modalities, is critical to this process. These datasets enable ImageBind
|
||
to achieve emergent zero-shot classification and retrieval performance
|
||
on tasks tailored to each modality, showcasing the model’s ability to
|
||
seamlessly navigate and leverage the complex interplay between different
|
||
forms of data.
|
||
</details>
|
||
<h2
|
||
id="siglip-sigmoid-loss-for-language-image-pre-training"><strong>SigLIP:
|
||
Sigmoid Loss for Language Image Pre-Training</strong></h2>
|
||
<p>SigLIP introduces a simple pairwise sigmoid loss for language-image
|
||
pre-training, allowing for scalable training with large batch sizes
|
||
without compromising performance, enabling efficient alignment between
|
||
image and text representations.</p>
|
||
<a href="https://arxiv.org/abs/2303.15343"><img
|
||
src="https://img.shields.io/badge/arXiv-2303.15343-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a><br />
|
||
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer<br />
|
||
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/60018313-37dd-4dbd-8eb4-a3075fd26663" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>SigLIP</strong>: A novel approach to language-image pre-training
|
||
by proposing <strong>a simple pairwise sigmoid loss</strong>. This
|
||
method contrasts with standard contrastive learning that utilizes
|
||
softmax normalization, as it operates directly on image-text pairs
|
||
without necessitating a global view of pairwise similarities for
|
||
normalization. The primary advantage of this approach is its
|
||
scalability, allowing for the use of larger batch sizes without
|
||
compromising performance. The architecture leverages a vision
|
||
transformer for image processing and a conventional transformer for
|
||
text, with the sigmoid loss facilitating independent processing of
|
||
image-text pairs. This design enables more efficient training dynamics,
|
||
particularly in the context of large batch sizes, by examining the
|
||
effects of varying the negative to positive ratio and the selection of
|
||
example pairs. Training methodologies focus on exploiting large batch
|
||
sizes, delving into the dynamics of how batch size variations influence
|
||
model performance. The introduction of sigmoid loss is pivotal, enabling
|
||
the model to train effectively with these large batches by investigating
|
||
the relationship between the ratio of negative to positive examples and
|
||
the optimization of example pair selection. The use of the LiT
|
||
image-text dataset and the WebLI dataset is integral to the model’s
|
||
training, aiming to achieve aligned representational spaces between
|
||
images and texts. These datasets are chosen for their utility in
|
||
assessing zero-shot transfer capabilities, as well as in exploring the
|
||
scalability and efficiency of the model’s sigmoid loss-based training.
|
||
In essence, SigLIP marks a significant stride in language-image
|
||
pre-training through its innovative use of sigmoid loss, enhancing
|
||
scalability and training efficiency. This approach not only simplifies
|
||
the training process by eliminating the need for global normalization
|
||
but also showcases the model’s adaptability to large-scale data
|
||
handling. The strategic selection of datasets further underscores the
|
||
model’s capability to forge aligned representational spaces, paving the
|
||
way for advanced zero-shot learning and efficient multimodal
|
||
integration.
|
||
</details>
|
||
<h2
|
||
id="vit-an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale"><strong>ViT:
|
||
An Image is Worth 16x16 Words: Transformers for Image Recognition at
|
||
Scale</strong></h2>
|
||
<p>The Vision Transformer (ViT) revolutionizes image recognition by
|
||
applying the Transformer architecture to images, processing them as a
|
||
sequence of fixed-size patches, thereby demonstrating that image
|
||
recognition can benefit from the power of transformers, surpassing
|
||
traditional convolutional neural network (CNN) approaches with the aid
|
||
of large-scale training datasets.</p>
|
||
<a href="https://arxiv.org/abs/2010.11929v2"><img
|
||
src="https://img.shields.io/badge/arXiv-2010.11929v2-b31b1b.svg?style=flat-square"
|
||
alt="arXiv" /></a> <a
|
||
href="https://github.com/google-research/vision_transformer"><img
|
||
src="https://badges.aleen42.com/src/github.svg"
|
||
alt="GitHub" /></a><br />
|
||
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,
|
||
Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,
|
||
Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
|
||
<p align="center">
|
||
<img src="https://github.com/gokayfem/Awesome-VLM-Architectures/assets/88277926/b2f77966-c2e8-4204-ba90-be51196a7dee" />
|
||
</p>
|
||
<details>
|
||
<summary>
|
||
ℹ️ <i>More Information</i>
|
||
</summary>
|
||
<strong>The Vision Transformer (ViT)</strong>: A paradigm shift in image
|
||
recognition by applying the transformer architecture, predominantly used
|
||
in natural language processing, directly to images. It innovatively
|
||
processes images as <strong>a sequence of fixed-size patches</strong>,
|
||
akin to how tokens are treated in <strong>text applications</strong>.
|
||
This approach is facilitated through minimal modifications to the
|
||
standard transformer components, emphasizing the model’s adaptability to
|
||
visual tasks without relying on the convolutional neural networks’
|
||
(CNNs) inductive biases. ViT’s architecture is distinguished by its use
|
||
of linear embedding for <strong>image patches</strong> and
|
||
<strong>position embeddings</strong>, which are crucial for maintaining
|
||
the spatial hierarchy of image data. The core of ViT is a standard
|
||
Transformer encoder that includes multiheaded self-attention (MSA) and
|
||
multilayer perceptron (MLP) blocks, complemented by layer normalization
|
||
and residual connections, underscoring its efficiency and robustness in
|
||
handling visual data. Training methodologies for ViT are characterized
|
||
by its scalability and the significant impact of dataset size on its
|
||
performance. Initially, ViT exhibits modest accuracies without strong
|
||
regularization techniques. However, its performance escalates with the
|
||
scale of training, showcasing its potential to outperform traditional
|
||
CNN approaches through extensive pre-training on large datasets. This
|
||
process highlights the critical role of dataset selection in ViT’s
|
||
training regimen. It is fine-tuned on smaller datasets following a
|
||
comprehensive pre-training phase that leverages large datasets like
|
||
ImageNet-21k and JFT-300M to enhance model generalization and
|
||
performance across a wide range of tasks. The datasets employed,
|
||
including ImageNet, CIFAR-100, VTAB, ImageNet-21k, and JFT-300M, serve
|
||
dual purposes: benchmarking the model’s image classification
|
||
capabilities and evaluating its transferability to diverse tasks with
|
||
limited data, thereby establishing ViT’s versatility and effectiveness
|
||
in advancing image recognition tasks.
|
||
</details>
|
||
<h2 id="important-references">Important References</h2>
|
||
<ul>
|
||
<li><a
|
||
href="https://encord.com/blog/vision-language-models-guide/">Guide to
|
||
Vision-Language Models (VLMs) by Görkem Polat</a></li>
|
||
<li><a href="https://aman.ai/primers/ai/VLM/#google_vignette">VLM Primer
|
||
by Aman Chadha</a></li>
|
||
<li><a
|
||
href="https://lilianweng.github.io/posts/2022-06-09-vlm/">Generalized
|
||
Visual Language Models by Lilian Weng</a></li>
|
||
</ul>
|
||
<p><a
|
||
href="https://github.com/gokayfem/awesome-vlm-architectures">vlmarchitectures.md
|
||
Github</a></p>
|