repository of Otter & Otter-HD
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Finding duplicate images made easy!
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
PyTorch Implementation of 《Denoising Diffusion Probabilistic Models》
Modeling, training, eval, and inference code for OLMo
A light-weight library for mixture-of-experts (MoE) training. The core of the system is efficient dropless-MoE (dMoE) and standard MoE layers
A Python Perceptual Image Hashing Module
OLMoE: Open Mixture-of-Experts Language Models
Codebase for Aria - An Open Multimodal Native MoE
The repository forked from the official LLaVA-NeXT repository
Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Implementation of Vary-toy (Small Language Model Meets with Reinforced Vision Vocabulary)
Mixture-of-Experts for Large Vision-Language Models
PyTorch code and models for V-JEPA self-supervised learning from video.
I-JEPA, First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture"
A family of highly capable yet efficient large multimodal models
A family of lightweight multimodal models