stereoplegic
's Collections
Multimodal
updated
Woodpecker: Hallucination Correction for Multimodal Large Language
Models
Paper
•
2310.16045
•
Published
•
14
HallusionBench: You See What You Think? Or You Think What You See? An
Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5,
and Other Multi-modality Models
Paper
•
2310.14566
•
Published
•
25
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper
•
2310.13355
•
Published
•
6
Conditional Diffusion Distillation
Paper
•
2310.01407
•
Published
•
20
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative
Editing
Paper
•
2310.12404
•
Published
•
15
MusicAgent: An AI Agent for Music Understanding and Generation with
Large Language Models
Paper
•
2310.11954
•
Published
•
24
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Paper
•
2309.16058
•
Published
•
55
Jointly Training Large Autoregressive Multimodal Models
Paper
•
2309.15564
•
Published
•
8
Empowering Vision-Language Models to Follow Interleaved Vision-Language
Instructions
Paper
•
2308.04152
•
Published
•
2
Multimodal Foundation Models: From Specialists to General-Purpose
Assistants
Paper
•
2309.10020
•
Published
•
40
Language as the Medium: Multimodal Video Classification through text
only
Paper
•
2309.10783
•
Published
•
1
Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
Paper
•
2310.00653
•
Published
•
3
Kosmos-2.5: A Multimodal Literate Model
Paper
•
2309.11419
•
Published
•
50
You Only Look at Screens: Multimodal Chain-of-Action Agents
Paper
•
2309.11436
•
Published
•
1
UniAudio: An Audio Foundation Model Toward Universal Audio Generation
Paper
•
2310.00704
•
Published
•
19
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency
Paper
•
2310.03734
•
Published
•
14
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
•
2310.03739
•
Published
•
21
Improved Baselines with Visual Instruction Tuning
Paper
•
2310.03744
•
Published
•
37
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic
Image Design and Generation
Paper
•
2310.08541
•
Published
•
17
Visual Storytelling with Question-Answer Plans
Paper
•
2310.05295
•
Published
•
1
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
•
2309.14525
•
Published
•
29
Toward Joint Language Modeling for Speech Units and Text
Paper
•
2310.08715
•
Published
•
7
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Paper
•
2306.05425
•
Published
•
11
NExT-GPT: Any-to-Any Multimodal LLM
Paper
•
2309.05519
•
Published
•
78
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper
•
2310.05737
•
Published
•
4
X-LLM: Bootstrapping Advanced Large Language Models by Treating
Multi-Modalities as Foreign Languages
Paper
•
2305.04160
•
Published
•
2
MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning
Paper
•
2310.09478
•
Published
•
19
Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models
Paper
•
2308.13437
•
Published
•
3
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
6
Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task
Instruction Tuning
Paper
•
2310.08166
•
Published
•
1
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models
Paper
•
2310.08825
•
Published
•
1
Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs
Paper
•
2310.00582
•
Published
•
1
Large-Scale Automatic Audiobook Creation
Paper
•
2309.03926
•
Published
•
53
Kosmos-G: Generating Images in Context with Multimodal Large Language
Models
Paper
•
2310.02992
•
Published
•
4
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models
Paper
•
2309.04041
•
Published
•
1
Multimodal Graph Learning for Generative Tasks
Paper
•
2310.07478
•
Published
•
1
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper
•
2309.09958
•
Published
•
18
TextBind: Multi-turn Interleaved Multimodal Instruction-following
Paper
•
2309.08637
•
Published
•
7
ImageBind-LLM: Multi-modality Instruction Tuning
Paper
•
2309.03905
•
Published
•
16
Never-ending Learning of User Interfaces
Paper
•
2308.08726
•
Published
•
1
LMDX: Language Model-based Document Information Extraction and
Localization
Paper
•
2309.10952
•
Published
•
65
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial
Understanding
Paper
•
2310.15308
•
Published
•
22
TiC-CLIP: Continual Training of CLIP Models
Paper
•
2310.16226
•
Published
•
8
ConvNets Match Vision Transformers at Scale
Paper
•
2310.16764
•
Published
•
20
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images
Paper
•
2310.16825
•
Published
•
32
A Picture is Worth a Thousand Words: Principled Recaptioning Improves
Image Generation
Paper
•
2310.16656
•
Published
•
40
From Sparse to Soft Mixtures of Experts
Paper
•
2308.00951
•
Published
•
20
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
•
2308.06093
•
Published
•
2
Self-slimmed Vision Transformer
Paper
•
2111.12624
•
Published
•
1
Robustifying Token Attention for Vision Transformers
Paper
•
2303.11126
•
Published
•
1
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks
Paper
•
2306.04073
•
Published
•
2
Retrieval-Augmented Multimodal Language Modeling
Paper
•
2211.12561
•
Published
•
1
Long-range Language Modeling with Self-retrieval
Paper
•
2306.13421
•
Published
•
16
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning
Paper
•
2303.08566
•
Published
•
1
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual
Speech Representation
Paper
•
2205.08180
•
Published
•
1
PVP: Pre-trained Visual Parameter-Efficient Tuning
Paper
•
2304.13639
•
Published
•
1
Do We Really Need a Large Number of Visual Prompts?
Paper
•
2305.17223
•
Published
•
1
Alternating Gradient Descent and Mixture-of-Experts for Integrated
Multimodal Perception
Paper
•
2305.06324
•
Published
•
1
Zorro: the masked multimodal transformer
Paper
•
2301.09595
•
Published
•
2
Attention Bottlenecks for Multimodal Fusion
Paper
•
2107.00135
•
Published
•
1
Contrastive Audio-Visual Masked Autoencoder
Paper
•
2210.07839
•
Published
•
1
On Robustness in Multimodal Learning
Paper
•
2304.04385
•
Published
•
1
Meta-Transformer: A Unified Framework for Multimodal Learning
Paper
•
2307.10802
•
Published
•
43
Using Multiple Instance Learning to Build Multimodal Representations
Paper
•
2212.05561
•
Published
•
1
LMEye: An Interactive Perception Network for Large Language Models
Paper
•
2305.03701
•
Published
•
1
Concept-Oriented Deep Learning with Large Language Models
Paper
•
2306.17089
•
Published
•
1
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
•
2309.06495
•
Published
•
1
Multimodal Multi-Hop Question Answering Through a Conversation Between
Tools and Efficiently Finetuned Large Language Models
Paper
•
2309.08922
•
Published
•
1
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large
Language Model Signals for Science Question Answering
Paper
•
2305.03453
•
Published
•
1
ViperGPT: Visual Inference via Python Execution for Reasoning
Paper
•
2303.08128
•
Published
•
2
Visual Programming: Compositional visual reasoning without training
Paper
•
2211.11559
•
Published
•
1
Generalization Differences between End-to-End and Neuro-Symbolic
Vision-Language Reasoning Systems
Paper
•
2210.15037
•
Published
•
1
Diversifying Joint Vision-Language Tokenization Learning
Paper
•
2306.03421
•
Published
•
1
Joint Adaptive Representations for Image-Language Learning
Paper
•
2305.19924
•
Published
•
1
Modular Visual Question Answering via Code Generation
Paper
•
2306.05392
•
Published
•
2
TouchStone: Evaluating Vision-Language Models by Language Models
Paper
•
2308.16890
•
Published
•
1
MMICL: Empowering Vision-language Model with Multi-Modal In-Context
Learning
Paper
•
2309.07915
•
Published
•
4
VIGC: Visual Instruction Generation and Correction
Paper
•
2308.12714
•
Published
•
1
Latent Consistency Models: Synthesizing High-Resolution Images with
Few-Step Inference
Paper
•
2310.04378
•
Published
•
19
MetaFormer Is Actually What You Need for Vision
Paper
•
2111.11418
•
Published
•
1
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Paper
•
2310.11441
•
Published
•
26
An Image is Worth Multiple Words: Learning Object Level Concepts using
Multi-Concept Prompt Learning
Paper
•
2310.12274
•
Published
•
11
Matryoshka Diffusion Models
Paper
•
2310.15111
•
Published
•
40
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Paper
•
2310.11440
•
Published
•
15
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers
Paper
•
2303.13755
•
Published
•
1
DSG: An End-to-End Document Structure Generator
Paper
•
2310.09118
•
Published
•
2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models
across Computer Vision Tasks
Paper
•
2310.19909
•
Published
•
20
Beyond U: Making Diffusion Models Faster & Lighter
Paper
•
2310.20092
•
Published
•
11
i-Code Studio: A Configurable and Composable Framework for Integrative
AI
Paper
•
2305.13738
•
Published
•
1
AssistGPT: A General Multi-modal Assistant that can Plan, Execute,
Inspect, and Learn
Paper
•
2306.08640
•
Published
•
26
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large
Language Models
Paper
•
2309.10707
•
Published
•
1
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual
Instruction Tuning
Paper
•
2306.04387
•
Published
•
8
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language
Models
Paper
•
2308.00675
•
Published
•
35
Evaluating the Capability of Large-scale Language Models on Chinese
Grammatical Error Correction Task
Paper
•
2307.03972
•
Published
•
1
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
Paper
•
2306.17107
•
Published
•
11
GPT4Tools: Teaching Large Language Model to Use Tools via
Self-instruction
Paper
•
2305.18752
•
Published
•
3
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and
Compositional Experts
Paper
•
2305.14839
•
Published
•
1
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo
Labelling
Paper
•
2311.00430
•
Published
•
56
Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data
Paper
•
2309.13876
•
Published
•
1
HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models
Paper
•
2309.15701
•
Published
•
2
Massive End-to-end Models for Short Search Queries
Paper
•
2309.12963
•
Published
•
1
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition
Paper
•
2310.06434
•
Published
•
4
MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription
Paper
•
2108.02625
•
Published
•
1
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
•
2311.00571
•
Published
•
40
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression
For On-device ASR Models
Paper
•
2309.01947
•
Published
•
1
Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization
Paper
•
2211.02077
•
Published
•
1
MUTEX: Learning Unified Policies from Multimodal Task Specifications
Paper
•
2309.14320
•
Published
•
1
Linking Representations with Multimodal Contrastive Learning
Paper
•
2304.03464
•
Published
•
1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity
for Efficient Vision Transformers
Paper
•
2211.11315
•
Published
•
1
Attention or Convolution: Transformer Encoders in Audio Language Models
for Inference Efficiency
Paper
•
2311.02772
•
Published
•
3
FLAP: Fast Language-Audio Pre-training
Paper
•
2311.01615
•
Published
•
16
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual
Representation Learning
Paper
•
2304.06461
•
Published
•
1
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation
Paper
•
2303.05668
•
Published
•
1
One-Step Knowledge Distillation and Fine-Tuning in Using Large
Pre-Trained Self-Supervised Learning Models for Speaker Verification
Paper
•
2305.17394
•
Published
•
1
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech
Representations
Paper
•
2203.16965
•
Published
•
1
Task-Agnostic Structured Pruning of Speech Representation Models
Paper
•
2306.01385
•
Published
•
1
Recycle-and-Distill: Universal Compression Strategy for
Transformer-based Speech SSL Models with Attention Map Reusing and Masking
Distillation
Paper
•
2305.11685
•
Published
•
2
Beyond Universal Transformer: block reusing with adaptor in Transformer
for automatic speech recognition
Paper
•
2303.13072
•
Published
•
1
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable
image-text retrieval
Paper
•
2309.01516
•
Published
•
1
Visual Query Tuning: Towards Effective Usage of Intermediate
Representations for Parameter and Memory Efficient Transfer Learning
Paper
•
2212.03220
•
Published
•
1
Residual Mixture of Experts
Paper
•
2204.09636
•
Published
•
1
End-to-end Knowledge Retrieval with Multi-modal Queries
Paper
•
2306.00424
•
Published
•
1
A Symmetric Dual Encoding Dense Retrieval Framework for
Knowledge-Intensive Visual Question Answering
Paper
•
2304.13649
•
Published
•
1
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Paper
•
2311.04589
•
Published
•
18
CODA-Prompt: COntinual Decomposed Attention-based Prompting for
Rehearsal-Free Continual Learning
Paper
•
2211.13218
•
Published
•
1
When Prompt-based Incremental Learning Does Not Meet Strong Pretraining
Paper
•
2308.10445
•
Published
•
1
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox
Paper
•
2309.07117
•
Published
•
2
A Simple Baseline that Questions the Use of Pretrained-Models in
Continual Learning
Paper
•
2210.04428
•
Published
•
1
A soft nearest-neighbor framework for continual semi-supervised learning
Paper
•
2212.05102
•
Published
•
1
Avalanche: an End-to-End Library for Continual Learning
Paper
•
2104.00405
•
Published
•
1
SequeL: A Continual Learning Library in PyTorch and JAX
Paper
•
2304.10857
•
Published
•
1
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient
Vision Transformer
Paper
•
2306.06446
•
Published
•
1
An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training
Paper
•
2306.17165
•
Published
•
1
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture
of Experts
Paper
•
2105.03036
•
Published
•
2
Language-Routing Mixture of Experts for Multilingual and Code-Switching
Speech Recognition
Paper
•
2307.05956
•
Published
•
1
Cross-token Modeling with Conditional Computation
Paper
•
2109.02008
•
Published
•
1
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal
Language Models
Paper
•
2311.05997
•
Published
•
36
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper
•
2311.06243
•
Published
•
17
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
•
2311.05908
•
Published
•
12
Continual Learning for Monolingual End-to-End Automatic Speech
Recognition
Paper
•
2112.09427
•
Published
•
1
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
Paper
•
2208.08340
•
Published
•
1
MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene
Classification
Paper
•
2309.09276
•
Published
•
1
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Paper
•
2306.15706
•
Published
•
1
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
•
2311.07463
•
Published
•
13
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
Paper
•
2311.05556
•
Published
•
80
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of
Experts And Frequency-augmented Decoder Approach
Paper
•
2310.12004
•
Published
•
2
From Words to Music: A Study of Subword Tokenization Techniques in
Symbolic Music Generation
Paper
•
2304.08953
•
Published
•
1
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic
Speech Recognition
Paper
•
2209.15176
•
Published
•
1
Decoder-only Architecture for Speech Recognition with CTC Prompts and
Text Data Augmentation
Paper
•
2309.08876
•
Published
•
1
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper
•
2312.04410
•
Published
•
14
Phonetic-assisted Multi-Target Units Modeling for Improving
Conformer-Transducer ASR system
Paper
•
2211.01571
•
Published
•
1
E-Branchformer: Branchformer with Enhanced merging for speech
recognition
Paper
•
2210.00077
•
Published
•
1
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
Paper
•
2309.10713
•
Published
•
1
EfficientFormer: Vision Transformers at MobileNet Speed
Paper
•
2206.01191
•
Published
•
1
COMCAT: Towards Efficient Compression and Customization of
Attention-Based Vision Models
Paper
•
2305.17235
•
Published
•
2
Semi-Autoregressive Streaming ASR With Label Context
Paper
•
2309.10926
•
Published
•
1
eP-ALM: Efficient Perceptual Augmentation of Language Models
Paper
•
2303.11403
•
Published
•
3
OneLLM: One Framework to Align All Modalities with Language
Paper
•
2312.03700
•
Published
•
20
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
26
Augmenting text for spoken language understanding with Large Language
Models
Paper
•
2309.09390
•
Published
•
2
Audiobox: Unified Audio Generation with Natural Language Prompts
Paper
•
2312.15821
•
Published
•
12
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
Paper
•
2312.14385
•
Published
•
5
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
•
2401.00849
•
Published
•
14
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding
Paper
•
2311.15075
•
Published
•
1
MLLMs-Augmented Visual-Language Representation Learning
Paper
•
2311.18765
•
Published
•
1
InfMLLM: A Unified Framework for Visual-Language Tasks
Paper
•
2311.06791
•
Published
•
3
Generative Multimodal Models are In-Context Learners
Paper
•
2312.13286
•
Published
•
34
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
representations to LLMs and Emergent Cross-modal Reasoning
Paper
•
2311.18799
•
Published
•
1
Training Transformers Together
Paper
•
2207.03481
•
Published
•
5
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Paper
•
2309.07623
•
Published
•
1
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
181
Diffusion Model Alignment Using Direct Preference Optimization
Paper
•
2311.12908
•
Published
•
47
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
Paper
•
2312.03491
•
Published
•
34
Efficient Monotonic Multihead Attention
Paper
•
2312.04515
•
Published
•
6
Qwen-Audio: Advancing Universal Audio Understanding via Unified
Large-Scale Audio-Language Models
Paper
•
2311.07919
•
Published
•
9
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models
Paper
•
2311.07575
•
Published
•
13
Towards General-Purpose Speech Abilities for Large Language Models Using
Unpaired Data
Paper
•
2311.06753
•
Published
•
6
LayoutPrompter: Awaken the Design Ability of Large Language Models
Paper
•
2311.06495
•
Published
•
10
Honeybee: Locality-enhanced Projector for Multimodal LLM
Paper
•
2312.06742
•
Published
•
9
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient
Channels
Paper
•
2309.08513
•
Published
•
1
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models
Paper
•
2305.05189
•
Published
•
2
TextDiffuser: Diffusion Models as Text Painters
Paper
•
2305.10855
•
Published
•
3
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing
Paper
•
2304.02051
•
Published
•
4
DITTO: Diffusion Inference-Time T-Optimization for Music Generation
Paper
•
2401.12179
•
Published
•
19
StreamVoice: Streamable Context-Aware Language Modeling for Real-time
Zero-Shot Voice Conversion
Paper
•
2401.11053
•
Published
•
9
FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder
Paper
•
2401.10032
•
Published
•
12
BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural
network for speech super-resolution
Paper
•
2312.13722
•
Published
•
1
Incremental FastPitch: Chunk-based High Quality Text to Speech
Paper
•
2401.01755
•
Published
•
8
CoMoSVC: Consistency Model-based Singing Voice Conversion
Paper
•
2401.01792
•
Published
•
8
Towards High-Quality and Efficient Speech Bandwidth Extension with
Parallel Amplitude and Phase Prediction
Paper
•
2401.06387
•
Published
•
1
Multi-Scale Sub-Band Constant-Q Transform Discriminator for
High-Fidelity Vocoder
Paper
•
2311.14957
•
Published
•
2
VMamba: Visual State Space Model
Paper
•
2401.10166
•
Published
•
37
Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost
Paper
•
2312.08614
•
Published
•
1
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
•
2401.13601
•
Published
•
44
ModaVerse: Efficiently Transforming Modalities with LLMs
Paper
•
2401.06395
•
Published
•
3
Video Understanding with Large Language Models: A Survey
Paper
•
2312.17432
•
Published
•
2
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Paper
•
2401.00246
•
Published
•
10
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
and Sharing in LLMs
Paper
•
2311.15759
•
Published
•
1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in
Large Multimodal Models
Paper
•
2401.13311
•
Published
•
10
Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models
Paper
•
2309.01479
•
Published
•
1
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for
End-to-End Speech Recognition
Paper
•
2209.08326
•
Published
•
1
Mixture-of-experts VAEs can disregard variation in surjective multimodal
data
Paper
•
2204.05229
•
Published
•
1
One Model, Multiple Modalities: A Sparsely Activated Approach for Text,
Sound, Image, Video and Code
Paper
•
2205.06126
•
Published
•
1
simple diffusion: End-to-end diffusion for high resolution images
Paper
•
2301.11093
•
Published
•
2
Vision Mamba: Efficient Visual Representation Learning with
Bidirectional State Space Model
Paper
•
2401.09417
•
Published
•
58
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image
Segmentation
Paper
•
2401.13560
•
Published
•
1
Vivim: a Video Vision Mamba for Medical Video Object Segmentation
Paper
•
2401.14168
•
Published
•
2
2-D SSM: A General Spatial Layer for Visual Transformers
Paper
•
2306.06635
•
Published
•
1
IconShop: Text-Guided Vector Icon Synthesis with Autoregressive
Transformers
Paper
•
2304.14400
•
Published
•
4
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit
Paper
•
2312.09911
•
Published
•
53
StarVector: Generating Scalable Vector Graphics Code from Images
Paper
•
2312.11556
•
Published
•
27
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on
E-Branchformer
Paper
•
2401.16658
•
Published
•
13
OtterHD: A High-Resolution Multi-modality Model
Paper
•
2311.04219
•
Published
•
31
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and
reusing ModulEs
Paper
•
2311.04901
•
Published
•
7
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with
Modality Collaboration
Paper
•
2311.04257
•
Published
•
20
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Paper
•
2311.05348
•
Published
•
11
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
45
Link-Context Learning for Multimodal LLMs
Paper
•
2308.07891
•
Published
•
15
Empowering LLM to use Smartphone for Intelligent Task Automation
Paper
•
2308.15272
•
Published
•
1
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
Paper
•
2206.02770
•
Published
•
3
Deep Lifelong Cross-modal Hashing
Paper
•
2304.13357
•
Published
•
1
Multi-Dimensional Hyena for Spatial Inductive Bias
Paper
•
2309.13600
•
Published
•
1
FLatten Transformer: Vision Transformer using Focused Linear Attention
Paper
•
2308.00442
•
Published
•
1
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
•
2402.14818
•
Published
•
23
SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems
Paper
•
2401.03945
•
Published
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks
Paper
•
2309.14118
•
Published
DistriFusion: Distributed Parallel Inference for High-Resolution
Diffusion Models
Paper
•
2402.19481
•
Published
•
20
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
13
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
7
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
•
2403.02677
•
Published
•
16
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of
Large Vision-Language Models
Paper
•
2403.00231
•
Published
•
1
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Diffusion Models Without Attention
Paper
•
2311.18257
•
Published
•
2
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
18
MambaIR: A Simple Baseline for Image Restoration with State-Space Model
Paper
•
2402.15648
•
Published
FiT: Flexible Vision Transformer for Diffusion Model
Paper
•
2402.12376
•
Published
•
48
SSM Meets Video Diffusion Models: Efficient Video Generation with
Structured State Spaces
Paper
•
2403.07711
•
Published
Scalable Diffusion Models with State Space Backbone
Paper
•
2402.05608
•
Published
LocalMamba: Visual State Space Model with Windowed Selective Scan
Paper
•
2403.09338
•
Published
•
7
VideoMamba: State Space Model for Efficient Video Understanding
Paper
•
2403.06977
•
Published
•
27
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity
Paper
•
2402.08846
•
Published
•
1
Transparent Image Layer Diffusion using Latent Transparency
Paper
•
2402.17113
•
Published
•
5
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
13
Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image
Diffusion Models
Paper
•
2405.14828
•
Published
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
with Image and Video Understanding
Paper
•
2311.08046
•
Published
•
1
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language
Models
Paper
•
2405.10311
•
Published
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
30
Speak While You Think: Streaming Speech Synthesis During Text Generation
Paper
•
2309.11210
•
Published
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
18
GrootVL: Tree Topology is All You Need in State Space Model
Paper
•
2406.02395
•
Published