DiarizationLM: Speaker Diarization Post-Processing with Large Language Models Paper β’ 2401.03506 β’ Published Jan 7, 2024 β’ 14
The VoxCeleb Speaker Recognition Challenge: A Retrospective Paper β’ 2408.14886 β’ Published Aug 27, 2024 β’ 11
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Paper β’ 2402.00340 β’ Published Feb 1, 2024 β’ 2
SuperBPE Collection SuperBPE tokenizers and models trained with them β’ 8 items β’ Updated Apr 10 β’ 14
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Paper β’ 2410.22366 β’ Published Oct 28, 2024 β’ 83
Pix2Gif: Motion-Guided Diffusion for GIF Generation Paper β’ 2403.04634 β’ Published Mar 7, 2024 β’ 18
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper β’ 2402.17764 β’ Published Feb 27, 2024 β’ 618
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model Paper β’ 2402.03766 β’ Published Feb 6, 2024 β’ 15
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases Paper β’ 2312.15011 β’ Published Dec 22, 2023 β’ 18
Boundary Attention: Learning to Find Faint Boundaries at Any Resolution Paper β’ 2401.00935 β’ Published Jan 1, 2024 β’ 18
Context Tuning for Retrieval Augmented Generation Paper β’ 2312.05708 β’ Published Dec 9, 2023 β’ 16
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis Paper β’ 2311.12454 β’ Published Nov 21, 2023 β’ 31
PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction Paper β’ 2311.12024 β’ Published Nov 20, 2023 β’ 20
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning Paper β’ 2311.12631 β’ Published Nov 21, 2023 β’ 15
Make Pixels Dance: High-Dynamic Video Generation Paper β’ 2311.10982 β’ Published Nov 18, 2023 β’ 69
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression Paper β’ 2311.10794 β’ Published Nov 17, 2023 β’ 28
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning Paper β’ 2311.10709 β’ Published Nov 17, 2023 β’ 26
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework Paper β’ 2311.10125 β’ Published Nov 16, 2023 β’ 6