Xiaohan Wang's picture

3 23 1

Xiaohan Wang

nicholswang

·

https://wxh1996.github.io/

XiaohanWang96

AI & ML interests

Video Understanding, Vision-Language Models

Recent Activity

published an article about 1 month ago

TimeScope: How Long Can Your Video Large Multimodal Model Go?

published a dataset 2 months ago

nicholswang/TimeLens

upvoted a paper 5 months ago

SmolVLM: Redefining small and efficient multimodal models

View all activity

Organizations

upvoted 3 papers 5 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Paper • 2503.23145 • Published Mar 29 • 36

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Paper • 2503.13399 • Published Mar 17 • 22

upvoted a paper 6 months ago

Video Action Differencing

Paper • 2503.07860 • Published Mar 10 • 34

upvoted 2 papers 7 months ago

Temporal Preference Optimization for Long-Form Video Understanding

Paper • 2501.13919 • Published Jan 23 • 23

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Paper • 2501.07171 • Published Jan 13 • 56

upvoted 3 papers 8 months ago

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Paper • 2501.03225 • Published Jan 6 • 7

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

Paper • 2412.13180 • Published Dec 17, 2024 • 13

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 147

upvoted 10 papers 11 months ago

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Paper • 2409.19603 • Published Sep 29, 2024 • 19

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Paper • 2410.00890 • Published Oct 1, 2024 • 20

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

Paper • 2410.00337 • Published Oct 1, 2024 • 11

Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation

Paper • 2409.18313 • Published Sep 26, 2024 • 3

What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study

Paper • 2410.00545 • Published Oct 1, 2024 • 5

Illustrious: an Open Advanced Illustration Model

Paper • 2409.19946 • Published Sep 30, 2024 • 16

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

Paper • 2410.00086 • Published Sep 30, 2024 • 12

Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

Paper • 2410.00418 • Published Oct 1, 2024 • 10

Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models

Paper • 2410.00231 • Published Sep 30, 2024 • 8

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Paper • 2409.20563 • Published Sep 30, 2024 • 9

upvoted a paper about 1 year ago

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Paper • 2407.06189 • Published Jul 8, 2024 • 27