SmolVLM-256M-Detection

experimental and for learning purposes; wouldn't recommend using unless

check out github/shreydan/VLM-OD for results and details.

Usage

load the model same as HuggingFaceTB/SmolVLM-256M-Instruct
inputs: detect car / detect person;car etc. Apply chat template with add_generation_prompt=True
parse the output tokens <loc000> to <loc255> (code in eval.ipynb of my github repo)
to reiterate I have not added any <locXXX> special tokens (that needs wayyy more training than this method), the model itself is generating them.

Model tree for shreydan/SmolVLM-256M-Detection