---
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
language:
- en
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
---

# SmolVLM-256M-Detection

> experimental and for learning purposes; wouldn't recommend using unless

check out [github/shreydan/VLM-OD](https://github.com/shreydan/VLM-OD/) for results and details.

## Usage

- load the model same as `HuggingFaceTB/SmolVLM-256M-Instruct`
- inputs: `detect car` / `detect person;car` etc. Apply chat template with `add_generation_prompt=True`
- parse the output tokens `<loc000>` to `<loc255>` (code in `eval.ipynb` of my github repo)
- to reiterate I have not added any `<locXXX>` special tokens (that needs wayyy more training than this method), the model itself is generating them.

![outputs](https://raw.githubusercontent.com/shreydan/VLM-OD/refs/heads/master/outputs.png)