--- library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text language: - en base_model: - HuggingFaceTB/SmolVLM-256M-Instruct --- # SmolVLM-256M-Detection > experimental and for learning purposes; wouldn't recommend using unless check out [github/shreydan/VLM-OD](https://github.com/shreydan/VLM-OD/) for results and details. ## Usage - load the model same as `HuggingFaceTB/SmolVLM-256M-Instruct` - inputs: `detect car` / `detect person;car` etc. Apply chat template with `add_generation_prompt=True` - parse the output tokens `` to `` (code in `eval.ipynb` of my github repo) - to reiterate I have not added any `` special tokens (that needs wayyy more training than this method), the model itself is generating them. ![outputs](https://raw.githubusercontent.com/shreydan/VLM-OD/refs/heads/master/outputs.png)