|
--- |
|
license: other |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
inference: false |
|
tags: |
|
- transformers |
|
- gguf |
|
- imatrix |
|
- gemma-3-4b-it |
|
--- |
|
Quantizations of https://huggingface.co/google/gemma-3-4b-it |
|
|
|
**Note**: you will need llama.cpp [b4875](https://github.com/ggml-org/llama.cpp/releases/tag/b4875) or later to run the model. |
|
|
|
### Open source inference clients/UIs |
|
* [llama.cpp](https://github.com/ggerganov/llama.cpp) |
|
* [KoboldCPP](https://github.com/LostRuins/koboldcpp) |
|
* [ollama](https://github.com/ollama/ollama) |
|
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui) |
|
* [jan](https://github.com/janhq/jan) |
|
* [GPT4All](https://github.com/nomic-ai/gpt4all) |
|
|
|
### Closed source inference clients/UIs |
|
* [LM Studio](https://lmstudio.ai/) |
|
* [Msty](https://msty.app/) |
|
* [Backyard AI](https://backyard.ai/) |
|
|
|
--- |
|
|
|
# From original readme |
|
|
|
Gemma is a family of lightweight, state-of-the-art open models from Google, |
|
built from the same research and technology used to create the Gemini models. |
|
Gemma 3 models are multimodal, handling text and image input and generating text |
|
output, with open weights for both pre-trained variants and instruction-tuned |
|
variants. Gemma 3 has a large, 128K context window, multilingual support in over |
|
140 languages, and is available in more sizes than previous versions. Gemma 3 |
|
models are well-suited for a variety of text generation and image understanding |
|
tasks, including question answering, summarization, and reasoning. Their |
|
relatively small size makes it possible to deploy them in environments with |
|
limited resources such as laptops, desktops or your own cloud infrastructure, |
|
democratizing access to state of the art AI models and helping foster innovation |
|
for everyone. |
|
|
|
### Inputs and outputs |
|
|
|
- **Input:** |
|
- Text string, such as a question, a prompt, or a document to be summarized |
|
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens |
|
each |
|
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and |
|
32K tokens for the 1B size |
|
|
|
- **Output:** |
|
- Generated text in response to the input, such as an answer to a |
|
question, analysis of image content, or a summary of a document |
|
- Total output context of 8192 tokens |
|
|
|
### Usage |
|
|
|
Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library with the version made for Gemma 3: |
|
|
|
```sh |
|
$ pip install git+https://github.com/huggingface/[email protected] |
|
``` |
|
|
|
Then, copy the snippet from the section that is relevant for your use case. |
|
|
|
#### Running with the `pipeline` API |
|
|
|
You can initialize the model and processor for inference with `pipeline` as follows. |
|
|
|
```python |
|
from transformers import pipeline |
|
import torch |
|
|
|
pipe = pipeline( |
|
"image-text-to-text", |
|
model="google/gemma-3-4b-it", |
|
device="cuda", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
``` |
|
|
|
With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline. |
|
|
|
```python |
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": "You are a helpful assistant."}] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, |
|
{"type": "text", "text": "What animal is on the candy?"} |
|
] |
|
} |
|
] |
|
|
|
output = pipe(text=messages, max_new_tokens=200) |
|
print(output[0][0]["generated_text"][-1]["content"]) |
|
# Okay, let's take a look! |
|
# Based on the image, the animal on the candy is a **turtle**. |
|
# You can see the shell shape and the head and legs. |
|
``` |
|
|
|
#### Running the model on a single/multi GPU |
|
|
|
```python |
|
# pip install accelerate |
|
|
|
from transformers import AutoProcessor, Gemma3ForConditionalGeneration |
|
from PIL import Image |
|
import requests |
|
import torch |
|
|
|
model_id = "google/gemma-3-4b-it" |
|
|
|
model = Gemma3ForConditionalGeneration.from_pretrained( |
|
model_id, device_map="auto" |
|
).eval() |
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [{"type": "text", "text": "You are a helpful assistant."}] |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, |
|
{"type": "text", "text": "Describe this image in detail."} |
|
] |
|
} |
|
] |
|
|
|
inputs = processor.apply_chat_template( |
|
messages, add_generation_prompt=True, tokenize=True, |
|
return_dict=True, return_tensors="pt" |
|
).to(model.device, dtype=torch.bfloat16) |
|
|
|
input_len = inputs["input_ids"].shape[-1] |
|
|
|
with torch.inference_mode(): |
|
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) |
|
generation = generation[0][input_len:] |
|
|
|
decoded = processor.decode(generation, skip_special_tokens=True) |
|
print(decoded) |
|
|
|
# **Overall Impression:** The image is a close-up shot of a vibrant garden scene, |
|
# focusing on a cluster of pink cosmos flowers and a busy bumblebee. |
|
# It has a slightly soft, natural feel, likely captured in daylight. |
|
``` |