Spaces:

ljoana
/

jojo-blog

Sleeping

App Files Files Community

JoeJoe1313 commited on May 30

Commit

0361bfb

1 Parent(s): 5e0609b

posts

Browse files

Files changed (9) hide show

src/posts/2025-02-12-fine-tuning-lora-mlx/images/lora.jpg +3 -0
src/posts/2025-02-12-fine-tuning-lora-mlx/index.qmd +234 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/input.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_1.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_2.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_3.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_4.png +3 -0
src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/index.qmd +341 -0

src/posts/2025-02-12-fine-tuning-lora-mlx/images/lora.jpg ADDED Viewed

Git LFS Details

SHA256: f2ec18f23af2eaf7d23cef934d65c6a3e428704fc304be604fa7df8a42706b00
Pointer size: 130 Bytes
Size of remote file: 32.4 kB

src/posts/2025-02-12-fine-tuning-lora-mlx/index.qmd ADDED Viewed

	@@ -0,0 +1,234 @@

+---
+title: "Fine-Tuning LLMs with LoRA and MLX-LM"
+author: "Joana Levtcheva"
+date: "2025-02-12"
+categories: [Machine Learning, mlx, llm]
+draft: false
+---
+This blog post is going to be a tutorial on how to fine-tune a LLM with LoRA and the `mlx-lm` package. Medium post can be found [here](https://medium.com/@levchevajoana/fine-tuning-llms-with-lora-and-mlx-lm-c0b143642deb) and Substack [here](https://substack.com/home/post/p-157008884).
+## Introduction
+[MLX](https://opensource.apple.com/projects/mlx/) is an array framework tailored for efficient machine learning research on Apple silicon. Its biggest strength is that it leverages the unified memory architecture of Apple devices and offers a familiar, NumPy-like API. Apple has also developed a package for LLM text generation, fine-tuning, etc. called [MLX LM](https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md).
+Overall, `mlx-lm` supports many of Hugging Face format LLMs. With `mlx-lm` it is also very easy to directly load models from the Hugging Face [MLX Community](https://huggingface.co/mlx-community). This is a place for mlx model pre-converted weights that run on Apple Silicon, hosting many ready-to-use models with the framework. The framework also supports parameter-efficient fine-tuning ([PEFT](https://huggingface.co/blog/peft)) with [LoRA and QLoRA](https://github.com/ml-explore/mlx-examples/tree/main/lora). You can find more information about LoRA in the following [paper](https://arxiv.org/abs/2106.09685).
+In this tutorial, with the help of the `mlx-lm` package, we are going to load the [Mistral-7B-Instruct-v0.3–4bit](https://medium.com/r/?url=https%3A%2F%2Fhuggingface.co%2Fmlx-community%2FMistral-7B-Instruct-v0.3-4bit) model from the MLX Community space, and attempt to fine-tune it with LoRA and the dataset [win-wang/Machine_Learning_QA_Collection](https://medium.com/r/?url=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fwin-wang%2FMachine_Learning_QA_Collection). Let's begin.
+## Packages and Model Loading
+First, we have to load the needed packages.
+```python
+import json
+import os
+from typing import Dict, List, Optional, Tuple, Union
+import matplotlib.pyplot as plt
+import mlx.optimizers as optim
+from mlx.utils import tree_flatten
+from mlx_lm import generate, load
+from mlx_lm.tuner import TrainingArgs, datasets, linear_to_lora_layers, train
+from transformers import PreTrainedTokenizer
+```
+Then, we should load the model and tokenizer.
+```python
+model_path = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
+model, tokenizer = load(model_path)
+```
+Let's see what would our model output when given a simple pormpt such as *"What is fine-tuning in machine learning?"*.
+```python
+prompt = "What is fine-tuning in machine learning?"
+messages = [{"role": "user", "content": prompt}]
+prompt = tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+response = generate(model, tokenizer, prompt=prompt, verbose=True)
+```
+The generated output of the model is:
+```
+Fine-tuning in machine learning refers to the process of taking a pre-trained model, which has already been trained on a large dataset for a specific task, and adapting it to a new, related task or a different aspect of the same task.
+For example, imagine you have a pre-trained model that can recognize different types of animals. You can fine-tune this model to recognize specific breeds of dogs, or even to recognize different types of flowers. The idea is that the pre-trained model has already learned some general features that are useful for the new task, and fine-tuning helps the model to learn the specific details that are important for the new task.
+Fine-tuning is often used when you have a small dataset for the new task, as it allows you to leverage the knowledge the model has already gained from the large pre-training dataset. It's a common technique in deep learning, particularly for tasks like image classification, natural language processing, and speech recognition.
+```
+## Preparation for Fine-Tuning
+Let's create an `adapters` directory, and the paths to the adapter configuration (in our case the LoRA configuration) and adapter files.
+```python
+adapter_path = "adapters"
+os.makedirs(adapter_path, exist_ok=True)
+adapter_config_path = os.path.join(adapter_path, "adapter_config.json")
+adapter_file_path = os.path.join(adapter_path, "adapters.safetensors")
+```
+We have to set our LoRA parameter configurations. This can be done in a separate `.yml` file, as shown [here](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/lora_config.yaml), but for code simplicity and the sake of just showing the process of fine-tuning with LoRA and mlx-lm, we are going to stick to this simple in-code configuration
+```python
+lora_config = {
+    "num_layers": 8,
+    "lora_parameters": {
+        "rank": 8,
+        "scale": 20.0,
+        "dropout": 0.0,
+    },
+}
+```
+which we save into the adapters directory we already created.
+```python
+with open(adapter_config_path, "w") as f:
+    json.dump(lora_config, f, indent=4)
+```
+We can also set our training arguments, pointing to our adapter file, how many iterations we want to perform, and how many steps per evaluation should be done.
+```python
+training_args = TrainingArgs(
+    adapter_file=adapter_file_path,
+    iters=200,
+    steps_per_eval=50,
+)
+```
+In the LoRA framework, most of the model's original parameters remain unchanged during fine-tuning. The `model.freeze()` command is used to set these parameters to a non-trainable state so that their weights aren't updated during backpropagation. This way, only the newly introduced low-rank adaptation matrices (LoRA parameters) are optimized, reducing computational overhead and memory usage while preserving the original model's knowledge.
+The `linear_to_lora_layers` function converts or wraps some of the model's linear layers into LoRA layers. Essentially, it replaces (or augments) selected linear layers with their LoRA counterparts, which include the additional low-rank matrices that will be trained. The configuration parameters (like the number of layers and specific LoRA parameters) determine which layers are modified and how the LoRA adapters are set up.
+We should also verify that only a small subset of parameters are set for training, and activate training mode while preserving the frozen state of the main model parameters.
+```python
+model.freeze()
+linear_to_lora_layers(model, lora_config["num_layers"], lora_config["lora_parameters"])
+num_train_params = sum(v.size for _, v in tree_flatten(model.trainable_parameters()))
+print(f"Number of trainable parameters: {num_train_params}")
+model.train()
+```
+We can also create a class to follow the train and validation loss metrics during the training process
+```python
+class Metrics:
+    def __init__(self) -> None:
+        self.train_losses: List[Tuple[int, float]] = []
+        self.val_losses: List[Tuple[int, float]] = []
+    def on_train_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
+        self.train_losses.append((info["iteration"], info["train_loss"]))
+    def on_val_loss_report(self, info: Dict[str, Union[float, int]]) -> None:
+        self.val_losses.append((info["iteration"], info["val_loss"]))
+```
+and create an instance of this class.
+```python
+metrics = Metrics()
+```
+## Data Loading
+Here, we are creating a simplified variant of the following [function](https://github.com/ml-explore/mlx-examples/blob/ec30dc35382d87614f51fe7590f015f93a491bfd/llms/mlx_lm/tuner/datasets.py#L163-L187) for loading a Hugging Face dataset.
+```python
+def custom_load_hf_dataset(
+    data_id: str,
+    tokenizer: PreTrainedTokenizer,
+    names: Tuple[str, str, str] = ("train", "valid", "test"),
+):
+    from datasets import exceptions, load_dataset
+    try:
+        dataset = load_dataset(data_id)
+        train, valid, test = [
+            (
+                datasets.create_dataset(dataset[n], tokenizer)
+                if n in dataset.keys()
+                else []
+            )
+            for n in names
+        ]
+    except exceptions.DatasetNotFoundError:
+        raise ValueError(f"Not found Hugging Face dataset: {data_id} .")
+    return train, valid, test
+```
+Then, let's load the `win-wang/Machine_Learning_QA_Collection` dataset from Hugging Face.
+```python
+train_set, val_set, test_set = custom_load_hf_dataset(
+    data_id="win-wang/Machine_Learning_QA_Collection",
+    tokenizer=tokenizer,
+    names=("train", "validation", "test"),
+)
+```
+## Fine-Tuning
+Finally, we can begin the LoRA fine-tuning process by calling the `train()` function.
+```python
+train(
+    model=model,
+    tokenizer=tokenizer,
+    args=training_args,
+    optimizer=optim.Adam(learning_rate=1e-5),
+    train_dataset=train_set,
+    val_dataset=val_set,
+    training_callback=metrics,
+)
+```
+After the training is completed, we can also plot the train and validation loss.
+```python
+train_its, train_losses = zip(*metrics.train_losses)
+validation_its, validation_losses = zip(*metrics.val_losses)
+plt.plot(train_its, train_losses, "-o", label="Train")
+plt.plot(validation_its, validation_losses, "-o", label="Validation")
+plt.xlabel("Iteration")
+plt.ylabel("Loss")
+plt.legend()
+plt.show()
+```
+For example, one of the trainings performed resulted in the following losses.
+![Train & Validation Loss](images/lora.jpg)
+## Test the model_lora
+Now, we can load the fine-tuned model, specifying the `adapter_path`,
+```python
+model_lora, _ = load(model_path, adapter_path=adapter_path)
+```
+and we can generate an output for the same prompt as earlier.
+```python
+response = generate(model_lora, tokenizer, prompt=prompt, verbose=True)
+```
+The generated response is:
+```
+Fine-tuning in machine learning refers to the process of adjusting the parameters of a pre-trained model to adapt it to a specific task or dataset. This approach is often used when the available data is limited, as it allows the model to leverage the knowledge it has already gained from previous training. Fine-tuning can improve the performance of a model on a new task, making it a valuable technique in many machine learning applications.
+```
+## Conclusion
+In this tutorial, we explored how to leverage MLX LM and LoRA for fine-tuning large language models on Apple silicon. We started by setting up the necessary environment, loading a pre-trained model from the MLX Community, and preparing our dataset from Hugging Face. By converting selected linear layers into LoRA adapters and freezing the majority of the model's weights, we efficiently fine-tuned the model using a modest computational footprint. This approach not only optimizes resource usage but also opens the door to experimenting with different fine-tuning strategies and datasets. Further modifications can be explored, such as experimenting with other adapter configurations like QLoRA (extends the LoRA approach by integrating quantization techniques), fusing adapters, integrating additional evaluation metrics to better understand a model's performance, etc. Happy fine-tuning!

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/input.png ADDED Viewed

Git LFS Details

SHA256: dadab311e49ab7dac3c8869ac03f9e740638a4c47cd8a279997b2ae390741b5b
Pointer size: 131 Bytes
Size of remote file: 522 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output.png ADDED Viewed

Git LFS Details

SHA256: 4c7580eaa0a03d076e52ecc83d2db7f4af92253cc713d82ebc4307ec9bebcd1f
Pointer size: 131 Bytes
Size of remote file: 260 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_1.png ADDED Viewed

Git LFS Details

SHA256: d5a3c0d787fb974a3839f359e396c02d1262cfb236adb0b4a7f903118b0bd3d8
Pointer size: 130 Bytes
Size of remote file: 19.3 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_2.png ADDED Viewed

Git LFS Details

SHA256: ab70a63e9ab4ccf6e9840ee6514a29f1b1fee87cf78131cb24baa46f8e912d6e
Pointer size: 130 Bytes
Size of remote file: 14.4 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_3.png ADDED Viewed

Git LFS Details

SHA256: 39a43e7dcf47b586d395e2297245adeb4f878a6e38e8ecbbc6b676315cf6d870
Pointer size: 130 Bytes
Size of remote file: 18.9 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/images/output_4.png ADDED Viewed

Git LFS Details

SHA256: fa725c8600b0e861edb1f0700c8e08746d9b22b7e782d672d8c9447b3573fc07
Pointer size: 130 Bytes
Size of remote file: 13 kB

src/posts/2025-02-13-qwen2_5-vl-mlx-vlm/index.qmd ADDED Viewed

	@@ -0,0 +1,341 @@

+---
+title: "Qwen2.5-vl with MLX-VLM"
+date: "2025-02-13"
+categories: [Machine Learning, mlx, vlm]
+draft: false
+---
+In this post, we are going to show a tutorial on using the [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) model with [MLX-VLM](https://github.com/Blaizzy/mlx-vlm) for visual understanding tasks. We are going to cover:
+- Loading the model and image
+- Generating a natural language description of an image
+- Perform object detection in different scenarios with outputing their bounding boxes in JSON format
+- Visualizing the results
+Medium post can be found [here](https://medium.com/@levchevajoana/qwen2-5-vl-with-mlx-vlm-c4329b40ab87) and Substack [here](https://substack.com/home/post/p-157062287).
+# Introduction
+[Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) is the latest flagship vision-language model from the Qwen series, representing a significant advancement over its predecessor, [Qwen2-VL](https://arxiv.org/abs/2409.12191). This model is designed to enhance visual understanding and interaction capabilities across various domains. Key features of Qwen2.5-VL include:
+- **Enhanced Visual Recognition:** The model excels at identifying a wide range of objects, including plants, animals, landmarks, and products. It also proficiently analyzes texts, charts, icons, graphics, and layouts within images.
+- **Agentic Abilities:** Qwen2.5-VL functions as a visual agent capable of reasoning and dynamically directing tools, enabling operations on devices like computers and mobile phones.
+- **Advanced Video Comprehension:** The model can understand lengthy videos exceeding one hour and can pinpoint specific events by identifying relevant video segments.
+- **Accurate Visual Localization:** It can precisely locate objects within images by generating bounding boxes or points and provides structured JSON outputs detailing absolute coordinates and attributes.
+- **Structured Data Output:** Qwen2.5-VL supports the generation of structured outputs from data such as scanned invoices, forms, and tables, benefiting applications in finance and commerce.
+Performance evaluations indicate that the flagship model, Qwen2.5-VL-72B-Instruct, delivers competitive results across various benchmarks, including college-level problem-solving, mathematics, document comprehension, general question answering, and video understanding. Notably, it demonstrates significant strengths in interpreting documents and diagrams and operates effectively as a visual agent without the need for task-specific fine-tuning.
+For developers and users interested in exploring Qwen2.5-VL, both base and instruct models are available in 3B, 7B, and 72B parameter sizes on platforms like Hugging Face. Additionally, the model can be used through [Qwen Chat](https://chat.qwenlm.ai).
+# Tutorial
+## Loading Packages
+We begin by importing the necessary libraries. We are going to use the `mlx_vlm` package to load and operate with our Qwen2.5-VL model. We also use libraries such as matplotlib for plotting and PIL for image processing.
+```python
+import json
+import matplotlib.patches as patches
+import matplotlib.pyplot as plt
+import numpy as np
+from mlx_vlm import apply_chat_template, generate, load
+from mlx_vlm.utils import load_image
+from PIL import Image
+```
+## Loading the Qwen2.5-VL Model and Processor
+Next, we load the pre-trained [Qwen2.5-VL-3B-Instruct-bf16](https://huggingface.co/mlx-community/Qwen2.5-VL-3B-Instruct-bf16) model from the Hugging Face [MLX Community](https://huggingface.co/mlx-community) along with its processor using the provided model path. The processor formats and preprocesses both text and image inputs to ensure they are compatible with the model’s architecture.
+```python
+model_path = "mlx-community/Qwen2.5-VL-3B-Instruct-bf16"
+model, processor = load(model_path)
+config = model.config
+```
+You’ll notice the loading process involves fetching several files if the model hasn’t been downloaded previously. Once completed, the model is ready to process our inputs.
+## Loading and Displaying the Image
+For this tutorial, we use an image file (`person_dog.jpg`) which contains a person with a dog. We load the image using a helper function and then display its size.
+```python
+image_path = "person_dog.jpg"
+image = load_image(image_path)
+print(image)
+print(image.size)  # Example output: (467, 700)
+```
+The input image is shown below.
+![Input](images/input.png){ style="display: block; margin: 0 auto"}
+## Generating an Image Description
+We now prepare a prompt to describe the image. The prompt is wrapped using the `apply_chat_template` function, which converts our query into the chat-based format expected by the model.
+```python
+prompt = "Describe the image."
+formatted_prompt = apply_chat_template(
+    processor, config, prompt, num_images=1
+)
+```
+Next, we generate the output by feeding both the formatted prompt and image into the model:
+```python
+output = generate(model, processor, formatted_prompt, image, verbose=True)
+```
+**Sample Output:**
+```
+The image shows a person standing outdoors, holding a small, fluffy, light-colored dog. The person is wearing a dark gray hoodie with the word "ROX" on it and blue jeans. The background features a garden with various plants and a fence, and there are some fallen leaves on the ground. The setting appears to be a residential area with a garden.
+```
+This demonstrates how the model can effectively generate descriptive captions for images.
+## Object Detection with Bounding Boxes
+In addition to descriptions, the Qwen2.5-VL model can help us obtain spatial details such as bounding box coordinates for detected objects. We prepare a prompt asking the model to outline each object’s position in JSON format. We include the system prompt *“You are a helpful assistant"*, the user prompt describing the task *“Outline the position of each object and output all the bbox coordinates in JSON format.”*, and the path to the input image.
+```python
+system_prompt="You are a helpful assistant"
+prompt="Outline the position of ecah object and output all the bbox coordinates in JSON format."
+messages = [
+    {
+      "role": "system",
+      "content": system_prompt
+    },
+    {
+      "role": "user",
+      "content": [
+        {
+          "type": "text",
+          "text": prompt
+        },
+        {
+          "type": "image",
+          "image": image_path,
+        }
+      ]
+    }
+  ]
+prompt = apply_chat_template(processor, config, messages, tokenize=False)
+```
+We then generate the spatial output:
+```python
+output = generate(
+    model,
+    processor,
+    prompt,
+    image,
+    verbose=True
+)
+```
+**Sample JSON Output:**
+`````markdown
+```json
+[
+  {
+    "bbox_2d": [170, 105, 429, 699],
+    "label": "person holding dog"
+  },
+  {
+    "bbox_2d": [180, 158, 318, 504],
+    "label": "dog"
+  }
+]
+```
+`````
+This output provides the absolute coordinates of the bounding boxes around the detected objects along with the corresponding label. We should note that the absolute coordinates are with respect to:
+- The beginning of the coordinate system which is top left.
+- The image size corresponding to the possibly resized image after it’s processed via the `processor`. We can determine the new size by checking the `image_grid_thw` value in
+```python
+processor.image_processor(image)
+```
+and the `patch_size` value from `processor`. Then we simply multiply the height and width values of `image_grid_thw` with the `ptach_size`, which by default is $14$. Thus, the adjusted bounding box coordinates can be determined by scaling with the original image size divided by the image size after it’s processed by the `processor`. The code can be seen in the next section in the function `normalize_bbox(processor, image, x_min, y_min, x_max, y_max)`.
+**Observations:**
+1. Most of the time the model produces an identical format of the JSON, with the same key-value pairs. If the user prompt didn’t include the word bbox before the word coordinates, the model sometimes produced slightly different key names and/or structure.
+2. I achieved accurate and identical JSON outputs when using [mlx-community/Qwen2.5-VL-3B-Instruct-8bit](https://huggingface.co/mlx-community/Qwen2.5-VL-3B-Instruct-8bit) and [mlx-community/Qwen2.5-VL-3B-Instruct-bf16](https://huggingface.co/mlx-community/Qwen2.5-VL-3B-Instruct-bf16). In contrast, when I experimented with [mlx-community/Qwen2.5-VL-7B-Instruct-6bit](https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-6bit) and [mlx-community/Qwen2.5-VL-7B-Instruct-8bit](https://huggingface.co/mlx-community/Qwen2.5-VL-7B-Instruct-8bit) the generated bounding box coordinates seemed to be shifted along the $y$-axis, but otherwise matched the dimensions of the bounding boxes generated with 3B models.
+## Visualizing the Bounding Boxes
+To better understand the spatial outputs, we can visualize these bounding boxes on the image. Below are helper functions that:
+- Parse the JSON output
+```python
+def parse_bbox(bbox_str):
+    return json.loads(bbox_str.replace("```json", "").replace("```", ""))
+```
+- Normalize bounding box coordinates to match the image dimensions
+```python
+def normalize_bbox(processor, image, x_min, y_min, x_max, y_max):
+    width, height = image.size
+    _, input_height, input_width = (
+        processor.image_processor(image)["image_grid_thw"][0] * 14
+    )
+    x_min_norm = int(x_min / input_width * width)
+    y_min_norm = int(y_min / input_height * height)
+    x_max_norm = int(x_max / input_width * width)
+    y_max_norm = int(y_max / input_height * height)
+    return x_min_norm, y_min_norm, x_max_norm, y_max_norm
+```
+- Plot the image with rectangles and labels
+```python
+def plot_image_with_bboxes(processor, image, bboxes):
+    image = Image.open(image) if isinstance(image, str) else image
+    _, ax = plt.subplots(1)
+    ax.imshow(image)
+    if isinstance(bboxes, list) and all(isinstance(bbox, dict) for bbox in bboxes):
+        colors = plt.cm.rainbow(np.linspace(0, 1, len(bboxes)))
+        for i, (bbox, color) in enumerate(zip(bboxes, colors)):
+            label = bbox.get("label", None)
+            x_min, y_min, x_max, y_max = bbox.get("bbox_2d", None)
+            x_min_norm, y_min_norm, x_max_norm, y_max_norm = normalize_bbox(
+                processor, image, x_min, y_min, x_max, y_max
+            )
+            width = x_max_norm - x_min_norm
+            height = y_max_norm - y_min_norm
+            rect = patches.Rectangle(
+                (x_min_norm, y_min_norm),
+                width,
+                height,
+                linewidth=2,
+                edgecolor=color,
+                facecolor="none",
+            )
+            ax.add_patch(rect)
+            ax.text(
+                x_min_norm,
+                y_min_norm,
+                label,
+                color=color,
+                fontweight="bold",
+                bbox=dict(facecolor="white", edgecolor=color, alpha=0.8),
+            )
+    plt.axis("off")
+    plt.tight_layout()
+```
+Running the functions below
+```python
+objects_data = parse_bbox(output)
+plot_image_with_bboxes(processor, image, bboxes=objects_data)
+```
+display the original image with bounding boxes drawn around the person and the dog, along with their respective labels.
+![Output](images/output.png){ style="display: block; margin: 0 auto"}
+This example shows that even the 3B model can accurately detect objects based on a general prompt to detect all objects in the image.
+## More Spatial Understanding Examples
+We can demonstrate a few other model outputs, corresponding to different spatial understanding tasks.
+### Detect a specific object using descriptions
+**Prompt:** *“Outline the position of the dog and output all the bbox coordinates in JSON format.”*
+**Output:**
+![Output](images/output_1.png){ style="display: block; margin: 0 auto"}
+**Observation:** The dog was accurately detected.
+The next examples are taken from the original Qwen2.5-VL [cookbook](https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/spatial_understanding.ipynb) in which they use the model `Qwen2.5-VL-7B-Instruct`.
+### Reasoning capability
+**Prompt:** *“Locate the shadow of the paper fox, report the bbox coordinates in JSON format.”*
+**Note:** The original image size as in the cookbook example was reduced so it can be better processed by the 3B model.
+**Output:**
+![Output](images/output_2.png){ style="display: block; margin: 0 auto"}
+**Observation:** The shadow of the paper fox was accurately detected.
+### Understand relationships across different instances
+**Prompt:** *“Locate the person who acts bravely, report the bbox coordinates in JSON format.”*
+**Output:**
+![Output](images/output_3.png){ style="display: block; margin: 0 auto"}
+**Observation:** The person who acts bravely was accurately detected.
+### Find a special instance with unique characteristic
+**Prompt:** *“If the sun is very glaring, which item in this image should I use? Please locate it in the image with its bbox coordinates and its name and output in JSON format.”*
+**Output:**
+![Output](images/output_4.png){ style="display: block; margin: 0 auto"}
+**Observation:** The image input in the cookbook has a transparent background. I tested the model with a present background and the produced results were not very logical. The above result is of the original image without background. Moreover, their output is `glasses`, in contrast to our 3B output `umbrella`, but our output is still logical.
+---
+In the end of their cookbook they mention that the above examples were based on the default system prompt. The system prompt can be changed so that we can obtain other output format like plain text. The supported Qwen2.5-VL formats are:
+- bbox-format: JSON
+```python
+{"bbox_2d": [x1, y1, x2, y2], "label": "object name/description"}
+```
+- bbox-format: plain text
+```
+x1,y1,x2,y2 object_name/description
+```
+- point-format: XML
+```xml
+<points x y>object_name/description</points>
+```
+- point-format: JSON
+```python
+{"point_2d": [x, y], "label": "object name/description"}
+```
+They also give an example of how to change the system prompt so it ouputs plain text:
+*“As an AI assistant, you specialize in accurate image object detection, delivering coordinates in plain text format ‘x1,y1,x2,y2 object’.”*
+## Conclusion
+In this tutorial, we explored the capabilities of Qwen2.5-VL by using MLX-VLM for various visual understanding tasks. We demonstrated how to load the model and images, generate natural language descriptions, and perform object detection with bounding boxes in different spatial understanding scenarios. Our experiments show that even the 3B model provides accurate object localization and structured JSON outputs, and suggests to be indeed a very powerful vision-language model.