microsoft
/

Magma-8B

@@ -34,6 +34,64 @@ pipeline_tag: text-generation
 </div>
 ## Model Details
 <div align="center">
@@ -109,99 +167,6 @@ response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
 print(response)
 ```
-## Intended Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in mutimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-The model takes images and text as inputs, and produces the textual outputs for the following uses:
-* **Image/Video-Conditoned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
-* **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
-* **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
-### Downstream Use
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-<!-- {{ downstream_use | default("[More Information Needed]", true)}} -->
-<!-- ### Out-of-Scope Use -->
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-<!-- {{ out_of_scope_use | default("[More Information Needed]", true)}} -->
-The model can be further finetuned for different downstream tasks, such as:
-* **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
-* **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
-* **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
-* **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-<!-- {{ bias_risks_limitations | default("[More Information Needed]", true)}} -->
-Please note that this model is not specifically designed or evaluated for all downstream purposes.
-The model is not intended to be deployed in production settings. It should not be used in high-risk scenarios, such as military and defense, financial services, and critical infrastructure systems.
-Developers should consider common limitations of multimodal models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case.
-Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Like other multimodal models, Magma can potentially behave in ways that are unfair, unreliable, or offensive.
-The models' outputs do not reflect the opinions of Microsoft.
-Some of the limiting behaviors to be aware of include:
-* **Quality of Service:** The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Magma is not intended to support multilingual use.
-* **Representation of Harms & Perpetuation of Stereotypes:** These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
-* **Inappropriate or Offensive Content:** These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
-* **Information Reliability:** Multimodal models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
-Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like [Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety) that have advanced guardrails is highly recommended.
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-<!-- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} -->
-Magma was developed for research purposes only. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
-The recommended usage for the finetuned models is within the research settings they were trained on — namely,
--	an android simulator running on a computer for UI manipulation.
--	an enclosure equipped with a robotic arm and everyday objects for Robotic manipulation
-For UI navigation task, researchers should make sure a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure no unintended consequences can occur as a result of performing the UI action proposed by the model.
-For the robotic manipulation task, some mitigation strategies to use for human safety when operating robotic arms include:
-* **Safety Zones and Barriers:** Establish physical barriers or safety zones around robotic workspaces to prevent unauthorized access.
-* **Emergency Stop Systems:** Equip robotic arms with easily accessible emergency stop buttons. Implement a fail-safe mechanism that triggers an immediate stop of operations in case of an emergency
-* **Safety Standards and Compliance:** Adhere to established safety standards (e.g., ISO 10218, ISO/TS 15066) for industrial robots and collaborative robots.
-* **User Training and Awareness:** Provide comprehensive training for all personnel working around robotic arms to understand their functions, safety features, and emergency procedures. Promote awareness of the potential risks associated with robotic manipulation.
 ## Training Details
 ### Training Data
@@ -310,29 +275,27 @@ We follow the individual dataset's evaluation metrics for the evaluation. Please
 Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
-| Model                 | Size  | VQAv2 | TextVQA | POPE  | SS-Mobile | SS-Desktop | SS-Web | VWB-Ele-G | VWB-Act-G | SE-Google Robot | SE-Bridge |
-|-----------------------|------|------|--------|------|----------|-----------|------|----------|----------|---------------|-----------|
-| GPT-4V               | n/a  | 77.2 | 78.0 | n/a  | 22.6/24.5 | 20.2/11.8 | 9.2/8.8 | 67.5 | 75.7 | - | - |
-| GPT-4V-OmniParser    | n/a  | n/a  | n/a    | n/a  | 92.7/49.4 | 64.9/26.3 | 77.3/39.7 | - | - | - | - |
-| LLava-1.5           | 7.4B | 78.5 | 58.2   | 85.9  | -  | - | -  | 12.1 | 13.6 | - | - |
-| LLava-Next          | 7.4B | 81.3 | 64.9   | 86.5  | -  | - | -  | 15.0 | 8.7 | - | - |
-| Qwen-VL             | 9.6B | 78.8 | 63.8   | n/a  | 7.5/4.8  | 7.5/5.0 | 3.5/2.4  | 14.0 | 0.7 | - | - |
-| Qwen-VL-Chat        | 9.6B | 78.2 | 61.5   | n/a  | -  | - | -  | - | - | - | - |
-| Fuyu                | 8B   | 74.2 | n/a    | n/a  | 41.0/1.3 | 38.0/3.6 | 33.9/4.4 | 19.4 | 15.5 | - | - |
-| SeeClick            | 9.6B | -    | -      | -    | 78.0/52.0 | 72.2/30.0 | 55.7/32.5 | 9.9 | 1.9 | - | - |
-| Octo               | 93M  | -    | -      | -    | -  | - | -  | - | - | - | - |
-| RT-1-X             | 35M  | -    | -      | -    | -  | - | -  | - | - | 6.0 | 15.9 |
-| OpenVLA            | 8B   | -    | -      | -    | -  | - | -  | - | - | 34.2 | 1.1 |
-| Magma-8B (Ours)    | 8.6B | 80.0 | 66.5 | 87.4 | 60.4/58.5 | 75.3/52.9 | 69.1/52.0 | 96.3 | 71.8 | 52.3 | 35.4 |
 <!-- {{ results | default("[More Information Needed]", true)}} -->
-#### Summary
-TBD
 <!-- {{ results_summary | default("", true) }} -->
 ## Technical Specifications
@@ -373,12 +336,105 @@ Our model is built based on:
 * [DeepSpeed](https://www.deepspeed.ai/)
 * [FlashAttenton](https://github.com/HazyResearch/flash-attention)
 ## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
 ```bibtex
 @misc{yang2025magmafoundationmodelmultimodal,
       title={Magma: A Foundation Model for Multimodal AI Agents},

 </div>
+## Agents
+### UI Navigation
+<div align="center">
+<div align="center" style="display: inline-block; width: 48%;">
+<video  autoplay muted loop controls playsinline>
+    <source src="https://microsoft.github.io/Magma/static/videos/ui_weather_and_flight_mode.mp4" type="video/mp4">
+</video>
+    <p class="is-5 has-text-centered">What's weather in Seattle? & turn on flight mode</p>
+</div>
+<div align="center" style="display: inline-block; width: 48%;">
+<video  autoplay muted loop controls playsinline>
+    <source src="https://microsoft.github.io/Magma/static/videos/ui_wordle.mp4" type="video/mp4">
+</video>
+    <p class="is-5 has-text-centered">Share and message this to Bob Steve, click send button to complete</p>
+</div>
+</div>
+### Robot Manipulation
+<div align="center">
+<div align="center">
+    <div style="display: flex; justify-content: space-between; gap: 1%;">
+        <div style="width: 32%;">
+            <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
+                <source src="https://microsoft.github.io/Magma/static/videos/magma_hotdog.mp4" type="video/mp4">
+            </video>
+        </div>
+        <div style="width: 32%;">
+            <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
+                <source src="https://microsoft.github.io/Magma/static/videos/magma_mushroom.mp4" type="video/mp4">
+            </video>
+        </div>
+        <div style="width: 32%;">
+            <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
+                <source src="https://microsoft.github.io/Magma/static/videos/magma_left.mp4" type="video/mp4">
+            </video>
+        </div>
+    </div>
+</div>
+<div align="center">
+<div style="display: flex; justify-content: space-between; gap: 1%;">
+    <div style="width: 32%;">
+        <p style="text-align: center;font-size: 18px;">Pick Place Hotdog Sausage</p>
+    </div>
+    <div style="width: 32%;">
+        <p style="text-align: center;font-size: 18px;">Put Mushroom Place Pot</p>
+    </div>
+    <div style="width: 32%;">
+        <p style="text-align: center;font-size: 18px;">Push Cloth Left to Right (Out-of-Dist.)</p>
+    </div>
+</div>
+</div>
+</div>
 ## Model Details
 <div align="center">
 print(response)
 ```
 ## Training Details
 ### Training Data
 Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
+| Model                 | VQAv2 | TextVQA | POPE  | SS-Mobile | SS-Desktop | SS-Web | VWB-Ele-G | VWB-Act-G | SE-Google Robot | SE-Bridge |
+|-----------------------|------|--------|------|----------|-----------|------|----------|----------|---------------|-----------|
+| GPT-4V               | 77.2 | 78.0   | n/a  | 23.6 | 16.0 | 9.0 | 67.5 | 75.7 | - | - |
+| GPT-4V-OmniParser    | n/a  | n/a    | n/a  | 71.1 | 45.6 | 58.5 | - | - | - | - |
+| LLava-1.5           | 78.5 | 58.2   | 85.9  | -  | - | -  | 12.1 | 13.6 | - | - |
+| LLava-Next          | 81.3 | 64.9   | 86.5  | -  | - | -  | 15.0 | 8.7 | - | - |
+| Qwen-VL             | 78.8 | 63.8   | n/a  | 6.2  | 6.3 | 3.0  | 14.0 | 0.7 | - | - |
+| Qwen-VL-Chat        | 78.2 | 61.5   | n/a  | -  | - | -  | - | - | - | - |
+| Fuyu                | 74.2 | n/a    | n/a  | 21.2 | 20.8 | 19.2 | 19.4 | 15.5 | - | - |
+| SeeClick            | -    | -      | -    | 65.0 | 51.1 | 44.1 | 9.9 | 1.9 | - | - |
+| Octo               | -    | -      | -    | -  | - | -  | - | - | - | - |
+| RT-1-X             | -    | -      | -    | -  | - | -  | - | - | 6.0 | 15.9 |
+| OpenVLA            | -    | -      | -    | -  | - | -  | - | - | 34.2 | 1.1 |
+| Magma-8B (Ours)    | 80.0 | 66.5   | 87.4  | 59.5 | 64.1 | 60.6 | 96.3 | 71.8 | 52.3 | 35.4 |
 <!-- {{ results | default("[More Information Needed]", true)}} -->
 <!-- {{ results_summary | default("", true) }} -->
 ## Technical Specifications
 * [DeepSpeed](https://www.deepspeed.ai/)
 * [FlashAttenton](https://github.com/HazyResearch/flash-attention)
+## Intended Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in mutimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+The model takes images and text as inputs, and produces the textual outputs for the following uses:
+* **Image/Video-Conditoned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
+* **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
+* **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
+### Downstream Use
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+<!-- {{ downstream_use | default("[More Information Needed]", true)}} -->
+<!-- ### Out-of-Scope Use -->
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+<!-- {{ out_of_scope_use | default("[More Information Needed]", true)}} -->
+The model can be further finetuned for different downstream tasks, such as:
+* **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
+* **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
+* **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
+* **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+<!-- {{ bias_risks_limitations | default("[More Information Needed]", true)}} -->
+Please note that this model is not specifically designed or evaluated for all downstream purposes.
+The model is not intended to be deployed in production settings. It should not be used in high-risk scenarios, such as military and defense, financial services, and critical infrastructure systems.
+Developers should consider common limitations of multimodal models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case.
+Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Like other multimodal models, Magma can potentially behave in ways that are unfair, unreliable, or offensive.
+The models' outputs do not reflect the opinions of Microsoft.
+Some of the limiting behaviors to be aware of include:
+* **Quality of Service:** The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Magma is not intended to support multilingual use.
+* **Representation of Harms & Perpetuation of Stereotypes:** These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
+* **Inappropriate or Offensive Content:** These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
+* **Information Reliability:** Multimodal models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
+Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like [Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety) that have advanced guardrails is highly recommended.
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+<!-- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} -->
+Magma was developed for research purposes only. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
+The recommended usage for the finetuned models is within the research settings they were trained on — namely,
+-	an android simulator running on a computer for UI manipulation.
+-	an enclosure equipped with a robotic arm and everyday objects for Robotic manipulation
+For UI navigation task, researchers should make sure a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure no unintended consequences can occur as a result of performing the UI action proposed by the model.
+For the robotic manipulation task, some mitigation strategies to use for human safety when operating robotic arms include:
+* **Safety Zones and Barriers:** Establish physical barriers or safety zones around robotic workspaces to prevent unauthorized access.
+* **Emergency Stop Systems:** Equip robotic arms with easily accessible emergency stop buttons. Implement a fail-safe mechanism that triggers an immediate stop of operations in case of an emergency
+* **Safety Standards and Compliance:** Adhere to established safety standards (e.g., ISO 10218, ISO/TS 15066) for industrial robots and collaborative robots.
+* **User Training and Awareness:** Provide comprehensive training for all personnel working around robotic arms to understand their functions, safety features, and emergency procedures. Promote awareness of the potential risks associated with robotic manipulation.
 ## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 ```bibtex
 @misc{yang2025magmafoundationmodelmultimodal,
       title={Magma: A Foundation Model for Multimodal AI Agents},