jw2yang commited on
Commit
f5081b4
·
1 Parent(s): e60c356
Files changed (1) hide show
  1. README.md +168 -112
README.md CHANGED
@@ -34,6 +34,64 @@ pipeline_tag: text-generation
34
 
35
  </div>
36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ## Model Details
38
 
39
  <div align="center">
@@ -109,99 +167,6 @@ response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
109
  print(response)
110
  ```
111
 
112
- ## Intended Uses
113
-
114
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
115
-
116
- This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in mutimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
117
-
118
- ### Direct Use
119
-
120
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
121
-
122
- The model takes images and text as inputs, and produces the textual outputs for the following uses:
123
-
124
- * **Image/Video-Conditoned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
125
-
126
- * **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
127
-
128
- * **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
129
-
130
-
131
-
132
- ### Downstream Use
133
-
134
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
135
-
136
- <!-- {{ downstream_use | default("[More Information Needed]", true)}} -->
137
-
138
- <!-- ### Out-of-Scope Use -->
139
-
140
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
141
-
142
- <!-- {{ out_of_scope_use | default("[More Information Needed]", true)}} -->
143
-
144
- The model can be further finetuned for different downstream tasks, such as:
145
-
146
- * **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
147
-
148
- * **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
149
-
150
- * **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
151
-
152
- * **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
153
-
154
-
155
- ## Bias, Risks, and Limitations
156
-
157
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
158
-
159
- <!-- {{ bias_risks_limitations | default("[More Information Needed]", true)}} -->
160
-
161
- Please note that this model is not specifically designed or evaluated for all downstream purposes.
162
-
163
- The model is not intended to be deployed in production settings. It should not be used in high-risk scenarios, such as military and defense, financial services, and critical infrastructure systems.
164
-
165
- Developers should consider common limitations of multimodal models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case.
166
-
167
- Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Like other multimodal models, Magma can potentially behave in ways that are unfair, unreliable, or offensive.
168
-
169
- The models' outputs do not reflect the opinions of Microsoft.
170
-
171
- Some of the limiting behaviors to be aware of include:
172
-
173
- * **Quality of Service:** The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Magma is not intended to support multilingual use.
174
-
175
- * **Representation of Harms & Perpetuation of Stereotypes:** These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
176
-
177
- * **Inappropriate or Offensive Content:** These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
178
-
179
- * **Information Reliability:** Multimodal models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
180
-
181
- Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like [Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety) that have advanced guardrails is highly recommended.
182
-
183
-
184
- ### Recommendations
185
-
186
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
187
-
188
- <!-- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} -->
189
-
190
- Magma was developed for research purposes only. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
191
-
192
- The recommended usage for the finetuned models is within the research settings they were trained on — namely,
193
- - an android simulator running on a computer for UI manipulation.
194
- - an enclosure equipped with a robotic arm and everyday objects for Robotic manipulation
195
-
196
- For UI navigation task, researchers should make sure a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure no unintended consequences can occur as a result of performing the UI action proposed by the model.
197
-
198
- For the robotic manipulation task, some mitigation strategies to use for human safety when operating robotic arms include:
199
-
200
- * **Safety Zones and Barriers:** Establish physical barriers or safety zones around robotic workspaces to prevent unauthorized access.
201
- * **Emergency Stop Systems:** Equip robotic arms with easily accessible emergency stop buttons. Implement a fail-safe mechanism that triggers an immediate stop of operations in case of an emergency
202
- * **Safety Standards and Compliance:** Adhere to established safety standards (e.g., ISO 10218, ISO/TS 15066) for industrial robots and collaborative robots.
203
- * **User Training and Awareness:** Provide comprehensive training for all personnel working around robotic arms to understand their functions, safety features, and emergency procedures. Promote awareness of the potential risks associated with robotic manipulation.
204
-
205
  ## Training Details
206
 
207
  ### Training Data
@@ -310,29 +275,27 @@ We follow the individual dataset's evaluation metrics for the evaluation. Please
310
 
311
  Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
312
 
313
- | Model | Size | VQAv2 | TextVQA | POPE | SS-Mobile | SS-Desktop | SS-Web | VWB-Ele-G | VWB-Act-G | SE-Google Robot | SE-Bridge |
314
- |-----------------------|------|------|--------|------|----------|-----------|------|----------|----------|---------------|-----------|
315
- | GPT-4V | n/a | 77.2 | 78.0 | n/a | 22.6/24.5 | 20.2/11.8 | 9.2/8.8 | 67.5 | 75.7 | - | - |
316
- | GPT-4V-OmniParser | n/a | n/a | n/a | n/a | 92.7/49.4 | 64.9/26.3 | 77.3/39.7 | - | - | - | - |
317
- | LLava-1.5 | 7.4B | 78.5 | 58.2 | 85.9 | - | - | - | 12.1 | 13.6 | - | - |
318
- | LLava-Next | 7.4B | 81.3 | 64.9 | 86.5 | - | - | - | 15.0 | 8.7 | - | - |
319
- | Qwen-VL | 9.6B | 78.8 | 63.8 | n/a | 7.5/4.8 | 7.5/5.0 | 3.5/2.4 | 14.0 | 0.7 | - | - |
320
- | Qwen-VL-Chat | 9.6B | 78.2 | 61.5 | n/a | - | - | - | - | - | - | - |
321
- | Fuyu | 8B | 74.2 | n/a | n/a | 41.0/1.3 | 38.0/3.6 | 33.9/4.4 | 19.4 | 15.5 | - | - |
322
- | SeeClick | 9.6B | - | - | - | 78.0/52.0 | 72.2/30.0 | 55.7/32.5 | 9.9 | 1.9 | - | - |
323
- | Octo | 93M | - | - | - | - | - | - | - | - | - | - |
324
- | RT-1-X | 35M | - | - | - | - | - | - | - | - | 6.0 | 15.9 |
325
- | OpenVLA | 8B | - | - | - | - | - | - | - | - | 34.2 | 1.1 |
326
- | Magma-8B (Ours) | 8.6B | 80.0 | 66.5 | 87.4 | 60.4/58.5 | 75.3/52.9 | 69.1/52.0 | 96.3 | 71.8 | 52.3 | 35.4 |
327
 
328
 
329
  <!-- {{ results | default("[More Information Needed]", true)}} -->
330
 
331
- #### Summary
332
-
333
- TBD
334
  <!-- {{ results_summary | default("", true) }} -->
335
 
 
336
  ## Technical Specifications
337
 
338
 
@@ -373,12 +336,105 @@ Our model is built based on:
373
  * [DeepSpeed](https://www.deepspeed.ai/)
374
  * [FlashAttenton](https://github.com/HazyResearch/flash-attention)
375
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  ## Citation
377
 
378
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
379
 
380
- **BibTeX:**
381
-
382
  ```bibtex
383
  @misc{yang2025magmafoundationmodelmultimodal,
384
  title={Magma: A Foundation Model for Multimodal AI Agents},
 
34
 
35
  </div>
36
 
37
+ ## Agents
38
+
39
+
40
+ ### UI Navigation
41
+
42
+ <div align="center">
43
+ <div align="center" style="display: inline-block; width: 48%;">
44
+ <video autoplay muted loop controls playsinline>
45
+ <source src="https://microsoft.github.io/Magma/static/videos/ui_weather_and_flight_mode.mp4" type="video/mp4">
46
+ </video>
47
+ <p class="is-5 has-text-centered">What's weather in Seattle? & turn on flight mode</p>
48
+ </div>
49
+ <div align="center" style="display: inline-block; width: 48%;">
50
+ <video autoplay muted loop controls playsinline>
51
+ <source src="https://microsoft.github.io/Magma/static/videos/ui_wordle.mp4" type="video/mp4">
52
+ </video>
53
+ <p class="is-5 has-text-centered">Share and message this to Bob Steve, click send button to complete</p>
54
+ </div>
55
+ </div>
56
+
57
+ ### Robot Manipulation
58
+
59
+ <div align="center">
60
+ <div align="center">
61
+ <div style="display: flex; justify-content: space-between; gap: 1%;">
62
+ <div style="width: 32%;">
63
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
64
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_hotdog.mp4" type="video/mp4">
65
+ </video>
66
+ </div>
67
+ <div style="width: 32%;">
68
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
69
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_mushroom.mp4" type="video/mp4">
70
+ </video>
71
+ </div>
72
+ <div style="width: 32%;">
73
+ <video autoplay muted loop controls playsinline height="98%" style="max-width: 450px; width: 100%; border-radius: 10px; overflow: hidden;">
74
+ <source src="https://microsoft.github.io/Magma/static/videos/magma_left.mp4" type="video/mp4">
75
+ </video>
76
+ </div>
77
+ </div>
78
+ </div>
79
+
80
+ <div align="center">
81
+ <div style="display: flex; justify-content: space-between; gap: 1%;">
82
+ <div style="width: 32%;">
83
+ <p style="text-align: center;font-size: 18px;">Pick Place Hotdog Sausage</p>
84
+ </div>
85
+ <div style="width: 32%;">
86
+ <p style="text-align: center;font-size: 18px;">Put Mushroom Place Pot</p>
87
+ </div>
88
+ <div style="width: 32%;">
89
+ <p style="text-align: center;font-size: 18px;">Push Cloth Left to Right (Out-of-Dist.)</p>
90
+ </div>
91
+ </div>
92
+ </div>
93
+ </div>
94
+
95
  ## Model Details
96
 
97
  <div align="center">
 
167
  print(response)
168
  ```
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  ## Training Details
171
 
172
  ### Training Data
 
275
 
276
  Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. Magma is the only model that can conduct the full task spectrum.
277
 
278
+ | Model | VQAv2 | TextVQA | POPE | SS-Mobile | SS-Desktop | SS-Web | VWB-Ele-G | VWB-Act-G | SE-Google Robot | SE-Bridge |
279
+ |-----------------------|------|--------|------|----------|-----------|------|----------|----------|---------------|-----------|
280
+ | GPT-4V | 77.2 | 78.0 | n/a | 23.6 | 16.0 | 9.0 | 67.5 | 75.7 | - | - |
281
+ | GPT-4V-OmniParser | n/a | n/a | n/a | 71.1 | 45.6 | 58.5 | - | - | - | - |
282
+ | LLava-1.5 | 78.5 | 58.2 | 85.9 | - | - | - | 12.1 | 13.6 | - | - |
283
+ | LLava-Next | 81.3 | 64.9 | 86.5 | - | - | - | 15.0 | 8.7 | - | - |
284
+ | Qwen-VL | 78.8 | 63.8 | n/a | 6.2 | 6.3 | 3.0 | 14.0 | 0.7 | - | - |
285
+ | Qwen-VL-Chat | 78.2 | 61.5 | n/a | - | - | - | - | - | - | - |
286
+ | Fuyu | 74.2 | n/a | n/a | 21.2 | 20.8 | 19.2 | 19.4 | 15.5 | - | - |
287
+ | SeeClick | - | - | - | 65.0 | 51.1 | 44.1 | 9.9 | 1.9 | - | - |
288
+ | Octo | - | - | - | - | - | - | - | - | - | - |
289
+ | RT-1-X | - | - | - | - | - | - | - | - | 6.0 | 15.9 |
290
+ | OpenVLA | - | - | - | - | - | - | - | - | 34.2 | 1.1 |
291
+ | Magma-8B (Ours) | 80.0 | 66.5 | 87.4 | 59.5 | 64.1 | 60.6 | 96.3 | 71.8 | 52.3 | 35.4 |
292
 
293
 
294
  <!-- {{ results | default("[More Information Needed]", true)}} -->
295
 
 
 
 
296
  <!-- {{ results_summary | default("", true) }} -->
297
 
298
+
299
  ## Technical Specifications
300
 
301
 
 
336
  * [DeepSpeed](https://www.deepspeed.ai/)
337
  * [FlashAttenton](https://github.com/HazyResearch/flash-attention)
338
 
339
+
340
+ ## Intended Uses
341
+
342
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
343
+
344
+ This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in mutimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
345
+
346
+ ### Direct Use
347
+
348
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
349
+
350
+ The model takes images and text as inputs, and produces the textual outputs for the following uses:
351
+
352
+ * **Image/Video-Conditoned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
353
+
354
+ * **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
355
+
356
+ * **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
357
+
358
+
359
+
360
+ ### Downstream Use
361
+
362
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
363
+
364
+ <!-- {{ downstream_use | default("[More Information Needed]", true)}} -->
365
+
366
+ <!-- ### Out-of-Scope Use -->
367
+
368
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
369
+
370
+ <!-- {{ out_of_scope_use | default("[More Information Needed]", true)}} -->
371
+
372
+ The model can be further finetuned for different downstream tasks, such as:
373
+
374
+ * **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
375
+
376
+ * **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
377
+
378
+ * **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
379
+
380
+ * **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
381
+
382
+
383
+ ## Bias, Risks, and Limitations
384
+
385
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
386
+
387
+ <!-- {{ bias_risks_limitations | default("[More Information Needed]", true)}} -->
388
+
389
+ Please note that this model is not specifically designed or evaluated for all downstream purposes.
390
+
391
+ The model is not intended to be deployed in production settings. It should not be used in high-risk scenarios, such as military and defense, financial services, and critical infrastructure systems.
392
+
393
+ Developers should consider common limitations of multimodal models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case.
394
+
395
+ Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Like other multimodal models, Magma can potentially behave in ways that are unfair, unreliable, or offensive.
396
+
397
+ The models' outputs do not reflect the opinions of Microsoft.
398
+
399
+ Some of the limiting behaviors to be aware of include:
400
+
401
+ * **Quality of Service:** The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Magma is not intended to support multilingual use.
402
+
403
+ * **Representation of Harms & Perpetuation of Stereotypes:** These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
404
+
405
+ * **Inappropriate or Offensive Content:** These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case.
406
+
407
+ * **Information Reliability:** Multimodal models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
408
+
409
+ Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like [Azure AI Content Safety](https://azure.microsoft.com/en-us/products/ai-services/ai-content-safety) that have advanced guardrails is highly recommended.
410
+
411
+
412
+ ### Recommendations
413
+
414
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
415
+
416
+ <!-- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}} -->
417
+
418
+ Magma was developed for research purposes only. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
419
+
420
+ The recommended usage for the finetuned models is within the research settings they were trained on — namely,
421
+ - an android simulator running on a computer for UI manipulation.
422
+ - an enclosure equipped with a robotic arm and everyday objects for Robotic manipulation
423
+
424
+ For UI navigation task, researchers should make sure a human is in the loop and in control for every action the agentic system generates. Since the model cannot act by itself, the sub-module a researcher uses to actually perform the UI navigation action should ensure no unintended consequences can occur as a result of performing the UI action proposed by the model.
425
+
426
+ For the robotic manipulation task, some mitigation strategies to use for human safety when operating robotic arms include:
427
+
428
+ * **Safety Zones and Barriers:** Establish physical barriers or safety zones around robotic workspaces to prevent unauthorized access.
429
+ * **Emergency Stop Systems:** Equip robotic arms with easily accessible emergency stop buttons. Implement a fail-safe mechanism that triggers an immediate stop of operations in case of an emergency
430
+ * **Safety Standards and Compliance:** Adhere to established safety standards (e.g., ISO 10218, ISO/TS 15066) for industrial robots and collaborative robots.
431
+ * **User Training and Awareness:** Provide comprehensive training for all personnel working around robotic arms to understand their functions, safety features, and emergency procedures. Promote awareness of the potential risks associated with robotic manipulation.
432
+
433
+
434
  ## Citation
435
 
436
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
437
 
 
 
438
  ```bibtex
439
  @misc{yang2025magmafoundationmodelmultimodal,
440
  title={Magma: A Foundation Model for Multimodal AI Agents},