AI4H Group commited on
Commit
df490c1
·
1 Parent(s): 6bd59f7

Initial commit

Browse files
Files changed (8) hide show
  1. OpenVLM_subset.json +656 -0
  2. README.md +24 -8
  3. ShoppingMMLU.json +64 -0
  4. ShoppingMMLU_overall.json +653 -0
  5. app.py +170 -0
  6. gen_table.py +204 -0
  7. meta_data.py +169 -0
  8. requirements.txt +3 -0
OpenVLM_subset.json ADDED
@@ -0,0 +1,656 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "time": "241031154353",
3
+ "results": {
4
+ "GPT-4o (0513, detail-high)": {
5
+ "META": {
6
+ "Method": [
7
+ "GPT-4o (0513, detail-high)",
8
+ "https://openai.com/index/hello-gpt-4o/"
9
+ ],
10
+ "Parameters": "",
11
+ "Language Model": "",
12
+ "Vision Model": "",
13
+ "Org": "OpenAI",
14
+ "Time": "2024/05/31",
15
+ "Verified": "Yes",
16
+ "OpenSource": "No",
17
+ "key": 270,
18
+ "dir_name": "GPT4o_HIGH"
19
+ },
20
+ "SEEDBench_IMG": {
21
+ "Overall": 77.1,
22
+ "Instance Attributes": 79.3,
23
+ "Instance Identity": 81.0,
24
+ "Instance Interaction": 80.4,
25
+ "Instance Location": 72.9,
26
+ "Instances Counting": 69.5,
27
+ "Scene Understanding": 80.1,
28
+ "Spatial Relation": 67.9,
29
+ "Text Understanding": 72.6,
30
+ "Visual Reasoning": 83.1,
31
+ "Overall (official)": "N/A"
32
+ },
33
+ "CCBench": {
34
+ "Overall": 71.2,
35
+ "Sketch Reasoning": 91.1,
36
+ "Historical Figure": 37.1,
37
+ "Calligraphy Painting": 70.2,
38
+ "Scenery Building": 89.5,
39
+ "Food Clothes": 62.6,
40
+ "Cultural Relic": 67.0,
41
+ "Traditional Show": 71.2
42
+ },
43
+ "MMBench_TEST_EN": {
44
+ "Overall": 83.4,
45
+ "CP": 87.4,
46
+ "FP-S": 78.9,
47
+ "FP-C": 83.8,
48
+ "AR": 86.5,
49
+ "LR": 80.3,
50
+ "RR": 80.6
51
+ },
52
+ "MMBench_TEST_CN": {
53
+ "Overall": 82.1,
54
+ "CP": 87.6,
55
+ "FP-S": 76.6,
56
+ "FP-C": 83.4,
57
+ "AR": 83.7,
58
+ "LR": 78.0,
59
+ "RR": 80.1
60
+ },
61
+ "MMBench_TEST_EN_V11": {
62
+ "Overall": 83.0,
63
+ "AR": 90.2,
64
+ "CP": 81.3,
65
+ "FP-C": 86.1,
66
+ "FP-S": 81.4,
67
+ "LR": 78.8,
68
+ "RR": 82.2,
69
+ "Action Recognition": 93.2,
70
+ "Attribute Comparison": 82.7,
71
+ "Attribute Recognition": 91.0,
72
+ "Celebrity Recognition": 62.6,
73
+ "Function Reasoning": 93.3,
74
+ "Future Prediction": 82.7,
75
+ "Identity Reasoning": 98.7,
76
+ "Image Emotion": 81.1,
77
+ "Image Quality": 59.7,
78
+ "Image Scene": 88.2,
79
+ "Image Style": 83.7,
80
+ "Image Topic": 97.8,
81
+ "Nature Relation": 92.4,
82
+ "Object Localization": 84.8,
83
+ "Ocr": 98.9,
84
+ "Physical Property Reasoning": 78.5,
85
+ "Physical Relation": 61.3,
86
+ "Social Relation": 89.0,
87
+ "Spatial Relationship": 78.7,
88
+ "Structuralized Imagetext Understanding": 76.1
89
+ },
90
+ "MMBench_TEST_CN_V11": {
91
+ "Overall": 81.5,
92
+ "AR": 86.5,
93
+ "CP": 81.5,
94
+ "FP-C": 85.0,
95
+ "FP-S": 79.1,
96
+ "LR": 77.2,
97
+ "RR": 79.8,
98
+ "Action Recognition": 94.0,
99
+ "Attribute Comparison": 81.3,
100
+ "Attribute Recognition": 91.0,
101
+ "Celebrity Recognition": 57.8,
102
+ "Function Reasoning": 94.4,
103
+ "Future Prediction": 82.7,
104
+ "Identity Reasoning": 97.4,
105
+ "Image Emotion": 85.6,
106
+ "Image Quality": 58.9,
107
+ "Image Scene": 88.2,
108
+ "Image Style": 80.4,
109
+ "Image Topic": 98.9,
110
+ "Nature Relation": 94.6,
111
+ "Object Localization": 82.9,
112
+ "Ocr": 97.8,
113
+ "Physical Property Reasoning": 67.1,
114
+ "Physical Relation": 53.3,
115
+ "Social Relation": 86.8,
116
+ "Spatial Relationship": 74.7,
117
+ "Structuralized Imagetext Understanding": 73.4
118
+ },
119
+ "MME": {
120
+ "Overall": 2310.3,
121
+ "Perception": 1614.2,
122
+ "Cognition": 696.1,
123
+ "OCR": 192.5,
124
+ "Artwork": 145.2,
125
+ "Celebrity": 67.9,
126
+ "Code Reasoning": 177.5,
127
+ "Color": 185.0,
128
+ "Commonsense Reasoning": 178.6,
129
+ "Count": 185.0,
130
+ "Existence": 185.0,
131
+ "Landmark": 182.0,
132
+ "Numerical Calculation": 147.5,
133
+ "Position": 133.3,
134
+ "Posters": 191.2,
135
+ "Scene": 147.0,
136
+ "Text Translation": 192.5
137
+ },
138
+ "MMVet": {
139
+ "Rec": 67.8,
140
+ "Ocr": 76.8,
141
+ "Know": 58.3,
142
+ "Gen": 56.9,
143
+ "Spat": 74.3,
144
+ "Math": 76.2,
145
+ "Overall": 69.1,
146
+ "Overall (official)": "N/A"
147
+ },
148
+ "MMMU_VAL": {
149
+ "Overall": 69.2,
150
+ "Art & Design": 72.5,
151
+ "Business": 73.3,
152
+ "Science": 64.7,
153
+ "Health & Medicine": 74.0,
154
+ "Humanities & Social Science": 80.8,
155
+ "Tech & Engineering": 57.6
156
+ },
157
+ "MathVista": {
158
+ "Overall": 61.3,
159
+ "SCI": 64.8,
160
+ "TQA": 70.3,
161
+ "NUM": 44.4,
162
+ "ARI": 58.4,
163
+ "VQA": 47.5,
164
+ "GEO": 61.5,
165
+ "ALG": 62.3,
166
+ "GPS": 60.1,
167
+ "MWP": 69.9,
168
+ "LOG": 43.2,
169
+ "FQA": 60.2,
170
+ "STA": 68.4
171
+ },
172
+ "HallusionBench": {
173
+ "aAcc": 70.2,
174
+ "fAcc": 49.1,
175
+ "qAcc": 45.5,
176
+ "Overall": 55.0
177
+ },
178
+ "LLaVABench": {
179
+ "Overall": 102.0,
180
+ "Conv": 93.6,
181
+ "Complex": 111.2,
182
+ "Detail": 93.6,
183
+ "Overall (official)": "N/A"
184
+ },
185
+ "AI2D": {
186
+ "Overall": 84.6,
187
+ "atomStructure": 75.0,
188
+ "eclipses": 90.3,
189
+ "faultsEarthquakes": 78.6,
190
+ "foodChainsWebs": 92.2,
191
+ "lifeCycles": 83.5,
192
+ "moonPhaseEquinox": 68.2,
193
+ "partsOfA": 80.9,
194
+ "partsOfTheEarth": 82.7,
195
+ "photosynthesisRespiration": 83.5,
196
+ "rockCycle": 73.1,
197
+ "rockStrata": 87.8,
198
+ "solarSystem": 97.2,
199
+ "typesOf": 81.0,
200
+ "volcano": 100.0,
201
+ "waterCNPCycle": 68.2
202
+ },
203
+ "ScienceQA_VAL": {
204
+ "Overall": 89.7,
205
+ "Adaptations": 97.9,
206
+ "Adaptations and natural selection": 100.0,
207
+ "Age of Exploration": 100.0,
208
+ "Ancient Egypt and Kush": 100.0,
209
+ "Ancient Mesopotamia": 100.0,
210
+ "Animals": 100.0,
211
+ "Astronomy": 100.0,
212
+ "Atoms and molecules": 100.0,
213
+ "Basic economic principles": 32.8,
214
+ "Chemical reactions": 100.0,
215
+ "Cities": 87.5,
216
+ "Classification": 98.8,
217
+ "Classification and scientific names": 100.0,
218
+ "Climate change": 100.0,
219
+ "Colonial America": 90.5,
220
+ "Context clues": 100.0,
221
+ "Descriptive details": 100.0,
222
+ "Designing experiments": 100.0,
223
+ "Domain-specific vocabulary": 60.0,
224
+ "Early 19th century American history": 100.0,
225
+ "Early Americas": 50.0,
226
+ "Earth events": 100.0,
227
+ "Ecological interactions": 76.0,
228
+ "Ecosystems": 95.5,
229
+ "Engineering practices": 100.0,
230
+ "English colonies in North America": 74.4,
231
+ "Force and motion": 84.0,
232
+ "Fossils": 82.4,
233
+ "Genes to traits": 83.0,
234
+ "Geography": 98.6,
235
+ "Government": 100.0,
236
+ "Independent reading comprehension": 100.0,
237
+ "Informational texts: level 1": 100.0,
238
+ "Magnets": 72.2,
239
+ "Maps": 96.8,
240
+ "Materials": 96.6,
241
+ "Medieval Asia": 100.0,
242
+ "Natural resources and human impacts": 100.0,
243
+ "Oceania: geography": 59.6,
244
+ "Oceans and continents": 100.0,
245
+ "Oceans and continents\t": 100.0,
246
+ "Particle motion and energy": 92.6,
247
+ "Persuasive strategies": 100.0,
248
+ "Physical Geography": 83.7,
249
+ "Plant reproduction": 90.0,
250
+ "Plants": 100.0,
251
+ "Plate tectonics": 100.0,
252
+ "Read-alone texts": 100.0,
253
+ "Rocks and minerals": 100.0,
254
+ "Rome and the Byzantine Empire": 100.0,
255
+ "Scientific names": 100.0,
256
+ "Solutions": 65.7,
257
+ "State capitals": 100.0,
258
+ "States": 100.0,
259
+ "States of matter": 97.4,
260
+ "The American Revolution": 100.0,
261
+ "The Americas: geography": 83.3,
262
+ "The Antebellum period": 100.0,
263
+ "The Civil War and Reconstruction": 100.0,
264
+ "The Silk Road": 100.0,
265
+ "Thermal energy": 100.0,
266
+ "Velocity, acceleration, and forces": 68.6,
267
+ "Visual elements": 100.0,
268
+ "Water cycle": 100.0,
269
+ "Weather and climate": 90.6,
270
+ "World religions": 100.0
271
+ },
272
+ "ScienceQA_TEST": {
273
+ "Overall": 90.7,
274
+ "Adaptations": 100.0,
275
+ "Ancient Egypt and Kush": 100.0,
276
+ "Ancient Mesopotamia": 100.0,
277
+ "Animals": 100.0,
278
+ "Astronomy": 100.0,
279
+ "Atoms and molecules": 100.0,
280
+ "Basic economic principles": 38.0,
281
+ "Cells": 100.0,
282
+ "Chemical reactions": 100.0,
283
+ "Cities": 91.7,
284
+ "Classification": 100.0,
285
+ "Classification and scientific names": 100.0,
286
+ "Climate change": 100.0,
287
+ "Colonial America": 81.6,
288
+ "Context clues": 100.0,
289
+ "Descriptive details": 100.0,
290
+ "Designing experiments": 100.0,
291
+ "Domain-specific vocabulary": 100.0,
292
+ "Early 19th century American history": 100.0,
293
+ "Earth events": 100.0,
294
+ "Ecological interactions": 66.7,
295
+ "Ecosystems": 90.4,
296
+ "Engineering practices": 98.2,
297
+ "English colonies in North America": 92.3,
298
+ "Force and motion": 100.0,
299
+ "Fossils": 100.0,
300
+ "Genes to traits": 76.3,
301
+ "Geography": 95.2,
302
+ "Government": 100.0,
303
+ "Greece": 100.0,
304
+ "Independent reading comprehension": 100.0,
305
+ "Informational texts: level 1": 100.0,
306
+ "Kinetic and potential energy": 100.0,
307
+ "Magnets": 77.3,
308
+ "Maps": 97.8,
309
+ "Materials": 96.5,
310
+ "Medieval Asia": 100.0,
311
+ "Oceania: geography": 76.5,
312
+ "Oceans and continents": 100.0,
313
+ "Oceans and continents\t": 100.0,
314
+ "Particle motion and energy": 97.6,
315
+ "Persuasive strategies": 100.0,
316
+ "Photosynthesis": 100.0,
317
+ "Physical Geography": 92.2,
318
+ "Plant reproduction": 100.0,
319
+ "Plants": 66.7,
320
+ "Plate tectonics": 100.0,
321
+ "Read-alone texts": 100.0,
322
+ "Rocks and minerals": 100.0,
323
+ "Scientific names": 100.0,
324
+ "Solutions": 72.2,
325
+ "State capitals": 100.0,
326
+ "States": 94.4,
327
+ "States of matter": 100.0,
328
+ "The American Revolution": 100.0,
329
+ "The Americas: geography": 71.1,
330
+ "The Antebellum period": 100.0,
331
+ "The Civil War and Reconstruction": 100.0,
332
+ "Thermal energy": 95.5,
333
+ "Topographic maps": 100.0,
334
+ "Velocity, acceleration, and forces": 67.7,
335
+ "Visual elements": 100.0,
336
+ "Water cycle": 100.0,
337
+ "Weather and climate": 91.4,
338
+ "World religions": 100.0
339
+ },
340
+ "OCRBench": {
341
+ "Text Recognition": 199,
342
+ "Scene Text-centric VQA": 181,
343
+ "Doc-oriented VQA": 168,
344
+ "Key Information Extraction": 170,
345
+ "Handwritten Mathematical Expression Recognition": 18,
346
+ "Final Score": 736
347
+ },
348
+ "MMStar": {
349
+ "Overall": 63.9,
350
+ "coarse perception": 73.6,
351
+ "fine-grained perception": 54.8,
352
+ "instance reasoning": 66.4,
353
+ "logical reasoning": 72.0,
354
+ "math": 66.4,
355
+ "science & technology": 50.0
356
+ },
357
+ "RealWorldQA": {
358
+ "Overall": 75.4
359
+ },
360
+ "POPE": {
361
+ "Overall": 85.6,
362
+ "acc": 86.7,
363
+ "precision": 93.0,
364
+ "recall": 79.3
365
+ },
366
+ "SEEDBench2_Plus": {
367
+ "Overall": 72.0,
368
+ "chart": 71.4,
369
+ "map": 62.0,
370
+ "web": 85.2
371
+ },
372
+ "MMT-Bench_VAL": {
373
+ "Overall": 67.3,
374
+ "VR": 85.3,
375
+ "Loc": 68.1,
376
+ "OCR": 82.5,
377
+ "Count": 57.2,
378
+ "HLN": 75.0,
379
+ "IR": 85.0,
380
+ "3D": 57.5,
381
+ "VC": 87.9,
382
+ "VG": 46.2,
383
+ "DU": 72.9,
384
+ "AR": 51.0,
385
+ "PLP": 43.5,
386
+ "I2IT": 50.0,
387
+ "RR": 76.2,
388
+ "IQT": 15.0,
389
+ "Emo": 58.3,
390
+ "VI": 33.9,
391
+ "MemU": 87.5,
392
+ "VPU": 84.9,
393
+ "AND": 57.0,
394
+ "KD": 57.1,
395
+ "VCR": 80.0,
396
+ "IEJ": 40.0,
397
+ "MIA": 42.5,
398
+ "CIM": 61.7,
399
+ "TU": 49.5,
400
+ "VP": 66.7,
401
+ "MedU": 74.0,
402
+ "AUD": 58.0,
403
+ "DKR": 64.6,
404
+ "EA": 90.0,
405
+ "GN": 46.2,
406
+ "abstract_visual_recognition": 85.0,
407
+ "action_quality_assessment": 15.0,
408
+ "age_gender_race_recognition": 60.0,
409
+ "anatomy_identification": 75.0,
410
+ "animal_keypoint_detection": 35.0,
411
+ "animals_recognition": 100.0,
412
+ "animated_character_recognition": 90.0,
413
+ "art_design": 81.8,
414
+ "artwork_emotion_recognition": 55.0,
415
+ "astronomical_recognition": 100.0,
416
+ "attribute_hallucination": 80.0,
417
+ "behavior_anomaly_detection": 30.0,
418
+ "body_emotion_recognition": 40.0,
419
+ "building_recognition": 90.0,
420
+ "business": 66.7,
421
+ "camouflage_object_detection": 55.0,
422
+ "celebrity_recognition": 0.0,
423
+ "chart_to_table": 95.0,
424
+ "chart_to_text": 90.0,
425
+ "chart_vqa": 70.0,
426
+ "chemical_apparatusn_recognition": 80.0,
427
+ "clock_reading": 30.0,
428
+ "clothes_keypoint_detection": 70.0,
429
+ "color_assimilation": 35.0,
430
+ "color_constancy": 14.3,
431
+ "color_contrast": 40.0,
432
+ "color_recognition": 95.0,
433
+ "counting_by_category": 33.8,
434
+ "counting_by_reasoning": 95.0,
435
+ "counting_by_visual_prompting": 50.0,
436
+ "crowd_counting": 50.0,
437
+ "deepfake_detection": 60.0,
438
+ "depth_estimation": 40.0,
439
+ "disaster_recognition": 85.0,
440
+ "disease_diagnose": 60.0,
441
+ "doc_vqa": 80.0,
442
+ "electronic_object_recognition": 100.0,
443
+ "eqn2latex": 90.0,
444
+ "exist_hallucination": 90.0,
445
+ "facail_expression_change_recognition": 95.0,
446
+ "face_detection": 90.0,
447
+ "face_mask_anomaly_dectection": 70.0,
448
+ "face_retrieval": 100.0,
449
+ "facial_expression_recognition": 75.0,
450
+ "fashion_recognition": 75.0,
451
+ "film_and_television_recognition": 95.0,
452
+ "font_recognition": 50.0,
453
+ "food_recognition": 100.0,
454
+ "furniture_keypoint_detection": 55.0,
455
+ "gaze_estimation": 10.0,
456
+ "general_action_recognition": 95.0,
457
+ "geometrical_perspective": 50.0,
458
+ "geometrical_relativity": 30.0,
459
+ "gesture_recognition": 65.0,
460
+ "google_apps": 50.0,
461
+ "gui_general": 45.0,
462
+ "gui_install": 50.0,
463
+ "handwritten_mathematical_expression_recognition": 90.0,
464
+ "handwritten_retrieval": 90.0,
465
+ "handwritten_text_recognition": 100.0,
466
+ "health_medicine": 92.9,
467
+ "helmet_anomaly_detection": 90.0,
468
+ "human_interaction_understanding": 95.0,
469
+ "human_keypoint_detection": 70.0,
470
+ "human_object_interaction_recognition": 75.0,
471
+ "humanitites_social_science": 54.5,
472
+ "image2image_retrieval": 75.0,
473
+ "image_based_action_recognition": 95.0,
474
+ "image_captioning": 100.0,
475
+ "image_captioning_paragraph": 95.0,
476
+ "image_colorization": 60.0,
477
+ "image_dense_captioning": 68.4,
478
+ "image_matting": 15.0,
479
+ "image_quality_assessment": 35.0,
480
+ "image_season_recognition": 80.0,
481
+ "industrial_produce_anomaly_detection": 40.0,
482
+ "instance_captioning": 95.0,
483
+ "interactive_segmentation": 85.7,
484
+ "jigsaw_puzzle_solving": 40.0,
485
+ "landmark_recognition": 100.0,
486
+ "lesion_grading": 90.0,
487
+ "logo_and_brand_recognition": 95.0,
488
+ "lvlm_response_judgement": 45.0,
489
+ "medical_modality_recognition": 100.0,
490
+ "meme_image_understanding": 95.0,
491
+ "meme_vedio_understanding": 80.0,
492
+ "mevis": 30.0,
493
+ "micro_expression_recognition": 20.0,
494
+ "multiple_image_captioning": 95.0,
495
+ "multiple_instance_captioning": 95.0,
496
+ "multiple_view_image_understanding": 10.0,
497
+ "muscial_instrument_recognition": 95.0,
498
+ "national_flag_recognition": 100.0,
499
+ "navigation": 90.0,
500
+ "next_img_prediction": 65.0,
501
+ "object_detection": 90.0,
502
+ "one_shot_detection": 85.0,
503
+ "order_hallucination": 50.0,
504
+ "other_biological_attributes": 45.0,
505
+ "painting_recognition": 90.0,
506
+ "person_reid": 95.0,
507
+ "pixel_localization": 25.0,
508
+ "pixel_recognition": 55.0,
509
+ "plant_recognition": 90.0,
510
+ "point_tracking": 35.0,
511
+ "polygon_localization": 40.0,
512
+ "profession_recognition": 90.0,
513
+ "ravens_progressive_matrices": 15.0,
514
+ "reason_seg": 47.4,
515
+ "referring_detection": 45.0,
516
+ "relation_hallucination": 80.0,
517
+ "religious_recognition": 75.0,
518
+ "remote_sensing_object_detection": 60.0,
519
+ "rock_recognition": 80.0,
520
+ "rotated_object_detection": 77.8,
521
+ "salient_object_detection_rgb": 55.0,
522
+ "salient_object_detection_rgbd": 50.0,
523
+ "scene_emotion_recognition": 65.0,
524
+ "scene_graph_recognition": 85.0,
525
+ "scene_recognition": 65.0,
526
+ "scene_text_recognition": 90.0,
527
+ "science": 58.3,
528
+ "screenshot2code": 60.0,
529
+ "sculpture_recognition": 80.0,
530
+ "shape_recognition": 95.0,
531
+ "sign_language_recognition": 40.0,
532
+ "single_object_tracking": 65.0,
533
+ "sketch2code": 50.0,
534
+ "sketch2image_retrieval": 95.0,
535
+ "small_object_detection": 60.0,
536
+ "social_relation_recognition": 50.0,
537
+ "som_recognition": 94.7,
538
+ "sports_recognition": 95.0,
539
+ "spot_the_diff": 10.0,
540
+ "spot_the_similarity": 75.0,
541
+ "table_structure_recognition": 50.0,
542
+ "tech_engineering": 33.3,
543
+ "temporal_anticipation": 75.0,
544
+ "temporal_localization": 52.6,
545
+ "temporal_ordering": 25.0,
546
+ "temporal_sequence_understanding": 25.0,
547
+ "text2image_retrieval": 55.0,
548
+ "texture_material_recognition": 75.0,
549
+ "threed_cad_recognition": 70.0,
550
+ "threed_indoor_recognition": 45.0,
551
+ "traffic_anomaly_detection": 55.0,
552
+ "traffic_light_understanding": 100.0,
553
+ "traffic_participants_understanding": 60.0,
554
+ "traffic_sign_understanding": 95.0,
555
+ "transparent_object_detection": 75.0,
556
+ "vehicle_keypoint_detection": 55.6,
557
+ "vehicle_recognition": 100.0,
558
+ "vehicle_retrieval": 85.0,
559
+ "video_captioning": 95.0,
560
+ "visual_document_information_extraction": 95.0,
561
+ "visual_prompt_understanding": 75.0,
562
+ "waste_recognition": 100.0,
563
+ "weapon_recognition": 100.0,
564
+ "weather_recognition": 100.0,
565
+ "web_shopping": 40.0,
566
+ "whoops": 80.0,
567
+ "writing_poetry_from_image": 60.0
568
+ },
569
+ "BLINK": {
570
+ "Overall": 68.0,
571
+ "Art_Style": 82.9,
572
+ "Counting": 66.7,
573
+ "Forensic_Detection": 90.9,
574
+ "Functional_Correspondence": 43.1,
575
+ "IQ_Test": 32.0,
576
+ "Jigsaw": 76.7,
577
+ "Multi-view_Reasoning": 58.6,
578
+ "Object_Localization": 69.7,
579
+ "Relative_Depth": 75.8,
580
+ "Relative_Reflectance": 32.8,
581
+ "Semantic_Correspondence": 61.2,
582
+ "Spatial_Relation": 83.2,
583
+ "Visual_Correspondence": 92.4,
584
+ "Visual_Similarity": 83.0
585
+ },
586
+ "QBench": {
587
+ "Overall": 78.9,
588
+ "type_0_concern_0": 82.4,
589
+ "type_0_concern_1": 82.3,
590
+ "type_0_concern_2": 81.2,
591
+ "type_0_concern_3": 87.1,
592
+ "type_1_concern_0": 76.7,
593
+ "type_1_concern_1": 84.8,
594
+ "type_1_concern_2": 87.0,
595
+ "type_1_concern_3": 88.9,
596
+ "type_2_concern_0": 66.5,
597
+ "type_2_concern_1": 72.4,
598
+ "type_2_concern_2": 66.7,
599
+ "type_2_concern_3": 80.0
600
+ },
601
+ "ABench": {
602
+ "Overall": 79.2,
603
+ "part1 -> bag_of_words -> attribute": 92.7,
604
+ "part1 -> bag_of_words -> composition -> arrangement": 86.7,
605
+ "part1 -> bag_of_words -> composition -> occlusion": 60.0,
606
+ "part1 -> bag_of_words -> composition -> orientation": 76.9,
607
+ "part1 -> bag_of_words -> composition -> size": 71.4,
608
+ "part1 -> bag_of_words -> counting": 79.6,
609
+ "part1 -> bag_of_words -> noun_as_adjective": 81.4,
610
+ "part1 -> basic_recognition -> major": 92.9,
611
+ "part1 -> basic_recognition -> minor": 93.2,
612
+ "part1 -> outside_knowledge -> contradiction overcome": 70.8,
613
+ "part1 -> outside_knowledge -> specific-terms -> company": 100.0,
614
+ "part1 -> outside_knowledge -> specific-terms -> creature": 83.3,
615
+ "part1 -> outside_knowledge -> specific-terms -> daily": 94.1,
616
+ "part1 -> outside_knowledge -> specific-terms -> food": 95.5,
617
+ "part1 -> outside_knowledge -> specific-terms -> geography": 81.0,
618
+ "part1 -> outside_knowledge -> specific-terms -> material": 95.2,
619
+ "part1 -> outside_knowledge -> specific-terms -> science": 100.0,
620
+ "part1 -> outside_knowledge -> specific-terms -> sports": 68.2,
621
+ "part1 -> outside_knowledge -> specific-terms -> style -> abstract": 100.0,
622
+ "part1 -> outside_knowledge -> specific-terms -> style -> art": 100.0,
623
+ "part1 -> outside_knowledge -> specific-terms -> style -> art_deco": 100.0,
624
+ "part1 -> outside_knowledge -> specific-terms -> style -> cubism": 100.0,
625
+ "part1 -> outside_knowledge -> specific-terms -> style -> dadaism": 100.0,
626
+ "part1 -> outside_knowledge -> specific-terms -> style -> deco": 100.0,
627
+ "part1 -> outside_knowledge -> specific-terms -> style -> expressionism": 100.0,
628
+ "part1 -> outside_knowledge -> specific-terms -> style -> fauvism": 100.0,
629
+ "part1 -> outside_knowledge -> specific-terms -> style -> futurism": 66.7,
630
+ "part1 -> outside_knowledge -> specific-terms -> style -> minimalism": 100.0,
631
+ "part1 -> outside_knowledge -> specific-terms -> style -> pop": 100.0,
632
+ "part1 -> outside_knowledge -> specific-terms -> style -> psychedelic": 100.0,
633
+ "part1 -> outside_knowledge -> specific-terms -> style -> steampunk": 100.0,
634
+ "part1 -> outside_knowledge -> specific-terms -> style -> surrealism": 100.0,
635
+ "part1 -> outside_knowledge -> specific-terms -> style -> victorian": 0.0,
636
+ "part1 -> outside_knowledge -> specific-terms -> vehicle": 94.7,
637
+ "part1 -> outside_knowledge -> specific-terms -> weather": 92.3,
638
+ "part2 -> aesthetic": 62.6,
639
+ "part2 -> generative": 72.4,
640
+ "part2 -> technical": 74.9
641
+ },
642
+ "MTVQA": {
643
+ "Overall": 31.2,
644
+ "AR": 21.3,
645
+ "DE": 35.1,
646
+ "FR": 42.2,
647
+ "IT": 37.2,
648
+ "JA": 19.9,
649
+ "KR": 35.1,
650
+ "RU": 15.9,
651
+ "TH": 26.0,
652
+ "VI": 39.6
653
+ }
654
+ }
655
+ }
656
+ }
README.md CHANGED
@@ -1,14 +1,30 @@
1
  ---
2
- title: Medical Llm Leaderboard
3
- emoji: 🌖
4
- colorFrom: purple
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 5.5.0
8
  app_file: app.py
9
- pinned: false
10
  license: apache-2.0
11
- short_description: A Benchmark of Large Language Models in the Clinic
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Shopping MMLU Leaderboard
3
+ emoji: 🌎
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
  app_file: app.py
9
+ pinned: true
10
  license: apache-2.0
11
+ tags:
12
+ - leaderboard
13
+ short_description: 'Massive Multi-Task LLM Benchmark for Online Shopping'
14
  ---
15
 
16
+
17
+ In this leaderboard, we display evaluation results obtained with Shopping MMLU. The space provides an overall leaderboard, consisting of 4 main online shopping skills:
18
+ - Shopping Concept Understanding
19
+ - Shopping Knowledge Reasoning
20
+ - User Behavior Alignment
21
+ - Multi-lingual Abilities
22
+
23
+ Github: https://github.com/KL4805/ShoppingMMLU
24
+ Report: https://arxiv.org/abs/2410.20745
25
+
26
+ Please consider to cite the report if the resource is useful to your research:
27
+
28
+ ```BibTex
29
+
30
+ ```
ShoppingMMLU.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "time": "241031154353",
3
+ "results": {
4
+ "GPT-4o (0513, detail-high)": {
5
+ "META": {
6
+ "Method": [
7
+ "GPT-4o (0513, detail-high)",
8
+ "https://openai.com/index/hello-gpt-4o/"
9
+ ],
10
+ "Parameters": "",
11
+ "Language Model": "",
12
+ "Vision Model": "",
13
+ "Org": "OpenAI",
14
+ "Time": "2024/05/31",
15
+ "Verified": "Yes",
16
+ "OpenSource": "No",
17
+ "key": 270,
18
+ "dir_name": "GPT4o_HIGH"
19
+ },
20
+ "Shopping Concept Understanding": {
21
+ "Rec": 67.8,
22
+ "Ocr": 76.8,
23
+ "Know": 58.3,
24
+ "Gen": 56.9,
25
+ "Spat": 74.3,
26
+ "Math": 76.2,
27
+ "Overall": 69.1,
28
+ "Overall (official)": "N/A"
29
+ },
30
+ "Shopping Knowledge Reasoning": {
31
+ "Overall": 61.3,
32
+ "SCI": 64.8,
33
+ "TQA": 70.3,
34
+ "NUM": 44.4,
35
+ "ARI": 58.4,
36
+ "VQA": 47.5,
37
+ "GEO": 61.5,
38
+ "ALG": 62.3,
39
+ "GPS": 60.1,
40
+ "MWP": 69.9,
41
+ "LOG": 43.2,
42
+ "FQA": 60.2,
43
+ "STA": 68.4
44
+ },
45
+ "User Behavior Alignment": {
46
+ "Text Recognition": 199,
47
+ "Scene Text-centric VQA": 181,
48
+ "Doc-oriented VQA": 168,
49
+ "Key Information Extraction": 170,
50
+ "Handwritten Mathematical Expression Recognition": 18,
51
+ "Overall": 736
52
+ },
53
+ "Multi-lingual Abilities": {
54
+ "Overall": 63.9,
55
+ "coarse perception": 73.6,
56
+ "fine-grained perception": 54.8,
57
+ "instance reasoning": 66.4,
58
+ "logical reasoning": 72.0,
59
+ "math": 66.4,
60
+ "science & technology": 50.0
61
+ }
62
+ }
63
+ }
64
+ }
ShoppingMMLU_overall.json ADDED
@@ -0,0 +1,653 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "time": "241031154353",
3
+ "results": {
4
+ "Claude3-Sonnet": {
5
+ "META": {
6
+ "Method": [
7
+ "Claude3-Sonnet",
8
+ "https://aws.amazon.com/bedrock/claude/"
9
+ ],
10
+ "Parameters": "",
11
+ "Org": "Anthropic",
12
+ "OpenSource": "No",
13
+ "Verified": "Yes"
14
+ },
15
+ "Shopping Concept Understanding": {
16
+ "Overall": 80.75
17
+ },
18
+ "Shopping Knowledge Reasoning": {
19
+ "Overall": 71.63
20
+ },
21
+ "User Behavior Alignment": {
22
+ "Overall": 70.17
23
+ },
24
+ "Multi-lingual Abilities": {
25
+ "Overall": 67.76
26
+ }
27
+ },
28
+ "Claude2": {
29
+ "META": {
30
+ "Method": [
31
+ "Claude2",
32
+ "https://aws.amazon.com/bedrock/claude/"
33
+ ],
34
+ "Parameters": "",
35
+ "Org": "Anthropic",
36
+ "OpenSource": "No",
37
+ "Verified": "Yes"
38
+ },
39
+ "Shopping Concept Understanding": {
40
+ "Overall": 75.46
41
+ },
42
+ "Shopping Knowledge Reasoning": {
43
+ "Overall": 65.5
44
+ },
45
+ "User Behavior Alignment": {
46
+ "Overall": 63.53
47
+ },
48
+ "Multi-lingual Abilities": {
49
+ "Overall": 65.24
50
+ }
51
+ },
52
+ "ChatGPT": {
53
+ "META": {
54
+ "Method": [
55
+ "ChatGPT",
56
+ "https://platform.openai.com/docs/models#gpt-3-5-turbo"
57
+ ],
58
+ "Parameters": "",
59
+ "Org": "OpenAI",
60
+ "OpenSource": "No",
61
+ "Verified": "Yes"
62
+ },
63
+ "Shopping Concept Understanding": {
64
+ "Overall": 75.63
65
+ },
66
+ "Shopping Knowledge Reasoning": {
67
+ "Overall": 64.97
68
+ },
69
+ "User Behavior Alignment": {
70
+ "Overall": 59.79
71
+ },
72
+ "Multi-lingual Abilities": {
73
+ "Overall": 60.81
74
+ }
75
+ },
76
+ "LLaMA3-70B-Instruct": {
77
+ "META": {
78
+ "Method": [
79
+ "LLaMA3-70B-Instruct",
80
+ "https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct"
81
+ ],
82
+ "Parameters": "70B",
83
+ "Org": "Meta",
84
+ "OpenSource": "Yes",
85
+ "Verified": "Yes"
86
+ },
87
+ "Shopping Concept Understanding": {
88
+ "Overall": 75.24
89
+ },
90
+ "Shopping Knowledge Reasoning": {
91
+ "Overall": 69.29
92
+ },
93
+ "User Behavior Alignment": {
94
+ "Overall": 67.67
95
+ },
96
+ "Multi-lingual Abilities": {
97
+ "Overall": 62.0
98
+ }
99
+ },
100
+ "QWen1.5-72B": {
101
+ "META": {
102
+ "Method": [
103
+ "QWen1.5-72B",
104
+ "https://huggingface.co/Qwen/Qwen1.5-72B"
105
+ ],
106
+ "Parameters": "72B",
107
+ "Org": "Alibaba",
108
+ "OpenSource": "Yes",
109
+ "Verified": "Yes"
110
+ },
111
+ "Shopping Concept Understanding": {
112
+ "Overall": 71.67
113
+ },
114
+ "Shopping Knowledge Reasoning": {
115
+ "Overall": 68.92
116
+ },
117
+ "User Behavior Alignment": {
118
+ "Overall": 64.12
119
+ },
120
+ "Multi-lingual Abilities": {
121
+ "Overall": 64.84
122
+ }
123
+ },
124
+ "LLaMA3-70B": {
125
+ "META": {
126
+ "Method": [
127
+ "LLaMA3-70B",
128
+ "https://huggingface.co/meta-llama/Meta-Llama-3-70B"
129
+ ],
130
+ "Parameters": "70B",
131
+ "Org": "Meta",
132
+ "OpenSource": "Yes",
133
+ "Verified": "Yes"
134
+ },
135
+ "Shopping Concept Understanding": {
136
+ "Overall": 69.59
137
+ },
138
+ "Shopping Knowledge Reasoning": {
139
+ "Overall": 63.56
140
+ },
141
+ "User Behavior Alignment": {
142
+ "Overall": 55.77
143
+ },
144
+ "Multi-lingual Abilities": {
145
+ "Overall": 58.95
146
+ }
147
+ },
148
+ "LLaMA2-70B-Chat": {
149
+ "META": {
150
+ "Method": [
151
+ "LLaMA2-70B-Chat",
152
+ "https://huggingface.co/meta-llama/Llama-2-70b-chat-hf"
153
+ ],
154
+ "Parameters": "70B",
155
+ "Org": "Meta",
156
+ "OpenSource": "Yes",
157
+ "Verified": "Yes"
158
+ },
159
+ "Shopping Concept Understanding": {
160
+ "Overall": 61.84
161
+ },
162
+ "Shopping Knowledge Reasoning": {
163
+ "Overall": 40.73
164
+ },
165
+ "User Behavior Alignment": {
166
+ "Overall": 44.2
167
+ },
168
+ "Multi-lingual Abilities": {
169
+ "Overall": 47.04
170
+ }
171
+ },
172
+ "LLaMA2-70B": {
173
+ "META": {
174
+ "Method": [
175
+ "LLaMA2-70B",
176
+ "https://huggingface.co/meta-llama/Llama-2-70b-hf"
177
+ ],
178
+ "Parameters": "70B",
179
+ "Org": "Meta",
180
+ "OpenSource": "Yes",
181
+ "Verified": "Yes"
182
+ },
183
+ "Shopping Concept Understanding": {
184
+ "Overall": 61.05
185
+ },
186
+ "Shopping Knowledge Reasoning": {
187
+ "Overall": 55.87
188
+ },
189
+ "User Behavior Alignment": {
190
+ "Overall": 43.24
191
+ },
192
+ "Multi-lingual Abilities": {
193
+ "Overall": 47.85
194
+ }
195
+ },
196
+ "Mixtral-8x7B": {
197
+ "META": {
198
+ "Method": [
199
+ "Mixtral-8x7B",
200
+ "https://huggingface.co/mistralai/Mixtral-8x7B-v0.1"
201
+ ],
202
+ "Parameters": "46.7B",
203
+ "Org": "MistralAI",
204
+ "OpenSource": "Yes",
205
+ "Verified": "Yes"
206
+ },
207
+ "Shopping Concept Understanding": {
208
+ "Overall": 59.43
209
+ },
210
+ "Shopping Knowledge Reasoning": {
211
+ "Overall": 54.32
212
+ },
213
+ "User Behavior Alignment": {
214
+ "Overall": 55.31
215
+ },
216
+ "Multi-lingual Abilities": {
217
+ "Overall": 44.69
218
+ }
219
+ },
220
+ "QWen1.5-14B": {
221
+ "META": {
222
+ "Method": [
223
+ "QWen1.5-14B",
224
+ "https://huggingface.co/Qwen/Qwen1.5-14B"
225
+ ],
226
+ "Parameters": "14B",
227
+ "Org": "Alibaba",
228
+ "OpenSource": "Yes",
229
+ "Verified": "Yes"
230
+ },
231
+ "Shopping Concept Understanding": {
232
+ "Overall": 67.22
233
+ },
234
+ "Shopping Knowledge Reasoning": {
235
+ "Overall": 60.92
236
+ },
237
+ "User Behavior Alignment": {
238
+ "Overall": 54.92
239
+ },
240
+ "Multi-lingual Abilities": {
241
+ "Overall": 55.21
242
+ }
243
+ },
244
+ "eCeLLM-L": {
245
+ "META": {
246
+ "Method": [
247
+ "eCeLLM-L",
248
+ "https://huggingface.co/NingLab/eCeLLM-L"
249
+ ],
250
+ "Parameters": "13B",
251
+ "Org": "OSU NingLab",
252
+ "OpenSource": "Yes",
253
+ "Verified": "Yes"
254
+ },
255
+ "Shopping Concept Understanding": {
256
+ "Overall": 61.54
257
+ },
258
+ "Shopping Knowledge Reasoning": {
259
+ "Overall": 54.84
260
+ },
261
+ "User Behavior Alignment": {
262
+ "Overall": 54.55
263
+ },
264
+ "Multi-lingual Abilities": {
265
+ "Overall": 59.64
266
+ }
267
+ },
268
+ "Vicuna-13B-v1.5": {
269
+ "META": {
270
+ "Method": [
271
+ "Vicuna-13B-v1.5",
272
+ "https://huggingface.co/lmsys/vicuna-13b-v1.5"
273
+ ],
274
+ "Parameters": "13B",
275
+ "Org": "LMSys",
276
+ "OpenSource": "Yes",
277
+ "Verified": "Yes"
278
+ },
279
+ "Shopping Concept Understanding": {
280
+ "Overall": 59.64
281
+ },
282
+ "Shopping Knowledge Reasoning": {
283
+ "Overall": 52.63
284
+ },
285
+ "User Behavior Alignment": {
286
+ "Overall": 49.81
287
+ },
288
+ "Multi-lingual Abilities": {
289
+ "Overall": 49.64
290
+ }
291
+ },
292
+ "LLaMA2-13B-Chat": {
293
+ "META": {
294
+ "Method": [
295
+ "LLaMA2-13B-Chat",
296
+ "https://huggingface.co/meta-llama/Llama-2-13b-chat-hf"
297
+ ],
298
+ "Parameters": "13B",
299
+ "Org": "Meta",
300
+ "OpenSource": "Yes",
301
+ "Verified": "Yes"
302
+ },
303
+ "Shopping Concept Understanding": {
304
+ "Overall": 51.79
305
+ },
306
+ "Shopping Knowledge Reasoning": {
307
+ "Overall": 45.01
308
+ },
309
+ "User Behavior Alignment": {
310
+ "Overall": 39.95
311
+ },
312
+ "Multi-lingual Abilities": {
313
+ "Overall": 42.99
314
+ }
315
+ },
316
+ "LLaMA2-13B": {
317
+ "META": {
318
+ "Method": [
319
+ "LLaMA2-13B",
320
+ "https://huggingface.co/meta-llama/Llama-2-13b-hf"
321
+ ],
322
+ "Parameters": "13B",
323
+ "Org": "Meta",
324
+ "OpenSource": "Yes",
325
+ "Verified": "Yes"
326
+ },
327
+ "Shopping Concept Understanding": {
328
+ "Overall": 45.86
329
+ },
330
+ "Shopping Knowledge Reasoning": {
331
+ "Overall": 39.47
332
+ },
333
+ "User Behavior Alignment": {
334
+ "Overall": 39.43
335
+ },
336
+ "Multi-lingual Abilities": {
337
+ "Overall": 44.23
338
+ }
339
+ },
340
+ "LLaMA3-8B-Instruct": {
341
+ "META": {
342
+ "Method": [
343
+ "LLaMA3-8B-Instruct",
344
+ "https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct"
345
+ ],
346
+ "Parameters": "8B",
347
+ "Org": "Meta",
348
+ "OpenSource": "Yes",
349
+ "Verified": "Yes"
350
+ },
351
+ "Shopping Concept Understanding": {
352
+ "Overall": 65.26
353
+ },
354
+ "Shopping Knowledge Reasoning": {
355
+ "Overall": 56.84
356
+ },
357
+ "User Behavior Alignment": {
358
+ "Overall": 54.88
359
+ },
360
+ "Multi-lingual Abilities": {
361
+ "Overall": 55.37
362
+ }
363
+ },
364
+ "LLaMA3-8B": {
365
+ "META": {
366
+ "Method": [
367
+ "LLaMA3-8B",
368
+ "https://huggingface.co/meta-llama/Meta-Llama-3-8B"
369
+ ],
370
+ "Parameters": "8B",
371
+ "Org": "Meta",
372
+ "OpenSource": "Yes",
373
+ "Verified": "Yes"
374
+ },
375
+ "Shopping Concept Understanding": {
376
+ "Overall": 58.02
377
+ },
378
+ "Shopping Knowledge Reasoning": {
379
+ "Overall": 49.74
380
+ },
381
+ "User Behavior Alignment": {
382
+ "Overall": 44.16
383
+ },
384
+ "Multi-lingual Abilities": {
385
+ "Overall": 51.03
386
+ }
387
+ },
388
+ "QWen1.5-7B": {
389
+ "META": {
390
+ "Method": [
391
+ "QWen1.5-7B",
392
+ "https://huggingface.co/Qwen/Qwen1.5-7B"
393
+ ],
394
+ "Parameters": "7B",
395
+ "Org": "Alibaba",
396
+ "OpenSource": "Yes",
397
+ "Verified": "Yes"
398
+ },
399
+ "Shopping Concept Understanding": {
400
+ "Overall": 58.89
401
+ },
402
+ "Shopping Knowledge Reasoning": {
403
+ "Overall": 52.34
404
+ },
405
+ "User Behavior Alignment": {
406
+ "Overall": 49.81
407
+ },
408
+ "Multi-lingual Abilities": {
409
+ "Overall": 50.14
410
+ }
411
+ },
412
+ "eCeLLM-M": {
413
+ "META": {
414
+ "Method": [
415
+ "eCeLLM-M",
416
+ "https://huggingface.co/NingLab/eCeLLM-M"
417
+ ],
418
+ "Parameters": "7B",
419
+ "Org": "OSU NingLab",
420
+ "OpenSource": "Yes",
421
+ "Verified": "Yes"
422
+ },
423
+ "Shopping Concept Understanding": {
424
+ "Overall": 63.29
425
+ },
426
+ "Shopping Knowledge Reasoning": {
427
+ "Overall": 48.94
428
+ },
429
+ "User Behavior Alignment": {
430
+ "Overall": 53.78
431
+ },
432
+ "Multi-lingual Abilities": {
433
+ "Overall": 56.08
434
+ }
435
+ },
436
+ "Zephyr-Beta": {
437
+ "META": {
438
+ "Method": [
439
+ "Zephyr-Beta",
440
+ "https://huggingface.co/HuggingFaceH4/zephyr-7b-beta"
441
+ ],
442
+ "Parameters": "7B",
443
+ "Org": "HuggingFace H4",
444
+ "OpenSource": "Yes",
445
+ "Verified": "Yes"
446
+ },
447
+ "Shopping Concept Understanding": {
448
+ "Overall": 61.65
449
+ },
450
+ "Shopping Knowledge Reasoning": {
451
+ "Overall": 52.57
452
+ },
453
+ "User Behavior Alignment": {
454
+ "Overall": 44.73
455
+ },
456
+ "Multi-lingual Abilities": {
457
+ "Overall": 45.35
458
+ }
459
+ },
460
+ "Mistral-7B-Instruct": {
461
+ "META": {
462
+ "Method": [
463
+ "Mistral-7B-Instruct",
464
+ "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2"
465
+ ],
466
+ "Parameters": "7B",
467
+ "Org": "MistralAI",
468
+ "OpenSource": "Yes",
469
+ "Verified": "Yes"
470
+ },
471
+ "Shopping Concept Understanding": {
472
+ "Overall": 62.03
473
+ },
474
+ "Shopping Knowledge Reasoning": {
475
+ "Overall": 46.36
476
+ },
477
+ "User Behavior Alignment": {
478
+ "Overall": 42.21
479
+ },
480
+ "Multi-lingual Abilities": {
481
+ "Overall": 43.32
482
+ }
483
+ },
484
+ "Mistral-7B": {
485
+ "META": {
486
+ "Method": [
487
+ "Mistral-7B",
488
+ "https://huggingface.co/mistralai/Mistral-7B-v0.1"
489
+ ],
490
+ "Parameters": "7B",
491
+ "Org": "MistralAI",
492
+ "OpenSource": "Yes",
493
+ "Verified": "Yes"
494
+ },
495
+ "Shopping Concept Understanding": {
496
+ "Overall": 55.82
497
+ },
498
+ "Shopping Knowledge Reasoning": {
499
+ "Overall": 46.69
500
+ },
501
+ "User Behavior Alignment": {
502
+ "Overall": 46.27
503
+ },
504
+ "Multi-lingual Abilities": {
505
+ "Overall": 41.47
506
+ }
507
+ },
508
+ "Vicuna-7B-v1.5": {
509
+ "META": {
510
+ "Method": [
511
+ "Vicuna-7B-v1.5",
512
+ "https://huggingface.co/lmsys/vicuna-7b-v1.5"
513
+ ],
514
+ "Parameters": "7B",
515
+ "Org": "LMSys",
516
+ "OpenSource": "Yes",
517
+ "Verified": "Yes"
518
+ },
519
+ "Shopping Concept Understanding": {
520
+ "Overall": 53.46
521
+ },
522
+ "Shopping Knowledge Reasoning": {
523
+ "Overall": 45.06
524
+ },
525
+ "User Behavior Alignment": {
526
+ "Overall": 41.11
527
+ },
528
+ "Multi-lingual Abilities": {
529
+ "Overall": 43.82
530
+ }
531
+ },
532
+ "LLaMA2-7B-Chat": {
533
+ "META": {
534
+ "Method": [
535
+ "LLaMA2-7B-Chat",
536
+ "https://huggingface.co/meta-llama/Llama-2-7b-chat-hf"
537
+ ],
538
+ "Parameters": "7B",
539
+ "Org": "Meta",
540
+ "OpenSource": "Yes",
541
+ "Verified": "Yes"
542
+ },
543
+ "Shopping Concept Understanding": {
544
+ "Overall": 51.67
545
+ },
546
+ "Shopping Knowledge Reasoning": {
547
+ "Overall": 43.48
548
+ },
549
+ "User Behavior Alignment": {
550
+ "Overall": 41.42
551
+ },
552
+ "Multi-lingual Abilities": {
553
+ "Overall": 40.43
554
+ }
555
+ },
556
+ "LLaMA2-7B": {
557
+ "META": {
558
+ "Method": [
559
+ "LLaMA2-7B",
560
+ "https://huggingface.co/meta-llama/Llama-2-7b-hf"
561
+ ],
562
+ "Parameters": "7B",
563
+ "Org": "Meta",
564
+ "OpenSource": "Yes",
565
+ "Verified": "Yes"
566
+ },
567
+ "Shopping Concept Understanding": {
568
+ "Overall": 38.22
569
+ },
570
+ "Shopping Knowledge Reasoning": {
571
+ "Overall": 32.81
572
+ },
573
+ "User Behavior Alignment": {
574
+ "Overall": 32.56
575
+ },
576
+ "Multi-lingual Abilities": {
577
+ "Overall": 27.71
578
+ }
579
+ },
580
+ "QWen1.5-4B": {
581
+ "META": {
582
+ "Method": [
583
+ "QWen1.5-4B",
584
+ "https://huggingface.co/Qwen/Qwen1.5-4B"
585
+ ],
586
+ "Parameters": "4B",
587
+ "Org": "Alibaba",
588
+ "OpenSource": "Yes",
589
+ "Verified": "Yes"
590
+ },
591
+ "Shopping Concept Understanding": {
592
+ "Overall": 57.21
593
+ },
594
+ "Shopping Knowledge Reasoning": {
595
+ "Overall": 52.56
596
+ },
597
+ "User Behavior Alignment": {
598
+ "Overall": 42.74
599
+ },
600
+ "Multi-lingual Abilities": {
601
+ "Overall": 49.78
602
+ }
603
+ },
604
+ "Phi-2": {
605
+ "META": {
606
+ "Method": [
607
+ "Phi-2",
608
+ "https://huggingface.co/microsoft/phi-2"
609
+ ],
610
+ "Parameters": "2.8B",
611
+ "Org": "Microsoft",
612
+ "OpenSource": "Yes",
613
+ "Verified": "Yes"
614
+ },
615
+ "Shopping Concept Understanding": {
616
+ "Overall": 49.34
617
+ },
618
+ "Shopping Knowledge Reasoning": {
619
+ "Overall": 42.83
620
+ },
621
+ "User Behavior Alignment": {
622
+ "Overall": 36.38
623
+ },
624
+ "Multi-lingual Abilities": {
625
+ "Overall": 32.91
626
+ }
627
+ },
628
+ "eCeLLM-S": {
629
+ "META": {
630
+ "Method": [
631
+ "eCeLLM-S",
632
+ "https://huggingface.co/NingLab/eCeLLM-S"
633
+ ],
634
+ "Parameters": "2.8B",
635
+ "Org": "OSU NingLab",
636
+ "OpenSource": "Yes",
637
+ "Verified": "Yes"
638
+ },
639
+ "Shopping Concept Understanding": {
640
+ "Overall": 49.4
641
+ },
642
+ "Shopping Knowledge Reasoning": {
643
+ "Overall": 39.06
644
+ },
645
+ "User Behavior Alignment": {
646
+ "Overall": 36.33
647
+ },
648
+ "Multi-lingual Abilities": {
649
+ "Overall": 32.79
650
+ }
651
+ }
652
+ }
653
+ }
app.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import abc
2
+
3
+ import gradio as gr
4
+
5
+ from gen_table import *
6
+ from meta_data import *
7
+
8
+ with gr.Blocks() as demo:
9
+ struct = load_results_local()
10
+ timestamp = struct['time']
11
+ EVAL_TIME = format_timestamp(timestamp)
12
+ results = struct['results']
13
+ N_MODEL = len(results)
14
+ N_DATA = len(results['Claude3-Sonnet']) - 1
15
+ DATASETS = list(results['Claude3-Sonnet'])
16
+ DATASETS.remove('META')
17
+ print(DATASETS)
18
+
19
+ gr.Markdown(LEADERBORAD_INTRODUCTION.format(N_MODEL, N_DATA, EVAL_TIME))
20
+ structs = [abc.abstractproperty() for _ in range(N_DATA)]
21
+
22
+ with gr.Tabs(elem_classes='tab-buttons') as tabs:
23
+ with gr.TabItem('🏅 Shopping MMLU Leaderboard', elem_id='main', id=0):
24
+ gr.Markdown(LEADERBOARD_MD['MAIN'])
25
+ _, check_box = BUILD_L1_DF(results, MAIN_FIELDS)
26
+ table = generate_table(results, DEFAULT_BENCH)
27
+ table['Rank'] = list(range(1, len(table) + 1))
28
+
29
+ type_map = check_box['type_map']
30
+ type_map['Rank'] = 'number'
31
+
32
+ checkbox_group = gr.CheckboxGroup(
33
+ choices=check_box['all'],
34
+ value=check_box['required'],
35
+ label='Evaluation Dimension',
36
+ interactive=True,
37
+ )
38
+
39
+ headers = ['Rank'] + check_box['essential'] + checkbox_group.value
40
+ with gr.Row():
41
+ model_size = gr.CheckboxGroup(
42
+ choices=MODEL_SIZE,
43
+ value=MODEL_SIZE,
44
+ label='Model Size',
45
+ interactive=True
46
+ )
47
+ model_type = gr.CheckboxGroup(
48
+ choices=MODEL_TYPE,
49
+ value=MODEL_TYPE,
50
+ label='Model Type',
51
+ interactive=True
52
+ )
53
+ print(headers)
54
+ print(check_box['essential'])
55
+ data_component = gr.components.DataFrame(
56
+ value=table[headers],
57
+ type='pandas',
58
+ datatype=[type_map[x] for x in headers],
59
+ interactive=False,
60
+ visible=True)
61
+
62
+ def filter_df(fields, model_size, model_type):
63
+ filter_list = ['Avg Score', 'Avg Rank', 'OpenSource', 'Verified']
64
+ headers = ['Rank'] + check_box['essential'] + fields
65
+
66
+ new_fields = [field for field in fields if field not in filter_list]
67
+ df = generate_table(results, new_fields)
68
+
69
+ df['flag'] = [model_size_flag(x, model_size) for x in df['Param (B)']]
70
+ df = df[df['flag']]
71
+ df.pop('flag')
72
+ if len(df):
73
+ df['flag'] = [model_type_flag(df.iloc[i], model_type) for i in range(len(df))]
74
+ df = df[df['flag']]
75
+ df.pop('flag')
76
+ df['Rank'] = list(range(1, len(df) + 1))
77
+
78
+ comp = gr.components.DataFrame(
79
+ value=df[headers],
80
+ type='pandas',
81
+ datatype=[type_map[x] for x in headers],
82
+ interactive=False,
83
+ visible=True)
84
+ return comp
85
+
86
+ for cbox in [checkbox_group, model_size, model_type]:
87
+ cbox.change(fn=filter_df, inputs=[checkbox_group, model_size, model_type], outputs=data_component)
88
+
89
+ with gr.TabItem('🔍 About', elem_id='about', id=1):
90
+ gr.Markdown(urlopen(SHOPPINGMMLU_README).read().decode())
91
+
92
+ for i, dataset in enumerate(DATASETS):
93
+ with gr.TabItem(f'📊 {dataset} Leaderboard', elem_id=dataset, id=i + 2):
94
+ if dataset in LEADERBOARD_MD:
95
+ gr.Markdown(LEADERBOARD_MD[dataset])
96
+
97
+ s = structs[i]
98
+ s.table, s.check_box = BUILD_L2_DF(results, dataset)
99
+ s.type_map = s.check_box['type_map']
100
+ s.type_map['Rank'] = 'number'
101
+
102
+ s.checkbox_group = gr.CheckboxGroup(
103
+ choices=s.check_box['all'],
104
+ value=s.check_box['required'],
105
+ label=f'{dataset} CheckBoxes',
106
+ interactive=True,
107
+ )
108
+ s.headers = ['Rank'] + s.check_box['essential'] + s.checkbox_group.value
109
+ print(s.check_box['essential'])
110
+ print(s.checkbox_group.value)
111
+ s.table['Rank'] = list(range(1, len(s.table) + 1))
112
+ print(s.headers)
113
+ with gr.Row():
114
+ s.model_size = gr.CheckboxGroup(
115
+ choices=MODEL_SIZE,
116
+ value=MODEL_SIZE,
117
+ label='Model Size',
118
+ interactive=True
119
+ )
120
+ s.model_type = gr.CheckboxGroup(
121
+ choices=MODEL_TYPE,
122
+ value=MODEL_TYPE,
123
+ label='Model Type',
124
+ interactive=True
125
+ )
126
+
127
+ s.data_component = gr.components.DataFrame(
128
+ value=s.table[s.headers],
129
+ type='pandas',
130
+ datatype=[s.type_map[x] for x in s.headers],
131
+ interactive=False,
132
+ visible=True)
133
+ s.dataset = gr.Textbox(value=dataset, label=dataset, visible=False)
134
+
135
+ def filter_df_l2(dataset_name, fields, model_size, model_type):
136
+ s = structs[DATASETS.index(dataset_name)]
137
+ headers = ['Rank'] + s.check_box['essential'] + fields
138
+ df = cp.deepcopy(s.table)
139
+ df['flag'] = [model_size_flag(x, model_size) for x in df['Param (B)']]
140
+ df = df[df['flag']]
141
+ df.pop('flag')
142
+ if len(df):
143
+ df['flag'] = [model_type_flag(df.iloc[i], model_type) for i in range(len(df))]
144
+ df = df[df['flag']]
145
+ df.pop('flag')
146
+ df['Rank'] = list(range(1, len(df) + 1))
147
+
148
+ comp = gr.components.DataFrame(
149
+ value=df[headers],
150
+ type='pandas',
151
+ datatype=[s.type_map[x] for x in headers],
152
+ interactive=False,
153
+ visible=True)
154
+ return comp
155
+
156
+ for cbox in [s.checkbox_group, s.model_size, s.model_type]:
157
+ cbox.change(
158
+ fn=filter_df_l2,
159
+ inputs=[s.dataset, s.checkbox_group, s.model_size, s.model_type],
160
+ outputs=s.data_component)
161
+
162
+ with gr.Row():
163
+ with gr.Accordion('Citation', open=False):
164
+ citation_button = gr.Textbox(
165
+ value=CITATION_BUTTON_TEXT,
166
+ label=CITATION_BUTTON_LABEL,
167
+ elem_id='citation-button')
168
+
169
+ if __name__ == '__main__':
170
+ demo.launch(server_name='0.0.0.0')
gen_table.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy as cp
2
+ import json
3
+ from collections import defaultdict
4
+ from urllib.request import urlopen
5
+
6
+ import gradio as gr
7
+ import numpy as np
8
+ import pandas as pd
9
+
10
+ from meta_data import DEFAULT_BENCH, META_FIELDS, URL, RESULTS
11
+
12
+
13
+ def listinstr(lst, s):
14
+ assert isinstance(lst, list)
15
+ for item in lst:
16
+ if item in s:
17
+ return True
18
+ return False
19
+
20
+
21
+ def load_results():
22
+ data = json.loads(urlopen(URL).read())
23
+ return data
24
+
25
+ def load_results_local():
26
+ with open(RESULTS, 'r') as infile:
27
+ data = json.load(infile)
28
+ return data
29
+
30
+ def nth_large(val, vals):
31
+ return sum([1 for v in vals if v > val]) + 1
32
+
33
+
34
+ def format_timestamp(timestamp):
35
+ date = timestamp[:2] + '.' + timestamp[2:4] + '.' + timestamp[4:6]
36
+ time = timestamp[6:8] + ':' + timestamp[8:10] + ':' + timestamp[10:12]
37
+ return date + ' ' + time
38
+
39
+
40
+ def model_size_flag(sz, FIELDS):
41
+ if pd.isna(sz) and 'Unknown' in FIELDS:
42
+ return True
43
+ if pd.isna(sz):
44
+ return False
45
+ if '<4B' in FIELDS and sz < 4:
46
+ return True
47
+ if '4B-10B' in FIELDS and sz >= 4 and sz < 10:
48
+ return True
49
+ if '10B-20B' in FIELDS and sz >= 10 and sz < 20:
50
+ return True
51
+ if '20B-40B' in FIELDS and sz >= 20 and sz < 40:
52
+ return True
53
+ if '>40B' in FIELDS and sz >= 40:
54
+ return True
55
+ return False
56
+
57
+
58
+ def model_type_flag(line, FIELDS):
59
+ if 'OpenSource' in FIELDS and line['OpenSource'] == 'Yes':
60
+ return True
61
+ if 'API' in FIELDS and line['OpenSource'] == 'No' and line['Verified'] == 'Yes':
62
+ return True
63
+ if 'Proprietary' in FIELDS and line['OpenSource'] == 'No' and line['Verified'] == 'No':
64
+ return True
65
+ return False
66
+
67
+
68
+ def BUILD_L1_DF(results, fields):
69
+ check_box = {}
70
+ check_box['essential'] = ['Method', 'Param (B)']
71
+ # revise there to set default dataset
72
+ check_box['required'] = ['Avg Score', 'Avg Rank'] + DEFAULT_BENCH
73
+ check_box['avg'] = ['Avg Score', 'Avg Rank']
74
+ check_box['all'] = check_box['avg'] + fields
75
+ type_map = defaultdict(lambda: 'number')
76
+ type_map['Method'] = 'html'
77
+ type_map['Language Model'] = type_map['Vision Model'] = type_map['OpenSource'] = type_map['Verified'] = 'str'
78
+ check_box['type_map'] = type_map
79
+
80
+ df = generate_table(results, fields)
81
+ return df, check_box
82
+
83
+
84
+ def BUILD_L2_DF(results, dataset):
85
+ res = defaultdict(list)
86
+ sub = [v for v in results.values() if dataset in v]
87
+ assert len(sub)
88
+ fields = list(sub[0][dataset].keys())
89
+
90
+ non_overall_fields = [x for x in fields if 'Overall' not in x]
91
+ overall_fields = [x for x in fields if 'Overall' in x]
92
+ if dataset == 'MME':
93
+ non_overall_fields = [x for x in non_overall_fields if not listinstr(['Perception', 'Cognition'], x)]
94
+ overall_fields = overall_fields + ['Perception', 'Cognition']
95
+ if dataset == 'OCRBench':
96
+ non_overall_fields = [x for x in non_overall_fields if not listinstr(['Final Score'], x)]
97
+ overall_fields = ['Final Score']
98
+ print(overall_fields)
99
+ print(non_overall_fields)
100
+
101
+ for m in results:
102
+ item = results[m]
103
+ if dataset not in item:
104
+ continue
105
+ meta = item['META']
106
+ for k in META_FIELDS:
107
+ if k == 'Param (B)':
108
+ param = meta['Parameters']
109
+ res[k].append(float(param.replace('B', '')) if param != '' else None)
110
+ elif k == 'Method':
111
+ name, url = meta['Method']
112
+ res[k].append(f'<a href="{url}">{name}</a>')
113
+ else:
114
+ res[k].append(meta[k])
115
+ fields = [x for x in fields]
116
+
117
+ for d in non_overall_fields:
118
+ res[d].append(item[dataset][d])
119
+ for d in overall_fields:
120
+ res[d].append(item[dataset][d])
121
+
122
+ df = pd.DataFrame(res)
123
+ print(df)
124
+ all_fields = overall_fields + non_overall_fields
125
+ # Use the first 5 non-overall fields as required fields
126
+ # required_fields = overall_fields if len(overall_fields) else non_overall_fields[:5]
127
+ required_fields = all_fields
128
+
129
+ if dataset == 'OCRBench':
130
+ df = df.sort_values('Final Score')
131
+ elif dataset == 'COCO_VAL':
132
+ df = df.sort_values('CIDEr')
133
+ else:
134
+ df = df.sort_values('Overall')
135
+ df = df.iloc[::-1]
136
+
137
+ check_box = {}
138
+ check_box['essential'] = ['Method', 'Param (B)']
139
+ check_box['required'] = required_fields
140
+ check_box['all'] = all_fields
141
+ type_map = defaultdict(lambda: 'number')
142
+ type_map['Method'] = 'html'
143
+ type_map['Language Model'] = type_map['Vision Model'] = type_map['OpenSource'] = type_map['Verified'] = 'str'
144
+ check_box['type_map'] = type_map
145
+ return df, check_box
146
+
147
+
148
+ def generate_table(results, fields):
149
+
150
+ def get_mmbench_v11(item):
151
+ assert 'MMBench_TEST_CN_V11' in item and 'MMBench_TEST_EN_V11' in item
152
+ val = (item['MMBench_TEST_CN_V11']['Overall'] + item['MMBench_TEST_EN_V11']['Overall']) / 2
153
+ val = float(f'{val:.1f}')
154
+ return val
155
+
156
+ res = defaultdict(list)
157
+ for i, m in enumerate(results):
158
+ item = results[m]
159
+ meta = item['META']
160
+ for k in META_FIELDS:
161
+ if k == 'Param (B)':
162
+ param = meta['Parameters']
163
+ res[k].append(float(param.replace('B', '')) if param != '' else None)
164
+ elif k == 'Method':
165
+ name, url = meta['Method']
166
+ res[k].append(f'<a href="{url}">{name}</a>')
167
+ res['name'].append(name)
168
+ else:
169
+ res[k].append(meta[k])
170
+ scores, ranks = [], []
171
+ for d in fields:
172
+ key_name = 'Overall' if d != 'OCRBench' else 'Final Score'
173
+ # Every Model should have MMBench_V11 results
174
+ if d == 'MMBench_V11':
175
+ val = get_mmbench_v11(item)
176
+ res[d].append(val)
177
+ scores.append(val)
178
+ ranks.append(nth_large(val, [get_mmbench_v11(x) for x in results.values()]))
179
+ elif d in item:
180
+ res[d].append(item[d][key_name])
181
+ if d == 'MME':
182
+ scores.append(item[d][key_name] / 28)
183
+ elif d == 'OCRBench':
184
+ scores.append(item[d][key_name] / 10)
185
+ else:
186
+ scores.append(item[d][key_name])
187
+ ranks.append(nth_large(item[d][key_name], [x[d][key_name] for x in results.values() if d in x]))
188
+ else:
189
+ res[d].append(None)
190
+ scores.append(None)
191
+ ranks.append(None)
192
+
193
+ res['Avg Score'].append(round(np.mean(scores), 1) if None not in scores else None)
194
+ res['Avg Rank'].append(round(np.mean(ranks), 2) if None not in ranks else None)
195
+
196
+ df = pd.DataFrame(res)
197
+ valid, missing = df[~pd.isna(df['Avg Score'])], df[pd.isna(df['Avg Score'])]
198
+ valid = valid.sort_values('Avg Score')
199
+ valid = valid.iloc[::-1]
200
+ if len(fields):
201
+ missing = missing.sort_values('MMBench_V11' if 'MMBench_V11' in fields else fields[0])
202
+ missing = missing.iloc[::-1]
203
+ df = pd.concat([valid, missing])
204
+ return df
meta_data.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CONSTANTS-URL
2
+ URL = "http://opencompass.openxlab.space/assets/OpenVLM.json"
3
+ RESULTS = 'ShoppingMMLU_overall.json'
4
+ SHOPPINGMMLU_README = 'https://raw.githubusercontent.com/KL4805/ShoppingMMLU/refs/heads/main/README.md'
5
+ # CONSTANTS-CITATION
6
+ CITATION_BUTTON_TEXT = r"""@article{jin2024shopping,
7
+ title={Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models},
8
+ author={Jin, Yilun and Li, Zheng and Zhang, Chenwei and Cao, Tianyu and Gao, Yifan and Jayarao, Pratik and Li, Mao and Liu, Xin and Sarkhel, Ritesh and Tang, Xianfeng and others},
9
+ journal={arXiv preprint arXiv:2410.20745},
10
+ year={2024}
11
+ }"""
12
+ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
13
+ # CONSTANTS-TEXT
14
+ LEADERBORAD_INTRODUCTION = """# Shopping MMLU Leaderboard
15
+ ### Welcome to Shopping MMLU Leaderboard! On this leaderboard we share the evaluation results of LLMs obtained by the OpenSource Framework:
16
+ ### [Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models](https://github.com/KL4805/ShoppingMMLU) 🏆
17
+ ### Currently, Shopping MMLU Leaderboard covers {} different LLMs and {} main online shopping skills.
18
+
19
+ This leaderboard was last updated: {}.
20
+
21
+ Shopping MMLU Leaderboard only includes open-source LLMs or API models that are publicly available. To add your own model to the leaderboard, please create a PR in [Shopping MMLU](https://github.com/KL4805/ShoppingMMLU) to support your LLM and then we will help with the evaluation and updating the leaderboard. For any questions or concerns, please feel free to contact us at [email protected] and [email protected].
22
+ """
23
+ # CONSTANTS-FIELDS
24
+ META_FIELDS = ['Method', 'Param (B)', 'OpenSource', 'Verified']
25
+ # MAIN_FIELDS = [
26
+ # 'MMBench_V11', 'MMStar', 'MME',
27
+ # 'MMMU_VAL', 'MathVista', 'OCRBench', 'AI2D',
28
+ # 'HallusionBench', 'SEEDBench_IMG', 'MMVet',
29
+ # 'LLaVABench', 'CCBench', 'RealWorldQA', 'POPE', 'ScienceQA_TEST',
30
+ # 'SEEDBench2_Plus', 'MMT-Bench_VAL', 'BLINK'
31
+ # ]
32
+ MAIN_FIELDS = [
33
+ 'Shopping Concept Understanding', 'Shopping Knowledge Reasoning', 'User Behavior Alignment','Multi-lingual Abilities'
34
+ ]
35
+ # DEFAULT_BENCH = [
36
+ # 'MMBench_V11', 'MMStar', 'MMMU_VAL', 'MathVista', 'OCRBench', 'AI2D',
37
+ # 'HallusionBench', 'MMVet'
38
+ # ]
39
+ DEFAULT_BENCH = ['Shopping Concept Understanding', 'Shopping Knowledge Reasoning', 'User Behavior Alignment','Multi-lingual Abilities']
40
+ MODEL_SIZE = ['<4B', '4B-10B', '10B-20B', '20B-40B', '>40B', 'Unknown']
41
+ MODEL_TYPE = ['API', 'OpenSource', 'Proprietary']
42
+
43
+ # The README file for each benchmark
44
+ LEADERBOARD_MD = {}
45
+
46
+ LEADERBOARD_MD['MAIN'] = f"""
47
+ ## Included Shopping Skills:
48
+
49
+ - Shopping Concept Understanding: Understanding domain-specific short texts in online shopping (e.g. brands, product models).
50
+ - Shopping Knowledge Reasoning: Reasoning over commonsense, numeric, and implicit product-product multi-hop knowledge.
51
+ - User Behavior Alignment: Modeling heterogeneous and implicit user behaviors (e.g. click, query, purchase).
52
+ - Multi-lingual Abilities: Online shopping across marketplaces around the globe.
53
+
54
+ ## Main Evaluation Results
55
+
56
+ - Metrics:
57
+ - Avg Score: The average score on all 4 online shopping skills (normalized to 0 - 100, the higher the better).
58
+ - Detailed metrics and evaluation results for each skill are provided in the consequent tabs.
59
+ """
60
+
61
+
62
+
63
+ LEADERBOARD_MD['Shopping Concept Understanding'] = """
64
+ ## Shopping Concept Understanding Evaluation Results
65
+
66
+ Online shopping concepts such as brands and product models are domain-specific and not often seen in pre-training. Moreover, they often appear in short texts (e.g. queries, attribute-value pairs) and thus no sufficient contexts are given to help understand them. Hence, failing to understand these concepts compromises the performance of LLMs on downstream tasks.
67
+
68
+ The included sub-skills and tasks include:
69
+ - **Concept Normalization**:
70
+ - Product Category Synonym
71
+ - Attribute Value Synonym
72
+ - **Elaboration**:
73
+ - Attribute Explanation
74
+ - Product Category Explanation
75
+ - **Relational Inference**:
76
+ - Applicable Attribute to Product Category
77
+ - Applicable Product Category to Attribute
78
+ - Inapplicable Attributes
79
+ - Valid Attribute Value Given Attribute and Product Category
80
+ - Valid Attribute Given Attribute Value and Product Category
81
+ - Product Category Classification
82
+ - Product Category Generation
83
+ - **Sentiment Analysis**:
84
+ - Aspect-based Sentiment Classification
85
+ - Aspect-based Review Retrieval
86
+ - Aspect-based Review Selection
87
+ - Aspect-based Reviews Overall Sentiment Classification
88
+ - **Information Extraction**:
89
+ - Attribute Value Extraction
90
+ - Query Named Entity Recognition
91
+ - Aspect-based Review Keyphrase Selection
92
+ - Aspect-based Review Keyphrase Extraction
93
+ - **Summarization**:
94
+ - Attribute Naming from Decription
95
+ - Product Category Naming from Description
96
+ - Review Aspect Retrieval
97
+ - Single Conversation Topic Selection
98
+ - Multi-Conversation Topic Retrieval
99
+ - Product Keyphrase Selection
100
+ - Product Keyphrase Retrieval
101
+ - Product Title Generation
102
+ """
103
+
104
+
105
+ LEADERBOARD_MD['Shopping Knowledge Reasoning'] = """
106
+ ## Shopping Knowledge Reasoning Evaluation Results
107
+
108
+ This skill focuses on understanding and applying various implicit knowledge to perform reasoning over products and their attributes. For example, calculations such as the total volume of a product pack require numeric reasoning, and finding compatible products requires multi-hop reasoning among various products over a product knowledge graph.
109
+
110
+ The included sub-skills and tasks include:
111
+ - **Numeric Reasoning**:
112
+ - Unit Conversation
113
+ - Product Numeric Reasoning
114
+ - **Commonsense Reasoning**
115
+ - **Implicit Multi-Hop Reasoning**:
116
+ - Product Compatibility
117
+ - Complementary Product Categories
118
+ - Implicit Attribute Reasoning
119
+ - Related Brands Selection
120
+ - Related Brands Retrieval
121
+ """
122
+
123
+ LEADERBOARD_MD['User Behavior Alignment'] = """
124
+ ## User Behavior Alignment Evaluation Results
125
+
126
+ Accurately modeling user behaviors is a crucial skill in online shopping. A large variety of user behaviors exist in online shopping, including queries, clicks, add-to-carts, purchases, etc. Moreover, these behaviors are generally implicit and not expressed in text.
127
+
128
+ Consequently, LLMs trained with general texts encounter challenges in aligning with the heterogeneous and implicit user behaviors as they rarely observe such inputs during pre-training.
129
+
130
+ The included sub-skills and tasks include:
131
+ - **Query-Query Relations**:
132
+ - Query Re-Writing
133
+ - Query-Query Intention Selection
134
+ - Intention-Based Related Query Retrieval
135
+ - **Query-Product Relations**:
136
+ - Product Category Selection for Query
137
+ - Query-Product Relation Selection
138
+ - Query-Product Ranking
139
+ - **Sessions**:
140
+ - Session-based Query Recommendation
141
+ - Session-based Next Query Selection
142
+ - Session-based Next Product Selection
143
+ - **Purchases**:
144
+ - Product Co-Purchase Selection
145
+ - Product Co-Purchase Retrieval
146
+ - **Reviews and QA**:
147
+ - Review Rating Prediction
148
+ - Aspect-Sentiment-Based Review Generation
149
+ - Review Helpfulness Selection
150
+ - Product-Based Question Answering
151
+ """
152
+
153
+ LEADERBOARD_MD['Multi-lingual Abilities'] = """
154
+ ## Multi-lingual Abilities Evaluation Results
155
+
156
+ Multi-lingual models are desired in online shopping as they can be deployed in multiple marketplaces without re-training.
157
+
158
+ The included sub-skills and tasks include:
159
+ - **Multi-lingual Shopping Concept Understanding**:
160
+ - Multi-lingual Product Title Generation
161
+ - Multi-lingual Product Keyphrase Selection
162
+ - Cross-lingual Product Title Translation
163
+ - Cross-lingual Product Entity Alignment
164
+ - **Multi-lingual User Behavior Alignment**:
165
+ - Multi-lingual Query-product Relation Selection
166
+ - Multi-lingual Query-product Ranking
167
+ - Multi-lingual Session-based Product Recommendation
168
+ """
169
+
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ gradio==4.15.0
2
+ numpy>=1.23.4
3
+ pandas>=1.5.3