add AIBOM

Dear model owner(s),
We are a group of researchers investigating the usefulness of sharing AIBOMs (Artificial Intelligence Bill of Materials) to document AI models – AIBOMs are machine-readable structured lists of components (e.g., datasets and models) used to enhance transparency in AI-model supply chains.

To pursue the above-mentioned objective, we identified popular models on HuggingFace and, based on your model card (and some configuration information available in HuggingFace), we generated your AIBOM according to the CyclonDX (v1.6) standard (see https://cyclonedx.org/docs/1.6/json/). AIBOMs are generated as JSON files by using the following open-source supporting tool: https://github.com/MSR4SBOM/ALOHA (technical details are available in the research paper: https://github.com/MSR4SBOM/ALOHA/blob/main/ALOHA.pdf).

The JSON file in this pull request is your AIBOM (see https://github.com/MSR4SBOM/ALOHA/blob/main/documentation.json for details on its structure).

Clearly, the submitted AIBOM matches the current model information, yet it can be easily regenerated when the model evolves, using the aforementioned AIBOM generator tool.

We open this pull request containing an AIBOM of your AI model, and hope it will be considered. We would also like to hear your opinion on the usefulness (or not) of AIBOM by answering a 3-minute anonymous survey: https://forms.gle/WGffSQD5dLoWttEe7.

Thanks in advance, and regards,
Riccardo D’Avino, Fatima Ahmed, Sabato Nocera, Simone Romano, Giuseppe Scanniello (University of Salerno, Italy),
Massimiliano Di Penta (University of Sannio, Italy),
The MSR4SBOM team

Files changed (1) hide show

pankajmathur_orca_mini_3b.json +243 -0

pankajmathur_orca_mini_3b.json ADDED Viewed

	@@ -0,0 +1,243 @@

+{
+    "bomFormat": "CycloneDX",
+    "specVersion": "1.6",
+    "serialNumber": "urn:uuid:1888515b-dc4f-45ab-8c2d-b442c0d24934",
+    "version": 1,
+    "metadata": {
+        "timestamp": "2025-06-05T09:37:53.860418+00:00",
+        "component": {
+            "type": "machine-learning-model",
+            "bom-ref": "pankajmathur/orca_mini_3b-eafe2e45-5a59-5b43-9c97-86c388bed7b9",
+            "name": "pankajmathur/orca_mini_3b",
+            "externalReferences": [
+                {
+                    "url": "https://huggingface.co/pankajmathur/orca_mini_3b",
+                    "type": "documentation"
+                }
+            ],
+            "modelCard": {
+                "modelParameters": {
+                    "task": "text-generation",
+                    "architectureFamily": "llama",
+                    "modelArchitecture": "LlamaForCausalLM",
+                    "datasets": [
+                        {
+                            "ref": "psmathur/alpaca_orca-0d13688f-ffdd-5fd5-9522-083dd42cdac9"
+                        },
+                        {
+                            "ref": "psmathur/dolly-v2_orca-ec6d4ce8-7474-520d-ac1e-080f58c05b6c"
+                        },
+                        {
+                            "ref": "psmathur/WizardLM_Orca-f084d080-d716-5a1d-bca0-b551ab1587aa"
+                        }
+                    ]
+                },
+                "properties": [
+                    {
+                        "name": "library_name",
+                        "value": "transformers"
+                    }
+                ],
+                "quantitativeAnalysis": {
+                    "performanceMetrics": [
+                        {
+                            "slice": "dataset: ai2_arc, split: test, config: ARC-Challenge",
+                            "type": "acc_norm",
+                            "value": 41.55
+                        },
+                        {
+                            "slice": "dataset: hellaswag, split: validation",
+                            "type": "acc_norm",
+                            "value": 61.52
+                        },
+                        {
+                            "slice": "dataset: cais/mmlu, split: test, config: all",
+                            "type": "acc",
+                            "value": 26.79
+                        },
+                        {
+                            "slice": "dataset: truthful_qa, split: validation, config: multiple_choice",
+                            "type": "mc2",
+                            "value": 42.42
+                        },
+                        {
+                            "slice": "dataset: winogrande, split: validation, config: winogrande_xl",
+                            "type": "acc",
+                            "value": 61.8
+                        },
+                        {
+                            "slice": "dataset: gsm8k, split: test, config: main",
+                            "type": "acc",
+                            "value": 0.08
+                        }
+                    ]
+                }
+            },
+            "authors": [
+                {
+                    "name": "pankajmathur"
+                }
+            ],
+            "licenses": [
+                {
+                    "license": {
+                        "id": "CC-BY-NC-SA-4.0",
+                        "url": "https://spdx.org/licenses/CC-BY-NC-SA-4.0.html"
+                    }
+                }
+            ],
+            "tags": [
+                "transformers",
+                "pytorch",
+                "safetensors",
+                "llama",
+                "text-generation",
+                "en",
+                "dataset:psmathur/alpaca_orca",
+                "dataset:psmathur/dolly-v2_orca",
+                "dataset:psmathur/WizardLM_Orca",
+                "arxiv:2306.02707",
+                "license:cc-by-nc-sa-4.0",
+                "model-index",
+                "autotrain_compatible",
+                "text-generation-inference",
+                "endpoints_compatible",
+                "region:us"
+            ]
+        }
+    },
+    "components": [
+        {
+            "type": "data",
+            "bom-ref": "psmathur/alpaca_orca-0d13688f-ffdd-5fd5-9522-083dd42cdac9",
+            "name": "psmathur/alpaca_orca",
+            "data": [
+                {
+                    "type": "dataset",
+                    "bom-ref": "psmathur/alpaca_orca-0d13688f-ffdd-5fd5-9522-083dd42cdac9",
+                    "name": "psmathur/alpaca_orca",
+                    "contents": {
+                        "url": "https://huggingface.co/datasets/psmathur/alpaca_orca",
+                        "properties": [
+                            {
+                                "name": "task_categories",
+                                "value": "text-generation"
+                            },
+                            {
+                                "name": "language",
+                                "value": "en"
+                            },
+                            {
+                                "name": "size_categories",
+                                "value": "10K<n<100K"
+                            },
+                            {
+                                "name": "license",
+                                "value": "cc-by-nc-sa-4.0"
+                            }
+                        ]
+                    },
+                    "governance": {
+                        "owners": [
+                            {
+                                "organization": {
+                                    "name": "pankajmathur",
+                                    "url": "https://huggingface.co/pankajmathur"
+                                }
+                            }
+                        ]
+                    },
+                    "description": "Explain tuned Alpaca dataset ~52K created using approaches from Orca Research Paper. \nWe leverage all of the 15 system instructions provided in Orca Research Paper. to generate custom datasets, in contrast to vanilla instruction tuning approaches used by original datasets.\nThis helps student models like orca_mini_13b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version).\nPlease see how the System prompt is added before each instruction.\n"
+                }
+            ]
+        },
+        {
+            "type": "data",
+            "bom-ref": "psmathur/dolly-v2_orca-ec6d4ce8-7474-520d-ac1e-080f58c05b6c",
+            "name": "psmathur/dolly-v2_orca",
+            "data": [
+                {
+                    "type": "dataset",
+                    "bom-ref": "psmathur/dolly-v2_orca-ec6d4ce8-7474-520d-ac1e-080f58c05b6c",
+                    "name": "psmathur/dolly-v2_orca",
+                    "contents": {
+                        "url": "https://huggingface.co/datasets/psmathur/dolly-v2_orca",
+                        "properties": [
+                            {
+                                "name": "task_categories",
+                                "value": "text-generation"
+                            },
+                            {
+                                "name": "language",
+                                "value": "en"
+                            },
+                            {
+                                "name": "size_categories",
+                                "value": "10K<n<100K"
+                            },
+                            {
+                                "name": "license",
+                                "value": "cc-by-nc-sa-4.0"
+                            }
+                        ]
+                    },
+                    "governance": {
+                        "owners": [
+                            {
+                                "organization": {
+                                    "name": "pankajmathur",
+                                    "url": "https://huggingface.co/pankajmathur"
+                                }
+                            }
+                        ]
+                    },
+                    "description": "Explain tuned Dolly-V2 dataset ~15K created using approaches from Orca Research Paper.\nWe leverage all of the 15 system instructions provided in Orca Research Paper to generate explain tuned datasets, in contrast to vanilla instruction tuning approaches used by original datasets.\nThis helps student models like orca_mini_13b, orca_mini_7b or orca_mini_3b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version).\nPlease see how the System prompt is added before\u2026 See the full description on the dataset page: https://huggingface.co/datasets/pankajmathur/dolly-v2_orca."
+                }
+            ]
+        },
+        {
+            "type": "data",
+            "bom-ref": "psmathur/WizardLM_Orca-f084d080-d716-5a1d-bca0-b551ab1587aa",
+            "name": "psmathur/WizardLM_Orca",
+            "data": [
+                {
+                    "type": "dataset",
+                    "bom-ref": "psmathur/WizardLM_Orca-f084d080-d716-5a1d-bca0-b551ab1587aa",
+                    "name": "psmathur/WizardLM_Orca",
+                    "contents": {
+                        "url": "https://huggingface.co/datasets/psmathur/WizardLM_Orca",
+                        "properties": [
+                            {
+                                "name": "task_categories",
+                                "value": "text-generation"
+                            },
+                            {
+                                "name": "language",
+                                "value": "en"
+                            },
+                            {
+                                "name": "size_categories",
+                                "value": "10K<n<100K"
+                            },
+                            {
+                                "name": "license",
+                                "value": "cc-by-nc-sa-4.0"
+                            }
+                        ]
+                    },
+                    "governance": {
+                        "owners": [
+                            {
+                                "organization": {
+                                    "name": "pankajmathur",
+                                    "url": "https://huggingface.co/pankajmathur"
+                                }
+                            }
+                        ]
+                    },
+                    "description": "Explain tuned WizardLM dataset ~55K created using approaches from Orca Research Paper.\nWe leverage all of the 15 system instructions provided in Orca Research Paper. to generate custom datasets, in contrast to vanilla instruction tuning approaches used by original datasets.\nThis helps student models like orca_mini_13b to learn thought process from teacher model, which is ChatGPT (gpt-3.5-turbo-0301 version).\nPlease see how the System prompt is added before each instruction.\n"
+                }
+            ]
+        }
+    ]
+}