nielsr HF Staff commited on
Commit
b3ad018
·
verified ·
1 Parent(s): fa6be4b

Improve model card: Add pipeline tag, library name, paper & code links

Browse files

This PR improves the model card for the F2LLM model by:

* Adding the `pipeline_tag: feature-extraction` metadata, which categorizes the model correctly as an embedding model and improves its discoverability on the Hugging Face Hub.
* Specifying `library_name: transformers` as the model's usage snippet demonstrates compatibility with the Hugging Face Transformers library, enabling the automated "How to use" code snippet.
* Moving the metadata to the top as YAML front matter.
* Adding the paper title as the main heading.
* Providing prominent links to the paper ([F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294)) and the GitHub repository (`https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM`) at the beginning of the model card for improved visibility.

The existing content, including usage examples, evaluation details, training information, and citation, remains largely unchanged to preserve the original author's documentation.

Files changed (1) hide show
  1. README.md +11 -4
README.md CHANGED
@@ -1,13 +1,20 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - codefuse-ai/F2LLM
5
  language:
6
  - en
7
- base_model:
8
- - Qwen/Qwen3-1.7B
 
9
  ---
10
 
 
 
 
 
 
11
  F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
12
 
13
  ## Usage
@@ -84,4 +91,4 @@ If you use the F2LLM models, data, or code, please cite the following technical
84
  eprinttype = {arXiv},
85
  eprint = {2510.02294}
86
  }
87
- ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-1.7B
4
  datasets:
5
  - codefuse-ai/F2LLM
6
  language:
7
  - en
8
+ license: apache-2.0
9
+ pipeline_tag: feature-extraction
10
+ library_name: transformers
11
  ---
12
 
13
+ # F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
14
+
15
+ This model is presented in the paper [F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data](https://huggingface.co/papers/2510.02294).
16
+ The code for this model is available on [GitHub](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/F2LLM).
17
+
18
  F2LLMs (Foundation to Feature Large Language Models) are foundation models directly finetuned on 6 million high-quality query-document pairs (available in [codefuse-ai/F2LLM](https://huggingface.co/datasets/codefuse-ai/F2LLM)) covering a diverse range of retrieval, classification, and clustering data, curated solely from open-source datasets without any synthetic data. These models are trained with homogeneous macro batches in a single stage, without sophisticated multi-stage pipelines.
19
 
20
  ## Usage
 
91
  eprinttype = {arXiv},
92
  eprint = {2510.02294}
93
  }
94
+ ```