BAAI
/

Shitao commited on
Commit
117489c
1 Parent(s): cfa9e40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -35
README.md CHANGED
@@ -3,7 +3,10 @@ license: mit
3
  language:
4
  - zh
5
  pipeline_tag: sentence-similarity
 
 
6
  ---
 
7
  <h1 align="center">FlagEmbedding</h1>
8
 
9
 
@@ -13,22 +16,20 @@ pipeline_tag: sentence-similarity
13
  <a href=#usage>Usage</a> |
14
  <a href="#evaluation">Evaluation</a> |
15
  <a href="#train">Train</a> |
16
- <a href="#contact">Contact</a> |
17
  <a href="#license">License</a>
18
  <p>
19
  </h4>
 
20
 
21
- More details please refer to our github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
22
-
23
- [English](README.md) | [中文](README_zh.md)
24
 
25
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
26
- And it also can be used in vector database for LLMs.
27
 
28
  ************* 🌟**Updates**🌟 *************
29
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
30
- - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!**
31
- - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (**C-MTEB**), consisting of 31 test dataset.
32
 
33
 
34
  ## Model List
@@ -37,12 +38,12 @@ And it also can be used in vector database for LLMs.
37
 
38
  | Model | Language | Description | query instruction for retrieval |
39
  |:-------------------------------|:--------:| :--------:| :--------:|
40
- | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | **rank 1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
41
- | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | **rank 2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
42
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
43
- | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | **rank 1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/benchmark) benchmark | `为这个句子生成表示以用于检索相关文章:` |
44
- | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and **rank 2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/benchmark) benchmark | |
45
- | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
46
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
47
 
48
 
@@ -51,15 +52,16 @@ And it also can be used in vector database for LLMs.
51
 
52
  * **Using FlagEmbedding**
53
  ```
54
- pip install flag_embedding
55
  ```
 
 
56
  ```python
57
- from flag_embedding import FlagModel
58
  sentences = ["样例数据-1", "样例数据-2"]
59
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
60
  embeddings = model.encode(sentences)
61
  print(embeddings)
62
-
63
  # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
64
  # corpus in retrieval task can still use encode() or encode_corpus()
65
  queries = ['query_1', 'query_2']
@@ -88,13 +90,12 @@ embeddings = model.encode(sentences, normalize_embeddings=True)
88
  print(embeddings)
89
  ```
90
  For retrieval task,
91
- each query should start with a instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
92
  ```python
93
  from sentence_transformers import SentenceTransformer
94
  queries = ["手机开不了机怎么办?"]
95
  passages = ["样例段落-1", "样例段落-2"]
96
  instruction = "为这个句子生成表示以用于检索相关文章:"
97
-
98
  model = SentenceTransformer('BAAI/bge-large-zh')
99
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
100
  p_embeddings = model.encode(passages, normalize_embeddings=True)
@@ -110,16 +111,13 @@ from transformers import AutoTokenizer, AutoModel
110
  import torch
111
  # Sentences we want sentence embeddings for
112
  sentences = ["样例数据-1", "样例数据-2"]
113
-
114
  # Load model from HuggingFace Hub
115
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
116
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
117
-
118
  # Tokenize sentences
119
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
120
- # for retrieval task, add a instruction to query
121
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
122
-
123
  # Compute token embeddings
124
  with torch.no_grad():
125
  model_output = model(**encoded_input)
@@ -133,7 +131,7 @@ print("Sentence embeddings:", sentence_embeddings)
133
 
134
  ## Evaluation
135
  `baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
136
- More details and evaluation scripts see [benchemark](benchmark/README.md).
137
 
138
  - **MTEB**:
139
 
@@ -162,7 +160,7 @@ More details and evaluation scripts see [benchemark](benchmark/README.md).
162
 
163
  - **C-MTEB**:
164
  We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
165
- Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
166
 
167
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
168
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
@@ -179,18 +177,17 @@ Please refer to [benchemark](benchmark/README.md) for a detailed introduction.
179
 
180
 
181
 
182
-
183
  ## Train
184
  This section will introduce the way we used to train the general embedding.
185
- The training scripts are in [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/),
186
- and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain/) and [fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).
187
 
188
 
189
  **1. RetroMAE Pre-train**
190
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
191
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
192
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
193
- In retromae, the mask ratio of the encoder and decoder are 0.3, and 0.5 respectively.
194
  We used the AdamW optimizer and the learning rate is 2e-5.
195
 
196
  **Pre-training data**:
@@ -214,26 +211,24 @@ We trained our model on 48 A100(40G) GPUs with a large batch size of 32,768 (so
214
  We used the AdamW optimizer and the learning rate is 1e-5.
215
  The temperature for contrastive loss is 0.01.
216
 
217
- For the version with `*-instrcution`, we add instruction to the query for the retrieval task in the training.
218
- For English, the instruction is `Represent this sentence for searching relevant passages: `;
219
- For Chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
220
  In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
221
 
222
 
223
- The finetune script is accessible in this repository: [flag_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/flag_embedding/baai_general_embedding/README.md).
224
  You can easily finetune your model with it.
225
 
226
  **Training data**:
227
 
228
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
229
 
230
- - For Chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
231
 
232
  **The data collection is to be released in the future.**
233
 
234
- We will continually update the embedding models and training codes,
235
- hoping to promote the development of the embedding model community.
236
 
237
 
238
  ## License
239
- FlagEmbedding is licensed under [MIT License](). The released models can be used for commercial purposes free of charge.
 
3
  language:
4
  - zh
5
  pipeline_tag: sentence-similarity
6
+ tags:
7
+ - sentence-transformers
8
  ---
9
+
10
  <h1 align="center">FlagEmbedding</h1>
11
 
12
 
 
16
  <a href=#usage>Usage</a> |
17
  <a href="#evaluation">Evaluation</a> |
18
  <a href="#train">Train</a> |
 
19
  <a href="#license">License</a>
20
  <p>
21
  </h4>
22
+ For more details please refer to our GitHub repo: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
23
 
24
+ [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
 
 
25
 
26
  FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
27
+ And it also can be used in vector databases for LLMs.
28
 
29
  ************* 🌟**Updates**🌟 *************
30
  - 08/05/2023: Release base-scale and small-scale models, **best performance among the models of the same size 🤗**
31
+ - 08/02/2023: Release `bge-large-*`(short for BAAI General Embedding) Models, **rank 1st on MTEB and C-MTEB benchmark!** :tada: :tada:
32
+ - 08/01/2023: We release the [Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB) (**C-MTEB**), consisting of 31 test dataset.
33
 
34
 
35
  ## Model List
 
38
 
39
  | Model | Language | Description | query instruction for retrieval |
40
  |:-------------------------------|:--------:| :--------:| :--------:|
41
+ | [BAAI/bge-large-en](https://huggingface.co/BAAI/bge-large-en) | English | :trophy: rank **1st** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
42
+ | [BAAI/bge-base-en](https://huggingface.co/BAAI/bge-base-en) | English | rank **2nd** in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaderboard | `Represent this sentence for searching relevant passages: ` |
43
  | [BAAI/bge-small-en](https://huggingface.co/BAAI/bge-small-en) | English | a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` |
44
+ | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | Chinese | :trophy: rank **1st** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | `为这个句子生成表示以用于检索相关文章:` |
45
+ | [BAAI/bge-large-zh-noinstruct](https://huggingface.co/BAAI/bge-large-zh-noinstruct) | Chinese | This model is trained without instruction, and rank **2nd** in [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB) benchmark | |
46
+ | [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) | Chinese | a base-scale model but has similar ability with `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` |
47
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
48
 
49
 
 
52
 
53
  * **Using FlagEmbedding**
54
  ```
55
+ pip install FlagEmbedding
56
  ```
57
+ See [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md) for more methods to install FlagEmbedding.
58
+
59
  ```python
60
+ from FlagEmbedding import FlagModel
61
  sentences = ["样例数据-1", "样例数据-2"]
62
  model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:")
63
  embeddings = model.encode(sentences)
64
  print(embeddings)
 
65
  # for retrieval task, please use encode_queries() which will automatically add the instruction to each query
66
  # corpus in retrieval task can still use encode() or encode_corpus()
67
  queries = ['query_1', 'query_2']
 
90
  print(embeddings)
91
  ```
92
  For retrieval task,
93
+ each query should start with an instruction (instructions see [Model List](https://github.com/FlagOpen/FlagEmbedding/tree/master#model-list)).
94
  ```python
95
  from sentence_transformers import SentenceTransformer
96
  queries = ["手机开不了机怎么办?"]
97
  passages = ["样例段落-1", "样例段落-2"]
98
  instruction = "为这个句子生成表示以用于检索相关文章:"
 
99
  model = SentenceTransformer('BAAI/bge-large-zh')
100
  q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
101
  p_embeddings = model.encode(passages, normalize_embeddings=True)
 
111
  import torch
112
  # Sentences we want sentence embeddings for
113
  sentences = ["样例数据-1", "样例数据-2"]
 
114
  # Load model from HuggingFace Hub
115
  tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
116
  model = AutoModel.from_pretrained('BAAI/bge-large-zh')
 
117
  # Tokenize sentences
118
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
119
+ # for retrieval task, add an instruction to query
120
  # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
 
121
  # Compute token embeddings
122
  with torch.no_grad():
123
  model_output = model(**encoded_input)
 
131
 
132
  ## Evaluation
133
  `baai-general-embedding` models achieve **state-of-the-art performance on both MTEB and C-MTEB leaderboard!**
134
+ More details and evaluation tools see our [scripts](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md).
135
 
136
  - **MTEB**:
137
 
 
160
 
161
  - **C-MTEB**:
162
  We create a benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks.
163
+ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/README.md) for a detailed introduction.
164
 
165
  | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
166
  |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
 
177
 
178
 
179
 
 
180
  ## Train
181
  This section will introduce the way we used to train the general embedding.
182
+ The training scripts are in [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md),
183
+ and we provide some examples to do [pre-train](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/pretrain/README.md) and [fine-tune](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/README.md).
184
 
185
 
186
  **1. RetroMAE Pre-train**
187
  We pre-train the model following the method [retromae](https://github.com/staoxiao/RetroMAE),
188
  which shows promising improvement in retrieval task ([paper](https://aclanthology.org/2022.emnlp-main.35.pdf)).
189
  The pre-training was conducted on 24 A100(40G) GPUs with a batch size of 720.
190
+ In retromae, the mask ratio of encoder and decoder are 0.3, and 0.5 respectively.
191
  We used the AdamW optimizer and the learning rate is 2e-5.
192
 
193
  **Pre-training data**:
 
211
  We used the AdamW optimizer and the learning rate is 1e-5.
212
  The temperature for contrastive loss is 0.01.
213
 
214
+ For the version with `*-instrcution`, we add instruction to the query for retrieval task in the training.
215
+ For english, the instruction is `Represent this sentence for searching relevant passages: `;
216
+ For chinese, the instruction is `为这个句子生成表示以用于检索相关文章:`.
217
  In the evaluation, the instruction should be added for sentence to passages retrieval task, not be added for other tasks.
218
 
219
 
220
+ The finetune script is accessible in this repository: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
221
  You can easily finetune your model with it.
222
 
223
  **Training data**:
224
 
225
  - For English, we collect 230M text pairs from [wikipedia](https://huggingface.co/datasets/wikipedia), [cc-net](https://github.com/facebookresearch/cc_net), and so on.
226
 
227
+ - For chinese, we collect 120M text pairs from [wudao](https://github.com/BAAI-WuDao/Data), [simclue](https://github.com/CLUEbenchmark/SimCLUE) and so on.
228
 
229
  **The data collection is to be released in the future.**
230
 
 
 
231
 
232
 
233
  ## License
234
+ FlagEmbedding is licensed under [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.