Text Generation
Safetensors
English
lwl-uestc commited on
Commit
eb895f5
·
verified ·
1 Parent(s): ba20252

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -5
README.md CHANGED
@@ -11,22 +11,52 @@ base_model:
11
  pipeline_tag: text-generation
12
  ---
13
 
14
- ## Introduction
15
 
16
- [RAG-Instruct](https://arxiv.org/abs/2501.00353) is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:
17
 
18
  - **Five RAG paradigms**, which represent diverse query-document relationships to enhance model generalization across tasks.
19
  - **Instruction simulation**, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.
20
 
21
- Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks.
22
- Our RAG-Instruct significantly enhances the RAG ability of LLMs, demonstrating remarkable improvements in RAG performance across various tasks.
 
23
 
24
  | Model | WQA (acc) | PQA (acc) | TQA (acc) | OBQA (EM) | Pub (EM) | ARC (EM) | 2WIKI (acc) | HotP (acc) | MSQ (acc) | CFQA (EM) | PubMed (EM) |
25
  |--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------|
26
  | Llama3.2-3B | 58.7 | 61.8 | 69.7 | 77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 |
27
  | Llama3.2-3B + **RAG-Instruct** | 65.3 | 64.0 | 77.0 | 81.2 | 66.4 | 73.0 | 72.9 | 52.7 | 25.0 | 50.3 | 72.6 |
28
 
29
- ## 📖 Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ```
31
  @misc{liu2024raginstructboostingllmsdiverse,
32
  title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions},
 
11
  pipeline_tag: text-generation
12
  ---
13
 
14
+ ## Introduction
15
 
16
+ RAG-Instructis a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:
17
 
18
  - **Five RAG paradigms**, which represent diverse query-document relationships to enhance model generalization across tasks.
19
  - **Instruction simulation**, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.
20
 
21
+ Using this approach, we constructed [RAG-Instruct](https://huggingface.co/datasets/FreedomIntelligence/RAG-Instruct), covering a wide range of RAG scenarios and tasks.
22
+
23
+ Our RAG-Instruct-Llama3-3B is trained on [RAG-Instruct data](https://huggingface.co/datasets/FreedomIntelligence/RAG-Instruct), which significantly enhances the RAG ability of LLMs, demonstrating remarkable improvements in RAG performance across various tasks.
24
 
25
  | Model | WQA (acc) | PQA (acc) | TQA (acc) | OBQA (EM) | Pub (EM) | ARC (EM) | 2WIKI (acc) | HotP (acc) | MSQ (acc) | CFQA (EM) | PubMed (EM) |
26
  |--------------------------------|-----------|-----------|-----------|-----------|----------|----------|-------------|------------|-----------|-----------|-------------|
27
  | Llama3.2-3B | 58.7 | 61.8 | 69.7 | 77.0 | 55.0 | 66.8 | 55.6 | 40.2 | 13.2 | 46.8 | 70.3 |
28
  | Llama3.2-3B + **RAG-Instruct** | 65.3 | 64.0 | 77.0 | 81.2 | 66.4 | 73.0 | 72.9 | 52.7 | 25.0 | 50.3 | 72.6 |
29
 
30
+ # <span>Usage</span>
31
+ RAG-Instruct-Llama3-3B can be used just like `Llama-3.1-8B-Instruct`. You can deploy it with tools like [vllm](https://github.com/vllm-project/vllm) or [Sglang](https://github.com/sgl-project/sglang), or perform direct inference:
32
+ ```python
33
+ from transformers import AutoModelForCausalLM, AutoTokenizer
34
+
35
+ # Load the model and tokenizer
36
+ model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-3B",torch_dtype="auto",device_map="auto")
37
+ tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-3B")
38
+
39
+ # Example input
40
+ input_text = """### Paragraph:
41
+ [1] structure is at risk from new development...
42
+ [2] as Customs and Excise stores...
43
+ [3] Powis Street is partly underway...
44
+ ...
45
+
46
+ ### Instruction:
47
+ Which organization is currently using a building in Woolwich that holds historical importance?
48
+ """
49
+
50
+ # Tokenize and prepare input
51
+ messages = [{"role": "user", "content": input_text}]
52
+ inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)
53
+
54
+ # Generate output
55
+ outputs = model.generate(**inputs, max_new_tokens=2048)
56
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
57
+ ```
58
+
59
+ ## Citation
60
  ```
61
  @misc{liu2024raginstructboostingllmsdiverse,
62
  title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions},