namespace-Pt commited on
Commit
3b2bff7
·
verified ·
1 Parent(s): 442ebf0

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +70 -0
  2. infbench.json +0 -0
  3. needle.png +0 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Intro
2
+
3
+ [Activation Beacon](https://arxiv.org/abs/2401.03462) compresses the original KV into fewer yet more compact states (a.k.a. beacons) and hence enabling the LLM to perceive longer context given its fixed context window. It is known for the following features:
4
+ - **Effective**
5
+ - there is little information loss given a compression ratio of 2, 4, and 8;
6
+ - **Efficient**
7
+ - it drastically reduces the GPU consumption of KV cache;
8
+ - **Compatible**
9
+ - it can work together with position extrapolation (e.g. YaRN) to further extends the context length; it can also work with grouped query attention to further reduce the KV cache size;
10
+ - **Low-Cost**
11
+ - it is light-weight and can be efficiently trained with roughly 1B tokens.
12
+
13
+
14
+ # Usage
15
+ ```python
16
+ import json
17
+ import torch
18
+ from transformers import AutoModelForCausalLM, AutoTokenizer
19
+
20
+ model_id = "namespace-Pt/beacon-qwen-2-7b-instruct"
21
+
22
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
23
+ model = AutoModelForCausalLM.from_pretrained(
24
+ model_id,
25
+ trust_remote_code=True,
26
+ torch_dtype=torch.bfloat16,
27
+ attn_implementation="flash_attention_2"
28
+ )
29
+
30
+ model = model.cuda().eval()
31
+
32
+ with torch.no_grad():
33
+ # short context
34
+ messages = [{"role": "user", "content": "Tell me about yourself."}]
35
+ inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
36
+ outputs = model.generate(**inputs, max_new_tokens=50)
37
+ print(f"Input Length: {inputs['input_ids'].shape[1]}")
38
+ print(f"Output: {repr(tokenizer.decode(outputs[0], skip_special_tokens=True))}")
39
+
40
+ # reset memory before new generation task
41
+ model.memory.reset()
42
+
43
+ # long context
44
+ with open("infbench.json", encoding="utf-8") as f:
45
+ example = json.load(f)
46
+ messages = [{"role": "user", "content": example["context"]}]
47
+ inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to("cuda")
48
+ outputs = model.generate(**inputs, do_sample=False, top_p=1, temperature=1, max_new_tokens=20)[:, inputs["input_ids"].shape[1]:]
49
+ print("*"*20)
50
+ print(f"Input Length: {inputs['input_ids'].shape[1]}")
51
+ print(f"Answers: {example['answer']}")
52
+ print(f"Prediction: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
53
+ ```
54
+ **NOTE**: It's okay to see warnings like `This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (32768). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.` Just ignore it.
55
+
56
+
57
+ # Results
58
+
59
+
60
+
61
+ ## LongBench
62
+
63
+ | Model | Single QA | Multi QA | Summarization | Few-Shot | Code | AVG |
64
+ |-------------------------------|-----------|----------|---------------|----------|-------|--------|
65
+ | qwen-2-7b-instruct | 39.60 | 36.92 | 27.97 | 71.12 | 62.34 | 47.59 |
66
+ | beacon-qwen-2-7b-instruct | 40.76 | 43.73 | 27.23 | 68.87 | 68.47 | 49.81 |
67
+
68
+ ## NIAH
69
+
70
+ ![](needle.png)
infbench.json ADDED
The diff for this file is too large to render. See raw diff
 
needle.png ADDED