File size: 8,166 Bytes
b57019c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
Quantization made by Richard Erkhov.

[Github](https://github.com/RichardErkhov)

[Discord](https://discord.gg/pvy7H8DZMG)

[Request more models](https://github.com/RichardErkhov/quant_request)


Llama-2-13b-hf-4bit-64rank - GGUF
- Model creator: https://huggingface.co/LoftQ/
- Original model: https://huggingface.co/LoftQ/Llama-2-13b-hf-4bit-64rank/


| Name | Quant method | Size |
| ---- | ---- | ---- |
| [Llama-2-13b-hf-4bit-64rank.Q2_K.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q2_K.gguf) | Q2_K | 4.52GB |
| [Llama-2-13b-hf-4bit-64rank.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.IQ3_XS.gguf) | IQ3_XS | 4.99GB |
| [Llama-2-13b-hf-4bit-64rank.IQ3_S.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.IQ3_S.gguf) | IQ3_S | 5.27GB |
| [Llama-2-13b-hf-4bit-64rank.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q3_K_S.gguf) | Q3_K_S | 5.27GB |
| [Llama-2-13b-hf-4bit-64rank.IQ3_M.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.IQ3_M.gguf) | IQ3_M | 5.57GB |
| [Llama-2-13b-hf-4bit-64rank.Q3_K.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q3_K.gguf) | Q3_K | 5.9GB |
| [Llama-2-13b-hf-4bit-64rank.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q3_K_M.gguf) | Q3_K_M | 5.9GB |
| [Llama-2-13b-hf-4bit-64rank.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q3_K_L.gguf) | Q3_K_L | 6.45GB |
| [Llama-2-13b-hf-4bit-64rank.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.IQ4_XS.gguf) | IQ4_XS | 6.54GB |
| [Llama-2-13b-hf-4bit-64rank.Q4_0.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q4_0.gguf) | Q4_0 | 6.86GB |
| [Llama-2-13b-hf-4bit-64rank.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.IQ4_NL.gguf) | IQ4_NL | 6.9GB |
| [Llama-2-13b-hf-4bit-64rank.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q4_K_S.gguf) | Q4_K_S | 6.91GB |
| [Llama-2-13b-hf-4bit-64rank.Q4_K.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q4_K.gguf) | Q4_K | 7.33GB |
| [Llama-2-13b-hf-4bit-64rank.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q4_K_M.gguf) | Q4_K_M | 7.33GB |
| [Llama-2-13b-hf-4bit-64rank.Q4_1.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q4_1.gguf) | Q4_1 | 7.61GB |
| [Llama-2-13b-hf-4bit-64rank.Q5_0.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q5_0.gguf) | Q5_0 | 8.36GB |
| [Llama-2-13b-hf-4bit-64rank.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q5_K_S.gguf) | Q5_K_S | 8.36GB |
| [Llama-2-13b-hf-4bit-64rank.Q5_K.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q5_K.gguf) | Q5_K | 8.6GB |
| [Llama-2-13b-hf-4bit-64rank.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q5_K_M.gguf) | Q5_K_M | 8.6GB |
| [Llama-2-13b-hf-4bit-64rank.Q5_1.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q5_1.gguf) | Q5_1 | 9.1GB |
| [Llama-2-13b-hf-4bit-64rank.Q6_K.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q6_K.gguf) | Q6_K | 9.95GB |
| [Llama-2-13b-hf-4bit-64rank.Q8_0.gguf](https://huggingface.co/RichardErkhov/LoftQ_-_Llama-2-13b-hf-4bit-64rank-gguf/blob/main/Llama-2-13b-hf-4bit-64rank.Q8_0.gguf) | Q8_0 | 12.88GB |




Original model description:
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- 'quantization '
- lora
---
# LoftQ Initialization

| [Paper](https://arxiv.org/abs/2310.08659) | [Code](https://github.com/yxli2123/LoftQ) | [PEFT Example](https://github.com/huggingface/peft/tree/main/examples/loftq_finetuning) |

LoftQ (LoRA-fine-tuning-aware Quantization) provides a quantized backbone Q and LoRA adapters A and B, given a full-precision pre-trained weight W.

This model, `Llama-2-13b-hf-4bit-64rank`, is obtained from [LLAMA-2-13b](https://huggingface.co/meta-llama/Llama-2-13b-hf). 
The backbone is under `LoftQ/Llama-2-13b-hf-4bit-64rank` and LoRA adapters are under the `subfolder='loftq_init'`.

## Model Info
### Backbone
- Stored format: `torch.bfloat16`
- Size: ~ 26 GiB
- Loaded format: bitsandbytes nf4
- Size loaded on GPU: ~6.5 GiB

### LoRA adapters
- rank: 64
- lora_alpha: 64
- target_modules: ["down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj"]

## Usage

**Training** Here's an example of loading this model and preparing for the LoRA fine-tuning.

```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/Llama-2-13b-hf-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="loftq_init",
    is_trainable=True,
)

# Do training with peft_model ...
```

## Experiment Results
We have conducted experiments on supervised fine-tuning of [GSM8K](https://huggingface.co/datasets/gsm8k) 
and [WikiText-2](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-raw-v1).

| Model          | Bits | Rank | LoRA Initial         | GSM8K | WikiText-2 |
| -------------- | ---- | ---- | -------------------- | ----- | ---------- |
| LLAMA-2-13b     | 16   | 64   | Gaussian + 0         | 45.3  | 5.12       |
| LLAMA-2-13b     | 4    | 64   | Gaussian + 0 (QLoRA) | 39.9  | 5.22       |
| **LLAMA-2-13b** | 4    | 64   | LoftQ                | 45.0  | 5.16       |



**Inference** Here is an example code for inference after the model has been fine-tuned on [GSM8K](https://huggingface.co/datasets/gsm8k).

```python
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/Llama-2-13b-hf-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="gsm8k",
    is_trainable=True,
)

# Do inference with peft_model ...
```
See the full code at our [Github Repo]((https://github.com/yxli2123/LoftQ))


## Citation

```bibtex
@article{li2023loftq,
  title={Loftq: Lora-fine-tuning-aware quantization for large language models},
  author={Li, Yixiao and Yu, Yifan and Liang, Chen and He, Pengcheng and Karampatziakis, Nikos and Chen, Weizhu and Zhao, Tuo},
  journal={arXiv preprint arXiv:2310.08659},
  year={2023}
}
```