Lewdiculous/Model-Requests · Request: YeungNLP/firefly-gemma-7b

Mar 17, 2024

Model name: firefly-gemma-7b
Model link: https://huggingface.co/YeungNLP/firefly-gemma-7b
Brief description:
The chat template of our chat models is similar as Official gemma-7b-it:

user
hello, who are you?
model
I am a AI program developed by Firefly

An image/direct image link to represent the model (square shaped):

[Optional] Additonal quants (if you want any):
IQ2 Series Q3_K-M(is it's performance better than IQ3_M? idk.)

Cran-May

Mar 17, 2024

•

edited Mar 17, 2024

Performance

We evaluate our models on Open LLM Leaderboard, they achieve good performance.

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
firefly-gemma-7b	62.93	62.12	79.77	61.57	49.41	75.45	49.28
zephyr-7b-gemma-v0.1	62.41	58.45	83.48	60.68	52.07	74.19	45.56
firefly-qwen1.5-en-7b-dpo-v0.1	62.36	54.35	76.04	61.21	56.4	72.06	54.13
zephyr-7b-beta	61.95	62.03	84.36	61.07	57.45	77.74	29.04
firefly-qwen1.5-en-7b	61.44	53.41	75.51	61.67	51.96	70.72	55.34
vicuna-13b-v1.5	55.41	57.08	81.24	56.67	51.51	74.66	11.3
Xwin-LM-13B-V0.1	55.29	62.54	82.8	56.53	45.96	74.27	9.63
Qwen1.5-7B-Chat	55.15	55.89	78.56	61.65	53.54	67.72	13.57
gemma-7b-it	53.56	51.45	71.96	53.52	47.29	67.96	29.19

Usage

The chat template of our chat models is similar as Official gemma-7b-it:

<bos><start_of_turn>user
hello, who are you?<end_of_turn>
<start_of_turn>model
I am a AI program developed by Firefly<eos>

You can use script to inference in Firefly.

You can also use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name_or_path = "YeungNLP/firefly-gemma-7b"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
text = f"""
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
""".strip()
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1500,
    top_p = 0.9,
    temperature = 0.35,
    repetition_penalty = 1.0,
    eos_token_id=tokenizer.encode('<eos>', add_special_tokens=False)
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Lewdiculous

Owner Mar 17, 2024

•

edited Mar 17, 2024

IQ2 Series Q3_K-M(is it's performance better than IQ3_M? idk.)

You can refer to these for a general overview of the current sizes.

@Cran-May Can you add what would be intended use cases/applications for this particular model?

(I am thinking general simple/summary assistant work?)

Just because it doesn't seem like what I usually focus on -- un-aligned and unsafe models with few restrictions.

Lewdiculous

Owner Mar 17, 2024

•

edited Mar 17, 2024

Cropped and upscaled card image: (square, 1:1)

    quantization_options = [
        "IQ2_XXS", "IQ2_XS", "IQ2_S", "IQ2_M", "Q3_K_M",
        "Q4_K_M", "Q4_K_S", "IQ4_XS", "Q5_K_M", "Q5_K_S",
        "Q6_K", "Q8_0", "IQ3_M", "IQ3_S", "IQ3_XXS"
    ]

Lewdiculous

Owner Mar 17, 2024

IQ2's will take a good while longer.

Lewdiculous

Owner Mar 17, 2024

•

edited Mar 17, 2024

@Cran-May - Heya! Will wait for your inputs.

I will need assistance to continue in this case as I am unable to load the initial FP16 GGUF created from your repo:

llama_model_load: error loading model: create_tensor: tensor 'blk.0.attn_q.weight' has wrong shape; expected  3072,  3072, got  3072,  4096,     1,     1
llama_load_model_from_file: failed to load model
main: error: unable to load model

Can't say I ever ran a gemma model.

Lewdiculous changed discussion title from Request: YeungNLP/firefly-gemma-7b to Request: YeungNLP/firefly-gemma-7b (help-needed) Mar 17, 2024

Lewdiculous

Owner Mar 18, 2024

•

edited Mar 18, 2024

@Virt-io Just in case you haven't seen this but do you know if I can sort this on my own?

I can't seem to load the FP16 GGUF converted from the linked repo using llama.cpp/convert.py. Maybe you know something about Gemma and Llama.cpp I am not aware...

Virt-io

Mar 19, 2024

•

edited Mar 19, 2024

@Lewdiculous

Use convert-hf-to-gguf.py instead of convert.py

I am unsure why convert.py doesn't work as I have not been paying attention to gemma models.

Warning convert-hf-to-gguf.py gave me a FP32 gguf so you will need to requant it to FP16 before running imatrix. In addition convert-hf-to-gguf.py needs more ram, or a page file.