How to chat with the Q-RWKV-6?

by ljy77777777 - opened Dec 17, 2024

Dec 17, 2024

I m really interested in our model! But it is my first time to know about the RWKV-structure model, could you please provide the envs requirements and the code to chat with it? Thanks!

wxgeorge

recursal org Dec 17, 2024

•

edited Dec 17, 2024

hi @ljy77777777 - this model is available for inference on Featherless.ai

Check out here: https://featherless.ai/models/recursal/QRWKV6-32B-Instruct-Preview-v0.1

We provide inference via OpenAI compatible API endpoints, so you can chat with it with just about any chat client (e.g. TypingMind, SillyTavern to name a few).

We also provide an basic in-browser client Phoenix for fast experimentation.

ljy77777777

Dec 18, 2024

Thanks for your answer! If I want to deploy the Q-RWKV-6 in my device. Is it supported by vllm now?
And I tryed to use the PP (transformers &accelerate lib) to deploy the model on multi-GPU and I found it does not work.
Could you please release an official code for deploying the model? Thanks!

SmerkyG

recursal org Dec 18, 2024

•

edited Dec 18, 2024

We don't have vllm support yet, but we used this HF model a lot internally to do the evals using lm-eval-harness and accelerate, so it definitely should work for you. You'll need to install the latest version of the flash-linear-attention repo at https://github.com/sustcsonglin/flash-linear-attention and a recent version of Triton.

ljy77777777

Dec 19, 2024

Thanks for your answer! I deploy the model on my device successfully. However I find in many instructions or tasks, the model has completely answer the question, but it does not stop and generate the unrelated context. I reckon is I think there might be something wrong with the prompt? Because I find the RWKV-4world should write the prompt as follows:
"""Instruction: {instruction}

Input: {input}

Response:"""
Therefore could you please provide the prompt template when you eval the model in some benckmarks? Thank you!

ljy77777777

Dec 20, 2024

Hello, Q-RWKV-6 is an excellent model for the Linear attention LLM. However I find the performance in chinese chating of the model is not pretty good. Is it because our continue training just using the English data?

SmerkyG

recursal org Dec 20, 2024

The chat template is built into the huggingface repo in https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/tokenizer_config.json
Unlike the world-tokenizer based RWKV models, it follows standard chatml e.g. "<|im_start|>", "<|im_end|>", like the base Qwen model it is adapted from.

SmerkyG

recursal org Dec 20, 2024

Not too sure about chinese chatting - we did some minor checks and it seemed good on that, but it's definitely possible that the training data reduced its abilities there as we used DCLM.

Johnnyfans

Dec 23, 2024

•

edited Dec 23, 2024

when inference using the following code,
'''
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
'''
we get the error : RuntimeError: probability tensor contains either inf, nan or element < 0"

York-Z

Dec 24, 2024

hi @ljy77777777 , how did you load the model with multiple GPUs? I set the device_map="auto" and it goes with an AttributeError: 'RWKV6State' object has no attribute '_modules'. :(

ljy77777777

Dec 24, 2024

hi @York-Z ，I encounter the same issue, I guess the RWKV6State could not be used in PP (multi GPUS). However, because the RWKV only need the constant space to cache the KV, therefore one GPU could run the 32B model on 32K length.

York-Z

Dec 24, 2024

reply @ljy77777777 : Oh OK. I tested the inference time cost of QRWKV and Qwen32B with 4K-tokens input prompt on one single GPU , and found that QRWKV is slightly slower than Qwen32B, did you get a similar result?

ljy77777777

Dec 24, 2024

reply @York-Z Yes, I got the similar result and I find the reason may come from the Flash Linear Attention kernel is also RNN construct and the peak of the GPU-UTIL is 43%.I think the prefill in the fla kernel is not Parallel Computing.

Hazzzardous

Dec 24, 2024

So you can do tp by splitting headwise, flash linear attention also has some other kernels that are faster for seq instead of batch

York-Z

Dec 26, 2024

hi @Hazzzardous , it there a splitting headwise verison fla for qrwkv6?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment