Reasoning content is output in 'content' when serving with vLLM and calling openAI API, using --reasoning-parser glm45

by stephenmcconnachie - opened 24 days ago

Discussion

stephenmcconnachie

24 days ago

vLLM version
The Guide to using unsloth/GLM-4.7-Flash-FP8-Dynamic with vLLM recommends installing nightly vLLM from https://wheels.vllm.ai/nightly/cu130
However there is no cu130 path - the paths change every night and when I installed the path was just https://wheels.vllm.ai/nightly
Current VLLM version: 0.16.0rc1.dev153+g2267cb1cf

Notes:

vLLM outputs an error when opening: WARNING 02-04 05:41:00 [init.py:144] xccl is not enabled in this torch build, communication is not available.
I'm serving VLLM with --kv-cache-dtype auto to resolve the looping/repetitive issue with fp8, discussed at https://huggingface.co/unsloth/GLM-4.7-Flash-FP8-Dynamic/discussions/2

Issue
When serving with VLLM using --reasoning-parser glm45, the output from API calls (using curl or requests in Python) is placing the reasoning content in the 'content' element, instead of a 'reasoning_content' element. Reasoning is null. Here is an example output:

{"id":"chatcmpl-8a34f3058c47091c","object":"chat.completion","created":1770186431,"model":"unsloth/GLM-4.7-Flash-FP8-Dynamic","choices":[{"index":0,"message":{"role":"assistant","content":"1. Analyze the Request: The user wants a one-sentence description of "VLLM".\n\n2. Identify the Core Subject: VLLM (which stands for "Very Large Language Model" in some contexts, but here specifically refers to the open-source library/framework "vLLM").\n\n3. **Identify Key Features/Functionality:**\n * It's for Large Language Models (LLMs).\n * It focuses on efficient inference (running the model).\n * Key technique: PagedAttention (memory management).\n * Key outcome: High throughput (speed).\n\n4. **Drafting Initial Ideas:**\n * Draft 1: VLLM is an open-source library for efficient inference of LLMs using PagedAttention. (Good, accurate)\n * Draft 2: It helps run large language models faster by managing memory smartly so you don't run out of space. (A bit too informal)\n * Draft 3: VLLM is a high-throughput inference engine designed for large language models. (Standard industry description)\n * Draft 4: Combining PagedAttention with caching, vLLM is an open-source project that enables efficient and fast inference of LLMs. (Comprehensive)\n\n5. **Refining for Brevity and Impact (One Sentence Constraint):**\n * "VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to maximize GPU memory efficiency and accelerate inference." -> A bit long.\n\n * Attempt 2: VLLM is an open-source high-performance inference engine for large language models that optimizes GPU memory usage through PagedAttention to speed up processing. -> Better.\n\n * Attempt 3 (Focus on the "what"): VLLM is an open-source library designed to optimize the inference speed and resource efficiency of large language models.\n\n6. **Selecting the Best Options:**\n * Option A (Technical): VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to maximize GPU memory efficiency.\n * Option B (Simpler): VLLM is an open-source library that speeds up the inference of large language models by managing GPU memory more efficiently.\n\n7. **Final Polish (Balancing technical accuracy and readability):**\n "VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to optimize GPU memory usage and accelerate inference."\n\n Let's check against common usage. Most descriptions center on inference, PagedAttention, and throughput.\n\n Revised: VLLM is an open-source high-performance inference engine designed for large language models that optimizes memory management and accelerates processing speeds.\n\n Let's try to make it punchy.\n "VLLM is an open-source library for efficiently running large language models by optimizing GPU memory usage and boosting inference throughput."\n\n8. Final Selection: "VLLM is an open-source high-throughput inference engine designed for large language models that optimizes GPU memory usage through PagedAttention to accelerate processing." (This covers the "how" and the "what").\n\n Simpler alternative: "VLLM is an open-source library that speeds up the deployment of large language models by managing GPU memory efficiently."\n\n *The most standard "one-sentence" definition usually cites PagedAttention.*\n\n Final Choice: **VLLM is an open-source high-throughput inference engine for large language models that optimizes GPU memory usage through PagedAttention to maximize processing speed.**\n\n *Actually, keeping it simpler is often better for a "describe in one sentence" prompt unless technical depth is requested.*\n\n Simple Version: VLLM is an open-source library that speeds up the deployment of large language models by optimizing GPU memory management.\n\n Technical Version: VLLM is an open-source high-performance inference engine that uses PagedAttention to maximize GPU memory efficiency and throughput for large language models.\n\n I will provide the technical but clear version as it explains why it's good.\n\n Self-Correction: The user might just want the definition. I will go with the most direct, accurate sentence.\n\n "VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to optimize GPU memory efficiency."\n\n Let's try to compress it even more.\n "VLLM is an open-source library that accelerates the inference of large language models through advanced GPU memory management techniques like PagedAttention."\n\n This is good. Let's stick to the "High-throughput inference engine" terminology as that's the official label.\n\n Final Answer Formulation:\n VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to optimize GPU memory usage and accelerate processing speeds.VLLM is an open-source high-throughput inference engine for large language models that utilizes PagedAttention to optimize GPU memory usage and accelerate processing speeds.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":154827,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":12,"total_tokens":1077,"completion_tokens":1065,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

stephenmcconnachie

22 days ago

I reinstalled vLLM from the correct nightly for my installed Cuda (12.9) - so I installed from https://wheels.vllm.ai/nightly/cu129/vllm

No change - 'reasoning' is still null and the reasoning tokens are being output in 'content'.

I switched to the Q8 GGUF using Llama.cpp llama-server, and that's working as expected - the reasoning tokens are correctly outputting to 'reasoning' and the content to 'content'. But I would prefer this Dynamic FP8 via vLLM if this issue could be resolved.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment