how to run this model
Hi,
have you changed the packing or something else, or depend on specific version of some library, I could not run it with Transformers
python3.10/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 162, in _forward
out = torch.matmul(x, weights).reshape(out_shape)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
even uninstall gptqmodels
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 9/9 [00:09<00:00, 1.02s/it]
Segmentation fault (core dumped)
Last time I ran it, last month, it was working fine with Transformers and vLLM.
I always use the most recent version of all the frameworks, so if you downgrade to the versions of Transformers, auto-gptq, and optimum (and maybe PyTorch) as they were one month ago, it should work.
not sure what causes your CUDA error and why it uses gptqmodel.
Thanks! After uninstalling and reinstalling some libraries, I finally got it working. With a limited test of 10 samples, the accuracy looks good. I'll test all the samples and give an update.
the results looks correct.
transformers 4.50.2
vllm 0.6.6.post1
autoround) wenhuach@mlp-dgx-01:/data5/wenhuach/kaitchup-Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit$ CUDA_VISIBLE_DEVICES=4 lm-eval --model vllm --model_args pretrained=./ --tasks leaderboard_ifeval --batch_size 16
2025-03-27:07:48:15,402 INFO [main.py:279] Verbosity set to INFO
2025-03-27:07:48:24,060 INFO [main.py:376] Selected Tasks: ['leaderboard_ifeval']
2025-03-27:07:48:24,065 INFO [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-27:07:48:24,065 INFO [evaluator.py:201] Initializing vllm model, with arguments: {'pretrained': './'}
INFO 03-27 07:48:31 config.py:510] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
INFO 03-27 07:48:32 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 03-27 07:48:32 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='./', speculative_config=None, tokenizer='./', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=./, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-27 07:48:33 selector.py:120] Using Flash Attention backend.
INFO 03-27 07:48:33 model_runner.py:1094] Starting to load model ./...
INFO 03-27 07:48:33 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/9 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 11% Completed | 1/9 [00:00<00:05, 1.39it/s]
Loading safetensors checkpoint shards: 22% Completed | 2/9 [00:01<00:05, 1.26it/s]
Loading safetensors checkpoint shards: 33% Completed | 3/9 [00:02<00:04, 1.23it/s]
Loading safetensors checkpoint shards: 44% Completed | 4/9 [00:03<00:04, 1.21it/s]
Loading safetensors checkpoint shards: 56% Completed | 5/9 [00:04<00:03, 1.18it/s]
Loading safetensors checkpoint shards: 67% Completed | 6/9 [00:04<00:02, 1.18it/s]
Loading safetensors checkpoint shards: 78% Completed | 7/9 [00:05<00:01, 1.17it/s]
Loading safetensors checkpoint shards: 89% Completed | 8/9 [00:06<00:00, 1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:06<00:00, 1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:06<00:00, 1.29it/s]
INFO 03-27 07:48:41 model_runner.py:1099] Loading model weights took 38.5241 GB
INFO 03-27 07:49:07 worker.py:241] Memory profiling takes 25.79 seconds
INFO 03-27 07:49:07 worker.py:241] the current vLLM instance can use total_gpu_memory (79.26GiB) x gpu_memory_utilization (0.90) = 71.33GiB
INFO 03-27 07:49:07 worker.py:241] model weights take 38.52GiB; non_torch_memory takes 0.40GiB; PyTorch activation peak memory takes 9.25GiB; the rest of the memory reserved for KV Cache is 23.16GiB.
INFO 03-27 07:49:07 gpu_executor.py:76] # GPU blocks: 4743, # CPU blocks: 819
INFO 03-27 07:49:07 gpu_executor.py:80] Maximum concurrency for 32768 tokens per request: 2.32x
INFO 03-27 07:49:10 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization
or switching to eager mode. You can also reduce the max_num_seqs
as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:25<00:00, 1.39it/s]
INFO 03-27 07:49:35 model_runner.py:1535] Graph capturing finished in 25 secs, took 0.51 GiB
INFO 03-27 07:49:35 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.02 seconds
2025-03-27:07:49:45,079 WARNING [task.py:325] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-03-27:07:49:45,080 WARNING [task.py:325] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-03-27:07:49:45,150 INFO [task.py:415] Building contexts for leaderboard_ifeval on rank 0...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 541/541 [00:00<00:00, 80135.56it/s]
2025-03-27:07:49:45,223 INFO [evaluator.py:496] Running generate_until requests
Running generate_until requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 541/541 [23:47<00:00, 2.64s/it]
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2025-03-27:08:13:47,094 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=./), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 16
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
leaderboard_ifeval | 3 | none | 0 | inst_level_loose_acc | β | 0.8453 | Β± | N/A |
none | 0 | inst_level_strict_acc | β | 0.7854 | Β± | N/A | ||
none | 0 | prompt_level_loose_acc | β | 0.7782 | Β± | 0.0179 | ||
none | 0 | prompt_level_strict_acc | β | 0.7006 | Β± | 0.0197 |
close to our other runs blackened on HF https://github.com/intel/auto-round/issues/476
If you could reproduce this result, could you kindly change the comments in your blog. If the accuracy still has issues, please let me know