how to run this model

by cicdatopea - opened Mar 27

Mar 27

•

Hi,
have you changed the packing or something else, or depend on specific version of some library, I could not run it with Transformers

python3.10/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 162, in _forward
out = torch.matmul(x, weights).reshape(out_shape)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

even uninstall gptqmodels

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:09<00:00, 1.02s/it]
Segmentation fault (core dumped)

bnjmnmarie

The Kaitchup org Mar 27

Last time I ran it, last month, it was working fine with Transformers and vLLM.
I always use the most recent version of all the frameworks, so if you downgrade to the versions of Transformers, auto-gptq, and optimum (and maybe PyTorch) as they were one month ago, it should work.

not sure what causes your CUDA error and why it uses gptqmodel.

cicdatopea

Mar 27

Thanks! After uninstalling and reinstalling some libraries, I finally got it working. With a limited test of 10 samples, the accuracy looks good. I'll test all the samples and give an update.

wenhuach

Mar 27

the results looks correct.
transformers 4.50.2
vllm 0.6.6.post1

autoround) wenhuach@mlp-dgx-01:/data5/wenhuach/kaitchup-Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit$ CUDA_VISIBLE_DEVICES=4 lm-eval --model vllm --model_args pretrained=./ --tasks leaderboard_ifeval --batch_size 16
2025-03-27:07:48:15,402 INFO [main.py:279] Verbosity set to INFO
2025-03-27:07:48:24,060 INFO [main.py:376] Selected Tasks: ['leaderboard_ifeval']
2025-03-27:07:48:24,065 INFO [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-27:07:48:24,065 INFO [evaluator.py:201] Initializing vllm model, with arguments: {'pretrained': './'}
INFO 03-27 07:48:31 config.py:510] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
INFO 03-27 07:48:32 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 03-27 07:48:32 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='./', speculative_config=None, tokenizer='./', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=./, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 03-27 07:48:33 selector.py:120] Using Flash Attention backend.
INFO 03-27 07:48:33 model_runner.py:1094] Starting to load model ./...
INFO 03-27 07:48:33 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/9 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 11% Completed | 1/9 [00:00<00:05, 1.39it/s]
Loading safetensors checkpoint shards: 22% Completed | 2/9 [00:01<00:05, 1.26it/s]
Loading safetensors checkpoint shards: 33% Completed | 3/9 [00:02<00:04, 1.23it/s]
Loading safetensors checkpoint shards: 44% Completed | 4/9 [00:03<00:04, 1.21it/s]
Loading safetensors checkpoint shards: 56% Completed | 5/9 [00:04<00:03, 1.18it/s]
Loading safetensors checkpoint shards: 67% Completed | 6/9 [00:04<00:02, 1.18it/s]
Loading safetensors checkpoint shards: 78% Completed | 7/9 [00:05<00:01, 1.17it/s]
Loading safetensors checkpoint shards: 89% Completed | 8/9 [00:06<00:00, 1.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:06<00:00, 1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:06<00:00, 1.29it/s]

INFO 03-27 07:48:41 model_runner.py:1099] Loading model weights took 38.5241 GB
INFO 03-27 07:49:07 worker.py:241] Memory profiling takes 25.79 seconds
INFO 03-27 07:49:07 worker.py:241] the current vLLM instance can use total_gpu_memory (79.26GiB) x gpu_memory_utilization (0.90) = 71.33GiB
INFO 03-27 07:49:07 worker.py:241] model weights take 38.52GiB; non_torch_memory takes 0.40GiB; PyTorch activation peak memory takes 9.25GiB; the rest of the memory reserved for KV Cache is 23.16GiB.
INFO 03-27 07:49:07 gpu_executor.py:76] # GPU blocks: 4743, # CPU blocks: 819
INFO 03-27 07:49:07 gpu_executor.py:80] Maximum concurrency for 32768 tokens per request: 2.32x
INFO 03-27 07:49:10 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:25<00:00, 1.39it/s]
INFO 03-27 07:49:35 model_runner.py:1535] Graph capturing finished in 25 secs, took 0.51 GiB
INFO 03-27 07:49:35 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 54.02 seconds
2025-03-27:07:49:45,079 WARNING [task.py:325] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-03-27:07:49:45,080 WARNING [task.py:325] [Task: leaderboard_ifeval] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2025-03-27:07:49:45,150 INFO [task.py:415] Building contexts for leaderboard_ifeval on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [00:00<00:00, 80135.56it/s]
2025-03-27:07:49:45,223 INFO [evaluator.py:496] Running generate_until requests
Running generate_until requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 541/541 [23:47<00:00, 2.64s/it]
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2025-03-27:08:13:47,094 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=./), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 16

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8453	±	N/A
		none	inst_level_strict_acc	↑	0.7854	±	N/A
		none	prompt_level_loose_acc	↑	0.7782	±	0.0179
		none	prompt_level_strict_acc	↑	0.7006	±	0.0197

wenhuach

Mar 27

close to our other runs blackened on HF https://github.com/intel/auto-round/issues/476

If you could reproduce this result, could you kindly change the comments in your blog. If the accuracy still has issues, please let me know

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment