zai-org/GLM-4.5V not working in sglang please help. I have 8xh100
(sgl) ubuntu@dahwin-inst-30cm8i60ygbxvjasij6exenmthf:~/code$ python3 -m sglang.launch_server --model-path zai-org/GLM-4.5V \
--tp-size 4 \ --attention-backend fa3 \ --mm-attention-backend fa3 \ --enable-torch-compile \ --tool-call-parser glm45 \ --reasoning-parser glm45 \ --served-model-name glm-4.5v \ --port 8000 \ --host 0.0.0.0
W0812 19:52:35.974000 809680 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:35.974000 809680 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-12 19:52:36] server_args=ServerArgs(model_path='zai-org/GLM-4.5V', tokenizer_path='zai-org/GLM-4.5V', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=8000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.856, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=40201957, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='glm-4.5v', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser='glm45', tool_call_parser='glm45', dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend=None, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=True, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, tool_server=None, enable_ep_moe=False, enable_deepep_moe=False)
[2025-08-12 19:52:37] Using default HuggingFace chat template with detected content format: string
W0812 19:52:43.185000 809933 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:43.185000 809933 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0812 19:52:43.320000 809929 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:43.320000 809929 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0812 19:52:43.397000 809931 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:43.397000 809931 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0812 19:52:43.432000 809932 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:43.432000 809932 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0812 19:52:43.475000 809930 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0812 19:52:43.475000 809930 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-12 19:52:44 TP0] Init torch distributed begin.
[W812 19:52:46.995902177 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W812 19:52:46.995933723 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W812 19:52:46.995937919 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W812 19:52:46.995952010 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-08-12 19:52:46 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-08-12 19:52:47 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-08-12 19:52:47 TP0] Init torch distributed ends. mem usage=2.01 GB
[2025-08-12 19:52:47 TP3] Glm4vMoeForConditionalGeneration has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-12 19:52:47 TP1] Glm4vMoeForConditionalGeneration has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-12 19:52:47 TP2] Glm4vMoeForConditionalGeneration has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-12 19:52:47 TP0] Glm4vMoeForConditionalGeneration has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
[2025-08-12 19:52:47 TP0] Load weight begin. avail mem=76.66 GB
[2025-08-12 19:52:47 TP3] Using Transformers backend.
[2025-08-12 19:52:47 TP2] Using Transformers backend.
[2025-08-12 19:52:47 TP1] Using Transformers backend.
[2025-08-12 19:52:47 TP0] Using Transformers backend.
[2025-08-12 19:52:50 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 2548, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 314, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 84, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 242, in init
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 285, in initialize
self.load_model()
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 643, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 432, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
return model_class(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/transformers.py", line 158, in init
self.model: PreTrainedModel = AutoModel.from_config(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_config
return model_class._from_config(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 1018, in init
self.language_model = Glm4vMoeTextModel._from_config(config.text_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in init
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 393, in init
self.mlp = Glm4vMoeTextMoE(config)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 301, in init
[
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 302, in
Glm4vMoeTextMLP(config, intermediate_size=config.moe_intermediate_size)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 355, in init
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in init
torch.empty((out_features, in_features), **factory_kwargs)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/utils/_device.py", line 103, in torch_function
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 1 has a total capacity of 79.19 GiB of which 3.00 MiB is free. Including non-PyTorch memory, this process has 79.18 GiB memory in use. Of the allocated memory 76.43 GiB is allocated by PyTorch, and 150.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-08-12 19:52:50] Received sigquit from a child process. It usually means the child failed.
[2025-08-12 19:52:50 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 2548, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 314, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 84, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 242, in init
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 285, in initialize
self.load_model()
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 643, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 432, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
return model_class(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/transformers.py", line 158, in init
self.model: PreTrainedModel = AutoModel.from_config(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_config
return model_class._from_config(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 1018, in init
self.language_model = Glm4vMoeTextModel._from_config(config.text_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in init
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 393, in init
self.mlp = Glm4vMoeTextMoE(config)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 301, in init
[
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 302, in
Glm4vMoeTextMLP(config, intermediate_size=config.moe_intermediate_size)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 356, in init
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in init
torch.empty((out_features, in_features), **factory_kwargs)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/utils/_device.py", line 103, in torch_function
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 3 has a total capacity of 79.19 GiB of which 3.00 MiB is free. Including non-PyTorch memory, this process has 79.18 GiB memory in use. Of the allocated memory 76.90 GiB is allocated by PyTorch, and 150.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-08-12 19:52:50] Received sigquit from a child process. It usually means the child failed.
[2025-08-12 19:52:50 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 2548, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/scheduler.py", line 314, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/managers/tp_worker.py", line 84, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 242, in init
self.initialize(min_per_gpu_memory)
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 285, in initialize
self.load_model()
File "/home/ubuntu/sglang/python/sglang/srt/model_executor/model_runner.py", line 643, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 432, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/model_loader/loader.py", line 174, in _initialize_model
return model_class(
^^^^^^^^^^^^
File "/home/ubuntu/sglang/python/sglang/srt/models/transformers.py", line 158, in init
self.model: PreTrainedModel = AutoModel.from_config(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_config
return model_class._from_config(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 1018, in init
self.language_model = Glm4vMoeTextModel._from_config(config.text_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2382, in _from_config
model = cls(config, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in init
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 930, in
[Glm4vMoeTextDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 393, in init
self.mlp = Glm4vMoeTextMoE(config)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 301, in init
[
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 302, in
Glm4vMoeTextMLP(config, intermediate_size=config.moe_intermediate_size)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py", line 354, in init
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 106, in init
torch.empty((out_features, in_features), **factory_kwargs)
File "/home/ubuntu/sgl/lib/python3.11/site-packages/torch/utils/_device.py", line 103, in torch_function
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 3.00 MiB is free. Including non-PyTorch memory, this process has 79.18 GiB memory in use. Of the allocated memory 76.53 GiB is allocated by PyTorch, and 150.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[2025-08-12 19:52:50] Received sigquit from a child process. It usually means the child failed.
Killed
(sgl) ubuntu@dahwin-inst-30cm8i60ygbxvjasij6exenmthf:/code$/code$
(sgl) ubuntu@dahwin-inst-30cm8i60ygbxvjasij6exenmthf: