Issue running v-alpha-tross after cache update
I am trying to run image_id: ami-04dd1be93bedbc674 # us-west-2
on a inf2.48xlarge
. No additonal installations, using system python.
When running
from optimum.neuron import NeuronModelForCausalLM
compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}
model = NeuronModelForCausalLM.from_pretrained(
"gradientai/v-alpha-tross",
export=True,
**compiler_args,
**input_shapes)
model.save_pretrained("./compiled/alphatross")
model.push_to_hub(
"gradientai/v-alpha-tross-neuron"
)
Here is the output of the above script:
config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 671/671 [00:00<00:00, 246kB/s]
model.safetensors.index.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 59.6k/59.6k [00:00<00:00, 10.5MB/s]
model-00001-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.72G/4.72G [00:20<00:00, 226MB/s]
model-00002-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 229MB/s]
model-00003-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [01:33<00:00, 53.5MB/s]
model-00004-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.97G/4.97G [00:20<00:00, 246MB/s]
model-00005-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [01:38<00:00, 47.5MB/s]
model-00006-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [01:44<00:00, 44.7MB/s]
model-00007-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:19<00:00, 245MB/s]
model-00008-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:20<00:00, 240MB/s]
model-00009-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.97G/4.97G [02:28<00:00, 33.5MB/s]
model-00010-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:17<00:00, 262MB/s]
model-00011-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [01:38<00:00, 47.3MB/s]
model-00012-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [01:32<00:00, 50.6MB/s]
model-00013-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:21<00:00, 231MB/s]
model-00014-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.97G/4.97G [00:18<00:00, 266MB/s]
model-00015-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:19<00:00, 242MB/s]
model-00016-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 223MB/s]
model-00017-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:19<00:00, 237MB/s]
model-00018-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:20<00:00, 239MB/s]
model-00019-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.97G/4.97G [02:29<00:00, 33.3MB/s]
model-00020-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:19<00:00, 235MB/s]
model-00021-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 230MB/s]
model-00022-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:19<00:00, 235MB/s]
model-00023-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:17<00:00, 278MB/s]
model-00024-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.97G/4.97G [00:20<00:00, 241MB/s]
model-00025-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 226MB/s]
model-00026-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 233MB/s]
model-00027-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4.66G/4.66G [00:20<00:00, 230MB/s]
model-00028-of-00029.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5.00G/5.00G [00:20<00:00, 240MB/s]
model-00029-of-00029.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3.78G/3.78G [00:49<00:00, 76.0MB/s]
Downloading shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29/29 [21:03<00:00, 43.55s/it]
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29/29 [00:19<00:00, 1.47it/s]
generation_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 183/183 [00:00<00:00, 70.9kB/s]
Traceback (most recent call last):
File "test.py", line 5, in <module>
model = NeuronModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/optimum/modeling_base.py", line 372, in from_pretrained
return from_pretrained_method(
File "/usr/local/lib/python3.8/dist-packages/optimum/neuron/modeling_decoder.py", line 155, in _from_transformers
return cls._from_pretrained(checkpoint_dir, config)
File "/usr/local/lib/python3.8/dist-packages/optimum/neuron/modeling_decoder.py", line 226, in _from_pretrained
neuronx_model.to_neuron()
File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/llama/model.py", line 124, in to_neuron
new_layer.to_neuron()
File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py", line 643, in to_neuron
self.attn_k_weight = maybe_shard_along(self.attn_k_weight, dim=1)
File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py", line 937, in shard_along
return self.manipulator.shard_along(tensor, dim)
File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/parallel.py", line 115, in shard_along
return ops.parallel_to_nc(self.shard_along_on_cpu(tensor, dim))
File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/parallel.py", line 107, in shard_along_on_cpu
raise ValueError(
ValueError: Weight with shape torch.Size([8192, 1024]) cannot be sharded along dimension 1. This results in 25 weight partitions which cannot be distributed to 24 NeuronCores evenly. To fix this issue either the model parameters or the `tp_degree` must be changed to allow the weight to be evenly split
ubuntu@ip-172-31-27-18:~$ neuron-ls
instance-type: inf2.48xlarge
instance-id: i-05a4c0dfe889f6123
+--------+--------+--------+-----------+---------+
| NEURON | NEURON | NEURON | CONNECTED | PCI |
| DEVICE | CORES | MEMORY | DEVICES | BDF |
+--------+--------+--------+-----------+---------+
| 0 | 2 | 32 GB | 11, 1 | 80:1e.0 |
| 1 | 2 | 32 GB | 0, 2 | 90:1e.0 |
| 2 | 2 | 32 GB | 1, 3 | 80:1d.0 |
| 3 | 2 | 32 GB | 2, 4 | 90:1f.0 |
| 4 | 2 | 32 GB | 3, 5 | 80:1f.0 |
| 5 | 2 | 32 GB | 4, 6 | 90:1d.0 |
| 6 | 2 | 32 GB | 5, 7 | 20:1e.0 |
| 7 | 2 | 32 GB | 6, 8 | 20:1f.0 |
| 8 | 2 | 32 GB | 7, 9 | 10:1e.0 |
| 9 | 2 | 32 GB | 8, 10 | 10:1f.0 |
| 10 | 2 | 32 GB | 9, 11 | 10:1d.0 |
| 11 | 2 | 32 GB | 10, 0 | 20:1d.0 |
+--------+--------+--------+-----------+---------+
@michaelfeil I ran it with the latest Hugging Face DLAMI (012324) (ami-029fcc46b49fda6c3) (us-west-2) and it compiled successfully and I was able to save it to disk. I had a problem pushing to the hub, but that could be permission/user error.
ubuntu@ip-172-31-15-168:~$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla 0.5.669
neuronx-cc 2.12.68.0+4480452af
neuronx-distributed 0.6.0
neuronx-hwm 2.12.0.0+422c9037c
optimum-neuron 0.0.17
tensorboard-plugin-neuronx 2.6.1.0
torch-neuronx 1.13.1.1.13.0
torch-xla 1.13.1+torchneurond
transformers-neuronx 0.9.474
ubuntu@ip-172-31-15-168:~$ python
Python 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from optimum.neuron import NeuronModelForCausalLM
>>>
>>> compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'}
>>> input_shapes = {"batch_size": 1, "sequence_length": 2048}
>>> model = NeuronModelForCausalLM.from_pretrained(
... "gradientai/v-alpha-tross",
... export=True,
... **compiler_args,
... **input_shapes)
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 29/29 [04:25<00:00, 9.16s/it]
/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py:150: UserWarning: KV head replication will be enabled since the number of KV heads (8) is not evenly divisible by the tensor parallel degree (24)
warnings.warn(
2024-02-03 21:05:05.000415: 4706 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000576: 4707 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000748: 4708 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000929: 4709 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000196: 4710 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000360: 4706 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_94afe10837f7a276ac9c+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000420: 4707 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_599e615b3867bff8ec4e+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000485: 4711 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000617: 4708 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_5b62a45b703e71e69832+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000660: 4712 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000831: 4713 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000995: 4714 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:07.000100: 4709 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_3421a2138da04a26b98d+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:07.000193: 4716 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:07.000391: 4711 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_e55471efb869a648d6d8+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:07.000539: 4710 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_d4ca6ba52e0580845391+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000013: 4713 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_70fdeda6c70fa31bb426+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000033: 4714 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_76d0c94cef61c7d3e233+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000059: 4716 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_9799940b47f7ad7e5e46+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000492: 4712 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_1b1581d51179610aca47+2c2d707e/model.neff. Exiting with a successfully compiled graph.
>>>
>>> model.save_pretrained("alphatross")
>>> model.push_to_hub("jburtoft/v-alpha-tross-neuron")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: push_to_hub() missing 1 required positional argument: 'repository_id'
Thanks for the help. I think a ami-image
mismatch caused this issue. I now updated the ami, thanks for the hint
@jburtoft
$ neuronx-cc --version
NeuronX Compiler version 2.12.68.0+4480452af
Python version 3.8.10
HWM version 2.12.0.0-422c9037c
NumPy version 1.24.4
Running on AMI ami-029fcc46b49fda6c3
Running in region usw2-az1
Also I just reserved a 500GB disk, and while the downloaded weights are around 140 GB, the neuron weights take another 286G after being exported.
After giving it another try, the model compiled on the correct ami.
Thanks for the help to you both!