Issue running v-alpha-tross after cache update

by michaelfeil - opened Feb 3

Feb 3

I am trying to run image_id: ami-04dd1be93bedbc674 # us-west-2 on a inf2.48xlarge. No additonal installations, using system python.
When running

from optimum.neuron import NeuronModelForCausalLM

compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}
model = NeuronModelForCausalLM.from_pretrained(
        "gradientai/v-alpha-tross",
        export=True,
        **compiler_args,
        **input_shapes)

model.save_pretrained("./compiled/alphatross")

model.push_to_hub(
    "gradientai/v-alpha-tross-neuron"
)

michaelfeil

Feb 3

Here is the output of the above script:

config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 671/671 [00:00<00:00, 246kB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████| 59.6k/59.6k [00:00<00:00, 10.5MB/s]
model-00001-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.72G/4.72G [00:20<00:00, 226MB/s]
model-00002-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 229MB/s]
model-00003-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 5.00G/5.00G [01:33<00:00, 53.5MB/s]
model-00004-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.97G/4.97G [00:20<00:00, 246MB/s]
model-00005-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.66G/4.66G [01:38<00:00, 47.5MB/s]
model-00006-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.66G/4.66G [01:44<00:00, 44.7MB/s]
model-00007-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:19<00:00, 245MB/s]
model-00008-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 5.00G/5.00G [00:20<00:00, 240MB/s]
model-00009-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.97G/4.97G [02:28<00:00, 33.5MB/s]
model-00010-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:17<00:00, 262MB/s]
model-00011-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.66G/4.66G [01:38<00:00, 47.3MB/s]
model-00012-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.66G/4.66G [01:32<00:00, 50.6MB/s]
model-00013-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 5.00G/5.00G [00:21<00:00, 231MB/s]
model-00014-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.97G/4.97G [00:18<00:00, 266MB/s]
model-00015-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:19<00:00, 242MB/s]
model-00016-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 223MB/s]
model-00017-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:19<00:00, 237MB/s]
model-00018-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 5.00G/5.00G [00:20<00:00, 239MB/s]
model-00019-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 4.97G/4.97G [02:29<00:00, 33.3MB/s]
model-00020-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:19<00:00, 235MB/s]
model-00021-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 230MB/s]
model-00022-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:19<00:00, 235MB/s]
model-00023-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 5.00G/5.00G [00:17<00:00, 278MB/s]
model-00024-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.97G/4.97G [00:20<00:00, 241MB/s]
model-00025-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 226MB/s]
model-00026-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 233MB/s]
model-00027-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 4.66G/4.66G [00:20<00:00, 230MB/s]
model-00028-of-00029.safetensors: 100%|███████████████████████████████████████████████████████████████| 5.00G/5.00G [00:20<00:00, 240MB/s]
model-00029-of-00029.safetensors: 100%|██████████████████████████████████████████████████████████████| 3.78G/3.78G [00:49<00:00, 76.0MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 29/29 [21:03<00:00, 43.55s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 29/29 [00:19<00:00,  1.47it/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 183/183 [00:00<00:00, 70.9kB/s]
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    model = NeuronModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/optimum/modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
  File "/usr/local/lib/python3.8/dist-packages/optimum/neuron/modeling_decoder.py", line 155, in _from_transformers
    return cls._from_pretrained(checkpoint_dir, config)
  File "/usr/local/lib/python3.8/dist-packages/optimum/neuron/modeling_decoder.py", line 226, in _from_pretrained
    neuronx_model.to_neuron()
  File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/llama/model.py", line 124, in to_neuron
    new_layer.to_neuron()
  File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py", line 643, in to_neuron
    self.attn_k_weight = maybe_shard_along(self.attn_k_weight, dim=1)
  File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py", line 937, in shard_along
    return self.manipulator.shard_along(tensor, dim)
  File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/parallel.py", line 115, in shard_along
    return ops.parallel_to_nc(self.shard_along_on_cpu(tensor, dim))
  File "/usr/local/lib/python3.8/dist-packages/transformers_neuronx/parallel.py", line 107, in shard_along_on_cpu
    raise ValueError(
ValueError: Weight with shape torch.Size([8192, 1024]) cannot be sharded along dimension 1. This results in 25 weight partitions which cannot be distributed to 24 NeuronCores evenly. To fix this issue either the model parameters or the `tp_degree` must be changed to allow the weight to be evenly split
ubuntu@ip-172-31-27-18:~$ neuron-ls
instance-type: inf2.48xlarge
instance-id: i-05a4c0dfe889f6123
+--------+--------+--------+-----------+---------+
| NEURON | NEURON | NEURON | CONNECTED |   PCI   |
| DEVICE | CORES  | MEMORY |  DEVICES  |   BDF   |
+--------+--------+--------+-----------+---------+
| 0      | 2      | 32 GB  | 11, 1     | 80:1e.0 |
| 1      | 2      | 32 GB  | 0, 2      | 90:1e.0 |
| 2      | 2      | 32 GB  | 1, 3      | 80:1d.0 |
| 3      | 2      | 32 GB  | 2, 4      | 90:1f.0 |
| 4      | 2      | 32 GB  | 3, 5      | 80:1f.0 |
| 5      | 2      | 32 GB  | 4, 6      | 90:1d.0 |
| 6      | 2      | 32 GB  | 5, 7      | 20:1e.0 |
| 7      | 2      | 32 GB  | 6, 8      | 20:1f.0 |
| 8      | 2      | 32 GB  | 7, 9      | 10:1e.0 |
| 9      | 2      | 32 GB  | 8, 10     | 10:1f.0 |
| 10     | 2      | 32 GB  | 9, 11     | 10:1d.0 |
| 11     | 2      | 32 GB  | 10, 0     | 20:1d.0 |
+--------+--------+--------+-----------+---------+

jburtoft

AWS Inferentia and Trainium org Feb 3

@michaelfeil I ran it with the latest Hugging Face DLAMI (012324) (ami-029fcc46b49fda6c3) (us-west-2) and it compiled successfully and I was able to save it to disk. I had a problem pushing to the hub, but that could be permission/user error.

ubuntu@ip-172-31-15-168:~$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  0.5.669
neuronx-cc                    2.12.68.0+4480452af
neuronx-distributed           0.6.0
neuronx-hwm                   2.12.0.0+422c9037c
optimum-neuron                0.0.17
tensorboard-plugin-neuronx    2.6.1.0
torch-neuronx                 1.13.1.1.13.0
torch-xla                     1.13.1+torchneurond
transformers-neuronx          0.9.474

ubuntu@ip-172-31-15-168:~$ python
Python 3.8.10 (default, Nov 22 2023, 10:22:35)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from optimum.neuron import NeuronModelForCausalLM
>>>
>>> compiler_args = {"num_cores": 24, "auto_cast_type": 'bf16'}
>>> input_shapes = {"batch_size": 1, "sequence_length": 2048}
>>> model = NeuronModelForCausalLM.from_pretrained(
...         "gradientai/v-alpha-tross",
...         export=True,
...         **compiler_args,
...         **input_shapes)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 29/29 [04:25<00:00,  9.16s/it]
/usr/local/lib/python3.8/dist-packages/transformers_neuronx/decoder.py:150: UserWarning: KV head replication will be enabled since the number of KV heads (8) is not evenly divisible by the tensor parallel degree (24)
  warnings.warn(

2024-02-03 21:05:05.000415:  4706  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000576:  4707  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000748:  4708  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:05.000929:  4709  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000196:  4710  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000360:  4706  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_94afe10837f7a276ac9c+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000420:  4707  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_599e615b3867bff8ec4e+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000485:  4711  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000617:  4708  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_5b62a45b703e71e69832+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:06.000660:  4712  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000831:  4713  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:06.000995:  4714  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:07.000100:  4709  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_3421a2138da04a26b98d+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:07.000193:  4716  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-02-03 21:05:07.000391:  4711  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_e55471efb869a648d6d8+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:07.000539:  4710  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_d4ca6ba52e0580845391+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000013:  4713  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_70fdeda6c70fa31bb426+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000033:  4714  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_76d0c94cef61c7d3e233+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000059:  4716  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_9799940b47f7ad7e5e46+2c2d707e/model.neff. Exiting with a successfully compiled graph.
2024-02-03 21:05:08.000492:  4712  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.68.0+4480452af/MODULE_1b1581d51179610aca47+2c2d707e/model.neff. Exiting with a successfully compiled graph.
>>>
>>> model.save_pretrained("alphatross")
>>> model.push_to_hub("jburtoft/v-alpha-tross-neuron")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: push_to_hub() missing 1 required positional argument: 'repository_id'

dacorvo

AWS Inferentia and Trainium org Feb 5

@jburtoft thanks for the feedback. You need to pass the save directory as first argument in push_to_hub:

>>> model.push_to_hub("alphatross", "jburtoft/v-alpha-tross-neuron")

michaelfeil

Feb 5

•

edited Feb 5

Thanks for the help. I think a ami-image mismatch caused this issue. I now updated the ami, thanks for the hint @jburtoft

$ neuronx-cc --version
NeuronX Compiler version 2.12.68.0+4480452af

Python version 3.8.10
HWM version 2.12.0.0-422c9037c
NumPy version 1.24.4

Running on AMI ami-029fcc46b49fda6c3
Running in region usw2-az1

Also I just reserved a 500GB disk, and while the downloaded weights are around 140 GB, the neuron weights take another 286G after being exported.
After giving it another try, the model compiled on the correct ami.

Thanks for the help to you both!

michaelfeil changed discussion status to closed Feb 6

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment