Gr00t Model - phospho Training Pipeline

Error Traceback

We faced an issue while training your model.

Training process failed with exit code 1:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.10/lib/python3.11/site-packages/transformers/activations.py", line 46, in forward
return nn.functional.gelu(input, approximate="tanh")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 102.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 98.12 MiB is free. Process 3696851 has 61.80 GiB memory in use. Process 4109944 has 17.24 GiB memory in use. Of the allocated memory 16.40 GiB is allocated by PyTorch, and 350.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

0%|          | 0/4402 [00:08<?, ?it/s]

Training parameters:

Dataset: advpatel/foldshirt
Wandb run URL: None
Epochs: 1
Batch size: 16
Training steps: 4402

📖 Get Started: docs.phospho.ai
🤖 Get your robot: robots.phospho.ai
🔗 Explore on Replicate: Replicate