CUDA error: device-side assert triggered
#2
by
raptorkwok
- opened
Hi Ayaka,
I encounter this error when training with this model:
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [573,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Codes are as follows:
checkpoint = 'Ayaka/bart-base-cantonese'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = BartForConditionalGeneration.from_pretrained(checkpoint, output_attentions = True, output_hidden_states = True)
batch_size = 4
training_args = CustomSeq2SeqTrainingArguments(
output_dir = output_model,
evaluation_strategy = "epoch",
optim = "adamw_torch",
eval_steps = 5000, # Previously: 1000
#logging_steps = 1000,
#save_steps = 5000,
save_strategy = "epoch",
learning_rate = 2e-5,
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
weight_decay = 0.01,
save_total_limit = 1,
num_train_epochs = 30,
predict_with_generate=True,
remove_unused_columns=True,
fp16 = True,
push_to_hub = True,
metric_for_best_model = "bleu",
load_best_model_at_end = True,
report_to = "wandb"
)
trainer = Seq2SeqTrainer(
model = model,
#model_init = model_init,
args = training_args,
train_dataset = tokenized_yuezh_master['train'],
eval_dataset = tokenized_yuezh_master['val'],
tokenizer = tokenizer,
data_collator = data_collator,
compute_metrics = compute_metrics,
callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)
Using:
- NVIDIA-SMI 535.154.05
- Driver Version: 535.154.05
- CUDA Version: 12.2
- NVIDIA GeForce RTX 3080 Ti 12GB RAM
- Ubuntu 22.04.2 LTS