Some instructions regarding fine-tuning & classification with ESM++ model
Dear contributors and developers, @lhallee
Thank you for the important and helpful work you are doing by making protein LLMs more accessible to the community!
I try to follow the code in modeling_esm_plusplus.py
in order to perform fine-tuning and protein-level classification downstream on my specific use case.
I am using ESMplusplus_600M()
function for embeddings (also tried with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large')
) and ESMplusplusForSequenceClassification.from_pretrained_esm("600")
for the classification head. Also using LORA in between.
I already tried several ways to correctly embed my dataset (e.g. model.embed_dataset()
), or just pass just the sequences + labels (inputs_ids, ...) via Dataset object directly to Trainer
function together with a tokenizer
object.
But nothing seems to work, and the training steps will not start at all or will crash after 1-epoch due to inconsistencies in shapes/dimensions and batches between inputs to the model, or to the outputs.
I would be very help for any help/advices/guides, can provide of course the specific code I used or the errors.
Many thanks,
Dani
No problem. Probably GitHub is better.
Sure thank you!
Here is the link to the notebook:
https://github.com/VadimDu/Protein_LLM_modeling/blob/main/clean_ver_Modeling_ESM_plusplus.ipynb
I basically copied all the code from modeling_esm_plusplus.py
there, and added over it my data and steps towards fine-tuning the classification model.
From the cell named "My protein input data" starts the part I added.
In the current trial I commented out data_collator
and tokenizer
from the Trainer, and used the default one and the tokenizer implemented in function ESMplusplus_600M()
, class ESMplusplusForMaskedLM()
, self.tokenizer = EsmSequenceTokenizer()
.
Any help will be much appreciated!
Dani
Hi @lhallee ,
Thanks again for your reply. Could you please clarify regarding AutoModelForSequenceClassification
? I could not find such class/method in your code I am using.
Regarding the errors:
- If I am running the preprocessing and fine-tuning steps exactly how is in the notebook (
model_embedding = ESMplusplus_600M(num_labels=3)
,.embed_dataset()
,model_classification = ESMplusplusForSequenceClassification.from_pretrained_esm("600")
, and no explicit custom data_collator and tokenizer give toTrainer()
), this is the error:
ValueError Traceback (most recent call last)
<ipython-input-31-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
13 frames
<ipython-input-2-bf6a22e3339a> in forward(self, x, attention_mask, output_hidden_states, output_attentions)
441 TransformerOutput containing last hidden state and optionally all hidden states and attention weights
442 """
--> 443 batch_size, seq_len, _ = x.shape
444 hidden_states = () if output_hidden_states else None
445 attentions = () if output_attentions else None
ValueError: not enough values to unpack (expected 3, got 2)
- If I am adding
class CustomDataCollator
to define a data_collator, to convert myinput_embeds
to shape:torch.Size([num_of_sequences, 1, 1152])
from a 2-dimensional tensor, then 1 training epoch finish OK, and then crashes at the start of the epoch 2:
Could not estimate the number of tokens of the input, floating-point operations will not be computed
[ 4/20 00:00 < 00:03, 4.01 it/s, Epoch 1/10]
Epoch Training Loss Validation Loss
[2/2 00:00]
Downloadingβbuilderβscript:β100%
β4.20k/4.20kβ[00:00<00:00,β506kB/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-48-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 3 dimensions. The detected shape was (2, 20, 1) + inhomogeneous part.
I used 20 sequences just as an example for training.
- If use the commented out cell (#@title ESM++ for protein embeddings using a pre-trained model from Synthyra) for sequence Dataset creation and tokenizer with
AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large)'
, and supplyTrainer()
with a tokenizer, I get this error:
ValueError Traceback (most recent call last)
<ipython-input-28-5075ee0329cb> in <cell line: 0>()
11
12 # Train the model
---> 13 trainer.train()
14 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3477 if size_average is not None or reduce is not None:
3478 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3479 return torch._C._nn.cross_entropy_loss(
3480 input,
3481 target,
ValueError: Expected input batch_size (2200) to match target batch_size (4).
I hope this information may be helpful, many thanks again for your efforts.
Dani
Gotcha. So a few things. If you want to finetune a model for sequence classification you do not need to pre-embed the sequences. Just need to feed the input_ids
and attention_mask
with the data collator. You can load the model without copying the implementation anywhere by doing this
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)
From here you can apply lora if you'd like.
If you want to just train a model on the vector embeddings of the model, you can embed them like you had and train a small neural network.
Does that make sense?
Here's an example of a collator we use for input_ids and labels. Trainer automatically unpacks a dictionary sent to the model, so everything in "batch" here will go to the right place
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
padding_to_multiple_of=8,
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
tokenizer = model.tokenizer
data_collator = string_labels_collator_builder(tokenizer)
This expects a PyTorch dataset class that will output a tuple of sequences and the labels you are interested in. A class that might link up with your current workflow looks something like this
from torch.utils.data import Dataset as TorchDataset
class StringLabelDatasetFromHF(TorchDataset):
def __init__(self, hf_dataset, col_name='seqs', label_col='labels', **kwargs):
self.seqs = hf_dataset[col_name]
self.labels = hf_dataset[label_col]
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
Does this help? If you try something new and get a new error please send along.
Hi @lhallee ,
Many thanks again for your help!
I have implemented the collator and PyTorch dataset as you suggested and used AutoModelForSequenceClassification
, but unfortunately after the 1st epoch of the training finished it crashed with a similar error as I had before.
Below I will paste all the relevant code from the start until the training, maybe you can spot some inconsistencies there:
from torch.utils.data import Dataset as TorchDataset
from transformers import AutoModelForSequenceClassification, AutoConfig
config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer
# Move models to GPU and keep them in float32
model_classification = model_classification.to(device) # Remove .half()
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
class StringLabelDatasetFromHF(TorchDataset):
'''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
torchdataset_my_train = StringLabelDatasetFromHF(my_train)
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid)
torchdataset_my_test = StringLabelDatasetFromHF(my_test)
data_collator = string_labels_collator_builder(tokenizer)
# LORA fine-tuning
# Define the regex pattern to match desired layers (excluding LayerNorm - ffn.0)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"
target_modules = [
name
for name, module in model_classification.named_modules() # iterate through all modules and their names.
if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')
lora_config = LoraConfig(
r=4, # Rank of the LoRA update matrices
lora_alpha=32, # Scaling factor for the LoRA update matrices
lora_dropout=0.05, # Dropout probability for the LoRA update matrices
bias="none", # Whether to apply bias to the LoRA update matrices
task_type=TaskType.SEQ_CLS, # Task type for sequence classification
target_modules=target_modules, # Modules which LORA method should target and modify their weights
)
model = get_peft_model(model_classification, lora_config)
# Prints the number of trainable parameters in the LoRA-adapted model
model.print_trainable_parameters()
# Define Huggingface Trainer arguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy = "epoch",
logging_strategy = "epoch",
save_strategy = "epoch",
learning_rate=3e-4,
# effective training batch size is batch * accum
# we recommend an effective training batch size of 8
per_device_train_batch_size=4,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=10,
weight_decay=0.01,
load_best_model_at_end=True,
#deepspeed= ds_config if deepspeed else None,
fp16 = False,
gradient_checkpointing=False,
)
# Metric definition for validation data
def compute_metrics(eval_pred, num_labels=3):
if num_labels>1: # for classification
metric = load("accuracy")
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
else: # for regression
metric = load("spearmanr")
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)
# Define the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=torchdataset_my_train,
eval_dataset=torchdataset_my_valid,
data_collator=data_collator, # the custom data collator
compute_metrics=compute_metrics,
)
# Train the model
trainer.train()
This is the error I got:
ValueError Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 501) + inhomogeneous part.
I am sorry that we still couldn't resolve the issue! Maybe I am missing something basic or critical, I'm still new to LLMs / Hugging Face api in general.
I can send you a small sample of my data so that you can try yourself if that's OK with you.
Thank you
Dani
Is it happening exactly 1 epoch? This could be an error from the evaluation, likely happening in compute_metrics
. I would write a separate one for regression or classification based on your needs, and pass the correct one when needed. The only argument for compute_metrics
should be an EvalPrediction. You can type hint it like this
from transformers import EvalPrediction
def compute_metrics(p: EvalPrediction):
preds, labels = p.predictions, p.label_ids
# if preds or labels is a tuple you usually need to take the 0th index, I usually add an if statement for this
# etc.
# For example
def compute_metrics_regression(p: EvalPrediction):
"""
Compute various regression metrics for model evaluation.
Args:
(p: EvalPrediction): An object containing predictions and label ids.
Returns:
dict: A dictionary containing the following metrics:
- r_squared: Coefficient of determination
- spearman_rho: Spearman's rank correlation coefficient
- spear_pval: p-value for Spearman's correlation
- pearson_rho: Pearson correlation coefficient
- pear_pval: p-value for Pearson's correlation
- mse: Mean Squared Error
- mae: Mean Absolute Error
- rmse: Root Mean Squared Error
"""
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
labels = p.label_ids[1] if isinstance(p.label_ids, tuple) else p.label_ids
logits = np.array(preds).flatten()
labels = np.array(labels).flatten()
r2 = r2_score(labels, logits)
spearman_rho, spear_pval = spearmanr(logits, labels)
pearson_rho, pear_pval = pearsonr(logits, labels)
mse = mean_squared_error(labels, logits)
mae = mean_absolute_error(labels, logits)
rmse = np.sqrt(mse)
return {
'r_squared': round(r2, 5),
'spearman_rho': round(spearman_rho, 5),
'spear_pval': round(spear_pval, 5),
'pearson_rho': round(pearson_rho, 5),
'pear_pval': round(pear_pval, 5),
'mse': round(mse, 5),
'mae': round(mae, 5),
'rmse': round(rmse, 5),
}
I also don't think you need the .to_numpy()
in your dataset class. That shouldn't be able to run for a list of strings.
I would be happy to look at a small sample of your data, one or a couple example lines is fine if it is sensitive (you can change the column names too). I can just copy what you send several times if I need more samples. Also, if you could send the full traceback I may be able to debug a bit better. Sometimes an IDE will not show you the whole thing, I don't think it did here. Not sure how to fix that though.
It's great that you are new to LLMs and Huggingface! Welcome to the ecosystem. There is definitely a learning curve but once it clicks it is a fantastic resource for research. Don't get discouraged!
Best,
Logan
Dear Logan,
Many thanks for your inputs and encouragement!
I managed to solve the problem, it was indeed as you pointed a problem in compute_metrics
which was called only after the 1st epoch. The returned object was a tuple and I needed to correctly index it to retrieve the logits.
Regarding .to_numpy()
- I added it as my input in this case was a DataFrame, which was converted to a Series in StringLabelDatasetFromHF()
, and thus couldn't be indexed with [ ].
Now the training was successfully finished :-)
Previously I already fine-tuned a transformer T5 model of ProtT5_xl ProtTrans. It actually worked great, on a relatively simple protein function classification task.
Then I used only the encoder part (half precision) and it was definitely enough, while minimized resources and time spent.
Maybe you know if it is possible to run the current ESM-C 600M param model in half precision as well?
In any case I wanted to try this new model as it is suppose to be SOTA design, and it's training data suppose to be much more extensive and varied (UniRef, MGnify and JGI, while ProtT5-xl was trained only on UniRef50).
Again many thanks for your time!
Best regards
Dani
No problem, glad it seems to be working!
You can absolutely run the current ESMC models in half precision, but training in half precision can be much less stable. We find that float16 inference offers almost no cost in performance, but the full half precision training can be tricky. You can try mixed precision training with the huggingface trainer, which should offer you a good speed up and memory reduction with a tiny cost to performance.
If you are interested in ProtT5-like models, the ANKH series has the same architecture and is better in about every way. Synthyra offers versions of the encoder-only weights in this collection - just look for the ANKH models.
Yea, its hard to tell where the meta-genomic data will help and hinder. Models trained on old uniref versions like ANKH and ESM2 are still just as good or better in many scenarios.
If you have any other questions feel free to ask here. If not, kindly close the issue. Thanks for your use of our Huggingface model versions, keep an eye out for our own product releases soon!!! We will have a variety of protein annotation systems hitting the market this year.
Dear @lhallee Logan,
Sorry I had to re-open the issue, I am re-running my notebook with the previous code for sequence classification task which worked perfectly, and now this line:model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', num_labels=3, trust_remote_code=True)
generates (AttributeError: 'ESMplusplusForSequenceClassification' object has no attribute 'sequence_head'):
Full track changes:
5 frames
/usr/local/lib/python3.11/dist-packages/transformers/models/auto/auto_factory.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
557 cls.register(config.__class__, model_class, exist_ok=True)
558 model_class = add_generation_mixin_to_remote_model(model_class)
--> 559 return model_class.from_pretrained(
560 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
561 )
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, weights_only, *model_args, **kwargs)
4243 offload_index,
4244 error_msgs,
-> 4245 ) = cls._load_pretrained_model(
4246 model,
4247 state_dict,
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py in _load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, hf_quantizer, keep_in_fp32_modules, gguf_path, weights_only)
4468
4469 # tie the model weights before retrieving the state_dict
-> 4470 model.tie_weights()
4471
4472 # Retrieve missing & unexpected_keys
/usr/local/lib/python3.11/dist-packages/transformers/modeling_utils.py in tie_weights(self)
1862 """
1863 if getattr(self.config, "tie_word_embeddings", True):
-> 1864 output_embeddings = self.get_output_embeddings()
1865 if output_embeddings is not None:
1866 self._tie_or_clone_weights(output_embeddings, self.get_input_embeddings())
~/.cache/huggingface/modules/transformers_modules/Synthyra/ESMplusplus_large/3e8b16b911bc31985bdd36c727c75d5d29cb1bfd/modeling_esm_plusplus.py in get_output_embeddings(self)
867
868 def get_output_embeddings(self):
--> 869 return self.sequence_head[-1]
870
871 def set_output_embeddings(self, new_embeddings):
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
1926 if name in modules:
1927 return modules[name]
-> 1928 raise AttributeError(
1929 f"'{type(self).__name__}' object has no attribute '{name}'"
1930 )
AttributeError: 'ESMplusplusForSequenceClassification' object has no attribute 'sequence_head'
I noticed that 4 days ago was a new commit for this model...
Again many thanks for helping out!
Best regards
Dani
Hi @lhallee ,
Thanks for the fast response!
The initial issue was fixed, thanks.
But now another issue appeared when I continue with my old same code.
When I run the trainer function (trainer.train()
) I got this error today:AttributeError: 'ESMplusplusForSequenceClassification' object has no attribute 'mean_pooling'
And yesterday I actually got a different error after your initial fix with the same code I used which is weird:TypeError: ESMplusplusForSequenceClassification.forward() got an unexpected keyword argument 'num_items_in_batch'
Thanks again for helping out!
Dani
Hey @DaniDubi ,
Could you please link to a GitHub file with the code for this. Haven't experienced this on our end.
Thanks again for helping out!
No, thank you for helping us fix the bugs! We are doing a beta launch of Synthyra products next week. There will be various protein language modeling tools coming out in the next couple months. Would you like early access? Just need your email and I can add you to our thread.
Best,
Logan
Hi @DaniDubi ,
I will paste the relevant pieces of code it here if that's OK:
config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
class StringLabelDatasetFromHF(TorchDataset):
'''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
torchdataset_my_train = StringLabelDatasetFromHF(my_train[0:100])
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid[0:25])
data_collator = string_labels_collator_builder(tokenizer)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"
target_modules = [
name
for name, module in model_classification.named_modules() # iterate through all modules and their names.
if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')
lora_config = LoraConfig(
r=4, # Rank of the LoRA update matrices
lora_alpha=32, # Scaling factor for the LoRA update matrices
lora_dropout=0.05, # Dropout probability for the LoRA update matrices
bias="none", # Whether to apply bias to the LoRA update matrices
task_type=TaskType.SEQ_CLS, # Task type for sequence classification
target_modules=target_modules, # Modules which LORA method should target and modify their weights
)
# Apply LoRA to the classification model
model = get_peft_model(model_classification, lora_config)
training_args = TrainingArguments(
output_dir="./results",
eval_strategy = "epoch",
logging_strategy = "epoch",
save_strategy = "epoch",
learning_rate=3e-4,
# effective training batch size is batch * accum
# we recommend an effective training batch size of 8
per_device_train_batch_size=4,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=10,
weight_decay=0.01,
load_best_model_at_end=True,
fp16 = False,
gradient_checkpointing=False,
)
# Define the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=torchdataset_my_train,
eval_dataset=torchdataset_my_valid,
data_collator=data_collator, # the custom data collator
compute_metrics=compute_metrics,
)
trainer.train()
The error I'm getting is: AttributeError: 'ESMplusplusForSequenceClassification' object has no attribute 'mean_pooling'
. I can paste here the complete trackback if needed.
Sure I will be happy to try out some of your new models and tools! If you will add support for some MLX-models to run natively on Apple M-silicon it will be great as well.
Here is my email: [email protected]
Many thanks again
Dani
Thanks for sharing.
Also, I believe we have found the issue. If you rerun and it should be all set. If there are any other issues don't hesitate to reach out.
Best,
Logan