File size: 6,643 Bytes
7582941 b916732 7582941 79ab2ae 7582941 79ab2ae b916732 7582941 b916732 7582941 b916732 7582941 d47162d b916732 d47162d b916732 b19289b b916732 7582941 b916732 7582941 b916732 7582941 b916732 b19289b b916732 7582941 b916732 7582941 d47162d b916732 7582941 b916732 7582941 b916732 7582941 b916732 7582941 b916732 7582941 b916732 d47162d b916732 7582941 b916732 7582941 b916732 79ab2ae b916732 7582941 79ab2ae 7582941 b916732 7582941 79ab2ae d47162d 79ab2ae 7582941 b916732 7582941 b916732 d47162d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
tags:
- LoRA
- protein language model
base_model: Rostlab/prot_t5_xl_uniref50
datasets:
- CQSB/SoftDis
---
# LoRA-DR-suite
## Model details
LoRA-DR-suite is a family of models for the identification of disordered regions (DR) in proteins, built upon state-of-the-art Protein Language Models (PLMs) trained on protein sequences only. They leverage Low-Rank Adaptation (LoRA) fine-tuning for binary classification of intrinsic and soft disorder.
Intrinsically-disordered residues are experimentally detected through circular dichroism and X-ray cristallography, while soft disorder is characterized by high B-factor, or intermittently missing residues across different X-ray crystal structures of the same sequence.
Models for intrinsic disorder were evaluated on Critical Assessment of Intrinsic Disorder (CAID) data. In particular:
- For the CAID1 and CAID2 evaluation, models were trained exclusively on data from DisProt 7.0 database. These models are denoted with the suffix “DisProt7” (see table below).
- For the CAID3 evaluation, the training set was expanded to also include data from both CAID1 and CAID2. These models are labelled with the suffix “ID”, and are the recommended models for intrinsic disorder prediction.
Models for soft disorder classification are trained instead on the [SoftDis](https://huggingface.co/datasets/CQSB/SoftDis) dataset, derived from an extensive analysis of clusters of alternative structures for the same protein
sequence in the Protein Data Bank (PDB). For each position in the represantitive sequence of each cluster, the dataset provides the frequency of closely-related homologs for which the corresponding residue is higly flexible or missing. Any position with a frequency higher than 0 is labeled as soft disordered.
The data split for model training corresponds in particular to the [id05](https://huggingface.co/datasets/CQSB/SoftDis/tree/main/splits/id05) configuration, that further clusters representative sequences for each structure at 0.5 sequence identity. See the [SoftDis Dataset card](https://huggingface.co/datasets/CQSB/SoftDis) for more details.
Models trained with this dataset are denoted with the suffix “SD”.
## Model checkpoints
We provide different model checkpoints, based on training data and pre-trained PLM.
| Checkpoint name | Training dataset | Pre-trained checkpoint |
|-----------------|------------------|------------------------|
| [esm2_650M-LoRA-DisProt7](https://huggingface.co/CQSB/esm2_650M-LoRA-DisProt7) | DisProt 7.0 | [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) |
| [esm2_35M-LoRA-DisProt7](https://huggingface.co/CQSB/esm2_35M-LoRA-DisProt7) | DisProt 7.0 | [esm2_t12_35M_UR50D](https://huggingface.co/facebook/esm2_t12_35M_UR50D) |
| [Ankh-LoRA-DisProt7](https://huggingface.co/CQSB/Ankh-LoRA-DisProt7) | DisProt 7.0 | [ankh-large](https://huggingface.co/ElnaggarLab/ankh-large) |
| [PortT5-LoRA-DisProt7](https://huggingface.co/CQSB/ProtT5-LoRA-DisProt7) | DisProt 7.0 | [prot_t5_xl_uniref5](Rostlab/prot_t5_xl_uniref50) |
| [esm2_650M-LoRA-ID](https://huggingface.co/CQSB/esm2_650M-LoRA-ID) | Intrinsic dis.* | [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) |
| [esm2_35M-LoRA-ID](https://huggingface.co/CQSB/esm2_35M-LoRA-ID) | Intrinsic dis.* | [esm2_t12_35M_UR50D](https://huggingface.co/facebook//esm2_t12_35M_UR50D) |
| [Ankh-LoRA-ID](https://huggingface.co/CQSB/Ankh-LoRA-ID) | Intrinsic dis.* | [ankh-large](https://huggingface.co/ElnaggarLab/ankh-large) |
| [PortT5-LoRA-ID](https://huggingface.co/CQSB/ProtT5-LoRA-ID) | Intrinsic dis.* | [prot_t5_xl_uniref5](Rostlab/prot_t5_xl_uniref50) |
| [esm2_650M-LoRA-SD](https://huggingface.co/CQSB/esm2_650M-LoRA-SD) | SoftDis | [esm2_t33_650M_UR50D](https://huggingface.co/facebook/esm2_t33_650M_UR50D) |
| [esm2_35M-LoRA-SD](https://huggingface.co/CQSB/esm2_35M-LoRA-SD) | SoftDis | [esm2_t12_35M_UR50D](https://huggingface.co/facebook//esm2_t12_35M_UR50D) |
| [Ankh-LoRA-SD](https://huggingface.co/CQSB/Ankh-LoRA-SD) | SoftDis | [ankh-large](https://huggingface.co/ElnaggarLab/ankh-large) |
| [PortT5-LoRA-SD](https://huggingface.co/CQSB/ProtT5-LoRA-SD) | SoftDis | [prot_t5_xl_uniref5](Rostlab/prot_t5_xl_uniref50) |
\* Union of DisProt7, CAID1 and CAID2 datasets
## Intended uses & limitations
The models are intended to be used for classification of different disorder types.
Models for intrinsic disorder trained on DisProt 7.0 were evaluated on CAID1 and CAID2 challenge, but we suggest to use "ID" models for classification of new sequences, as they show better generalization, in particular esm2_650M-LoRA-ID.
In addition to its relation to flexibility and assembly pathways, soft disorder can be used to infer confidence score for structure prediciton tools, as we found high negative Spearman correlation between soft disorder probabilities and pLDDT from AlphaFold2 predicitons.
### Model usage
All models can be loaded as PyTorch Modules, together with their associated tokenizer, with the following code:
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer
model_id = "CQSB/ProtT5-LoRA-ID-DisProt7" # model_id for selected model
model = AutoModelForTokenClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
```
Once the model is loadded, disorder profile for all residues in a sequence can be obtained as follows:
```python
import torch
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# example sequence
sequence = "TAIWEQHTVTLHRAPGFGFGIAISGGRDNPHFQSGETSIVISDVLKG"
# each pre-trained model adds its own special tokens to the tokenized sequence,
# special_tokens_mask allows to deal with them (padding included, for batched
# inputs) without changing the code
inputs = tokenizer(
[sequence], return_tensors="pt", return_special_tokens_mask=True
)
input_ids = inputs['input_ids'].to(device)
attention_mask = inputs['attention_mask'].to(device)
special_tokens_mask = inputs['special_tokens_mask'].bool()
# extract predicted disorder probability
with torch.inference_mode():
output = model(input_ids, attention_mask).logits.cpu()
output = output[~special_tokens_mask]
disorder_proba = F.softmax(output, dim=-1)[:, 1]
```
## How to cite
@article{lombardi2025lora,
 title={LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences},
 author={Lombardi, Gianluca and Seoane, Beatriz and Carbone, Alessandra},
 journal={bioRxiv},
 year={2025},
}
|