Protein2Text: Providing Rich Descriptions from Protein Sequences

Abstract:

Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. Unraveling protein functions is essential for advancements in medicine, agriculture, and biotechnology, enabling the development of targeted therapies, engineered crops, and novel biomaterials. However, this endeavor is challenging due to the complex nature of proteins, requiring sophisticated experimental designs and extended timelines to uncover their specific functions. Public large language models (LLMs), though proficient in natural language processing, struggle with biological sequences due to the unique and intricate nature of biochemical data. These models often fail to accurately interpret and predict the functional and structural properties of proteins, limiting their utility in bioinformatics. To address this gap, we introduce BetaDescribe, a collection of models designed to generate detailed and rich textual descriptions of proteins, encompassing properties such as function, catalytic activity, involvement in specific metabolic pathways, subcellular localizations, and the presence of particular domains. The trained BetaDescribe model receives protein sequences as input and outputs a textual description of these properties. BetaDescribe’s starting point was the LLAMA2 model, which was trained on trillions of tokens. Next, we trained our model on datasets containing both biological and English text, allowing biological knowledge to be incorporated. We demonstrate the utility of BetaDescribe by providing descriptions for proteins that share little to no sequence similarity to proteins with functional descriptions in public datasets. We also show that BetaDescribe can be harnessed to conduct in-silico mutagenesis procedures to identify regions important for protein functionality without needing homologous sequences for the inference. Altogether, BetaDescribe offers a powerful tool to explore protein functionality, augmenting existing approaches such as annotation transfer based on sequence or structure similarity.

image/png

BetaDescribe workflow. The generator processes the protein sequences and creates multiple candidate descriptions. Independently, the validators provide simple textual properties of the protein. The judge receives the candidate descriptions (from the generator) and the predicted properties (from the validators) and rejects or accepts each description. Finally, BetaDescribe provides up to three alternative descriptions for each protein.

Preprint: https://www.biorxiv.org/content/10.1101/2024.12.04.626777v1.full.pdf+html

Examples of descriptions of unknown proteins:

SnRV-Env:

Sequence: MKLVLLFSLSVLLGTSVGRILEIPETNQTRTVQVRKGQLVQLTCPQLPPPQGTGVLIWGRNKRTGGGALDFNGVLTVPVGDNENTYQCMWCQNTTSKNAPRQKRSLRNQPTEWHLHMCGPPGDYICIWTNKKPVCTTYHEGQDTYSLGTHRKVLPKVTEACAVGQPPQIPGTYVASSKGWTMFNKFEVHSYPANVTQIKTNRTLHDVTLWWCHDNSIWRCTQMGFIHPHQGRRIQLGDGTRFRDGLYVIVSNHGDHHTVQHYMLGSGYTVPVSTATRVQMQKIGPGEWKIATSMVGLCLDEWEIECTGFCSGPPPCSLSITQQQDTVGGSYDSWNGCFVKSIHTPVMALNLWWRRSCKGLPEATGMVKIYYPDQFEIAPWMRPQPRQPKLILPFTVAPKYRRQRRGLNPSTTPDYYTNEDYSGSGGWEINDEWEYIPPTVKPTTPSVEFIQKVTTPRQDKLTTVLSRNKRGVNIASSGNSWKAEIDEIRKQKWQKCYFSGKLRIKGTDYEEIDTCPKPLIGPLSGFIPTGVTKTLKTGVTWTTAVVKIDLQQWVDILNSTCKDTLIGKHWIKVIQRLLREYQKTGVTFNLPQVQSLPNWETKNKDNPGHHIPKSRRKRIRRGLGEALGLGNFADNRWKDLQIAGLGVEQQKLMGLTREATFEAWNALKGISNELIKWEEDMVATLRQLLLQIKGTNTTLCSAMGPLMATNIQQIMFALQHGNLPEMSYSNPVLKEIAKQYNGQMLGVPVETTGNNLGIMLSLPTGGENIGRAVAVYDMGVRHNRTLYLDPNARWIHNHTEKSNPKGWVTIVDLSKCVETTGTIYCNEHGFRDRKFTKGPSELVQHLAGNTWCLNSGTWSSLKNETLYVSGRNCSFSLTSRRRPVCFHLNSTAQWRGHVLPFVSNSQEAPNTEIWEGLIEEAIREHNKVQDILTKLEQQHQNWKQNTDNALQNMKDAIDSMDNNMLTFRYEYTQYGLFIVCLLAFLFAVIFGWLCGVTVRLREVFTILSVKIHALKSQAHQLAMLRGLRDPETGEQDRQAPAYREPPTYQEWARRRGGRPPIVTFLIDRETGERHDGQIFQPIRNRSNQVHRPQPPRPTAPNPDNQRPIREPRPEEPEHGDFLQGASWMWQ

Description: FUNCTION$ The leader peptide is a component of released, infectious virions and is required for particle budding, & The transmembrane protein (TM) acts as a class I viral fusion protein. Under the current model, the protein has at least 3 conformational states: pre-fusion native state, pre-hairpin intermediate state, and post-fusion hairpin state. During viral and target cell membrane fusion, the coiled coil regions (heptad repeats) assume a trimer-of-hairpins structure, positioning the fusion peptide in close proximity to the C-terminal region of the ectodomain. The formation of this structure appears to drive apposition and subsequent fusion of viral and target cell membranes. Membranes fusion leads to delivery of the nucleocapsid into the cytoplasm, SUBCELLULAR LOCATION$ Endoplasmic reticulum membrane.

TGV-S:

Sequence: MISGHTLCMLVLFYLYSYSNAQHELQLNPTTYHWLNCATSDCKSWQACPSTQATTCVSFSYTGLAWHKQDNTIIGYSNFTSQSLYDTISYTFAPSYVLSHAMTNLEPQKLCSLKSTIQSFHGFTPADCCLNPSASPACSYFSTGDTSFITGTPYQCTASYYGYGSPYGTDCEPYFASVSPYGTSVTPSGDVFTNFGEKSVHTYDCFYENWARYRPAPYTNNPSDPRWNLCHSIYYYVWTLSDTNHQFTTVESEPGDKVIMKQLSSHTPVYLTLGGWTSNNTVLYQAISSRRLDTIAMLRDLHDNYGVTGVCIDFEFIGGSNQYSNIFLLDWVPDLLSFLSSVRLEFGPSYYITFVGLAVGSHFLPTIYQQIDPLIDAWLISGYDLHGDWEVKATQQAALVDDPKSDFPTYSLFTSVDNMLAITTPDKIILGLPQYTRGVYTSLTGSTTGPYPPTTPMCPTPPACGTDIVISTSHGEIPSTHDTTKGDIIIEDPSQPKFYISKGSRNGRTFNHFFMNSTTASHIRSTLQPKGITRWYSYASSMNLQTNTNFKTALLSQSRKARQLSTYYKYPAPAGSGVTSCPGIVVFTDTFVVTTTAYAGSHALPLLDGNFYSPRSTFTCSPGFSTLMPTTTTRCSGIDPSNLLPSDSSSVSIVCPDMTFFGAKIAICASSTTTSKPTHLQLEVSTSIEGQFQFNSLPIYSQHKVSTTSFSVPYKCINFTPIPSCISSVCGSSHSCVTKLQESPASYACQSAAAIAIVYNNTLDLVKRSQTTTELLFNQVVLESSKFGVVTHTRQTRGLFGILSITSLIMSGVALATSSSALYVSIKNQAELSSLRNDVNSKFTTIDQNFDQITSKFNHLSTTTSDAFIAQSNINTQLQSSINQLQENLEVLSNFVTTQLSSVSSSITQLSEAIDALSDQVNYLAYLTSGISSYTSRLTSVTVQATNTAVKFSTLQSHLSNCLTSLQQQSFTGCIHKSGNIIPLKVVYTPFGNTRYLSFIYAEAELLGYQQYKSALSYCDQNFLYSSSPGCFFLLNGSSIDHRSSLSAACPTPATVVSMSCQNVTLDLSSQSIVRPYVFPLLNLTLPTPVKTNISFTPGKAPVFQNITQIDQTLLLDLAQQLQAIQLQLNPVGPISTSSFSPVVIALTVISAVVFLAVTSIVIYMLCKTAPFKPSRKTA

Descriptions:

  1. FUNCTION$ Envelope glycoprotein that forms spikes at the surface of virion envelope. Essential for the initial attachment to heparan sulfate moities of the host cell surface proteoglycans. Involved in fusion of viral and cellular membranes leading to virus entry into the host cell. Following initial binding to its host receptors, membrane fusion is mediated by the fusion machinery composed at least of gB and the heterodimer gH/gL. May be involved in the fusion between the virion envelope and the outer nuclear membrane during virion egress, SUBCELLULAR LOCATION$ Virion membrane, SUBUNIT$ Homotrimer; disulfide-linked. Binds to heparan sulfate proteoglycans. Interacts with gH/gL heterodimer, SIMILARITY$ Belongs to the herpesviridae glycoprotein B family.

  2. FUNCTION$ The surface protein (SU) attaches the virus to the host cell by binding to its receptor. This interaction triggers the refolding of the transmembrane protein (TM) and is thought to activate its fusogenic potential by unmasking its fusion peptide. Fusion occurs at the host cell plasma membrane, & The transmembrane protein (TM) acts as a class I viral fusion protein. Under the current model, the protein has at least 3 conformational states: pre-fusion native state, pre-hairpin intermediate state, and post-fusion hairpin state. During viral and target cell membrane fusion, the coiled coil regions (heptad repeats) assume a trimer-of-hairpins structure, positioning the fusion peptide in close proximity to the C-terminal region of the ectodomain. The formation of this structure appears to drive apposition and subsequent fusion of viral and target cell membranes. Membranes fusion leads to delivery of the nucleocapsid into the cytoplasm, SUBCELLULAR LOCATION$ Cell membrane. SUBUNIT$ The mature envelope protein (Env) consists of a trimer of SU-TM heterodimers attached by noncovalent interactions or by a labile interchain disulfide bond

Protein 1 (TiLV virus):

Sequence: MWAFQEGVCKGNLLSGPTSMKAPDSAARESLDRASEIMTGKSYNAVHTGDLSKLPNQGESPLRIVDSDLYSERSCCWVIEKEGRVVCKSTTLTRGMTGLLNTTRCSSPSELICKVLTVESLSEKIGDTSVEELLSHGRYFKCALRDQERGKPKSRAIFLSHPFFRLLSSVVETHARSVLSKVSAVYTATASAEQRAMMAAQVVESRKHVLNGDCTKYNEAIDADTLLKVWDAIGMGSIGVMLAYMVRRKCVLIKDTLVECPGGMLMGMFNATATLALQGTTDRFLSFSDDFITSFNSPAELREIEDLLFASCHNLSLKKSYISVASLEINSCTLTRDGDLATGLGCTAGVPFRGPLVTLKQTAAMLSGAVDSGVMPFHSAERLFQIKQQECAYRYNNPTYTTRNEDFLPTCLGGKTVISFQSLLTWDCHPFWYQVHPDGPDTIDQKVLSVLASKTRRRRTRLEALSDLDPLVPHRLLVSESDVSKIRAARQAHLKSLGLEQPTNFNYAIYKAVQPTAGC

Description: FUNCTION$ Probably involved in the RNA silencing pathway and required for the generation of small interfering RNAs (siRNAs), CATALYTIC ACTIVITY$ a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate + RNA(n+1), SIMILARITY$ Belongs to the RdRP family.

Protein 2 (TiLV virus):

Sequence: MSQFGKSFKGRTEVTITEYRSHTVKDVHRSLLTADKSLRKSFCFRNALNQFLDKDLPLLPIRPKLESRVAVKKSKLRSQLSFRPGLTQEEAIDLYNKGYDGDSVSGALQDRVVNEPVAYSSADNDKFHRGLAALGYTLADRAFDTCESGFVRAIPTTPCGFICCGPGSFKDSLGFVIKIGEFWHMYDGFQHFVAVEDAKFLASKSPSFWLAKRLAKRLNLVPKEDPSIAAAECPCRKVWEASFARAPTALDPFGGRAFCDQGWVYHRDVGYATANHISQETLFQQALSVRNLGPQGSANVSGSIHTALDRLRAAYSRGTPASRSILQGLANLITPVGENFECDLDKRKLNIKALRSPERYITIEGLVVNLDDVVRGFYLDKAKVTVLSRSKWMGYEDLPQKPPNGTFYCRKRKAMLLISCSPGTYAKKRKVAVQEDRFKDMRVENFREVAENMDLNQ

Description: FUNCTION$ DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates, CATALYTIC ACTIVITY$ a ribonucleoside 5'-triphosphate + RNA(n) = diphosphate + RNA(n+1), SIMILARITY$ Belongs to the RNA polymerase beta' chain family.

Protein 3 (TiLV virus):

Sequence: MDSRFAQLTGVFCDDFTYSEGSRRFLSSYSTVERRPGVPVEGDCYDCLKNKWIAFELEGQPRKFPKATVRCILNNDATYVCSEQEYQQICKVQFKDYLEIDGVVKVGHKASYDAELRERLLELPHPKSGPKPRIEWVAPPRLADISKETAELKRQYGFFECSKFLACGEECGLDQEARELILNEYARDREFEFRNGGWIQRYTVASHKPATQKILPLPASAPLARELLMLIARSTTQAGKVLHSDNTSILAVPVMRDSGKHSKRRPTASTHHLVVGLSKPGCEHDFEFDGYRAAVHVMHLDPKQSANIGEQDFVSTREIYKLDMLELPPISRKGDLDRASGLETRWDVILLLECLDSTRVSQAVAQHFNRHRLALSVCKDEFRKGYQLASEIRGTIPLSSLYYSLCAVRLRMTVHPFAR

Descriptions:

  1. FUNCTION$ DNA-dependent RNA polymerase catalyzes the transcription of DNA into RNA using the four ribonucleoside triphosphates as substrates. Specific core component of RNA polymerase III which synthesizes small RNAs, such as 5S rRNA and tRNAs, SUBCELLULAR LOCATION$ Nucleus, SUBUNIT$ Component of the RNA polymerase III (Pol III) complex consisting of 17 subunits, SIMILARITY$ Belongs to the eukaryotic RPC3/POLR3C RNA polymerase subunit family.

  2. FUNCTION$ Decapping enzyme for NAD-capped RNAs: specifically hydrolyzes the nicotinamide adenine dinucleotide (NAD) cap from a subset of RNAs by removing the entire NAD moiety from the 5'-end of an NAD-capped RNA, SUBCELLULAR LOCATION$ Nucleus, COFACTOR$ a divalent metal cation, SIMILARITY$ Belongs to the DXO/Dom3Z family.

Code: https://github.com/technion-cs-nlp/BetaDescribe-code/

Downloads last month
50
Safetensors
Model size
149M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support