Github:

Paper:

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks has been accepted by NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges.

Model Card of OPI-Llama-3.1-8B-Instruct

OPI-Llama-3.1-8B-Instruct was fine-tuned from the Meta-Llama-3.1-8B-Instruct model using the complete OPI training set (i.e.,OPI_full_1.61M_train.json). For more details of training and testing, please visit https://github.com/baaihealth/opi.

Evaluation of OPI-Llama-3.1-8B-Instruct on 9 tasks

Each testing result is derived from the Meta-Llama-3.1-8B-Instruct model that has been fine-tuned using OPI_full_1.61M.json and subsequently evaluated on the respective testing set for each specific task.

Task Type	Task Name	Testing file	Accuracy	Precision	Recall	F1	Rouge-L
Sequence Understanding	EC Number Prediction (split100)	CLEAN_EC_number_new_test	-	0.3724	0.3374	0.3468	-
	EC Number Prediction (split100)	CLEAN_EC_number_price_test	-	0.0738	0.0738	0.0738	-
	Fold Type Prediction	fold_type_test_Fold_Holdout	0.1045	-	-	-	-
		fold_type_test_Superfamily_Holdout	0.1507	-	-	-	-
		fold_type_test_Family_Holdout	0.6145	-	-	-	-
	Subcellular Localization Prediction	subcell_loc_test	0.4214	-	-	-	-
Annotation Prediction	Function Keywords Prediction	CASPSimilarSeq_keywords_test	-	0.4202	0.5057	0.4385	-
		IDFilterSeq_keywords_test	-	0.6762	0.6905	0.6650	-
		UniProtSeq_keywords_test	-	0.7606	0.7489	0.7374	-
	Gene Ontology(GO) Terms Prediction	CASPSimilarSeq_go_terms_test	-	0.1113	0.0936	0.099	-
		IDFilterSeq_go_terms_test	-	0.6686	0.6287	0.6304	-
		UniProtSeq_go_terms_test	-	0.7150	0.6897	0.6849	-
	Function Description Prediction	CASPSimilarSeq_function_test	-	-	-	-	0.7524
		IDFilterSeq_function_test	-	-	-	-	0.4786
		UniProtSeq_function_test	-	-	-	-	0.5144
Knowledge Mining	Tissue Location Prediction from Gene Symbol	gene_symbol_to_tissue_test	-	0.4002	0.9356	0.5466	-
	Cancer Prediction from Gene Symbol	gene_symbol_to_cancer_test	-	0.2890	0.2701	0.2664	-
	Cancer Prediction from Gene Name	gene_name_to_cancer_test	-	0.2786	0.2707	0.2659	-