--- base_model: - google-bert/bert-base-uncased datasets: - gayanin/pubmed-gastro-maskfilling language: - en library_name: transformers license: apache-2.0 pipeline_tag: fill-mask tags: - math --- ![medBERT-logo](medBERT.png) # **medBERT-base** This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts. ## **Model Architecture** - **Base Model**: `bert-base-uncased` - **Task**: Masked Language Modeling (MLM) for medical texts - **Tokenizer**: BERT's WordPiece tokenizer ## **Usage** ### **Loading the Pre-trained Model** You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library: ```py from transformers import BertTokenizer, BertForMaskedLM import torch tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base') model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda") input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer." inputs = tokenizer(input_text, return_tensors='pt').to("cuda") outputs = model(**inputs) masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item() top_k = 5 logits = outputs.logits[0, masked_index] top_k_ids = torch.topk(logits, k=top_k).indices.tolist() top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids) print("Top 5 prediction:") for i, token in enumerate(top_k_tokens): print(f"{i + 1}: {token}") ``` _Top 5 prediction:_ _1: from_ _2: of_ _3: after_ _4: by_ _5: through_ ### **Fine-tuning the Model** To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps: 1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format. 2. Tokenize the dataset and apply masking. 3. Train the model using the provided training loop. Here's the training code: https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb ## **Training Details** ### **Hyperparameters** - **Batch Size**: 16 - **Learning Rate**: 5e-5 - **Number of Epochs**: 1 - **Max Sequence Length**: 512 tokens ### **Dataset** - **Dataset Name**: *gayanin/pubmed-gastro-maskfilling* - **Task**: Masked Language Modeling (MLM) on medical texts ## **Acknowledgements** - The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training. - This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models