---
base_model:
- google-bert/bert-base-uncased
datasets:
- gayanin/pubmed-gastro-maskfilling
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: fill-mask
tags:
- math
---

![medBERT-logo](medBERT.png)

# **medBERT-base**

This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.

## **Model Architecture**
- **Base Model**: `bert-base-uncased`
- **Task**: Masked Language Modeling (MLM) for medical texts
- **Tokenizer**: BERT's WordPiece tokenizer

## **Usage**

### **Loading the Pre-trained Model**

You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:

```py
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")

input_text = "Response to neoadjuvant chemotherapy best predicts survival [MASK] curative resection of gastric cancer."
inputs = tokenizer(input_text, return_tensors='pt').to("cuda")

outputs = model(**inputs)

masked_index = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0].item()

top_k = 5
logits = outputs.logits[0, masked_index]
top_k_ids = torch.topk(logits, k=top_k).indices.tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)

print("Top 5 prediction:")
for i, token in enumerate(top_k_tokens):
    print(f"{i + 1}: {token}")
```

_Top 5 prediction:_
_1: from_
_2: of_
_3: after_
_4: by_
_5: through_

### **Fine-tuning the Model**

To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:

1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
2. Tokenize the dataset and apply masking.
3. Train the model using the provided training loop.

Here's the training code:

https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb

## **Training Details**

### **Hyperparameters**
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Number of Epochs**: 1
- **Max Sequence Length**: 512 tokens

### **Dataset**
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
- **Task**: Masked Language Modeling (MLM) on medical texts

## **Acknowledgements**

- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models

<h3 align="left">Support:</h3>
<p><a href="https://www.buymeacoffee.com/suayptalha"> <img align="left" src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" height="50" width="210" alt="suayptalha" /></a></p><br><br>