Molecular BERT Pretrained Using ChEMBL Database

This model has been pretrained based on the methodology outlined in the paper Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. While the original model was initially trained using custom code, it has been adapted for use within the Hugging Face Transformers framework in this project.

Model Details

The model architecture utilized is based on BERT. Here are the key configuration details:

BertConfig(
    vocab_size=70,
    hidden_size=256,
    num_hidden_layers=8,
    num_attention_heads=8,
    intermediate_size=1024,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=max_seq_len,
    type_vocab_size=1,
    pad_token_id=tokenizer_pretrained.vocab["[PAD]"],
    position_embedding_type="absolute"
)

Optimizer: AdamW
Learning rate: 1e-4
Learning rate scheduler: False
Epochs: 50
AMP: True
GPU: Single Nvidia RTX 3090

Pretraining Database

The model was pretrained using data from the ChEMBL database, specifically version 33. You can download the database from ChEMBL. Additionally, the dataset is available on the Hugging Face Datasets Hub and can be accessed at Hugging Face Datasets - ChEMBL_v33_pretraining.

Performance

The accuracy score achieved by the pretrained model is 0.9672. The testing dataset used for evaluation constitutes 10% of the ChEMBL dataset.