File size: 29,355 Bytes
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a7a58f
 
 
 
 
 
e8b0600
 
 
 
 
 
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
6a7a58f
 
d6ebf8d
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
 
 
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
 
 
 
 
 
 
 
 
 
 
6a7a58f
 
d6ebf8d
 
 
6a7a58f
 
d6ebf8d
 
6a7a58f
d6ebf8d
6a7a58f
 
 
 
d6ebf8d
6a7a58f
 
 
 
 
 
 
d6ebf8d
6a7a58f
 
 
 
 
 
d6ebf8d
6a7a58f
 
 
 
 
 
 
d6ebf8d
6a7a58f
 
 
 
 
 
d6ebf8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a7a58f
 
d6ebf8d
 
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
 
 
6a7a58f
 
 
e8b0600
 
 
 
 
 
 
 
 
 
 
 
 
6a7a58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6ebf8d
6a7a58f
 
 
d6ebf8d
6a7a58f
d6ebf8d
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
---
license: apache-2.0
base_model: Derify/ModChemBERT-MLM-DAPT
datasets:
- Derify/augmented_canonical_druglike_QED_Pfizer_15M
metrics:
- roc_auc
- rmse
library_name: transformers
tags:
- modernbert
- ModChemBERT
- cheminformatics
- chemical-language-model
- molecular-property-prediction
- mergekit
- merge
pipeline_tag: fill-mask
model-index:
- name: Derify/ModChemBERT-MLM
  results:
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BACE
      type: BACE
    metrics:
    - type: roc_auc
      value: 0.8346
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: BBBP
      type: BBBP
    metrics:
    - type: roc_auc
      value: 0.7573
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: CLINTOX
      type: CLINTOX
    metrics:
    - type: roc_auc
      value: 0.9938
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: HIV
      type: HIV
    metrics:
    - type: roc_auc
      value: 0.7737
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: SIDER
      type: SIDER
    metrics:
    - type: roc_auc
      value: 0.6600
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: TOX21
      type: TOX21
    metrics:
    - type: roc_auc
      value: 0.7518
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: BACE
      type: BACE
    metrics:
    - type: rmse
      value: 0.9665
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: CLEARANCE
      type: CLEARANCE
    metrics:
    - type: rmse
      value: 44.0137
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ESOL
      type: ESOL
    metrics:
    - type: rmse
      value: 0.8158
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: FREESOLV
      type: FREESOLV
    metrics:
    - type: rmse
      value: 0.4979
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: LIPO
      type: LIPO
    metrics:
    - type: rmse
      value: 0.6505
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: Antimalarial
      type: Antimalarial
    metrics:
    - type: roc_auc
      value: 0.8966
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: Cocrystal
      type: Cocrystal
    metrics:
    - type: roc_auc
      value: 0.8654
  - task:
      type: text-classification
      name: Classification (ROC AUC)
    dataset:
      name: COVID19
      type: COVID19
    metrics:
    - type: roc_auc
      value: 0.8132
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME microsom stab human
      type: ADME
    metrics:
    - type: rmse
      value: 0.4248
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME microsom stab rat
      type: ADME
    metrics:
    - type: rmse
      value: 0.4403
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME permeability
      type: ADME
    metrics:
    - type: rmse
      value: 0.5025
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME ppb human
      type: ADME
    metrics:
    - type: rmse
      value: 0.8901
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME ppb rat
      type: ADME
    metrics:
    - type: rmse
      value: 0.7268
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: ADME solubility
      type: ADME
    metrics:
    - type: rmse
      value: 0.4627
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: AstraZeneca CL
      type: AstraZeneca
    metrics:
    - type: rmse
      value: 0.4932
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: AstraZeneca LogD74
      type: AstraZeneca
    metrics:
    - type: rmse
      value: 0.7596
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: AstraZeneca PPB
      type: AstraZeneca
    metrics:
    - type: rmse
      value: 0.1150
  - task:
      type: regression
      name: Regression (RMSE)
    dataset:
      name: AstraZeneca Solubility
      type: AstraZeneca
    metrics:
    - type: rmse
      value: 0.8735
---

# ModChemBERT: ModernBERT as a Chemical Language Model
ModChemBERT is a ModernBERT-based chemical language model (CLM), trained on SMILES strings for masked language modeling (MLM) and downstream molecular property prediction (classification & regression).

## Usage
Install the `transformers` library starting from v4.56.1:

```bash
pip install -U transformers>=4.56.1
```

### Load Model
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "Derify/ModChemBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    dtype="float16",
    device_map="auto",
)
```

### Fill-Mask Pipeline
```python
from transformers import pipeline

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill("c1ccccc1[MASK]"))
```

## Architecture
- Backbone: ModernBERT
- Hidden size: 768
- Intermediate size: 1152
- Encoder Layers: 22
- Attention heads: 12
- Max sequence length: 256 tokens (MLM primarily trained with 128-token sequences)
- Tokenizer: BPE tokenizer using [MolFormer's vocab](https://github.com/emapco/ModChemBERT/blob/main/modchembert/tokenizers/molformer/vocab.json) (2362 tokens)

## Pooling (Classifier / Regressor Head)
Kallergis et al. [1] demonstrated that the CLM embedding method prior to the prediction head was the strongest contributor to downstream performance among evaluated hyperparameters.

Behrendt et al. [2] noted that the last few layers contain task-specific information and that pooling methods leveraging information from multiple layers can enhance model performance. Their results further demonstrated that the `max_seq_mha` pooling method was particularly effective in low-data regimes, which is often the case for molecular property prediction tasks.

Multiple pooling strategies are supported by ModChemBERT to explore their impact on downstream performance:
- `cls`: Last layer [CLS]
- `mean`: Mean over last hidden layer
- `max_cls`: Max over last k layers of [CLS]
- `cls_mha`: MHA with [CLS] as query
- `max_seq_mha`: MHA with max pooled sequence as KV and max pooled [CLS] as query
- `sum_mean`: Sum over all layers then mean tokens
- `sum_sum`: Sum over all layers then sum tokens
- `mean_mean`: Mean over all layers then mean tokens
- `mean_sum`: Mean over all layers then sum tokens
- `max_seq_mean`: Max over last k layers then mean tokens

Note: ModChemBERT’s `max_seq_mha` differs from MaxPoolBERT [2]. MaxPoolBERT uses PyTorch `nn.MultiheadAttention`, whereas ModChemBERT's `ModChemBertPoolingAttention` adapts ModernBERT’s `ModernBertAttention`. 
On ChemBERTa-3 benchmarks this variant produced stronger validation metrics and avoided the training instabilities (sporadic zero / NaN losses and gradient norms) seen with `nn.MultiheadAttention`. Training instability with ModernBERT has been reported in the past ([discussion 1](https://huggingface.co/answerdotai/ModernBERT-base/discussions/59) and [discussion 2](https://huggingface.co/answerdotai/ModernBERT-base/discussions/63)).

## Training Pipeline
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/656892962693fa22e18b5331/bxNbpgMkU8m60ypyEJoWQ.png" alt="ModChemBERT Training Pipeline" width="650"/>
</div>

### Rationale for MTR Stage
Following Sultan et al. [3], multi-task regression (physicochemical properties) biases the latent space toward ADME-related representations prior to narrow TAFT specialization. Sultan et al. observed that MLM + DAPT (MTR) outperforms MLM-only, MTR-only, and MTR + DAPT (MTR).

### Checkpoint Averaging Motivation
Inspired by ModernBERT [4], JaColBERTv2.5 [5], and Llama 3.1 [6], where results show that model merging can enhance generalization or performance while mitigating overfitting to any single fine-tune or annealing checkpoint.

## Datasets
- Pretraining: [Derify/augmented_canonical_druglike_QED_Pfizer_15M](https://huggingface.co/datasets/Derify/augmented_canonical_druglike_QED_Pfizer_15M) (canonical_smiles column)
- Domain Adaptive Pretraining (DAPT) & Task Adaptive Fine-tuning (TAFT): ADME (6 tasks) + AstraZeneca (4 tasks) datasets that are split using DA4MT's [3] Bemis-Murcko scaffold splitter (see [domain-adaptation-molecular-transformers](https://github.com/emapco/ModChemBERT/blob/main/domain-adaptation-molecular-transformers/da4mt/splitting.py))
- Benchmarking: 
  - ChemBERTa-3 [7]  
    - classification: BACE, BBBP, TOX21, HIV, SIDER, CLINTOX
    - regression: ESOL, FREESOLV, LIPO, BACE, CLEARANCE
  - Mswahili, et al. [8] proposed additional datasets for benchmarking chemical language models:
    - classification: Antimalarial [9], Cocrystal [10], COVID19 [11]
  - DAPT/TAFT stage regression datasets:
    - ADME [12]: adme_microsom_stab_h, adme_microsom_stab_r, adme_permeability, adme_ppb_h, adme_ppb_r, adme_solubility
    - AstraZeneca: astrazeneca_CL, astrazeneca_LogD74, astrazeneca_PPB, astrazeneca_Solubility

## Benchmarking
Benchmarks were conducted using the ChemBERTa-3 framework. DeepChem scaffold splits were utilized for all datasets, with the exception of the Antimalarial dataset, which employed a random split. Each task was trained for 100 epochs, with results averaged across 3 random seeds.

The complete hyperparameter configurations for these benchmarks are available here: [ChemBERTa3 configs](https://github.com/emapco/ModChemBERT/tree/main/conf/chemberta3)

### Evaluation Methodology
- Classification Metric: ROC AUC
- Regression Metric: RMSE
- Aggregation: Mean ± standard deviation of the triplicate results.
- Input Constraints: SMILES truncated / filtered to ≤200 tokens, following ChemBERTa-3's recommendation.

### Results
<details><summary>Click to expand</summary>

#### ChemBERTa-3 Classification Datasets (ROC AUC - Higher is better)

| Model                                                                        | BACE↑             | BBBP↑             | CLINTOX↑              | HIV↑                  | SIDER↑                | TOX21↑            | AVG†   |
| ---------------------------------------------------------------------------- | ----------------- | ----------------- | --------------------- | --------------------- | --------------------- | ----------------- | ------ |
| **Tasks**                                                                    | 1                 | 1                 | 2                     | 1                     | 27                    | 12                |        |
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)*    | 0.781 ± 0.019     | 0.700 ± 0.027     | 0.979 ± 0.022         | 0.740 ± 0.013         | 0.611 ± 0.002         | 0.718 ± 0.011     | 0.7548 |
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)*      | 0.819 ± 0.019     | 0.735 ± 0.019     | 0.839 ± 0.013         | 0.762 ± 0.005         | 0.618 ± 0.005         | 0.723 ± 0.012     | 0.7493 |
| MoLFormer-LHPC*                                                              | **0.887 ± 0.004** | **0.908 ± 0.013** | 0.993 ± 0.004         | 0.750 ± 0.003         | 0.622 ± 0.007         | **0.791 ± 0.014** | 0.8252 |
|                                                                              |                   |                   |                       |                       |                       |                   |        |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 0.8065 ± 0.0103   | 0.7222 ± 0.0150   | 0.9709 ± 0.0227       | ***0.7800 ± 0.0133*** | 0.6419 ± 0.0113       | 0.7400 ± 0.0044   | 0.7769 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | 0.8224 ± 0.0156   | 0.7402 ± 0.0095   | 0.9820 ± 0.0138       | 0.7702 ± 0.0020       | 0.6303 ± 0.0039       | 0.7360 ± 0.0036   | 0.7802 |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 0.7924 ± 0.0155   | 0.7282 ± 0.0058   | 0.9725 ± 0.0213       | 0.7770 ± 0.0047       | 0.6542 ± 0.0128       | *0.7646 ± 0.0039* | 0.7815 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8213 ± 0.0051   | 0.7356 ± 0.0094   | 0.9664 ± 0.0202       | 0.7750 ± 0.0048       | 0.6415 ± 0.0094       | 0.7263 ± 0.0036   | 0.7777 |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | *0.8346 ± 0.0045* | *0.7573 ± 0.0120* | ***0.9938 ± 0.0017*** | 0.7737 ± 0.0034       | ***0.6600 ± 0.0061*** | 0.7518 ± 0.0047   | 0.7952 |

#### ChemBERTa-3 Regression Datasets (RMSE - Lower is better)

| Model                                                                        | BACE↓                 | CLEARANCE↓             | ESOL↓                 | FREESOLV↓             | LIPO↓                 | AVG‡             |
| ---------------------------------------------------------------------------- | --------------------- | ---------------------- | --------------------- | --------------------- | --------------------- | ---------------- |
| **Tasks**                                                                    | 1                     | 1                      | 1                     | 1                     | 1                     |                  |
| [ChemBERTa-100M-MLM](https://huggingface.co/DeepChem/ChemBERTa-100M-MLM)*    | 1.011 ± 0.038         | 51.582 ± 3.079         | 0.920 ± 0.011         | 0.536 ± 0.016         | 0.758 ± 0.013         | 0.8063 / 10.9614 |
| [c3-MoLFormer-1.1B](https://huggingface.co/DeepChem/MoLFormer-c3-1.1B)*      | 1.094 ± 0.126         | 52.058 ± 2.767         | 0.829 ± 0.019         | 0.572 ± 0.023         | 0.728 ± 0.016         | 0.8058 / 11.0562 |
| MoLFormer-LHPC*                                                              | 1.201 ± 0.100         | 45.74 ± 2.637          | 0.848 ± 0.031         | 0.683 ± 0.040         | 0.895 ± 0.080         | 0.9068 / 9.8734  |
|                                                                              |                       |                        |                       |                       |                       |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 1.0893 ± 0.1319       | 49.0005 ± 1.2787       | 0.8456 ± 0.0406       | 0.5491 ± 0.0134       | 0.7147 ± 0.0062       | 0.7997 / 10.4398 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | 0.9931 ± 0.0258       | 45.4951 ± 0.7112       | 0.9319 ± 0.0153       | 0.6049 ± 0.0666       | 0.6874 ± 0.0040       | 0.8043 / 9.7425  |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 1.0304 ± 0.1146       | 47.8418 ± 0.4070       | ***0.7669 ± 0.0024*** | 0.5293 ± 0.0267       | 0.6708 ± 0.0074       | 0.7493 / 10.1678 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.9713 ± 0.0224       | ***42.8010 ± 3.3475*** | 0.8169 ± 0.0268       | 0.5445 ± 0.0257       | 0.6820 ± 0.0028       | 0.7537 / 9.1631  |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | ***0.9665 ± 0.0250*** | 44.0137 ± 1.1110       | 0.8158 ± 0.0115       | ***0.4979 ± 0.0158*** | ***0.6505 ± 0.0126*** | 0.7327 / 9.3889  |

#### Mswahili, et al. [8] Proposed Classification Datasets (ROC AUC - Higher is better)

| Model                                                                        | Antimalarial↑         | Cocrystal↑            | COVID19↑              | AVG†   |
| ---------------------------------------------------------------------------- | --------------------- | --------------------- | --------------------- | ------ |
| **Tasks**                                                                    | 1                     | 1                     | 1                     |        |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 0.8707 ± 0.0032       | 0.7967 ± 0.0124       | 0.8106 ± 0.0170       | 0.8260 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | 0.8756 ± 0.0056       | 0.8288 ± 0.0143       | 0.8029 ± 0.0159       | 0.8358 |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 0.8832 ± 0.0051       | 0.7866 ± 0.0204       | ***0.8308 ± 0.0026*** | 0.8335 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.8819 ± 0.0052       | 0.8550 ± 0.0106       | 0.8013 ± 0.0118       | 0.8461 |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | ***0.8966 ± 0.0045*** | ***0.8654 ± 0.0080*** | 0.8132 ± 0.0195       | 0.8584 |

#### ADME/AstraZeneca Regression Datasets (RMSE - Lower is better)

Hyperparameter optimization for the TAFT stage appears to induce overfitting, as the `MLM + DAPT + TAFT OPT` model shows slightly degraded performance on the ADME/AstraZeneca datasets compared to the `MLM + DAPT + TAFT` model.
The `MLM + DAPT + TAFT` model, a merge of unoptimized TAFT checkpoints trained with `max_seq_mean` pooling, achieved the best overall performance across the ADME/AstraZeneca datasets.

|                                                                              | ADME                |                     |                     |                     |                     |                     | AstraZeneca         |                     |                     |                     |        |
| ---------------------------------------------------------------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------------------- | ------ |
| Model                                                                        | microsom_stab_h↓    | microsom_stab_r↓    | permeability↓       | ppb_h↓              | ppb_r↓              | solubility↓         | CL↓                 | LogD74↓             | PPB↓                | Solubility↓         | AVG†   |
|                                                                              |                     |                     |                     |                     |                     |                     |                     |                     |                     |                     |
| **Tasks**                                                                    | 1                   | 1                   | 1                   | 1                   | 1                   | 1                   | 1                   | 1                   | 1                   | 1                   |        |
| [MLM](https://huggingface.co/Derify/ModChemBERT-MLM)                         | 0.4489 ± 0.0114     | 0.4685 ± 0.0225     | 0.5423 ± 0.0076     | 0.8041 ± 0.0378     | 0.7849 ± 0.0394     | 0.5191 ± 0.0147     | **0.4812 ± 0.0073** | 0.8204 ± 0.0070     | 0.1365 ± 0.0066     | 0.9614 ± 0.0189     | 0.5967 |
| [MLM + DAPT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT)             | **0.4199 ± 0.0064** | 0.4568 ± 0.0091     | 0.5042 ± 0.0135     | 0.8376 ± 0.0629     | 0.8446 ± 0.0756     | 0.4800 ± 0.0118     | 0.5351 ± 0.0036     | 0.8191 ± 0.0066     | 0.1237 ± 0.0022     | 0.9280 ± 0.0088     | 0.5949 |
| [MLM + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-TAFT)             | 0.4375 ± 0.0027     | 0.4542 ± 0.0024     | 0.5202 ± 0.0141     | **0.7618 ± 0.0138** | 0.7027 ± 0.0023     | 0.5023 ± 0.0107     | 0.5104 ± 0.0110     | 0.7599 ± 0.0050     | 0.1233 ± 0.0088     | 0.8730 ± 0.0112     | 0.5645 |
| [MLM + DAPT + TAFT](https://huggingface.co/Derify/ModChemBERT-MLM-DAPT-TAFT) | 0.4206 ± 0.0071     | **0.4400 ± 0.0039** | **0.4899 ± 0.0068** | 0.8927 ± 0.0163     | **0.6942 ± 0.0397** | 0.4641 ± 0.0082     | 0.5022 ± 0.0136     | **0.7467 ± 0.0041** | 0.1195 ± 0.0026     | **0.8564 ± 0.0265** | 0.5626 |
| [MLM + DAPT + TAFT OPT](https://huggingface.co/Derify/ModChemBERT)           | 0.4248 ± 0.0041     | 0.4403 ± 0.0046     | 0.5025 ± 0.0029     | 0.8901 ± 0.0123     | 0.7268 ± 0.0090     | **0.4627 ± 0.0083** | 0.4932 ± 0.0079     | 0.7596 ± 0.0044     | **0.1150 ± 0.0002** | 0.8735 ± 0.0053     | 0.5689 |


**Bold** indicates the best result in the column; *italic* indicates the best result among ModChemBERT checkpoints.<br/>
\* Published results from the ChemBERTa-3 [7] paper for optimized chemical language models using DeepChem scaffold splits.<br/>
† AVG column shows the mean score across classification tasks.<br/>
‡ AVG column shows the mean scores across regression tasks without and with the clearance score.

</details>

## Optimized ModChemBERT Hyperparameters

<details><summary>Click to expand</summary>

### TAFT Datasets
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

| Dataset                | Learning Rate | Batch Size | Warmup Ratio | Classifier Pooling | Last k Layers |
| ---------------------- | ------------- | ---------- | ------------ | ------------------ | ------------- |
| adme_microsom_stab_h   | 3e-5          | 8          | 0.0          | max_seq_mean       | 5             |
| adme_microsom_stab_r   | 3e-5          | 16         | 0.2          | max_cls            | 3             |
| adme_permeability      | 3e-5          | 8          | 0.0          | max_cls            | 3             |
| adme_ppb_h             | 1e-5          | 32         | 0.1          | max_seq_mean       | 5             |
| adme_ppb_r             | 1e-5          | 32         | 0.0          | sum_mean           | N/A           |
| adme_solubility        | 3e-5          | 32         | 0.0          | sum_mean           | N/A           |
| astrazeneca_CL         | 3e-5          | 8          | 0.1          | max_seq_mha        | 3             |
| astrazeneca_LogD74     | 1e-5          | 8          | 0.0          | max_seq_mean       | 5             |
| astrazeneca_PPB        | 1e-5          | 32         | 0.0          | max_cls            | 3             |
| astrazeneca_Solubility | 1e-5          | 32         | 0.0          | max_seq_mean       | 5             |

### Benchmarking Datasets
Optimal parameters (per dataset) for the `MLM + DAPT + TAFT OPT` merged model:

| Dataset             | Batch Size | Classifier Pooling | Last k Layers | Pooling Attention Dropout | Classifier Dropout | Embedding Dropout |
| ------------------- | ---------- | ------------------ | ------------- | ------------------------- | ------------------ | ----------------- |
| bace_classification | 32         | max_seq_mha        | 3             | 0.0                       | 0.0                | 0.0               |
| bbbp                | 64         | max_cls            | 3             | 0.1                       | 0.0                | 0.0               |
| clintox             | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| hiv                 | 32         | max_seq_mha        | 3             | 0.0                       | 0.0                | 0.0               |
| sider               | 32         | mean               | N/A           | 0.1                       | 0.0                | 0.1               |
| tox21               | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| base_regression     | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| clearance           | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| esol                | 64         | sum_mean           | N/A           | 0.1                       | 0.0                | 0.1               |
| freesolv            | 32         | max_seq_mha        | 5             | 0.1                       | 0.0                | 0.0               |
| lipo                | 32         | max_seq_mha        | 3             | 0.1                       | 0.1                | 0.1               |
| antimalarial        | 16         | max_seq_mha        | 3             | 0.1                       | 0.1                | 0.1               |
| cocrystal           | 16         | max_cls            | 3             | 0.1                       | 0.0                | 0.1               |
| covid19             | 16         | sum_mean           | N/A           | 0.1                       | 0.0                | 0.1               |

</details>

## Intended Use
* Primary: Research and development for molecular property prediction, experimentation with pooling strategies, and as a foundational model for downstream applications.
* Appropriate for: Binary / multi-class classification (e.g., toxicity, activity) and single-task or multi-task regression (e.g., solubility, clearance) after fine-tuning.
* Not intended for generating novel molecules.

## Limitations
- Out-of-domain performance may degrade for: very long (>128 token) SMILES, inorganic / organometallic compounds, polymers, or charged / enumerated tautomers are not well represented in training.
- No guarantee of synthesizability, safety, or biological efficacy.

## Ethical Considerations & Responsible Use
- Potential biases arise from training corpora skewed to drug-like space.
- Do not deploy in clinical or regulatory settings without rigorous, domain-specific validation.

## Hardware
Training and experiments were performed on 2 NVIDIA RTX 3090 GPUs.

## Citation
If you use ModChemBERT in your research, please cite the checkpoint and the following:
```
@software{cortes-2025-modchembert,
  author = {Emmanuel Cortes},
  title = {ModChemBERT: ModernBERT as a Chemical Language Model},
  year = {2025},
  publisher = {GitHub},
  howpublished = {GitHub repository},
  url = {https://github.com/emapco/ModChemBERT}
}
```

## References
1. Kallergis, G., Asgari, E., Empting, M. et al. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 8, 114 (2025). https://doi.org/10.1038/s42004-025-01484-4
2. Behrendt, Maike, Stefan Sylvius Wagner, and Stefan Harmeling. "MaxPoolBERT: Enhancing BERT Classification via Layer-and Token-Wise Aggregation." arXiv preprint arXiv:2505.15696 (2025).
3. Sultan, Afnan, et al. "Transformers for molecular property prediction: Domain adaptation efficiently improves performance." arXiv preprint arXiv:2503.03360 (2025).
4. Warner, Benjamin, et al. "Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference." arXiv preprint arXiv:2412.13663 (2024).
5. Clavié, Benjamin. "JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources." arXiv preprint arXiv:2407.20750 (2024).
6. Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
7. Singh R, Barsainyan AA, Irfan R, Amorin CJ, He S, Davis T, et al. ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models. ChemRxiv. 2025; doi:10.26434/chemrxiv-2025-4glrl-v2 This content is a preprint and has not been peer-reviewed.
8. Mswahili, M.E., Hwang, J., Rajapakse, J.C. et al. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J Cheminform 17, 17 (2025). https://doi.org/10.1186/s13321-025-00959-9
9. Mswahili, M.E.; Ndomba, G.E.; Jo, K.; Jeong, Y.-S. Graph Neural Network and BERT Model for Antimalarial Drug Predictions Using Plasmodium Potential Targets. Applied Sciences, 2024, 14(4), 1472. https://doi.org/10.3390/app14041472
10. Mswahili, M.E.; Lee, M.-J.; Martin, G.L.; Kim, J.; Kim, P.; Choi, G.J.; Jeong, Y.-S. Cocrystal Prediction Using Machine Learning Models and Descriptors. Applied Sciences, 2021, 11, 1323. https://doi.org/10.3390/app11031323
11. Harigua-Souiai, E.; Heinhane, M.M.; Abdelkrim, Y.Z.; Souiai, O.; Abdeljaoued-Tej, I.; Guizani, I. Deep Learning Algorithms Achieved Satisfactory Predictions When Trained on a Novel Collection of Anticoronavirus Molecules. Frontiers in Genetics, 2021, 12:744170. https://doi.org/10.3389/fgene.2021.744170
12. Cheng Fang, Ye Wang, Richard Grater, Sudarshan Kapadnis, Cheryl Black, Patrick Trapa, and Simone Sciabola. "Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective" Journal of Chemical Information and Modeling 2023 63 (11), 3263-3274 https://doi.org/10.1021/acs.jcim.3c00160