BioMike commited on
Commit
e0c076d
·
verified ·
1 Parent(s): 406ff27

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md CHANGED
@@ -1,3 +1,101 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ metrics:
4
+ - accuracy
5
+ - bleu
6
+ pipeline_tag: text2text-generation
7
+ tags:
8
+ - chemistry
9
+ - biology
10
+ - medical
11
+ - smiles
12
+ - iupac
13
+ - text-generation-inference
14
+ widget:
15
+ - text: CCO
16
+ example_title: ethanol
17
  ---
18
+ # SMILES2IUPAC-canonical-small
19
+
20
+ IUPAC2SMILES-canonical-small was designed to accurately translate IUPAC chemical names to SMILES.
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ IUPAC2SMILES-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
27
+ - **Developed by:** Knowladgator Engineering
28
+ - **Model type:** Encoder-Decoder with attention mechanism
29
+ - **Language(s) (NLP):** SMILES, IUPAC (English)
30
+ - **License:** Apache License 2.0
31
+
32
+ ### Model Sources
33
+ - **Paper:** coming soon
34
+ - **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)
35
+
36
+ ## Quickstart
37
+ Firstly, install the library:
38
+ ```commandline
39
+ pip install chemical-converters
40
+ ```
41
+ ### IUPAC to SMILES
42
+ #### To perform simple translation, follow the example:
43
+ ```python
44
+ from chemicalconverters import NamesConverter
45
+
46
+ converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
47
+ print(converter.smiles_to_iupac('ethanol'))
48
+ print(converter.smiles_to_iupac(['ethanol', 'ethanol', 'ethanol']))
49
+ ```
50
+ ```text
51
+ ['CCO']
52
+ ['CCO', 'CCO', 'CCO']
53
+ ```
54
+ #### Processing in batches:
55
+ ```python
56
+ from chemicalconverters import NamesConverter
57
+
58
+ converter = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
59
+ print(converter.smiles_to_iupac(["buta-1,3-diene" for _ in range(10)], num_beams=1,
60
+ process_in_batch=True, batch_size=1000))
61
+ ```
62
+ ```text
63
+ ['<SYST>C=CC=C', '<SYST>C=CC=C'...]
64
+ ```
65
+ Our models also predict IUPAC styles from the table:
66
+
67
+ | Style Token | Description |
68
+ |-------------|----------------------------------------------------------------------------------------------------|
69
+ | `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
70
+ | `<SYST>` | The totally systematic style without trivial names |
71
+ | `<TRAD>` | The style is based on trivial names of the parts of substances |
72
+
73
+ ## Bias, Risks, and Limitations
74
+
75
+ This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
76
+
77
+ ### Training Procedure
78
+
79
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
+
81
+ The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
82
+
83
+ ## Evaluation
84
+
85
+ | Model | Accuracy | BLEU-4 score | Size(MB) |
86
+ |-------------------------------------|---------|------------------|----------|
87
+ | IUPAC2SMILES-canonical-small |88.9% |0.966 |23 |
88
+ | IUPAC2SMILES-canonical-base |93.7% |0.974 |180 |
89
+ | STOUT V2.0\* |68.47% |0.92 |128 |
90
+ *According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4
91
+
92
+ ## Citation
93
+ Coming soon.
94
+
95
+ ## Model Card Authors
96
+
97
+ [Mykhailo Shtopko](https://huggingface.co/BioMike)
98
+
99
+ ## Model Card Contact
100
+
101