ipd commited on
Commit
1928da3
·
verified ·
1 Parent(s): 6ae0ea4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -51
README.md CHANGED
@@ -1,51 +1,52 @@
1
- ---
2
- license: apache-2.0
3
- library_name: transformers
4
- pipeline_tag: feature-extraction
5
- tags:
6
- - chemistry
7
- ---
8
-
9
- # selfies-ted
10
-
11
- selfies-ted is a project for encoding SMILES (Simplified Molecular Input Line Entry System) into SELFIES (SELF-referencing Embedded Strings) and generating embeddings for molecular representations.
12
-
13
- ![selfies-ted](selfies-ted.png)
14
-
15
-
16
- ## Usage
17
-
18
- ### Import
19
-
20
- ```
21
- from transformers import AutoTokenizer, AutoModel
22
- import selfies as sf
23
- ```
24
-
25
- ### Load the model and tokenizer
26
- ```
27
- tokenizer = AutoTokenizer.from_pretrained("ibm/materials.selfies-ted")
28
- model = AutoModel.from_pretrained("ibm/materials.selfies-ted")
29
- ```
30
- ### Encode SMILES strings to selfies
31
- ```
32
- smiles_list = "c1ccccc1"
33
- selfies = sf.encoder(smiles)
34
- selfies = selfies.replace("][", "] [")
35
-
36
- ```
37
- ### Get embedding
38
- ```
39
- token = self.tokenizer(selfies return_tensors='pt', max_length=128, truncation=True, padding='max_length')
40
- input_ids = token['input_ids']
41
- attention_mask = token['attention_mask']
42
- outputs = self.model.encoder(input_ids=input_ids, attention_mask=attention_mask)
43
- model_output = outputs.last_hidden_state
44
-
45
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_output.size()).float()
46
- sum_embeddings = torch.sum(model_output * input_mask_expanded, 1)
47
- sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
48
- model_output = sum_embeddings / sum_mask
49
- ```
50
-
51
-
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: feature-extraction
5
+ tags:
6
+ - chemistry
7
+ ---
8
+
9
+ # selfies-ted
10
+
11
+ selfies-ted is a project for encoding SMILES (Simplified Molecular Input Line Entry System) into SELFIES (SELF-referencing Embedded Strings) and generating embeddings for molecular representations.
12
+
13
+ ![selfies-ted](selfies-ted.png)
14
+
15
+
16
+ ## Usage
17
+
18
+ ### Import
19
+
20
+ ```
21
+ from transformers import AutoTokenizer, AutoModel
22
+ import selfies as sf
23
+ import torch
24
+ ```
25
+
26
+ ### Load the model and tokenizer
27
+ ```
28
+ tokenizer = AutoTokenizer.from_pretrained("ibm/materials.selfies-ted")
29
+ model = AutoModel.from_pretrained("ibm/materials.selfies-ted")
30
+ ```
31
+ ### Encode SMILES strings to selfies
32
+ ```
33
+ smiles = "c1ccccc1"
34
+ selfies = sf.encoder(smiles)
35
+ selfies = selfies.replace("][", "] [")
36
+
37
+ ```
38
+ ### Get embedding
39
+ ```
40
+ token = tokenizer(selfies, return_tensors='pt', max_length=128, truncation=True, padding='max_length')
41
+ input_ids = token['input_ids']
42
+ attention_mask = token['attention_mask']
43
+ outputs = model.encoder(input_ids=input_ids, attention_mask=attention_mask)
44
+ model_output = outputs.last_hidden_state
45
+
46
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_output.size()).float()
47
+ sum_embeddings = torch.sum(model_output * input_mask_expanded, 1)
48
+ sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
49
+ model_output = sum_embeddings / sum_mask
50
+ ```
51
+
52
+