alakob commited on
Commit
f52930c
·
verified ·
1 Parent(s): cb7eae8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -10
README.md CHANGED
@@ -38,7 +38,7 @@ This model is a fine-tuned version of InstaDeepAI's Nucleotide Transformer (2.5B
38
 
39
  This model can be used directly for predicting whether a given nucleotide sequence is associated with Antimicrobial Resistance (AMR) without additional fine-tuning.
40
 
41
- ### Downstream Use [optional]
42
 
43
  The model can be further fine-tuned for specific AMR-related tasks or integrated into larger bioinformatics pipelines for genomic analysis.
44
 
@@ -71,13 +71,12 @@ sequence = "ATGC..." # Replace with your nucleotide sequence
71
  inputs = tokenizer(sequence, truncation=True, max_length=1000, return_tensors="pt")
72
  outputs = model(**inputs)
73
  prediction = outputs.logits.argmax(-1).item() # 0 = non-AMR, 1 = AMR
 
74
 
75
  ## Training Details
76
 
77
  ### Training Data
78
 
79
- The model was trained on the DraGNOME-2.5b-v1 dataset, consisting of 1200 overlapping sequences:
80
-
81
  - **Negative sequences (non-AMR):**
82
  `DSM_20231.fasta`, `ecoli-k12.fasta`, `FDA.fasta`
83
 
@@ -89,7 +88,7 @@ The model was trained on the DraGNOME-2.5b-v1 dataset, consisting of 1200 overla
89
 
90
  ### Training Procedure
91
 
92
- #### Preprocessing [optional]
93
 
94
  Sequences were tokenized using the Nucleotide Transformer tokenizer with a maximum length of 1000 tokens and truncation applied where necessary.
95
 
@@ -103,10 +102,10 @@ Sequences were tokenized using the Nucleotide Transformer tokenizer with a maxim
103
  - **Scheduler:** Linear with 10% warmup
104
  - **LoRA parameters:** `r=32`, `alpha=64`, `dropout=0.1`, `target_modules=["query", "value"]`
105
 
106
- #### Speeds, Sizes, Times [optional]
107
 
108
  Training was performed on Google Colab with checkpointing every 500 steps, retaining the last 3 checkpoints.
109
- Exact throughput and times depend on Colab's hardware allocation (typically T4 GPU).
110
 
111
  ---
112
 
@@ -150,7 +149,7 @@ Evaluation was performed across AMR and non-AMR classes.
150
 
151
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
152
 
153
- - **Hardware Type:** Google Colab GPU (typically NVIDIA T4)
154
  - **Hours used:** [More Information Needed]
155
  - **Cloud Provider:** Google Colab
156
  - **Compute Region:** [More Information Needed]
@@ -170,7 +169,7 @@ Training was performed on Google Colab with persistent storage via Google Drive.
170
 
171
  #### Hardware
172
 
173
- - NVIDIA T4 GPU (typical Colab allocation)
174
 
175
  #### Software
176
 
@@ -191,7 +190,7 @@ Training was performed on Google Colab with persistent storage via Google Drive.
191
 
192
  ---
193
 
194
- ## Glossary [optional]
195
 
196
  - **AMR:** Antimicrobial Resistance
197
  - **LoRA:** Low-Rank Adaptation
@@ -205,7 +204,7 @@ Training was performed on Google Colab with persistent storage via Google Drive.
205
 
206
  ---
207
 
208
- ## Model Card Authors [optional]
209
 
210
  Blaise Alako
211
 
 
38
 
39
  This model can be used directly for predicting whether a given nucleotide sequence is associated with Antimicrobial Resistance (AMR) without additional fine-tuning.
40
 
41
+ ### Downstream Use
42
 
43
  The model can be further fine-tuned for specific AMR-related tasks or integrated into larger bioinformatics pipelines for genomic analysis.
44
 
 
71
  inputs = tokenizer(sequence, truncation=True, max_length=1000, return_tensors="pt")
72
  outputs = model(**inputs)
73
  prediction = outputs.logits.argmax(-1).item() # 0 = non-AMR, 1 = AMR
74
+ ```
75
 
76
  ## Training Details
77
 
78
  ### Training Data
79
 
 
 
80
  - **Negative sequences (non-AMR):**
81
  `DSM_20231.fasta`, `ecoli-k12.fasta`, `FDA.fasta`
82
 
 
88
 
89
  ### Training Procedure
90
 
91
+ #### Preprocessing
92
 
93
  Sequences were tokenized using the Nucleotide Transformer tokenizer with a maximum length of 1000 tokens and truncation applied where necessary.
94
 
 
102
  - **Scheduler:** Linear with 10% warmup
103
  - **LoRA parameters:** `r=32`, `alpha=64`, `dropout=0.1`, `target_modules=["query", "value"]`
104
 
105
+ #### Speeds, Sizes, Times
106
 
107
  Training was performed on Google Colab with checkpointing every 500 steps, retaining the last 3 checkpoints.
108
+ Exact throughput and times depend on Colab's hardware allocation NVIDIA A100 GPU.
109
 
110
  ---
111
 
 
149
 
150
  Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
 
152
+ - **Hardware Type:** Google Colab NVIDIA A100 GPU
153
  - **Hours used:** [More Information Needed]
154
  - **Cloud Provider:** Google Colab
155
  - **Compute Region:** [More Information Needed]
 
169
 
170
  #### Hardware
171
 
172
+ - NVIDIA A100 GPU
173
 
174
  #### Software
175
 
 
190
 
191
  ---
192
 
193
+ ## Glossary
194
 
195
  - **AMR:** Antimicrobial Resistance
196
  - **LoRA:** Low-Rank Adaptation
 
204
 
205
  ---
206
 
207
+ ## Model Card Authors
208
 
209
  Blaise Alako
210