|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
base_model: axolotl-ai-co/gpt-oss-20b-dequantized |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- HuggingFaceH4/Multilingual-Thinking |
|
model-index: |
|
- name: outputs/gpt-oss-20b/ |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl) |
|
<details><summary>See axolotl config</summary> |
|
|
|
axolotl version: `0.12.0` |
|
```yaml |
|
# the original mxfp4 quantized model is not supported with FSDP cpu_ram_efficient_loading |
|
# FSDP cpu_ram_efficient_loading is used to reduce the initial CPU memory usage when loading the model |
|
base_model: axolotl-ai-co/gpt-oss-20b-dequantized |
|
|
|
use_kernels: false |
|
|
|
dp_shard_size: 8 # requires 2x8xH100 nodes |
|
|
|
plugins: |
|
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin |
|
|
|
experimental_skip_move_to_device: true # prevent OOM by NOT putting model to GPU before sharding |
|
|
|
datasets: |
|
- path: HuggingFaceH4/Multilingual-Thinking |
|
type: chat_template |
|
field_thinking: thinking |
|
template_thinking_key: thinking |
|
|
|
dataset_prepared_path: last_run_prepared |
|
val_set_size: 0 |
|
output_dir: ./outputs/gpt-oss-20b/ |
|
#save_only_model: true |
|
|
|
sequence_len: 4096 |
|
sample_packing: true |
|
pad_to_sequence_len: true |
|
|
|
wandb_project: gpt-oss-20b |
|
wandb_name: fft-20b |
|
|
|
gradient_accumulation_steps: 1 |
|
micro_batch_size: 4 |
|
num_epochs: 1 |
|
|
|
optimizer: adamw_torch_fused # 8bit optimizers do not work with FSDP2 offload |
|
lr_scheduler: constant_with_warmup |
|
learning_rate: 2e-5 |
|
load_best_model_at_end: false |
|
|
|
bf16: true |
|
tf32: true |
|
|
|
flash_attention: true |
|
attn_implementation: kernels-community/vllm-flash-attn3 |
|
|
|
gradient_checkpointing: true |
|
activation_offloading: true |
|
|
|
logging_steps: 1 |
|
saves_per_epoch: 1 |
|
|
|
warmup_ratio: 0.03 |
|
|
|
special_tokens: |
|
eot_tokens: |
|
- "<|end|>" |
|
|
|
#deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json |
|
fsdp_version: 2 |
|
fsdp_config: |
|
offload_params: true |
|
state_dict_type: SHARDED_STATE_DICT |
|
auto_wrap_policy: TRANSFORMER_BASED_WRAP |
|
transformer_layer_cls_to_wrap: GptOssDecoderLayer |
|
reshard_after_forward: true |
|
cpu_ram_efficient_loading: true |
|
``` |
|
|
|
</details><br> |
|
|
|
# outputs/gpt-oss-20b/ |
|
|
|
This model is a fine-tuned version of [axolotl-ai-co/gpt-oss-20b-dequantized](https://huggingface.co/axolotl-ai-co/gpt-oss-20b-dequantized) on the HuggingFaceH4/Multilingual-Thinking dataset. |
|
|
|
## Model description |
|
|
|
More information needed |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 4 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 8 |
|
- total_train_batch_size: 32 |
|
- total_eval_batch_size: 32 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: constant_with_warmup |
|
- training_steps: 8 |
|
|
|
### Training results |
|
|
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.55.0 |
|
- Pytorch 2.8.0+cu128 |
|
- Datasets 4.0.0 |
|
- Tokenizers 0.21.4 |
|
|