HoangHa's picture
Update README.md (#1)
feb7be2 verified
metadata
datasets:
  - homebrewltd/instruction-speech-whispervq-v2
language:
  - en
license: apache-2.0
tags:
  - sound language model
pipeline_tag: audio-text-to-text

Model Details

We have developed and released the family llama3s. This family is natively understanding audio and text input.

We continual pretrain on the expanded vocabulary homebrewltd/llama3.2-3B-s-whispervq-init with 900M tokens from homebrewltd/raw-speech-whispervq-v1 dataset.

Model developers Homebrew Research.

Input Text and sound.

Output Text.

Model Architecture Llama-3.

Language(s): English.

Intended Use

Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

Out-of-scope The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

image/png

MMLU:

Model MMLU Score
llama3.5-instruct-8b 69.40
ichigo-llama3.1-s-v0.3: phase 3 63.79
ichigo-llama3.1-s-v0.3: phase 2 63.08
ichigo-llama3.1-s-base-v0.3 42.11
mini-ichigo-llama3.2-3B-s-instruct 59.61
mini-ichigo-llama3.2-3B-s-base 58.68
llama3.5-instruct-v0.2 50.27

Hardware

GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.

GPU Usage:

  • Continual Training: 30 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter Continual Training
Epoch 1
Global batch size 480
Learning Rate 2e-4
Learning Scheduler LambdaLR with warmup
Optimizer AdamW fused
Warmup Steps 80
Weight Decay 0.01
Max Sequence Length 512

Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August},
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}

Acknowledgement