Deepseek Tiny V0.1

6-layer DeepSeek-V3 with Multihead Latent Attention (MLA) trained for research on shared subspaces in Transformer attention mechanisms.

Model Description

  • Model Type: Transformer Decoder (DeepSeek-V3 based)
  • Architecture: 6-layer decoder with Mixture of Experts
  • Parameters: 16.26M
  • Hidden Size: 256
  • Attention Heads: 8
  • Head Dimension: 32
  • Sequence Length: 1,024 tokens
  • Query Latent Dimension: 96
  • Key-Value Latent Dimension: 64

Performance

  • SST-2 Accuracy: 87.96%
  • WikiText-103 Perplexity: 28.89

Research Context

This model is part of the shared-subspaces research project investigating the impact of shared output latent spaces in Transformer attention mechanisms.

Usage

import torch
from transformers import DeepseekV3ForCausalLM, AutoTokenizer

# Load model and tokenizer
model = DeepseekV3ForCausalLM.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")
tokenizer = AutoTokenizer.from_pretrained("ChrisMcCormick/deepseek-tiny-v0.1")



# Generate text
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Pre-training Dataset: WikiText-103
  • Fine-tuning Dataset: SST-2 (GLUE)
  • Optimizer: AdamW
  • Learning Rate: 5e-4 (pre-training), 5e-5 (fine-tuning)
  • Weight Decay: 0.01 (pre-training), 0.05 (fine-tuning)
  • Precision: bfloat16
  • Compilation: torch.compile with inductor backend
  • Training Steps: 12,500 (pre-training), 1,500 (fine-tuning)

Limitations

  • Small scale model (16M parameters) intended for research purposes
  • Trained on limited data compared to production models
  • May require custom loading code for output subspace variants

Citation

@misc{mccormick2025sharedsubspaces,
  title={Shared Subspaces in Transformer Attention: Investigating Output Latent Spaces},
  author={McCormick, Chris},
  year={2025},
  howpublished={\url{https://github.com/chrisjmccormick/shared-subspaces}}
}

License

Apache 2.0

Downloads last month
14
Safetensors
Model size
17.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ChrisMcCormick/deepseek-tiny-v0.1