# DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML

This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.

## Model Description

[DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.

This CoreML conversion provides:
- Full compatibility with Apple Silicon devices (M1, M2, M3 series)
- Stateful inference with KV-caching for efficient text generation
- Optimized performance for on-device deployment

## Technical Specifications

- **Base Model**: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- **Parameters**: 8 billion
- **Context Length**: Configurable (default: 64, expandable based on memory constraints)
- **Quantization**: FP16 
- **File Format**: .mlpackage
- **Deployment Target**: macOS 15+
- **Architecture**: Stateful LLM with key-value caching
- **Input Features**: Flexible input size with dynamic shape handling

## Key Features

- **Stateful Inference**: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
- **Dynamic Input Shapes**: Supports variable input lengths through RangeDim specification.
- **Optimized Memory Usage**: Efficiently manages the key-value cache to minimize memory footprint.

## Implementation Details

This conversion utilizes:
- A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
- CoreML's state management capabilities for maintaining KV caches between inference calls
- Proper buffer registration to ensure state persistence
- Dynamic tensor shapes to accommodate various input and context lengths

## Usage

The model can be loaded and used with CoreML in your Swift or Python projects:

```python
import coremltools as ct

# Load the model
model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")

# Prepare inputs for inference
# ...

# Run inference
output = model.predict({
    "inputIds": input_ids,
    "causalMask": causal_mask
})
```

## Conversion Process

The model was converted using CoreML Tools with the following steps:
1. Loading the original model from Hugging Face
2. Wrapping it with custom state management
3. Tracing with PyTorch's JIT
4. Converting to CoreML format with state specifications
5. Saving in the .mlpackage format

## Requirements

To use this model:
- Apple Silicon Mac (M1/M2/M3 series)
- macOS 15 or later
- Minimum 16GB RAM recommended

## Limitations

- The model requires significant memory for inference, especially with longer contexts
- Performance is highly dependent on the device's Neural Engine capabilities
- The default configuration supports a context length of 64 tokens, but this can be adjusted

## License

This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.

## Acknowledgments

- [DeepSeek-AI](https://github.com/deepseek-ai) for creating and releasing the original model
- [Hugging Face](https://huggingface.co/) for hosting the model and providing the Transformers library
- Apple for developing the CoreML framework

## Citation

If you use this model in your research, please cite both the original DeepSeek model and this conversion.