anthonymikinka
/

DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML

+# DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML
+This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.
+## Model Description
+[DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.
+This CoreML conversion provides:
+- Full compatibility with Apple Silicon devices (M1, M2, M3 series)
+- Stateful inference with KV-caching for efficient text generation
+- Optimized performance for on-device deployment
+## Technical Specifications
+- **Base Model**: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+- **Parameters**: 8 billion
+- **Context Length**: Configurable (default: 64, expandable based on memory constraints)
+- **Quantization**: FP16
+- **File Format**: .mlpackage
+- **Deployment Target**: macOS 15+
+- **Architecture**: Stateful LLM with key-value caching
+- **Input Features**: Flexible input size with dynamic shape handling
+## Key Features
+- **Stateful Inference**: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
+- **Dynamic Input Shapes**: Supports variable input lengths through RangeDim specification.
+- **Optimized Memory Usage**: Efficiently manages the key-value cache to minimize memory footprint.
+## Implementation Details
+This conversion utilizes:
+- A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
+- CoreML's state management capabilities for maintaining KV caches between inference calls
+- Proper buffer registration to ensure state persistence
+- Dynamic tensor shapes to accommodate various input and context lengths
+## Usage
+The model can be loaded and used with CoreML in your Swift or Python projects:
+```python
+import coremltools as ct
+# Load the model
+model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")
+# Prepare inputs for inference
+# ...
+# Run inference
+output = model.predict({
+    "inputIds": input_ids,
+    "causalMask": causal_mask
+})
+```
+## Conversion Process
+The model was converted using CoreML Tools with the following steps:
+1. Loading the original model from Hugging Face
+2. Wrapping it with custom state management
+3. Tracing with PyTorch's JIT
+4. Converting to CoreML format with state specifications
+5. Saving in the .mlpackage format
+## Requirements
+To use this model:
+- Apple Silicon Mac (M1/M2/M3 series)
+- macOS 15 or later
+- Minimum 16GB RAM recommended
+## Limitations
+- The model requires significant memory for inference, especially with longer contexts
+- Performance is highly dependent on the device's Neural Engine capabilities
+- The default configuration supports a context length of 64 tokens, but this can be adjusted
+## License
+This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.
+## Acknowledgments
+- [DeepSeek-AI](https://github.com/deepseek-ai) for creating and releasing the original model
+- [Hugging Face](https://huggingface.co/) for hosting the model and providing the Transformers library
+- Apple for developing the CoreML framework
+## Citation
+If you use this model in your research, please cite both the original DeepSeek model and this conversion.