ByteDance-Seed
/

Seed-Coder-8B-Base

 ---
+license: apache-2.0
+---
+# Seed-Coder-8B-Base
+## Introduction
+**Seed-Coder-8B-Base** is an 8-billion-parameter foundation model tailored for code understanding and generation. It is designed to provide developers with a powerful, general-purpose code model capable of handling a wide range of coding tasks.
+It features:
+- Pre-trained on a **massively curated corpus**, filtered using **LLM-based techniques** to ensure **high-quality real-world code**, **text-code alignment data**, and **synthetic datasets**, resulting in cleaner and more effective learning signals.
+- Excels at **code completion** and supports **Fill-in-the-Middle (FIM)** tasks, enabling it to predict missing code spans given partial contexts.
+- Robust performance across **various programming languages** and **code reasoning scenarios**, making it ideal for downstream finetuning or direct use in code generation systems.
+- **Long-context support** up to 32K tokens, enabling it to handle large codebases, multi-file projects, and extended editing tasks.
+Seed-Coder-8B-Base serves as the foundation for Seed-Coder-8B-Instruct and Seed-Coder-8B-reasoning.
+## Requirements
+You will need to install the latest versions of `transformers` and `accelerate`:
+```bash
+pip install -U transformers accelerate
+```
+## Quickstart
+Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face `pipeline` API:
+```python
+import transformers
+import torch
+model_id = "ByteDance-Seed/Seed-Coder-8B-Base"
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model_id,
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device_map="auto",
+)
+output = pipeline("def say_hello_world():", max_new_tokens=100)
+print(output[0]["generated_text"])
+```
+### Fill-in-the-Middle (FIM) Example
+Seed-Coder-8B-Base natively supports **Fill-in-the-Middle (FIM)** tasks, where the model is given a prefix and a suffix and asked to predict the missing middle content.
+This allows for code infilling scenarios such as completing a function body or inserting missing logic between two pieces of code.
+A typical usage flow:
+```python
+import transformers
+import torch
+model_id = "ByteDance-Seed/Seed-Coder-8B-Base"
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model_id,
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device_map="auto",
+)
+# You can concatenate a prefix, a special FIM separator token, and a suffix
+prefix = "def add_numbers(a, b):\n    "
+suffix = "\n    return result"
+# Combine prefix and suffix following the FIM format
+fim_input = "<|fim-suffix|>" + suffix + "<|fim-prefix|>" + prefix + "<|fim-middle|>"
+output = pipeline(fim_input, max_new_tokens=100)
+print(output[0]["generated_text"])
+```
+## Evaluation
+Seed-Coder-8B-Base has been internally evaluated across a variety of code understanding and generation benchmarks.
+It demonstrates strong capabilities in:
+- Fluent and contextually appropriate code completion.
+- Reasoning about code structure and inferring missing logic.
+- Generalizing across different programming languages, coding styles, and codebases.
+For detailed benchmark results, please refer to our [📑 paper](https://arxiv.org/pdf/xxx.xxxxx).
+## Citation
+If you find Seed-Coder helpful, please consider citing our work:
+```
+@article{zhang2025seedcoder,
+    title={Seed-Coder: Let the Code Model Curate Data for Itself},
+    author={Xxx},
+    year={2025},
+    eprint={2504.xxxxx},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/xxxx.xxxxx},
+}
+```