yuyuzhang commited on
Commit
8f0e7db
·
verified ·
1 Parent(s): 7d31c26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -5
README.md CHANGED
@@ -1,6 +1,100 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- - zh
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Seed-Coder-8B-Base
6
+
7
+ ## Introduction
8
+ **Seed-Coder-8B-Base** is an 8-billion-parameter foundation model tailored for code understanding and generation. It is designed to provide developers with a powerful, general-purpose code model capable of handling a wide range of coding tasks.
9
+ It features:
10
+ - Pre-trained on a **massively curated corpus**, filtered using **LLM-based techniques** to ensure **high-quality real-world code**, **text-code alignment data**, and **synthetic datasets**, resulting in cleaner and more effective learning signals.
11
+ - Excels at **code completion** and supports **Fill-in-the-Middle (FIM)** tasks, enabling it to predict missing code spans given partial contexts.
12
+ - Robust performance across **various programming languages** and **code reasoning scenarios**, making it ideal for downstream finetuning or direct use in code generation systems.
13
+ - **Long-context support** up to 32K tokens, enabling it to handle large codebases, multi-file projects, and extended editing tasks.
14
+
15
+ Seed-Coder-8B-Base serves as the foundation for Seed-Coder-8B-Instruct and Seed-Coder-8B-reasoning.
16
+
17
+ ## Requirements
18
+ You will need to install the latest versions of `transformers` and `accelerate`:
19
+
20
+ ```bash
21
+ pip install -U transformers accelerate
22
+ ```
23
+
24
+ ## Quickstart
25
+
26
+ Here is a simple example demonstrating how to load the model and perform code generation using the Hugging Face `pipeline` API:
27
+
28
+ ```python
29
+ import transformers
30
+ import torch
31
+
32
+ model_id = "ByteDance-Seed/Seed-Coder-8B-Base"
33
+
34
+ pipeline = transformers.pipeline(
35
+ "text-generation",
36
+ model=model_id,
37
+ model_kwargs={"torch_dtype": torch.bfloat16},
38
+ device_map="auto",
39
+ )
40
+
41
+ output = pipeline("def say_hello_world():", max_new_tokens=100)
42
+ print(output[0]["generated_text"])
43
+ ```
44
+
45
+ ### Fill-in-the-Middle (FIM) Example
46
+
47
+ Seed-Coder-8B-Base natively supports **Fill-in-the-Middle (FIM)** tasks, where the model is given a prefix and a suffix and asked to predict the missing middle content.
48
+ This allows for code infilling scenarios such as completing a function body or inserting missing logic between two pieces of code.
49
+
50
+ A typical usage flow:
51
+
52
+ ```python
53
+ import transformers
54
+ import torch
55
+
56
+ model_id = "ByteDance-Seed/Seed-Coder-8B-Base"
57
+
58
+ pipeline = transformers.pipeline(
59
+ "text-generation",
60
+ model=model_id,
61
+ model_kwargs={"torch_dtype": torch.bfloat16},
62
+ device_map="auto",
63
+ )
64
+
65
+ # You can concatenate a prefix, a special FIM separator token, and a suffix
66
+ prefix = "def add_numbers(a, b):\n "
67
+ suffix = "\n return result"
68
+
69
+ # Combine prefix and suffix following the FIM format
70
+ fim_input = "<|fim-suffix|>" + suffix + "<|fim-prefix|>" + prefix + "<|fim-middle|>"
71
+
72
+ output = pipeline(fim_input, max_new_tokens=100)
73
+ print(output[0]["generated_text"])
74
+ ```
75
+
76
+ ## Evaluation
77
+
78
+ Seed-Coder-8B-Base has been internally evaluated across a variety of code understanding and generation benchmarks.
79
+ It demonstrates strong capabilities in:
80
+ - Fluent and contextually appropriate code completion.
81
+ - Reasoning about code structure and inferring missing logic.
82
+ - Generalizing across different programming languages, coding styles, and codebases.
83
+
84
+ For detailed benchmark results, please refer to our [📑 paper](https://arxiv.org/pdf/xxx.xxxxx).
85
+
86
+ ## Citation
87
+
88
+ If you find Seed-Coder helpful, please consider citing our work:
89
+
90
+ ```
91
+ @article{zhang2025seedcoder,
92
+ title={Seed-Coder: Let the Code Model Curate Data for Itself},
93
+ author={Xxx},
94
+ year={2025},
95
+ eprint={2504.xxxxx},
96
+ archivePrefix={arXiv},
97
+ primaryClass={cs.CL},
98
+ url={https://arxiv.org/abs/xxxx.xxxxx},
99
+ }
100
+ ```