Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,49 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-4.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
# CPRetriever-Code
|
7 |
+
|
8 |
+
**CPRetriever-Code** is a code embedding model trained via contrastive learning for **code-related retrieval tasks** in competitive programming. It achieves strong performance on tasks such as:
|
9 |
+
|
10 |
+
* **Text-to-Code** retrieval (problem description β relevant code)
|
11 |
+
* **Code-to-Code** retrieval (find alternate solutions to the same problem)
|
12 |
+
|
13 |
+
This model is part of the [CPRet](https://github.com/coldchair/CPRet) suite for competitive programming retrieval research.
|
14 |
+
|
15 |
+
## π§ Usage
|
16 |
+
|
17 |
+
You can load this model using the `sentence-transformers` library:
|
18 |
+
|
19 |
+
```python
|
20 |
+
from sentence_transformers import SentenceTransformer
|
21 |
+
|
22 |
+
model = SentenceTransformer("coldchair16/CPRetriever-Code")
|
23 |
+
embeddings = model.encode([
|
24 |
+
"def mex_query(arr):\n n = len(arr)\n seen = set()\n for i in range(n):\n seen.add(arr[i])\n i = 0\n while True:\n if i not in seen:\n return i\n i += 1"
|
25 |
+
])
|
26 |
+
```
|
27 |
+
|
28 |
+
## π‘ Applications
|
29 |
+
|
30 |
+
This model is optimized for **code-level semantic retrieval** in competitive programming settings:
|
31 |
+
|
32 |
+
* **Text-to-Code**: Retrieve relevant code snippets given a natural language problem description.
|
33 |
+
* **Code-to-Code**: Retrieve alternative implementations of the same problem.
|
34 |
+
|
35 |
+
It is particularly effective for analyzing programming contest submissions, searching solution variants, and building educational tools for code understanding.
|
36 |
+
|
37 |
+
## π Training and Evaluation
|
38 |
+
|
39 |
+
CPRetriever-Code is trained via **contrastive learning** using positive and hard negative code pairs derived from [CPRet-data](https://huggingface.co/datasets/coldchair16/CPRet-data).
|
40 |
+
|
41 |
+
For the training pipeline, see the full project:
|
42 |
+
π [CPRet on GitHub](https://github.com/coldchair/CPRet?tab=readme-ov-file)
|
43 |
+
|
44 |
+
## π¦ Model Card
|
45 |
+
|
46 |
+
* Architecture: `Salesforce/SFR-Embedding-Code-2B_R` (encoder backbone)
|
47 |
+
* Training: Contrastive objective on code/code and text/code pairs
|
48 |
+
* Format: Compatible with `sentence-transformers`
|
49 |
+
|