Update README.md
Browse files
README.md
CHANGED
|
@@ -8,197 +8,143 @@ base_model:
|
|
| 8 |
- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
|
|
|
| 15 |
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 27 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 28 |
-
- **Model type:** [More Information Needed]
|
| 29 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 30 |
-
- **License:** [More Information Needed]
|
| 31 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
- **Paper [optional]:** [More Information Needed]
|
| 39 |
-
- **Demo [optional]:** [More Information Needed]
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
|
|
|
|
|
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
| 58 |
|
| 59 |
-
<!--
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
|
|
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
|
|
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
|
|
|
|
|
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
-
|
|
|
|
|
|
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
#### Metrics
|
| 127 |
-
|
| 128 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 129 |
-
|
| 130 |
-
[More Information Needed]
|
| 131 |
-
|
| 132 |
-
### Results
|
| 133 |
-
|
| 134 |
-
[More Information Needed]
|
| 135 |
-
|
| 136 |
-
#### Summary
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
## Model Examination [optional]
|
| 141 |
-
|
| 142 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 143 |
-
|
| 144 |
-
[More Information Needed]
|
| 145 |
-
|
| 146 |
-
## Environmental Impact
|
| 147 |
-
|
| 148 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 149 |
-
|
| 150 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 151 |
-
|
| 152 |
-
- **Hardware Type:** [More Information Needed]
|
| 153 |
-
- **Hours used:** [More Information Needed]
|
| 154 |
-
- **Cloud Provider:** [More Information Needed]
|
| 155 |
-
- **Compute Region:** [More Information Needed]
|
| 156 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 157 |
-
|
| 158 |
-
## Technical Specifications [optional]
|
| 159 |
-
|
| 160 |
-
### Model Architecture and Objective
|
| 161 |
-
|
| 162 |
-
[More Information Needed]
|
| 163 |
-
|
| 164 |
-
### Compute Infrastructure
|
| 165 |
-
|
| 166 |
-
[More Information Needed]
|
| 167 |
-
|
| 168 |
-
#### Hardware
|
| 169 |
-
|
| 170 |
-
[More Information Needed]
|
| 171 |
-
|
| 172 |
-
#### Software
|
| 173 |
-
|
| 174 |
-
[More Information Needed]
|
| 175 |
-
|
| 176 |
-
## Citation [optional]
|
| 177 |
-
|
| 178 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 179 |
-
|
| 180 |
-
**BibTeX:**
|
| 181 |
-
|
| 182 |
-
[More Information Needed]
|
| 183 |
-
|
| 184 |
-
**APA:**
|
| 185 |
-
|
| 186 |
-
[More Information Needed]
|
| 187 |
-
|
| 188 |
-
## Glossary [optional]
|
| 189 |
-
|
| 190 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 191 |
-
|
| 192 |
-
[More Information Needed]
|
| 193 |
-
|
| 194 |
-
## More Information [optional]
|
| 195 |
-
|
| 196 |
-
[More Information Needed]
|
| 197 |
-
|
| 198 |
-
## Model Card Authors [optional]
|
| 199 |
-
|
| 200 |
-
[More Information Needed]
|
| 201 |
-
|
| 202 |
-
## Model Card Contact
|
| 203 |
-
|
| 204 |
-
[More Information Needed]
|
|
|
|
| 8 |
- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
|
| 9 |
---
|
| 10 |
|
| 11 |
+
Of course. Here is the model card rewritten in natural, professional English.
|
| 12 |
|
| 13 |
+
-----
|
| 14 |
|
| 15 |
+
## **Model Card — HyperCLOVAX-SEED-Think-14B (fork) + DeepConf**
|
| 16 |
|
| 17 |
+
### **Summary**
|
| 18 |
|
| 19 |
+
This model enhances a user-forked version of **HyperCLOVAX-SEED-Think-14B** by integrating ideas from Meta AI × UCSD's **DeepConf**. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.
|
| 20 |
|
| 21 |
+
The core of this method is the **Lowest Group Confidence (LGC)**, a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (**Top-p% Filtering**, **Confidence-Weighted Voting**) and online optimization (**Early Abort**), ultimately achieving higher accuracy at a lower computational cost.
|
| 22 |
|
| 23 |
+
-----
|
| 24 |
|
| 25 |
+
### **1. Background and Motivation**
|
| 26 |
|
| 27 |
+
While **Self-Consistency**—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
The **DeepConf** framework addresses this by reading the model's internal **token generation probability distribution (confidence)** to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.
|
| 30 |
|
| 31 |
+
-----
|
| 32 |
|
| 33 |
+
### **2. Methods**
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
#### **2.1 Confidence Metric: Lowest Group Confidence (LGC)**
|
| 36 |
|
| 37 |
+
LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.
|
| 38 |
|
| 39 |
+
* **Intuition**: The quality of a path is limited by its most uncertain or speculative segment.
|
| 40 |
|
| 41 |
+
The formula is:
|
| 42 |
+
$$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$
|
| 43 |
|
| 44 |
+
Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token.
|
| 45 |
|
| 46 |
+
#### **2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting**
|
| 47 |
|
| 48 |
+
* **Top-p% Filtering**: Among $N$ generated paths, only the **top p%** with the highest confidence scores are included in the final vote.
|
| 49 |
+
* **Confidence-Weighted Voting**: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
|
| 50 |
+
* **Literature Example**: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)
|
| 51 |
|
| 52 |
+
#### **2.3 Online Method: Adaptive Sampling (Early Abort)**
|
| 53 |
|
| 54 |
+
1. **Warm-up**: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$.
|
| 55 |
+
2. **Monitoring**: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.
|
| 56 |
|
| 57 |
+
<!-- end list -->
|
| 58 |
|
| 59 |
+
* **Reported Gains**: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy.
|
| 60 |
|
| 61 |
+
#### **2.4 HyperCLOVAX (Think/Answer) Specialization**
|
| 62 |
|
| 63 |
+
We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$.
|
| 64 |
|
| 65 |
+
* **Thinking Stage**: A looser threshold encourages broader exploration of ideas.
|
| 66 |
+
* **Answer Stage**: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.
|
| 67 |
|
| 68 |
+
-----
|
| 69 |
|
| 70 |
+
### **3. Hyperparameters (Recommended Defaults)**
|
| 71 |
|
| 72 |
+
| Name | Description | Default Value (Example) |
|
| 73 |
+
| ------------------- | -------------------------------------------------- | ---------------------------------------- |
|
| 74 |
+
| `W` | Sliding window length (tokens) | 2048 |
|
| 75 |
+
| `p` | Percentage for Top-p% Filtering | 10 |
|
| 76 |
+
| `M` | Number of warm-up paths for calibration | 16 |
|
| 77 |
+
| $\\tau\_{\\text{think}}$ | Early abort threshold for the `thinking` stage | Dynamic (based on warm-up) |
|
| 78 |
+
| $\\tau\_{\\text{answer}}$ | Early abort threshold for the `answer` stage | Dynamic (based on warm-up, stricter) |
|
| 79 |
+
| `N_max` | Max number of paths to sample (online) | Optional limit (e.g., 64) |
|
| 80 |
|
| 81 |
+
-----
|
| 82 |
|
| 83 |
+
### **4. Evaluation**
|
| 84 |
|
| 85 |
+
#### **4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`**
|
| 86 |
|
| 87 |
+
*Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.*
|
| 88 |
|
| 89 |
+
| Metric | `original` | `deepconf` | Notes |
|
| 90 |
+
| ----------------------- | ---------- | ---------- | --------------------------------------------- |
|
| 91 |
+
| **Total Correct** | 8 | **10** | +2 questions correct |
|
| 92 |
+
| **Accuracy (out of 30)** | 26.7% | **33.3%** | +6.7%p improvement |
|
| 93 |
+
| Attempts (Format OK) | 8 | 11 | `deepconf` attempted 3 more questions |
|
| 94 |
+
| Format Failures | 22 | 19 | `deepconf` shows better format stability |
|
| 95 |
+
| **Head-to-Head** | — | — | **2 Wins / 0 Losses / 28 Ties for `deepconf`** |
|
| 96 |
|
| 97 |
+
**Breakdown by Part:**
|
| 98 |
|
| 99 |
+
* **Part I**: Both models solved 6/15 questions (Tie).
|
| 100 |
+
* **Part II**: `original` solved 2/15, while `deepconf` solved 4/15. **The performance gain was concentrated in the more difficult second half.**
|
| 101 |
|
| 102 |
+
*Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.*
|
| 103 |
|
| 104 |
+
#### **4.2 Efficiency & Speed (10-question sample test)**
|
| 105 |
|
| 106 |
+
| Metric | Improvement with `deepconf` |
|
| 107 |
+
| ------------------------- | ---------------------------- |
|
| 108 |
+
| **Majority-Vote Accuracy** | +20.0%p |
|
| 109 |
+
| **Avg. Generated Tokens** | –29.6% |
|
| 110 |
+
| **Avg. Generation Time** | –41.6% |
|
| 111 |
|
| 112 |
+
***Caution: These results are based on a very small sample size (N≈10).*** However, they signal a meaningful improvement across accuracy, speed, and cost.
|
| 113 |
|
| 114 |
+
-----
|
| 115 |
|
| 116 |
+
### **5. Use Cases and Recommended Pipeline**
|
| 117 |
|
| 118 |
+
This model is ideal for **mathematical and logical reasoning tasks** where it offers significant sample savings and improved reliability compared to standard self-consistency.
|
| 119 |
|
| 120 |
+
**Recommended Pipeline:**
|
| 121 |
|
| 122 |
+
1. **Online**: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
|
| 123 |
+
2. **Offline**: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths.
|
| 124 |
+
3. **Finalization**: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.
|
| 125 |
|
| 126 |
+
-----
|
| 127 |
|
| 128 |
+
### **6. Limitations & What to Watch Out For**
|
| 129 |
|
| 130 |
+
* **Confidence Miscalibration**: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
|
| 131 |
+
* **Domain Shift**: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles.
|
| 132 |
+
* **Unintended Early Aborts**: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
|
| 133 |
+
* **Reliance on Format Validation**: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.
|
| 134 |
|
| 135 |
+
-----
|
| 136 |
|
| 137 |
+
### **7. Responsible Use**
|
| 138 |
|
| 139 |
+
* **Expose Reasoning**: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
|
| 140 |
+
* **Resource Allocation**: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
|
| 141 |
+
* **Bias and Fairness**: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.
|
| 142 |
|
| 143 |
+
-----
|
| 144 |
|
| 145 |
+
### **Citation**
|
| 146 |
|
| 147 |
+
* **Original Idea**: Fu, Wang, Tian, Zhao et al., *DeepConf: A Confidence-based Framework for LLM-based Code Generation* (Meta AI × UCSD).
|
| 148 |
+
* **This Work**: A report on the integration of DeepConf with a forked HyperCLOVAX-SEED-Think-14B, including the application of ChatML-aware dual thresholds.
|
| 149 |
|
| 150 |
+
*Feel free to ask for configuration templates for different profiles (e.g., **accuracy-focused**, **cost-sensitive**, **balanced**) with tuned values for **p, W, M, and τ**.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|