sigridjineth commited on
Commit
e11e60e
·
verified ·
1 Parent(s): 53500ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -140
README.md CHANGED
@@ -8,197 +8,143 @@ base_model:
8
  - naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
9
  ---
10
 
11
- # Model Card for Model ID
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
 
 
15
 
 
16
 
17
- ## Model Details
18
 
19
- ### Model Description
20
 
21
- <!-- Provide a longer summary of what this model is. -->
22
 
23
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
24
 
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
 
33
- ### Model Sources [optional]
34
 
35
- <!-- Provide the basic links for the model. -->
36
 
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
 
41
- ## Uses
42
 
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
 
45
- ### Direct Use
46
 
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
48
 
49
- [More Information Needed]
50
 
51
- ### Downstream Use [optional]
52
 
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
54
 
55
- [More Information Needed]
56
 
57
- ### Out-of-Scope Use
 
58
 
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
 
61
- [More Information Needed]
62
 
63
- ## Bias, Risks, and Limitations
64
 
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
 
67
- [More Information Needed]
 
68
 
69
- ### Recommendations
70
 
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
 
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
 
 
74
 
75
- ## How to Get Started with the Model
76
 
77
- Use the code below to get started with the model.
78
 
79
- [More Information Needed]
80
 
81
- ## Training Details
82
 
83
- ### Training Data
 
 
 
 
 
 
84
 
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
 
87
- [More Information Needed]
 
88
 
89
- ### Training Procedure
90
 
91
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
 
93
- #### Preprocessing [optional]
 
 
 
 
94
 
95
- [More Information Needed]
96
 
 
97
 
98
- #### Training Hyperparameters
99
 
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
 
102
- #### Speeds, Sizes, Times [optional]
103
 
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
105
 
106
- [More Information Needed]
107
 
108
- ## Evaluation
109
 
110
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
111
 
112
- ### Testing Data, Factors & Metrics
113
 
114
- #### Testing Data
115
 
116
- <!-- This should link to a Dataset Card if possible. -->
 
 
117
 
118
- [More Information Needed]
119
 
120
- #### Factors
121
 
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
123
 
124
- [More Information Needed]
125
-
126
- #### Metrics
127
-
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
-
130
- [More Information Needed]
131
-
132
- ### Results
133
-
134
- [More Information Needed]
135
-
136
- #### Summary
137
-
138
-
139
-
140
- ## Model Examination [optional]
141
-
142
- <!-- Relevant interpretability work for the model goes here -->
143
-
144
- [More Information Needed]
145
-
146
- ## Environmental Impact
147
-
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
-
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
-
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
-
158
- ## Technical Specifications [optional]
159
-
160
- ### Model Architecture and Objective
161
-
162
- [More Information Needed]
163
-
164
- ### Compute Infrastructure
165
-
166
- [More Information Needed]
167
-
168
- #### Hardware
169
-
170
- [More Information Needed]
171
-
172
- #### Software
173
-
174
- [More Information Needed]
175
-
176
- ## Citation [optional]
177
-
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
-
180
- **BibTeX:**
181
-
182
- [More Information Needed]
183
-
184
- **APA:**
185
-
186
- [More Information Needed]
187
-
188
- ## Glossary [optional]
189
-
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
-
192
- [More Information Needed]
193
-
194
- ## More Information [optional]
195
-
196
- [More Information Needed]
197
-
198
- ## Model Card Authors [optional]
199
-
200
- [More Information Needed]
201
-
202
- ## Model Card Contact
203
-
204
- [More Information Needed]
 
8
  - naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
9
  ---
10
 
11
+ Of course. Here is the model card rewritten in natural, professional English.
12
 
13
+ -----
14
 
15
+ ## **Model Card — HyperCLOVAX-SEED-Think-14B (fork) + DeepConf**
16
 
17
+ ### **Summary**
18
 
19
+ This model enhances a user-forked version of **HyperCLOVAX-SEED-Think-14B** by integrating ideas from Meta AI × UCSD's **DeepConf**. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.
20
 
21
+ The core of this method is the **Lowest Group Confidence (LGC)**, a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (**Top-p% Filtering**, **Confidence-Weighted Voting**) and online optimization (**Early Abort**), ultimately achieving higher accuracy at a lower computational cost.
22
 
23
+ -----
24
 
25
+ ### **1. Background and Motivation**
26
 
27
+ While **Self-Consistency**—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.
 
 
 
 
 
 
28
 
29
+ The **DeepConf** framework addresses this by reading the model's internal **token generation probability distribution (confidence)** to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.
30
 
31
+ -----
32
 
33
+ ### **2. Methods**
 
 
34
 
35
+ #### **2.1 Confidence Metric: Lowest Group Confidence (LGC)**
36
 
37
+ LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.
38
 
39
+ * **Intuition**: The quality of a path is limited by its most uncertain or speculative segment.
40
 
41
+ The formula is:
42
+ $$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$
43
 
44
+ Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token.
45
 
46
+ #### **2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting**
47
 
48
+ * **Top-p% Filtering**: Among $N$ generated paths, only the **top p%** with the highest confidence scores are included in the final vote.
49
+ * **Confidence-Weighted Voting**: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
50
+ * **Literature Example**: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)
51
 
52
+ #### **2.3 Online Method: Adaptive Sampling (Early Abort)**
53
 
54
+ 1. **Warm-up**: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$.
55
+ 2. **Monitoring**: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.
56
 
57
+ <!-- end list -->
58
 
59
+ * **Reported Gains**: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy.
60
 
61
+ #### **2.4 HyperCLOVAX (Think/Answer) Specialization**
62
 
63
+ We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$.
64
 
65
+ * **Thinking Stage**: A looser threshold encourages broader exploration of ideas.
66
+ * **Answer Stage**: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.
67
 
68
+ -----
69
 
70
+ ### **3. Hyperparameters (Recommended Defaults)**
71
 
72
+ | Name | Description | Default Value (Example) |
73
+ | ------------------- | -------------------------------------------------- | ---------------------------------------- |
74
+ | `W` | Sliding window length (tokens) | 2048 |
75
+ | `p` | Percentage for Top-p% Filtering | 10 |
76
+ | `M` | Number of warm-up paths for calibration | 16 |
77
+ | $\\tau\_{\\text{think}}$ | Early abort threshold for the `thinking` stage | Dynamic (based on warm-up) |
78
+ | $\\tau\_{\\text{answer}}$ | Early abort threshold for the `answer` stage | Dynamic (based on warm-up, stricter) |
79
+ | `N_max` | Max number of paths to sample (online) | Optional limit (e.g., 64) |
80
 
81
+ -----
82
 
83
+ ### **4. Evaluation**
84
 
85
+ #### **4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`**
86
 
87
+ *Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.*
88
 
89
+ | Metric | `original` | `deepconf` | Notes |
90
+ | ----------------------- | ---------- | ---------- | --------------------------------------------- |
91
+ | **Total Correct** | 8 | **10** | +2 questions correct |
92
+ | **Accuracy (out of 30)** | 26.7% | **33.3%** | +6.7%p improvement |
93
+ | Attempts (Format OK) | 8 | 11 | `deepconf` attempted 3 more questions |
94
+ | Format Failures | 22 | 19 | `deepconf` shows better format stability |
95
+ | **Head-to-Head** | — | — | **2 Wins / 0 Losses / 28 Ties for `deepconf`** |
96
 
97
+ **Breakdown by Part:**
98
 
99
+ * **Part I**: Both models solved 6/15 questions (Tie).
100
+ * **Part II**: `original` solved 2/15, while `deepconf` solved 4/15. **The performance gain was concentrated in the more difficult second half.**
101
 
102
+ *Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.*
103
 
104
+ #### **4.2 Efficiency & Speed (10-question sample test)**
105
 
106
+ | Metric | Improvement with `deepconf` |
107
+ | ------------------------- | ---------------------------- |
108
+ | **Majority-Vote Accuracy** | +20.0%p |
109
+ | **Avg. Generated Tokens** | –29.6% |
110
+ | **Avg. Generation Time** | –41.6% |
111
 
112
+ ***Caution: These results are based on a very small sample size (N≈10).*** However, they signal a meaningful improvement across accuracy, speed, and cost.
113
 
114
+ -----
115
 
116
+ ### **5. Use Cases and Recommended Pipeline**
117
 
118
+ This model is ideal for **mathematical and logical reasoning tasks** where it offers significant sample savings and improved reliability compared to standard self-consistency.
119
 
120
+ **Recommended Pipeline:**
121
 
122
+ 1. **Online**: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
123
+ 2. **Offline**: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths.
124
+ 3. **Finalization**: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.
125
 
126
+ -----
127
 
128
+ ### **6. Limitations & What to Watch Out For**
129
 
130
+ * **Confidence Miscalibration**: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
131
+ * **Domain Shift**: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles.
132
+ * **Unintended Early Aborts**: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
133
+ * **Reliance on Format Validation**: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.
134
 
135
+ -----
136
 
137
+ ### **7. Responsible Use**
138
 
139
+ * **Expose Reasoning**: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
140
+ * **Resource Allocation**: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
141
+ * **Bias and Fairness**: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.
142
 
143
+ -----
144
 
145
+ ### **Citation**
146
 
147
+ * **Original Idea**: Fu, Wang, Tian, Zhao et al., *DeepConf: A Confidence-based Framework for LLM-based Code Generation* (Meta AI × UCSD).
148
+ * **This Work**: A report on the integration of DeepConf with a forked HyperCLOVAX-SEED-Think-14B, including the application of ChatML-aware dual thresholds.
149
 
150
+ *Feel free to ask for configuration templates for different profiles (e.g., **accuracy-focused**, **cost-sensitive**, **balanced**) with tuned values for **p, W, M, and τ**.*