Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B
Summary
This model enhances a user-forked version of HyperCLOVAX-SEED-Think-14B by integrating ideas from Meta AI ร UCSD's DeepConf. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.
The core of this method is the Lowest Group Confidence (LGC), a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (Top-p% Filtering, Confidence-Weighted Voting) and online optimization (Early Abort), ultimately achieving higher accuracy at a lower computational cost.
1. Background and Motivation
While Self-Consistencyโgenerating multiple paths and taking a majority voteโcan improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.
The DeepConf framework addresses this by reading the model's internal token generation probability distribution (confidence) to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.
2. Methods
2.1 Confidence Metric: Lowest Group Confidence (LGC)
LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.
- Intuition: The quality of a path is limited by its most uncertain or speculative segment.
The formula is:
Here, $\text{conf}(y_i)$ is the generation probability of token $y_i$. Our implementation defaults to using the softmax probability of the top-1 token.
2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting
- Top-p% Filtering: Among $N$ generated paths, only the top p% with the highest confidence scores are included in the final vote.
- Confidence-Weighted Voting: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
- Literature Example: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)
2.3 Online Method: Adaptive Sampling (Early Abort)
- Warm-up: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\tau$.
- Monitoring: For each new path, if its real-time LGC drops below $\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.
- Reported Gains: This technique can reduce the number of sampled tokens by ~85% while maintaining or even improving accuracy.
2.4 HyperCLOVAX (Think/Answer) Specialization
We leverage the model's ChatML structure, which separates the thinking
(exploration) and answer
(formal response) stages, by applying a dual-threshold system: $\tau_{\text{think}} < \tau_{\text{answer}}$.
- Thinking Stage: A looser threshold encourages broader exploration of ideas.
- Answer Stage: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.
3. Hyperparameters (Recommended Defaults)
Name | Description | Default Value (Example) |
---|---|---|
W |
Sliding window length (tokens) | 2048 |
p |
Percentage for Top-p% Filtering | 10 |
M |
Number of warm-up paths for calibration | 16 |
$\tau_{\text{think}}$ | Early abort threshold for the thinking stage |
Dynamic (based on warm-up) |
$\tau_{\text{answer}}$ | Early abort threshold for the answer stage |
Dynamic (based on warm-up, stricter) |
N_max |
Max number of paths to sample (online) | Optional limit (e.g., 64) |
4. Evaluation
4.1 AIME 2025 (30-question slice) โ deepconf
vs. original
Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.
Metric | original |
deepconf |
Notes |
---|---|---|---|
Total Correct | 8 | 10 | +2 questions correct |
Accuracy (out of 30) | 26.7% | 33.3% | +6.7%p improvement |
Attempts (Format OK) | 8 | 11 | deepconf attempted 3 more questions |
Format Failures | 22 | 19 | deepconf shows better format stability |
Head-to-Head | โ | โ | 2 Wins / 0 Losses / 28 Ties for deepconf |
Breakdown by Part:
- Part I: Both models solved 6/15 questions (Tie).
- Part II:
original
solved 2/15, whiledeepconf
solved 4/15. The performance gain was concentrated in the more difficult second half.
Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.
4.2 Efficiency & Speed (10-question sample test)
Metric | Improvement with deepconf |
---|---|
Majority-Vote Accuracy | +20.0%p |
Avg. Generated Tokens | โ29.6% |
Avg. Generation Time | โ41.6% |
Caution: These results are based on a very small sample size (Nโ10). However, they signal a meaningful improvement across accuracy, speed, and cost.
5. Use Cases and Recommended Pipeline
This model is ideal for mathematical and logical reasoning tasks where it offers significant sample savings and improved reliability compared to standard self-consistency.
Recommended Pipeline:
- Online: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
- Offline: Apply Top-p% Filtering (with
p=10
as a starting point) to the remaining high-quality paths. - Finalization: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.
6. Limitations & What to Watch Out For
- Confidence Miscalibration: If the model's probability estimates are not well-calibrated, the threshold $\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
- Domain Shift: The optimal hyperparameters ($\tau, W, p$) may need recalibration when applied to new domains or problem styles.
- Unintended Early Aborts: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
- Reliance on Format Validation: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.
7. Responsible Use
- Expose Reasoning: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
- Resource Allocation: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
- Bias and Fairness: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.
Citation
- Original Idea: Fu, Wang, Tian, Zhao et al., Deep Think With Confidence (Meta AI, UCSD)
- Downloads last month
- 40
Model tree for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B
Base model
naver-hyperclovax/HyperCLOVAX-SEED-Think-14B