Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B

Summary

This model enhances a user-forked version of HyperCLOVAX-SEED-Think-14B by integrating ideas from Meta AI ร— UCSD's DeepConf. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.

The core of this method is the Lowest Group Confidence (LGC), a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (Top-p% Filtering, Confidence-Weighted Voting) and online optimization (Early Abort), ultimately achieving higher accuracy at a lower computational cost.


1. Background and Motivation

While Self-Consistencyโ€”generating multiple paths and taking a majority voteโ€”can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.

The DeepConf framework addresses this by reading the model's internal token generation probability distribution (confidence) to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.


2. Methods

2.1 Confidence Metric: Lowest Group Confidence (LGC)

LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.

  • Intuition: The quality of a path is limited by its most uncertain or speculative segment.

The formula is: LGC(trajectory)=minโกt1Wโˆ‘i=tt+Wโˆ’1conf(yi)\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)

Here, $\text{conf}(y_i)$ is the generation probability of token $y_i$. Our implementation defaults to using the softmax probability of the top-1 token.

2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting

  • Top-p% Filtering: Among $N$ generated paths, only the top p% with the highest confidence scores are included in the final vote.
  • Confidence-Weighted Voting: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
  • Literature Example: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)

2.3 Online Method: Adaptive Sampling (Early Abort)

  1. Warm-up: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\tau$.
  2. Monitoring: For each new path, if its real-time LGC drops below $\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.
  • Reported Gains: This technique can reduce the number of sampled tokens by ~85% while maintaining or even improving accuracy.

2.4 HyperCLOVAX (Think/Answer) Specialization

We leverage the model's ChatML structure, which separates the thinking (exploration) and answer (formal response) stages, by applying a dual-threshold system: $\tau_{\text{think}} < \tau_{\text{answer}}$.

  • Thinking Stage: A looser threshold encourages broader exploration of ideas.
  • Answer Stage: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.

3. Hyperparameters (Recommended Defaults)

Name Description Default Value (Example)
W Sliding window length (tokens) 2048
p Percentage for Top-p% Filtering 10
M Number of warm-up paths for calibration 16
$\tau_{\text{think}}$ Early abort threshold for the thinking stage Dynamic (based on warm-up)
$\tau_{\text{answer}}$ Early abort threshold for the answer stage Dynamic (based on warm-up, stricter)
N_max Max number of paths to sample (online) Optional limit (e.g., 64)

4. Evaluation

4.1 AIME 2025 (30-question slice) โ€” deepconf vs. original

Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.

Metric original deepconf Notes
Total Correct 8 10 +2 questions correct
Accuracy (out of 30) 26.7% 33.3% +6.7%p improvement
Attempts (Format OK) 8 11 deepconf attempted 3 more questions
Format Failures 22 19 deepconf shows better format stability
Head-to-Head โ€” โ€” 2 Wins / 0 Losses / 28 Ties for deepconf

Breakdown by Part:

  • Part I: Both models solved 6/15 questions (Tie).
  • Part II: original solved 2/15, while deepconf solved 4/15. The performance gain was concentrated in the more difficult second half.

Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.

4.2 Efficiency & Speed (10-question sample test)

Metric Improvement with deepconf
Majority-Vote Accuracy +20.0%p
Avg. Generated Tokens โ€“29.6%
Avg. Generation Time โ€“41.6%

Caution: These results are based on a very small sample size (Nโ‰ˆ10). However, they signal a meaningful improvement across accuracy, speed, and cost.


5. Use Cases and Recommended Pipeline

This model is ideal for mathematical and logical reasoning tasks where it offers significant sample savings and improved reliability compared to standard self-consistency.

Recommended Pipeline:

  1. Online: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
  2. Offline: Apply Top-p% Filtering (with p=10 as a starting point) to the remaining high-quality paths.
  3. Finalization: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.

6. Limitations & What to Watch Out For

  • Confidence Miscalibration: If the model's probability estimates are not well-calibrated, the threshold $\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
  • Domain Shift: The optimal hyperparameters ($\tau, W, p$) may need recalibration when applied to new domains or problem styles.
  • Unintended Early Aborts: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
  • Reliance on Format Validation: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.

7. Responsible Use

  • Expose Reasoning: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
  • Resource Allocation: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
  • Bias and Fairness: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.

Citation

  • Original Idea: Fu, Wang, Tian, Zhao et al., Deep Think With Confidence (Meta AI, UCSD)
Downloads last month
40
Safetensors
Model size
14.7B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B

Finetuned
(2)
this model