File size: 8,811 Bytes

88ffb82
 
2e5b18b
 
 
 
 
 
88ffb82
 
63bf638
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
88ffb82
e11e60e
88ffb82
e11e60e
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
 
 
 
 
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
 
 
 
 
88ffb82
e11e60e
88ffb82
e11e60e
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
 
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
e11e60e
 
 
88ffb82
e11e60e
88ffb82
e11e60e
88ffb82
bd817b4

---
library_name: transformers
license: apache-2.0
language:
- en
- ko
base_model:
- naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
---

## Model Card for sigridjineth/HyperCLOVAX-SEED-Think-DeepConf-14B

### **Summary**

This model enhances a user-forked version of **HyperCLOVAX-SEED-Think-14B** by integrating ideas from Meta AI × UCSD's **DeepConf**. It performs confidence-based quality estimation and adaptive sampling to improve both accuracy and efficiency.

The core of this method is the **Lowest Group Confidence (LGC)**, a metric that uses a sliding window to identify the "most uncertain segment" of a generation path. This allows for intelligent offline filtering (**Top-p% Filtering**, **Confidence-Weighted Voting**) and online optimization (**Early Abort**), ultimately achieving higher accuracy at a lower computational cost.

-----

### **1. Background and Motivation**

While **Self-Consistency**—generating multiple paths and taking a majority vote—can improve performance on reasoning tasks, its practical application is limited by prohibitive computational costs and the noise introduced by low-quality generation paths.

The **DeepConf** framework addresses this by reading the model's internal **token generation probability distribution (confidence)** to estimate the quality of a path in real time. Simple average confidence can be misleading due to the "pitfall of averages." We instead use the sliding-window LGC metric to quantify the path's weakest link.

-----

### **2. Methods**

#### **2.1 Confidence Metric: Lowest Group Confidence (LGC)**

LGC is calculated by moving a window of size $W$ (e.g., 2048 tokens) across the entire generation path, calculating the average confidence within each window, and taking the minimum value as the quality score for the entire trajectory.

  * **Intuition**: The quality of a path is limited by its most uncertain or speculative segment.

The formula is:
$$\text{LGC}(\text{trajectory}) = \min_{t} \frac{1}{W}\sum_{i=t}^{t+W-1} \text{conf}(y_i)$$

Here, $\\text{conf}(y\_i)$ is the generation probability of token $y\_i$. Our implementation defaults to using the softmax probability of the top-1 token.

#### **2.2 Offline Methods: Top-p% Filtering & Confidence-Weighted Voting**

  * **Top-p% Filtering**: Among $N$ generated paths, only the **top p%** with the highest confidence scores are included in the final vote.
  * **Confidence-Weighted Voting**: Each path's vote is weighted by a function of its confidence score (e.g., its LGC score or a monotonic transformation of it).
  * **Literature Example**: For a GPT-family model on AIME-2025, using only the top 10% of 512 samples reportedly improved accuracy from 97.0% to 99.9%. (Note: This is a literature example; this model's specific results are detailed below.)

#### **2.3 Online Method: Adaptive Sampling (Early Abort)**

1.  **Warm-up**: Fully generate $M$ initial paths (e.g., 16) to establish a dynamic confidence threshold, $\\tau$.
2.  **Monitoring**: For each new path, if its real-time LGC drops below $\\tau$ at any point, the generation is immediately aborted and discarded, preventing wasted computation on low-quality paths.

<!-- end list -->

  * **Reported Gains**: This technique can reduce the number of sampled tokens by \~85% while maintaining or even improving accuracy.

#### **2.4 HyperCLOVAX (Think/Answer) Specialization**

We leverage the model's ChatML structure, which separates the `thinking` (exploration) and `answer` (formal response) stages, by applying a dual-threshold system: $\\tau\_{\\text{think}} \< \\tau\_{\\text{answer}}$.

  * **Thinking Stage**: A looser threshold encourages broader exploration of ideas.
  * **Answer Stage**: A stricter threshold enforces high confidence, ensuring formal correctness and accuracy in the final output.

-----

### **3. Hyperparameters (Recommended Defaults)**

| Name                | Description                                        | Default Value (Example)                  |
| ------------------- | -------------------------------------------------- | ---------------------------------------- |
| `W`                 | Sliding window length (tokens)                     | 2048                                     |
| `p`                 | Percentage for Top-p% Filtering                  | 10                                       |
| `M`                 | Number of warm-up paths for calibration          | 16                                       |
| $\\tau\_{\\text{think}}$ | Early abort threshold for the `thinking` stage     | Dynamic (based on warm-up)               |
| $\\tau\_{\\text{answer}}$  | Early abort threshold for the `answer` stage       | Dynamic (based on warm-up, stricter)     |
| `N_max`             | Max number of paths to sample (online)             | Optional limit (e.g., 64)                |

-----

### **4. Evaluation**

#### **4.1 AIME 2025 (30-question slice) — `deepconf` vs. `original`**

*Scoring: Correct = 1, Incorrect / No Format = 0. "No Format" is treated as not attempted.*

| Metric                  | `original` | `deepconf` | Notes                                         |
| ----------------------- | ---------- | ---------- | --------------------------------------------- |
| **Total Correct** | 8          | **10** | +2 questions correct                          |
| **Accuracy (out of 30)** | 26.7%      | **33.3%** | +6.7%p improvement                            |
| Attempts (Format OK)    | 8          | 11         | `deepconf` attempted 3 more questions         |
| Format Failures         | 22         | 19         | `deepconf` shows better format stability      |
| **Head-to-Head** | —          | —          | **2 Wins / 0 Losses / 28 Ties for `deepconf`** |

**Breakdown by Part:**

  * **Part I**: Both models solved 6/15 questions (Tie).
  * **Part II**: `original` solved 2/15, while `deepconf` solved 4/15. **The performance gain was concentrated in the more difficult second half.**

*Note: The high number of "Format Failures" in this slice indicates that the ability to adhere to strict output formatting was a significant factor in the final score.*

#### **4.2 Efficiency & Speed (10-question sample test)**

| Metric                    | Improvement with `deepconf` |
| ------------------------- | ---------------------------- |
| **Majority-Vote Accuracy** | +20.0%p                      |
| **Avg. Generated Tokens** | –29.6%                       |
| **Avg. Generation Time** | –41.6%                       |

***Caution: These results are based on a very small sample size (N≈10).*** However, they signal a meaningful improvement across accuracy, speed, and cost.

-----

### **5. Use Cases and Recommended Pipeline**

This model is ideal for **mathematical and logical reasoning tasks** where it offers significant sample savings and improved reliability compared to standard self-consistency.

**Recommended Pipeline:**

1.  **Online**: Use adaptive sampling with a warm-up phase and early abort to filter out low-quality paths efficiently.
2.  **Offline**: Apply Top-p% Filtering (with `p=10` as a starting point) to the remaining high-quality paths.
3.  **Finalization**: Use Confidence-Weighted Voting on the filtered set and apply a final format validation step to extract the answer.

-----

### **6. Limitations & What to Watch Out For**

  * **Confidence Miscalibration**: If the model's probability estimates are not well-calibrated, the threshold $\\tau$ may be unreliable. This can be mitigated by tuning temperature/top-k or relying on warm-up statistics.
  * **Domain Shift**: The optimal hyperparameters ($\\tau, W, p$) may need recalibration when applied to new domains or problem styles.
  * **Unintended Early Aborts**: A path might be discarded prematurely if it contains rare tokens or formatting that cause a temporary dip in confidence. Consider implementing a minimum generation length or a cooldown period.
  * **Reliance on Format Validation**: If the final answer extraction logic is not robust, "correct but badly formatted" answers may still be missed.

-----

### **7. Responsible Use**

  * **Expose Reasoning**: For math and coding tasks, always pair the final answer with the generation's reasoning or verification steps to mitigate hallucinations and minor errors.
  * **Resource Allocation**: While early abort reduces overall cost, the warm-up phase introduces overhead. Manage this effectively with batching and queueing in a production environment.
  * **Bias and Fairness**: Confidence-based filtering may systematically favor certain response styles. We recommend periodic auditing and sampling to ensure fairness and diversity in outputs.

-----

### **Citation**

  * **Original Idea**: Fu, Wang, Tian, Zhao et al., *Deep Think With Confidence* (Meta AI, UCSD)