Huzaifa Pardawala commited on
Commit
bd767d8
·
1 Parent(s): 555b191

fix: editing icons and sections

Browse files
Files changed (1) hide show
  1. index.html +13 -13
index.html CHANGED
@@ -3861,8 +3861,8 @@
3861
 
3862
  <!-- </section>
3863
  </div> -->
3864
- <!-- <section class="section">
3865
- <div class="container"> -->
3866
  <!-- Model Performance Highlights -->
3867
  <div class="card mb-5">
3868
  <div class="card-header">
@@ -3916,12 +3916,12 @@
3916
  <p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
3917
 
3918
  <div class="notification is-info is-light py-3 px-4">
3919
- <p><strong><span class="icon"><i class="fa-solid fa-trophy"></i></span> No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
3920
- <p><strong><span class="icon"><i class="fa-solid fa-balance-scale"></i></span> Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
3921
- <p><strong><span class="icon"><i class="fa-solid fa-tools"></i></span> Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
3922
- <p><strong><span class="icon"><i class="fa-solid fa-coins"></i></span> Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
3923
- <p><strong><span class="icon"><i class="fa-solid fa-chart-line"></i></span> Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
3924
- <p><strong><span class="icon"><i class="fa-solid fa-list-ol"></i></span> Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
3925
  </div>
3926
  </div>
3927
  </div>
@@ -4202,27 +4202,27 @@
4202
  </div>
4203
 
4204
  <div class="notification is-info is-light py-2 px-3 mb-3">
4205
- <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-brain"></i></span> Few-Shot & Chain-of-Thought</p>
4206
  <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
4207
  </div>
4208
 
4209
  <div class="notification is-info is-light py-2 px-3 mb-3">
4210
- <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-line"></i></span> Domain-Adaptive Training</p>
4211
  <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
4212
  </div>
4213
 
4214
  <div class="notification is-info is-light py-2 px-3 mb-3">
4215
- <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-database"></i></span> Expanded Dataset Coverage</p>
4216
  <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
4217
  </div>
4218
 
4219
  <div class="notification is-info is-light py-2 px-3 mb-3">
4220
- <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-balance-scale"></i></span> Efficiency & Cost Benchmarking</p>
4221
  <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
4222
  </div>
4223
 
4224
  <div class="notification is-info is-light py-2 px-3 mb-3">
4225
- <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-bar"></i></span> Advanced Evaluation Metrics</p>
4226
  <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
4227
  </div>
4228
 
 
3861
 
3862
  <!-- </section>
3863
  </div> -->
3864
+ <section class="section">
3865
+ <div class="container">
3866
  <!-- Model Performance Highlights -->
3867
  <div class="card mb-5">
3868
  <div class="card-header">
 
3916
  <p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
3917
 
3918
  <div class="notification is-info is-light py-3 px-4">
3919
+ <p><strong>🏆 No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
3920
+ <p><strong>⚖️ Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
3921
+ <p><strong>🛠️ Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
3922
+ <p><strong>💰 Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
3923
+ <p><strong>📈 Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
3924
+ <p><strong>🔢 Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
3925
  </div>
3926
  </div>
3927
  </div>
 
4202
  </div>
4203
 
4204
  <div class="notification is-info is-light py-2 px-3 mb-3">
4205
+ <p class="has-text-weight-bold mb-1">🧠 Few-Shot & Chain-of-Thought</p>
4206
  <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
4207
  </div>
4208
 
4209
  <div class="notification is-info is-light py-2 px-3 mb-3">
4210
+ <p class="has-text-weight-bold mb-1">⚙️ Domain-Adaptive Training</p>
4211
  <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
4212
  </div>
4213
 
4214
  <div class="notification is-info is-light py-2 px-3 mb-3">
4215
+ <p class="has-text-weight-bold mb-1">📊 Expanded Dataset Coverage</p>
4216
  <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
4217
  </div>
4218
 
4219
  <div class="notification is-info is-light py-2 px-3 mb-3">
4220
+ <p class="has-text-weight-bold mb-1">⚖️ Efficiency & Cost Benchmarking</p>
4221
  <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
4222
  </div>
4223
 
4224
  <div class="notification is-info is-light py-2 px-3 mb-3">
4225
+ <p class="has-text-weight-bold mb-1">📈 Advanced Evaluation Metrics</p>
4226
  <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
4227
  </div>
4228