Spaces:

gtfintechlab
/

FLaME

Running

App Files Files Community

Huzaifa Pardawala commited on Mar 19

Commit

bd767d8

1 Parent(s): 555b191

fix: editing icons and sections

Browse files

Files changed (1) hide show

index.html +13 -13

index.html CHANGED Viewed

@@ -3861,8 +3861,8 @@
         <!-- </section>
 </div> -->
-  <!-- <section class="section">
-  <div class="container"> -->
     <!-- Model Performance Highlights -->
     <div class="card mb-5">
       <div class="card-header">
@@ -3916,12 +3916,12 @@
           <p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
           <div class="notification is-info is-light py-3 px-4">
-            <p><strong><span class="icon"><i class="fa-solid fa-trophy"></i></span> No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
-            <p><strong><span class="icon"><i class="fa-solid fa-balance-scale"></i></span> Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
-            <p><strong><span class="icon"><i class="fa-solid fa-tools"></i></span> Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
-            <p><strong><span class="icon"><i class="fa-solid fa-coins"></i></span> Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
-            <p><strong><span class="icon"><i class="fa-solid fa-chart-line"></i></span> Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
-            <p><strong><span class="icon"><i class="fa-solid fa-list-ol"></i></span> Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
           </div>
         </div>
       </div>
@@ -4202,27 +4202,27 @@
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
-              <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-brain"></i></span> Few-Shot & Chain-of-Thought</p>
               <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
-              <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-line"></i></span> Domain-Adaptive Training</p>
               <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
-              <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-database"></i></span> Expanded Dataset Coverage</p>
               <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
-              <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-balance-scale"></i></span> Efficiency & Cost Benchmarking</p>
               <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
-              <p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-bar"></i></span> Advanced Evaluation Metrics</p>
               <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
             </div>

         <!-- </section>
 </div> -->
+  <section class="section">
+  <div class="container">
     <!-- Model Performance Highlights -->
     <div class="card mb-5">
       <div class="card-header">
           <p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
           <div class="notification is-info is-light py-3 px-4">
+            <p><strong>🏆 No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
+            <p><strong>⚖️ Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
+            <p><strong>🛠️ Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
+            <p><strong>💰 Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
+            <p><strong>📈 Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
+            <p><strong>🔢 Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
           </div>
         </div>
       </div>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
+              <p class="has-text-weight-bold mb-1">🧠 Few-Shot & Chain-of-Thought</p>
               <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
+              <p class="has-text-weight-bold mb-1">⚙️ Domain-Adaptive Training</p>
               <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
+              <p class="has-text-weight-bold mb-1">📊 Expanded Dataset Coverage</p>
               <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
+              <p class="has-text-weight-bold mb-1">⚖️ Efficiency & Cost Benchmarking</p>
               <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
             </div>
             <div class="notification is-info is-light py-2 px-3 mb-3">
+              <p class="has-text-weight-bold mb-1">📈 Advanced Evaluation Metrics</p>
               <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
             </div>