Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Pratik Bhavsar
commited on
Commit
·
ae900da
1
Parent(s):
e2809a3
added key insights
Browse files- data_loader.py +97 -10
data_loader.py
CHANGED
|
@@ -721,15 +721,102 @@ METHODOLOGY = """
|
|
| 721 |
cases that challenge real-world applicability.
|
| 722 |
</p>
|
| 723 |
|
| 724 |
-
|
| 725 |
-
|
| 726 |
-
|
| 727 |
-
|
| 728 |
-
|
| 729 |
-
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 733 |
|
| 734 |
<h2 class="methodology-subtitle">Dataset Structure</h2>
|
| 735 |
<div class="table-container">
|
|
@@ -847,5 +934,5 @@ METHODOLOGY = """
|
|
| 847 |
<li>Monthly model additions</li>
|
| 848 |
</ul>
|
| 849 |
</div>
|
| 850 |
-
|
| 851 |
"""
|
|
|
|
| 721 |
cases that challenge real-world applicability.
|
| 722 |
</p>
|
| 723 |
|
| 724 |
+
<style>
|
| 725 |
+
.key-insights thead tr {
|
| 726 |
+
background: linear-gradient(90deg, #60A5FA, #818CF8);
|
| 727 |
+
}
|
| 728 |
+
|
| 729 |
+
.key-insights td:first-child {
|
| 730 |
+
color: var(--accent-blue);
|
| 731 |
+
background: var(--bg-primary);
|
| 732 |
+
}
|
| 733 |
+
|
| 734 |
+
.key-insights td:last-child {
|
| 735 |
+
background: var(--bg-primary);
|
| 736 |
+
}
|
| 737 |
+
|
| 738 |
+
.key-insights td {
|
| 739 |
+
padding: 1rem;
|
| 740 |
+
border-bottom: 1px solid rgba(31, 41, 55, 0.5);
|
| 741 |
+
}
|
| 742 |
+
</style>
|
| 743 |
+
|
| 744 |
+
<div class="methodology-section">
|
| 745 |
+
<h1 class="methodology-subtitle">Key Insights</h1>
|
| 746 |
+
<div class="table-container">
|
| 747 |
+
<table class="dataset-table key-insights">
|
| 748 |
+
<thead>
|
| 749 |
+
<tr>
|
| 750 |
+
<th>Category</th>
|
| 751 |
+
<th>Finding</th>
|
| 752 |
+
</tr>
|
| 753 |
+
</thead>
|
| 754 |
+
<tbody>
|
| 755 |
+
<tr>
|
| 756 |
+
<td>Performance Champion</td>
|
| 757 |
+
<td>Gemini-2.0-flash dominates with 0.935 score at just $0.075 per million tokens, excelling in both complex tasks (0.95) and safety features (0.98)</td>
|
| 758 |
+
</tr>
|
| 759 |
+
<tr>
|
| 760 |
+
<td>Price-Performance Paradox</td>
|
| 761 |
+
<td>Top 3 models span 20x price difference yet only 3% performance gap, challenging pricing assumptions</td>
|
| 762 |
+
</tr>
|
| 763 |
+
<tr>
|
| 764 |
+
<td>Open Vs Closed Source</td>
|
| 765 |
+
<td>The new Mistral-small leads in open source models and performs similar to GPT-4o-mini at 0.83, signaling OSS maturity in tool calling</td>
|
| 766 |
+
</tr>
|
| 767 |
+
<tr>
|
| 768 |
+
<td>Reasoning Models</td>
|
| 769 |
+
<td>Although being great for reasoning, o1 and o3-mini are far from perfect scoring 0.87 and 0.84 respectively. DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
|
| 770 |
+
</tr>
|
| 771 |
+
<tr>
|
| 772 |
+
<td>Tool Miss Detection</td>
|
| 773 |
+
<td>Dataset averages of 0.59 and 0.78 reveal fundamental challenges in handling edge cases and maintaining context, even as models excel at basic tasks</td>
|
| 774 |
+
</tr>
|
| 775 |
+
<tr>
|
| 776 |
+
<td>Architecture Trade-offs</td>
|
| 777 |
+
<td>Long context vs parallel execution shows architectural limits: O1 leads context (0.98) but fails parallel tasks (0.43), while GPT-4o shows opposite pattern</td>
|
| 778 |
+
</tr>
|
| 779 |
+
</tbody>
|
| 780 |
+
</table>
|
| 781 |
+
</div>
|
| 782 |
+
|
| 783 |
+
<h2 class="methodology-subtitle">Development Implications</h2>
|
| 784 |
+
<div class="table-container">
|
| 785 |
+
<table class="dataset-table key-insights">
|
| 786 |
+
<thead>
|
| 787 |
+
<tr>
|
| 788 |
+
<th>Area</th>
|
| 789 |
+
<th>Recommendation</th>
|
| 790 |
+
</tr>
|
| 791 |
+
</thead>
|
| 792 |
+
<tbody>
|
| 793 |
+
<tr>
|
| 794 |
+
<td>Task Complexity</td>
|
| 795 |
+
<td>Simple tasks work with most models. Complex workflows requiring multiple tools need models with 0.85+ scores in composite tests</td>
|
| 796 |
+
</tr>
|
| 797 |
+
<tr>
|
| 798 |
+
<td>Error Handling</td>
|
| 799 |
+
<td>Models with low tool selection scores need guardrails. Add validation layers and structured error recovery, especially for parameter collection</td>
|
| 800 |
+
</tr>
|
| 801 |
+
<tr>
|
| 802 |
+
<td>Context Management</td>
|
| 803 |
+
<td>Long conversations require either models strong in context retention or external context storage systems</td>
|
| 804 |
+
</tr>
|
| 805 |
+
<tr>
|
| 806 |
+
<td>Reasoning Models</td>
|
| 807 |
+
<td>While o1 and o3-mini excelled in function calling, DeepSeek V3 and R1 were excluded from rankings due to limited function support</td>
|
| 808 |
+
</tr>
|
| 809 |
+
<tr>
|
| 810 |
+
<td>Safety Controls</td>
|
| 811 |
+
<td>Add strict tool access controls for models weak in irrelevance detection. Include validation layers for inconsistent performers</td>
|
| 812 |
+
</tr>
|
| 813 |
+
<tr>
|
| 814 |
+
<td>Open Vs Closed Source</td>
|
| 815 |
+
<td>Private models lead in complex tasks, but open-source options work well for basic operations. Choose based on your scaling needs</td>
|
| 816 |
+
</tr>
|
| 817 |
+
</tbody>
|
| 818 |
+
</table>
|
| 819 |
+
</div>
|
| 820 |
|
| 821 |
<h2 class="methodology-subtitle">Dataset Structure</h2>
|
| 822 |
<div class="table-container">
|
|
|
|
| 934 |
<li>Monthly model additions</li>
|
| 935 |
</ul>
|
| 936 |
</div>
|
| 937 |
+
|
| 938 |
"""
|