Spaces:
				
			
			
	
			
			
					
		Running
		
			on 
			
			CPU Upgrade
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
			on 
			
			CPU Upgrade
	Commit 
							
							·
						
						7abc6a7
	
1
								Parent(s):
							
							96d111a
								
Update benchmark count and fix typo (`inetuning->finetuning`) (#395)
Browse files- Update benchmark count and fix typo (`inetuning->finetuning`) (cdeea55b7621c0b1fa7515a40bf2fb50df62d5d7)
Co-authored-by: Alvaro Bartolome <[email protected]>
- src/display/about.py +2 -2
    	
        src/display/about.py
    CHANGED
    
    | @@ -28,7 +28,7 @@ If there is no icon, we have not uploaded the information on the model yet, feel | |
| 28 |  | 
| 29 | 
             
            ## How it works
         | 
| 30 |  | 
| 31 | 
            -
            📈 We evaluate models on  | 
| 32 |  | 
| 33 | 
             
            - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
         | 
| 34 | 
             
            - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
         | 
| @@ -67,7 +67,7 @@ The tasks and few shots parameters are: | |
| 67 | 
             
            Side note on the baseline scores: 
         | 
| 68 | 
             
            - for log-likelihood evaluation, we select the random baseline
         | 
| 69 | 
             
            - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
         | 
| 70 | 
            -
            - for GSM8K, we select the score obtained in the paper after  | 
| 71 |  | 
| 72 | 
             
            ## Quantization
         | 
| 73 | 
             
            To get more information about quantization, see:
         | 
|  | |
| 28 |  | 
| 29 | 
             
            ## How it works
         | 
| 30 |  | 
| 31 | 
            +
            📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
         | 
| 32 |  | 
| 33 | 
             
            - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
         | 
| 34 | 
             
            - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
         | 
|  | |
| 67 | 
             
            Side note on the baseline scores: 
         | 
| 68 | 
             
            - for log-likelihood evaluation, we select the random baseline
         | 
| 69 | 
             
            - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
         | 
| 70 | 
            +
            - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
         | 
| 71 |  | 
| 72 | 
             
            ## Quantization
         | 
| 73 | 
             
            To get more information about quantization, see:
         | 
 
			

 
		