Spaces:
				
			
			
	
			
			
		Runtime error
		
	
	
	
			
			
	
	
	
	
		
		
		Runtime error
		
	Commit 
							
							·
						
						5601a63
	
1
								Parent(s):
							
							ce824ba
								
Adjust description for TruthfulQA
Browse files- content.py +6 -2
    	
        content.py
    CHANGED
    
    | @@ -1,4 +1,7 @@ | |
| 1 | 
             
            CHANGELOG_TEXT = f"""
         | 
|  | |
|  | |
|  | |
| 2 | 
             
            ## [2023-06-12] 
         | 
| 3 | 
             
            - Add Human & GPT-4 Evaluations
         | 
| 4 |  | 
| @@ -34,7 +37,8 @@ CHANGELOG_TEXT = f""" | |
| 34 | 
             
            - Display different queues for jobs that are RUNNING, PENDING, FINISHED status 
         | 
| 35 |  | 
| 36 | 
             
            ## [2023-05-15] 
         | 
| 37 | 
            -
            - Fix a typo: from "TruthQA" to " | 
|  | |
| 38 |  | 
| 39 | 
             
            ## [2023-05-10] 
         | 
| 40 | 
             
            - Fix a bug that prevented auto-refresh
         | 
| @@ -58,7 +62,7 @@ Evaluation is performed against 4 popular benchmarks: | |
| 58 | 
             
            - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
         | 
| 59 | 
             
            - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
         | 
| 60 | 
             
            - <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
         | 
| 61 | 
            -
            - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a  | 
| 62 |  | 
| 63 | 
             
            We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
         | 
| 64 | 
             
            """
         | 
|  | |
| 1 | 
             
            CHANGELOG_TEXT = f"""
         | 
| 2 | 
            +
            ## [2023-06-13] 
         | 
| 3 | 
            +
            - Adjust description for TruthfulQA
         | 
| 4 | 
            +
             | 
| 5 | 
             
            ## [2023-06-12] 
         | 
| 6 | 
             
            - Add Human & GPT-4 Evaluations
         | 
| 7 |  | 
|  | |
| 37 | 
             
            - Display different queues for jobs that are RUNNING, PENDING, FINISHED status 
         | 
| 38 |  | 
| 39 | 
             
            ## [2023-05-15] 
         | 
| 40 | 
            +
            - Fix a typo: from "TruthQA" to "
         | 
| 41 | 
            +
            QA"
         | 
| 42 |  | 
| 43 | 
             
            ## [2023-05-10] 
         | 
| 44 | 
             
            - Fix a bug that prevented auto-refresh
         | 
|  | |
| 62 | 
             
            - <a href="https://arxiv.org/abs/1803.05457" target="_blank">  AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
         | 
| 63 | 
             
            - <a href="https://arxiv.org/abs/1905.07830" target="_blank">  HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
         | 
| 64 | 
             
            - <a href="https://arxiv.org/abs/2009.03300" target="_blank">  MMLU </a>  (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
         | 
| 65 | 
            +
            - <a href="https://arxiv.org/abs/2109.07958" target="_blank">  TruthfulQA </a> (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
         | 
| 66 |  | 
| 67 | 
             
            We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
         | 
| 68 | 
             
            """
         | 
 
			
