Update evaluation/demo_humaneval.md
Browse files- evaluation/demo_humaneval.md +7 -22
evaluation/demo_humaneval.md
CHANGED
|
@@ -19,27 +19,12 @@ print(pass_at_k)
|
|
| 19 |
{'pass@1': 0.5, 'pass@2': 1.0}
|
| 20 |
```
|
| 21 |
|
| 22 |
-
To better understand how pass@k metric works, we will illustrate it with
|
| 23 |
|
| 24 |
-
**Problem
|
| 25 |
|
| 26 |
```python
|
| 27 |
|
| 28 |
-
from typing import List
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
def separate_paren_groups(paren_string: str) -> List[str]:
|
| 32 |
-
""" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
|
| 33 |
-
separate those group into separate strings and return the list of those.
|
| 34 |
-
Separate groups are balanced (each open brace is properly closed) and not nested within each other
|
| 35 |
-
Ignore any spaces in the input string.
|
| 36 |
-
>>> separate_paren_groups('( ) (( )) (( )( ))')
|
| 37 |
-
['()', '(())', '(()())']
|
| 38 |
-
"""
|
| 39 |
-
````
|
| 40 |
-
**Problem 2:**
|
| 41 |
-
```python
|
| 42 |
-
|
| 43 |
def truncate_number(number: float) -> float:
|
| 44 |
""" Given a positive floating point number, it can be decomposed into
|
| 45 |
and integer part (largest integer smaller than given number) and decimals
|
|
@@ -51,23 +36,23 @@ def truncate_number(number: float) -> float:
|
|
| 51 |
"""
|
| 52 |
````
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
**Remark**:
|
| 57 |
|
| 58 |
Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program.
|
| 59 |
|
| 60 |
|
| 61 |
-
For our experiment, we compute pass@1, pass@10 and pass@20, each
|
| 62 |
|
| 63 |
```
|
| 64 |
|
| 65 |
-
Results: {'pass@1': 0.
|
| 66 |
|
| 67 |
````
|
| 68 |
|
| 69 |
-
If we take a closer look at the unit test results for each candidate solution
|
| 70 |
-
for pass@20, it is `1
|
| 71 |
|
| 72 |
```python
|
| 73 |
|
|
|
|
| 19 |
{'pass@1': 0.5, 'pass@2': 1.0}
|
| 20 |
```
|
| 21 |
|
| 22 |
+
To better understand how pass@k metric works, we will illustrate it with a concrete example from HumanEval dataset. We select the problem below and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests:
|
| 23 |
|
| 24 |
+
**Problem:**
|
| 25 |
|
| 26 |
```python
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
def truncate_number(number: float) -> float:
|
| 29 |
""" Given a positive floating point number, it can be decomposed into
|
| 30 |
and integer part (largest integer smaller than given number) and decimals
|
|
|
|
| 36 |
"""
|
| 37 |
````
|
| 38 |
|
| 39 |
+
Instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).
|
| 40 |
|
| 41 |
**Remark**:
|
| 42 |
|
| 43 |
Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program.
|
| 44 |
|
| 45 |
|
| 46 |
+
For our experiment, we compute pass@1, pass@10 and pass@20, each corresponding to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.
|
| 47 |
|
| 48 |
```
|
| 49 |
|
| 50 |
+
Results: {'pass@1': 0.1, 'pass@10': 0.7631, 'pass@20': 1.0}
|
| 51 |
|
| 52 |
````
|
| 53 |
|
| 54 |
+
If we take a closer look at the unit test results for each candidate solution, we find that 2 passed the unit test. This means that we have 2 correct solutions among 20, which corresponds to our pass@1 value `2/20 = 0.1`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
|
| 55 |
+
for pass@20, it is `1`, since if we select all 20 candidates the problem gets solved which gives 100% success rate. If you are curious about the candidate solutions that passed the tests, they both implemented this function:
|
| 56 |
|
| 57 |
```python
|
| 58 |
|