yu-val-weiss
commited on
Commit
·
17ddf40
1
Parent(s):
2338f58
update documentation
Browse files
README.md
CHANGED
|
@@ -45,53 +45,72 @@ results = blimp.compute(model_id='pico-lm/pico-decoder')
|
|
| 45 |
|
| 46 |
- **model_id** (str): model used for calculating BLiMP.
|
| 47 |
- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
|
|
|
|
| 48 |
- **device** (str): device to run on, defaults to `cuda` when available
|
|
|
|
| 49 |
|
| 50 |
### Output Values
|
| 51 |
|
| 52 |
-
This metric outputs a dictionary
|
| 53 |
-
If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
|
| 54 |
|
| 55 |
-
|
| 56 |
-
{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
|
| 57 |
-
```
|
| 58 |
-
|
| 59 |
-
The range of this metric is [0, inf). A lower score is better.
|
| 60 |
-
|
| 61 |
-
### Examples
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
```python
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
print(round(results["perplexities"][0], 2))
|
| 76 |
-
>>>32.25
|
| 77 |
```
|
| 78 |
|
| 79 |
-
|
|
|
|
|
|
|
| 80 |
|
| 81 |
```python
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
results =
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
```
|
| 96 |
|
| 97 |
## Citation
|
|
|
|
| 45 |
|
| 46 |
- **model_id** (str): model used for calculating BLiMP.
|
| 47 |
- **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
|
| 48 |
+
- **predictions** (list[str]): names of metrics to run. pass empty list or `["*"]` to run all of them
|
| 49 |
- **device** (str): device to run on, defaults to `cuda` when available
|
| 50 |
+
- **samples_per_set** (int): the number of samples per metric, defaults to 1_000. Maximum 1_000 (enforced with a `min` call).
|
| 51 |
|
| 52 |
### Output Values
|
| 53 |
|
| 54 |
+
This metric outputs a dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
|
|
|
| 55 |
|
| 56 |
+
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
Each score is in `[0,1]`. A **higher** score is better.
|
| 59 |
|
| 60 |
```python
|
| 61 |
+
{
|
| 62 |
+
"accuracy": 0.621288127211,
|
| 63 |
+
"by_uid": {
|
| 64 |
+
"adjunct_island": 0.12761212512, # rest of sub-datasets...
|
| 65 |
+
},
|
| 66 |
+
"by_phenomenon": {
|
| 67 |
+
"anaphor_agreement": 0.71287512125, # rest of phenomena...
|
| 68 |
+
},
|
| 69 |
+
}
|
|
|
|
|
|
|
| 70 |
```
|
| 71 |
|
| 72 |
+
### Examples
|
| 73 |
+
|
| 74 |
+
Calculating BLiMP on predictions defined here:
|
| 75 |
|
| 76 |
```python
|
| 77 |
+
def check_blimp():
|
| 78 |
+
# Load the metric
|
| 79 |
+
blimp = load("pico-lm/blimp")
|
| 80 |
+
|
| 81 |
+
# example with a small language model
|
| 82 |
+
results = blimp.compute(
|
| 83 |
+
model_id="distilgpt2",
|
| 84 |
+
batch_size=16,
|
| 85 |
+
predictions=["*"],
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
# Print results
|
| 89 |
+
print("Overall accuracy:", results["accuracy"])
|
| 90 |
+
>>> Overall accuracy: 0.5035074626865672
|
| 91 |
+
print("Top 5 best performing uids:")
|
| 92 |
+
sorted_results = sorted(results["by_uid"].items(), key=lambda x: x[1], reverse=True)
|
| 93 |
+
for phenomenon, accuracy in sorted_results[:5]:
|
| 94 |
+
print(f"{phenomenon}: {accuracy:.3f}")
|
| 95 |
+
>>> Top 5 best performing uids:
|
| 96 |
+
>>> anaphor_number_agreement: 0.919
|
| 97 |
+
>>> anaphor_gender_agreement: 0.868
|
| 98 |
+
>>> matrix_question_npi_licensor_present: 0.840
|
| 99 |
+
>>> wh_vs_that_no_gap: 0.787
|
| 100 |
+
>>> sentential_negation_npi_licensor_present: 0.729
|
| 101 |
+
|
| 102 |
+
print("Top 5 best performing phenomena:")
|
| 103 |
+
sorted_results = sorted(
|
| 104 |
+
results["by_phenomenon"].items(), key=lambda x: x[1], reverse=True
|
| 105 |
+
)
|
| 106 |
+
for phenomenon, accuracy in sorted_results[:5]:
|
| 107 |
+
print(f"{phenomenon}: {accuracy:.3f}")
|
| 108 |
+
>>> Top 5 best performing phenomena:
|
| 109 |
+
>>> anaphor_agreement: 0.893
|
| 110 |
+
>>> argument_structure: 0.597
|
| 111 |
+
>>> npi_licensing: 0.579
|
| 112 |
+
>>> filler_gap_dependency: 0.561
|
| 113 |
+
>>> control_raising: 0.533
|
| 114 |
```
|
| 115 |
|
| 116 |
## Citation
|
blimp.py
CHANGED
|
@@ -128,8 +128,6 @@ Args:
|
|
| 128 |
Returns:
|
| 129 |
blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
| 130 |
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
| 131 |
-
Examples:
|
| 132 |
-
TODO: examples.
|
| 133 |
"""
|
| 134 |
|
| 135 |
|
|
|
|
| 128 |
Returns:
|
| 129 |
blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
|
| 130 |
An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
|
|
|
|
|
|
|
| 131 |
"""
|
| 132 |
|
| 133 |
|