blimp

Running

App Files Files Community

yu-val-weiss commited on Mar 11

Commit

17ddf40

1 Parent(s): 2338f58

update documentation

Browse files

Files changed (2) hide show

README.md +54 -35
blimp.py +0 -2

README.md CHANGED Viewed

@@ -45,53 +45,72 @@ results = blimp.compute(model_id='pico-lm/pico-decoder')
 - **model_id** (str): model used for calculating BLiMP.
 - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
 - **device** (str): device to run on, defaults to `cuda` when available
 ### Output Values
-This metric outputs a dictionary with the BLiMP scores for each subdataset.
-If one of the input texts is longer than the max input length of the model, then it is truncated to the max length for the perplexity computation.
-```
-{'perplexities': [8.182524681091309, 33.42122268676758, 27.012239456176758], 'mean_perplexity': 22.871995608011883}
-```
-The range of this metric is [0, inf). A lower score is better.
-### Examples
-Calculating perplexity on predictions defined here:
 ```python
-perplexity = evaluate.load("perplexity", module_type="metric")
-input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
-results = perplexity.compute(model_id='gpt2',
-                             add_start_token=False,
-                             predictions=input_texts)
-print(list(results.keys()))
->>>['perplexities', 'mean_perplexity']
-print(round(results["mean_perplexity"], 2))
->>>646.75
-print(round(results["perplexities"][0], 2))
->>>32.25
 ```
-Calculating perplexity on predictions loaded in from a dataset:
 ```python
-perplexity = evaluate.load("perplexity", module_type="metric")
-input_texts = datasets.load_dataset("wikitext",
-                                    "wikitext-2-raw-v1",
-                                    split="test")["text"][:50]
-input_texts = [s for s in input_texts if s!='']
-results = perplexity.compute(model_id='gpt2',
-                             predictions=input_texts)
-print(list(results.keys()))
->>>['perplexities', 'mean_perplexity']
-print(round(results["mean_perplexity"], 2))
->>>576.76
-print(round(results["perplexities"][0], 2))
->>>889.28
 ```
 ## Citation

 - **model_id** (str): model used for calculating BLiMP.
 - **batch_size** (int): the batch size to run texts through the model. Defaults to 16.
+- **predictions** (list[str]): names of metrics to run. pass empty list or `["*"]` to run all of them
 - **device** (str): device to run on, defaults to `cuda` when available
+- **samples_per_set** (int): the number of samples per metric, defaults to 1_000. Maximum 1_000 (enforced with a `min` call).
 ### Output Values
+This metric outputs a dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
+An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
+Each score is in `[0,1]`. A **higher** score is better.
 ```python
+{
+    "accuracy": 0.621288127211,
+    "by_uid": {
+        "adjunct_island": 0.12761212512, # rest of sub-datasets...
+        },
+    "by_phenomenon": {
+        "anaphor_agreement": 0.71287512125, # rest of phenomena...
+    },
+}
 ```
+### Examples
+Calculating BLiMP on predictions defined here:
 ```python
+def check_blimp():
+    # Load the metric
+    blimp = load("pico-lm/blimp")
+    # example with a small language model
+    results = blimp.compute(
+        model_id="distilgpt2",
+        batch_size=16,
+        predictions=["*"],
+    )
+    # Print results
+    print("Overall accuracy:", results["accuracy"])
+    >>> Overall accuracy: 0.5035074626865672
+    print("Top 5 best performing uids:")
+    sorted_results = sorted(results["by_uid"].items(), key=lambda x: x[1], reverse=True)
+    for phenomenon, accuracy in sorted_results[:5]:
+        print(f"{phenomenon}: {accuracy:.3f}")
+    >>> Top 5 best performing uids:
+    >>> anaphor_number_agreement: 0.919
+    >>> anaphor_gender_agreement: 0.868
+    >>> matrix_question_npi_licensor_present: 0.840
+    >>> wh_vs_that_no_gap: 0.787
+    >>> sentential_negation_npi_licensor_present: 0.729
+    print("Top 5 best performing phenomena:")
+    sorted_results = sorted(
+        results["by_phenomenon"].items(), key=lambda x: x[1], reverse=True
+    )
+    for phenomenon, accuracy in sorted_results[:5]:
+        print(f"{phenomenon}: {accuracy:.3f}")
+    >>> Top 5 best performing phenomena:
+    >>> anaphor_agreement: 0.893
+    >>> argument_structure: 0.597
+    >>> npi_licensing: 0.579
+    >>> filler_gap_dependency: 0.561
+    >>> control_raising: 0.533
 ```
 ## Citation

blimp.py CHANGED Viewed

@@ -128,8 +128,6 @@ Args:
 Returns:
     blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
     An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
-Examples:
-    TODO: examples.
 """

 Returns:
     blimp: dictionary containing the blimp scores for each of the 67 sub-datasets, as well as the overall accuracy.
     An LM’s overall accuracy on BLiMP is simply the proportion of the 67,000 minimal pairs in which the model assigns a higher probability to the acceptable sentence.
 """