Safetensors
llama
TildeSIA commited on
Commit
e3895ef
·
verified ·
1 Parent(s): 77f83c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -106,17 +106,20 @@ outputs = model.generate(
106
  # Evaluation
107
  ## Per-Character Perplexity
108
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
109
- **Why Character-Level?** Different language models use different internal vocabularies - some break text into whole words, others into word fragments, and some into individual characters. This makes direct comparison difficult.
110
- Character-level perplexity creates a standardised comparison by calculating how well each model would theoretically perform if we measured their predictions character-by-character. We're not changing how the models work - instead, we use mathematical conversion to approximate their character-level performance based on their predictions.
111
-
112
  Perplexity fairly evaluates how well each model handles:
113
  - Spelling accuracy across a diverse vocabulary
114
  - Grammar rules that span multiple words
115
  - Sentence structure and flow
116
- - Language-specific patterns (like how different languages form plurals or compound words)
 
 
 
 
117
  **Why does this Matter?** Models with lower perplexity generally perform better on real-world tasks like text generation, translation, and understanding context. It's a reliable indicator of overall language competency across different applications.
 
118
  **What data did we use?**
119
  We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking.
 
120
  | Language | TildeOpen-30B | Gemma-2-27B | EuroLLM-9B | ALIA-40B |
121
  |----------|---------------|-------------|------------|-----------------|
122
  | Bulgarian | **2.1716** | 2.3541 | 2.3502 | 2.2411 |
 
106
  # Evaluation
107
  ## Per-Character Perplexity
108
  **What is Perplexity?** Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works.
 
 
 
109
  Perplexity fairly evaluates how well each model handles:
110
  - Spelling accuracy across a diverse vocabulary
111
  - Grammar rules that span multiple words
112
  - Sentence structure and flow
113
+ - Language-specific patterns (how different languages form plural forms or compound words)
114
+
115
+ **Why Character-Level?** Different language models use different internal vocabularies - some break text into whole words, others into word fragments, and some into individual characters. This makes direct comparison difficult.
116
+ Character-level perplexity creates a standardised comparison by calculating how well each model would theoretically perform if we measured their predictions character-by-character. We're not changing how the models work - instead, we use mathematical conversion to approximate their character-level performance based on their predictions.
117
+
118
  **Why does this Matter?** Models with lower perplexity generally perform better on real-world tasks like text generation, translation, and understanding context. It's a reliable indicator of overall language competency across different applications.
119
+
120
  **What data did we use?**
121
  We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking.
122
+
123
  | Language | TildeOpen-30B | Gemma-2-27B | EuroLLM-9B | ALIA-40B |
124
  |----------|---------------|-------------|------------|-----------------|
125
  | Bulgarian | **2.1716** | 2.3541 | 2.3502 | 2.2411 |