vimalk78 commited on
Commit
bfd6ff4
Β·
1 Parent(s): b05514b

docs: add soft minimum visualization ideas and vocabulary alternatives analysis

Browse files

- Add comprehensive visualization concepts for soft minimum method
- Add detailed analysis of vocabulary alternatives beyond WordFreq
- Add python scripts to analyse word list from peter norvig home page
- Update .gitignore for fine-tuned models and T5 model cache
- Include SUBTLEX dataset and Norvig vocabulary analysis files

Signed-off-by: Vimal Kumar <[email protected]>

crossword-app/backend-py/docs/softmin_visualization_ideas.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Soft Minimum Visualization Ideas
2
+
3
+ This document outlines visualization concepts to showcase how the soft minimum method works for multi-topic word intersection in the crossword generator.
4
+
5
+ ## Overview
6
+
7
+ The soft minimum method uses the formula `-log(sum(exp(-beta * similarities))) / beta` to find words that are genuinely relevant to ALL topics simultaneously. Unlike simple averaging, which can promote words that are highly relevant to just one topic, soft minimum penalizes words that score poorly on any individual topic.
8
+
9
+ Visualizations would help users understand:
10
+ - How soft minimum differs from averaging
11
+ - Why it produces better semantic intersections
12
+ - How the beta parameter affects results
13
+ - How adaptive beta mechanism works
14
+
15
+ ## Visualization Concepts
16
+
17
+ ### 1. Heat Map Comparison (🌟 Most Impactful)
18
+
19
+ **Concept**: Side-by-side heat maps showing individual topic similarities vs soft minimum scores.
20
+
21
+ **Layout**:
22
+ - **Left Heat Map**: Individual Similarities
23
+ - Rows: Top 50-100 words
24
+ - Columns: Individual topics (e.g., "universe", "movies", "languages")
25
+ - Color intensity: Similarity score (0.0 = white, 1.0 = dark blue)
26
+
27
+ - **Right Heat Map**: Soft Minimum Results
28
+ - Same rows (words)
29
+ - Single column: Soft minimum score
30
+ - Color intensity: Final soft minimum score
31
+
32
+ **Key Insights**:
33
+ - Words like "anime" would show moderate blue across all topics β†’ high soft minimum score
34
+ - Words like "astronomy" would show dark blue for "universe", white for others β†’ low soft minimum score
35
+ - Visually demonstrates how soft minimum penalizes topic-specific words
36
+
37
+ **Implementation**:
38
+ - Frontend: Use libraries like D3.js or Plotly for interactive heat maps
39
+ - Backend: Return individual topic similarities alongside soft minimum scores
40
+
41
+ ### 2. 3D Scatter Plot (For 3-Topic Cases)
42
+
43
+ **Concept**: 3D space where each axis represents similarity to one topic.
44
+
45
+ **Layout**:
46
+ - X-axis: Similarity to topic 1
47
+ - Y-axis: Similarity to topic 2
48
+ - Z-axis: Similarity to topic 3
49
+ - Point size/color: Soft minimum score
50
+ - Point labels: Word names (on hover)
51
+
52
+ **Key Insights**:
53
+ - Words near the center (similar to all topics) = large, bright points
54
+ - Words near axes (similar to only one topic) = small, dim points
55
+ - Shows the "volume" of intersection vs union
56
+
57
+ **Implementation**:
58
+ - Use Three.js or Plotly 3D
59
+ - Interactive rotation and zoom
60
+ - Filter points by soft minimum threshold
61
+
62
+ ### 3. Interactive Beta Slider
63
+
64
+ **Concept**: Real-time visualization of how beta parameter affects word selection.
65
+
66
+ **Layout**:
67
+ - Horizontal slider: Beta value (1.0 to 20.0)
68
+ - Bar chart: Word scores (sorted descending)
69
+ - Threshold line: Current similarity threshold
70
+ - Counter: Number of words above threshold
71
+
72
+ **Key Insights**:
73
+ - High beta (strict): Only a few words pass, distribution is peaked
74
+ - Low beta (permissive): More words pass, distribution flattens
75
+ - Shows adaptive beta mechanism in action
76
+
77
+ **Implementation**:
78
+ - React component with range slider
79
+ - Real-time recalculation of soft minimum scores
80
+ - Animated transitions as beta changes
81
+
82
+ ### 4. Venn Diagram with Words
83
+
84
+ **Concept**: Position words in Venn diagram based on topic similarities.
85
+
86
+ **Layout** (for 2-3 topics):
87
+ - Circles represent individual topics
88
+ - Words positioned based on similarity combinations
89
+ - Words in intersections = high soft minimum scores
90
+ - Words in single circles = low soft minimum scores
91
+ - Word opacity/size based on final soft minimum score
92
+
93
+ **Key Insights**:
94
+ - Visual representation of "true intersections"
95
+ - Words in overlap regions are what soft minimum promotes
96
+ - Empty intersection regions explain why some topic combinations yield few words
97
+
98
+ **Implementation**:
99
+ - SVG-based Venn diagrams
100
+ - Dynamic positioning algorithm
101
+ - Interactive word tooltips
102
+
103
+ ### 5. Before/After Word Clouds
104
+
105
+ **Concept**: Compare averaging vs soft minimum results using word clouds.
106
+
107
+ **Layout**:
108
+ - **Left Cloud**: "Averaging Method"
109
+ - Word size based on average similarity
110
+ - May prominently feature problematic words like "ethology" for Art+Books
111
+
112
+ - **Right Cloud**: "Soft Minimum Method"
113
+ - Word size based on soft minimum score
114
+ - Should prominently feature true intersections like "literature"
115
+
116
+ **Key Insights**:
117
+ - Dramatic visual difference in word prominence
118
+ - Shows quality improvement at a glance
119
+ - Easy to understand for non-technical users
120
+
121
+ **Implementation**:
122
+ - Use word cloud libraries (wordcloud2.js, D3-cloud)
123
+ - Color coding by topic affinity
124
+ - Interactive word selection
125
+
126
+ ### 6. Mathematical Formula Animation
127
+
128
+ **Concept**: Step-by-step visualization of soft minimum calculation.
129
+
130
+ **Layout**:
131
+ - Example word with similarities: [0.8, 0.2, 0.1] (universe, movies, languages)
132
+ - Animated steps:
133
+ 1. Show individual similarities as bars
134
+ 2. Apply exponential transformation: exp(-beta * sim)
135
+ 3. Sum the exponentials
136
+ 4. Apply logarithm and normalization
137
+ 5. Compare result to simple average (0.37)
138
+
139
+ **Key Insights**:
140
+ - How the minimum similarity dominates the calculation
141
+ - Why soft minimum β‰ˆ minimum similarity for high beta
142
+ - Mathematical intuition behind the formula
143
+
144
+ **Implementation**:
145
+ - Animated SVG or Canvas
146
+ - Step-by-step button progression
147
+ - Mathematical notation display
148
+
149
+ ### 7. Adaptive Beta Journey
150
+
151
+ **Concept**: Show the adaptive beta retry process as a timeline.
152
+
153
+ **Layout**:
154
+ - Horizontal timeline showing beta decay: 10.0 β†’ 7.0 β†’ 4.9 β†’ 3.4...
155
+ - For each beta value:
156
+ - Histogram of soft minimum scores
157
+ - Threshold line (adjusted)
158
+ - Count of valid words
159
+ - Decision: "Continue" or "Stop"
160
+
161
+ **Key Insights**:
162
+ - How threshold adjustment makes lower beta more permissive
163
+ - Why word count increases with each retry
164
+ - When the algorithm decides to stop
165
+
166
+ **Implementation**:
167
+ - Timeline component with expandable sections
168
+ - Small multiples showing score distributions
169
+ - Real-time data from debug logs
170
+
171
+ ## Implementation Priorities
172
+
173
+ ### Phase 1: Essential (MVP)
174
+ 1. **Heat Map Comparison** - Most educational value
175
+ 2. **Interactive Beta Slider** - Shows parameter effects clearly
176
+
177
+ ### Phase 2: Enhanced Understanding
178
+ 3. **Before/After Word Clouds** - Easy to understand impact
179
+ 4. **Mathematical Formula Animation** - Educational for technical users
180
+
181
+ ### Phase 3: Advanced Analysis
182
+ 5. **3D Scatter Plot** - For deep analysis of 3-topic cases
183
+ 6. **Venn Diagram** - Complex positioning algorithms
184
+ 7. **Adaptive Beta Journey** - Comprehensive debugging tool
185
+
186
+ ## Technical Implementation Notes
187
+
188
+ ### Backend Changes Needed
189
+ - Return individual topic similarities alongside soft minimum scores
190
+ - Add debug endpoint for visualization data
191
+ - Include beta parameter and threshold information in responses
192
+
193
+ ### Frontend Integration
194
+ - Add to existing debug tab
195
+ - Use React components for interactivity
196
+ - Responsive design for different screen sizes
197
+ - Export/save visualization capabilities
198
+
199
+ ### Data Format
200
+ ```json
201
+ {
202
+ "visualization_data": {
203
+ "individual_similarities": {
204
+ "word1": [0.8, 0.2, 0.1],
205
+ "word2": [0.3, 0.9, 0.4]
206
+ },
207
+ "soft_minimum_scores": {
208
+ "word1": 0.15,
209
+ "word2": 0.32
210
+ },
211
+ "beta_used": 7.0,
212
+ "threshold_adjusted": 0.175,
213
+ "topics": ["universe", "movies", "languages"]
214
+ }
215
+ }
216
+ ```
217
+
218
+ ## Expected Impact
219
+
220
+ These visualizations would:
221
+ 1. **Educate users** about the soft minimum method
222
+ 2. **Build confidence** in the algorithm's choices
223
+ 3. **Enable debugging** of problematic topic combinations
224
+ 4. **Facilitate research** into parameter optimization
225
+ 5. **Demonstrate value** of the multi-topic intersection approach
226
+
227
+ The heat map comparison alone would be worth implementing, as it clearly shows why soft minimum produces higher-quality word intersections than simple averaging.
crossword-app/backend-py/docs/vocabulary_alternatives_analysis.md ADDED
@@ -0,0 +1,405 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Vocabulary Alternatives Analysis: Beyond WordFreq
2
+
3
+ ## Executive Summary
4
+
5
+ WordFreq, while useful for general frequency analysis, produces vocabulary quality issues for crossword generation due to its web-scraped, uncurated nature. After hands-on evaluation of alternatives, most "curated" crossword lists have significant quality issues requiring substantial cleanup effort.
6
+
7
+ ### **Updated Recommendations (Post-Evaluation):**
8
+ 1. **Primary**: COCA free sample (6K high-quality words with rich metadata) + Peter Norvig's clean 100K list
9
+ 2. **Quality Leader**: COCA full version (if budget allows) - 14 billion words, sophisticated metadata
10
+ 3. **Fallback**: SUBTLEX (reasonable quality, needs programming to parse properly)
11
+ 4. **Avoid**: Most crossword-specific lists contain junk data requiring extensive cleanup
12
+ 5. **Semantic Processing**: Keep all-mpnet-base-v2 (working well)
13
+
14
+ ## Current Issues with WordFreq Vocabulary
15
+
16
+ ### Problems Identified:
17
+ 1. **Web-based contamination**: Includes Reddit, Twitter, and web crawl data with typos, slang, and internet-specific language
18
+ 2. **No quality filtering**: Purely frequency-based without considering appropriateness for crosswords
19
+ 3. **Mixed registers**: Combines formal and informal language indiscriminately
20
+ 4. **Problematic intersections**: Generates words like "ethology", "guns", "porn" for topics like "Art+Books"
21
+ 5. **Limited metadata**: No information about word suitability, part-of-speech, or crossword usage
22
+ 6. **AI contamination risk**: WordFreq author stopped updates in 2024 due to generative AI polluting data sources
23
+
24
+ ### Impact on Crossword Generation:
25
+ - Lower quality semantic intersections
26
+ - Inappropriate words for family-friendly puzzles
27
+ - Poor difficulty calibration
28
+ - Reduced solver experience quality
29
+
30
+ ## Superior Alternatives
31
+
32
+ ### 1. Crossword-Specific Word Lists (⚠️ QUALITY ISSUES FOUND)
33
+
34
+ #### A. Collaborative Word List (❌ NOT RECOMMENDED)
35
+ - **Source**: https://github.com/Crossword-Nexus/collaborative-word-list
36
+ - **Size**: 114,000+ words
37
+ - **Direct download**: `https://raw.githubusercontent.com/Crossword-Nexus/collaborative-word-list/main/xwordlist.dict`
38
+ - **QUALITY PROBLEMS IDENTIFIED**:
39
+ - Contains nonsensical entries: `10THGENCONSOLE`, `1STGENERATIONCONSOLES`, `4XGAMES`
40
+ - Single letters: `A`, `AA`, `AAA`, `AAAA`
41
+ - Meaningless sequences: `AAAAH`, `AAAAUTOCLUB`
42
+ - **Verdict**: Requires extensive cleanup before use
43
+
44
+ #### B. Spread the Word(list) (❌ NOT RECOMMENDED)
45
+ - **Source**: https://www.spreadthewordlist.com
46
+ - **Size**: 114,000+ answers with scores
47
+ - **QUALITY PROBLEMS IDENTIFIED**:
48
+ - Garbage entries: `zzzzzzzzzzzzzzz`, `zzzquil`
49
+ - Malformed words: `aaaaddress`, `aabb`, `aabba`
50
+ - Random sequences: `aaiiiiiiiiiiiii`
51
+ - **Verdict**: Same quality issues as Collaborative List
52
+
53
+ #### C. Christopher Jones' Crossword Wordlist (⚠️ NEEDS CLEANUP)
54
+ - **Source**: https://github.com/christophsjones/crossword-wordlist
55
+ - **QUALITY PROBLEMS IDENTIFIED**:
56
+ - Long phrases: `"a week from now"`, `"a recipe for disaster"`
57
+ - Absurdly long compounds: `ABIRDINTHEHANDISWORTHTWOINTHEBUSH`, `ABLEBODIEDSEAMAN`
58
+ - Arbitrary scoring: Many words with score 50 don't match claimed "common words you wouldn't hesitate to use"
59
+ - **Verdict**: Contains good data but needs significant filtering and rescoring
60
+
61
+ ### 2. SUBTLEX Psycholinguistic Databases (βœ… REASONABLE QUALITY)
62
+
63
+ #### SUBTLEX-US (American English)
64
+ - **Source**: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
65
+ - **Size**: 74,000+ words
66
+ - **Quality**: Based on film/TV subtitles (natural language exposure)
67
+ - **Scoring**: Zipf scale 1-7, contextual diversity metrics
68
+ - **License**: Free for research
69
+
70
+ #### EVALUATION RESULTS:
71
+ - **βœ… Better quality**: Words are generally reasonable and appropriate
72
+ - **⚠️ Contains tiny phrases**: Some entries are short phrases rather than single words
73
+ - **⚠️ Requires programming**: Need to parse and filter the numerical data properly
74
+ - **βœ… Rich metadata**: Includes frequency, Zipf scores, part-of-speech, contextual diversity
75
+ - **βœ… Research backing**: Proven to predict word processing difficulty better than traditional corpora
76
+
77
+ #### Advantages:
78
+ - **Psycholinguistic validity**: Better predictor of word processing difficulty
79
+ - **Clean vocabulary**: Professional media content (edited, appropriate)
80
+ - **Good difficulty calibration**: Zipf 1-3 = rare/hard, 4-7 = common/easy
81
+ - **Multiple languages**: Available for US, UK, Chinese, Welsh, Spanish
82
+
83
+ ### 3. COCA (Corpus of Contemporary American English) (🌟 EXCELLENT QUALITY)
84
+
85
+ #### Available Data:
86
+ - **Free tier**: ~6,000 words with rich metadata and collocates
87
+ - **Full version**: 14 billion words with sophisticated metadata (paid)
88
+ - **Source**: https://www.wordfrequency.info/ and https://github.com/brucewlee/COCA-WordFrequency
89
+ - **Composition**: Balanced across news, fiction, academic, spoken
90
+
91
+ #### EVALUATION RESULTS:
92
+ - **🌟 Excellent quality**: "Phew, this is good" - professional curation shows
93
+ - **βœ… Rich metadata**: Frequency, part-of-speech, genre distribution, collocates
94
+ - **βœ… Clean vocabulary**: Academic standard filtering
95
+ - **βœ… Balanced representation**: Multiple text types ensure comprehensive coverage
96
+ - **πŸ’° Premium option**: Full version provides 14 billion words with sophisticated metadata
97
+ - **βœ… Free sample sufficient**: 6K words could serve as high-quality core vocabulary
98
+
99
+ #### Advantages:
100
+ - **Academic gold standard**: Most accurate and reliable word frequency data
101
+ - **Professional curation**: High editorial and scholarly standards
102
+ - **Balanced corpus**: News, fiction, academic, spoken genres represented
103
+ - **Collocate data**: Helps understand word usage patterns and context
104
+ - **Research proven**: Widely used and validated in linguistics research
105
+
106
+ ### 4. Peter Norvig's Clean Word Lists (🌟 EXCELLENT DISCOVERY)
107
+
108
+ #### Norvig's Word Count Lists
109
+ - **Source**: https://norvig.com/ngrams/
110
+ - **Key Resource**: `count_1w100k.txt` - 100,000 most popular words, all uppercase
111
+ - **Quality**: Really clean vocabulary without junk entries
112
+ - **Problem**: No frequency information included
113
+
114
+ #### EVALUATION RESULTS:
115
+ - **βœ… Very clean**: Properly curated, no garbage like other sources
116
+ - **βœ… Good coverage**: 100K words should provide sufficient vocabulary
117
+ - **βœ… Reliable source**: Peter Norvig (Google's Director of Research) ensures quality
118
+ - **❌ Missing frequencies**: Would need to cross-reference with other sources for difficulty grading
119
+ - **πŸ’‘ Hybrid opportunity**: Could combine Norvig's clean words with frequency data from SUBTLEX or COCA
120
+
121
+ #### Potential Implementation:
122
+ ```python
123
+ # Use Norvig's clean word list as vocabulary base
124
+ norvig_words = load_norvig_100k()
125
+ # Cross-reference with SUBTLEX for frequency data
126
+ subtlex_freq = load_subtlex_frequencies()
127
+ # Result: Clean vocabulary + reliable frequency information
128
+ ```
129
+
130
+ ### 5. Premium Options (For Comparison - Not Evaluated)
131
+
132
+ #### XWordInfo (NYT-focused)
133
+ - **Cost**: $50 Angel membership
134
+ - **Quality**: Every NYT crossword ever published
135
+ - **Size**: 200,000+ words
136
+ - **Note**: Not evaluated in this analysis
137
+
138
+ #### Cruciverb
139
+ - **Cost**: $35 Gold membership
140
+ - **Quality**: Multiple publication sources
141
+ - **Note**: Not evaluated in this analysis
142
+
143
+ ## Detailed Comparison Analysis (Updated with Evaluation Results)
144
+
145
+ | Source | Size | Quality Score | Frequency Data | Evaluated Quality | Cost | Recommendation |
146
+ |--------|------|---------------|----------------|------------------|------|----------------|
147
+ | **WordFreq** | 100K+ | ❌ Web-scraped | βœ… Frequency | ❌ Original issues | Free | ⚠️ Current baseline |
148
+ | **Collaborative List** | 114K+ | ❌ Junk entries | ❌ Arbitrary scoring | ❌ `10THGENCONSOLE`, `AAAA` | Free | ❌ **AVOID** |
149
+ | **Spread Wordlist** | 114K+ | ❌ Junk entries | ❌ Arbitrary scoring | ❌ `zzzzzzzzzzzzzzz`, `aabb` | Free | ❌ **AVOID** |
150
+ | **C. Jones Wordlist** | ~50K | ⚠️ Needs filtering | ⚠️ Arbitrary scoring | ⚠️ Long phrases, compounds | Free | ⚠️ **CLEANUP REQUIRED** |
151
+ | **SUBTLEX-US** | 74K | βœ… Reasonable quality | βœ… Zipf 1-7 | βœ… Clean, some phrases | Free | βœ… **VIABLE** |
152
+ | **COCA (free)** | 6K | 🌟 Excellent | βœ… Rich metadata | 🌟 "Phew, this is good" | Free | 🌟 **RECOMMENDED** |
153
+ | **COCA (full)** | 1M+ | 🌟 Excellent | βœ… Rich metadata | 🌟 Sophisticated metadata | $$$ | 🌟 **PREMIUM CHOICE** |
154
+ | **Norvig 100K** | 100K | 🌟 Very clean | ❌ None included | 🌟 Clean, no garbage | Free | 🌟 **HYBRID BASE** |
155
+
156
+ ## Updated Implementation Recommendations (Post-Evaluation)
157
+
158
+ ### Recommended Approach: Hybrid COCA + Norvig System
159
+
160
+ Based on hands-on evaluation, the cleanest approach combines the best of multiple sources:
161
+
162
+ #### Option A: COCA Free + Extended Coverage (Recommended)
163
+ ```python
164
+ # 1. Load COCA 6K words as high-quality core
165
+ def load_coca_core():
166
+ """Load 6K high-quality words from COCA free sample"""
167
+ # Excellent quality, rich metadata, reliable frequencies
168
+ return parse_coca_free_sample()
169
+
170
+ # 2. Extend with filtered SUBTLEX for broader coverage
171
+ def extend_with_subtlex():
172
+ """Add clean words from SUBTLEX for broader coverage"""
173
+ # Filter out phrases, keep single words only
174
+ # Use Zipf scores for difficulty grading
175
+ return filtered_subtlex_words()
176
+
177
+ # 3. Cross-reference with Norvig's clean list for validation
178
+ def validate_with_norvig():
179
+ """Use Norvig's 100K list to validate word cleanliness"""
180
+ norvig_clean = load_norvig_100k()
181
+ # Only include words that appear in Norvig's curated list
182
+ return validated_vocabulary
183
+ ```
184
+
185
+ #### Option B: Norvig Base + Frequency Cross-Reference (Alternative)
186
+ ```python
187
+ # 1. Start with Norvig's clean 100K vocabulary
188
+ norvig_words = load_norvig_100k()
189
+
190
+ # 2. Cross-reference with COCA for frequency data
191
+ coca_freq = load_coca_frequencies() # Free 6K sample
192
+ subtlex_freq = load_subtlex_frequencies() # Broader coverage
193
+
194
+ # 3. Assign frequencies with fallback chain
195
+ def get_word_difficulty(word):
196
+ if word in coca_freq:
197
+ return coca_freq[word] # Highest quality
198
+ elif word in subtlex_freq:
199
+ return subtlex_freq[word] # Good quality
200
+ else:
201
+ return default_difficulty # Fallback
202
+ ```
203
+
204
+ ### Why This Hybrid Approach Works
205
+
206
+ #### Problems with "Crossword-Specific" Lists:
207
+ - **Collaborative Word List**: Contains `10THGENCONSOLE`, `AAAA`, `AAAAUTOCLUB`
208
+ - **Spread the Wordlist**: Contains `zzzzzzzzzzzzzzz`, `aaaaddress`, `aabba`
209
+ - **Christopher Jones**: Contains `ABIRDINTHEHANDISWORTHTWOINTHEBUSH`
210
+ - **Verdict**: All require extensive cleanup, defeating their supposed advantage
211
+
212
+ #### Advantages of COCA + Norvig Hybrid:
213
+ - **COCA Free**: 6K professionally curated, academically validated words
214
+ - **Norvig 100K**: Clean vocabulary from Google's Director of Research
215
+ - **SUBTLEX**: Reasonable quality with psycholinguistic validity
216
+ - **No garbage**: Avoid the cleanup nightmare of "crossword-specific" lists
217
+ - **Research backing**: Academic and industry validation
218
+
219
+ ### Updated Difficulty Grading System
220
+
221
+ ```python
222
+ def classify_word_difficulty(word):
223
+ """Updated difficulty classification using clean sources"""
224
+
225
+ # Priority 1: COCA data (highest quality)
226
+ if word in coca_frequencies:
227
+ freq_rank = coca_frequencies[word]['rank']
228
+ if freq_rank <= 1000:
229
+ return "easy"
230
+ elif freq_rank <= 3000:
231
+ return "medium"
232
+ else:
233
+ return "hard"
234
+
235
+ # Priority 2: SUBTLEX Zipf score
236
+ elif word in subtlex_zipf:
237
+ zipf = subtlex_zipf[word]
238
+ if zipf >= 4.5:
239
+ return "easy" # Very common
240
+ elif zipf >= 2.5:
241
+ return "medium" # Moderately common
242
+ else:
243
+ return "hard" # Rare
244
+
245
+ # Fallback: Conservative classification
246
+ else:
247
+ return "medium" # Unknown words default to medium
248
+ ```
249
+
250
+ ## Updated Technical Integration Steps
251
+
252
+ ### 1. Data Download and Preprocessing (Revised)
253
+
254
+ ```bash
255
+ # Download COCA free sample (6K high-quality words)
256
+ wget https://raw.githubusercontent.com/brucewlee/COCA-WordFrequency/master/coca_5000.txt
257
+
258
+ # Download Peter Norvig's clean 100K word list
259
+ wget https://norvig.com/ngrams/count_1w100k.txt
260
+
261
+ # Download SUBTLEX-US (requires academic access)
262
+ # Available at: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
263
+
264
+ # AVOID these due to quality issues:
265
+ # ❌ Collaborative Word List (contains garbage)
266
+ # ❌ Spread the Wordlist (contains garbage)
267
+ # ❌ Christopher Jones (needs extensive cleanup)
268
+ ```
269
+
270
+ ### 2. Data Structure Migration
271
+
272
+ ```python
273
+ class EnhancedVocabulary:
274
+ def __init__(self):
275
+ self.collaborative_scores = {} # word -> quality score (10-100)
276
+ self.subtlex_zipf = {} # word -> zipf score (1-7)
277
+ self.subtlex_pos = {} # word -> part of speech
278
+ self.word_embeddings = {} # word -> embedding vector
279
+
280
+ def load_all_sources(self):
281
+ """Load and integrate all vocabulary sources"""
282
+ self.load_collaborative_wordlist()
283
+ self.load_subtlex_data()
284
+ self.compute_embeddings() # Keep existing all-mpnet-base-v2
285
+
286
+ def is_crossword_suitable(self, word):
287
+ """Filter based on crossword appropriateness"""
288
+ return word.upper() in self.collaborative_scores
289
+ ```
290
+
291
+ ### 3. Configuration Updates
292
+
293
+ ```python
294
+ # Environment variables to add
295
+ VOCAB_SOURCE = "collaborative" # "collaborative", "subtlex", "hybrid"
296
+ COLLABORATIVE_WORDLIST_URL = "https://raw.githubusercontent.com/..."
297
+ SUBTLEX_DATA_PATH = "/path/to/subtlex_us.txt"
298
+ MIN_CROSSWORD_QUALITY = 30 # Minimum collaborative score
299
+ MIN_ZIPF_SCORE = 2.0 # Minimum SUBTLEX frequency
300
+ ```
301
+
302
+ ## Quality Scoring Systems Comparison
303
+
304
+ ### WordFreq (Current)
305
+ - **Scale**: Frequency values (logarithmic)
306
+ - **Basis**: Web text frequency
307
+ - **Issues**: No quality filtering, includes inappropriate content
308
+
309
+ ### Collaborative Word List
310
+ - **Scale**: 10-100 quality score
311
+ - **Basis**: Crossword constructor consensus
312
+ - **Interpretation**:
313
+ - 70-100: Excellent crossword words (common, clean)
314
+ - 40-69: Good crossword words (moderate difficulty)
315
+ - 10-39: Challenging words (obscure, specialized)
316
+
317
+ ### SUBTLEX Zipf Scale
318
+ - **Scale**: 1-7 (logarithmic)
319
+ - **Basis**: Psycholinguistic word processing research
320
+ - **Interpretation**:
321
+ - 6-7: Ultra common (THE, AND, OF)
322
+ - 4-5: Common (HOUSE, WATER, FRIEND)
323
+ - 2-3: Uncommon (BIZARRE, ELOQUENT)
324
+ - 1: Rare (OBSEQUIOUS, PERSPICACIOUS)
325
+
326
+ ## Expected Benefits
327
+
328
+ ### Immediate Quality Improvements:
329
+ 1. **Cleaner intersections**: No more "ethology/guns/porn" issues
330
+ 2. **Family-friendly vocabulary**: Community-curated appropriateness
331
+ 3. **Better difficulty calibration**: Psycholinguistically validated scales
332
+ 4. **Crossword-optimized**: Words chosen for puzzle suitability
333
+
334
+ ### Long-term Advantages:
335
+ 1. **Community support**: Active maintenance by crossword constructors
336
+ 2. **Research backing**: SUBTLEX has extensive academic validation
337
+ 3. **Hybrid flexibility**: Can combine multiple quality signals
338
+ 4. **Scalability**: Easy to add new vocabulary sources
339
+
340
+ ## Migration Strategy
341
+
342
+ ### Week 1: Data Integration
343
+ - Download and preprocess Collaborative Word List
344
+ - Create vocabulary loading pipeline
345
+ - Implement basic quality filtering
346
+
347
+ ### Week 2: Scoring System
348
+ - Implement hybrid quality scoring
349
+ - Map quality scores to difficulty levels
350
+ - Test with existing multi-topic intersection methods
351
+
352
+ ### Week 3: Performance Validation
353
+ - A/B test against WordFreq baseline
354
+ - Measure semantic intersection quality
355
+ - Validate difficulty calibration
356
+
357
+ ### Week 4: Production Deployment
358
+ - Update environment configuration
359
+ - Monitor vocabulary coverage
360
+ - Collect user feedback on word quality
361
+
362
+ ## Alternative Implementation: Gradual Migration
363
+
364
+ For lower risk, implement gradual migration:
365
+
366
+ ```python
367
+ def get_word_quality(word):
368
+ """Gradual migration approach"""
369
+ if word in collaborative_scores:
370
+ # Use collaborative score if available
371
+ return collaborative_scores[word] / 100.0
372
+ elif word in subtlex_zipf:
373
+ # Fallback to SUBTLEX
374
+ return subtlex_zipf[word] / 7.0
375
+ else:
376
+ # Final fallback to WordFreq
377
+ return word_frequency(word, 'en')
378
+ ```
379
+
380
+ This allows testing new vocabulary sources while maintaining compatibility with existing words not found in curated lists.
381
+
382
+ ## Conclusion (Updated After Hands-On Evaluation)
383
+
384
+ **Key Finding**: Most "crossword-specific" vocabulary lists contain significant amounts of junk data that require extensive cleanup, defeating their supposed advantage over general-purpose sources.
385
+
386
+ **Recommended Solution**: Combine high-quality general sources instead:
387
+ 1. **COCA free sample** (6K words) for core high-quality vocabulary
388
+ 2. **Peter Norvig's 100K list** for clean, broad coverage
389
+ 3. **SUBTLEX** for psycholinguistically validated difficulty grading
390
+ 4. **Avoid crossword-specific lists** until they improve their curation
391
+
392
+ This hybrid approach provides:
393
+ - **Clean vocabulary**: No `10THGENCONSOLE`, `zzzzzzzzzzzzzzz`, or `AAAAUTOCLUB` garbage
394
+ - **Academic validation**: COCA and SUBTLEX are research-proven
395
+ - **Industry credibility**: Norvig's list comes from Google's Director of Research
396
+ - **Reasonable coverage**: 6K-100K words should handle most crossword needs
397
+ - **Better difficulty calibration**: Psycholinguistic frequency data beats arbitrary scores
398
+
399
+ **Next Steps**:
400
+ 1. Start with COCA free sample as proof of concept
401
+ 2. Extend with filtered SUBTLEX for broader coverage
402
+ 3. Validate against Norvig's clean list
403
+ 4. Consider COCA full version if budget allows
404
+
405
+ The investment in clean, research-backed vocabulary data will dramatically improve puzzle quality without the cleanup nightmare of supposedly "crossword-specific" sources.
hack/SUBTLEX/SUBTLEXus74286wordstextversion.txt ADDED
The diff for this file is too large to render. See raw diff
 
hack/analyze_norvig_vocabulary.py ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Statistical Analysis of Norvig Word Count Files
4
+
5
+ Analyzes a single Norvig word count file (count_1w.txt or count_1w100k.txt)
6
+ from norvig.com/ngrams/ to understand vocabulary characteristics for crossword generation.
7
+
8
+ Usage:
9
+ python analyze_norvig_vocabulary.py <filename>
10
+ python analyze_norvig_vocabulary.py --help
11
+
12
+ Examples:
13
+ python analyze_norvig_vocabulary.py norvig/count_1w100k.txt
14
+ python analyze_norvig_vocabulary.py norvig/count_1w.txt
15
+ """
16
+
17
+ import os
18
+ import sys
19
+ import argparse
20
+ import numpy as np
21
+ import matplotlib.pyplot as plt
22
+ import pandas as pd
23
+ from collections import Counter, defaultdict
24
+ import seaborn as sns
25
+ from pathlib import Path
26
+
27
+ # Set style for better plots
28
+ plt.style.use('seaborn-v0_8')
29
+ sns.set_palette("husl")
30
+
31
+ def parse_arguments():
32
+ """Parse command line arguments"""
33
+ parser = argparse.ArgumentParser(
34
+ description='Analyze Norvig word count files for crossword generation',
35
+ formatter_class=argparse.RawDescriptionHelpFormatter,
36
+ epilog="""
37
+ Examples:
38
+ python analyze_norvig_vocabulary.py norvig/count_1w100k.txt
39
+ python analyze_norvig_vocabulary.py norvig/count_1w.txt
40
+ python analyze_norvig_vocabulary.py --help
41
+
42
+ File formats supported:
43
+ - count_1w100k.txt: Top 100,000 most frequent words
44
+ - count_1w.txt: Full word count dataset (1M+ words)
45
+
46
+ Output:
47
+ - Comprehensive statistical analysis
48
+ - 6-panel visualization saved as norvig_comprehensive_analysis.png
49
+ - Summary statistics printed to console
50
+ """
51
+ )
52
+
53
+ parser.add_argument(
54
+ 'filename',
55
+ help='Path to Norvig word count file (e.g., norvig/count_1w100k.txt)'
56
+ )
57
+
58
+ return parser.parse_args()
59
+
60
+ def load_word_counts(filepath):
61
+ """Load word count file and return dict of {word: count}"""
62
+ word_counts = {}
63
+ total_lines = 0
64
+
65
+ print(f"Loading {filepath}...")
66
+
67
+ try:
68
+ with open(filepath, 'r', encoding='utf-8') as f:
69
+ for line in f:
70
+ total_lines += 1
71
+ parts = line.strip().split('\t')
72
+ if len(parts) == 2:
73
+ word, count = parts
74
+ word_counts[word.upper()] = int(count)
75
+ elif len(parts) == 1 and line.strip():
76
+ # Handle case where count might be missing
77
+ word = parts[0]
78
+ word_counts[word.upper()] = 1
79
+
80
+ print(f"βœ… Loaded {len(word_counts):,} words from {filepath}")
81
+ return word_counts
82
+
83
+ except FileNotFoundError:
84
+ print(f"❌ File not found: {filepath}")
85
+ return {}
86
+ except Exception as e:
87
+ print(f"❌ Error loading {filepath}: {e}")
88
+ return {}
89
+
90
+ def analyze_word_lengths(words):
91
+ """Analyze distribution of word lengths"""
92
+ lengths = [len(word) for word in words]
93
+ length_dist = Counter(lengths)
94
+
95
+ return lengths, length_dist
96
+
97
+ def classify_difficulty(rank, total_words):
98
+ """Classify word difficulty based on frequency rank"""
99
+ if rank <= total_words * 0.05: # Top 5%
100
+ return "Very Easy"
101
+ elif rank <= total_words * 0.20: # Top 20%
102
+ return "Easy"
103
+ elif rank <= total_words * 0.60: # Top 60%
104
+ return "Medium"
105
+ elif rank <= total_words * 0.85: # Top 85%
106
+ return "Hard"
107
+ else:
108
+ return "Very Hard"
109
+
110
+ def create_comprehensive_analysis(word_counts, filename, base_dir):
111
+ """Create comprehensive statistical analysis with readable plots"""
112
+
113
+ # Create figure with subplots - 2x3 layout with good spacing
114
+ fig = plt.figure(figsize=(18, 12))
115
+ fig.suptitle(f'Norvig Word Count Analysis - {filename}',
116
+ fontsize=16, fontweight='bold', y=0.95)
117
+
118
+ # Convert to sorted lists for analysis
119
+ words = list(word_counts.keys())
120
+ counts = list(word_counts.values())
121
+ ranks = list(range(1, len(counts) + 1))
122
+
123
+ # 1. Zipf's Law Analysis (log-log plot)
124
+ ax1 = plt.subplot(2, 3, 1)
125
+ plt.loglog(ranks, counts, 'b-', alpha=0.7, linewidth=2)
126
+ plt.xlabel('Rank (log scale)')
127
+ plt.ylabel('Frequency (log scale)')
128
+ plt.title('Zipf\'s Law Validation', fontweight='bold')
129
+ plt.grid(True, alpha=0.3)
130
+
131
+ # Add theoretical Zipf line for comparison
132
+ theoretical_zipf = [counts[0] / r for r in ranks]
133
+ plt.loglog(ranks, theoretical_zipf, 'r--', alpha=0.5, label='Theoretical')
134
+ plt.legend()
135
+
136
+ # 2. Word Length Distribution
137
+ ax2 = plt.subplot(2, 3, 2)
138
+ lengths, length_dist = analyze_word_lengths(words)
139
+ lengths_list = sorted(length_dist.keys())
140
+ counts_list = [length_dist[l] for l in lengths_list]
141
+
142
+ bars = plt.bar(lengths_list, counts_list, alpha=0.7, color='skyblue', edgecolor='navy')
143
+ plt.xlabel('Word Length (characters)')
144
+ plt.ylabel('Number of Words')
145
+ plt.title('Word Length Distribution', fontweight='bold')
146
+
147
+ # Highlight crossword-suitable range (3-12 letters)
148
+ for i, bar in enumerate(bars):
149
+ if 3 <= lengths_list[i] <= 12:
150
+ bar.set_color('lightgreen')
151
+ elif lengths_list[i] < 3 or lengths_list[i] > 15:
152
+ bar.set_color('lightcoral')
153
+
154
+ plt.axvspan(3, 12, alpha=0.2, color='green', label='Crossword Range')
155
+ plt.legend()
156
+
157
+ # 3. Difficulty Distribution
158
+ ax3 = plt.subplot(2, 3, 3)
159
+ difficulty_dist = defaultdict(int)
160
+ for rank in ranks:
161
+ difficulty = classify_difficulty(rank, len(ranks))
162
+ difficulty_dist[difficulty] += 1
163
+
164
+ diff_labels = list(difficulty_dist.keys())
165
+ diff_counts = list(difficulty_dist.values())
166
+ colors = ['darkgreen', 'green', 'orange', 'red', 'darkred']
167
+
168
+ wedges, texts, autotexts = plt.pie(diff_counts, labels=diff_labels, autopct='%1.1f%%',
169
+ colors=colors[:len(diff_labels)], startangle=90)
170
+ plt.title('Difficulty Distribution', fontweight='bold')
171
+
172
+ # 4. Cumulative Frequency Coverage
173
+ ax4 = plt.subplot(2, 3, 4)
174
+ cumulative_freq = np.cumsum(counts)
175
+ total_freq = cumulative_freq[-1]
176
+ coverage_pct = (cumulative_freq / total_freq) * 100
177
+
178
+ plt.plot(ranks, coverage_pct, 'g-', linewidth=2)
179
+ plt.xlabel('Vocabulary Size')
180
+ plt.ylabel('Coverage (%)')
181
+ plt.title('Cumulative Coverage', fontweight='bold')
182
+ plt.grid(True, alpha=0.3)
183
+
184
+ # Add key milestone markers
185
+ milestones = [1000, 5000, 10000, 25000, 50000]
186
+ for milestone in milestones:
187
+ if milestone < len(coverage_pct):
188
+ plt.axvline(x=milestone, color='red', linestyle='--', alpha=0.5)
189
+
190
+ # 5. Crossword Suitability
191
+ ax5 = plt.subplot(2, 3, 5)
192
+ crossword_suitable = {word: count for word, count in word_counts.items()
193
+ if 3 <= len(word) <= 12 and word.isalpha()}
194
+
195
+ total_words = len(word_counts)
196
+ suitable_words = len(crossword_suitable)
197
+ unsuitable_words = total_words - suitable_words
198
+
199
+ labels = [f'Suitable\n{suitable_words:,}', f'Not Suitable\n{unsuitable_words:,}']
200
+ sizes = [suitable_words, unsuitable_words]
201
+ colors = ['lightgreen', 'lightcoral']
202
+
203
+ wedges, texts, autotexts = plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
204
+ plt.title('Crossword Suitability', fontweight='bold')
205
+
206
+ # 6. Difficulty Categories for Crosswords
207
+ ax6 = plt.subplot(2, 3, 6)
208
+
209
+ # Define crossword difficulty thresholds
210
+ easy_threshold = 5000
211
+ medium_threshold = 25000
212
+
213
+ easy_words = sum(1 for i, word in enumerate(words[:easy_threshold]) if 3 <= len(word) <= 12 and i < len(words))
214
+ medium_words = sum(1 for i, word in enumerate(words[easy_threshold:medium_threshold]) if 3 <= len(word) <= 12 and (i + easy_threshold) < len(words))
215
+ hard_words = sum(1 for i, word in enumerate(words[medium_threshold:]) if 3 <= len(word) <= 12 and (i + medium_threshold) < len(words))
216
+
217
+ categories = ['Easy', 'Medium', 'Hard']
218
+ word_counts_cat = [easy_words, medium_words, hard_words]
219
+ colors_cat = ['lightgreen', 'gold', 'lightcoral']
220
+
221
+ bars = plt.bar(categories, word_counts_cat, color=colors_cat, alpha=0.8)
222
+ plt.ylabel('Crossword Words')
223
+ plt.title('Difficulty Categories\n(Based on Frequency Rank)', fontweight='bold')
224
+
225
+ # Add value labels on bars
226
+ for bar, count in zip(bars, word_counts_cat):
227
+ height = bar.get_height()
228
+ if height > 0:
229
+ plt.text(bar.get_x() + bar.get_width()/2, height + max(word_counts_cat)*0.02,
230
+ f'{count:,}', ha='center', va='bottom', fontweight='bold')
231
+
232
+ # Add explanation text box with examples
233
+ # Get some example words for each category
234
+ easy_examples = [w for i, w in enumerate(words[:100]) if 3 <= len(w) <= 12][:3]
235
+ medium_examples = [w for i, w in enumerate(words[7000:12000]) if 3 <= len(w) <= 12][:3]
236
+ hard_examples = [w for i, w in enumerate(words[30000:35000]) if 3 <= len(w) <= 12][:3]
237
+
238
+ explanation = (f'Easy: Ranks 1-5,000 (most frequent)\n'
239
+ f' e.g., {", ".join(easy_examples[:3])}\n'
240
+ f'Medium: Ranks 5,001-25,000\n'
241
+ f' e.g., {", ".join(medium_examples[:3])}\n'
242
+ f'Hard: Ranks 25,001+ (least frequent)\n'
243
+ f' e.g., {", ".join(hard_examples[:3])}\n\n'
244
+ 'Lower rank = higher frequency = easier')
245
+
246
+ plt.text(0.98, 0.98, explanation, transform=ax6.transAxes,
247
+ fontsize=8, verticalalignment='top', horizontalalignment='right',
248
+ bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.9))
249
+
250
+ # Adjust layout with proper spacing
251
+ plt.subplots_adjust(left=0.08, bottom=0.08, right=0.95, top=0.88, wspace=0.35, hspace=0.45)
252
+
253
+ # Save the comprehensive analysis with filename in the output name
254
+ # Extract base name and create clean output filename
255
+ if 'count_1w100k' in filename:
256
+ output_name = 'norvig_analysis_100k.png'
257
+ elif 'count_1w.txt' in filename:
258
+ output_name = 'norvig_analysis_full.png'
259
+ else:
260
+ # Fallback for any other filename - make it filesystem safe
261
+ safe_name = filename.replace('.txt', '').replace('/', '_').replace('count_', '')
262
+ output_name = f'norvig_analysis_{safe_name}.png'
263
+
264
+ output_path = base_dir / output_name
265
+ plt.savefig(output_path, dpi=300, bbox_inches='tight')
266
+ print(f"πŸ“Š Comprehensive analysis saved to: {output_path}")
267
+
268
+ return fig, crossword_suitable
269
+
270
+ def print_summary_statistics(word_counts, filename, crossword_suitable):
271
+ """Print comprehensive summary statistics"""
272
+
273
+ print("\n" + "="*80)
274
+ print("πŸ“Š NORVIG VOCABULARY STATISTICAL ANALYSIS")
275
+ print(f"πŸ“ File: {filename}")
276
+ print("="*80)
277
+
278
+ # Basic statistics
279
+ total_words = len(word_counts)
280
+ total_frequency = sum(word_counts.values())
281
+
282
+ print(f"\nπŸ“š BASIC STATISTICS:")
283
+ print(f" β€’ Total words: {total_words:,}")
284
+ print(f" β€’ Total frequency: {total_frequency:,}")
285
+ print(f" β€’ Average frequency: {total_frequency/total_words:.2f}")
286
+
287
+ # Word length analysis
288
+ lengths, length_dist = analyze_word_lengths(word_counts.keys())
289
+ avg_length = np.mean(lengths)
290
+ crossword_length_words = sum(count for length, count in length_dist.items() if 3 <= length <= 12)
291
+ crossword_length_pct = (crossword_length_words / total_words) * 100
292
+
293
+ print(f"\nπŸ“ WORD LENGTH ANALYSIS:")
294
+ print(f" β€’ Average word length: {avg_length:.1f} characters")
295
+ print(f" β€’ Words 3-12 characters: {crossword_length_words:,} ({crossword_length_pct:.1f}%)")
296
+ print(f" β€’ Most common lengths: {sorted(length_dist.items(), key=lambda x: x[1], reverse=True)[:5]}")
297
+
298
+ # Crossword suitability
299
+ suitable_count = len(crossword_suitable)
300
+ suitable_pct = (suitable_count / total_words) * 100
301
+ suitable_freq = sum(crossword_suitable.values())
302
+ suitable_freq_pct = (suitable_freq / total_frequency) * 100
303
+
304
+ print(f"\n🧩 CROSSWORD SUITABILITY:")
305
+ print(f" β€’ Suitable words (3-12 letters, alphabetic): {suitable_count:,} ({suitable_pct:.1f}%)")
306
+ print(f" β€’ Suitable word frequency coverage: {suitable_freq_pct:.1f}%")
307
+
308
+ # Difficulty distribution for crosswords
309
+ easy_words = len([w for w, c in list(crossword_suitable.items())[:5000]])
310
+ medium_words = len([w for w, c in list(crossword_suitable.items())[5000:25000]])
311
+ hard_words = len([w for w, c in list(crossword_suitable.items())[25000:]])
312
+
313
+ print(f"\n🎯 CROSSWORD DIFFICULTY DISTRIBUTION:")
314
+ print(f" β€’ Easy (rank 1-5K): {easy_words:,} words")
315
+ print(f" β€’ Medium (rank 5K-25K): {medium_words:,} words")
316
+ print(f" β€’ Hard (rank 25K+): {hard_words:,} words")
317
+
318
+ # Top and bottom words examples
319
+ words_list = list(word_counts.keys())
320
+ print(f"\nπŸ” TOP 10 MOST FREQUENT WORDS:")
321
+ for i, word in enumerate(words_list[:10], 1):
322
+ print(f" {i:2d}. {word:<12} ({word_counts[word]:,})")
323
+
324
+ print(f"\nπŸ”š BOTTOM 10 LEAST FREQUENT WORDS:")
325
+ for i, word in enumerate(words_list[-10:], 1):
326
+ print(f" {i:2d}. {word:<12} ({word_counts[word]:,})")
327
+
328
+ # Zipf's law validation
329
+ words_list = list(word_counts.keys())
330
+ counts_list = list(word_counts.values())
331
+
332
+ # Calculate correlation coefficient for log-log relationship
333
+ log_ranks = np.log(range(1, len(counts_list) + 1))
334
+ log_freqs = np.log(counts_list)
335
+ correlation = np.corrcoef(log_ranks, log_freqs)[0, 1]
336
+
337
+ print(f"\nπŸ“ˆ ZIPF'S LAW VALIDATION:")
338
+ print(f" β€’ Log-log correlation: {correlation:.4f}")
339
+ print(f" β€’ Zipf compliance: {'βœ… Excellent' if abs(correlation) > 0.95 else '⚠️ Moderate' if abs(correlation) > 0.8 else '❌ Poor'}")
340
+
341
+ # Recommendations
342
+ print(f"\nπŸ’‘ RECOMMENDATIONS FOR CROSSWORD GENERATION:")
343
+ print(f" β€’ Dataset size: {total_words:,} words with excellent coverage")
344
+ print(f" β€’ Filter to 3-12 letters: Reduces to {suitable_count:,} words ({suitable_pct:.1f}%)")
345
+ print(f" β€’ Difficulty thresholds (for crossword-suitable words):")
346
+ print(f" - Easy: ranks 1-5,000 ({easy_words:,} suitable words)")
347
+ print(f" - Medium: ranks 5,001-25,000 ({medium_words:,} suitable words)")
348
+ print(f" - Hard: ranks 25,001+ ({hard_words:,} suitable words)")
349
+ print(f" β€’ Quality: βœ… No garbage entries (unlike crossword-specific lists)")
350
+ print(f" β€’ Source credibility: βœ… Peter Norvig (Google) + Google Books corpus")
351
+
352
+ print("="*80)
353
+
354
+ def main():
355
+ """Main analysis function"""
356
+
357
+ # Parse command line arguments
358
+ args = parse_arguments()
359
+
360
+ # File paths
361
+ base_dir = Path(__file__).parent
362
+ input_file = Path(args.filename)
363
+
364
+ # Make path relative to script directory if not absolute
365
+ if not input_file.is_absolute():
366
+ input_file = base_dir / input_file
367
+
368
+ print("πŸ” Norvig Vocabulary Statistical Analysis")
369
+ print("=" * 50)
370
+ print(f"πŸ“ Analyzing: {input_file}")
371
+
372
+ # Load data
373
+ word_counts = load_word_counts(input_file)
374
+
375
+ if not word_counts:
376
+ print(f"❌ Could not load word list from {input_file}. Please check file path.")
377
+ return
378
+
379
+ # Create comprehensive analysis
380
+ fig, crossword_suitable = create_comprehensive_analysis(word_counts, input_file.name, base_dir)
381
+
382
+ # Print summary statistics
383
+ print_summary_statistics(word_counts, input_file.name, crossword_suitable)
384
+
385
+ # Don't show plot interactively in CLI, just save it
386
+ # plt.show() # Comment out for CLI usage
387
+
388
+ # Generate the same output filename logic for final message
389
+ if 'count_1w100k' in input_file.name:
390
+ output_name = 'norvig_analysis_100k.png'
391
+ elif 'count_1w.txt' in input_file.name:
392
+ output_name = 'norvig_analysis_full.png'
393
+ else:
394
+ safe_name = input_file.name.replace('.txt', '').replace('/', '_').replace('count_', '')
395
+ output_name = f'norvig_analysis_{safe_name}.png'
396
+
397
+ print(f"\nβœ… Analysis complete! Check {base_dir}/{output_name} for detailed plots.")
398
+
399
+ if __name__ == "__main__":
400
+ main()
hack/norvig/count_1w.txt ADDED
The diff for this file is too large to render. See raw diff
 
hack/norvig/count_1w100k.txt ADDED
The diff for this file is too large to render. See raw diff