docs: add soft minimum visualization ideas and vocabulary alternatives analysis
Browse files- Add comprehensive visualization concepts for soft minimum method
- Add detailed analysis of vocabulary alternatives beyond WordFreq
- Add python scripts to analyse word list from peter norvig home page
- Update .gitignore for fine-tuned models and T5 model cache
- Include SUBTLEX dataset and Norvig vocabulary analysis files
Signed-off-by: Vimal Kumar <[email protected]>
- crossword-app/backend-py/docs/softmin_visualization_ideas.md +227 -0
- crossword-app/backend-py/docs/vocabulary_alternatives_analysis.md +405 -0
- hack/SUBTLEX/SUBTLEXus74286wordstextversion.txt +0 -0
- hack/analyze_norvig_vocabulary.py +400 -0
- hack/norvig/count_1w.txt +0 -0
- hack/norvig/count_1w100k.txt +0 -0
crossword-app/backend-py/docs/softmin_visualization_ideas.md
ADDED
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Soft Minimum Visualization Ideas
|
2 |
+
|
3 |
+
This document outlines visualization concepts to showcase how the soft minimum method works for multi-topic word intersection in the crossword generator.
|
4 |
+
|
5 |
+
## Overview
|
6 |
+
|
7 |
+
The soft minimum method uses the formula `-log(sum(exp(-beta * similarities))) / beta` to find words that are genuinely relevant to ALL topics simultaneously. Unlike simple averaging, which can promote words that are highly relevant to just one topic, soft minimum penalizes words that score poorly on any individual topic.
|
8 |
+
|
9 |
+
Visualizations would help users understand:
|
10 |
+
- How soft minimum differs from averaging
|
11 |
+
- Why it produces better semantic intersections
|
12 |
+
- How the beta parameter affects results
|
13 |
+
- How adaptive beta mechanism works
|
14 |
+
|
15 |
+
## Visualization Concepts
|
16 |
+
|
17 |
+
### 1. Heat Map Comparison (π Most Impactful)
|
18 |
+
|
19 |
+
**Concept**: Side-by-side heat maps showing individual topic similarities vs soft minimum scores.
|
20 |
+
|
21 |
+
**Layout**:
|
22 |
+
- **Left Heat Map**: Individual Similarities
|
23 |
+
- Rows: Top 50-100 words
|
24 |
+
- Columns: Individual topics (e.g., "universe", "movies", "languages")
|
25 |
+
- Color intensity: Similarity score (0.0 = white, 1.0 = dark blue)
|
26 |
+
|
27 |
+
- **Right Heat Map**: Soft Minimum Results
|
28 |
+
- Same rows (words)
|
29 |
+
- Single column: Soft minimum score
|
30 |
+
- Color intensity: Final soft minimum score
|
31 |
+
|
32 |
+
**Key Insights**:
|
33 |
+
- Words like "anime" would show moderate blue across all topics β high soft minimum score
|
34 |
+
- Words like "astronomy" would show dark blue for "universe", white for others β low soft minimum score
|
35 |
+
- Visually demonstrates how soft minimum penalizes topic-specific words
|
36 |
+
|
37 |
+
**Implementation**:
|
38 |
+
- Frontend: Use libraries like D3.js or Plotly for interactive heat maps
|
39 |
+
- Backend: Return individual topic similarities alongside soft minimum scores
|
40 |
+
|
41 |
+
### 2. 3D Scatter Plot (For 3-Topic Cases)
|
42 |
+
|
43 |
+
**Concept**: 3D space where each axis represents similarity to one topic.
|
44 |
+
|
45 |
+
**Layout**:
|
46 |
+
- X-axis: Similarity to topic 1
|
47 |
+
- Y-axis: Similarity to topic 2
|
48 |
+
- Z-axis: Similarity to topic 3
|
49 |
+
- Point size/color: Soft minimum score
|
50 |
+
- Point labels: Word names (on hover)
|
51 |
+
|
52 |
+
**Key Insights**:
|
53 |
+
- Words near the center (similar to all topics) = large, bright points
|
54 |
+
- Words near axes (similar to only one topic) = small, dim points
|
55 |
+
- Shows the "volume" of intersection vs union
|
56 |
+
|
57 |
+
**Implementation**:
|
58 |
+
- Use Three.js or Plotly 3D
|
59 |
+
- Interactive rotation and zoom
|
60 |
+
- Filter points by soft minimum threshold
|
61 |
+
|
62 |
+
### 3. Interactive Beta Slider
|
63 |
+
|
64 |
+
**Concept**: Real-time visualization of how beta parameter affects word selection.
|
65 |
+
|
66 |
+
**Layout**:
|
67 |
+
- Horizontal slider: Beta value (1.0 to 20.0)
|
68 |
+
- Bar chart: Word scores (sorted descending)
|
69 |
+
- Threshold line: Current similarity threshold
|
70 |
+
- Counter: Number of words above threshold
|
71 |
+
|
72 |
+
**Key Insights**:
|
73 |
+
- High beta (strict): Only a few words pass, distribution is peaked
|
74 |
+
- Low beta (permissive): More words pass, distribution flattens
|
75 |
+
- Shows adaptive beta mechanism in action
|
76 |
+
|
77 |
+
**Implementation**:
|
78 |
+
- React component with range slider
|
79 |
+
- Real-time recalculation of soft minimum scores
|
80 |
+
- Animated transitions as beta changes
|
81 |
+
|
82 |
+
### 4. Venn Diagram with Words
|
83 |
+
|
84 |
+
**Concept**: Position words in Venn diagram based on topic similarities.
|
85 |
+
|
86 |
+
**Layout** (for 2-3 topics):
|
87 |
+
- Circles represent individual topics
|
88 |
+
- Words positioned based on similarity combinations
|
89 |
+
- Words in intersections = high soft minimum scores
|
90 |
+
- Words in single circles = low soft minimum scores
|
91 |
+
- Word opacity/size based on final soft minimum score
|
92 |
+
|
93 |
+
**Key Insights**:
|
94 |
+
- Visual representation of "true intersections"
|
95 |
+
- Words in overlap regions are what soft minimum promotes
|
96 |
+
- Empty intersection regions explain why some topic combinations yield few words
|
97 |
+
|
98 |
+
**Implementation**:
|
99 |
+
- SVG-based Venn diagrams
|
100 |
+
- Dynamic positioning algorithm
|
101 |
+
- Interactive word tooltips
|
102 |
+
|
103 |
+
### 5. Before/After Word Clouds
|
104 |
+
|
105 |
+
**Concept**: Compare averaging vs soft minimum results using word clouds.
|
106 |
+
|
107 |
+
**Layout**:
|
108 |
+
- **Left Cloud**: "Averaging Method"
|
109 |
+
- Word size based on average similarity
|
110 |
+
- May prominently feature problematic words like "ethology" for Art+Books
|
111 |
+
|
112 |
+
- **Right Cloud**: "Soft Minimum Method"
|
113 |
+
- Word size based on soft minimum score
|
114 |
+
- Should prominently feature true intersections like "literature"
|
115 |
+
|
116 |
+
**Key Insights**:
|
117 |
+
- Dramatic visual difference in word prominence
|
118 |
+
- Shows quality improvement at a glance
|
119 |
+
- Easy to understand for non-technical users
|
120 |
+
|
121 |
+
**Implementation**:
|
122 |
+
- Use word cloud libraries (wordcloud2.js, D3-cloud)
|
123 |
+
- Color coding by topic affinity
|
124 |
+
- Interactive word selection
|
125 |
+
|
126 |
+
### 6. Mathematical Formula Animation
|
127 |
+
|
128 |
+
**Concept**: Step-by-step visualization of soft minimum calculation.
|
129 |
+
|
130 |
+
**Layout**:
|
131 |
+
- Example word with similarities: [0.8, 0.2, 0.1] (universe, movies, languages)
|
132 |
+
- Animated steps:
|
133 |
+
1. Show individual similarities as bars
|
134 |
+
2. Apply exponential transformation: exp(-beta * sim)
|
135 |
+
3. Sum the exponentials
|
136 |
+
4. Apply logarithm and normalization
|
137 |
+
5. Compare result to simple average (0.37)
|
138 |
+
|
139 |
+
**Key Insights**:
|
140 |
+
- How the minimum similarity dominates the calculation
|
141 |
+
- Why soft minimum β minimum similarity for high beta
|
142 |
+
- Mathematical intuition behind the formula
|
143 |
+
|
144 |
+
**Implementation**:
|
145 |
+
- Animated SVG or Canvas
|
146 |
+
- Step-by-step button progression
|
147 |
+
- Mathematical notation display
|
148 |
+
|
149 |
+
### 7. Adaptive Beta Journey
|
150 |
+
|
151 |
+
**Concept**: Show the adaptive beta retry process as a timeline.
|
152 |
+
|
153 |
+
**Layout**:
|
154 |
+
- Horizontal timeline showing beta decay: 10.0 β 7.0 β 4.9 β 3.4...
|
155 |
+
- For each beta value:
|
156 |
+
- Histogram of soft minimum scores
|
157 |
+
- Threshold line (adjusted)
|
158 |
+
- Count of valid words
|
159 |
+
- Decision: "Continue" or "Stop"
|
160 |
+
|
161 |
+
**Key Insights**:
|
162 |
+
- How threshold adjustment makes lower beta more permissive
|
163 |
+
- Why word count increases with each retry
|
164 |
+
- When the algorithm decides to stop
|
165 |
+
|
166 |
+
**Implementation**:
|
167 |
+
- Timeline component with expandable sections
|
168 |
+
- Small multiples showing score distributions
|
169 |
+
- Real-time data from debug logs
|
170 |
+
|
171 |
+
## Implementation Priorities
|
172 |
+
|
173 |
+
### Phase 1: Essential (MVP)
|
174 |
+
1. **Heat Map Comparison** - Most educational value
|
175 |
+
2. **Interactive Beta Slider** - Shows parameter effects clearly
|
176 |
+
|
177 |
+
### Phase 2: Enhanced Understanding
|
178 |
+
3. **Before/After Word Clouds** - Easy to understand impact
|
179 |
+
4. **Mathematical Formula Animation** - Educational for technical users
|
180 |
+
|
181 |
+
### Phase 3: Advanced Analysis
|
182 |
+
5. **3D Scatter Plot** - For deep analysis of 3-topic cases
|
183 |
+
6. **Venn Diagram** - Complex positioning algorithms
|
184 |
+
7. **Adaptive Beta Journey** - Comprehensive debugging tool
|
185 |
+
|
186 |
+
## Technical Implementation Notes
|
187 |
+
|
188 |
+
### Backend Changes Needed
|
189 |
+
- Return individual topic similarities alongside soft minimum scores
|
190 |
+
- Add debug endpoint for visualization data
|
191 |
+
- Include beta parameter and threshold information in responses
|
192 |
+
|
193 |
+
### Frontend Integration
|
194 |
+
- Add to existing debug tab
|
195 |
+
- Use React components for interactivity
|
196 |
+
- Responsive design for different screen sizes
|
197 |
+
- Export/save visualization capabilities
|
198 |
+
|
199 |
+
### Data Format
|
200 |
+
```json
|
201 |
+
{
|
202 |
+
"visualization_data": {
|
203 |
+
"individual_similarities": {
|
204 |
+
"word1": [0.8, 0.2, 0.1],
|
205 |
+
"word2": [0.3, 0.9, 0.4]
|
206 |
+
},
|
207 |
+
"soft_minimum_scores": {
|
208 |
+
"word1": 0.15,
|
209 |
+
"word2": 0.32
|
210 |
+
},
|
211 |
+
"beta_used": 7.0,
|
212 |
+
"threshold_adjusted": 0.175,
|
213 |
+
"topics": ["universe", "movies", "languages"]
|
214 |
+
}
|
215 |
+
}
|
216 |
+
```
|
217 |
+
|
218 |
+
## Expected Impact
|
219 |
+
|
220 |
+
These visualizations would:
|
221 |
+
1. **Educate users** about the soft minimum method
|
222 |
+
2. **Build confidence** in the algorithm's choices
|
223 |
+
3. **Enable debugging** of problematic topic combinations
|
224 |
+
4. **Facilitate research** into parameter optimization
|
225 |
+
5. **Demonstrate value** of the multi-topic intersection approach
|
226 |
+
|
227 |
+
The heat map comparison alone would be worth implementing, as it clearly shows why soft minimum produces higher-quality word intersections than simple averaging.
|
crossword-app/backend-py/docs/vocabulary_alternatives_analysis.md
ADDED
@@ -0,0 +1,405 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Vocabulary Alternatives Analysis: Beyond WordFreq
|
2 |
+
|
3 |
+
## Executive Summary
|
4 |
+
|
5 |
+
WordFreq, while useful for general frequency analysis, produces vocabulary quality issues for crossword generation due to its web-scraped, uncurated nature. After hands-on evaluation of alternatives, most "curated" crossword lists have significant quality issues requiring substantial cleanup effort.
|
6 |
+
|
7 |
+
### **Updated Recommendations (Post-Evaluation):**
|
8 |
+
1. **Primary**: COCA free sample (6K high-quality words with rich metadata) + Peter Norvig's clean 100K list
|
9 |
+
2. **Quality Leader**: COCA full version (if budget allows) - 14 billion words, sophisticated metadata
|
10 |
+
3. **Fallback**: SUBTLEX (reasonable quality, needs programming to parse properly)
|
11 |
+
4. **Avoid**: Most crossword-specific lists contain junk data requiring extensive cleanup
|
12 |
+
5. **Semantic Processing**: Keep all-mpnet-base-v2 (working well)
|
13 |
+
|
14 |
+
## Current Issues with WordFreq Vocabulary
|
15 |
+
|
16 |
+
### Problems Identified:
|
17 |
+
1. **Web-based contamination**: Includes Reddit, Twitter, and web crawl data with typos, slang, and internet-specific language
|
18 |
+
2. **No quality filtering**: Purely frequency-based without considering appropriateness for crosswords
|
19 |
+
3. **Mixed registers**: Combines formal and informal language indiscriminately
|
20 |
+
4. **Problematic intersections**: Generates words like "ethology", "guns", "porn" for topics like "Art+Books"
|
21 |
+
5. **Limited metadata**: No information about word suitability, part-of-speech, or crossword usage
|
22 |
+
6. **AI contamination risk**: WordFreq author stopped updates in 2024 due to generative AI polluting data sources
|
23 |
+
|
24 |
+
### Impact on Crossword Generation:
|
25 |
+
- Lower quality semantic intersections
|
26 |
+
- Inappropriate words for family-friendly puzzles
|
27 |
+
- Poor difficulty calibration
|
28 |
+
- Reduced solver experience quality
|
29 |
+
|
30 |
+
## Superior Alternatives
|
31 |
+
|
32 |
+
### 1. Crossword-Specific Word Lists (β οΈ QUALITY ISSUES FOUND)
|
33 |
+
|
34 |
+
#### A. Collaborative Word List (β NOT RECOMMENDED)
|
35 |
+
- **Source**: https://github.com/Crossword-Nexus/collaborative-word-list
|
36 |
+
- **Size**: 114,000+ words
|
37 |
+
- **Direct download**: `https://raw.githubusercontent.com/Crossword-Nexus/collaborative-word-list/main/xwordlist.dict`
|
38 |
+
- **QUALITY PROBLEMS IDENTIFIED**:
|
39 |
+
- Contains nonsensical entries: `10THGENCONSOLE`, `1STGENERATIONCONSOLES`, `4XGAMES`
|
40 |
+
- Single letters: `A`, `AA`, `AAA`, `AAAA`
|
41 |
+
- Meaningless sequences: `AAAAH`, `AAAAUTOCLUB`
|
42 |
+
- **Verdict**: Requires extensive cleanup before use
|
43 |
+
|
44 |
+
#### B. Spread the Word(list) (β NOT RECOMMENDED)
|
45 |
+
- **Source**: https://www.spreadthewordlist.com
|
46 |
+
- **Size**: 114,000+ answers with scores
|
47 |
+
- **QUALITY PROBLEMS IDENTIFIED**:
|
48 |
+
- Garbage entries: `zzzzzzzzzzzzzzz`, `zzzquil`
|
49 |
+
- Malformed words: `aaaaddress`, `aabb`, `aabba`
|
50 |
+
- Random sequences: `aaiiiiiiiiiiiii`
|
51 |
+
- **Verdict**: Same quality issues as Collaborative List
|
52 |
+
|
53 |
+
#### C. Christopher Jones' Crossword Wordlist (β οΈ NEEDS CLEANUP)
|
54 |
+
- **Source**: https://github.com/christophsjones/crossword-wordlist
|
55 |
+
- **QUALITY PROBLEMS IDENTIFIED**:
|
56 |
+
- Long phrases: `"a week from now"`, `"a recipe for disaster"`
|
57 |
+
- Absurdly long compounds: `ABIRDINTHEHANDISWORTHTWOINTHEBUSH`, `ABLEBODIEDSEAMAN`
|
58 |
+
- Arbitrary scoring: Many words with score 50 don't match claimed "common words you wouldn't hesitate to use"
|
59 |
+
- **Verdict**: Contains good data but needs significant filtering and rescoring
|
60 |
+
|
61 |
+
### 2. SUBTLEX Psycholinguistic Databases (β
REASONABLE QUALITY)
|
62 |
+
|
63 |
+
#### SUBTLEX-US (American English)
|
64 |
+
- **Source**: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
|
65 |
+
- **Size**: 74,000+ words
|
66 |
+
- **Quality**: Based on film/TV subtitles (natural language exposure)
|
67 |
+
- **Scoring**: Zipf scale 1-7, contextual diversity metrics
|
68 |
+
- **License**: Free for research
|
69 |
+
|
70 |
+
#### EVALUATION RESULTS:
|
71 |
+
- **β
Better quality**: Words are generally reasonable and appropriate
|
72 |
+
- **β οΈ Contains tiny phrases**: Some entries are short phrases rather than single words
|
73 |
+
- **β οΈ Requires programming**: Need to parse and filter the numerical data properly
|
74 |
+
- **β
Rich metadata**: Includes frequency, Zipf scores, part-of-speech, contextual diversity
|
75 |
+
- **β
Research backing**: Proven to predict word processing difficulty better than traditional corpora
|
76 |
+
|
77 |
+
#### Advantages:
|
78 |
+
- **Psycholinguistic validity**: Better predictor of word processing difficulty
|
79 |
+
- **Clean vocabulary**: Professional media content (edited, appropriate)
|
80 |
+
- **Good difficulty calibration**: Zipf 1-3 = rare/hard, 4-7 = common/easy
|
81 |
+
- **Multiple languages**: Available for US, UK, Chinese, Welsh, Spanish
|
82 |
+
|
83 |
+
### 3. COCA (Corpus of Contemporary American English) (π EXCELLENT QUALITY)
|
84 |
+
|
85 |
+
#### Available Data:
|
86 |
+
- **Free tier**: ~6,000 words with rich metadata and collocates
|
87 |
+
- **Full version**: 14 billion words with sophisticated metadata (paid)
|
88 |
+
- **Source**: https://www.wordfrequency.info/ and https://github.com/brucewlee/COCA-WordFrequency
|
89 |
+
- **Composition**: Balanced across news, fiction, academic, spoken
|
90 |
+
|
91 |
+
#### EVALUATION RESULTS:
|
92 |
+
- **π Excellent quality**: "Phew, this is good" - professional curation shows
|
93 |
+
- **β
Rich metadata**: Frequency, part-of-speech, genre distribution, collocates
|
94 |
+
- **β
Clean vocabulary**: Academic standard filtering
|
95 |
+
- **β
Balanced representation**: Multiple text types ensure comprehensive coverage
|
96 |
+
- **π° Premium option**: Full version provides 14 billion words with sophisticated metadata
|
97 |
+
- **β
Free sample sufficient**: 6K words could serve as high-quality core vocabulary
|
98 |
+
|
99 |
+
#### Advantages:
|
100 |
+
- **Academic gold standard**: Most accurate and reliable word frequency data
|
101 |
+
- **Professional curation**: High editorial and scholarly standards
|
102 |
+
- **Balanced corpus**: News, fiction, academic, spoken genres represented
|
103 |
+
- **Collocate data**: Helps understand word usage patterns and context
|
104 |
+
- **Research proven**: Widely used and validated in linguistics research
|
105 |
+
|
106 |
+
### 4. Peter Norvig's Clean Word Lists (π EXCELLENT DISCOVERY)
|
107 |
+
|
108 |
+
#### Norvig's Word Count Lists
|
109 |
+
- **Source**: https://norvig.com/ngrams/
|
110 |
+
- **Key Resource**: `count_1w100k.txt` - 100,000 most popular words, all uppercase
|
111 |
+
- **Quality**: Really clean vocabulary without junk entries
|
112 |
+
- **Problem**: No frequency information included
|
113 |
+
|
114 |
+
#### EVALUATION RESULTS:
|
115 |
+
- **β
Very clean**: Properly curated, no garbage like other sources
|
116 |
+
- **β
Good coverage**: 100K words should provide sufficient vocabulary
|
117 |
+
- **β
Reliable source**: Peter Norvig (Google's Director of Research) ensures quality
|
118 |
+
- **β Missing frequencies**: Would need to cross-reference with other sources for difficulty grading
|
119 |
+
- **π‘ Hybrid opportunity**: Could combine Norvig's clean words with frequency data from SUBTLEX or COCA
|
120 |
+
|
121 |
+
#### Potential Implementation:
|
122 |
+
```python
|
123 |
+
# Use Norvig's clean word list as vocabulary base
|
124 |
+
norvig_words = load_norvig_100k()
|
125 |
+
# Cross-reference with SUBTLEX for frequency data
|
126 |
+
subtlex_freq = load_subtlex_frequencies()
|
127 |
+
# Result: Clean vocabulary + reliable frequency information
|
128 |
+
```
|
129 |
+
|
130 |
+
### 5. Premium Options (For Comparison - Not Evaluated)
|
131 |
+
|
132 |
+
#### XWordInfo (NYT-focused)
|
133 |
+
- **Cost**: $50 Angel membership
|
134 |
+
- **Quality**: Every NYT crossword ever published
|
135 |
+
- **Size**: 200,000+ words
|
136 |
+
- **Note**: Not evaluated in this analysis
|
137 |
+
|
138 |
+
#### Cruciverb
|
139 |
+
- **Cost**: $35 Gold membership
|
140 |
+
- **Quality**: Multiple publication sources
|
141 |
+
- **Note**: Not evaluated in this analysis
|
142 |
+
|
143 |
+
## Detailed Comparison Analysis (Updated with Evaluation Results)
|
144 |
+
|
145 |
+
| Source | Size | Quality Score | Frequency Data | Evaluated Quality | Cost | Recommendation |
|
146 |
+
|--------|------|---------------|----------------|------------------|------|----------------|
|
147 |
+
| **WordFreq** | 100K+ | β Web-scraped | β
Frequency | β Original issues | Free | β οΈ Current baseline |
|
148 |
+
| **Collaborative List** | 114K+ | β Junk entries | β Arbitrary scoring | β `10THGENCONSOLE`, `AAAA` | Free | β **AVOID** |
|
149 |
+
| **Spread Wordlist** | 114K+ | β Junk entries | β Arbitrary scoring | β `zzzzzzzzzzzzzzz`, `aabb` | Free | β **AVOID** |
|
150 |
+
| **C. Jones Wordlist** | ~50K | β οΈ Needs filtering | β οΈ Arbitrary scoring | β οΈ Long phrases, compounds | Free | β οΈ **CLEANUP REQUIRED** |
|
151 |
+
| **SUBTLEX-US** | 74K | β
Reasonable quality | β
Zipf 1-7 | β
Clean, some phrases | Free | β
**VIABLE** |
|
152 |
+
| **COCA (free)** | 6K | π Excellent | β
Rich metadata | π "Phew, this is good" | Free | π **RECOMMENDED** |
|
153 |
+
| **COCA (full)** | 1M+ | π Excellent | β
Rich metadata | π Sophisticated metadata | $$$ | π **PREMIUM CHOICE** |
|
154 |
+
| **Norvig 100K** | 100K | π Very clean | β None included | π Clean, no garbage | Free | π **HYBRID BASE** |
|
155 |
+
|
156 |
+
## Updated Implementation Recommendations (Post-Evaluation)
|
157 |
+
|
158 |
+
### Recommended Approach: Hybrid COCA + Norvig System
|
159 |
+
|
160 |
+
Based on hands-on evaluation, the cleanest approach combines the best of multiple sources:
|
161 |
+
|
162 |
+
#### Option A: COCA Free + Extended Coverage (Recommended)
|
163 |
+
```python
|
164 |
+
# 1. Load COCA 6K words as high-quality core
|
165 |
+
def load_coca_core():
|
166 |
+
"""Load 6K high-quality words from COCA free sample"""
|
167 |
+
# Excellent quality, rich metadata, reliable frequencies
|
168 |
+
return parse_coca_free_sample()
|
169 |
+
|
170 |
+
# 2. Extend with filtered SUBTLEX for broader coverage
|
171 |
+
def extend_with_subtlex():
|
172 |
+
"""Add clean words from SUBTLEX for broader coverage"""
|
173 |
+
# Filter out phrases, keep single words only
|
174 |
+
# Use Zipf scores for difficulty grading
|
175 |
+
return filtered_subtlex_words()
|
176 |
+
|
177 |
+
# 3. Cross-reference with Norvig's clean list for validation
|
178 |
+
def validate_with_norvig():
|
179 |
+
"""Use Norvig's 100K list to validate word cleanliness"""
|
180 |
+
norvig_clean = load_norvig_100k()
|
181 |
+
# Only include words that appear in Norvig's curated list
|
182 |
+
return validated_vocabulary
|
183 |
+
```
|
184 |
+
|
185 |
+
#### Option B: Norvig Base + Frequency Cross-Reference (Alternative)
|
186 |
+
```python
|
187 |
+
# 1. Start with Norvig's clean 100K vocabulary
|
188 |
+
norvig_words = load_norvig_100k()
|
189 |
+
|
190 |
+
# 2. Cross-reference with COCA for frequency data
|
191 |
+
coca_freq = load_coca_frequencies() # Free 6K sample
|
192 |
+
subtlex_freq = load_subtlex_frequencies() # Broader coverage
|
193 |
+
|
194 |
+
# 3. Assign frequencies with fallback chain
|
195 |
+
def get_word_difficulty(word):
|
196 |
+
if word in coca_freq:
|
197 |
+
return coca_freq[word] # Highest quality
|
198 |
+
elif word in subtlex_freq:
|
199 |
+
return subtlex_freq[word] # Good quality
|
200 |
+
else:
|
201 |
+
return default_difficulty # Fallback
|
202 |
+
```
|
203 |
+
|
204 |
+
### Why This Hybrid Approach Works
|
205 |
+
|
206 |
+
#### Problems with "Crossword-Specific" Lists:
|
207 |
+
- **Collaborative Word List**: Contains `10THGENCONSOLE`, `AAAA`, `AAAAUTOCLUB`
|
208 |
+
- **Spread the Wordlist**: Contains `zzzzzzzzzzzzzzz`, `aaaaddress`, `aabba`
|
209 |
+
- **Christopher Jones**: Contains `ABIRDINTHEHANDISWORTHTWOINTHEBUSH`
|
210 |
+
- **Verdict**: All require extensive cleanup, defeating their supposed advantage
|
211 |
+
|
212 |
+
#### Advantages of COCA + Norvig Hybrid:
|
213 |
+
- **COCA Free**: 6K professionally curated, academically validated words
|
214 |
+
- **Norvig 100K**: Clean vocabulary from Google's Director of Research
|
215 |
+
- **SUBTLEX**: Reasonable quality with psycholinguistic validity
|
216 |
+
- **No garbage**: Avoid the cleanup nightmare of "crossword-specific" lists
|
217 |
+
- **Research backing**: Academic and industry validation
|
218 |
+
|
219 |
+
### Updated Difficulty Grading System
|
220 |
+
|
221 |
+
```python
|
222 |
+
def classify_word_difficulty(word):
|
223 |
+
"""Updated difficulty classification using clean sources"""
|
224 |
+
|
225 |
+
# Priority 1: COCA data (highest quality)
|
226 |
+
if word in coca_frequencies:
|
227 |
+
freq_rank = coca_frequencies[word]['rank']
|
228 |
+
if freq_rank <= 1000:
|
229 |
+
return "easy"
|
230 |
+
elif freq_rank <= 3000:
|
231 |
+
return "medium"
|
232 |
+
else:
|
233 |
+
return "hard"
|
234 |
+
|
235 |
+
# Priority 2: SUBTLEX Zipf score
|
236 |
+
elif word in subtlex_zipf:
|
237 |
+
zipf = subtlex_zipf[word]
|
238 |
+
if zipf >= 4.5:
|
239 |
+
return "easy" # Very common
|
240 |
+
elif zipf >= 2.5:
|
241 |
+
return "medium" # Moderately common
|
242 |
+
else:
|
243 |
+
return "hard" # Rare
|
244 |
+
|
245 |
+
# Fallback: Conservative classification
|
246 |
+
else:
|
247 |
+
return "medium" # Unknown words default to medium
|
248 |
+
```
|
249 |
+
|
250 |
+
## Updated Technical Integration Steps
|
251 |
+
|
252 |
+
### 1. Data Download and Preprocessing (Revised)
|
253 |
+
|
254 |
+
```bash
|
255 |
+
# Download COCA free sample (6K high-quality words)
|
256 |
+
wget https://raw.githubusercontent.com/brucewlee/COCA-WordFrequency/master/coca_5000.txt
|
257 |
+
|
258 |
+
# Download Peter Norvig's clean 100K word list
|
259 |
+
wget https://norvig.com/ngrams/count_1w100k.txt
|
260 |
+
|
261 |
+
# Download SUBTLEX-US (requires academic access)
|
262 |
+
# Available at: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
|
263 |
+
|
264 |
+
# AVOID these due to quality issues:
|
265 |
+
# β Collaborative Word List (contains garbage)
|
266 |
+
# β Spread the Wordlist (contains garbage)
|
267 |
+
# β Christopher Jones (needs extensive cleanup)
|
268 |
+
```
|
269 |
+
|
270 |
+
### 2. Data Structure Migration
|
271 |
+
|
272 |
+
```python
|
273 |
+
class EnhancedVocabulary:
|
274 |
+
def __init__(self):
|
275 |
+
self.collaborative_scores = {} # word -> quality score (10-100)
|
276 |
+
self.subtlex_zipf = {} # word -> zipf score (1-7)
|
277 |
+
self.subtlex_pos = {} # word -> part of speech
|
278 |
+
self.word_embeddings = {} # word -> embedding vector
|
279 |
+
|
280 |
+
def load_all_sources(self):
|
281 |
+
"""Load and integrate all vocabulary sources"""
|
282 |
+
self.load_collaborative_wordlist()
|
283 |
+
self.load_subtlex_data()
|
284 |
+
self.compute_embeddings() # Keep existing all-mpnet-base-v2
|
285 |
+
|
286 |
+
def is_crossword_suitable(self, word):
|
287 |
+
"""Filter based on crossword appropriateness"""
|
288 |
+
return word.upper() in self.collaborative_scores
|
289 |
+
```
|
290 |
+
|
291 |
+
### 3. Configuration Updates
|
292 |
+
|
293 |
+
```python
|
294 |
+
# Environment variables to add
|
295 |
+
VOCAB_SOURCE = "collaborative" # "collaborative", "subtlex", "hybrid"
|
296 |
+
COLLABORATIVE_WORDLIST_URL = "https://raw.githubusercontent.com/..."
|
297 |
+
SUBTLEX_DATA_PATH = "/path/to/subtlex_us.txt"
|
298 |
+
MIN_CROSSWORD_QUALITY = 30 # Minimum collaborative score
|
299 |
+
MIN_ZIPF_SCORE = 2.0 # Minimum SUBTLEX frequency
|
300 |
+
```
|
301 |
+
|
302 |
+
## Quality Scoring Systems Comparison
|
303 |
+
|
304 |
+
### WordFreq (Current)
|
305 |
+
- **Scale**: Frequency values (logarithmic)
|
306 |
+
- **Basis**: Web text frequency
|
307 |
+
- **Issues**: No quality filtering, includes inappropriate content
|
308 |
+
|
309 |
+
### Collaborative Word List
|
310 |
+
- **Scale**: 10-100 quality score
|
311 |
+
- **Basis**: Crossword constructor consensus
|
312 |
+
- **Interpretation**:
|
313 |
+
- 70-100: Excellent crossword words (common, clean)
|
314 |
+
- 40-69: Good crossword words (moderate difficulty)
|
315 |
+
- 10-39: Challenging words (obscure, specialized)
|
316 |
+
|
317 |
+
### SUBTLEX Zipf Scale
|
318 |
+
- **Scale**: 1-7 (logarithmic)
|
319 |
+
- **Basis**: Psycholinguistic word processing research
|
320 |
+
- **Interpretation**:
|
321 |
+
- 6-7: Ultra common (THE, AND, OF)
|
322 |
+
- 4-5: Common (HOUSE, WATER, FRIEND)
|
323 |
+
- 2-3: Uncommon (BIZARRE, ELOQUENT)
|
324 |
+
- 1: Rare (OBSEQUIOUS, PERSPICACIOUS)
|
325 |
+
|
326 |
+
## Expected Benefits
|
327 |
+
|
328 |
+
### Immediate Quality Improvements:
|
329 |
+
1. **Cleaner intersections**: No more "ethology/guns/porn" issues
|
330 |
+
2. **Family-friendly vocabulary**: Community-curated appropriateness
|
331 |
+
3. **Better difficulty calibration**: Psycholinguistically validated scales
|
332 |
+
4. **Crossword-optimized**: Words chosen for puzzle suitability
|
333 |
+
|
334 |
+
### Long-term Advantages:
|
335 |
+
1. **Community support**: Active maintenance by crossword constructors
|
336 |
+
2. **Research backing**: SUBTLEX has extensive academic validation
|
337 |
+
3. **Hybrid flexibility**: Can combine multiple quality signals
|
338 |
+
4. **Scalability**: Easy to add new vocabulary sources
|
339 |
+
|
340 |
+
## Migration Strategy
|
341 |
+
|
342 |
+
### Week 1: Data Integration
|
343 |
+
- Download and preprocess Collaborative Word List
|
344 |
+
- Create vocabulary loading pipeline
|
345 |
+
- Implement basic quality filtering
|
346 |
+
|
347 |
+
### Week 2: Scoring System
|
348 |
+
- Implement hybrid quality scoring
|
349 |
+
- Map quality scores to difficulty levels
|
350 |
+
- Test with existing multi-topic intersection methods
|
351 |
+
|
352 |
+
### Week 3: Performance Validation
|
353 |
+
- A/B test against WordFreq baseline
|
354 |
+
- Measure semantic intersection quality
|
355 |
+
- Validate difficulty calibration
|
356 |
+
|
357 |
+
### Week 4: Production Deployment
|
358 |
+
- Update environment configuration
|
359 |
+
- Monitor vocabulary coverage
|
360 |
+
- Collect user feedback on word quality
|
361 |
+
|
362 |
+
## Alternative Implementation: Gradual Migration
|
363 |
+
|
364 |
+
For lower risk, implement gradual migration:
|
365 |
+
|
366 |
+
```python
|
367 |
+
def get_word_quality(word):
|
368 |
+
"""Gradual migration approach"""
|
369 |
+
if word in collaborative_scores:
|
370 |
+
# Use collaborative score if available
|
371 |
+
return collaborative_scores[word] / 100.0
|
372 |
+
elif word in subtlex_zipf:
|
373 |
+
# Fallback to SUBTLEX
|
374 |
+
return subtlex_zipf[word] / 7.0
|
375 |
+
else:
|
376 |
+
# Final fallback to WordFreq
|
377 |
+
return word_frequency(word, 'en')
|
378 |
+
```
|
379 |
+
|
380 |
+
This allows testing new vocabulary sources while maintaining compatibility with existing words not found in curated lists.
|
381 |
+
|
382 |
+
## Conclusion (Updated After Hands-On Evaluation)
|
383 |
+
|
384 |
+
**Key Finding**: Most "crossword-specific" vocabulary lists contain significant amounts of junk data that require extensive cleanup, defeating their supposed advantage over general-purpose sources.
|
385 |
+
|
386 |
+
**Recommended Solution**: Combine high-quality general sources instead:
|
387 |
+
1. **COCA free sample** (6K words) for core high-quality vocabulary
|
388 |
+
2. **Peter Norvig's 100K list** for clean, broad coverage
|
389 |
+
3. **SUBTLEX** for psycholinguistically validated difficulty grading
|
390 |
+
4. **Avoid crossword-specific lists** until they improve their curation
|
391 |
+
|
392 |
+
This hybrid approach provides:
|
393 |
+
- **Clean vocabulary**: No `10THGENCONSOLE`, `zzzzzzzzzzzzzzz`, or `AAAAUTOCLUB` garbage
|
394 |
+
- **Academic validation**: COCA and SUBTLEX are research-proven
|
395 |
+
- **Industry credibility**: Norvig's list comes from Google's Director of Research
|
396 |
+
- **Reasonable coverage**: 6K-100K words should handle most crossword needs
|
397 |
+
- **Better difficulty calibration**: Psycholinguistic frequency data beats arbitrary scores
|
398 |
+
|
399 |
+
**Next Steps**:
|
400 |
+
1. Start with COCA free sample as proof of concept
|
401 |
+
2. Extend with filtered SUBTLEX for broader coverage
|
402 |
+
3. Validate against Norvig's clean list
|
403 |
+
4. Consider COCA full version if budget allows
|
404 |
+
|
405 |
+
The investment in clean, research-backed vocabulary data will dramatically improve puzzle quality without the cleanup nightmare of supposedly "crossword-specific" sources.
|
hack/SUBTLEX/SUBTLEXus74286wordstextversion.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
hack/analyze_norvig_vocabulary.py
ADDED
@@ -0,0 +1,400 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Statistical Analysis of Norvig Word Count Files
|
4 |
+
|
5 |
+
Analyzes a single Norvig word count file (count_1w.txt or count_1w100k.txt)
|
6 |
+
from norvig.com/ngrams/ to understand vocabulary characteristics for crossword generation.
|
7 |
+
|
8 |
+
Usage:
|
9 |
+
python analyze_norvig_vocabulary.py <filename>
|
10 |
+
python analyze_norvig_vocabulary.py --help
|
11 |
+
|
12 |
+
Examples:
|
13 |
+
python analyze_norvig_vocabulary.py norvig/count_1w100k.txt
|
14 |
+
python analyze_norvig_vocabulary.py norvig/count_1w.txt
|
15 |
+
"""
|
16 |
+
|
17 |
+
import os
|
18 |
+
import sys
|
19 |
+
import argparse
|
20 |
+
import numpy as np
|
21 |
+
import matplotlib.pyplot as plt
|
22 |
+
import pandas as pd
|
23 |
+
from collections import Counter, defaultdict
|
24 |
+
import seaborn as sns
|
25 |
+
from pathlib import Path
|
26 |
+
|
27 |
+
# Set style for better plots
|
28 |
+
plt.style.use('seaborn-v0_8')
|
29 |
+
sns.set_palette("husl")
|
30 |
+
|
31 |
+
def parse_arguments():
|
32 |
+
"""Parse command line arguments"""
|
33 |
+
parser = argparse.ArgumentParser(
|
34 |
+
description='Analyze Norvig word count files for crossword generation',
|
35 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
36 |
+
epilog="""
|
37 |
+
Examples:
|
38 |
+
python analyze_norvig_vocabulary.py norvig/count_1w100k.txt
|
39 |
+
python analyze_norvig_vocabulary.py norvig/count_1w.txt
|
40 |
+
python analyze_norvig_vocabulary.py --help
|
41 |
+
|
42 |
+
File formats supported:
|
43 |
+
- count_1w100k.txt: Top 100,000 most frequent words
|
44 |
+
- count_1w.txt: Full word count dataset (1M+ words)
|
45 |
+
|
46 |
+
Output:
|
47 |
+
- Comprehensive statistical analysis
|
48 |
+
- 6-panel visualization saved as norvig_comprehensive_analysis.png
|
49 |
+
- Summary statistics printed to console
|
50 |
+
"""
|
51 |
+
)
|
52 |
+
|
53 |
+
parser.add_argument(
|
54 |
+
'filename',
|
55 |
+
help='Path to Norvig word count file (e.g., norvig/count_1w100k.txt)'
|
56 |
+
)
|
57 |
+
|
58 |
+
return parser.parse_args()
|
59 |
+
|
60 |
+
def load_word_counts(filepath):
|
61 |
+
"""Load word count file and return dict of {word: count}"""
|
62 |
+
word_counts = {}
|
63 |
+
total_lines = 0
|
64 |
+
|
65 |
+
print(f"Loading {filepath}...")
|
66 |
+
|
67 |
+
try:
|
68 |
+
with open(filepath, 'r', encoding='utf-8') as f:
|
69 |
+
for line in f:
|
70 |
+
total_lines += 1
|
71 |
+
parts = line.strip().split('\t')
|
72 |
+
if len(parts) == 2:
|
73 |
+
word, count = parts
|
74 |
+
word_counts[word.upper()] = int(count)
|
75 |
+
elif len(parts) == 1 and line.strip():
|
76 |
+
# Handle case where count might be missing
|
77 |
+
word = parts[0]
|
78 |
+
word_counts[word.upper()] = 1
|
79 |
+
|
80 |
+
print(f"β
Loaded {len(word_counts):,} words from {filepath}")
|
81 |
+
return word_counts
|
82 |
+
|
83 |
+
except FileNotFoundError:
|
84 |
+
print(f"β File not found: {filepath}")
|
85 |
+
return {}
|
86 |
+
except Exception as e:
|
87 |
+
print(f"β Error loading {filepath}: {e}")
|
88 |
+
return {}
|
89 |
+
|
90 |
+
def analyze_word_lengths(words):
|
91 |
+
"""Analyze distribution of word lengths"""
|
92 |
+
lengths = [len(word) for word in words]
|
93 |
+
length_dist = Counter(lengths)
|
94 |
+
|
95 |
+
return lengths, length_dist
|
96 |
+
|
97 |
+
def classify_difficulty(rank, total_words):
|
98 |
+
"""Classify word difficulty based on frequency rank"""
|
99 |
+
if rank <= total_words * 0.05: # Top 5%
|
100 |
+
return "Very Easy"
|
101 |
+
elif rank <= total_words * 0.20: # Top 20%
|
102 |
+
return "Easy"
|
103 |
+
elif rank <= total_words * 0.60: # Top 60%
|
104 |
+
return "Medium"
|
105 |
+
elif rank <= total_words * 0.85: # Top 85%
|
106 |
+
return "Hard"
|
107 |
+
else:
|
108 |
+
return "Very Hard"
|
109 |
+
|
110 |
+
def create_comprehensive_analysis(word_counts, filename, base_dir):
|
111 |
+
"""Create comprehensive statistical analysis with readable plots"""
|
112 |
+
|
113 |
+
# Create figure with subplots - 2x3 layout with good spacing
|
114 |
+
fig = plt.figure(figsize=(18, 12))
|
115 |
+
fig.suptitle(f'Norvig Word Count Analysis - {filename}',
|
116 |
+
fontsize=16, fontweight='bold', y=0.95)
|
117 |
+
|
118 |
+
# Convert to sorted lists for analysis
|
119 |
+
words = list(word_counts.keys())
|
120 |
+
counts = list(word_counts.values())
|
121 |
+
ranks = list(range(1, len(counts) + 1))
|
122 |
+
|
123 |
+
# 1. Zipf's Law Analysis (log-log plot)
|
124 |
+
ax1 = plt.subplot(2, 3, 1)
|
125 |
+
plt.loglog(ranks, counts, 'b-', alpha=0.7, linewidth=2)
|
126 |
+
plt.xlabel('Rank (log scale)')
|
127 |
+
plt.ylabel('Frequency (log scale)')
|
128 |
+
plt.title('Zipf\'s Law Validation', fontweight='bold')
|
129 |
+
plt.grid(True, alpha=0.3)
|
130 |
+
|
131 |
+
# Add theoretical Zipf line for comparison
|
132 |
+
theoretical_zipf = [counts[0] / r for r in ranks]
|
133 |
+
plt.loglog(ranks, theoretical_zipf, 'r--', alpha=0.5, label='Theoretical')
|
134 |
+
plt.legend()
|
135 |
+
|
136 |
+
# 2. Word Length Distribution
|
137 |
+
ax2 = plt.subplot(2, 3, 2)
|
138 |
+
lengths, length_dist = analyze_word_lengths(words)
|
139 |
+
lengths_list = sorted(length_dist.keys())
|
140 |
+
counts_list = [length_dist[l] for l in lengths_list]
|
141 |
+
|
142 |
+
bars = plt.bar(lengths_list, counts_list, alpha=0.7, color='skyblue', edgecolor='navy')
|
143 |
+
plt.xlabel('Word Length (characters)')
|
144 |
+
plt.ylabel('Number of Words')
|
145 |
+
plt.title('Word Length Distribution', fontweight='bold')
|
146 |
+
|
147 |
+
# Highlight crossword-suitable range (3-12 letters)
|
148 |
+
for i, bar in enumerate(bars):
|
149 |
+
if 3 <= lengths_list[i] <= 12:
|
150 |
+
bar.set_color('lightgreen')
|
151 |
+
elif lengths_list[i] < 3 or lengths_list[i] > 15:
|
152 |
+
bar.set_color('lightcoral')
|
153 |
+
|
154 |
+
plt.axvspan(3, 12, alpha=0.2, color='green', label='Crossword Range')
|
155 |
+
plt.legend()
|
156 |
+
|
157 |
+
# 3. Difficulty Distribution
|
158 |
+
ax3 = plt.subplot(2, 3, 3)
|
159 |
+
difficulty_dist = defaultdict(int)
|
160 |
+
for rank in ranks:
|
161 |
+
difficulty = classify_difficulty(rank, len(ranks))
|
162 |
+
difficulty_dist[difficulty] += 1
|
163 |
+
|
164 |
+
diff_labels = list(difficulty_dist.keys())
|
165 |
+
diff_counts = list(difficulty_dist.values())
|
166 |
+
colors = ['darkgreen', 'green', 'orange', 'red', 'darkred']
|
167 |
+
|
168 |
+
wedges, texts, autotexts = plt.pie(diff_counts, labels=diff_labels, autopct='%1.1f%%',
|
169 |
+
colors=colors[:len(diff_labels)], startangle=90)
|
170 |
+
plt.title('Difficulty Distribution', fontweight='bold')
|
171 |
+
|
172 |
+
# 4. Cumulative Frequency Coverage
|
173 |
+
ax4 = plt.subplot(2, 3, 4)
|
174 |
+
cumulative_freq = np.cumsum(counts)
|
175 |
+
total_freq = cumulative_freq[-1]
|
176 |
+
coverage_pct = (cumulative_freq / total_freq) * 100
|
177 |
+
|
178 |
+
plt.plot(ranks, coverage_pct, 'g-', linewidth=2)
|
179 |
+
plt.xlabel('Vocabulary Size')
|
180 |
+
plt.ylabel('Coverage (%)')
|
181 |
+
plt.title('Cumulative Coverage', fontweight='bold')
|
182 |
+
plt.grid(True, alpha=0.3)
|
183 |
+
|
184 |
+
# Add key milestone markers
|
185 |
+
milestones = [1000, 5000, 10000, 25000, 50000]
|
186 |
+
for milestone in milestones:
|
187 |
+
if milestone < len(coverage_pct):
|
188 |
+
plt.axvline(x=milestone, color='red', linestyle='--', alpha=0.5)
|
189 |
+
|
190 |
+
# 5. Crossword Suitability
|
191 |
+
ax5 = plt.subplot(2, 3, 5)
|
192 |
+
crossword_suitable = {word: count for word, count in word_counts.items()
|
193 |
+
if 3 <= len(word) <= 12 and word.isalpha()}
|
194 |
+
|
195 |
+
total_words = len(word_counts)
|
196 |
+
suitable_words = len(crossword_suitable)
|
197 |
+
unsuitable_words = total_words - suitable_words
|
198 |
+
|
199 |
+
labels = [f'Suitable\n{suitable_words:,}', f'Not Suitable\n{unsuitable_words:,}']
|
200 |
+
sizes = [suitable_words, unsuitable_words]
|
201 |
+
colors = ['lightgreen', 'lightcoral']
|
202 |
+
|
203 |
+
wedges, texts, autotexts = plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors, startangle=90)
|
204 |
+
plt.title('Crossword Suitability', fontweight='bold')
|
205 |
+
|
206 |
+
# 6. Difficulty Categories for Crosswords
|
207 |
+
ax6 = plt.subplot(2, 3, 6)
|
208 |
+
|
209 |
+
# Define crossword difficulty thresholds
|
210 |
+
easy_threshold = 5000
|
211 |
+
medium_threshold = 25000
|
212 |
+
|
213 |
+
easy_words = sum(1 for i, word in enumerate(words[:easy_threshold]) if 3 <= len(word) <= 12 and i < len(words))
|
214 |
+
medium_words = sum(1 for i, word in enumerate(words[easy_threshold:medium_threshold]) if 3 <= len(word) <= 12 and (i + easy_threshold) < len(words))
|
215 |
+
hard_words = sum(1 for i, word in enumerate(words[medium_threshold:]) if 3 <= len(word) <= 12 and (i + medium_threshold) < len(words))
|
216 |
+
|
217 |
+
categories = ['Easy', 'Medium', 'Hard']
|
218 |
+
word_counts_cat = [easy_words, medium_words, hard_words]
|
219 |
+
colors_cat = ['lightgreen', 'gold', 'lightcoral']
|
220 |
+
|
221 |
+
bars = plt.bar(categories, word_counts_cat, color=colors_cat, alpha=0.8)
|
222 |
+
plt.ylabel('Crossword Words')
|
223 |
+
plt.title('Difficulty Categories\n(Based on Frequency Rank)', fontweight='bold')
|
224 |
+
|
225 |
+
# Add value labels on bars
|
226 |
+
for bar, count in zip(bars, word_counts_cat):
|
227 |
+
height = bar.get_height()
|
228 |
+
if height > 0:
|
229 |
+
plt.text(bar.get_x() + bar.get_width()/2, height + max(word_counts_cat)*0.02,
|
230 |
+
f'{count:,}', ha='center', va='bottom', fontweight='bold')
|
231 |
+
|
232 |
+
# Add explanation text box with examples
|
233 |
+
# Get some example words for each category
|
234 |
+
easy_examples = [w for i, w in enumerate(words[:100]) if 3 <= len(w) <= 12][:3]
|
235 |
+
medium_examples = [w for i, w in enumerate(words[7000:12000]) if 3 <= len(w) <= 12][:3]
|
236 |
+
hard_examples = [w for i, w in enumerate(words[30000:35000]) if 3 <= len(w) <= 12][:3]
|
237 |
+
|
238 |
+
explanation = (f'Easy: Ranks 1-5,000 (most frequent)\n'
|
239 |
+
f' e.g., {", ".join(easy_examples[:3])}\n'
|
240 |
+
f'Medium: Ranks 5,001-25,000\n'
|
241 |
+
f' e.g., {", ".join(medium_examples[:3])}\n'
|
242 |
+
f'Hard: Ranks 25,001+ (least frequent)\n'
|
243 |
+
f' e.g., {", ".join(hard_examples[:3])}\n\n'
|
244 |
+
'Lower rank = higher frequency = easier')
|
245 |
+
|
246 |
+
plt.text(0.98, 0.98, explanation, transform=ax6.transAxes,
|
247 |
+
fontsize=8, verticalalignment='top', horizontalalignment='right',
|
248 |
+
bbox=dict(boxstyle='round,pad=0.5', facecolor='lightblue', alpha=0.9))
|
249 |
+
|
250 |
+
# Adjust layout with proper spacing
|
251 |
+
plt.subplots_adjust(left=0.08, bottom=0.08, right=0.95, top=0.88, wspace=0.35, hspace=0.45)
|
252 |
+
|
253 |
+
# Save the comprehensive analysis with filename in the output name
|
254 |
+
# Extract base name and create clean output filename
|
255 |
+
if 'count_1w100k' in filename:
|
256 |
+
output_name = 'norvig_analysis_100k.png'
|
257 |
+
elif 'count_1w.txt' in filename:
|
258 |
+
output_name = 'norvig_analysis_full.png'
|
259 |
+
else:
|
260 |
+
# Fallback for any other filename - make it filesystem safe
|
261 |
+
safe_name = filename.replace('.txt', '').replace('/', '_').replace('count_', '')
|
262 |
+
output_name = f'norvig_analysis_{safe_name}.png'
|
263 |
+
|
264 |
+
output_path = base_dir / output_name
|
265 |
+
plt.savefig(output_path, dpi=300, bbox_inches='tight')
|
266 |
+
print(f"π Comprehensive analysis saved to: {output_path}")
|
267 |
+
|
268 |
+
return fig, crossword_suitable
|
269 |
+
|
270 |
+
def print_summary_statistics(word_counts, filename, crossword_suitable):
|
271 |
+
"""Print comprehensive summary statistics"""
|
272 |
+
|
273 |
+
print("\n" + "="*80)
|
274 |
+
print("π NORVIG VOCABULARY STATISTICAL ANALYSIS")
|
275 |
+
print(f"π File: {filename}")
|
276 |
+
print("="*80)
|
277 |
+
|
278 |
+
# Basic statistics
|
279 |
+
total_words = len(word_counts)
|
280 |
+
total_frequency = sum(word_counts.values())
|
281 |
+
|
282 |
+
print(f"\nπ BASIC STATISTICS:")
|
283 |
+
print(f" β’ Total words: {total_words:,}")
|
284 |
+
print(f" β’ Total frequency: {total_frequency:,}")
|
285 |
+
print(f" β’ Average frequency: {total_frequency/total_words:.2f}")
|
286 |
+
|
287 |
+
# Word length analysis
|
288 |
+
lengths, length_dist = analyze_word_lengths(word_counts.keys())
|
289 |
+
avg_length = np.mean(lengths)
|
290 |
+
crossword_length_words = sum(count for length, count in length_dist.items() if 3 <= length <= 12)
|
291 |
+
crossword_length_pct = (crossword_length_words / total_words) * 100
|
292 |
+
|
293 |
+
print(f"\nπ WORD LENGTH ANALYSIS:")
|
294 |
+
print(f" β’ Average word length: {avg_length:.1f} characters")
|
295 |
+
print(f" β’ Words 3-12 characters: {crossword_length_words:,} ({crossword_length_pct:.1f}%)")
|
296 |
+
print(f" β’ Most common lengths: {sorted(length_dist.items(), key=lambda x: x[1], reverse=True)[:5]}")
|
297 |
+
|
298 |
+
# Crossword suitability
|
299 |
+
suitable_count = len(crossword_suitable)
|
300 |
+
suitable_pct = (suitable_count / total_words) * 100
|
301 |
+
suitable_freq = sum(crossword_suitable.values())
|
302 |
+
suitable_freq_pct = (suitable_freq / total_frequency) * 100
|
303 |
+
|
304 |
+
print(f"\nπ§© CROSSWORD SUITABILITY:")
|
305 |
+
print(f" β’ Suitable words (3-12 letters, alphabetic): {suitable_count:,} ({suitable_pct:.1f}%)")
|
306 |
+
print(f" β’ Suitable word frequency coverage: {suitable_freq_pct:.1f}%")
|
307 |
+
|
308 |
+
# Difficulty distribution for crosswords
|
309 |
+
easy_words = len([w for w, c in list(crossword_suitable.items())[:5000]])
|
310 |
+
medium_words = len([w for w, c in list(crossword_suitable.items())[5000:25000]])
|
311 |
+
hard_words = len([w for w, c in list(crossword_suitable.items())[25000:]])
|
312 |
+
|
313 |
+
print(f"\nπ― CROSSWORD DIFFICULTY DISTRIBUTION:")
|
314 |
+
print(f" β’ Easy (rank 1-5K): {easy_words:,} words")
|
315 |
+
print(f" β’ Medium (rank 5K-25K): {medium_words:,} words")
|
316 |
+
print(f" β’ Hard (rank 25K+): {hard_words:,} words")
|
317 |
+
|
318 |
+
# Top and bottom words examples
|
319 |
+
words_list = list(word_counts.keys())
|
320 |
+
print(f"\nπ TOP 10 MOST FREQUENT WORDS:")
|
321 |
+
for i, word in enumerate(words_list[:10], 1):
|
322 |
+
print(f" {i:2d}. {word:<12} ({word_counts[word]:,})")
|
323 |
+
|
324 |
+
print(f"\nπ BOTTOM 10 LEAST FREQUENT WORDS:")
|
325 |
+
for i, word in enumerate(words_list[-10:], 1):
|
326 |
+
print(f" {i:2d}. {word:<12} ({word_counts[word]:,})")
|
327 |
+
|
328 |
+
# Zipf's law validation
|
329 |
+
words_list = list(word_counts.keys())
|
330 |
+
counts_list = list(word_counts.values())
|
331 |
+
|
332 |
+
# Calculate correlation coefficient for log-log relationship
|
333 |
+
log_ranks = np.log(range(1, len(counts_list) + 1))
|
334 |
+
log_freqs = np.log(counts_list)
|
335 |
+
correlation = np.corrcoef(log_ranks, log_freqs)[0, 1]
|
336 |
+
|
337 |
+
print(f"\nπ ZIPF'S LAW VALIDATION:")
|
338 |
+
print(f" β’ Log-log correlation: {correlation:.4f}")
|
339 |
+
print(f" β’ Zipf compliance: {'β
Excellent' if abs(correlation) > 0.95 else 'β οΈ Moderate' if abs(correlation) > 0.8 else 'β Poor'}")
|
340 |
+
|
341 |
+
# Recommendations
|
342 |
+
print(f"\nπ‘ RECOMMENDATIONS FOR CROSSWORD GENERATION:")
|
343 |
+
print(f" β’ Dataset size: {total_words:,} words with excellent coverage")
|
344 |
+
print(f" β’ Filter to 3-12 letters: Reduces to {suitable_count:,} words ({suitable_pct:.1f}%)")
|
345 |
+
print(f" β’ Difficulty thresholds (for crossword-suitable words):")
|
346 |
+
print(f" - Easy: ranks 1-5,000 ({easy_words:,} suitable words)")
|
347 |
+
print(f" - Medium: ranks 5,001-25,000 ({medium_words:,} suitable words)")
|
348 |
+
print(f" - Hard: ranks 25,001+ ({hard_words:,} suitable words)")
|
349 |
+
print(f" β’ Quality: β
No garbage entries (unlike crossword-specific lists)")
|
350 |
+
print(f" β’ Source credibility: β
Peter Norvig (Google) + Google Books corpus")
|
351 |
+
|
352 |
+
print("="*80)
|
353 |
+
|
354 |
+
def main():
|
355 |
+
"""Main analysis function"""
|
356 |
+
|
357 |
+
# Parse command line arguments
|
358 |
+
args = parse_arguments()
|
359 |
+
|
360 |
+
# File paths
|
361 |
+
base_dir = Path(__file__).parent
|
362 |
+
input_file = Path(args.filename)
|
363 |
+
|
364 |
+
# Make path relative to script directory if not absolute
|
365 |
+
if not input_file.is_absolute():
|
366 |
+
input_file = base_dir / input_file
|
367 |
+
|
368 |
+
print("π Norvig Vocabulary Statistical Analysis")
|
369 |
+
print("=" * 50)
|
370 |
+
print(f"π Analyzing: {input_file}")
|
371 |
+
|
372 |
+
# Load data
|
373 |
+
word_counts = load_word_counts(input_file)
|
374 |
+
|
375 |
+
if not word_counts:
|
376 |
+
print(f"β Could not load word list from {input_file}. Please check file path.")
|
377 |
+
return
|
378 |
+
|
379 |
+
# Create comprehensive analysis
|
380 |
+
fig, crossword_suitable = create_comprehensive_analysis(word_counts, input_file.name, base_dir)
|
381 |
+
|
382 |
+
# Print summary statistics
|
383 |
+
print_summary_statistics(word_counts, input_file.name, crossword_suitable)
|
384 |
+
|
385 |
+
# Don't show plot interactively in CLI, just save it
|
386 |
+
# plt.show() # Comment out for CLI usage
|
387 |
+
|
388 |
+
# Generate the same output filename logic for final message
|
389 |
+
if 'count_1w100k' in input_file.name:
|
390 |
+
output_name = 'norvig_analysis_100k.png'
|
391 |
+
elif 'count_1w.txt' in input_file.name:
|
392 |
+
output_name = 'norvig_analysis_full.png'
|
393 |
+
else:
|
394 |
+
safe_name = input_file.name.replace('.txt', '').replace('/', '_').replace('count_', '')
|
395 |
+
output_name = f'norvig_analysis_{safe_name}.png'
|
396 |
+
|
397 |
+
print(f"\nβ
Analysis complete! Check {base_dir}/{output_name} for detailed plots.")
|
398 |
+
|
399 |
+
if __name__ == "__main__":
|
400 |
+
main()
|
hack/norvig/count_1w.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
hack/norvig/count_1w100k.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|