File size: 6,054 Bytes
486eff6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# Vocabulary Optimization & Unification
## Problem Solved
Previously, the crossword system had **vocabulary redundancy** with 3 separate sources:
- **SentenceTransformer Model Vocabulary**: ~30K tokens β ~8-12K actual words after filtering
- **NLTK Words Corpus**: 41,998 words for embeddings in thematic generator
- **WordFreq Database**: 319,938 words for frequency data
This created inconsistencies, memory waste, and limited vocabulary coverage.
## Solution: Unified Architecture
### New Design
- **Single Vocabulary Source**: WordFreq database (319,938 words)
- **Single Embedding Model**: all-mpnet-base-v2 (generates embeddings for any text)
- **Unified Filtering**: Consistent crossword-suitable word filtering
- **Shared Caching**: Single vocabulary + embeddings + frequency cache
### Key Components
#### 1. VocabularyManager (`hack/thematic_word_generator.py`)
- Loads and filters WordFreq vocabulary
- Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
- Generates frequency data with 10-tier classification
- Handles caching for performance
#### 2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`)
- Uses WordFreq vocabulary instead of NLTK words
- Generates all-mpnet-base-v2 embeddings for WordFreq words
- Maintains 10-tier frequency classification system
- Provides both hack tool API and backend-compatible API
#### 3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`)
- Bridge adapter for backend integration
- Compatible with existing VectorSearchService interface
- Uses comprehensive WordFreq vocabulary instead of limited model vocabulary
## Usage
### For Hack Tools
```python
from thematic_word_generator import UnifiedThematicWordGenerator
# Initialize with desired vocabulary size
generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
generator.initialize()
# Generate thematic words with tier info
results = generator.generate_thematic_words(
topic="science",
num_words=10,
difficulty_tier="tier_5_common" # Optional tier filtering
)
for word, similarity, tier in results:
print(f"{word}: {similarity:.3f} ({tier})")
```
### For Backend Integration
#### Option 1: Replace VectorSearchService
```python
# In crossword_generator.py
from .unified_word_service import create_unified_word_service
# Initialize
vector_service = await create_unified_word_service(vocab_size_limit=100000)
crossword_gen = CrosswordGenerator(vector_service=vector_service)
```
#### Option 2: Direct Usage
```python
from .unified_word_service import UnifiedWordService
service = UnifiedWordService(vocab_size_limit=100000)
await service.initialize()
# Compatible with existing interface
words = await service.find_similar_words("animal", "medium", max_words=15)
```
## Performance Improvements
### Memory Usage
- **Before**: 3 separate vocabularies + embeddings (~500MB+)
- **After**: Single vocabulary + embeddings (~200MB)
- **Reduction**: ~60% memory usage reduction
### Vocabulary Coverage
- **Before**: Limited to ~8-12K words from model tokenizer
- **After**: Up to 100K+ filtered words from WordFreq database
- **Improvement**: 10x+ vocabulary coverage
### Consistency
- **Before**: Different words available in hack tools vs backend
- **After**: Same comprehensive vocabulary across all components
- **Benefit**: Consistent word quality and availability
## Configuration
### Environment Variables
- `MAX_VOCABULARY_SIZE`: Maximum vocabulary size (default: 100000)
- `EMBEDDING_MODEL`: Model name (default: all-mpnet-base-v2)
- `WORD_SIMILARITY_THRESHOLD`: Minimum similarity (default: 0.3)
### Vocabulary Size Options
- **Small (10K)**: Fast initialization, basic vocabulary
- **Medium (50K)**: Balanced performance and coverage
- **Large (100K)**: Comprehensive coverage, slower initialization
- **Full (319K)**: Complete WordFreq database, longest initialization
## Migration Guide
### For Existing Hack Tools
1. Update imports: `from thematic_word_generator import UnifiedThematicWordGenerator`
2. Replace `ThematicWordGenerator` with `UnifiedThematicWordGenerator`
3. API remains compatible, but now uses comprehensive WordFreq vocabulary
### For Backend Services
1. Import: `from .unified_word_service import UnifiedWordService`
2. Replace `VectorSearchService` initialization with `UnifiedWordService`
3. All existing methods remain compatible
4. Benefits: Better vocabulary coverage, consistent frequency data
### Backwards Compatibility
- All existing APIs maintained
- Same method signatures and return formats
- Gradual migration possible - can run both systems in parallel
## Benefits Summary
β
**Eliminates Redundancy**: Single vocabulary source instead of 3 separate ones
β
**Improves Coverage**: 100K+ words vs previous 8-12K words
β
**Reduces Memory**: ~60% reduction in memory usage
β
**Ensures Consistency**: Same vocabulary across hack tools and backend
β
**Maintains Performance**: Smart caching and batch processing
β
**Preserves Features**: 10-tier frequency classification, difficulty filtering
β
**Enables Growth**: Easy to add new features with unified architecture
## Cache Management
### Cache Locations
- **Hack tools**: `hack/model_cache/`
- **Backend**: `crossword-app/backend-py/cache/unified_generator/`
### Cache Files
- `unified_vocabulary_<size>.pkl`: Filtered vocabulary
- `unified_frequencies_<size>.pkl`: Frequency data
- `unified_embeddings_<model>_<size>.npy`: Pre-computed embeddings
### Cache Invalidation
Caches are automatically rebuilt if:
- Vocabulary size limit changes
- Embedding model changes
- WordFreq database updates (rare)
## Future Enhancements
1. **Semantic Clustering**: Group words by semantic similarity
2. **Dynamic Difficulty**: Real-time difficulty adjustment based on user performance
3. **Topic Expansion**: Automatic topic discovery and expansion
4. **Multilingual Support**: Extend to other languages using WordFreq
5. **Custom Vocabularies**: Allow domain-specific vocabulary additions |