Vocabulary Optimization & Unification
Problem Solved
Previously, the crossword system had vocabulary redundancy with 3 separate sources:
- SentenceTransformer Model Vocabulary: ~30K tokens β ~8-12K actual words after filtering
- NLTK Words Corpus: 41,998 words for embeddings in thematic generator
- WordFreq Database: 319,938 words for frequency data
This created inconsistencies, memory waste, and limited vocabulary coverage.
Solution: Unified Architecture
New Design
- Single Vocabulary Source: WordFreq database (319,938 words)
- Single Embedding Model: all-mpnet-base-v2 (generates embeddings for any text)
- Unified Filtering: Consistent crossword-suitable word filtering
- Shared Caching: Single vocabulary + embeddings + frequency cache
Key Components
1. VocabularyManager (hack/thematic_word_generator.py)
- Loads and filters WordFreq vocabulary
- Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
- Generates frequency data with 10-tier classification
- Handles caching for performance
2. UnifiedThematicWordGenerator (hack/thematic_word_generator.py)
- Uses WordFreq vocabulary instead of NLTK words
- Generates all-mpnet-base-v2 embeddings for WordFreq words
- Maintains 10-tier frequency classification system
- Provides both hack tool API and backend-compatible API
3. UnifiedWordService (crossword-app/backend-py/src/services/unified_word_service.py)
- Bridge adapter for backend integration
- Compatible with existing VectorSearchService interface
- Uses comprehensive WordFreq vocabulary instead of limited model vocabulary
Usage
For Hack Tools
from thematic_word_generator import UnifiedThematicWordGenerator
# Initialize with desired vocabulary size
generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
generator.initialize()
# Generate thematic words with tier info
results = generator.generate_thematic_words(
topic="science",
num_words=10,
difficulty_tier="tier_5_common" # Optional tier filtering
)
for word, similarity, tier in results:
print(f"{word}: {similarity:.3f} ({tier})")
For Backend Integration
Option 1: Replace VectorSearchService
# In crossword_generator.py
from .unified_word_service import create_unified_word_service
# Initialize
vector_service = await create_unified_word_service(vocab_size_limit=100000)
crossword_gen = CrosswordGenerator(vector_service=vector_service)
Option 2: Direct Usage
from .unified_word_service import UnifiedWordService
service = UnifiedWordService(vocab_size_limit=100000)
await service.initialize()
# Compatible with existing interface
words = await service.find_similar_words("animal", "medium", max_words=15)
Performance Improvements
Memory Usage
- Before: 3 separate vocabularies + embeddings (~500MB+)
- After: Single vocabulary + embeddings (~200MB)
- Reduction: ~60% memory usage reduction
Vocabulary Coverage
- Before: Limited to ~8-12K words from model tokenizer
- After: Up to 100K+ filtered words from WordFreq database
- Improvement: 10x+ vocabulary coverage
Consistency
- Before: Different words available in hack tools vs backend
- After: Same comprehensive vocabulary across all components
- Benefit: Consistent word quality and availability
Configuration
Environment Variables
MAX_VOCABULARY_SIZE: Maximum vocabulary size (default: 100000)EMBEDDING_MODEL: Model name (default: all-mpnet-base-v2)WORD_SIMILARITY_THRESHOLD: Minimum similarity (default: 0.3)
Vocabulary Size Options
- Small (10K): Fast initialization, basic vocabulary
- Medium (50K): Balanced performance and coverage
- Large (100K): Comprehensive coverage, slower initialization
- Full (319K): Complete WordFreq database, longest initialization
Migration Guide
For Existing Hack Tools
- Update imports:
from thematic_word_generator import UnifiedThematicWordGenerator - Replace
ThematicWordGeneratorwithUnifiedThematicWordGenerator - API remains compatible, but now uses comprehensive WordFreq vocabulary
For Backend Services
- Import:
from .unified_word_service import UnifiedWordService - Replace
VectorSearchServiceinitialization withUnifiedWordService - All existing methods remain compatible
- Benefits: Better vocabulary coverage, consistent frequency data
Backwards Compatibility
- All existing APIs maintained
- Same method signatures and return formats
- Gradual migration possible - can run both systems in parallel
Benefits Summary
β
Eliminates Redundancy: Single vocabulary source instead of 3 separate ones
β
Improves Coverage: 100K+ words vs previous 8-12K words
β
Reduces Memory: ~60% reduction in memory usage
β
Ensures Consistency: Same vocabulary across hack tools and backend
β
Maintains Performance: Smart caching and batch processing
β
Preserves Features: 10-tier frequency classification, difficulty filtering
β
Enables Growth: Easy to add new features with unified architecture
Cache Management
Cache Locations
- Hack tools:
hack/model_cache/ - Backend:
crossword-app/backend-py/cache/unified_generator/
Cache Files
unified_vocabulary_<size>.pkl: Filtered vocabularyunified_frequencies_<size>.pkl: Frequency dataunified_embeddings_<model>_<size>.npy: Pre-computed embeddings
Cache Invalidation
Caches are automatically rebuilt if:
- Vocabulary size limit changes
- Embedding model changes
- WordFreq database updates (rare)
Future Enhancements
- Semantic Clustering: Group words by semantic similarity
- Dynamic Difficulty: Real-time difficulty adjustment based on user performance
- Topic Expansion: Automatic topic discovery and expansion
- Multilingual Support: Extend to other languages using WordFreq
- Custom Vocabularies: Allow domain-specific vocabulary additions