abc123 / VOCABULARY_OPTIMIZATION.md
vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6
|
raw
history blame
6.05 kB

Vocabulary Optimization & Unification

Problem Solved

Previously, the crossword system had vocabulary redundancy with 3 separate sources:

  • SentenceTransformer Model Vocabulary: ~30K tokens β†’ ~8-12K actual words after filtering
  • NLTK Words Corpus: 41,998 words for embeddings in thematic generator
  • WordFreq Database: 319,938 words for frequency data

This created inconsistencies, memory waste, and limited vocabulary coverage.

Solution: Unified Architecture

New Design

  • Single Vocabulary Source: WordFreq database (319,938 words)
  • Single Embedding Model: all-mpnet-base-v2 (generates embeddings for any text)
  • Unified Filtering: Consistent crossword-suitable word filtering
  • Shared Caching: Single vocabulary + embeddings + frequency cache

Key Components

1. VocabularyManager (hack/thematic_word_generator.py)

  • Loads and filters WordFreq vocabulary
  • Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
  • Generates frequency data with 10-tier classification
  • Handles caching for performance

2. UnifiedThematicWordGenerator (hack/thematic_word_generator.py)

  • Uses WordFreq vocabulary instead of NLTK words
  • Generates all-mpnet-base-v2 embeddings for WordFreq words
  • Maintains 10-tier frequency classification system
  • Provides both hack tool API and backend-compatible API

3. UnifiedWordService (crossword-app/backend-py/src/services/unified_word_service.py)

  • Bridge adapter for backend integration
  • Compatible with existing VectorSearchService interface
  • Uses comprehensive WordFreq vocabulary instead of limited model vocabulary

Usage

For Hack Tools

from thematic_word_generator import UnifiedThematicWordGenerator

# Initialize with desired vocabulary size
generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
generator.initialize()

# Generate thematic words with tier info
results = generator.generate_thematic_words(
    topic="science", 
    num_words=10,
    difficulty_tier="tier_5_common"  # Optional tier filtering
)

for word, similarity, tier in results:
    print(f"{word}: {similarity:.3f} ({tier})")

For Backend Integration

Option 1: Replace VectorSearchService

# In crossword_generator.py
from .unified_word_service import create_unified_word_service

# Initialize
vector_service = await create_unified_word_service(vocab_size_limit=100000)
crossword_gen = CrosswordGenerator(vector_service=vector_service)

Option 2: Direct Usage

from .unified_word_service import UnifiedWordService

service = UnifiedWordService(vocab_size_limit=100000)
await service.initialize()

# Compatible with existing interface
words = await service.find_similar_words("animal", "medium", max_words=15)

Performance Improvements

Memory Usage

  • Before: 3 separate vocabularies + embeddings (~500MB+)
  • After: Single vocabulary + embeddings (~200MB)
  • Reduction: ~60% memory usage reduction

Vocabulary Coverage

  • Before: Limited to ~8-12K words from model tokenizer
  • After: Up to 100K+ filtered words from WordFreq database
  • Improvement: 10x+ vocabulary coverage

Consistency

  • Before: Different words available in hack tools vs backend
  • After: Same comprehensive vocabulary across all components
  • Benefit: Consistent word quality and availability

Configuration

Environment Variables

  • MAX_VOCABULARY_SIZE: Maximum vocabulary size (default: 100000)
  • EMBEDDING_MODEL: Model name (default: all-mpnet-base-v2)
  • WORD_SIMILARITY_THRESHOLD: Minimum similarity (default: 0.3)

Vocabulary Size Options

  • Small (10K): Fast initialization, basic vocabulary
  • Medium (50K): Balanced performance and coverage
  • Large (100K): Comprehensive coverage, slower initialization
  • Full (319K): Complete WordFreq database, longest initialization

Migration Guide

For Existing Hack Tools

  1. Update imports: from thematic_word_generator import UnifiedThematicWordGenerator
  2. Replace ThematicWordGenerator with UnifiedThematicWordGenerator
  3. API remains compatible, but now uses comprehensive WordFreq vocabulary

For Backend Services

  1. Import: from .unified_word_service import UnifiedWordService
  2. Replace VectorSearchService initialization with UnifiedWordService
  3. All existing methods remain compatible
  4. Benefits: Better vocabulary coverage, consistent frequency data

Backwards Compatibility

  • All existing APIs maintained
  • Same method signatures and return formats
  • Gradual migration possible - can run both systems in parallel

Benefits Summary

βœ… Eliminates Redundancy: Single vocabulary source instead of 3 separate ones
βœ… Improves Coverage: 100K+ words vs previous 8-12K words
βœ… Reduces Memory: ~60% reduction in memory usage
βœ… Ensures Consistency: Same vocabulary across hack tools and backend
βœ… Maintains Performance: Smart caching and batch processing
βœ… Preserves Features: 10-tier frequency classification, difficulty filtering
βœ… Enables Growth: Easy to add new features with unified architecture

Cache Management

Cache Locations

  • Hack tools: hack/model_cache/
  • Backend: crossword-app/backend-py/cache/unified_generator/

Cache Files

  • unified_vocabulary_<size>.pkl: Filtered vocabulary
  • unified_frequencies_<size>.pkl: Frequency data
  • unified_embeddings_<model>_<size>.npy: Pre-computed embeddings

Cache Invalidation

Caches are automatically rebuilt if:

  • Vocabulary size limit changes
  • Embedding model changes
  • WordFreq database updates (rare)

Future Enhancements

  1. Semantic Clustering: Group words by semantic similarity
  2. Dynamic Difficulty: Real-time difficulty adjustment based on user performance
  3. Topic Expansion: Automatic topic discovery and expansion
  4. Multilingual Support: Extend to other languages using WordFreq
  5. Custom Vocabularies: Allow domain-specific vocabulary additions