Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / VOCABULARY_OPTIMIZATION.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 3 months ago

preview code

raw

history blame

6.05 kB

Vocabulary Optimization & Unification

Problem Solved

Previously, the crossword system had vocabulary redundancy with 3 separate sources:

SentenceTransformer Model Vocabulary: ~30K tokens → ~8-12K actual words after filtering
NLTK Words Corpus: 41,998 words for embeddings in thematic generator
WordFreq Database: 319,938 words for frequency data

This created inconsistencies, memory waste, and limited vocabulary coverage.

Solution: Unified Architecture

New Design

Single Vocabulary Source: WordFreq database (319,938 words)
Single Embedding Model: all-mpnet-base-v2 (generates embeddings for any text)
Unified Filtering: Consistent crossword-suitable word filtering
Shared Caching: Single vocabulary + embeddings + frequency cache

Key Components

1. VocabularyManager (`hack/thematic_word_generator.py`)

Loads and filters WordFreq vocabulary
Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
Generates frequency data with 10-tier classification
Handles caching for performance

2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`)

Uses WordFreq vocabulary instead of NLTK words
Generates all-mpnet-base-v2 embeddings for WordFreq words
Maintains 10-tier frequency classification system
Provides both hack tool API and backend-compatible API

3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`)

Bridge adapter for backend integration
Compatible with existing VectorSearchService interface
Uses comprehensive WordFreq vocabulary instead of limited model vocabulary

Usage

For Hack Tools

from thematic_word_generator import UnifiedThematicWordGenerator

# Initialize with desired vocabulary size
generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
generator.initialize()

# Generate thematic words with tier info
results = generator.generate_thematic_words(
    topic="science", 
    num_words=10,
    difficulty_tier="tier_5_common"  # Optional tier filtering
)

for word, similarity, tier in results:
    print(f"{word}: {similarity:.3f} ({tier})")

For Backend Integration

Option 1: Replace VectorSearchService

# In crossword_generator.py
from .unified_word_service import create_unified_word_service

# Initialize
vector_service = await create_unified_word_service(vocab_size_limit=100000)
crossword_gen = CrosswordGenerator(vector_service=vector_service)

Option 2: Direct Usage

from .unified_word_service import UnifiedWordService

service = UnifiedWordService(vocab_size_limit=100000)
await service.initialize()

# Compatible with existing interface
words = await service.find_similar_words("animal", "medium", max_words=15)

Performance Improvements

Memory Usage

Before: 3 separate vocabularies + embeddings (~500MB+)
After: Single vocabulary + embeddings (~200MB)
Reduction: ~60% memory usage reduction

Vocabulary Coverage

Before: Limited to ~8-12K words from model tokenizer
After: Up to 100K+ filtered words from WordFreq database
Improvement: 10x+ vocabulary coverage

Consistency

Before: Different words available in hack tools vs backend
After: Same comprehensive vocabulary across all components
Benefit: Consistent word quality and availability

Configuration

Environment Variables

MAX_VOCABULARY_SIZE: Maximum vocabulary size (default: 100000)
EMBEDDING_MODEL: Model name (default: all-mpnet-base-v2)
WORD_SIMILARITY_THRESHOLD: Minimum similarity (default: 0.3)

Vocabulary Size Options

Small (10K): Fast initialization, basic vocabulary
Medium (50K): Balanced performance and coverage
Large (100K): Comprehensive coverage, slower initialization
Full (319K): Complete WordFreq database, longest initialization

Migration Guide

For Existing Hack Tools

Update imports: from thematic_word_generator import UnifiedThematicWordGenerator
Replace ThematicWordGenerator with UnifiedThematicWordGenerator
API remains compatible, but now uses comprehensive WordFreq vocabulary

For Backend Services

Import: from .unified_word_service import UnifiedWordService
Replace VectorSearchService initialization with UnifiedWordService
All existing methods remain compatible
Benefits: Better vocabulary coverage, consistent frequency data

Backwards Compatibility

All existing APIs maintained
Same method signatures and return formats
Gradual migration possible - can run both systems in parallel

Benefits Summary

✅ Eliminates Redundancy: Single vocabulary source instead of 3 separate ones
✅ Improves Coverage: 100K+ words vs previous 8-12K words
✅ Reduces Memory: ~60% reduction in memory usage
✅ Ensures Consistency: Same vocabulary across hack tools and backend
✅ Maintains Performance: Smart caching and batch processing
✅ Preserves Features: 10-tier frequency classification, difficulty filtering
✅ Enables Growth: Easy to add new features with unified architecture

Cache Management

Cache Locations

Hack tools: hack/model_cache/
Backend: crossword-app/backend-py/cache/unified_generator/

Cache Files

unified_vocabulary_<size>.pkl: Filtered vocabulary
unified_frequencies_<size>.pkl: Frequency data
unified_embeddings_<model>_<size>.npy: Pre-computed embeddings

Cache Invalidation

Caches are automatically rebuilt if:

Vocabulary size limit changes
Embedding model changes
WordFreq database updates (rare)

Future Enhancements

Semantic Clustering: Group words by semantic similarity
Dynamic Difficulty: Real-time difficulty adjustment based on user performance
Topic Expansion: Automatic topic discovery and expansion
Multilingual Support: Extend to other languages using WordFreq
Custom Vocabularies: Allow domain-specific vocabulary additions