Spaces:

vimalk78
/

abc123

Running

App Files Files Community

abc123 / hack /THEME_HANDLING_EXPLAINED.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 27 days ago

preview code

raw

history blame contribute delete

12.6 kB

Theme Handling in Thematic Word Generator

Overview

The Unified Thematic Word Generator supports two distinct modes of semantic processing:

Single Theme: Treats all inputs as contributing to one unified concept
Multi-Theme: Detects and processes multiple separate concepts using machine learning clustering

This document explains the technical differences, algorithms, and practical implications of each approach.

Triggering Logic

Automatic Detection

# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme

if final_multi_theme and len(clean_inputs) > 2:
    # Multi-theme path
    theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
    # Single theme path  
    theme_vectors = [self._compute_theme_vector(clean_inputs)]

Trigger Conditions

Single Theme: 1-2 inputs OR manual override with multi_theme=False
Multi-Theme: 3+ inputs (automatic) OR manual override with multi_theme=True

Examples

# Single theme (automatic)
generate_thematic_words("cats")                    # 1 input
generate_thematic_words(["cats", "dogs"])          # 2 inputs

# Multi-theme (automatic) 
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs → auto multi-theme

# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True)  # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme

Single Theme Processing

Algorithm: `_compute_theme_vector(inputs)`

Steps:

Encode all inputs → Get sentence-transformer embeddings for each input
Average embeddings → np.mean(input_embeddings, axis=0)
Return single vector → One unified theme representation

Conceptual Approach

Treats all inputs as contributing to one unified concept
Creates a semantic centroid that represents the combined meaning
Finds words similar to the average meaning of all inputs
Results are coherent and focused around the unified theme

Example Process

inputs = ["cats", "dogs"]  # 2 inputs → Single theme

# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]  
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] → "pets/domestic animals" concept

Use Cases

Related concepts: "science, research, study" → Academic/research words
Variations of same thing: "cats, kittens, felines" → Cat-related words
Sentences: "I love furry animals" → Animal-loving context words
Semantic expansion: "ocean, water" → Marine/aquatic words

Multi-Theme Processing

Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`

Steps:

Encode all inputs → Get embeddings for each input
Determine clusters → n_clusters = min(max_themes, len(inputs), 3)
K-means clustering → Group semantically similar inputs together
Extract cluster centers → Each cluster center becomes one theme vector
Return multiple vectors → Multiple separate theme representations

Conceptual Approach

Treats inputs as potentially representing multiple different concepts
Uses machine learning clustering to automatically group related inputs
Finds words similar to each separate theme cluster
Results are diverse, covering multiple semantic areas

Example Process

inputs = ["science", "art", "cooking"]  # 3 inputs → Multi-theme

# Processing:
# 1. Get embeddings for all three words:
#    "science": [0.8, 0.1, 0.2, 0.3, ...]
#    "art":     [0.2, 0.9, 0.1, 0.4, ...]  
#    "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
#    Cluster 1: "science" → [0.8, 0.1, 0.2, 0.3, ...] (research theme)
#    Cluster 2: "art"     → [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
#    Cluster 3: "cooking" → [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation

Clustering Details

Cluster Count Logic:

n_clusters = min(max_themes=3, len(inputs), 3)

Examples:

3 inputs → 3 clusters (each input potentially gets its own theme)
4 inputs → 3 clusters (max_themes limit applies)
5 inputs → 3 clusters (max_themes limit applies)
6+ inputs → 3 clusters (max_themes limit applies)

K-means Parameters:

random_state=42: Ensures reproducible clustering results
n_init=10: Runs clustering 10 times with different initializations, picks best result

Use Cases

Diverse topics: "science, art, cooking" → Words from all three domains
Mixed contexts: "I love you, moonpie, chocolate" → Romance + food words
Broad exploration: "technology, nature, music" → Wide semantic coverage
Unrelated concepts: "politics, sports, weather" → Balanced representation

Word Generation Differences

Single Theme Word Generation

# Only one theme vector
theme_vectors = [single_theme_vector]  # Length = 1

# Similarity calculation (runs once)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 1 (no change)

# Result: All words are similar to the unified theme

Characteristics:

Coherent results: All words relate to the unified concept
Focused semantic area: Words cluster around the average meaning
High thematic consistency: Strong semantic relationships between results

Multi-Theme Word Generation

# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3]  # Length = 3

# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 3 (average)

# Result: Words similar to any of the themes, averaged across all themes

Characteristics:

Diverse results: Words come from multiple separate concepts
Broader semantic coverage: Covers different conceptual areas
Balanced representation: Each theme contributes equally to final results
Higher variety: Less repetitive, more exploratory results

Practical Examples

Single Theme Examples

Example 1: Related Animals

inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept

Example 2: Academic Focus

inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept

Example 3: Sentence Input

inputs = ["I love furry animals"]  
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept

Multi-Theme Examples

Example 1: Diverse Domains

inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced

Example 2: Mixed Context

inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat  
# Themes: Romantic + Food concepts balanced

Example 3: Technology Exploration

inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced

Performance Characteristics

Single Theme Performance

Speed: Faster (one embedding average, one similarity calculation)
Memory: Lower (stores one theme vector)
Consistency: Higher (coherent semantic direction)
Best for: Focused exploration, related concepts, sentence inputs

Multi-Theme Performance

Speed: Slower (clustering computation, multiple similarity calculations)
Memory: Higher (stores multiple theme vectors)
Diversity: Higher (multiple semantic directions)
Best for: Broad exploration, unrelated concepts, diverse word discovery

Technical Implementation Details

Single Theme Code Path

def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
    """Compute semantic centroid from input words/sentences."""
    # Encode all inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Simple approach: average all input embeddings  
    theme_vector = np.mean(input_embeddings, axis=0)
    
    return theme_vector.reshape(1, -1)

Multi-Theme Code Path

def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
    """Detect multiple themes using clustering."""
    # Encode inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Determine optimal number of clusters
    n_clusters = min(max_themes, len(inputs), 3)
    
    if n_clusters == 1:
        return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    kmeans.fit(input_embeddings)
    
    # Return cluster centers as theme vectors
    return [center.reshape(1, -1) for center in kmeans.cluster_centers_]

Similarity Aggregation

# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))

for theme_vector in theme_vectors:
    # Compute similarities with vocabulary
    similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Average across themes

Usage Guidelines

When to Use Single Theme

1-2 related inputs: Natural single theme territory
Sentence inputs: Coherent meaning in natural language
Focused exploration: Want words around one specific concept
Related concepts: Inputs that should blend together semantically
Performance priority: Need faster results

When to Use Multi-Theme (or Allow Auto-Detection)

3+ diverse inputs: Let automatic detection handle it
Unrelated concepts: Want representation from all areas
Broad exploration: Seeking diverse word discovery
Balanced results: Need equal weight from different themes
Creative applications: Want unexpected combinations

Manual Override Cases

# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)

# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)  
# Result: Attempts to find different aspects of cats vs kittens

Interactive Mode Examples

Single Theme Interactive Commands

I love animals                    # Sentence → single theme
cats dogs                        # 2 words → single theme  
science research                 # Related concepts → single theme

Multi-Theme Interactive Commands

cats, dogs, birds               # 3+ topics → auto multi-theme
science, art, cooking           # Diverse topics → auto multi-theme
"I love you, moonpie, chocolate" # Mixed content → auto multi-theme
technology, nature, music 15    # With parameters → auto multi-theme

Manual Control

cats dogs multi                 # Force multi-theme on 2 inputs
"science, research, study"      # 3 inputs but could be single theme contextually

Summary

The single theme approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The multi-theme approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.

The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.

Theme Handling in Thematic Word Generator

Overview

Triggering Logic

Automatic Detection

Trigger Conditions

Examples

Single Theme Processing

Algorithm: _compute_theme_vector(inputs)

Conceptual Approach

Example Process

Use Cases

Multi-Theme Processing

Algorithm: _detect_multiple_themes(inputs, max_themes=3)

Conceptual Approach

Example Process

Clustering Details

Use Cases

Word Generation Differences

Single Theme Word Generation

Multi-Theme Word Generation

Practical Examples

Single Theme Examples

Example 1: Related Animals

Example 2: Academic Focus

Example 3: Sentence Input

Multi-Theme Examples

Example 1: Diverse Domains

Example 2: Mixed Context

Example 3: Technology Exploration

Performance Characteristics

Single Theme Performance

Multi-Theme Performance

Technical Implementation Details

Single Theme Code Path

Multi-Theme Code Path

Similarity Aggregation

Usage Guidelines

When to Use Single Theme

When to Use Multi-Theme (or Allow Auto-Detection)

Manual Override Cases

Interactive Mode Examples

Single Theme Interactive Commands

Multi-Theme Interactive Commands

Manual Control

Summary

Algorithm: `_compute_theme_vector(inputs)`

Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`