abc123 / hack /THEME_HANDLING_EXPLAINED.md
vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6

Theme Handling in Thematic Word Generator

Overview

The Unified Thematic Word Generator supports two distinct modes of semantic processing:

  • Single Theme: Treats all inputs as contributing to one unified concept
  • Multi-Theme: Detects and processes multiple separate concepts using machine learning clustering

This document explains the technical differences, algorithms, and practical implications of each approach.

Triggering Logic

Automatic Detection

# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme

if final_multi_theme and len(clean_inputs) > 2:
    # Multi-theme path
    theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
    # Single theme path  
    theme_vectors = [self._compute_theme_vector(clean_inputs)]

Trigger Conditions

  • Single Theme: 1-2 inputs OR manual override with multi_theme=False
  • Multi-Theme: 3+ inputs (automatic) OR manual override with multi_theme=True

Examples

# Single theme (automatic)
generate_thematic_words("cats")                    # 1 input
generate_thematic_words(["cats", "dogs"])          # 2 inputs

# Multi-theme (automatic) 
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β†’ auto multi-theme

# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True)  # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme

Single Theme Processing

Algorithm: _compute_theme_vector(inputs)

Steps:

  1. Encode all inputs β†’ Get sentence-transformer embeddings for each input
  2. Average embeddings β†’ np.mean(input_embeddings, axis=0)
  3. Return single vector β†’ One unified theme representation

Conceptual Approach

  • Treats all inputs as contributing to one unified concept
  • Creates a semantic centroid that represents the combined meaning
  • Finds words similar to the average meaning of all inputs
  • Results are coherent and focused around the unified theme

Example Process

inputs = ["cats", "dogs"]  # 2 inputs β†’ Single theme

# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]  
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β†’ "pets/domestic animals" concept

Use Cases

  • Related concepts: "science, research, study" β†’ Academic/research words
  • Variations of same thing: "cats, kittens, felines" β†’ Cat-related words
  • Sentences: "I love furry animals" β†’ Animal-loving context words
  • Semantic expansion: "ocean, water" β†’ Marine/aquatic words

Multi-Theme Processing

Algorithm: _detect_multiple_themes(inputs, max_themes=3)

Steps:

  1. Encode all inputs β†’ Get embeddings for each input
  2. Determine clusters β†’ n_clusters = min(max_themes, len(inputs), 3)
  3. K-means clustering β†’ Group semantically similar inputs together
  4. Extract cluster centers β†’ Each cluster center becomes one theme vector
  5. Return multiple vectors β†’ Multiple separate theme representations

Conceptual Approach

  • Treats inputs as potentially representing multiple different concepts
  • Uses machine learning clustering to automatically group related inputs
  • Finds words similar to each separate theme cluster
  • Results are diverse, covering multiple semantic areas

Example Process

inputs = ["science", "art", "cooking"]  # 3 inputs β†’ Multi-theme

# Processing:
# 1. Get embeddings for all three words:
#    "science": [0.8, 0.1, 0.2, 0.3, ...]
#    "art":     [0.2, 0.9, 0.1, 0.4, ...]  
#    "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
#    Cluster 1: "science" β†’ [0.8, 0.1, 0.2, 0.3, ...] (research theme)
#    Cluster 2: "art"     β†’ [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
#    Cluster 3: "cooking" β†’ [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation

Clustering Details

Cluster Count Logic:

n_clusters = min(max_themes=3, len(inputs), 3)

Examples:

  • 3 inputs β†’ 3 clusters (each input potentially gets its own theme)
  • 4 inputs β†’ 3 clusters (max_themes limit applies)
  • 5 inputs β†’ 3 clusters (max_themes limit applies)
  • 6+ inputs β†’ 3 clusters (max_themes limit applies)

K-means Parameters:

  • random_state=42: Ensures reproducible clustering results
  • n_init=10: Runs clustering 10 times with different initializations, picks best result

Use Cases

  • Diverse topics: "science, art, cooking" β†’ Words from all three domains
  • Mixed contexts: "I love you, moonpie, chocolate" β†’ Romance + food words
  • Broad exploration: "technology, nature, music" β†’ Wide semantic coverage
  • Unrelated concepts: "politics, sports, weather" β†’ Balanced representation

Word Generation Differences

Single Theme Word Generation

# Only one theme vector
theme_vectors = [single_theme_vector]  # Length = 1

# Similarity calculation (runs once)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 1 (no change)

# Result: All words are similar to the unified theme

Characteristics:

  • Coherent results: All words relate to the unified concept
  • Focused semantic area: Words cluster around the average meaning
  • High thematic consistency: Strong semantic relationships between results

Multi-Theme Word Generation

# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3]  # Length = 3

# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 3 (average)

# Result: Words similar to any of the themes, averaged across all themes

Characteristics:

  • Diverse results: Words come from multiple separate concepts
  • Broader semantic coverage: Covers different conceptual areas
  • Balanced representation: Each theme contributes equally to final results
  • Higher variety: Less repetitive, more exploratory results

Practical Examples

Single Theme Examples

Example 1: Related Animals

inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept

Example 2: Academic Focus

inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept

Example 3: Sentence Input

inputs = ["I love furry animals"]  
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept

Multi-Theme Examples

Example 1: Diverse Domains

inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced

Example 2: Mixed Context

inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat  
# Themes: Romantic + Food concepts balanced

Example 3: Technology Exploration

inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced

Performance Characteristics

Single Theme Performance

  • Speed: Faster (one embedding average, one similarity calculation)
  • Memory: Lower (stores one theme vector)
  • Consistency: Higher (coherent semantic direction)
  • Best for: Focused exploration, related concepts, sentence inputs

Multi-Theme Performance

  • Speed: Slower (clustering computation, multiple similarity calculations)
  • Memory: Higher (stores multiple theme vectors)
  • Diversity: Higher (multiple semantic directions)
  • Best for: Broad exploration, unrelated concepts, diverse word discovery

Technical Implementation Details

Single Theme Code Path

def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
    """Compute semantic centroid from input words/sentences."""
    # Encode all inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Simple approach: average all input embeddings  
    theme_vector = np.mean(input_embeddings, axis=0)
    
    return theme_vector.reshape(1, -1)

Multi-Theme Code Path

def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
    """Detect multiple themes using clustering."""
    # Encode inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Determine optimal number of clusters
    n_clusters = min(max_themes, len(inputs), 3)
    
    if n_clusters == 1:
        return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    kmeans.fit(input_embeddings)
    
    # Return cluster centers as theme vectors
    return [center.reshape(1, -1) for center in kmeans.cluster_centers_]

Similarity Aggregation

# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))

for theme_vector in theme_vectors:
    # Compute similarities with vocabulary
    similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Average across themes

Usage Guidelines

When to Use Single Theme

  • 1-2 related inputs: Natural single theme territory
  • Sentence inputs: Coherent meaning in natural language
  • Focused exploration: Want words around one specific concept
  • Related concepts: Inputs that should blend together semantically
  • Performance priority: Need faster results

When to Use Multi-Theme (or Allow Auto-Detection)

  • 3+ diverse inputs: Let automatic detection handle it
  • Unrelated concepts: Want representation from all areas
  • Broad exploration: Seeking diverse word discovery
  • Balanced results: Need equal weight from different themes
  • Creative applications: Want unexpected combinations

Manual Override Cases

# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)

# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)  
# Result: Attempts to find different aspects of cats vs kittens

Interactive Mode Examples

Single Theme Interactive Commands

I love animals                    # Sentence β†’ single theme
cats dogs                        # 2 words β†’ single theme  
science research                 # Related concepts β†’ single theme

Multi-Theme Interactive Commands

cats, dogs, birds               # 3+ topics β†’ auto multi-theme
science, art, cooking           # Diverse topics β†’ auto multi-theme
"I love you, moonpie, chocolate" # Mixed content β†’ auto multi-theme
technology, nature, music 15    # With parameters β†’ auto multi-theme

Manual Control

cats dogs multi                 # Force multi-theme on 2 inputs
"science, research, study"      # 3 inputs but could be single theme contextually

Summary

The single theme approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The multi-theme approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.

The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.