Theme Handling in Thematic Word Generator
Overview
The Unified Thematic Word Generator supports two distinct modes of semantic processing:
- Single Theme: Treats all inputs as contributing to one unified concept
- Multi-Theme: Detects and processes multiple separate concepts using machine learning clustering
This document explains the technical differences, algorithms, and practical implications of each approach.
Triggering Logic
Automatic Detection
# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme
if final_multi_theme and len(clean_inputs) > 2:
# Multi-theme path
theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
# Single theme path
theme_vectors = [self._compute_theme_vector(clean_inputs)]
Trigger Conditions
- Single Theme: 1-2 inputs OR manual override with
multi_theme=False
- Multi-Theme: 3+ inputs (automatic) OR manual override with
multi_theme=True
Examples
# Single theme (automatic)
generate_thematic_words("cats") # 1 input
generate_thematic_words(["cats", "dogs"]) # 2 inputs
# Multi-theme (automatic)
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β auto multi-theme
# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True) # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme
Single Theme Processing
Algorithm: _compute_theme_vector(inputs)
Steps:
- Encode all inputs β Get sentence-transformer embeddings for each input
- Average embeddings β
np.mean(input_embeddings, axis=0)
- Return single vector β One unified theme representation
Conceptual Approach
- Treats all inputs as contributing to one unified concept
- Creates a semantic centroid that represents the combined meaning
- Finds words similar to the average meaning of all inputs
- Results are coherent and focused around the unified theme
Example Process
inputs = ["cats", "dogs"] # 2 inputs β Single theme
# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β "pets/domestic animals" concept
Use Cases
- Related concepts: "science, research, study" β Academic/research words
- Variations of same thing: "cats, kittens, felines" β Cat-related words
- Sentences: "I love furry animals" β Animal-loving context words
- Semantic expansion: "ocean, water" β Marine/aquatic words
Multi-Theme Processing
Algorithm: _detect_multiple_themes(inputs, max_themes=3)
Steps:
- Encode all inputs β Get embeddings for each input
- Determine clusters β
n_clusters = min(max_themes, len(inputs), 3)
- K-means clustering β Group semantically similar inputs together
- Extract cluster centers β Each cluster center becomes one theme vector
- Return multiple vectors β Multiple separate theme representations
Conceptual Approach
- Treats inputs as potentially representing multiple different concepts
- Uses machine learning clustering to automatically group related inputs
- Finds words similar to each separate theme cluster
- Results are diverse, covering multiple semantic areas
Example Process
inputs = ["science", "art", "cooking"] # 3 inputs β Multi-theme
# Processing:
# 1. Get embeddings for all three words:
# "science": [0.8, 0.1, 0.2, 0.3, ...]
# "art": [0.2, 0.9, 0.1, 0.4, ...]
# "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
# Cluster 1: "science" β [0.8, 0.1, 0.2, 0.3, ...] (research theme)
# Cluster 2: "art" β [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
# Cluster 3: "cooking" β [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation
Clustering Details
Cluster Count Logic:
n_clusters = min(max_themes=3, len(inputs), 3)
Examples:
- 3 inputs β 3 clusters (each input potentially gets its own theme)
- 4 inputs β 3 clusters (max_themes limit applies)
- 5 inputs β 3 clusters (max_themes limit applies)
- 6+ inputs β 3 clusters (max_themes limit applies)
K-means Parameters:
random_state=42
: Ensures reproducible clustering resultsn_init=10
: Runs clustering 10 times with different initializations, picks best result
Use Cases
- Diverse topics: "science, art, cooking" β Words from all three domains
- Mixed contexts: "I love you, moonpie, chocolate" β Romance + food words
- Broad exploration: "technology, nature, music" β Wide semantic coverage
- Unrelated concepts: "politics, sports, weather" β Balanced representation
Word Generation Differences
Single Theme Word Generation
# Only one theme vector
theme_vectors = [single_theme_vector] # Length = 1
# Similarity calculation (runs once)
for theme_vector in theme_vectors:
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Divide by 1 (no change)
# Result: All words are similar to the unified theme
Characteristics:
- Coherent results: All words relate to the unified concept
- Focused semantic area: Words cluster around the average meaning
- High thematic consistency: Strong semantic relationships between results
Multi-Theme Word Generation
# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3] # Length = 3
# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Divide by 3 (average)
# Result: Words similar to any of the themes, averaged across all themes
Characteristics:
- Diverse results: Words come from multiple separate concepts
- Broader semantic coverage: Covers different conceptual areas
- Balanced representation: Each theme contributes equally to final results
- Higher variety: Less repetitive, more exploratory results
Practical Examples
Single Theme Examples
Example 1: Related Animals
inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept
Example 2: Academic Focus
inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept
Example 3: Sentence Input
inputs = ["I love furry animals"]
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept
Multi-Theme Examples
Example 1: Diverse Domains
inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced
Example 2: Mixed Context
inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat
# Themes: Romantic + Food concepts balanced
Example 3: Technology Exploration
inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced
Performance Characteristics
Single Theme Performance
- Speed: Faster (one embedding average, one similarity calculation)
- Memory: Lower (stores one theme vector)
- Consistency: Higher (coherent semantic direction)
- Best for: Focused exploration, related concepts, sentence inputs
Multi-Theme Performance
- Speed: Slower (clustering computation, multiple similarity calculations)
- Memory: Higher (stores multiple theme vectors)
- Diversity: Higher (multiple semantic directions)
- Best for: Broad exploration, unrelated concepts, diverse word discovery
Technical Implementation Details
Single Theme Code Path
def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
"""Compute semantic centroid from input words/sentences."""
# Encode all inputs
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
# Simple approach: average all input embeddings
theme_vector = np.mean(input_embeddings, axis=0)
return theme_vector.reshape(1, -1)
Multi-Theme Code Path
def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
"""Detect multiple themes using clustering."""
# Encode inputs
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
# Determine optimal number of clusters
n_clusters = min(max_themes, len(inputs), 3)
if n_clusters == 1:
return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
# Perform clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(input_embeddings)
# Return cluster centers as theme vectors
return [center.reshape(1, -1) for center in kmeans.cluster_centers_]
Similarity Aggregation
# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))
for theme_vector in theme_vectors:
# Compute similarities with vocabulary
similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Average across themes
Usage Guidelines
When to Use Single Theme
- 1-2 related inputs: Natural single theme territory
- Sentence inputs: Coherent meaning in natural language
- Focused exploration: Want words around one specific concept
- Related concepts: Inputs that should blend together semantically
- Performance priority: Need faster results
When to Use Multi-Theme (or Allow Auto-Detection)
- 3+ diverse inputs: Let automatic detection handle it
- Unrelated concepts: Want representation from all areas
- Broad exploration: Seeking diverse word discovery
- Balanced results: Need equal weight from different themes
- Creative applications: Want unexpected combinations
Manual Override Cases
# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)
# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)
# Result: Attempts to find different aspects of cats vs kittens
Interactive Mode Examples
Single Theme Interactive Commands
I love animals # Sentence β single theme
cats dogs # 2 words β single theme
science research # Related concepts β single theme
Multi-Theme Interactive Commands
cats, dogs, birds # 3+ topics β auto multi-theme
science, art, cooking # Diverse topics β auto multi-theme
"I love you, moonpie, chocolate" # Mixed content β auto multi-theme
technology, nature, music 15 # With parameters β auto multi-theme
Manual Control
cats dogs multi # Force multi-theme on 2 inputs
"science, research, study" # 3 inputs but could be single theme contextually
Summary
The single theme approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The multi-theme approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.
The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.