| # Theme Handling in Thematic Word Generator | |
| ## Overview | |
| The Unified Thematic Word Generator supports two distinct modes of semantic processing: | |
| - **Single Theme**: Treats all inputs as contributing to one unified concept | |
| - **Multi-Theme**: Detects and processes multiple separate concepts using machine learning clustering | |
| This document explains the technical differences, algorithms, and practical implications of each approach. | |
| ## Triggering Logic | |
| ### Automatic Detection | |
| ```python | |
| # Auto-enable multi-theme for 3+ inputs (matching original behavior) | |
| auto_multi_theme = len(clean_inputs) > 2 | |
| final_multi_theme = multi_theme or auto_multi_theme | |
| if final_multi_theme and len(clean_inputs) > 2: | |
| # Multi-theme path | |
| theme_vectors = self._detect_multiple_themes(clean_inputs) | |
| else: | |
| # Single theme path | |
| theme_vectors = [self._compute_theme_vector(clean_inputs)] | |
| ``` | |
| ### Trigger Conditions | |
| - **Single Theme**: 1-2 inputs OR manual override with `multi_theme=False` | |
| - **Multi-Theme**: 3+ inputs (automatic) OR manual override with `multi_theme=True` | |
| ### Examples | |
| ```python | |
| # Single theme (automatic) | |
| generate_thematic_words("cats") # 1 input | |
| generate_thematic_words(["cats", "dogs"]) # 2 inputs | |
| # Multi-theme (automatic) | |
| generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β auto multi-theme | |
| # Manual override | |
| generate_thematic_words(["cats", "dogs"], multi_theme=True) # Force multi-theme | |
| generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme | |
| ``` | |
| ## Single Theme Processing | |
| ### Algorithm: `_compute_theme_vector(inputs)` | |
| **Steps:** | |
| 1. **Encode all inputs** β Get sentence-transformer embeddings for each input | |
| 2. **Average embeddings** β `np.mean(input_embeddings, axis=0)` | |
| 3. **Return single vector** β One unified theme representation | |
| ### Conceptual Approach | |
| - Treats all inputs as contributing to **one unified concept** | |
| - Creates a **semantic centroid** that represents the combined meaning | |
| - Finds words similar to the **average meaning** of all inputs | |
| - Results are coherent and focused around the unified theme | |
| ### Example Process | |
| ```python | |
| inputs = ["cats", "dogs"] # 2 inputs β Single theme | |
| # Processing: | |
| # 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...] | |
| # 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...] | |
| # 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...] | |
| # 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β "pets/domestic animals" concept | |
| ``` | |
| ### Use Cases | |
| - **Related concepts**: "science, research, study" β Academic/research words | |
| - **Variations of same thing**: "cats, kittens, felines" β Cat-related words | |
| - **Sentences**: "I love furry animals" β Animal-loving context words | |
| - **Semantic expansion**: "ocean, water" β Marine/aquatic words | |
| ## Multi-Theme Processing | |
| ### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)` | |
| **Steps:** | |
| 1. **Encode all inputs** β Get embeddings for each input | |
| 2. **Determine clusters** β `n_clusters = min(max_themes, len(inputs), 3)` | |
| 3. **K-means clustering** β Group semantically similar inputs together | |
| 4. **Extract cluster centers** β Each cluster center becomes one theme vector | |
| 5. **Return multiple vectors** β Multiple separate theme representations | |
| ### Conceptual Approach | |
| - Treats inputs as potentially representing **multiple different concepts** | |
| - Uses **machine learning clustering** to automatically group related inputs | |
| - Finds words similar to **each separate theme cluster** | |
| - Results are diverse, covering multiple semantic areas | |
| ### Example Process | |
| ```python | |
| inputs = ["science", "art", "cooking"] # 3 inputs β Multi-theme | |
| # Processing: | |
| # 1. Get embeddings for all three words: | |
| # "science": [0.8, 0.1, 0.2, 0.3, ...] | |
| # "art": [0.2, 0.9, 0.1, 0.4, ...] | |
| # "cooking": [0.3, 0.2, 0.8, 0.5, ...] | |
| # 2. Run K-means clustering (k=3) | |
| # 3. Cluster results: | |
| # Cluster 1: "science" β [0.8, 0.1, 0.2, 0.3, ...] (research theme) | |
| # Cluster 2: "art" β [0.2, 0.9, 0.1, 0.4, ...] (creative theme) | |
| # Cluster 3: "cooking" β [0.3, 0.2, 0.8, 0.5, ...] (culinary theme) | |
| # 4. Result: Three separate theme vectors for word generation | |
| ``` | |
| ### Clustering Details | |
| **Cluster Count Logic:** | |
| ```python | |
| n_clusters = min(max_themes=3, len(inputs), 3) | |
| ``` | |
| **Examples:** | |
| - 3 inputs β 3 clusters (each input potentially gets its own theme) | |
| - 4 inputs β 3 clusters (max_themes limit applies) | |
| - 5 inputs β 3 clusters (max_themes limit applies) | |
| - 6+ inputs β 3 clusters (max_themes limit applies) | |
| **K-means Parameters:** | |
| - `random_state=42`: Ensures reproducible clustering results | |
| - `n_init=10`: Runs clustering 10 times with different initializations, picks best result | |
| ### Use Cases | |
| - **Diverse topics**: "science, art, cooking" β Words from all three domains | |
| - **Mixed contexts**: "I love you, moonpie, chocolate" β Romance + food words | |
| - **Broad exploration**: "technology, nature, music" β Wide semantic coverage | |
| - **Unrelated concepts**: "politics, sports, weather" β Balanced representation | |
| ## Word Generation Differences | |
| ### Single Theme Word Generation | |
| ```python | |
| # Only one theme vector | |
| theme_vectors = [single_theme_vector] # Length = 1 | |
| # Similarity calculation (runs once) | |
| for theme_vector in theme_vectors: | |
| similarities = cosine_similarity(theme_vector, vocab_embeddings)[0] | |
| all_similarities += similarities / len(theme_vectors) # Divide by 1 (no change) | |
| # Result: All words are similar to the unified theme | |
| ``` | |
| **Characteristics:** | |
| - **Coherent results**: All words relate to the unified concept | |
| - **Focused semantic area**: Words cluster around the average meaning | |
| - **High thematic consistency**: Strong semantic relationships between results | |
| ### Multi-Theme Word Generation | |
| ```python | |
| # Multiple theme vectors | |
| theme_vectors = [theme1, theme2, theme3] # Length = 3 | |
| # Similarity calculation (runs 3 times) | |
| for theme_vector in theme_vectors: | |
| similarities = cosine_similarity(theme_vector, vocab_embeddings)[0] | |
| all_similarities += similarities / len(theme_vectors) # Divide by 3 (average) | |
| # Result: Words similar to any of the themes, averaged across all themes | |
| ``` | |
| **Characteristics:** | |
| - **Diverse results**: Words come from multiple separate concepts | |
| - **Broader semantic coverage**: Covers different conceptual areas | |
| - **Balanced representation**: Each theme contributes equally to final results | |
| - **Higher variety**: Less repetitive, more exploratory results | |
| ## Practical Examples | |
| ### Single Theme Examples | |
| #### Example 1: Related Animals | |
| ```python | |
| inputs = ["cats", "dogs"] | |
| # Result words: pets, animals, fur, tail, home, domestic, companion, mammal | |
| # Theme: Unified "domestic pets" concept | |
| ``` | |
| #### Example 2: Academic Focus | |
| ```python | |
| inputs = ["science", "research"] | |
| # Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery | |
| # Theme: Unified "scientific research" concept | |
| ``` | |
| #### Example 3: Sentence Input | |
| ```python | |
| inputs = ["I love furry animals"] | |
| # Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion | |
| # Theme: Unified "affection for furry pets" concept | |
| ``` | |
| ### Multi-Theme Examples | |
| #### Example 1: Diverse Domains | |
| ```python | |
| inputs = ["science", "art", "cooking"] | |
| # Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor | |
| # Themes: Scientific + Creative + Culinary concepts balanced | |
| ``` | |
| #### Example 2: Mixed Context | |
| ```python | |
| inputs = ["I love you", "moonpie", "chocolate"] | |
| # Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat | |
| # Themes: Romantic + Food concepts balanced | |
| ``` | |
| #### Example 3: Technology Exploration | |
| ```python | |
| inputs = ["computer", "internet", "smartphone", "AI"] | |
| # Result words: technology, digital, network, mobile, artificial, intelligence, software, device | |
| # Themes: Different tech concepts clustered and balanced | |
| ``` | |
| ## Performance Characteristics | |
| ### Single Theme Performance | |
| - **Speed**: Faster (one embedding average, one similarity calculation) | |
| - **Memory**: Lower (stores one theme vector) | |
| - **Consistency**: Higher (coherent semantic direction) | |
| - **Best for**: Focused exploration, related concepts, sentence inputs | |
| ### Multi-Theme Performance | |
| - **Speed**: Slower (clustering computation, multiple similarity calculations) | |
| - **Memory**: Higher (stores multiple theme vectors) | |
| - **Diversity**: Higher (multiple semantic directions) | |
| - **Best for**: Broad exploration, unrelated concepts, diverse word discovery | |
| ## Technical Implementation Details | |
| ### Single Theme Code Path | |
| ```python | |
| def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray: | |
| """Compute semantic centroid from input words/sentences.""" | |
| # Encode all inputs | |
| input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False) | |
| # Simple approach: average all input embeddings | |
| theme_vector = np.mean(input_embeddings, axis=0) | |
| return theme_vector.reshape(1, -1) | |
| ``` | |
| ### Multi-Theme Code Path | |
| ```python | |
| def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]: | |
| """Detect multiple themes using clustering.""" | |
| # Encode inputs | |
| input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False) | |
| # Determine optimal number of clusters | |
| n_clusters = min(max_themes, len(inputs), 3) | |
| if n_clusters == 1: | |
| return [np.mean(input_embeddings, axis=0).reshape(1, -1)] | |
| # Perform clustering | |
| kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) | |
| kmeans.fit(input_embeddings) | |
| # Return cluster centers as theme vectors | |
| return [center.reshape(1, -1) for center in kmeans.cluster_centers_] | |
| ``` | |
| ### Similarity Aggregation | |
| ```python | |
| # Collect similarities from all themes | |
| all_similarities = np.zeros(len(self.vocabulary)) | |
| for theme_vector in theme_vectors: | |
| # Compute similarities with vocabulary | |
| similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0] | |
| all_similarities += similarities / len(theme_vectors) # Average across themes | |
| ``` | |
| ## Usage Guidelines | |
| ### When to Use Single Theme | |
| - **1-2 related inputs**: Natural single theme territory | |
| - **Sentence inputs**: Coherent meaning in natural language | |
| - **Focused exploration**: Want words around one specific concept | |
| - **Related concepts**: Inputs that should blend together semantically | |
| - **Performance priority**: Need faster results | |
| ### When to Use Multi-Theme (or Allow Auto-Detection) | |
| - **3+ diverse inputs**: Let automatic detection handle it | |
| - **Unrelated concepts**: Want representation from all areas | |
| - **Broad exploration**: Seeking diverse word discovery | |
| - **Balanced results**: Need equal weight from different themes | |
| - **Creative applications**: Want unexpected combinations | |
| ### Manual Override Cases | |
| ```python | |
| # Force single theme for diverse inputs (rare) | |
| generate_thematic_words(["red", "blue", "green"], multi_theme=False) | |
| # Result: Color-focused words (unified color concept) | |
| # Force multi-theme for similar inputs (rare) | |
| generate_thematic_words(["cat", "kitten"], multi_theme=True) | |
| # Result: Attempts to find different aspects of cats vs kittens | |
| ``` | |
| ## Interactive Mode Examples | |
| ### Single Theme Interactive Commands | |
| ```bash | |
| I love animals # Sentence β single theme | |
| cats dogs # 2 words β single theme | |
| science research # Related concepts β single theme | |
| ``` | |
| ### Multi-Theme Interactive Commands | |
| ```bash | |
| cats, dogs, birds # 3+ topics β auto multi-theme | |
| science, art, cooking # Diverse topics β auto multi-theme | |
| "I love you, moonpie, chocolate" # Mixed content β auto multi-theme | |
| technology, nature, music 15 # With parameters β auto multi-theme | |
| ``` | |
| ### Manual Control | |
| ```bash | |
| cats dogs multi # Force multi-theme on 2 inputs | |
| "science, research, study" # 3 inputs but could be single theme contextually | |
| ``` | |
| ## Summary | |
| The **single theme** approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The **multi-theme** approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery. | |
| The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case. |