File size: 12,621 Bytes
486eff6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 |
# Theme Handling in Thematic Word Generator
## Overview
The Unified Thematic Word Generator supports two distinct modes of semantic processing:
- **Single Theme**: Treats all inputs as contributing to one unified concept
- **Multi-Theme**: Detects and processes multiple separate concepts using machine learning clustering
This document explains the technical differences, algorithms, and practical implications of each approach.
## Triggering Logic
### Automatic Detection
```python
# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme
if final_multi_theme and len(clean_inputs) > 2:
# Multi-theme path
theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
# Single theme path
theme_vectors = [self._compute_theme_vector(clean_inputs)]
```
### Trigger Conditions
- **Single Theme**: 1-2 inputs OR manual override with `multi_theme=False`
- **Multi-Theme**: 3+ inputs (automatic) OR manual override with `multi_theme=True`
### Examples
```python
# Single theme (automatic)
generate_thematic_words("cats") # 1 input
generate_thematic_words(["cats", "dogs"]) # 2 inputs
# Multi-theme (automatic)
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β auto multi-theme
# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True) # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme
```
## Single Theme Processing
### Algorithm: `_compute_theme_vector(inputs)`
**Steps:**
1. **Encode all inputs** β Get sentence-transformer embeddings for each input
2. **Average embeddings** β `np.mean(input_embeddings, axis=0)`
3. **Return single vector** β One unified theme representation
### Conceptual Approach
- Treats all inputs as contributing to **one unified concept**
- Creates a **semantic centroid** that represents the combined meaning
- Finds words similar to the **average meaning** of all inputs
- Results are coherent and focused around the unified theme
### Example Process
```python
inputs = ["cats", "dogs"] # 2 inputs β Single theme
# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β "pets/domestic animals" concept
```
### Use Cases
- **Related concepts**: "science, research, study" β Academic/research words
- **Variations of same thing**: "cats, kittens, felines" β Cat-related words
- **Sentences**: "I love furry animals" β Animal-loving context words
- **Semantic expansion**: "ocean, water" β Marine/aquatic words
## Multi-Theme Processing
### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`
**Steps:**
1. **Encode all inputs** β Get embeddings for each input
2. **Determine clusters** β `n_clusters = min(max_themes, len(inputs), 3)`
3. **K-means clustering** β Group semantically similar inputs together
4. **Extract cluster centers** β Each cluster center becomes one theme vector
5. **Return multiple vectors** β Multiple separate theme representations
### Conceptual Approach
- Treats inputs as potentially representing **multiple different concepts**
- Uses **machine learning clustering** to automatically group related inputs
- Finds words similar to **each separate theme cluster**
- Results are diverse, covering multiple semantic areas
### Example Process
```python
inputs = ["science", "art", "cooking"] # 3 inputs β Multi-theme
# Processing:
# 1. Get embeddings for all three words:
# "science": [0.8, 0.1, 0.2, 0.3, ...]
# "art": [0.2, 0.9, 0.1, 0.4, ...]
# "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
# Cluster 1: "science" β [0.8, 0.1, 0.2, 0.3, ...] (research theme)
# Cluster 2: "art" β [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
# Cluster 3: "cooking" β [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation
```
### Clustering Details
**Cluster Count Logic:**
```python
n_clusters = min(max_themes=3, len(inputs), 3)
```
**Examples:**
- 3 inputs β 3 clusters (each input potentially gets its own theme)
- 4 inputs β 3 clusters (max_themes limit applies)
- 5 inputs β 3 clusters (max_themes limit applies)
- 6+ inputs β 3 clusters (max_themes limit applies)
**K-means Parameters:**
- `random_state=42`: Ensures reproducible clustering results
- `n_init=10`: Runs clustering 10 times with different initializations, picks best result
### Use Cases
- **Diverse topics**: "science, art, cooking" β Words from all three domains
- **Mixed contexts**: "I love you, moonpie, chocolate" β Romance + food words
- **Broad exploration**: "technology, nature, music" β Wide semantic coverage
- **Unrelated concepts**: "politics, sports, weather" β Balanced representation
## Word Generation Differences
### Single Theme Word Generation
```python
# Only one theme vector
theme_vectors = [single_theme_vector] # Length = 1
# Similarity calculation (runs once)
for theme_vector in theme_vectors:
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Divide by 1 (no change)
# Result: All words are similar to the unified theme
```
**Characteristics:**
- **Coherent results**: All words relate to the unified concept
- **Focused semantic area**: Words cluster around the average meaning
- **High thematic consistency**: Strong semantic relationships between results
### Multi-Theme Word Generation
```python
# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3] # Length = 3
# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Divide by 3 (average)
# Result: Words similar to any of the themes, averaged across all themes
```
**Characteristics:**
- **Diverse results**: Words come from multiple separate concepts
- **Broader semantic coverage**: Covers different conceptual areas
- **Balanced representation**: Each theme contributes equally to final results
- **Higher variety**: Less repetitive, more exploratory results
## Practical Examples
### Single Theme Examples
#### Example 1: Related Animals
```python
inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept
```
#### Example 2: Academic Focus
```python
inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept
```
#### Example 3: Sentence Input
```python
inputs = ["I love furry animals"]
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept
```
### Multi-Theme Examples
#### Example 1: Diverse Domains
```python
inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced
```
#### Example 2: Mixed Context
```python
inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat
# Themes: Romantic + Food concepts balanced
```
#### Example 3: Technology Exploration
```python
inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced
```
## Performance Characteristics
### Single Theme Performance
- **Speed**: Faster (one embedding average, one similarity calculation)
- **Memory**: Lower (stores one theme vector)
- **Consistency**: Higher (coherent semantic direction)
- **Best for**: Focused exploration, related concepts, sentence inputs
### Multi-Theme Performance
- **Speed**: Slower (clustering computation, multiple similarity calculations)
- **Memory**: Higher (stores multiple theme vectors)
- **Diversity**: Higher (multiple semantic directions)
- **Best for**: Broad exploration, unrelated concepts, diverse word discovery
## Technical Implementation Details
### Single Theme Code Path
```python
def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
"""Compute semantic centroid from input words/sentences."""
# Encode all inputs
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
# Simple approach: average all input embeddings
theme_vector = np.mean(input_embeddings, axis=0)
return theme_vector.reshape(1, -1)
```
### Multi-Theme Code Path
```python
def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
"""Detect multiple themes using clustering."""
# Encode inputs
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
# Determine optimal number of clusters
n_clusters = min(max_themes, len(inputs), 3)
if n_clusters == 1:
return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
# Perform clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
kmeans.fit(input_embeddings)
# Return cluster centers as theme vectors
return [center.reshape(1, -1) for center in kmeans.cluster_centers_]
```
### Similarity Aggregation
```python
# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))
for theme_vector in theme_vectors:
# Compute similarities with vocabulary
similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
all_similarities += similarities / len(theme_vectors) # Average across themes
```
## Usage Guidelines
### When to Use Single Theme
- **1-2 related inputs**: Natural single theme territory
- **Sentence inputs**: Coherent meaning in natural language
- **Focused exploration**: Want words around one specific concept
- **Related concepts**: Inputs that should blend together semantically
- **Performance priority**: Need faster results
### When to Use Multi-Theme (or Allow Auto-Detection)
- **3+ diverse inputs**: Let automatic detection handle it
- **Unrelated concepts**: Want representation from all areas
- **Broad exploration**: Seeking diverse word discovery
- **Balanced results**: Need equal weight from different themes
- **Creative applications**: Want unexpected combinations
### Manual Override Cases
```python
# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)
# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)
# Result: Attempts to find different aspects of cats vs kittens
```
## Interactive Mode Examples
### Single Theme Interactive Commands
```bash
I love animals # Sentence β single theme
cats dogs # 2 words β single theme
science research # Related concepts β single theme
```
### Multi-Theme Interactive Commands
```bash
cats, dogs, birds # 3+ topics β auto multi-theme
science, art, cooking # Diverse topics β auto multi-theme
"I love you, moonpie, chocolate" # Mixed content β auto multi-theme
technology, nature, music 15 # With parameters β auto multi-theme
```
### Manual Control
```bash
cats dogs multi # Force multi-theme on 2 inputs
"science, research, study" # 3 inputs but could be single theme contextually
```
## Summary
The **single theme** approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The **multi-theme** approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.
The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case. |