File size: 12,621 Bytes
486eff6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
# Theme Handling in Thematic Word Generator

## Overview

The Unified Thematic Word Generator supports two distinct modes of semantic processing:

- **Single Theme**: Treats all inputs as contributing to one unified concept
- **Multi-Theme**: Detects and processes multiple separate concepts using machine learning clustering

This document explains the technical differences, algorithms, and practical implications of each approach.

## Triggering Logic

### Automatic Detection
```python
# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme

if final_multi_theme and len(clean_inputs) > 2:
    # Multi-theme path
    theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
    # Single theme path  
    theme_vectors = [self._compute_theme_vector(clean_inputs)]
```

### Trigger Conditions
- **Single Theme**: 1-2 inputs OR manual override with `multi_theme=False`
- **Multi-Theme**: 3+ inputs (automatic) OR manual override with `multi_theme=True`

### Examples
```python
# Single theme (automatic)
generate_thematic_words("cats")                    # 1 input
generate_thematic_words(["cats", "dogs"])          # 2 inputs

# Multi-theme (automatic) 
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β†’ auto multi-theme

# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True)  # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme
```

## Single Theme Processing

### Algorithm: `_compute_theme_vector(inputs)`

**Steps:**
1. **Encode all inputs** β†’ Get sentence-transformer embeddings for each input
2. **Average embeddings** β†’ `np.mean(input_embeddings, axis=0)`
3. **Return single vector** β†’ One unified theme representation

### Conceptual Approach
- Treats all inputs as contributing to **one unified concept**
- Creates a **semantic centroid** that represents the combined meaning
- Finds words similar to the **average meaning** of all inputs
- Results are coherent and focused around the unified theme

### Example Process
```python
inputs = ["cats", "dogs"]  # 2 inputs β†’ Single theme

# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]  
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β†’ "pets/domestic animals" concept
```

### Use Cases
- **Related concepts**: "science, research, study" β†’ Academic/research words
- **Variations of same thing**: "cats, kittens, felines" β†’ Cat-related words  
- **Sentences**: "I love furry animals" β†’ Animal-loving context words
- **Semantic expansion**: "ocean, water" β†’ Marine/aquatic words

## Multi-Theme Processing

### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`

**Steps:**
1. **Encode all inputs** β†’ Get embeddings for each input
2. **Determine clusters** β†’ `n_clusters = min(max_themes, len(inputs), 3)`
3. **K-means clustering** β†’ Group semantically similar inputs together
4. **Extract cluster centers** β†’ Each cluster center becomes one theme vector
5. **Return multiple vectors** β†’ Multiple separate theme representations

### Conceptual Approach
- Treats inputs as potentially representing **multiple different concepts**
- Uses **machine learning clustering** to automatically group related inputs
- Finds words similar to **each separate theme cluster**
- Results are diverse, covering multiple semantic areas

### Example Process
```python
inputs = ["science", "art", "cooking"]  # 3 inputs β†’ Multi-theme

# Processing:
# 1. Get embeddings for all three words:
#    "science": [0.8, 0.1, 0.2, 0.3, ...]
#    "art":     [0.2, 0.9, 0.1, 0.4, ...]  
#    "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
#    Cluster 1: "science" β†’ [0.8, 0.1, 0.2, 0.3, ...] (research theme)
#    Cluster 2: "art"     β†’ [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
#    Cluster 3: "cooking" β†’ [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation
```

### Clustering Details

**Cluster Count Logic:**
```python
n_clusters = min(max_themes=3, len(inputs), 3)
```

**Examples:**
- 3 inputs β†’ 3 clusters (each input potentially gets its own theme)
- 4 inputs β†’ 3 clusters (max_themes limit applies)  
- 5 inputs β†’ 3 clusters (max_themes limit applies)
- 6+ inputs β†’ 3 clusters (max_themes limit applies)

**K-means Parameters:**
- `random_state=42`: Ensures reproducible clustering results
- `n_init=10`: Runs clustering 10 times with different initializations, picks best result

### Use Cases
- **Diverse topics**: "science, art, cooking" β†’ Words from all three domains
- **Mixed contexts**: "I love you, moonpie, chocolate" β†’ Romance + food words
- **Broad exploration**: "technology, nature, music" β†’ Wide semantic coverage
- **Unrelated concepts**: "politics, sports, weather" β†’ Balanced representation

## Word Generation Differences

### Single Theme Word Generation
```python
# Only one theme vector
theme_vectors = [single_theme_vector]  # Length = 1

# Similarity calculation (runs once)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 1 (no change)

# Result: All words are similar to the unified theme
```

**Characteristics:**
- **Coherent results**: All words relate to the unified concept
- **Focused semantic area**: Words cluster around the average meaning
- **High thematic consistency**: Strong semantic relationships between results

### Multi-Theme Word Generation  
```python  
# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3]  # Length = 3

# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 3 (average)

# Result: Words similar to any of the themes, averaged across all themes
```

**Characteristics:**
- **Diverse results**: Words come from multiple separate concepts
- **Broader semantic coverage**: Covers different conceptual areas
- **Balanced representation**: Each theme contributes equally to final results
- **Higher variety**: Less repetitive, more exploratory results

## Practical Examples

### Single Theme Examples

#### Example 1: Related Animals
```python
inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept
```

#### Example 2: Academic Focus  
```python
inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept
```

#### Example 3: Sentence Input
```python
inputs = ["I love furry animals"]  
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept
```

### Multi-Theme Examples

#### Example 1: Diverse Domains
```python
inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced
```

#### Example 2: Mixed Context
```python
inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat  
# Themes: Romantic + Food concepts balanced
```

#### Example 3: Technology Exploration
```python
inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced
```

## Performance Characteristics

### Single Theme Performance
- **Speed**: Faster (one embedding average, one similarity calculation)
- **Memory**: Lower (stores one theme vector)
- **Consistency**: Higher (coherent semantic direction)
- **Best for**: Focused exploration, related concepts, sentence inputs

### Multi-Theme Performance  
- **Speed**: Slower (clustering computation, multiple similarity calculations)
- **Memory**: Higher (stores multiple theme vectors)
- **Diversity**: Higher (multiple semantic directions)
- **Best for**: Broad exploration, unrelated concepts, diverse word discovery

## Technical Implementation Details

### Single Theme Code Path
```python
def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
    """Compute semantic centroid from input words/sentences."""
    # Encode all inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Simple approach: average all input embeddings  
    theme_vector = np.mean(input_embeddings, axis=0)
    
    return theme_vector.reshape(1, -1)
```

### Multi-Theme Code Path
```python
def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
    """Detect multiple themes using clustering."""
    # Encode inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Determine optimal number of clusters
    n_clusters = min(max_themes, len(inputs), 3)
    
    if n_clusters == 1:
        return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    kmeans.fit(input_embeddings)
    
    # Return cluster centers as theme vectors
    return [center.reshape(1, -1) for center in kmeans.cluster_centers_]
```

### Similarity Aggregation
```python
# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))

for theme_vector in theme_vectors:
    # Compute similarities with vocabulary
    similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Average across themes
```

## Usage Guidelines

### When to Use Single Theme
- **1-2 related inputs**: Natural single theme territory
- **Sentence inputs**: Coherent meaning in natural language
- **Focused exploration**: Want words around one specific concept
- **Related concepts**: Inputs that should blend together semantically
- **Performance priority**: Need faster results

### When to Use Multi-Theme (or Allow Auto-Detection)
- **3+ diverse inputs**: Let automatic detection handle it
- **Unrelated concepts**: Want representation from all areas
- **Broad exploration**: Seeking diverse word discovery
- **Balanced results**: Need equal weight from different themes
- **Creative applications**: Want unexpected combinations

### Manual Override Cases
```python
# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)

# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)  
# Result: Attempts to find different aspects of cats vs kittens
```

## Interactive Mode Examples

### Single Theme Interactive Commands
```bash
I love animals                    # Sentence β†’ single theme
cats dogs                        # 2 words β†’ single theme  
science research                 # Related concepts β†’ single theme
```

### Multi-Theme Interactive Commands  
```bash
cats, dogs, birds               # 3+ topics β†’ auto multi-theme
science, art, cooking           # Diverse topics β†’ auto multi-theme
"I love you, moonpie, chocolate" # Mixed content β†’ auto multi-theme
technology, nature, music 15    # With parameters β†’ auto multi-theme
```

### Manual Control
```bash  
cats dogs multi                 # Force multi-theme on 2 inputs
"science, research, study"      # 3 inputs but could be single theme contextually
```

## Summary

The **single theme** approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The **multi-theme** approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.

The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.