|
# Theme Handling in Thematic Word Generator |
|
|
|
## Overview |
|
|
|
The Unified Thematic Word Generator supports two distinct modes of semantic processing: |
|
|
|
- **Single Theme**: Treats all inputs as contributing to one unified concept |
|
- **Multi-Theme**: Detects and processes multiple separate concepts using machine learning clustering |
|
|
|
This document explains the technical differences, algorithms, and practical implications of each approach. |
|
|
|
## Triggering Logic |
|
|
|
### Automatic Detection |
|
```python |
|
# Auto-enable multi-theme for 3+ inputs (matching original behavior) |
|
auto_multi_theme = len(clean_inputs) > 2 |
|
final_multi_theme = multi_theme or auto_multi_theme |
|
|
|
if final_multi_theme and len(clean_inputs) > 2: |
|
# Multi-theme path |
|
theme_vectors = self._detect_multiple_themes(clean_inputs) |
|
else: |
|
# Single theme path |
|
theme_vectors = [self._compute_theme_vector(clean_inputs)] |
|
``` |
|
|
|
### Trigger Conditions |
|
- **Single Theme**: 1-2 inputs OR manual override with `multi_theme=False` |
|
- **Multi-Theme**: 3+ inputs (automatic) OR manual override with `multi_theme=True` |
|
|
|
### Examples |
|
```python |
|
# Single theme (automatic) |
|
generate_thematic_words("cats") # 1 input |
|
generate_thematic_words(["cats", "dogs"]) # 2 inputs |
|
|
|
# Multi-theme (automatic) |
|
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs β auto multi-theme |
|
|
|
# Manual override |
|
generate_thematic_words(["cats", "dogs"], multi_theme=True) # Force multi-theme |
|
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme |
|
``` |
|
|
|
## Single Theme Processing |
|
|
|
### Algorithm: `_compute_theme_vector(inputs)` |
|
|
|
**Steps:** |
|
1. **Encode all inputs** β Get sentence-transformer embeddings for each input |
|
2. **Average embeddings** β `np.mean(input_embeddings, axis=0)` |
|
3. **Return single vector** β One unified theme representation |
|
|
|
### Conceptual Approach |
|
- Treats all inputs as contributing to **one unified concept** |
|
- Creates a **semantic centroid** that represents the combined meaning |
|
- Finds words similar to the **average meaning** of all inputs |
|
- Results are coherent and focused around the unified theme |
|
|
|
### Example Process |
|
```python |
|
inputs = ["cats", "dogs"] # 2 inputs β Single theme |
|
|
|
# Processing: |
|
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...] |
|
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...] |
|
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...] |
|
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] β "pets/domestic animals" concept |
|
``` |
|
|
|
### Use Cases |
|
- **Related concepts**: "science, research, study" β Academic/research words |
|
- **Variations of same thing**: "cats, kittens, felines" β Cat-related words |
|
- **Sentences**: "I love furry animals" β Animal-loving context words |
|
- **Semantic expansion**: "ocean, water" β Marine/aquatic words |
|
|
|
## Multi-Theme Processing |
|
|
|
### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)` |
|
|
|
**Steps:** |
|
1. **Encode all inputs** β Get embeddings for each input |
|
2. **Determine clusters** β `n_clusters = min(max_themes, len(inputs), 3)` |
|
3. **K-means clustering** β Group semantically similar inputs together |
|
4. **Extract cluster centers** β Each cluster center becomes one theme vector |
|
5. **Return multiple vectors** β Multiple separate theme representations |
|
|
|
### Conceptual Approach |
|
- Treats inputs as potentially representing **multiple different concepts** |
|
- Uses **machine learning clustering** to automatically group related inputs |
|
- Finds words similar to **each separate theme cluster** |
|
- Results are diverse, covering multiple semantic areas |
|
|
|
### Example Process |
|
```python |
|
inputs = ["science", "art", "cooking"] # 3 inputs β Multi-theme |
|
|
|
# Processing: |
|
# 1. Get embeddings for all three words: |
|
# "science": [0.8, 0.1, 0.2, 0.3, ...] |
|
# "art": [0.2, 0.9, 0.1, 0.4, ...] |
|
# "cooking": [0.3, 0.2, 0.8, 0.5, ...] |
|
# 2. Run K-means clustering (k=3) |
|
# 3. Cluster results: |
|
# Cluster 1: "science" β [0.8, 0.1, 0.2, 0.3, ...] (research theme) |
|
# Cluster 2: "art" β [0.2, 0.9, 0.1, 0.4, ...] (creative theme) |
|
# Cluster 3: "cooking" β [0.3, 0.2, 0.8, 0.5, ...] (culinary theme) |
|
# 4. Result: Three separate theme vectors for word generation |
|
``` |
|
|
|
### Clustering Details |
|
|
|
**Cluster Count Logic:** |
|
```python |
|
n_clusters = min(max_themes=3, len(inputs), 3) |
|
``` |
|
|
|
**Examples:** |
|
- 3 inputs β 3 clusters (each input potentially gets its own theme) |
|
- 4 inputs β 3 clusters (max_themes limit applies) |
|
- 5 inputs β 3 clusters (max_themes limit applies) |
|
- 6+ inputs β 3 clusters (max_themes limit applies) |
|
|
|
**K-means Parameters:** |
|
- `random_state=42`: Ensures reproducible clustering results |
|
- `n_init=10`: Runs clustering 10 times with different initializations, picks best result |
|
|
|
### Use Cases |
|
- **Diverse topics**: "science, art, cooking" β Words from all three domains |
|
- **Mixed contexts**: "I love you, moonpie, chocolate" β Romance + food words |
|
- **Broad exploration**: "technology, nature, music" β Wide semantic coverage |
|
- **Unrelated concepts**: "politics, sports, weather" β Balanced representation |
|
|
|
## Word Generation Differences |
|
|
|
### Single Theme Word Generation |
|
```python |
|
# Only one theme vector |
|
theme_vectors = [single_theme_vector] # Length = 1 |
|
|
|
# Similarity calculation (runs once) |
|
for theme_vector in theme_vectors: |
|
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0] |
|
all_similarities += similarities / len(theme_vectors) # Divide by 1 (no change) |
|
|
|
# Result: All words are similar to the unified theme |
|
``` |
|
|
|
**Characteristics:** |
|
- **Coherent results**: All words relate to the unified concept |
|
- **Focused semantic area**: Words cluster around the average meaning |
|
- **High thematic consistency**: Strong semantic relationships between results |
|
|
|
### Multi-Theme Word Generation |
|
```python |
|
# Multiple theme vectors |
|
theme_vectors = [theme1, theme2, theme3] # Length = 3 |
|
|
|
# Similarity calculation (runs 3 times) |
|
for theme_vector in theme_vectors: |
|
similarities = cosine_similarity(theme_vector, vocab_embeddings)[0] |
|
all_similarities += similarities / len(theme_vectors) # Divide by 3 (average) |
|
|
|
# Result: Words similar to any of the themes, averaged across all themes |
|
``` |
|
|
|
**Characteristics:** |
|
- **Diverse results**: Words come from multiple separate concepts |
|
- **Broader semantic coverage**: Covers different conceptual areas |
|
- **Balanced representation**: Each theme contributes equally to final results |
|
- **Higher variety**: Less repetitive, more exploratory results |
|
|
|
## Practical Examples |
|
|
|
### Single Theme Examples |
|
|
|
#### Example 1: Related Animals |
|
```python |
|
inputs = ["cats", "dogs"] |
|
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal |
|
# Theme: Unified "domestic pets" concept |
|
``` |
|
|
|
#### Example 2: Academic Focus |
|
```python |
|
inputs = ["science", "research"] |
|
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery |
|
# Theme: Unified "scientific research" concept |
|
``` |
|
|
|
#### Example 3: Sentence Input |
|
```python |
|
inputs = ["I love furry animals"] |
|
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion |
|
# Theme: Unified "affection for furry pets" concept |
|
``` |
|
|
|
### Multi-Theme Examples |
|
|
|
#### Example 1: Diverse Domains |
|
```python |
|
inputs = ["science", "art", "cooking"] |
|
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor |
|
# Themes: Scientific + Creative + Culinary concepts balanced |
|
``` |
|
|
|
#### Example 2: Mixed Context |
|
```python |
|
inputs = ["I love you", "moonpie", "chocolate"] |
|
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat |
|
# Themes: Romantic + Food concepts balanced |
|
``` |
|
|
|
#### Example 3: Technology Exploration |
|
```python |
|
inputs = ["computer", "internet", "smartphone", "AI"] |
|
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device |
|
# Themes: Different tech concepts clustered and balanced |
|
``` |
|
|
|
## Performance Characteristics |
|
|
|
### Single Theme Performance |
|
- **Speed**: Faster (one embedding average, one similarity calculation) |
|
- **Memory**: Lower (stores one theme vector) |
|
- **Consistency**: Higher (coherent semantic direction) |
|
- **Best for**: Focused exploration, related concepts, sentence inputs |
|
|
|
### Multi-Theme Performance |
|
- **Speed**: Slower (clustering computation, multiple similarity calculations) |
|
- **Memory**: Higher (stores multiple theme vectors) |
|
- **Diversity**: Higher (multiple semantic directions) |
|
- **Best for**: Broad exploration, unrelated concepts, diverse word discovery |
|
|
|
## Technical Implementation Details |
|
|
|
### Single Theme Code Path |
|
```python |
|
def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray: |
|
"""Compute semantic centroid from input words/sentences.""" |
|
# Encode all inputs |
|
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False) |
|
|
|
# Simple approach: average all input embeddings |
|
theme_vector = np.mean(input_embeddings, axis=0) |
|
|
|
return theme_vector.reshape(1, -1) |
|
``` |
|
|
|
### Multi-Theme Code Path |
|
```python |
|
def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]: |
|
"""Detect multiple themes using clustering.""" |
|
# Encode inputs |
|
input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False) |
|
|
|
# Determine optimal number of clusters |
|
n_clusters = min(max_themes, len(inputs), 3) |
|
|
|
if n_clusters == 1: |
|
return [np.mean(input_embeddings, axis=0).reshape(1, -1)] |
|
|
|
# Perform clustering |
|
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10) |
|
kmeans.fit(input_embeddings) |
|
|
|
# Return cluster centers as theme vectors |
|
return [center.reshape(1, -1) for center in kmeans.cluster_centers_] |
|
``` |
|
|
|
### Similarity Aggregation |
|
```python |
|
# Collect similarities from all themes |
|
all_similarities = np.zeros(len(self.vocabulary)) |
|
|
|
for theme_vector in theme_vectors: |
|
# Compute similarities with vocabulary |
|
similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0] |
|
all_similarities += similarities / len(theme_vectors) # Average across themes |
|
``` |
|
|
|
## Usage Guidelines |
|
|
|
### When to Use Single Theme |
|
- **1-2 related inputs**: Natural single theme territory |
|
- **Sentence inputs**: Coherent meaning in natural language |
|
- **Focused exploration**: Want words around one specific concept |
|
- **Related concepts**: Inputs that should blend together semantically |
|
- **Performance priority**: Need faster results |
|
|
|
### When to Use Multi-Theme (or Allow Auto-Detection) |
|
- **3+ diverse inputs**: Let automatic detection handle it |
|
- **Unrelated concepts**: Want representation from all areas |
|
- **Broad exploration**: Seeking diverse word discovery |
|
- **Balanced results**: Need equal weight from different themes |
|
- **Creative applications**: Want unexpected combinations |
|
|
|
### Manual Override Cases |
|
```python |
|
# Force single theme for diverse inputs (rare) |
|
generate_thematic_words(["red", "blue", "green"], multi_theme=False) |
|
# Result: Color-focused words (unified color concept) |
|
|
|
# Force multi-theme for similar inputs (rare) |
|
generate_thematic_words(["cat", "kitten"], multi_theme=True) |
|
# Result: Attempts to find different aspects of cats vs kittens |
|
``` |
|
|
|
## Interactive Mode Examples |
|
|
|
### Single Theme Interactive Commands |
|
```bash |
|
I love animals # Sentence β single theme |
|
cats dogs # 2 words β single theme |
|
science research # Related concepts β single theme |
|
``` |
|
|
|
### Multi-Theme Interactive Commands |
|
```bash |
|
cats, dogs, birds # 3+ topics β auto multi-theme |
|
science, art, cooking # Diverse topics β auto multi-theme |
|
"I love you, moonpie, chocolate" # Mixed content β auto multi-theme |
|
technology, nature, music 15 # With parameters β auto multi-theme |
|
``` |
|
|
|
### Manual Control |
|
```bash |
|
cats dogs multi # Force multi-theme on 2 inputs |
|
"science, research, study" # 3 inputs but could be single theme contextually |
|
``` |
|
|
|
## Summary |
|
|
|
The **single theme** approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The **multi-theme** approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery. |
|
|
|
The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case. |