Spaces:

vimalk78
/

abc123

Sleeping

File size: 12,621 Bytes

486eff6

# Theme Handling in Thematic Word Generator

## Overview

The Unified Thematic Word Generator supports two distinct modes of semantic processing:

- **Single Theme**: Treats all inputs as contributing to one unified concept
- **Multi-Theme**: Detects and processes multiple separate concepts using machine learning clustering

This document explains the technical differences, algorithms, and practical implications of each approach.

## Triggering Logic

### Automatic Detection
```python
# Auto-enable multi-theme for 3+ inputs (matching original behavior)
auto_multi_theme = len(clean_inputs) > 2
final_multi_theme = multi_theme or auto_multi_theme

if final_multi_theme and len(clean_inputs) > 2:
    # Multi-theme path
    theme_vectors = self._detect_multiple_themes(clean_inputs)
else:
    # Single theme path  
    theme_vectors = [self._compute_theme_vector(clean_inputs)]
```

### Trigger Conditions
- **Single Theme**: 1-2 inputs OR manual override with `multi_theme=False`
- **Multi-Theme**: 3+ inputs (automatic) OR manual override with `multi_theme=True`

### Examples
```python
# Single theme (automatic)
generate_thematic_words("cats")                    # 1 input
generate_thematic_words(["cats", "dogs"])          # 2 inputs

# Multi-theme (automatic) 
generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs → auto multi-theme

# Manual override
generate_thematic_words(["cats", "dogs"], multi_theme=True)  # Force multi-theme
generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme
```

## Single Theme Processing

### Algorithm: `_compute_theme_vector(inputs)`

**Steps:**
1. **Encode all inputs** → Get sentence-transformer embeddings for each input
2. **Average embeddings** → `np.mean(input_embeddings, axis=0)`
3. **Return single vector** → One unified theme representation

### Conceptual Approach
- Treats all inputs as contributing to **one unified concept**
- Creates a **semantic centroid** that represents the combined meaning
- Finds words similar to the **average meaning** of all inputs
- Results are coherent and focused around the unified theme

### Example Process
```python
inputs = ["cats", "dogs"]  # 2 inputs → Single theme

# Processing:
# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]  
# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] → "pets/domestic animals" concept
```

### Use Cases
- **Related concepts**: "science, research, study" → Academic/research words
- **Variations of same thing**: "cats, kittens, felines" → Cat-related words  
- **Sentences**: "I love furry animals" → Animal-loving context words
- **Semantic expansion**: "ocean, water" → Marine/aquatic words

## Multi-Theme Processing

### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`

**Steps:**
1. **Encode all inputs** → Get embeddings for each input
2. **Determine clusters** → `n_clusters = min(max_themes, len(inputs), 3)`
3. **K-means clustering** → Group semantically similar inputs together
4. **Extract cluster centers** → Each cluster center becomes one theme vector
5. **Return multiple vectors** → Multiple separate theme representations

### Conceptual Approach
- Treats inputs as potentially representing **multiple different concepts**
- Uses **machine learning clustering** to automatically group related inputs
- Finds words similar to **each separate theme cluster**
- Results are diverse, covering multiple semantic areas

### Example Process
```python
inputs = ["science", "art", "cooking"]  # 3 inputs → Multi-theme

# Processing:
# 1. Get embeddings for all three words:
#    "science": [0.8, 0.1, 0.2, 0.3, ...]
#    "art":     [0.2, 0.9, 0.1, 0.4, ...]  
#    "cooking": [0.3, 0.2, 0.8, 0.5, ...]
# 2. Run K-means clustering (k=3)
# 3. Cluster results:
#    Cluster 1: "science" → [0.8, 0.1, 0.2, 0.3, ...] (research theme)
#    Cluster 2: "art"     → [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
#    Cluster 3: "cooking" → [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
# 4. Result: Three separate theme vectors for word generation
```

### Clustering Details

**Cluster Count Logic:**
```python
n_clusters = min(max_themes=3, len(inputs), 3)
```

**Examples:**
- 3 inputs → 3 clusters (each input potentially gets its own theme)
- 4 inputs → 3 clusters (max_themes limit applies)  
- 5 inputs → 3 clusters (max_themes limit applies)
- 6+ inputs → 3 clusters (max_themes limit applies)

**K-means Parameters:**
- `random_state=42`: Ensures reproducible clustering results
- `n_init=10`: Runs clustering 10 times with different initializations, picks best result

### Use Cases
- **Diverse topics**: "science, art, cooking" → Words from all three domains
- **Mixed contexts**: "I love you, moonpie, chocolate" → Romance + food words
- **Broad exploration**: "technology, nature, music" → Wide semantic coverage
- **Unrelated concepts**: "politics, sports, weather" → Balanced representation

## Word Generation Differences

### Single Theme Word Generation
```python
# Only one theme vector
theme_vectors = [single_theme_vector]  # Length = 1

# Similarity calculation (runs once)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 1 (no change)

# Result: All words are similar to the unified theme
```

**Characteristics:**
- **Coherent results**: All words relate to the unified concept
- **Focused semantic area**: Words cluster around the average meaning
- **High thematic consistency**: Strong semantic relationships between results

### Multi-Theme Word Generation  
```python  
# Multiple theme vectors
theme_vectors = [theme1, theme2, theme3]  # Length = 3

# Similarity calculation (runs 3 times)
for theme_vector in theme_vectors:  
    similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Divide by 3 (average)

# Result: Words similar to any of the themes, averaged across all themes
```

**Characteristics:**
- **Diverse results**: Words come from multiple separate concepts
- **Broader semantic coverage**: Covers different conceptual areas
- **Balanced representation**: Each theme contributes equally to final results
- **Higher variety**: Less repetitive, more exploratory results

## Practical Examples

### Single Theme Examples

#### Example 1: Related Animals
```python
inputs = ["cats", "dogs"]
# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
# Theme: Unified "domestic pets" concept
```

#### Example 2: Academic Focus  
```python
inputs = ["science", "research"]
# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
# Theme: Unified "scientific research" concept
```

#### Example 3: Sentence Input
```python
inputs = ["I love furry animals"]  
# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
# Theme: Unified "affection for furry pets" concept
```

### Multi-Theme Examples

#### Example 1: Diverse Domains
```python
inputs = ["science", "art", "cooking"]
# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
# Themes: Scientific + Creative + Culinary concepts balanced
```

#### Example 2: Mixed Context
```python
inputs = ["I love you", "moonpie", "chocolate"]
# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat  
# Themes: Romantic + Food concepts balanced
```

#### Example 3: Technology Exploration
```python
inputs = ["computer", "internet", "smartphone", "AI"]
# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
# Themes: Different tech concepts clustered and balanced
```

## Performance Characteristics

### Single Theme Performance
- **Speed**: Faster (one embedding average, one similarity calculation)
- **Memory**: Lower (stores one theme vector)
- **Consistency**: Higher (coherent semantic direction)
- **Best for**: Focused exploration, related concepts, sentence inputs

### Multi-Theme Performance  
- **Speed**: Slower (clustering computation, multiple similarity calculations)
- **Memory**: Higher (stores multiple theme vectors)
- **Diversity**: Higher (multiple semantic directions)
- **Best for**: Broad exploration, unrelated concepts, diverse word discovery

## Technical Implementation Details

### Single Theme Code Path
```python
def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
    """Compute semantic centroid from input words/sentences."""
    # Encode all inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Simple approach: average all input embeddings  
    theme_vector = np.mean(input_embeddings, axis=0)
    
    return theme_vector.reshape(1, -1)
```

### Multi-Theme Code Path
```python
def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
    """Detect multiple themes using clustering."""
    # Encode inputs
    input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)
    
    # Determine optimal number of clusters
    n_clusters = min(max_themes, len(inputs), 3)
    
    if n_clusters == 1:
        return [np.mean(input_embeddings, axis=0).reshape(1, -1)]
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    kmeans.fit(input_embeddings)
    
    # Return cluster centers as theme vectors
    return [center.reshape(1, -1) for center in kmeans.cluster_centers_]
```

### Similarity Aggregation
```python
# Collect similarities from all themes
all_similarities = np.zeros(len(self.vocabulary))

for theme_vector in theme_vectors:
    # Compute similarities with vocabulary
    similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
    all_similarities += similarities / len(theme_vectors)  # Average across themes
```

## Usage Guidelines

### When to Use Single Theme
- **1-2 related inputs**: Natural single theme territory
- **Sentence inputs**: Coherent meaning in natural language
- **Focused exploration**: Want words around one specific concept
- **Related concepts**: Inputs that should blend together semantically
- **Performance priority**: Need faster results

### When to Use Multi-Theme (or Allow Auto-Detection)
- **3+ diverse inputs**: Let automatic detection handle it
- **Unrelated concepts**: Want representation from all areas
- **Broad exploration**: Seeking diverse word discovery
- **Balanced results**: Need equal weight from different themes
- **Creative applications**: Want unexpected combinations

### Manual Override Cases
```python
# Force single theme for diverse inputs (rare)
generate_thematic_words(["red", "blue", "green"], multi_theme=False)
# Result: Color-focused words (unified color concept)

# Force multi-theme for similar inputs (rare)
generate_thematic_words(["cat", "kitten"], multi_theme=True)  
# Result: Attempts to find different aspects of cats vs kittens
```

## Interactive Mode Examples

### Single Theme Interactive Commands
```bash
I love animals                    # Sentence → single theme
cats dogs                        # 2 words → single theme  
science research                 # Related concepts → single theme
```

### Multi-Theme Interactive Commands  
```bash
cats, dogs, birds               # 3+ topics → auto multi-theme
science, art, cooking           # Diverse topics → auto multi-theme
"I love you, moonpie, chocolate" # Mixed content → auto multi-theme
technology, nature, music 15    # With parameters → auto multi-theme
```

### Manual Control
```bash  
cats dogs multi                 # Force multi-theme on 2 inputs
"science, research, study"      # 3 inputs but could be single theme contextually
```

## Summary

The **single theme** approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The **multi-theme** approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.

The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.