Spaces:

vimalk78
/

abc123

Running

App Files Files Community

abc123 / hack /THEME_HANDLING_EXPLAINED.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 about 1 month ago

preview code

raw

history blame contribute delete

12.6 kB

	# Theme Handling in Thematic Word Generator

	## Overview

	The Unified Thematic Word Generator supports two distinct modes of semantic processing:

	- Single Theme: Treats all inputs as contributing to one unified concept
	- Multi-Theme: Detects and processes multiple separate concepts using machine learning clustering

	This document explains the technical differences, algorithms, and practical implications of each approach.

	## Triggering Logic

	### Automatic Detection
	```python
	# Auto-enable multi-theme for 3+ inputs (matching original behavior)
	auto_multi_theme = len(clean_inputs) > 2
	final_multi_theme = multi_theme or auto_multi_theme

	if final_multi_theme and len(clean_inputs) > 2:
	# Multi-theme path
	theme_vectors = self._detect_multiple_themes(clean_inputs)
	else:
	# Single theme path
	theme_vectors = [self._compute_theme_vector(clean_inputs)]
	```

	### Trigger Conditions
	- Single Theme: 1-2 inputs OR manual override with `multi_theme=False`
	- Multi-Theme: 3+ inputs (automatic) OR manual override with `multi_theme=True`

	### Examples
	```python
	# Single theme (automatic)
	generate_thematic_words("cats") # 1 input
	generate_thematic_words(["cats", "dogs"]) # 2 inputs

	# Multi-theme (automatic)
	generate_thematic_words(["cats", "dogs", "birds"]) # 3 inputs → auto multi-theme

	# Manual override
	generate_thematic_words(["cats", "dogs"], multi_theme=True) # Force multi-theme
	generate_thematic_words(["a", "b", "c"], multi_theme=False) # Force single theme
	```

	## Single Theme Processing

	### Algorithm: `_compute_theme_vector(inputs)`

	Steps:
	1. Encode all inputs → Get sentence-transformer embeddings for each input
	2. Average embeddings → `np.mean(input_embeddings, axis=0)`
	3. Return single vector → One unified theme representation

	### Conceptual Approach
	- Treats all inputs as contributing to one unified concept
	- Creates a semantic centroid that represents the combined meaning
	- Finds words similar to the average meaning of all inputs
	- Results are coherent and focused around the unified theme

	### Example Process
	```python
	inputs = ["cats", "dogs"] # 2 inputs → Single theme

	# Processing:
	# 1. Get embedding for "cats": [0.2, 0.8, 0.1, 0.4, ...]
	# 2. Get embedding for "dogs": [0.3, 0.7, 0.2, 0.5, ...]
	# 3. Average: [(0.2+0.3)/2, (0.8+0.7)/2, (0.1+0.2)/2, (0.4+0.5)/2, ...]
	# 4. Result: [0.25, 0.75, 0.15, 0.45, ...] → "pets/domestic animals" concept
	```

	### Use Cases
	- Related concepts: "science, research, study" → Academic/research words
	- Variations of same thing: "cats, kittens, felines" → Cat-related words
	- Sentences: "I love furry animals" → Animal-loving context words
	- Semantic expansion: "ocean, water" → Marine/aquatic words

	## Multi-Theme Processing

	### Algorithm: `_detect_multiple_themes(inputs, max_themes=3)`

	Steps:
	1. Encode all inputs → Get embeddings for each input
	2. Determine clusters → `n_clusters = min(max_themes, len(inputs), 3)`
	3. K-means clustering → Group semantically similar inputs together
	4. Extract cluster centers → Each cluster center becomes one theme vector
	5. Return multiple vectors → Multiple separate theme representations

	### Conceptual Approach
	- Treats inputs as potentially representing multiple different concepts
	- Uses machine learning clustering to automatically group related inputs
	- Finds words similar to each separate theme cluster
	- Results are diverse, covering multiple semantic areas

	### Example Process
	```python
	inputs = ["science", "art", "cooking"] # 3 inputs → Multi-theme

	# Processing:
	# 1. Get embeddings for all three words:
	# "science": [0.8, 0.1, 0.2, 0.3, ...]
	# "art": [0.2, 0.9, 0.1, 0.4, ...]
	# "cooking": [0.3, 0.2, 0.8, 0.5, ...]
	# 2. Run K-means clustering (k=3)
	# 3. Cluster results:
	# Cluster 1: "science" → [0.8, 0.1, 0.2, 0.3, ...] (research theme)
	# Cluster 2: "art" → [0.2, 0.9, 0.1, 0.4, ...] (creative theme)
	# Cluster 3: "cooking" → [0.3, 0.2, 0.8, 0.5, ...] (culinary theme)
	# 4. Result: Three separate theme vectors for word generation
	```

	### Clustering Details

	Cluster Count Logic:
	```python
	n_clusters = min(max_themes=3, len(inputs), 3)
	```

	Examples:
	- 3 inputs → 3 clusters (each input potentially gets its own theme)
	- 4 inputs → 3 clusters (max_themes limit applies)
	- 5 inputs → 3 clusters (max_themes limit applies)
	- 6+ inputs → 3 clusters (max_themes limit applies)

	K-means Parameters:
	- `random_state=42`: Ensures reproducible clustering results
	- `n_init=10`: Runs clustering 10 times with different initializations, picks best result

	### Use Cases
	- Diverse topics: "science, art, cooking" → Words from all three domains
	- Mixed contexts: "I love you, moonpie, chocolate" → Romance + food words
	- Broad exploration: "technology, nature, music" → Wide semantic coverage
	- Unrelated concepts: "politics, sports, weather" → Balanced representation

	## Word Generation Differences

	### Single Theme Word Generation
	```python
	# Only one theme vector
	theme_vectors = [single_theme_vector] # Length = 1

	# Similarity calculation (runs once)
	for theme_vector in theme_vectors:
	similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
	all_similarities += similarities / len(theme_vectors) # Divide by 1 (no change)

	# Result: All words are similar to the unified theme
	```

	Characteristics:
	- Coherent results: All words relate to the unified concept
	- Focused semantic area: Words cluster around the average meaning
	- High thematic consistency: Strong semantic relationships between results

	### Multi-Theme Word Generation
	```python
	# Multiple theme vectors
	theme_vectors = [theme1, theme2, theme3] # Length = 3

	# Similarity calculation (runs 3 times)
	for theme_vector in theme_vectors:
	similarities = cosine_similarity(theme_vector, vocab_embeddings)[0]
	all_similarities += similarities / len(theme_vectors) # Divide by 3 (average)

	# Result: Words similar to any of the themes, averaged across all themes
	```

	Characteristics:
	- Diverse results: Words come from multiple separate concepts
	- Broader semantic coverage: Covers different conceptual areas
	- Balanced representation: Each theme contributes equally to final results
	- Higher variety: Less repetitive, more exploratory results

	## Practical Examples

	### Single Theme Examples

	#### Example 1: Related Animals
	```python
	inputs = ["cats", "dogs"]
	# Result words: pets, animals, fur, tail, home, domestic, companion, mammal
	# Theme: Unified "domestic pets" concept
	```

	#### Example 2: Academic Focus
	```python
	inputs = ["science", "research"]
	# Result words: study, experiment, theory, hypothesis, laboratory, academic, discovery
	# Theme: Unified "scientific research" concept
	```

	#### Example 3: Sentence Input
	```python
	inputs = ["I love furry animals"]
	# Result words: pets, cats, dogs, cuddle, soft, warm, affection, companion
	# Theme: Unified "affection for furry pets" concept
	```

	### Multi-Theme Examples

	#### Example 1: Diverse Domains
	```python
	inputs = ["science", "art", "cooking"]
	# Result words: research, painting, recipe, experiment, canvas, ingredients, theory, brush, flavor
	# Themes: Scientific + Creative + Culinary concepts balanced
	```

	#### Example 2: Mixed Context
	```python
	inputs = ["I love you", "moonpie", "chocolate"]
	# Result words: romance, dessert, sweet, affection, snack, love, candy, caring, treat
	# Themes: Romantic + Food concepts balanced
	```

	#### Example 3: Technology Exploration
	```python
	inputs = ["computer", "internet", "smartphone", "AI"]
	# Result words: technology, digital, network, mobile, artificial, intelligence, software, device
	# Themes: Different tech concepts clustered and balanced
	```

	## Performance Characteristics

	### Single Theme Performance
	- Speed: Faster (one embedding average, one similarity calculation)
	- Memory: Lower (stores one theme vector)
	- Consistency: Higher (coherent semantic direction)
	- Best for: Focused exploration, related concepts, sentence inputs

	### Multi-Theme Performance
	- Speed: Slower (clustering computation, multiple similarity calculations)
	- Memory: Higher (stores multiple theme vectors)
	- Diversity: Higher (multiple semantic directions)
	- Best for: Broad exploration, unrelated concepts, diverse word discovery

	## Technical Implementation Details

	### Single Theme Code Path
	```python
	def _compute_theme_vector(self, inputs: List[str]) -> np.ndarray:
	"""Compute semantic centroid from input words/sentences."""
	# Encode all inputs
	input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)

	# Simple approach: average all input embeddings
	theme_vector = np.mean(input_embeddings, axis=0)

	return theme_vector.reshape(1, -1)
	```

	### Multi-Theme Code Path
	```python
	def _detect_multiple_themes(self, inputs: List[str], max_themes: int = 3) -> List[np.ndarray]:
	"""Detect multiple themes using clustering."""
	# Encode inputs
	input_embeddings = self.model.encode(inputs, convert_to_tensor=False, show_progress_bar=False)

	# Determine optimal number of clusters
	n_clusters = min(max_themes, len(inputs), 3)

	if n_clusters == 1:
	return [np.mean(input_embeddings, axis=0).reshape(1, -1)]

	# Perform clustering
	kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
	kmeans.fit(input_embeddings)

	# Return cluster centers as theme vectors
	return [center.reshape(1, -1) for center in kmeans.cluster_centers_]
	```

	### Similarity Aggregation
	```python
	# Collect similarities from all themes
	all_similarities = np.zeros(len(self.vocabulary))

	for theme_vector in theme_vectors:
	# Compute similarities with vocabulary
	similarities = cosine_similarity(theme_vector, self.vocab_embeddings)[0]
	all_similarities += similarities / len(theme_vectors) # Average across themes
	```

	## Usage Guidelines

	### When to Use Single Theme
	- 1-2 related inputs: Natural single theme territory
	- Sentence inputs: Coherent meaning in natural language
	- Focused exploration: Want words around one specific concept
	- Related concepts: Inputs that should blend together semantically
	- Performance priority: Need faster results

	### When to Use Multi-Theme (or Allow Auto-Detection)
	- 3+ diverse inputs: Let automatic detection handle it
	- Unrelated concepts: Want representation from all areas
	- Broad exploration: Seeking diverse word discovery
	- Balanced results: Need equal weight from different themes
	- Creative applications: Want unexpected combinations

	### Manual Override Cases
	```python
	# Force single theme for diverse inputs (rare)
	generate_thematic_words(["red", "blue", "green"], multi_theme=False)
	# Result: Color-focused words (unified color concept)

	# Force multi-theme for similar inputs (rare)
	generate_thematic_words(["cat", "kitten"], multi_theme=True)
	# Result: Attempts to find different aspects of cats vs kittens
	```

	## Interactive Mode Examples

	### Single Theme Interactive Commands
	```bash
	I love animals # Sentence → single theme
	cats dogs # 2 words → single theme
	science research # Related concepts → single theme
	```

	### Multi-Theme Interactive Commands
	```bash
	cats, dogs, birds # 3+ topics → auto multi-theme
	science, art, cooking # Diverse topics → auto multi-theme
	"I love you, moonpie, chocolate" # Mixed content → auto multi-theme
	technology, nature, music 15 # With parameters → auto multi-theme
	```

	### Manual Control
	```bash
	cats dogs multi # Force multi-theme on 2 inputs
	"science, research, study" # 3 inputs but could be single theme contextually
	```

	## Summary

	The single theme approach creates semantic unity by averaging all inputs into one unified concept, perfect for exploring focused topics and related concepts. The multi-theme approach preserves semantic diversity by using machine learning clustering to detect and maintain separate themes, ideal for broad exploration and diverse word discovery.

	The automatic detection (3+ inputs = multi-theme) provides intelligent defaults while allowing manual override for special cases. This gives you both the focused power of semantic averaging and the exploratory power of multi-concept clustering, depending on your use case.