Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5
A lightweight approach to topic extraction leveraging the Bottleneck T5 autoencoder architecture with learned transformation matrices. This project provides three specialized transformation matrices for mapping content embeddings to topic embeddings across different domains.
TL;DR: Transform content embeddings into topic embeddings using domain-specific 1024×1024 transformation matrices, trained on three distinct datasets. Built on top of the Bottleneck T5 architecture for efficient, training-free topic extraction.
Motivation
Large Language Models (LLMs) have become the go-to solution for many NLP tasks, including topic extraction and classification. However, they come with significant overhead:
- High computational requirements
- Large memory footprint
- Considerable inference latency
- Complex deployment needs
- Limited to pre-specified classes
This project offers a lightweight alternative specifically for topic extraction by leveraging the semantic structure of the Bottleneck T5's latent space. Instead of training a new model or fine-tuning existing ones, we learn a simple linear transformation between content and topic embeddings, providing:
- Fast inference (milliseconds)
- Minimal memory footprint (single 1024×1024 matrix per domain)
- Simple deployment (basic matrix multiplication)
- No need for GPU at inference time
- Generator in nature
Architecture Overview
Base Model
- Uses Bottleneck T5 Large (thesephist/contra-bottleneck-t5-large-wikipedia)
- Fixed embedding dimension: 1024
- Pre-trained on Wikipedia data
- Autoencoder architecture with attention pooling
Transformation Layers
- Three domain-specific transformation matrices (1024×1024 each)
- Linear mapping from content to topic space
- Learned using simple Mean Squared Error optimization
- Total additional parameters: ~3M per domain
Datasets and Performance Metrics
1. ArXiv Abstracts Dataset (ankitagr01/dynamic_topic_modeling_arxiv_abstracts)
Scientific paper abstracts paired with their research topics, providing a test bed for academic content classification.
Performance Metrics:
- Training MSE: 0.00225 (error on samples used to learn transformation)
- Testing MSE: 0.00268 (error on held-out validation set)
- Inter-topic MSE: 0.00620 (minimum distance between different topic embeddings)
Use Cases:
- Automated paper categorization
- Research trend analysis
- Academic content recommendation
2. TopicSUM Dataset (knkarthick/topicsum)
241,171 dialogue samples with human-annotated topic labels, ideal for conversational content analysis.
Performance Metrics:
- Training MSE: 0.00252
- Testing MSE: 0.00255
- Inter-topic MSE: 0.00737
Use Cases:
- Meeting summarization
- Customer service dialogue categorization
- Chat log analysis
3. MSD Manual Topics (nuvocare/MSD_manual_topics_user_base)
Medical content from Merck's Manual, featuring both professional and patient-oriented content.
Performance Metrics:
- Training MSE: 0.00174
- Testing MSE: 0.00197
- Inter-topic MSE: 0.00566
Use Cases:
- Medical document classification
- Healthcare content organization
- Patient information routing
Understanding the Metrics
Computational Requirements
Resource | Requirement | Notes |
---|---|---|
Storage | ~9MB per matrix | 1024×1024 float32 values |
Memory | ~27MB total | All three domain matrices |
Inference Time | ~10ms | On CPU, per text sample |
Training Hardware | P100 GPU | Free tier on Kaggle |
Training Time | ~4 hours total | Mostly embedding generation |
Base Model | ~770M parameters | Loaded only during embedding creation |
Performance Metrics Explained
Training MSE (Mean Squared Error)
- Measures how well the transformation matrix maps content to topic embeddings
- Calculated on the 80% training split
- Lower values indicate better alignment between transformed content and actual topic embeddings
Testing MSE
- Same metric but on 20% held-out test set
- Indicates generalization capability
- Similar values between train/test suggest good generalization. Slightly higher than training MSE is expected and healthy
Inter-topic MSE
- Minimum squared distance between any pair of topic embeddings
- Higher values indicate better topic separation
- Critical for preventing topic confusion
- Example: MSD's 0.00566 means medical topics maintain distinct representations
Comparative Analysis
- MSD dataset shows best training performance (0.00174 MSE)
- Likely due to well-structured medical vocabulary
- Clear topic boundaries in medical domain
- TopicSUM has highest inter-topic MSE (0.00737)
- Reflects diverse nature of conversational topics
- Important for distinguishing between varied dialogue contexts
- ArXiv results balance between the two
- Scientific content has natural overlap between fields
- Still maintains good topic separation (0.00620 inter-topic MSE)
Implementation
Try it out here: (https://colab.research.google.com/drive/1_SuTiL3QS-PUYjSrugqqD5mQlMv8Hbfc?usp=sharing)
1. Base Model Wrapper
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
class BottleneckT5Autoencoder:
def __init__(self, model_path: str, device='cpu'):
self.device = device
self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True
).to(device)
self.model.eval()
@torch.no_grad()
def embed(self, text: str) -> torch.FloatTensor:
inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
return self.model(
**inputs,
decoder_input_ids=decoder_inputs['input_ids'],
encode_only=True,
)[0]
@torch.no_grad()
def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
dummy_text = '.'
dummy = self.embed(dummy_text)
perturb_vector = latent - dummy
self.model.perturb_vector = perturb_vector
input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
output = self.model.generate(
input_ids=input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.9,
num_return_sequences=1,
)
return self.tokenizer.decode(output[0], skip_special_tokens=True)
2. Topic Mapper
Transformations Available:
- https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_topicsum.pt
- https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt
- https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_msd.pt
url = 'https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt'
file_path = 'transformation_matrix.pt'
with open(file_path, 'wb') as f:
f.write(requests.get(url).content)
transformation_matrix = torch.load(file_path, weights_only=False).float()
print(transformation_matrix.shape, type(transformation_matrix))
3. Final Conversion
autoencoder = BottleneckT5Autoencoder(model_path=model_path, device=device)
content_embedding = autoencoder.embed(content)
topic_embedding = content_embedding @ transformation_matrix
topic = = autoencoder.generate_from_latent(topic_embedding)
print(topic)
Limitations and Future Work
Representation Quality
- System inherits Bottleneck T5's encoding limitations
- Performance depends on input text fitting model's training distribution
Domain Specificity
- Each matrix is domain-optimized
- Cross-domain performance not guaranteed
- Future work: Investigate domain adaptation techniques
Fixed Dimensionality
- Locked to Bottleneck T5's 1024D space
- Potential future work: Dimension reduction studies
Linear Transformation Limitations
- Assumes linear relationship between content and topic spaces
- Future work: Explore non-linear transformations
Memory and Computation Requirements
- Transformation Matrix: 1024 × 1024 × 4 bytes ≈ 9MB per domain
- Inference Time: ~10ms on CPU (matrix multiplication)
- Total Model Size: ~27MB (all three domains)
- Base Model: ~770M parameters (loaded only during embedding creation)
Acknowledgments
Special thanks to:
- Linus Lee (@thesephist) for the Bottleneck T5 model
- The T5 team at Google Research
- Dataset providers:
- @ankitagr01 for the ArXiv abstracts dataset
- @knkarthick for the TopicSUM dataset
- @nuvocare for the MSD Manual topics dataset
- Kaggle for providing free P100 GPU resources
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Downloads last month
- 14