Supercharge Your Semantic Search with embs
In an era where data is growing exponentially, the ability to retrieve and rank relevant information efficiently has become critical. Whether you’re building a semantic search engine or enabling context-aware chatbots, having a robust document processing pipeline is essential. This is where embs comes in.
embs is a lightweight Python toolkit designed to simplify the workflow of retrieving, splitting, embedding, and ranking text. Powered by free APIs—Docsifer for document conversion and the Lightweight Embeddings API for embeddings—it lets you achieve exceptional results with minimal configuration.
Use Case: Semantic Search with Custom Models
Imagine building a semantic search engine for querying large documents like PDFs or web pages. To achieve the best results, you might need splitting, embedding with custom models, and ranking by relevance—all of which are made simple with embs.
Supported Embedding Models
embs supports the following state-of-the-art text embedding models through the Lightweight Embeddings API:
snowflake-arctic-embed-l-v2.0
: General-purpose multilingual model optimized for semantic similarity tasks.bge-m3
: A robust model for large-scale semantic search.gte-multilingual-base
: Designed for multilingual understanding.paraphrase-multilingual-MiniLM-L12-v2
: Lightweight, fast, and great for paraphrase detection.paraphrase-multilingual-mpnet-base-v2
: Excellent for multilingual text similarity tasks.multilingual-e5-small
,multilingual-e5-base
,multilingual-e5-large
: High-quality embeddings for semantic similarity, available in various sizes for trade-offs between speed and accuracy.
Example: Search Documents with Model Selection
Here’s how you can use embs to retrieve, split, embed, and rank documents while specifying a custom embedding model.
from functools import partial
from embs import Embs
async def main():
# Configure the Markdown splitter
split_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2")], # Split by Markdown headers
"return_each_line": False, # Group content under each header
"strip_headers": True # Remove headers from chunks
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)
# Initialize the Embs client
client = Embs()
# Step 1: Retrieve documents from a file and a URL, with splitting
docs = await client.retrieve_documents_async(
files=["/path/to/sample.pdf"],
urls=["https://example.com"],
splitter=md_splitter # Optional splitter for better granularity
)
print(f"Total chunks after splitting: {len(docs)}")
# Step 2: Rank the document chunks by a query, specifying a custom model
query = "What is quantum computing?"
ranked_chunks = await client.search_documents_async(
query=query,
files=["/path/to/sample.pdf"],
urls=["https://example.com"],
splitter=md_splitter, # Use the splitter
model="multilingual-e5-base" # Specify the embedding model
)
# Step 3: Print the top results
for chunk in ranked_chunks[:3]:
print(f"File: {chunk['filename']} | Score: {chunk['probability']:.4f}")
print(f"Snippet: {chunk['markdown'][:100]}...")
Explanation: Why Custom Models Matter?
Different use cases require different embedding models:
snowflake-arctic-embed-l-v2.0
is a strong general-purpose multilingual model.bge-m3
excels in large-scale document search, making it ideal for enterprise search engines.multilingual-e5-large
offers high accuracy for complex semantic tasks, but for lightweight needs,e5-small
ore5-base
might be better.paraphrase-multilingual-*
models are excellent for tasks like chatbot Q&A or paraphrase detection.
By explicitly selecting a model, you can fine-tune your pipeline for precision, speed, or multilingual capability.
Helper Functions and Features
Markdown Splitting (markdown_splitter
)
The markdown_splitter
function makes it easy to break large documents into manageable chunks based on Markdown headers. This is critical for improving relevance in search workflows.
Example:
split_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2")],
"strip_headers": True,
"return_each_line": False
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)
Why Split Documents?
- Improves relevance by focusing on smaller, meaningful sections.
- Enables better ranking when combined with embeddings.
Embedding Text with Custom Models
The embed
method generates embeddings for any text or list of texts. You can specify one of the supported models to align with your specific use case.
Example:
client = Embs()
embedding = client.embed(
text_or_texts="What is quantum computing?",
model="paraphrase-multilingual-mpnet-base-v2"
)
print(embedding)
Caching for Scalability
Enable caching to handle large-scale retrieval workloads efficiently. Use memory caching for fast lookups or disk caching for persistent storage.
Example:
cache_config = {
"enabled": True,
"type": "memory", # Memory-based LRU caching
"max_mem_items": 100, # Cache up to 100 items
"max_ttl_seconds": 3600 # Expire items after 1 hour
}
client = Embs(cache_config=cache_config)
Why Use embs for Semantic Search?
- State-of-the-Art Models: Supports multilingual and task-specific embeddings for precision across diverse use cases.
- Unified Workflow: Retrieve, split, embed, and rank—all in one API.
- Free API Access: Both Docsifer and Lightweight Embeddings are free to use, reducing infrastructure costs.
- Customizable: Easily switch models or add splitters to optimize performance for your specific requirements.
Get Started Today
Install embs with:
pip install embs
Here’s a quick-start snippet to see it in action:
from embs import Embs
client = Embs()
results = client.search_documents(
query="What is quantum computing?",
files=["/path/to/quantum.pdf"],
model="multilingual-e5-base" # Specify the embedding model
)
for doc in results[:3]:
print(f"{doc['filename']} | {doc['probability']:.4f} | {doc['markdown'][:80]}...")
🌟 Build smarter semantic search and RAG systems with embs today! 🚀