Supercharge Your Semantic Search with embs

Community Article Published January 27, 2025

In an era where data is growing exponentially, the ability to retrieve and rank relevant information efficiently has become critical. Whether you’re building a semantic search engine or enabling context-aware chatbots, having a robust document processing pipeline is essential. This is where embs comes in.

embs is a lightweight Python toolkit designed to simplify the workflow of retrieving, splitting, embedding, and ranking text. Powered by free APIsDocsifer for document conversion and the Lightweight Embeddings API for embeddings—it lets you achieve exceptional results with minimal configuration.

Use Case: Semantic Search with Custom Models

Imagine building a semantic search engine for querying large documents like PDFs or web pages. To achieve the best results, you might need splitting, embedding with custom models, and ranking by relevance—all of which are made simple with embs.

Supported Embedding Models

embs supports the following state-of-the-art text embedding models through the Lightweight Embeddings API:

  • snowflake-arctic-embed-l-v2.0: General-purpose multilingual model optimized for semantic similarity tasks.
  • bge-m3: A robust model for large-scale semantic search.
  • gte-multilingual-base: Designed for multilingual understanding.
  • paraphrase-multilingual-MiniLM-L12-v2: Lightweight, fast, and great for paraphrase detection.
  • paraphrase-multilingual-mpnet-base-v2: Excellent for multilingual text similarity tasks.
  • multilingual-e5-small, multilingual-e5-base, multilingual-e5-large: High-quality embeddings for semantic similarity, available in various sizes for trade-offs between speed and accuracy.

Example: Search Documents with Model Selection

Here’s how you can use embs to retrieve, split, embed, and rank documents while specifying a custom embedding model.

from functools import partial
from embs import Embs

async def main():
    # Configure the Markdown splitter
    split_config = {
        "headers_to_split_on": [("#", "h1"), ("##", "h2")],  # Split by Markdown headers
        "return_each_line": False,  # Group content under each header
        "strip_headers": True       # Remove headers from chunks
    }
    md_splitter = partial(Embs.markdown_splitter, config=split_config)

    # Initialize the Embs client
    client = Embs()

    # Step 1: Retrieve documents from a file and a URL, with splitting
    docs = await client.retrieve_documents_async(
        files=["/path/to/sample.pdf"],
        urls=["https://example.com"],
        splitter=md_splitter  # Optional splitter for better granularity
    )
    print(f"Total chunks after splitting: {len(docs)}")

    # Step 2: Rank the document chunks by a query, specifying a custom model
    query = "What is quantum computing?"
    ranked_chunks = await client.search_documents_async(
        query=query,
        files=["/path/to/sample.pdf"],
        urls=["https://example.com"],
        splitter=md_splitter,               # Use the splitter
        model="multilingual-e5-base"        # Specify the embedding model
    )

    # Step 3: Print the top results
    for chunk in ranked_chunks[:3]:
        print(f"File: {chunk['filename']} | Score: {chunk['probability']:.4f}")
        print(f"Snippet: {chunk['markdown'][:100]}...")

Explanation: Why Custom Models Matter?

Different use cases require different embedding models:

  • snowflake-arctic-embed-l-v2.0 is a strong general-purpose multilingual model.
  • bge-m3 excels in large-scale document search, making it ideal for enterprise search engines.
  • multilingual-e5-large offers high accuracy for complex semantic tasks, but for lightweight needs, e5-small or e5-base might be better.
  • paraphrase-multilingual-* models are excellent for tasks like chatbot Q&A or paraphrase detection.

By explicitly selecting a model, you can fine-tune your pipeline for precision, speed, or multilingual capability.

Helper Functions and Features

Markdown Splitting (markdown_splitter)

The markdown_splitter function makes it easy to break large documents into manageable chunks based on Markdown headers. This is critical for improving relevance in search workflows.

Example:

split_config = {
    "headers_to_split_on": [("#", "h1"), ("##", "h2")],
    "strip_headers": True,
    "return_each_line": False
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)

Why Split Documents?

  • Improves relevance by focusing on smaller, meaningful sections.
  • Enables better ranking when combined with embeddings.

Embedding Text with Custom Models

The embed method generates embeddings for any text or list of texts. You can specify one of the supported models to align with your specific use case.

Example:

client = Embs()
embedding = client.embed(
    text_or_texts="What is quantum computing?",
    model="paraphrase-multilingual-mpnet-base-v2"
)
print(embedding)

Caching for Scalability

Enable caching to handle large-scale retrieval workloads efficiently. Use memory caching for fast lookups or disk caching for persistent storage.

Example:

cache_config = {
    "enabled": True,
    "type": "memory",        # Memory-based LRU caching
    "max_mem_items": 100,    # Cache up to 100 items
    "max_ttl_seconds": 3600  # Expire items after 1 hour
}
client = Embs(cache_config=cache_config)

Why Use embs for Semantic Search?

  1. State-of-the-Art Models: Supports multilingual and task-specific embeddings for precision across diverse use cases.
  2. Unified Workflow: Retrieve, split, embed, and rank—all in one API.
  3. Free API Access: Both Docsifer and Lightweight Embeddings are free to use, reducing infrastructure costs.
  4. Customizable: Easily switch models or add splitters to optimize performance for your specific requirements.

Get Started Today

Install embs with:

pip install embs

Here’s a quick-start snippet to see it in action:

from embs import Embs

client = Embs()
results = client.search_documents(
    query="What is quantum computing?",
    files=["/path/to/quantum.pdf"],
    model="multilingual-e5-base"  # Specify the embedding model
)

for doc in results[:3]:
    print(f"{doc['filename']} | {doc['probability']:.4f} | {doc['markdown'][:80]}...")

🌟 Build smarter semantic search and RAG systems with embs today! 🚀

Community

Sign up or log in to comment