AskBit FAQ Retriever

A fast, interpretable FAQ retriever using bit vector encoding of SBERT sentence embeddings combined with a binary KNN classifier. This repository hosts a model artifact from the AskBit project.

📚 This model was created as part of an educational journey exploring efficient semantic FAQ matching with bitwise vector representations and KNN classification.

🔢 Uses SBERT (all-MiniLM-L6-v2) to embed question-answer pairs as dense semantic vectors.
🧠 Converts dense embeddings into binarized bit vectors for fast similarity search.
⚡ Uses a K-Nearest Neighbors classifier with Hamming distance over bit vectors.
💡 Fully open source, efficient, and suitable for lightweight semantic FAQ retrieval.
🗂️ Model file: model.pkl
📄 Training data file: faq.json

📁 Files in This Repository

File	Description
`model.pkl`	Trained KNN classifier model over SBERT-based bit vectors.
`faq.json`	FAQ question-answer dataset used for training and evaluation.
`requirements.txt`	Python dependencies to load and use the model.
`README.md`	Model usage instructions, background, and examples.

🧠 How It Works

Semantic Bit Vector Encoding (`SbertBitEncoder`)

Uses the Sentence-BERT model (all-MiniLM-L6-v2) to generate dense semantic embeddings of entire question-answer pairs.
Embeddings capture meaningful sentence-level semantics, enabling effective retrieval beyond simple word overlap.
Each dense embedding vector is binarized by thresholding (e.g., bits set to 1 if value > 0) to produce a compact, fixed-length bit vector.
Both the FAQ entries and queries are encoded this way, ensuring semantic similarity maps to bitwise proximity.

Binary K-Nearest Neighbors Classifier (`FAQClassifier`)

Implements a KNN classifier using Hamming distance as the similarity metric on bit vectors.
Learns to associate bit-encoded queries with their corresponding answers.
Supports retrieving the best matching answer or top-k candidates with similarity scores.

🚀 Usage Example

import pickle
import numpy as np

# Load the trained model artifact
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Bit vector input: binarized SBERT embeddings (e.g., 384-bit vector)
query_vec = np.array([1, 0, 1, 1, 0, ..., 0])  # Must match training bit vector format

# Predict (get best matching answer)
answer = model.predict(query_vec)
print("Predicted answer:", answer)

⚠️ Important: Ensure you encode new queries with the same SBERT bit-vector encoder used at training for consistent results.

📦 Dependencies

Install dependencies with:

pip install -r requirements.txt

Main dependencies:

sentence-transformers
scikit-learn
numpy
yake
spacy (for optional text preprocessing)

📚 Related Project

This model is part of the AskBit project on GitHub:

✅ Full source code with CLI and training scripts
✅ Debug and inspect bit vectors and retrieval results
✅ Lightweight, interpretable semantic FAQ search

📜 License

MIT License — free to use, modify, or contribute.

🤝 Contributing

This model is intended for learning and experimentation. Feel free to fork, improve, or build upon it!

Model trained and shared by @Shanvit