Shanvit
/

askbit-faq-retriever

@@ -1,73 +1,85 @@
----
-license: mit
-language:
-- en
-pipeline_tag: text-classification
-tags:
-- faqs
-- bitwise
-- educational
-- mlp
-- glove
-- vector-representation
----
 # AskBit FAQ Classifier
-A simple, interpretable machine learning model trained to classify frequently asked questions (FAQs) using bitwise vector representations. This is the **model artifact** from the [AskBit](https://github.com/Shanvit7/askbit) project.
-> 📚 This model was created as part of an educational exploration into the potential of bitwise vector representations for machine learning tasks.
-* 🔢 Uses binarized GloVe embeddings to create 300-bit vectors
-* 🧠 Trained using a small MLP (Multilayer Perceptron)
-* 💡 Fully open source and educational — not intended for production
 ---
 ## 📁 Files in This Repository
-| File                        | Description                                                |
-| --------------------------- | ---------------------------------------------------------- |
-| `askbit_faq_classifier.pkl` | Trained classifier using `MLPClassifier` from scikit-learn |
-| `requirements.txt`          | Python dependencies to load and use the model              |
-| `README.md`                 | You are here — model usage, context, and examples          |
 ---
 ## 🧠 How It Works
-### Bit Vector Encoding (`BitEncoder`)
-* Takes pre-trained GloVe embeddings (300d)
-* Converts each question-answer pair into a 300-dimensional vector
-* Binarizes it using `(avg_vec > 0).astype(int)`
-### Neural Network Classifier (`FAQClassifier`)
-* Model: `sklearn.neural_network.MLPClassifier`
-* Hidden Layer: `(32,)`
-* Learns to map bit-encoded vectors to answer indices
 ---
 ## 🚀 Usage Example
-```python
 import pickle
 import numpy as np
-# Load model
-with open("askbit_faq_classifier.pkl", "rb") as f:
     model = pickle.load(f)
-# Bit vector input (this should match training format)
-query_vec = np.array([1, 0, 1, 1, 0, ..., 0])  # 300 bits long
-# Predict
 answer = model.predict(query_vec)
 print("Predicted answer:", answer)
 ```
-> ⚠️ Note: You must generate the `query_vec` the same way it was done during training (see full AskBit project).
 ---
@@ -75,14 +87,17 @@ print("Predicted answer:", answer)
 Install dependencies with:
-```bash
 pip install -r requirements.txt
 ```
-Main requirements:
-* `scikit-learn`
-* `numpy`
 ---
@@ -90,9 +105,9 @@ Main requirements:
 This model is part of the [AskBit project on GitHub](https://github.com/Shanvit7/askbit):
-* ✅ Full source code
-* ✅ CLI training + querying interface
-* ✅ Debugging + vector inspection tools
 ---
@@ -104,6 +119,6 @@ MIT License — free to use, modify, or contribute.
 ## 🤝 Contributing
-This model is designed for learning. Feel free to fork, improve, or build upon it!
 > Model trained and shared by [@Shanvit](https://huggingface.co/Shanvit)

+---
+license: mit
+language:
+- en
+pipeline_tag: text-classification
+tags:
+- faqs
+- bitwise
+- semantic-search
+- knn
+- sbert
+- bit-vector
+- binary-embedding
+---
 # AskBit FAQ Classifier
+A fast, interpretable FAQ retriever using **bit vector encoding of SBERT sentence embeddings** combined with a **binary KNN classifier**. This repository hosts a **model artifact** from the [AskBit](https://github.com/Shanvit7/askbit) project.
+> 📚 This model was created as part of an educational journey exploring efficient semantic FAQ matching with bitwise vector representations and KNN classification.
+* 🔢 Uses SBERT (`all-MiniLM-L6-v2`) to embed question-answer pairs as dense semantic vectors.
+* 🧠 Converts dense embeddings into binarized bit vectors for fast similarity search.
+* ⚡ Uses a K-Nearest Neighbors classifier with Hamming distance over bit vectors.
+* 💡 Fully open source, efficient, and suitable for lightweight semantic FAQ retrieval.
+* 🗂️ Model file: `model.pkl`
+* 📄 Training data file: `faq.json`
 ---
 ## 📁 Files in This Repository
+| File                | Description                                                  |
+|---------------------|--------------------------------------------------------------|
+| `model.pkl`         | Trained KNN classifier model over SBERT-based bit vectors.   |
+| `faq.json`          | FAQ question-answer dataset used for training and evaluation.|
+| `requirements.txt`  | Python dependencies to load and use the model.                |
+| `README.md`         | Model usage instructions, background, and examples.          |
 ---
 ## 🧠 How It Works
+### Semantic Bit Vector Encoding (`SbertBitEncoder`)
+- Uses the **Sentence-BERT** model (`all-MiniLM-L6-v2`) to generate dense semantic embeddings of entire question-answer pairs.
+- Embeddings capture **meaningful sentence-level semantics**, enabling effective retrieval beyond simple word overlap.
+- Each dense embedding vector is **binarized** by thresholding (e.g., bits set to 1 if value > 0) to produce a compact, fixed-length bit vector.
+- Both the FAQ entries and queries are encoded this way, ensuring semantic similarity maps to bitwise proximity.
+### Binary K-Nearest Neighbors Classifier (`FAQClassifier`)
+- Implements a KNN classifier using **Hamming distance** as the similarity metric on bit vectors.
+- Learns to associate bit-encoded queries with their corresponding answers.
+- Supports retrieving the best matching answer or top-k candidates with similarity scores.
 ---
 ## 🚀 Usage Example
+```
 import pickle
 import numpy as np
+# Load the trained model artifact
+with open("model.pkl", "rb") as f:
     model = pickle.load(f)
+# Bit vector input: binarized SBERT embeddings (e.g., 384-bit vector)
+query_vec = np.array([1, 0, 1, 1, 0, ..., 0])  # Must match training bit vector format
+# Predict (get best matching answer)
 answer = model.predict(query_vec)
 print("Predicted answer:", answer)
 ```
+> ⚠️ Important: Ensure you encode new queries with the same SBERT bit-vector encoder used at training for consistent results.
 ---
 Install dependencies with:
+```
 pip install -r requirements.txt
 ```
+Main dependencies:
+- `sentence-transformers`
+- `scikit-learn`
+- `numpy`
+- `yake`
+- `spacy` (for optional text preprocessing)
 ---
 This model is part of the [AskBit project on GitHub](https://github.com/Shanvit7/askbit):
+- ✅ Full source code with CLI and training scripts
+- ✅ Debug and inspect bit vectors and retrieval results
+- ✅ Lightweight, interpretable semantic FAQ search
 ---
 ## 🤝 Contributing
+This model is intended for learning and experimentation. Feel free to fork, improve, or build upon it!
 > Model trained and shared by [@Shanvit](https://huggingface.co/Shanvit)