Shanvit commited on
Commit
92f26b5
Β·
verified Β·
1 Parent(s): 0785cd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -46
README.md CHANGED
@@ -1,73 +1,85 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- pipeline_tag: text-classification
6
- tags:
7
- - faqs
8
- - bitwise
9
- - educational
10
- - mlp
11
- - glove
12
- - vector-representation
13
- ---
 
 
 
 
 
 
 
14
  # AskBit FAQ Classifier
15
 
16
- A simple, interpretable machine learning model trained to classify frequently asked questions (FAQs) using bitwise vector representations. This is the **model artifact** from the [AskBit](https://github.com/Shanvit7/askbit) project.
17
 
18
- > πŸ“š This model was created as part of an educational exploration into the potential of bitwise vector representations for machine learning tasks.
19
 
20
- * πŸ”’ Uses binarized GloVe embeddings to create 300-bit vectors
21
- * 🧠 Trained using a small MLP (Multilayer Perceptron)
22
- * πŸ’‘ Fully open source and educational β€” not intended for production
 
 
 
23
 
24
  ---
25
 
26
  ## πŸ“ Files in This Repository
27
 
28
- | File | Description |
29
- | --------------------------- | ---------------------------------------------------------- |
30
- | `askbit_faq_classifier.pkl` | Trained classifier using `MLPClassifier` from scikit-learn |
31
- | `requirements.txt` | Python dependencies to load and use the model |
32
- | `README.md` | You are here β€” model usage, context, and examples |
 
33
 
34
  ---
35
 
36
  ## 🧠 How It Works
37
 
38
- ### Bit Vector Encoding (`BitEncoder`)
39
 
40
- * Takes pre-trained GloVe embeddings (300d)
41
- * Converts each question-answer pair into a 300-dimensional vector
42
- * Binarizes it using `(avg_vec > 0).astype(int)`
 
43
 
44
- ### Neural Network Classifier (`FAQClassifier`)
45
 
46
- * Model: `sklearn.neural_network.MLPClassifier`
47
- * Hidden Layer: `(32,)`
48
- * Learns to map bit-encoded vectors to answer indices
49
 
50
  ---
51
 
52
  ## πŸš€ Usage Example
53
 
54
- ```python
55
  import pickle
56
  import numpy as np
57
 
58
- # Load model
59
- with open("askbit_faq_classifier.pkl", "rb") as f:
60
  model = pickle.load(f)
61
 
62
- # Bit vector input (this should match training format)
63
- query_vec = np.array([1, 0, 1, 1, 0, ..., 0]) # 300 bits long
64
 
65
- # Predict
66
  answer = model.predict(query_vec)
67
  print("Predicted answer:", answer)
68
  ```
69
 
70
- > ⚠️ Note: You must generate the `query_vec` the same way it was done during training (see full AskBit project).
71
 
72
  ---
73
 
@@ -75,14 +87,17 @@ print("Predicted answer:", answer)
75
 
76
  Install dependencies with:
77
 
78
- ```bash
79
  pip install -r requirements.txt
80
  ```
81
 
82
- Main requirements:
83
 
84
- * `scikit-learn`
85
- * `numpy`
 
 
 
86
 
87
  ---
88
 
@@ -90,9 +105,9 @@ Main requirements:
90
 
91
  This model is part of the [AskBit project on GitHub](https://github.com/Shanvit7/askbit):
92
 
93
- * βœ… Full source code
94
- * βœ… CLI training + querying interface
95
- * βœ… Debugging + vector inspection tools
96
 
97
  ---
98
 
@@ -104,6 +119,6 @@ MIT License β€” free to use, modify, or contribute.
104
 
105
  ## 🀝 Contributing
106
 
107
- This model is designed for learning. Feel free to fork, improve, or build upon it!
108
 
109
  > Model trained and shared by [@Shanvit](https://huggingface.co/Shanvit)
 
1
+ ---
2
+
3
+ license: mit
4
+
5
+ language:
6
+ - en
7
+
8
+ pipeline_tag: text-classification
9
+
10
+ tags:
11
+ - faqs
12
+ - bitwise
13
+ - semantic-search
14
+ - knn
15
+ - sbert
16
+ - bit-vector
17
+ - binary-embedding
18
+
19
+ ---
20
+
21
  # AskBit FAQ Classifier
22
 
23
+ A fast, interpretable FAQ retriever using **bit vector encoding of SBERT sentence embeddings** combined with a **binary KNN classifier**. This repository hosts a **model artifact** from the [AskBit](https://github.com/Shanvit7/askbit) project.
24
 
25
+ > πŸ“š This model was created as part of an educational journey exploring efficient semantic FAQ matching with bitwise vector representations and KNN classification.
26
 
27
+ * πŸ”’ Uses SBERT (`all-MiniLM-L6-v2`) to embed question-answer pairs as dense semantic vectors.
28
+ * 🧠 Converts dense embeddings into binarized bit vectors for fast similarity search.
29
+ * ⚑ Uses a K-Nearest Neighbors classifier with Hamming distance over bit vectors.
30
+ * πŸ’‘ Fully open source, efficient, and suitable for lightweight semantic FAQ retrieval.
31
+ * πŸ—‚οΈ Model file: `model.pkl`
32
+ * πŸ“„ Training data file: `faq.json`
33
 
34
  ---
35
 
36
  ## πŸ“ Files in This Repository
37
 
38
+ | File | Description |
39
+ |---------------------|--------------------------------------------------------------|
40
+ | `model.pkl` | Trained KNN classifier model over SBERT-based bit vectors. |
41
+ | `faq.json` | FAQ question-answer dataset used for training and evaluation.|
42
+ | `requirements.txt` | Python dependencies to load and use the model. |
43
+ | `README.md` | Model usage instructions, background, and examples. |
44
 
45
  ---
46
 
47
  ## 🧠 How It Works
48
 
49
+ ### Semantic Bit Vector Encoding (`SbertBitEncoder`)
50
 
51
+ - Uses the **Sentence-BERT** model (`all-MiniLM-L6-v2`) to generate dense semantic embeddings of entire question-answer pairs.
52
+ - Embeddings capture **meaningful sentence-level semantics**, enabling effective retrieval beyond simple word overlap.
53
+ - Each dense embedding vector is **binarized** by thresholding (e.g., bits set to 1 if value > 0) to produce a compact, fixed-length bit vector.
54
+ - Both the FAQ entries and queries are encoded this way, ensuring semantic similarity maps to bitwise proximity.
55
 
56
+ ### Binary K-Nearest Neighbors Classifier (`FAQClassifier`)
57
 
58
+ - Implements a KNN classifier using **Hamming distance** as the similarity metric on bit vectors.
59
+ - Learns to associate bit-encoded queries with their corresponding answers.
60
+ - Supports retrieving the best matching answer or top-k candidates with similarity scores.
61
 
62
  ---
63
 
64
  ## πŸš€ Usage Example
65
 
66
+ ```
67
  import pickle
68
  import numpy as np
69
 
70
+ # Load the trained model artifact
71
+ with open("model.pkl", "rb") as f:
72
  model = pickle.load(f)
73
 
74
+ # Bit vector input: binarized SBERT embeddings (e.g., 384-bit vector)
75
+ query_vec = np.array([1, 0, 1, 1, 0, ..., 0]) # Must match training bit vector format
76
 
77
+ # Predict (get best matching answer)
78
  answer = model.predict(query_vec)
79
  print("Predicted answer:", answer)
80
  ```
81
 
82
+ > ⚠️ Important: Ensure you encode new queries with the same SBERT bit-vector encoder used at training for consistent results.
83
 
84
  ---
85
 
 
87
 
88
  Install dependencies with:
89
 
90
+ ```
91
  pip install -r requirements.txt
92
  ```
93
 
94
+ Main dependencies:
95
 
96
+ - `sentence-transformers`
97
+ - `scikit-learn`
98
+ - `numpy`
99
+ - `yake`
100
+ - `spacy` (for optional text preprocessing)
101
 
102
  ---
103
 
 
105
 
106
  This model is part of the [AskBit project on GitHub](https://github.com/Shanvit7/askbit):
107
 
108
+ - βœ… Full source code with CLI and training scripts
109
+ - βœ… Debug and inspect bit vectors and retrieval results
110
+ - βœ… Lightweight, interpretable semantic FAQ search
111
 
112
  ---
113
 
 
119
 
120
  ## 🀝 Contributing
121
 
122
+ This model is intended for learning and experimentation. Feel free to fork, improve, or build upon it!
123
 
124
  > Model trained and shared by [@Shanvit](https://huggingface.co/Shanvit)