semantic-search-rDepression
This repository contains my personal implementation of a semantic search system using FAISS, trained on a Kaggle dataset of approximately 32,000 Reddit posts from r/depression, collected over a three-month period in 2019. While many of these posts are now deleted, the dataset was ideal for local training on my RTX 3080 (10 GB VRAM).
What does the model do?
The idea is simple: you type out what you're feeling, whether it's to vent or to make sense of what's going on internally and instead of posting it publicly, the model tries to find similar posts from the dataset.
Sometimes, just seeing that someone else has felt the same way can make a difference. The model returns the top 3 most semantically similar posts, along with their original Reddit URLs. If the post still exists, you can visit it and read the discussions and comments of others who've been through something similar.
Example Output
Query:
i cant eat
Top 3 Similar Posts:
Similar Post 1 (Similarity: 0.9772)
Text: why is it so hard to eat? i literally can't get myself to do it i'm hungry i'm not trying to lose weight or anything i don't have ED but i just can't get the motivation to eat even if it's right in fr...
URL: https://www.reddit.com/r/depression/comments/kjsglw/why_is_it_so_hard_to_eat/
Similar Post 2 (Similarity: 0.9146)
Text: No-one to talk to and how am I supposed to eat? I'm starving, I have money and food but I can't stomach it and i'm feeling weak and sick lately from not eating. What am I supposed to do? I'll snack a bi...
URL: https://www.reddit.com/r/depression/comments/2efn4z/noone_to_talk_to_and_how_am_i_suppose_to_eat/
Similar Post 3 (Similarity: 0.8786)
Text: I won't eat and I refuse I recently lost all my friends just to a fight and people just use me as a doormat i haven't ate in 1 week and drank a little bit of juice all together. I can't walk around my ...
URL: https://www.reddit.com/r/depression/comments/rgik5/i_wont_eat_and_i_refuse/
Clicking on a URL will take you to the full discussion on Reddit, provided the post hasn't been deleted.
Training Setup & Hyperparameters
- Loss Function:
PyTorch built-in TripletMarginLoss
(margin =1.0
) - Optimizer:
AdamW
(learning rate =2e-5
) - Scheduler:
Cosine
with 5% warmup - Epochs: 3
- Effective Batch Size: 32 (8 per step with 4-step accumulation)
- Monitoring:
eval_steps = 500
Results
Final Epoch (Epoch 3):
- Train Loss: 0.4466 | Train Accuracy: 95.00%
- Val Loss: 0.4358 | Val Accuracy: 94.59%
Test Set Evaluation:
Test Accuracy: 99.07%
Mean Positive Similarity: 0.7044
Mean Negative Similarity: 0.4952
Mean Margin: 0.2092
Std Dev Margin: 0.0795
Correct: 956/965
Incorrect: 9
Challenging Cases (margin < 0.1): 90/965 (9.33%)
MRR: 0.9855
Recall@5: 1.0000
Note: Accuracy is very high because it's very easy to label positive data, but in reality similarity score is what is important. Plus the test dataset is quite small only 3% (~930 posts)
Full Project (Training,Evaluation)
Feedback, corrections, and suggestions are always welcome!
- Downloads last month
- 18
Model tree for waellejmi/all-MiniLM-L6-v2-finetuned-on-rDepression
Base model
sentence-transformers/all-MiniLM-L6-v2