YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Fake Job Posting Detection
A machine learning project that uses DeBERTa-v3-base to detect fraudulent job postings using natural language processing techniques.
Project Overview
This project implements a complete pipeline for detecting fake job postings:
- Data Processing: Comprehensive data cleaning, feature engineering, and class balancing
- Model Training: Fine-tuned DeBERTa-v3-base transformer model
- API Development: REST API for model inference (implemented)
- UI Development: User interface for job posting analysis (implemented)
Features
- Advanced NLP: Uses DeBERTa-v3-base transformer model
- Class Imbalance Handling: SMOTE oversampling for balanced training
- Comprehensive Data Processing: Missing value handling, feature engineering, encoding
- Proper Evaluation: Train/validation/test separation with no data leakage
- Production Ready: Model saved and ready for deployment
- REST API: FastAPI backend for real-time inference
- Modern UI: React frontend for user-friendly predictions
Model Performance
Final Results
- F1-Score: 35.92% on test set
- Recall: 97.63% (catches almost all fraudulent jobs)
- Precision: 22.01% (some false positives, requires human review)
- Accuracy: 34.28% (expected for imbalanced dataset)
Project Structure
fake-job-posting-prediction/
βββ data/
β βββ data_processing.py # Data preprocessing pipeline
β βββ data_processing.log # Processing logs
β βββ train_data.csv # Training data (SMOTE applied)
β βββ val_data.csv # Validation data
β βββ test_data.csv # Test data
β βββ processed_fake_job_postings.csv # Combined processed data
βββ model/
β βββ train_deberta.py # Model training script
β βββ model_training.log # Training logs
β βββ final_results.md # Complete results documentation
β βββ deberta_best_model/ # Best trained model
βββ api/
β βββ main.py # FastAPI backend
β βββ ... # API code and logs
βββ ui/
β βββ src/ # React frontend source code
β βββ ... # Frontend assets
βββ .gitignore # Git ignore rules
βββ README.md # Project documentation
Installation
Clone the repository
git clone <repository-url> cd fake-job-posting-prediction
Set up Python environment (recommended: pyenv + virtualenv)
pyenv install 3.10.14 pyenv virtualenv 3.10.14 hf310env pyenv activate hf310env
Install backend dependencies
pip install -r requirements.txt # or pip install fastapi uvicorn transformers torch sentencepiece pandas numpy scikit-learn imbalanced-learn pydantic loguru datasets kaggle
Install frontend dependencies
cd ui npm install
Usage
Data Processing
cd data
python data_processing.py
Model Training
cd model
python train_deberta.py
Start the API Backend
cd api
uvicorn main:app --reload
Start the Frontend UI
cd ui
npm start
API Usage
- POST /predict: Send a job posting text and receive a prediction (fraudulent/legitimate) and probability.
- GET /: Health check endpoint.
Frontend Usage
- Access the React app at http://localhost:3000
- Enter a job posting and click "Predict" to see results.
Demo
- The app provides real-time predictions for job postings via a modern web UI.
- Example: Enter "this is a high paying job, no experience needed and it is remote." and see the prediction.
Troubleshooting & Tips
- CORS errors: Ensure CORS middleware is enabled in FastAPI and only one app instance is created.
- Python version: Use Python 3.10.x for best compatibility with ML libraries.
- Environment issues: Always activate your virtualenv before running backend commands.
- Model always predicts one class: Check your model, data balance, and retrain if needed.
Technical Details
Model Configuration
- Model: microsoft/deberta-v3-base
- Parameters: 184M
- Max sequence length: 256
- Batch size: 8
- Learning rate: 2e-5
- Epochs: 3
Data Processing Pipeline
- Missing value handling (drop columns/rows with >50% missing)
- Feature selection (remove identifiers, constants, highly correlated)
- Feature engineering (text length, keyword flags, count features)
- Categorical encoding (LabelEncoder with unseen category handling)
- Numerical scaling (StandardScaler)
- Train/Test split (80/20)
- Train/Validation split (80/20 of training data)
- SMOTE oversampling (applied ONLY to training data)
Dataset Statistics
- Original: ~18K samples (94% non-fraudulent, 6% fraudulent)
- Training: 18,730 samples (perfectly balanced with SMOTE)
- Validation: 2,859 samples (original distribution)
- Test: 3,573 samples (original distribution)
Challenges Solved
- Class Imbalance: Resolved using SMOTE on training data only
- Data Leakage: Prevented with proper train/validation/test separation
- MPS Compatibility: Resolved by switching to CPU training
- Binary Label Conversion: Handled non-binary processed values
- Zero Division Warnings: Resolved with proper metric handling
- CORS/API Integration: Fixed CORS and API integration issues for seamless frontend-backend communication
Next Steps
- Data processing pipeline
- Model training and evaluation
- API development and deployment
- Frontend UI integration
- Troubleshooting and bugfixes
- Project documentation
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
This project is licensed under the MIT License.
Acknowledgments
- Microsoft for the DeBERTa model
- Hugging Face for the transformers library
- The original dataset providers
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support