Spaces:
Running
A newer version of the Gradio SDK is available:
5.43.1
title: ObesityRiskPredictor
emoji: 🍽️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
license: mit
short_description: Classification with Random Forest, LightGBM and XGBoost.
Multi-Model Performance Analysis for Obesity Risk Classification
1. Project Overview
This project provides a comprehensive framework for training, evaluating, and comparing the performance of several prominent machine learning models on the multi-class classification task of Obesity Risk Prediction.
The primary objective is to conduct a comparative analysis to determine which modeling approach yields the highest predictive accuracy and robustness for this specific dataset.
The experiment is designed to serve as a benchmark, showcasing a standardized pipeline that includes data preprocessing, hyperparameter optimization, and rigorous model evaluation.
By comparing an ensemble bagging model (Random Forest) against two powerful gradient boosting implementations (LightGBM and XGBoost), we aim to uncover insights into the most effective architecture for this type of tabular data problem.
2. Technical Architecture & Methodologies
2.1. Models Evaluated
The core of this experiment involves the evaluation of three distinct, yet powerful, tree-based ensemble models:
- Random Forest Classifier: An ensemble method based on bagging. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. It is known for its robustness and ability to handle high-dimensional data.
- LightGBM (Light Gradient Boosting Machine): A high-performance gradient boosting framework that uses tree-based learning algorithms. It is distinguished by its use of histogram-based algorithms and leaf-wise tree growth, which results in significantly faster training speeds and lower memory usage compared to other boosting methods.
- XGBoost (eXtreme Gradient Boosting): An optimized and distributed gradient boosting library designed for efficiency, flexibility, and portability. It implements machine learning algorithms under the Gradient Boosting framework and provides a parallel tree boosting that solves many data science problems in a fast and accurate way.
2.2. Dataset
The analysis is performed on the "Estimation of Obesity Levels Based On Eating Habits and Physical Condition" dataset.
- Task Type: Multi-Class Classification
- Features: The dataset comprises a mix of numerical (e.g.,
Age
,Height
,Weight
) and categorical (e.g.,Gender
,family_history_with_overweight
,MTRANS
) variables. - Instances: 2111
- Attributes: 16 predictive features and 1 target class (
NObeyesdad
).
2.2.1. Dataset Source and Composition
This dataset was created to estimate obesity levels in individuals from Mexico, Peru, and Colombia. It is composed of both real and synthetically generated data:
- 23% of the data was collected directly from users via a web platform.
- 77% of the data was generated synthetically using the SMOTE (Synthetic Minority Over-sampling Technique) filter in Weka to address class imbalance.
2.2.2. Citation
Proper credit is given to the creators of this dataset.
- Source: UCI Machine Learning Repository
- Creators: Palechor, F. M., & de la Hoz Manotas, A. (2019).
2.3. Data Preprocessing Pipeline
A standardized preprocessing pipeline is applied to ensure data quality and compatibility with the machine learning models:
- Categorical Feature Encoding: One-Hot Encoding is applied to all nominal categorical features. This transforms categorical data into a numerical format without introducing an ordinal relationship, creating binary columns for each category.
- Target Variable Encoding: The multi-class target variable (
NObeyesdad
) is converted into numerical format using Label Encoding.
2.4. Hyperparameter Optimization
To ensure each model performs optimally, we employ a systematic hyperparameter tuning strategy:
- Strategy:
RandomizedSearchCV
is utilized to efficiently search a defined parameter space for each model. This approach samples a fixed number of parameter combinations from the specified distributions, offering a strong balance between computational cost and tuning effectiveness. - Cross-Validation:
StratifiedKFold
cross-validation (with 5 splits) is used within the search process. This ensures that each fold is a representative sample of the overall class distribution, which is critical for maintaining robust evaluation on multi-class datasets that may have imbalanced classes. - Optimization Metric: The primary scoring metric used to identify the best parameter set during the search is Accuracy.
2.5. Model Evaluation
The performance of the fine-tuned models is assessed using a standard set of classification metrics:
- Overall Accuracy: The primary measure of the model's ability to make correct predictions across all classes.
- Classification Report: A detailed report providing class-wise performance metrics, including:
- Precision: The ability of the classifier not to label as positive a sample that is negative.
- Recall (Sensitivity): The ability of the classifier to find all the positive samples.
- F1-Score: The weighted harmonic mean of precision and recall.
2.6. Evaluation Results
--- Starting Model Evaluation ---
Attempting to load dataset from 'datasets/ObesityDataSet_raw_and_data_sinthetic.csv'...
Dataset loaded successfully.
Dataset Head:
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP ... SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
0 Female 21.0 1.62 64.0 yes no 2.0 3.0 ... no 2.0 no 0.0 1.0 no Public_Transportation Normal_Weight
1 Female 21.0 1.52 56.0 yes no 3.0 3.0 ... yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male 23.0 1.80 77.0 yes no 2.0 3.0 ... no 2.0 no 2.0 1.0 Frequently Public_Transportation Normal_Weight
3 Male 27.0 1.80 87.0 no no 3.0 3.0 ... no 2.0 no 2.0 0.0 Frequently Walking Overweight_Level_I
4 Male 22.0 1.78 89.8 no no 2.0 1.0 ... no 2.0 no 0.0 0.0 Sometimes Public_Transportation Overweight_Level_II
[5 rows x 17 columns]
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 2111 non-null object
1 Age 2111 non-null float64
2 Height 2111 non-null float64
3 Weight 2111 non-null float64
4 family_history_with_overweight 2111 non-null object
5 FAVC 2111 non-null object
6 FCVC 2111 non-null float64
7 NCP 2111 non-null float64
8 CAEC 2111 non-null object
9 SMOKE 2111 non-null object
10 CH2O 2111 non-null float64
11 SCC 2111 non-null object
12 FAF 2111 non-null float64
13 TUE 2111 non-null float64
14 CALC 2111 non-null object
15 MTRANS 2111 non-null object
16 NObeyesdad 2111 non-null object
dtypes: float64(8), object(9)
memory usage: 280.5+ KB
Preprocessing data...
Target classes mapped: {'Insufficient_Weight': np.int64(0), 'Normal_Weight': np.int64(1), 'Obesity_Type_I': np.int64(2), 'Obesity_Type_II': np.int64(3), 'Obesity_Type_III': np.int64(4), 'Overweight_Level_I': np.int64(5), 'Overweight_Level_II': np.int64(6)}
RandomForest Model, feature columns, and label encoder loaded for prediction.
Evaluating RandomForest performance...
RandomForest Accuracy: 0.9480
RandomForest Classification Report:
precision recall f1-score support
Insufficient_Weight 1.00 0.93 0.96 54
Normal_Weight 0.79 0.97 0.87 58
Obesity_Type_I 0.94 0.97 0.96 70
Obesity_Type_II 1.00 0.98 0.99 60
Obesity_Type_III 1.00 0.98 0.99 65
Overweight_Level_I 0.96 0.84 0.90 58
Overweight_Level_II 0.98 0.95 0.96 58
accuracy 0.95 423
macro avg 0.95 0.95 0.95 423
weighted avg 0.95 0.95 0.95 423
LightGBM Model, feature columns, and label encoder loaded for prediction.
Evaluating LightGBM performance...
LightGBM Accuracy: 0.9716
LightGBM Classification Report:
precision recall f1-score support
Insufficient_Weight 1.00 0.94 0.97 54
Normal_Weight 0.89 1.00 0.94 58
Obesity_Type_I 0.96 0.99 0.97 70
Obesity_Type_II 1.00 0.98 0.99 60
Obesity_Type_III 1.00 0.98 0.99 65
Overweight_Level_I 0.98 0.91 0.95 58
Overweight_Level_II 0.98 0.98 0.98 58
accuracy 0.97 423
macro avg 0.97 0.97 0.97 423
weighted avg 0.97 0.97 0.97 423
XGBoost Model, feature columns, and label encoder loaded for prediction.
Evaluating XGBoost performance...
XGBoost Accuracy: 0.9527
XGBoost Classification Report:
precision recall f1-score support
Insufficient_Weight 0.98 0.89 0.93 54
Normal_Weight 0.82 0.97 0.89 58
Obesity_Type_I 0.97 0.97 0.97 70
Obesity_Type_II 0.98 0.98 0.98 60
Obesity_Type_III 1.00 0.98 0.99 65
Overweight_Level_I 0.96 0.90 0.93 58
Overweight_Level_II 0.97 0.97 0.97 58
accuracy 0.95 423
macro avg 0.96 0.95 0.95 423
weighted avg 0.96 0.95 0.95 423
--- Model Evaluation Finished ---