metadata

title: ObesityRiskPredictor
emoji: 🍽️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.41.0
app_file: app.py
pinned: false
license: mit
short_description: Classification with Random Forest, LightGBM and XGBoost.

Multi-Model Performance Analysis for Obesity Risk Classification

1. Project Overview

This project provides a comprehensive framework for training, evaluating, and comparing the performance of several prominent machine learning models on the multi-class classification task of Obesity Risk Prediction.

The primary objective is to conduct a comparative analysis to determine which modeling approach yields the highest predictive accuracy and robustness for this specific dataset.

The experiment is designed to serve as a benchmark, showcasing a standardized pipeline that includes data preprocessing, hyperparameter optimization, and rigorous model evaluation.

By comparing an ensemble bagging model (Random Forest) against two powerful gradient boosting implementations (LightGBM and XGBoost), we aim to uncover insights into the most effective architecture for this type of tabular data problem.

2. Technical Architecture & Methodologies

2.1. Models Evaluated

The core of this experiment involves the evaluation of three distinct, yet powerful, tree-based ensemble models:

Random Forest Classifier: An ensemble method based on bagging. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. It is known for its robustness and ability to handle high-dimensional data.
LightGBM (Light Gradient Boosting Machine): A high-performance gradient boosting framework that uses tree-based learning algorithms. It is distinguished by its use of histogram-based algorithms and leaf-wise tree growth, which results in significantly faster training speeds and lower memory usage compared to other boosting methods.
XGBoost (eXtreme Gradient Boosting): An optimized and distributed gradient boosting library designed for efficiency, flexibility, and portability. It implements machine learning algorithms under the Gradient Boosting framework and provides a parallel tree boosting that solves many data science problems in a fast and accurate way.

2.2. Dataset

The analysis is performed on the "Estimation of Obesity Levels Based On Eating Habits and Physical Condition" dataset.

Task Type: Multi-Class Classification
Features: The dataset comprises a mix of numerical (e.g., Age, Height, Weight) and categorical (e.g., Gender, family_history_with_overweight, MTRANS) variables.
Instances: 2111
Attributes: 16 predictive features and 1 target class (NObeyesdad).

2.2.1. Dataset Source and Composition

This dataset was created to estimate obesity levels in individuals from Mexico, Peru, and Colombia. It is composed of both real and synthetically generated data:

23% of the data was collected directly from users via a web platform.
77% of the data was generated synthetically using the SMOTE (Synthetic Minority Over-sampling Technique) filter in Weka to address class imbalance.

2.2.2. Citation

Proper credit is given to the creators of this dataset.

Source: UCI Machine Learning Repository
Creators: Palechor, F. M., & de la Hoz Manotas, A. (2019).

2.3. Data Preprocessing Pipeline

A standardized preprocessing pipeline is applied to ensure data quality and compatibility with the machine learning models:

Categorical Feature Encoding: One-Hot Encoding is applied to all nominal categorical features. This transforms categorical data into a numerical format without introducing an ordinal relationship, creating binary columns for each category.
Target Variable Encoding: The multi-class target variable (NObeyesdad) is converted into numerical format using Label Encoding.

2.4. Hyperparameter Optimization

To ensure each model performs optimally, we employ a systematic hyperparameter tuning strategy:

Strategy: RandomizedSearchCV is utilized to efficiently search a defined parameter space for each model. This approach samples a fixed number of parameter combinations from the specified distributions, offering a strong balance between computational cost and tuning effectiveness.
Cross-Validation: StratifiedKFold cross-validation (with 5 splits) is used within the search process. This ensures that each fold is a representative sample of the overall class distribution, which is critical for maintaining robust evaluation on multi-class datasets that may have imbalanced classes.
Optimization Metric: The primary scoring metric used to identify the best parameter set during the search is Accuracy.

2.5. Model Evaluation

The performance of the fine-tuned models is assessed using a standard set of classification metrics:

Overall Accuracy: The primary measure of the model's ability to make correct predictions across all classes.
Classification Report: A detailed report providing class-wise performance metrics, including:
- Precision: The ability of the classifier not to label as positive a sample that is negative.
- Recall (Sensitivity): The ability of the classifier to find all the positive samples.
- F1-Score: The weighted harmonic mean of precision and recall.

2.6. Evaluation Results

--- Starting Model Evaluation ---
Attempting to load dataset from 'datasets/ObesityDataSet_raw_and_data_sinthetic.csv'...
Dataset loaded successfully.
Dataset Head:
   Gender   Age  Height  Weight family_history_with_overweight FAVC  FCVC  NCP  ... SMOKE CH2O  SCC  FAF  TUE        CALC                 MTRANS           NObeyesdad
0  Female  21.0    1.62    64.0                            yes   no   2.0  3.0  ...    no  2.0   no  0.0  1.0          no  Public_Transportation        Normal_Weight
1  Female  21.0    1.52    56.0                            yes   no   3.0  3.0  ...   yes  3.0  yes  3.0  0.0   Sometimes  Public_Transportation        Normal_Weight
2    Male  23.0    1.80    77.0                            yes   no   2.0  3.0  ...    no  2.0   no  2.0  1.0  Frequently  Public_Transportation        Normal_Weight
3    Male  27.0    1.80    87.0                             no   no   3.0  3.0  ...    no  2.0   no  2.0  0.0  Frequently                Walking   Overweight_Level_I
4    Male  22.0    1.78    89.8                             no   no   2.0  1.0  ...    no  2.0   no  0.0  0.0   Sometimes  Public_Transportation  Overweight_Level_II

[5 rows x 17 columns]

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CALC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

Preprocessing data...
Target classes mapped: {'Insufficient_Weight': np.int64(0), 'Normal_Weight': np.int64(1), 'Obesity_Type_I': np.int64(2), 'Obesity_Type_II': np.int64(3), 'Obesity_Type_III': np.int64(4), 'Overweight_Level_I': np.int64(5), 'Overweight_Level_II': np.int64(6)}
RandomForest Model, feature columns, and label encoder loaded for prediction.

Evaluating RandomForest performance...
RandomForest Accuracy: 0.9480
RandomForest Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       1.00      0.93      0.96        54
      Normal_Weight       0.79      0.97      0.87        58
     Obesity_Type_I       0.94      0.97      0.96        70
    Obesity_Type_II       1.00      0.98      0.99        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.96      0.84      0.90        58
Overweight_Level_II       0.98      0.95      0.96        58

           accuracy                           0.95       423
          macro avg       0.95      0.95      0.95       423
       weighted avg       0.95      0.95      0.95       423

LightGBM Model, feature columns, and label encoder loaded for prediction.

Evaluating LightGBM performance...
LightGBM Accuracy: 0.9716
LightGBM Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       1.00      0.94      0.97        54
      Normal_Weight       0.89      1.00      0.94        58
     Obesity_Type_I       0.96      0.99      0.97        70
    Obesity_Type_II       1.00      0.98      0.99        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.98      0.91      0.95        58
Overweight_Level_II       0.98      0.98      0.98        58

           accuracy                           0.97       423
          macro avg       0.97      0.97      0.97       423
       weighted avg       0.97      0.97      0.97       423

XGBoost Model, feature columns, and label encoder loaded for prediction.

Evaluating XGBoost performance...
XGBoost Accuracy: 0.9527
XGBoost Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.98      0.89      0.93        54
      Normal_Weight       0.82      0.97      0.89        58
     Obesity_Type_I       0.97      0.97      0.97        70
    Obesity_Type_II       0.98      0.98      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.96      0.90      0.93        58
Overweight_Level_II       0.97      0.97      0.97        58

           accuracy                           0.95       423
          macro avg       0.96      0.95      0.95       423
       weighted avg       0.96      0.95      0.95       423

--- Model Evaluation Finished ---