Spaces:

ethicalabs
/

ObesityRiskPredictor

Running

App Files Files Community

mrs83 commited on 3 days ago

Commit

697fe11

1 Parent(s): 60dad93

initial import

Browse files

Files changed (12) hide show

README.md +177 -12
app.py +274 -0
model/LightGBM_model_columns.joblib +3 -0
model/RandomForest_model_columns.joblib +3 -0
model/XGBoost_model_columns.joblib +3 -0
model/label_encoder.joblib +3 -0
model/obesity_LightGBM_model.joblib +3 -0
model/obesity_RandomForest_model.joblib +3 -0
model/obesity_XGBoost_model.joblib +3 -0
obesity_rp/__init__.py +0 -0
obesity_rp/config.py +27 -0
requirements.txt +68 -0

README.md CHANGED Viewed

@@ -1,12 +1,177 @@
----
-title: ObesityRiskPredictor
-emoji: 👁
-colorFrom: gray
-colorTo: purple
-sdk: gradio
-sdk_version: 5.43.1
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+Multi-Model Performance Analysis for Obesity Risk Classification
+================================================================
+1\. Project Overview
+--------------------
+[This project](https://github.com/ethicalabs-ai/ObesityRiskPredictor "null") provides a comprehensive framework for training, evaluating, and comparing the performance of several prominent machine learning models on the multi-class classification task of Obesity Risk Prediction.
+The primary objective is to conduct a comparative analysis to determine which modeling approach yields the highest predictive accuracy and robustness for this specific dataset.
+The experiment is designed to serve as a benchmark, showcasing a standardized pipeline that includes data preprocessing, hyperparameter optimization, and rigorous model evaluation.
+By comparing an ensemble bagging model (Random Forest) against two powerful gradient boosting implementations (LightGBM and XGBoost), we aim to uncover insights into the most effective architecture for this type of tabular data problem.
+2\. Technical Architecture & Methodologies
+------------------------------------------
+### 2.1. Models Evaluated
+The core of this experiment involves the evaluation of three distinct, yet powerful, tree-based ensemble models:
+-   **Random Forest Classifier:** An ensemble method based on bagging. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees. It is known for its robustness and ability to handle high-dimensional data.
+-   **LightGBM (Light Gradient Boosting Machine):** A high-performance gradient boosting framework that uses tree-based learning algorithms. It is distinguished by its use of histogram-based algorithms and leaf-wise tree growth, which results in significantly faster training speeds and lower memory usage compared to other boosting methods.
+-   **XGBoost (eXtreme Gradient Boosting):** An optimized and distributed gradient boosting library designed for efficiency, flexibility, and portability. It implements machine learning algorithms under the Gradient Boosting framework and provides a parallel tree boosting that solves many data science problems in a fast and accurate way.
+### 2.2. Dataset
+The analysis is performed on the **"Estimation of Obesity Levels Based On Eating Habits and Physical Condition"** dataset.
+-   **Task Type:** Multi-Class Classification
+-   **Features:** The dataset comprises a mix of numerical (e.g., `Age`, `Height`, `Weight`) and categorical (e.g., `Gender`, `family_history_with_overweight`, `MTRANS`) variables.
+-   **Instances:** 2111
+-   **Attributes:** 16 predictive features and 1 target class (`NObeyesdad`).
+#### 2.2.1. Dataset Source and Composition
+This dataset was created to estimate obesity levels in individuals from Mexico, Peru, and Colombia. It is composed of both real and synthetically generated data:
+-   **23%** of the data was collected directly from users via a web platform.
+-   **77%** of the data was generated synthetically using the SMOTE (Synthetic Minority Over-sampling Technique) filter in Weka to address class imbalance.
+#### 2.2.2. Citation
+Proper credit is given to the creators of this dataset.
+-   **Source:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition "null")
+-   **Creators:** Palechor, F. M., & de la Hoz Manotas, A. (2019).
+### 2.3. Data Preprocessing Pipeline
+A standardized preprocessing pipeline is applied to ensure data quality and compatibility with the machine learning models:
+-   **Categorical Feature Encoding:**  **One-Hot Encoding** is applied to all nominal categorical features. This transforms categorical data into a numerical format without introducing an ordinal relationship, creating binary columns for each category.
+-   **Target Variable Encoding:** The multi-class target variable (`NObeyesdad`) is converted into numerical format using **Label Encoding**.
+### 2.4. Hyperparameter Optimization
+To ensure each model performs optimally, we employ a systematic hyperparameter tuning strategy:
+-   **Strategy:** `RandomizedSearchCV` is utilized to efficiently search a defined parameter space for each model. This approach samples a fixed number of parameter combinations from the specified distributions, offering a strong balance between computational cost and tuning effectiveness.
+-   **Cross-Validation:** `StratifiedKFold` cross-validation (with 5 splits) is used within the search process. This ensures that each fold is a representative sample of the overall class distribution, which is critical for maintaining robust evaluation on multi-class datasets that may have imbalanced classes.
+-   **Optimization Metric:** The primary scoring metric used to identify the best parameter set during the search is **Accuracy**.
+### 2.5. Model Evaluation
+The performance of the fine-tuned models is assessed using a standard set of classification metrics:
+-   **Overall Accuracy:** The primary measure of the model's ability to make correct predictions across all classes.
+-   **Classification Report:** A detailed report providing class-wise performance metrics, including:
+    -   **Precision:** The ability of the classifier not to label as positive a sample that is negative.
+    -   **Recall (Sensitivity):** The ability of the classifier to find all the positive samples.
+    -   **F1-Score:** The weighted harmonic mean of precision and recall.
+### 2.5. Evaluation Results
+### 2.6. Evaluation Results
+```
+--- Starting Model Evaluation ---
+Attempting to load dataset from 'datasets/ObesityDataSet_raw_and_data_sinthetic.csv'...
+Dataset loaded successfully.
+Dataset Head:
+   Gender   Age  Height  Weight family_history_with_overweight FAVC  FCVC  NCP  ... SMOKE CH2O  SCC  FAF  TUE        CALC                 MTRANS           NObeyesdad
+0  Female  21.0    1.62    64.0                            yes   no   2.0  3.0  ...    no  2.0   no  0.0  1.0          no  Public_Transportation        Normal_Weight
+1  Female  21.0    1.52    56.0                            yes   no   3.0  3.0  ...   yes  3.0  yes  3.0  0.0   Sometimes  Public_Transportation        Normal_Weight
+2    Male  23.0    1.80    77.0                            yes   no   2.0  3.0  ...    no  2.0   no  2.0  1.0  Frequently  Public_Transportation        Normal_Weight
+3    Male  27.0    1.80    87.0                             no   no   3.0  3.0  ...    no  2.0   no  2.0  0.0  Frequently                Walking   Overweight_Level_I
+4    Male  22.0    1.78    89.8                             no   no   2.0  1.0  ...    no  2.0   no  0.0  0.0   Sometimes  Public_Transportation  Overweight_Level_II
+[5 rows x 17 columns]
+Dataset Info:
+<class 'pandas.core.frame.DataFrame'>
+RangeIndex: 2111 entries, 0 to 2110
+Data columns (total 17 columns):
+ #   Column                          Non-Null Count  Dtype
+---  ------                          --------------  -----
+ 0   Gender                          2111 non-null   object
+ 1   Age                             2111 non-null   float64
+ 2   Height                          2111 non-null   float64
+ 3   Weight                          2111 non-null   float64
+ 4   family_history_with_overweight  2111 non-null   object
+ 5   FAVC                            2111 non-null   object
+ 6   FCVC                            2111 non-null   float64
+ 7   NCP                             2111 non-null   float64
+ 8   CAEC                            2111 non-null   object
+ 9   SMOKE                           2111 non-null   object
+ 10  CH2O                            2111 non-null   float64
+ 11  SCC                             2111 non-null   object
+ 12  FAF                             2111 non-null   float64
+ 13  TUE                             2111 non-null   float64
+ 14  CALC                            2111 non-null   object
+ 15  MTRANS                          2111 non-null   object
+ 16  NObeyesdad                      2111 non-null   object
+dtypes: float64(8), object(9)
+memory usage: 280.5+ KB
+Preprocessing data...
+Target classes mapped: {'Insufficient_Weight': np.int64(0), 'Normal_Weight': np.int64(1), 'Obesity_Type_I': np.int64(2), 'Obesity_Type_II': np.int64(3), 'Obesity_Type_III': np.int64(4), 'Overweight_Level_I': np.int64(5), 'Overweight_Level_II': np.int64(6)}
+RandomForest Model, feature columns, and label encoder loaded for prediction.
+Evaluating RandomForest performance...
+RandomForest Accuracy: 0.9480
+RandomForest Classification Report:
+                      precision    recall  f1-score   support
+Insufficient_Weight       1.00      0.93      0.96        54
+      Normal_Weight       0.79      0.97      0.87        58
+     Obesity_Type_I       0.94      0.97      0.96        70
+    Obesity_Type_II       1.00      0.98      0.99        60
+   Obesity_Type_III       1.00      0.98      0.99        65
+ Overweight_Level_I       0.96      0.84      0.90        58
+Overweight_Level_II       0.98      0.95      0.96        58
+           accuracy                           0.95       423
+          macro avg       0.95      0.95      0.95       423
+       weighted avg       0.95      0.95      0.95       423
+LightGBM Model, feature columns, and label encoder loaded for prediction.
+Evaluating LightGBM performance...
+LightGBM Accuracy: 0.9716
+LightGBM Classification Report:
+                      precision    recall  f1-score   support
+Insufficient_Weight       1.00      0.94      0.97        54
+      Normal_Weight       0.89      1.00      0.94        58
+     Obesity_Type_I       0.96      0.99      0.97        70
+    Obesity_Type_II       1.00      0.98      0.99        60
+   Obesity_Type_III       1.00      0.98      0.99        65
+ Overweight_Level_I       0.98      0.91      0.95        58
+Overweight_Level_II       0.98      0.98      0.98        58
+           accuracy                           0.97       423
+          macro avg       0.97      0.97      0.97       423
+       weighted avg       0.97      0.97      0.97       423
+XGBoost Model, feature columns, and label encoder loaded for prediction.
+Evaluating XGBoost performance...
+XGBoost Accuracy: 0.9527
+XGBoost Classification Report:
+                      precision    recall  f1-score   support
+Insufficient_Weight       0.98      0.89      0.93        54
+      Normal_Weight       0.82      0.97      0.89        58
+     Obesity_Type_I       0.97      0.97      0.97        70
+    Obesity_Type_II       0.98      0.98      0.98        60
+   Obesity_Type_III       1.00      0.98      0.99        65
+ Overweight_Level_I       0.96      0.90      0.93        58
+Overweight_Level_II       0.97      0.97      0.97        58
+           accuracy                           0.95       423
+          macro avg       0.96      0.95      0.95       423
+       weighted avg       0.96      0.95      0.95       423
+--- Model Evaluation Finished ---
+```

app.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import pandas as pd
+import lightgbm as lgb
+import xgboost as xgb
+import gradio as gr
+import joblib
+import os
+from obesity_rp import config as cfg
+# Global variables to store loaded models, their columns, and the label encoder
+loaded_models = {}
+loaded_model_columns_map = {}
+label_encoder = None
+def load_model_artifacts(model_name):
+    """
+    Loads the trained model, feature columns, and the label encoder.
+    """
+    model_file = os.path.join(cfg.MODEL_DIR, f"obesity_{model_name}_model.joblib")
+    columns_file = os.path.join(cfg.MODEL_DIR, f"{model_name}_model_columns.joblib")
+    encoder_file = os.path.join(cfg.MODEL_DIR, "label_encoder.joblib")
+    if not all(os.path.exists(f) for f in [model_file, columns_file, encoder_file]):
+        raise FileNotFoundError(
+            f"Model artifacts for '{model_name}' not found. Please ensure all required files exist."
+        )
+    loaded_model = joblib.load(model_file)
+    loaded_model_columns = joblib.load(columns_file)
+    le = joblib.load(encoder_file)
+    print(
+        f"{model_name} Model, feature columns, and label encoder loaded for prediction."
+    )
+    return loaded_model, loaded_model_columns, le
+def predict_obesity_risk(
+    model_choice,
+    Gender,
+    Age,
+    Height,
+    Weight,
+    family_history_with_overweight,
+    FAVC,
+    FCVC,
+    NCP,
+    CAEC,
+    SMOKE,
+    CH2O,
+    SCC,
+    FAF,
+    TUE,
+    CALC,
+    MTRANS,
+):
+    """
+    Predicts obesity risk based on input features and chosen model.
+    """
+    global label_encoder
+    if model_choice not in loaded_models:
+        try:
+            model, columns, le = load_model_artifacts(model_choice)
+            loaded_models[model_choice] = model
+            loaded_model_columns_map[model_choice] = columns
+            if label_encoder is None:
+                label_encoder = le
+        except FileNotFoundError as e:
+            return f"Error: {e}. Model '{model_choice}' not found. Please train the model first."
+    else:
+        model = loaded_models[model_choice]
+        columns = loaded_model_columns_map[model_choice]
+        le = label_encoder
+    # Create a dictionary to hold the input data
+    input_data_dict = {
+        "Age": Age,
+        "Height": Height,
+        "Weight": Weight,
+        "FCVC": FCVC,
+        "NCP": NCP,
+        "CH2O": CH2O,
+        "FAF": FAF,
+        "TUE": TUE,
+    }
+    input_df = pd.DataFrame(0, index=[0], columns=columns)
+    for col, value in input_data_dict.items():
+        if col in input_df.columns:
+            input_df.loc[0, col] = value
+    # Handle one-hot encoded categorical features
+    categorical_inputs = {
+        "Gender": Gender,
+        "family_history_with_overweight": family_history_with_overweight,
+        "FAVC": FAVC,
+        "CAEC": CAEC,
+        "SMOKE": SMOKE,
+        "SCC": SCC,
+        "CALC": CALC,
+        "MTRANS": MTRANS,
+    }
+    for col_prefix, value in categorical_inputs.items():
+        column_name = f"{col_prefix}_{value}"
+        if column_name in input_df.columns:
+            input_df.loc[0, column_name] = 1
+    input_df = input_df[columns]
+    prediction_proba = model.predict_proba(input_df)[0]
+    prediction_encoded = model.predict(input_df)[0]
+    prediction_label = le.inverse_transform([prediction_encoded])[0]
+    results = f"Using {model_choice} Model:\nPrediction: {prediction_label}\n\n--- Prediction Probabilities ---\n"
+    for i, class_name in enumerate(le.classes_):
+        prob = prediction_proba[i] * 100
+        results += f"{class_name}: {prob:.2f}%\n"
+    return results
+def launch_gradio_app(share=False):
+    """
+    Launches the Gradio web application for obesity risk prediction.
+    """
+    print("\n--- Starting Gradio App ---")
+    # Define Gradio input components
+    model_choice_input = gr.Dropdown(
+        choices=cfg.MODEL_CHOICES, label="Select Model", value=cfg.RANDOM_FOREST
+    )
+    gender_input = gr.Dropdown(choices=["Female", "Male"], label="Gender")
+    age_input = gr.Slider(minimum=1, maximum=100, step=1, label="Age")
+    height_input = gr.Slider(minimum=1.0, maximum=2.2, step=0.01, label="Height (m)")
+    weight_input = gr.Slider(minimum=30.0, maximum=200.0, step=0.1, label="Weight (kg)")
+    family_history_input = gr.Radio(
+        choices=["yes", "no"], label="Family History with Overweight"
+    )
+    favc_input = gr.Radio(
+        choices=["yes", "no"], label="Frequent consumption of high caloric food (FAVC)"
+    )
+    fcvc_input = gr.Slider(
+        minimum=1,
+        maximum=3,
+        step=1,
+        label="Frequency of consumption of vegetables (FCVC)",
+    )
+    ncp_input = gr.Slider(
+        minimum=1, maximum=4, step=1, label="Number of main meals (NCP)"
+    )
+    caec_input = gr.Dropdown(
+        choices=["no", "Sometimes", "Frequently", "Always"],
+        label="Consumption of food between meals (CAEC)",
+    )
+    smoke_input = gr.Radio(choices=["yes", "no"], label="SMOKE")
+    ch2o_input = gr.Slider(
+        minimum=1, maximum=3, step=1, label="Consumption of water daily (CH2O)"
+    )
+    scc_input = gr.Radio(
+        choices=["yes", "no"], label="Calories consumption monitoring (SCC)"
+    )
+    faf_input = gr.Slider(
+        minimum=0, maximum=3, step=1, label="Physical activity frequency (FAF)"
+    )
+    tue_input = gr.Slider(
+        minimum=0, maximum=2, step=1, label="Time using technology devices (TUE)"
+    )
+    calc_input = gr.Dropdown(
+        choices=["no", "Sometimes", "Frequently", "Always"],
+        label="Consumption of alcohol (CALC)",
+    )
+    mtrans_input = gr.Dropdown(
+        choices=["Automobile", "Motorbike", "Bike", "Public_Transportation", "Walking"],
+        label="Transportation used (MTRANS)",
+    )
+    output_text = gr.Textbox(label="Obesity Risk Prediction Result", lines=10)
+    iface = gr.Interface(
+        fn=predict_obesity_risk,
+        inputs=[
+            model_choice_input,
+            gender_input,
+            age_input,
+            height_input,
+            weight_input,
+            family_history_input,
+            favc_input,
+            fcvc_input,
+            ncp_input,
+            caec_input,
+            smoke_input,
+            ch2o_input,
+            scc_input,
+            faf_input,
+            tue_input,
+            calc_input,
+            mtrans_input,
+        ],
+        outputs=output_text,
+        title="Obesity Risk Prediction (Multi-Model)",
+        description="Select a machine learning model and enter patient details to predict the obesity risk category.",
+        examples=[
+            [
+                cfg.RANDOM_FOREST,
+                "Male",
+                25,
+                1.8,
+                85,
+                "yes",
+                "yes",
+                2,
+                3,
+                "Sometimes",
+                "no",
+                2,
+                "no",
+                1,
+                1,
+                "Frequently",
+                "Public_Transportation",
+            ],
+            [
+                cfg.LIGHTGBM,
+                "Female",
+                30,
+                1.65,
+                70,
+                "yes",
+                "yes",
+                3,
+                3,
+                "Frequently",
+                "no",
+                3,
+                "yes",
+                2,
+                0,
+                "Sometimes",
+                "Automobile",
+            ],
+            [
+                cfg.XGBOOST,
+                "Female",
+                21,
+                1.52,
+                56,
+                "yes",
+                "no",
+                3,
+                3,
+                "Sometimes",
+                "yes",
+                3,
+                "yes",
+                3,
+                0,
+                "Sometimes",
+                "Public_Transportation",
+            ],
+        ],
+    )
+    iface.launch(share=share)
+    print("--- Gradio App Launched ---")
+if __name__ == "__main__":
+    launch_gradio_app(share=False)

model/LightGBM_model_columns.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e29f9392a10215c7ab1d416c9bfae37b8cfc5e86ee2d6f8e640991913fa2f0a2
+size 327

model/RandomForest_model_columns.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e29f9392a10215c7ab1d416c9bfae37b8cfc5e86ee2d6f8e640991913fa2f0a2
+size 327

model/XGBoost_model_columns.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e29f9392a10215c7ab1d416c9bfae37b8cfc5e86ee2d6f8e640991913fa2f0a2
+size 327

model/label_encoder.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43bd445b421e9956b488fc242ab67cbe5c6adfde6447803285b0c6ae47d21587
+size 608

model/obesity_LightGBM_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:da6364966e3d5f5d59d3d3db27046be8eff8ea1c8a5637fa8e111b0420f9e457
+size 2418732

model/obesity_RandomForest_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd825a6560f7873a60f94bfdd377b37648e8daebd116e9b72b00b18d1d3c2b29
+size 20145505

model/obesity_XGBoost_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:366d23a5061befeb45609742c6bfbe6c05d04907168ae150147df824457a4c68
+size 3021443

obesity_rp/__init__.py ADDED Viewed

File without changes

obesity_rp/config.py ADDED Viewed

	@@ -0,0 +1,27 @@

+# File and Directory Paths
+DATASET_FILE = "datasets/ObesityDataSet_raw_and_data_sinthetic.csv"
+MODEL_DIR = "model/"
+# Target Variable
+TARGET_COLUMN = "NObeyesdad"
+# Feature Columns
+CATEGORICAL_FEATURES = [
+    "Gender",
+    "family_history_with_overweight",
+    "FAVC",
+    "CAEC",
+    "SMOKE",
+    "SCC",
+    "CALC",
+    "MTRANS",
+]
+# Numerical Features
+NUMERICAL_FEATURES = ["Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE"]
+# Model Identifiers
+RANDOM_FOREST = "RandomForest"
+LIGHTGBM = "LightGBM"
+XGBOOST = "XGBoost"
+MODEL_CHOICES = [RANDOM_FOREST, LIGHTGBM, XGBOOST]

requirements.txt ADDED Viewed

	@@ -0,0 +1,68 @@

+aiofiles==24.1.0
+annotated-types==0.7.0
+anyio==4.10.0
+brotli==1.1.0
+certifi==2025.8.3
+charset-normalizer==3.4.2
+click==8.2.1
+contourpy==1.3.2
+cycler==0.12.1
+exceptiongroup==1.3.0
+fastapi==0.116.1
+ffmpy==0.6.1
+filelock==3.18.0
+fonttools==4.59.0
+fsspec==2025.7.0
+gradio==5.41.0
+gradio-client==1.11.0
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.1.7
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.34.3
+idna==3.10
+jinja2==3.1.6
+joblib==1.5.1
+kiwisolver==1.4.8
+lightgbm==4.6.0
+markdown-it-py==3.0.0
+markupsafe==3.0.2
+matplotlib==3.10.5
+mdurl==0.1.2
+numpy==2.2.6
+orjson==3.11.1
+packaging==25.0
+pandas==2.3.1
+pillow==11.3.0
+pydantic==2.11.7
+pydantic-core==2.33.2
+pydub==0.25.1
+pygments==2.19.2
+pyparsing==3.2.3
+python-dateutil==2.9.0.post0
+python-multipart==0.0.20
+pytz==2025.2
+pyyaml==6.0.2
+requests==2.32.4
+rich==14.1.0
+ruff==0.12.7
+safehttpx==0.1.6
+scikit-learn==1.7.1
+scipy==1.15.3
+semantic-version==2.10.0
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+starlette==0.47.2
+threadpoolctl==3.6.0
+tomlkit==0.13.3
+tqdm==4.67.1
+typer==0.16.0
+typing-extensions==4.14.1
+typing-inspection==0.4.1
+tzdata==2025.2
+urllib3==2.5.0
+uvicorn==0.35.0
+websockets==15.0.1
+xgboost==3.0.3