Data Preprocessing Pipeline Documentation
This document details the comprehensive data preprocessing pipeline implemented for the medical diagnosis system. The pipeline handles multiple data types, implements sophisticated balancing techniques, and prepares data for both training and inference.
Table of Contents
Data Pipeline Overview
Base Architecture
The preprocessing pipeline is built around two main components:
GoogleDrivePathManager
: Manages data paths and directory structuresMedicalDataset
: Handles data loading and preprocessing
Data Flow
- Raw data loading from JSON files
- Initial validation and cleaning
- Feature extraction and normalization
- Data balancing
- Training/validation/test split preparation
Feature Processing
Evidence Processing
- Vocabulary Creation: Creates standardized evidence vocabulary from all cases
- Evidence Types:
- Binary symptoms (208 types)
- Categorical symptoms (10 types)
- Multi-choice symptoms (5 types)
Demographic Features
- Age normalization: Scaled to [0,1] range
- Sex encoding: Binary (M=1, F=0)
- Feature alignment with medical standards
Statistical Characteristics
- Average evidence markers per case: 13.56 (σ = 5.06)
- Average symptoms per case: 10.07 (σ = 4.69)
- Average antecedents per case: 3.49 (σ = 2.23)
Data Balancing
Hybrid Balancing Strategy
The pipeline implements a sophisticated hybrid balancing approach:
SMOTE (Synthetic Minority Over-sampling Technique)
- Applied to well-represented conditions (≥10 samples)
- Parameters:
- k_neighbors: 5
- random_state: 42
Random Oversampling
- Applied to rare conditions (<10 samples)
- Preserves original feature relationships
Balance Monitoring
- Pre-balancing class distribution tracking
- Post-balancing validation
- Distribution equality checks
Dataset Management
Data Splits
- Training: 103,189 cases
- Validation: 17,197 cases
- Testing: 17,199 cases
Case Structure
Each preprocessed case contains:
- Normalized demographic data
- Encoded evidence features
- Primary diagnosis label
- Differential diagnosis probabilities
- Original feature mappings for traceability
State Management
- Comprehensive tracking of:
- Covered evidence sets
- Response histories
- Condition assessments
- Question queue status
Quality Controls
Validation Checkpoints
Input Data Validation
- Data type checking
- Range validation
- Missing value detection
Feature Validation
- Normalization bounds checking
- Encoding consistency verification
- Relationship preservation validation
Output Validation
- Class distribution verification
- Feature correlation preservation
- Synthetic sample quality assessment
Error Handling
- Comprehensive error logging
- Data integrity checks
- Automatic correction for common issues
- Exception handling with detailed reporting
Implementation Notes
Key Considerations
- Maintenance of medical terminology consistency
- Preservation of symptom relationships
- Handling of rare conditions
- Performance optimization for large datasets
Best Practices
- Regular validation of preprocessing results
- Monitoring of class distributions
- Documentation of preprocessing decisions
- Version control of preprocessing parameters
Note: This documentation reflects the preprocessing pipeline implementation as of January 2025. For the most recent updates, please check the repository history.