sapbert-pubmedbert-ddxplus-10k / Data-Processing-Summary.md
acharya-jyu's picture
Upload Data-Processing-Summary.md
e57b5f8 verified

Data Preprocessing Pipeline Documentation

This document details the comprehensive data preprocessing pipeline implemented for the medical diagnosis system. The pipeline handles multiple data types, implements sophisticated balancing techniques, and prepares data for both training and inference.

Table of Contents

  1. Data Pipeline Overview
  2. Feature Processing
  3. Data Balancing
  4. Dataset Management
  5. Quality Controls

Data Pipeline Overview

Base Architecture

The preprocessing pipeline is built around two main components:

  • GoogleDrivePathManager: Manages data paths and directory structures
  • MedicalDataset: Handles data loading and preprocessing

Data Flow

  1. Raw data loading from JSON files
  2. Initial validation and cleaning
  3. Feature extraction and normalization
  4. Data balancing
  5. Training/validation/test split preparation

Feature Processing

Evidence Processing

  • Vocabulary Creation: Creates standardized evidence vocabulary from all cases
  • Evidence Types:
    • Binary symptoms (208 types)
    • Categorical symptoms (10 types)
    • Multi-choice symptoms (5 types)

Demographic Features

  • Age normalization: Scaled to [0,1] range
  • Sex encoding: Binary (M=1, F=0)
  • Feature alignment with medical standards

Statistical Characteristics

  • Average evidence markers per case: 13.56 (σ = 5.06)
  • Average symptoms per case: 10.07 (σ = 4.69)
  • Average antecedents per case: 3.49 (σ = 2.23)

Data Balancing

Hybrid Balancing Strategy

The pipeline implements a sophisticated hybrid balancing approach:

  1. SMOTE (Synthetic Minority Over-sampling Technique)

    • Applied to well-represented conditions (≥10 samples)
    • Parameters:
      • k_neighbors: 5
      • random_state: 42
  2. Random Oversampling

    • Applied to rare conditions (<10 samples)
    • Preserves original feature relationships

Balance Monitoring

  • Pre-balancing class distribution tracking
  • Post-balancing validation
  • Distribution equality checks

Dataset Management

Data Splits

  • Training: 103,189 cases
  • Validation: 17,197 cases
  • Testing: 17,199 cases

Case Structure

Each preprocessed case contains:

  • Normalized demographic data
  • Encoded evidence features
  • Primary diagnosis label
  • Differential diagnosis probabilities
  • Original feature mappings for traceability

State Management

  • Comprehensive tracking of:
    • Covered evidence sets
    • Response histories
    • Condition assessments
    • Question queue status

Quality Controls

Validation Checkpoints

  1. Input Data Validation

    • Data type checking
    • Range validation
    • Missing value detection
  2. Feature Validation

    • Normalization bounds checking
    • Encoding consistency verification
    • Relationship preservation validation
  3. Output Validation

    • Class distribution verification
    • Feature correlation preservation
    • Synthetic sample quality assessment

Error Handling

  • Comprehensive error logging
  • Data integrity checks
  • Automatic correction for common issues
  • Exception handling with detailed reporting

Implementation Notes

Key Considerations

  • Maintenance of medical terminology consistency
  • Preservation of symptom relationships
  • Handling of rare conditions
  • Performance optimization for large datasets

Best Practices

  1. Regular validation of preprocessing results
  2. Monitoring of class distributions
  3. Documentation of preprocessing decisions
  4. Version control of preprocessing parameters

Note: This documentation reflects the preprocessing pipeline implementation as of January 2025. For the most recent updates, please check the repository history.