acharya-jyu commited on
Commit
e57b5f8
·
verified ·
1 Parent(s): 278b29f

Upload Data-Processing-Summary.md

Browse files

Summary of data pre-processing and cleaning.

Files changed (1) hide show
  1. Data-Processing-Summary.md +127 -0
Data-Processing-Summary.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Preprocessing Pipeline Documentation
2
+
3
+ This document details the comprehensive data preprocessing pipeline implemented for the medical diagnosis system. The pipeline handles multiple data types, implements sophisticated balancing techniques, and prepares data for both training and inference.
4
+
5
+ ## Table of Contents
6
+ 1. [Data Pipeline Overview](#data-pipeline-overview)
7
+ 2. [Feature Processing](#feature-processing)
8
+ 3. [Data Balancing](#data-balancing)
9
+ 4. [Dataset Management](#dataset-management)
10
+ 5. [Quality Controls](#quality-controls)
11
+
12
+ ## Data Pipeline Overview
13
+
14
+ ### Base Architecture
15
+ The preprocessing pipeline is built around two main components:
16
+ - `GoogleDrivePathManager`: Manages data paths and directory structures
17
+ - `MedicalDataset`: Handles data loading and preprocessing
18
+
19
+ ### Data Flow
20
+ 1. Raw data loading from JSON files
21
+ 2. Initial validation and cleaning
22
+ 3. Feature extraction and normalization
23
+ 4. Data balancing
24
+ 5. Training/validation/test split preparation
25
+
26
+ ## Feature Processing
27
+
28
+ ### Evidence Processing
29
+ - **Vocabulary Creation**: Creates standardized evidence vocabulary from all cases
30
+ - **Evidence Types**:
31
+ - Binary symptoms (208 types)
32
+ - Categorical symptoms (10 types)
33
+ - Multi-choice symptoms (5 types)
34
+
35
+ ### Demographic Features
36
+ - Age normalization: Scaled to [0,1] range
37
+ - Sex encoding: Binary (M=1, F=0)
38
+ - Feature alignment with medical standards
39
+
40
+ ### Statistical Characteristics
41
+ - Average evidence markers per case: 13.56 (σ = 5.06)
42
+ - Average symptoms per case: 10.07 (σ = 4.69)
43
+ - Average antecedents per case: 3.49 (σ = 2.23)
44
+
45
+ ## Data Balancing
46
+
47
+ ### Hybrid Balancing Strategy
48
+ The pipeline implements a sophisticated hybrid balancing approach:
49
+
50
+ 1. **SMOTE (Synthetic Minority Over-sampling Technique)**
51
+ - Applied to well-represented conditions (≥10 samples)
52
+ - Parameters:
53
+ - k_neighbors: 5
54
+ - random_state: 42
55
+
56
+ 2. **Random Oversampling**
57
+ - Applied to rare conditions (<10 samples)
58
+ - Preserves original feature relationships
59
+
60
+ ### Balance Monitoring
61
+ - Pre-balancing class distribution tracking
62
+ - Post-balancing validation
63
+ - Distribution equality checks
64
+
65
+ ## Dataset Management
66
+
67
+ ### Data Splits
68
+ - Training: 103,189 cases
69
+ - Validation: 17,197 cases
70
+ - Testing: 17,199 cases
71
+
72
+ ### Case Structure
73
+ Each preprocessed case contains:
74
+ - Normalized demographic data
75
+ - Encoded evidence features
76
+ - Primary diagnosis label
77
+ - Differential diagnosis probabilities
78
+ - Original feature mappings for traceability
79
+
80
+ ### State Management
81
+ - Comprehensive tracking of:
82
+ - Covered evidence sets
83
+ - Response histories
84
+ - Condition assessments
85
+ - Question queue status
86
+
87
+ ## Quality Controls
88
+
89
+ ### Validation Checkpoints
90
+ 1. Input Data Validation
91
+ - Data type checking
92
+ - Range validation
93
+ - Missing value detection
94
+
95
+ 2. Feature Validation
96
+ - Normalization bounds checking
97
+ - Encoding consistency verification
98
+ - Relationship preservation validation
99
+
100
+ 3. Output Validation
101
+ - Class distribution verification
102
+ - Feature correlation preservation
103
+ - Synthetic sample quality assessment
104
+
105
+ ### Error Handling
106
+ - Comprehensive error logging
107
+ - Data integrity checks
108
+ - Automatic correction for common issues
109
+ - Exception handling with detailed reporting
110
+
111
+ ## Implementation Notes
112
+
113
+ ### Key Considerations
114
+ - Maintenance of medical terminology consistency
115
+ - Preservation of symptom relationships
116
+ - Handling of rare conditions
117
+ - Performance optimization for large datasets
118
+
119
+ ### Best Practices
120
+ 1. Regular validation of preprocessing results
121
+ 2. Monitoring of class distributions
122
+ 3. Documentation of preprocessing decisions
123
+ 4. Version control of preprocessing parameters
124
+
125
+ ---
126
+
127
+ *Note: This documentation reflects the preprocessing pipeline implementation as of January 2025. For the most recent updates, please check the repository history.*