devjas1 commited on
Commit
8013c07
·
1 Parent(s): e61ce7a

(DOCS): Add comprehensive codebase inventory for ml-polymer-recycling project

Browse files
Files changed (1) hide show
  1. CODEBASE_INVENTORY.md +191 -0
CODEBASE_INVENTORY.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Codebase Inventory: ml-polymer-recycling
2
+
3
+ ## Overview
4
+
5
+ A comprehensive machine learning system for AI-driven polymer aging prediction and classification using spectral data analysis. The project implements multiple CNN architectures (Figure2CNN, ResNet1D, ResNet18Vision) to classify polymer degradation levels as a proxy for recyclability, built with Python, PyTorch, and featuring both CLI and Streamlit UI workflows.
6
+
7
+ ## Inventory by Category
8
+
9
+ ### 1. Core Application Modules
10
+
11
+ - **Module Name**: `models/registry.py`
12
+ - **Purpose**: Central registry system for model architectures providing dynamic model selection and instantiation
13
+ - **Key Exports/Functions**: `choices()`, `build(name, input_length)`, `_REGISTRY`
14
+ - **Key Dependencies**: `models.figure2_cnn`, `models.resnet_cnn`, `models.resnet18_vision`
15
+ - **External Dependencies**: `typing`
16
+
17
+ - **Module Name**: `models/figure2_cnn.py`
18
+ - **Purpose**: CNN architecture implementation based on literature (Neo et al. 2023) for 1D Raman spectral classification
19
+ - **Key Exports/Functions**: `Figure2CNN` class with conv blocks and classifier layers
20
+ - **Key Dependencies**: None (self-contained)
21
+ - **External Dependencies**: `torch`, `torch.nn`
22
+
23
+ - **Module Name**: `models/resnet_cnn.py`
24
+ - **Purpose**: ResNet1D implementation with residual blocks for deeper spectral feature learning
25
+ - **Key Exports/Functions**: `ResNet1D`, `ResidualBlock1D` classes
26
+ - **Key Dependencies**: None (self-contained)
27
+ - **External Dependencies**: `torch`, `torch.nn`
28
+
29
+ - **Module Name**: `models/resnet18_vision.py`
30
+ - **Purpose**: ResNet18 architecture adapted for 1D spectral data processing
31
+ - **Key Exports/Functions**: `ResNet18Vision` class
32
+ - **Key Dependencies**: None (self-contained)
33
+ - **External Dependencies**: `torch`, `torch.nn`
34
+
35
+ - **Module Name**: `utils/preprocessing.py`
36
+ - **Purpose**: Spectral data preprocessing utilities including resampling, baseline correction, smoothing, and normalization
37
+ - **Key Exports/Functions**: `preprocess_spectrum()`, `resample_spectrum()`, `remove_baseline()`, `normalize_spectrum()`, `smooth_spectrum()`
38
+ - **Key Dependencies**: None (self-contained)
39
+ - **External Dependencies**: `numpy`, `scipy.interpolate`, `scipy.signal`, `sklearn.preprocessing`
40
+
41
+ - **Module Name**: `scripts/preprocess_dataset.py`
42
+ - **Purpose**: Comprehensive dataset preprocessing pipeline with CLI interface for Raman spectral data
43
+ - **Key Exports/Functions**: `preprocess_dataset()`, `resample_spectrum()`, `label_file()`, preprocessing helper functions
44
+ - **Key Dependencies**: `scripts.discover_raman_files`, `scripts.plot_spectrum`
45
+ - **External Dependencies**: `numpy`, `scipy`, `sklearn.preprocessing`
46
+
47
+ ### 2. Scripts & Automation
48
+
49
+ - **Script Name**: `validate_pipeline.sh`
50
+ - **Trigger**: Manual execution (`./validate_pipeline.sh`)
51
+ - **Apparent Function**: Canonical smoke test validating the complete Raman pipeline from preprocessing through training to inference
52
+ - **Dependencies**: `conda`, `scripts/preprocess_dataset.py`, `scripts/train_model.py`, `scripts/run_inference.py`, `scripts/plot_spectrum.py`
53
+
54
+ - **Script Name**: `scripts/train_model.py`
55
+ - **Trigger**: CLI execution (`python scripts/train_model.py`)
56
+ - **Apparent Function**: 10-fold stratified cross-validation training with multiple model architectures and preprocessing options
57
+ - **Dependencies**: `scripts/preprocess_dataset`, `models/registry`, reproducibility seeds, PyTorch training loop
58
+
59
+ - **Script Name**: `scripts/run_inference.py`
60
+ - **Trigger**: CLI execution (`python scripts/run_inference.py`)
61
+ - **Apparent Function**: Single spectrum inference with model loading, preprocessing, and prediction output to JSON
62
+ - **Dependencies**: `models/registry`, `scripts/preprocess_dataset`, trained model weights
63
+
64
+ - **Script Name**: `scripts/plot_spectrum.py`
65
+ - **Trigger**: CLI execution (`python scripts/plot_spectrum.py`)
66
+ - **Apparent Function**: Visualization tool for Raman spectra with matplotlib plotting and file I/O
67
+ - **Dependencies**: Spectrum loading utilities
68
+
69
+ - **Script Name**: `scripts/discover_raman_files.py`
70
+ - **Trigger**: Imported by other scripts
71
+ - **Apparent Function**: File discovery and labeling utilities for Raman dataset management
72
+ - **Dependencies**: File system operations, regex pattern matching
73
+
74
+ - **Script Name**: `scripts/list_spectra.py`
75
+ - **Trigger**: CLI or import
76
+ - **Apparent Function**: Dataset inventory and spectrum listing utilities
77
+ - **Dependencies**: File system scanning
78
+
79
+ ### 3. Configuration & Data
80
+
81
+ - **File Name**: `deploy/hf-space/requirements.txt`
82
+ - **Purpose**: Python dependencies for Hugging Face Spaces deployment
83
+ - **Key Contents/Structure**: `streamlit`, `torch`, `torchvision`, `scikit-learn`, `scipy`, `numpy`, `pandas`, `matplotlib`, `fastapi`, `altair`, `huggingface-hub`
84
+
85
+ - **File Name**: `deploy/hf-space/Dockerfile`
86
+ - **Purpose**: Container configuration for Hugging Face Spaces deployment
87
+ - **Key Contents/Structure**: Python 3.13-slim base, build tools installation, Streamlit server configuration on port 8501
88
+
89
+ - **File Name**: `deploy/hf-space/sample_data/sta-1.txt`
90
+ - **Purpose**: Sample Raman spectrum for UI demonstration
91
+ - **Key Contents/Structure**: Two-column wavenumber/intensity data format
92
+
93
+ - **File Name**: `deploy/hf-space/sample_data/sta-2.txt`
94
+ - **Purpose**: Additional sample Raman spectrum for UI testing
95
+ - **Key Contents/Structure**: Two-column wavenumber/intensity data format
96
+
97
+ - **File Name**: `.gitignore`
98
+ - **Purpose**: Version control exclusions for datasets, build artifacts, and system files
99
+ - **Key Contents/Structure**: `datasets/`, `__pycache__/`, model weights, logs, environment files, deprecated scripts
100
+
101
+ - **File Name**: `MANIFEST.git`
102
+ - **Purpose**: Git object manifest listing all tracked files with hashes
103
+ - **Key Contents/Structure**: File paths, permissions, and SHA hashes for repository contents
104
+
105
+ ### 4. Assets & Documentation
106
+
107
+ - **Asset Name**: `README.md`
108
+ - **Purpose**: Primary project documentation with objectives, architecture overview, and usage instructions
109
+ - **Key Contents/Structure**: Project goals, model architectures table, structure diagram, installation guides, sample commands
110
+
111
+ - **Asset Name**: `GROUND_TRUTH_PIPELINE.md`
112
+ - **Purpose**: Comprehensive empirical baseline inventory documenting every aspect of the current system
113
+ - **Key Contents/Structure**: 635-line detailed documentation of data handling, preprocessing, models, CLI workflow, UI workflow, and gap identification
114
+
115
+ - **Asset Name**: `docs/ENVIRONMENT_GUIDE.md`
116
+ - **Purpose**: Environment management guide for local and HPC deployment
117
+ - **Key Contents/Structure**: Conda vs venv setup instructions, platform-specific configurations, dependency management
118
+
119
+ - **Asset Name**: `docs/PROJECT_TIMELINE.md`
120
+ - **Purpose**: Development milestone tracking and project progression documentation
121
+ - **Key Contents/Structure**: Phase-based timeline from project kickoff through model expansion, tagged milestones
122
+
123
+ - **Asset Name**: `docs/sprint_log.md`
124
+ - **Purpose**: Sprint-based development log with specific technical changes and testing results
125
+ - **Key Contents/Structure**: Chronological entries with goals, changes, tests, and notes for each development sprint
126
+
127
+ - **Asset Name**: `docs/REPRODUCIBILITY.md`
128
+ - **Purpose**: Scientific reproducibility guidelines and artifact control documentation
129
+ - **Key Contents/Structure**: Validation procedures, artifact integrity, experimental controls
130
+
131
+ - **Asset Name**: `docs/HPC_REMOTE_SETUP.md`
132
+ - **Purpose**: High-performance computing environment setup for CWRU Pioneer cluster
133
+ - **Key Contents/Structure**: HPC-specific configurations, remote access procedures, computational resource management
134
+
135
+ - **Asset Name**: `docs/BACKEND_MIGRATION_LOG.md`
136
+ - **Purpose**: Technical migration documentation for backend architecture changes
137
+ - **Key Contents/Structure**: Migration procedures, compatibility notes, system architecture evolution
138
+
139
+ ### 5. Deployment & UI Components
140
+
141
+ - **Module Name**: `deploy/hf-space/app.py`
142
+ - **Purpose**: Streamlit web application for polymer classification with file upload and model inference
143
+ - **Key Exports/Functions**: Streamlit UI components, model loading, preprocessing pipeline, prediction display
144
+ - **Key Dependencies**: `models.figure2_cnn`, `models.resnet_cnn`, `utils.preprocessing` (fallback), `scripts.preprocess_dataset`
145
+ - **External Dependencies**: `streamlit`, `torch`, `matplotlib`, `PIL`, `numpy`
146
+
147
+ ### 6. Model Artifacts & Outputs
148
+
149
+ - **File Name**: `outputs/resnet_model.pth`
150
+ - **Purpose**: Trained ResNet1D model weights for Raman spectrum classification
151
+ - **Key Contents/Structure**: PyTorch state dictionary with model parameters
152
+
153
+ ## Workflows & Interactions
154
+
155
+ - **CLI Training Pipeline**: The main training workflow starts with `scripts/train_model.py` which imports the model registry (`models/registry.py`) to dynamically select architectures (Figure2CNN, ResNet1D, or ResNet18Vision). It uses `scripts/preprocess_dataset.py` to load and preprocess Raman spectra from `datasets/rdwp/`, applying resampling, baseline correction, smoothing, and normalization. The script performs 10-fold stratified cross-validation and saves trained models to `outputs/{model}_model.pth` with diagnostics to `outputs/logs/`.
156
+
157
+ - **CLI Inference Pipeline**: Running `scripts/run_inference.py` loads a trained model via the registry, processes a single Raman spectrum file through the same preprocessing pipeline, and outputs predictions in JSON format to `outputs/inference/`.
158
+
159
+ - **UI Workflow**: The Streamlit application (`deploy/hf-space/app.py`) provides a web interface that loads trained models, accepts file uploads or sample data selection, but currently bypasses the full preprocessing pipeline (missing baseline correction, smoothing, and normalization steps) before running inference.
160
+
161
+ - **Validation Workflow**: The `validate_pipeline.sh` script orchestrates a complete pipeline test by sequentially running preprocessing, training, inference, and plotting scripts to ensure reproducibility and catch regressions.
162
+
163
+ - **Model Registry System**: All model architectures are centrally managed through `models/registry.py`, which provides dynamic model selection for both CLI training and inference scripts, ensuring consistent model instantiation across the codebase.
164
+
165
+ ## External Dependencies Summary
166
+
167
+ - **PyTorch Ecosystem**: `torch`, `torchvision` for deep learning model implementation and training
168
+ - **Scientific Computing**: `numpy`, `scipy` for numerical operations and signal processing
169
+ - **Machine Learning**: `scikit-learn` for preprocessing, metrics, and cross-validation utilities
170
+ - **Data Handling**: `pandas` for structured data manipulation
171
+ - **Visualization**: `matplotlib`, `seaborn` for plotting and data visualization
172
+ - **Web Framework**: `streamlit` for interactive web application deployment
173
+ - **Image Processing**: `PIL` (Pillow) for image handling in the UI
174
+ - **Development Tools**: `argparse` for CLI interfaces, `json` for data serialization
175
+ - **Deployment**: `fastapi`, `uvicorn` for potential API deployment, `huggingface-hub` for model hosting
176
+
177
+ ## Key Findings & Assumptions
178
+
179
+ - **Critical Preprocessing Gap**: The UI workflow in `deploy/hf-space/app.py` bypasses essential preprocessing steps (baseline correction, smoothing, normalization) that are standard in the CLI pipeline, potentially causing prediction inconsistencies.
180
+
181
+ - **Model Architecture Assumptions**: Three CNN architectures are registered (`figure2`, `resnet`, `resnet18vision`) but the codebase suggests only two are currently trained and validated in the standard pipeline.
182
+
183
+ - **Dataset Structure**: The system assumes Raman spectra are stored as two-column text files (wavenumber, intensity) in the `datasets/rdwp/` directory, with filenames indicating weathering conditions for automated labeling.
184
+
185
+ - **Environment Fragmentation**: The project uses different dependency management systems (Conda for local development, venv for HPC, pip requirements for deployment) which could lead to environment inconsistencies.
186
+
187
+ - **Reproducibility Controls**: Strong emphasis on scientific reproducibility with fixed random seeds, deterministic algorithms, and comprehensive validation scripts, indicating this is research-oriented code requiring strict experimental controls.
188
+
189
+ - **Deployment Readiness**: The Hugging Face Spaces deployment setup suggests the project is intended for public demonstration or research sharing, but the preprocessing gap needs resolution for production use.
190
+
191
+ - **Legacy Code Management**: The `.gitignore` and documentation references suggest active management of deprecated FTIR-related components, indicating focused scope refinement to Raman-only analysis.