|
# Enhanced Merge with Intelligent Null Filling |
|
|
|
## Overview |
|
|
|
The `merge_temp.py` module has been enhanced with sophisticated null filling capabilities that prioritize finding values from the **same symbol + interval_timestamp** combination across different data sources before falling back to other strategies. |
|
|
|
## Key Features |
|
|
|
### 1. Symbol-First Null Filling Strategy |
|
|
|
When merging temp files to existing features, the system now: |
|
|
|
1. **Identifies null values** in the target (merged) dataset |
|
2. **Searches for matching records** in the source (temp) dataset using `(symbol, interval_timestamp)` as the key |
|
3. **Fills null values** only when: |
|
- The same symbol + timestamp exists in the temp data |
|
- The temp data has a non-null value for that column |
|
- The column exists in both datasets |
|
|
|
### 2. Cross-Dataset Null Filling |
|
|
|
During train file creation and merged features generation: |
|
|
|
1. **Combines multiple sources** (archive, features, temp files) |
|
2. **Creates a comprehensive lookup** of all non-null values by `(symbol, timestamp)` |
|
3. **Fills nulls intelligently** using the best available data from any source |
|
4. **Preserves data integrity** by only filling with values from the exact same symbol and time |
|
|
|
### 3. Enhanced Functions |
|
|
|
#### `fill_nulls_from_temp(df_merged, df_temp)` |
|
- Fills null values in `df_merged` using data from `df_temp` |
|
- Only fills when exact `(symbol, interval_timestamp)` match exists |
|
- Returns count of null values filled |
|
- Provides detailed logging of the filling process |
|
|
|
#### `merge_temp_to_merged(temp_name, merged_name)` |
|
- Enhanced to perform null filling before adding new records |
|
- Reports both new records added and null values filled |
|
- Maintains existing functionality while adding intelligent null handling |
|
|
|
#### `merge_all_to_train()` |
|
- Cross-source null filling during train file creation |
|
- Combines archive, features, and temp data optimally |
|
- Eliminates duplicates while preserving the best available data |
|
|
|
#### `create_merged_features()` |
|
- Creates the main `merged_features.parquet` file |
|
- Combines crypto and stock features with cross-dataset null filling |
|
- Provides comprehensive statistics on the merge process |
|
|
|
## Benefits |
|
|
|
### π― **Data Quality Improvements** |
|
- **Preserves Symbol Characteristics**: Uses same-symbol data to fill nulls |
|
- **Temporal Consistency**: Only uses data from the exact same timestamp |
|
- **No Data Pollution**: Never mixes data from different symbols or times |
|
|
|
### π **Better Coverage** |
|
- **Reduced Null Values**: Significantly fewer missing values in final datasets |
|
- **Multi-Source Integration**: Leverages all available data sources |
|
- **Smart Deduplication**: Keeps the best version of each record |
|
|
|
### π§ **Robust Processing** |
|
- **Error Handling**: Graceful handling of missing files and edge cases |
|
- **Detailed Logging**: Clear reporting of what was filled and why |
|
- **Validation**: Built-in checks to ensure data integrity |
|
|
|
## Usage Examples |
|
|
|
### Test the Null Filling |
|
```bash |
|
cd src/merge |
|
python merge_temp.py --test-null-filling |
|
``` |
|
|
|
### Run Normal Merge Process |
|
```bash |
|
cd src/merge |
|
python merge_temp.py |
|
``` |
|
|
|
### Manual Testing |
|
```bash |
|
cd src/merge |
|
python test_null_filling_merge.py |
|
``` |
|
|
|
## Integration with Main Pipeline |
|
|
|
The enhanced merge functionality is automatically integrated into the main pipeline: |
|
|
|
1. **After data collection**: Temp files are created with new data |
|
2. **During merge_temp.py**: Null filling happens automatically |
|
3. **Before normalization**: Data is as complete as possible |
|
4. **Train file creation**: Uses all available historical data |
|
|
|
## Example Output |
|
|
|
``` |
|
[INFO] Attempting to fill nulls in 4 columns: ['price', 'volume', 'rsi', 'macd'] |
|
[INFO] Successfully filled 7 null values from temp data |
|
[INFO] Column 'price': 0 nulls remaining |
|
[INFO] Column 'volume': 0 nulls remaining |
|
[INFO] Column 'rsi': 0 nulls remaining |
|
[INFO] Column 'macd': 0 nulls remaining |
|
[OK] Added 15 new records from crypto_features.parquet to crypto_features.parquet, filled 7 null values |
|
``` |
|
|
|
## Performance Considerations |
|
|
|
- **Efficient Lookups**: Uses dictionary-based lookups for O(1) access |
|
- **Memory Optimized**: Processes data in chunks when possible |
|
- **Minimal Overhead**: Only processes columns that actually have nulls |
|
|
|
## Future Enhancements |
|
|
|
- **Time-Window Filling**: Fill with nearest timestamp if exact match not found |
|
- **Interpolation**: Smart interpolation for numerical features |
|
- **Symbol Similarity**: Fill using similar symbols when exact match unavailable |
|
- **Quality Scoring**: Rank data sources by quality for better filling decisions |
|
|
|
This enhanced merge system ensures that your machine learning models receive the highest quality, most complete data possible while preserving the integrity and characteristics of each financial instrument. |
|
|