Enhanced Merge with Intelligent Null Filling
Overview
The merge_temp.py
module has been enhanced with sophisticated null filling capabilities that prioritize finding values from the same symbol + interval_timestamp combination across different data sources before falling back to other strategies.
Key Features
1. Symbol-First Null Filling Strategy
When merging temp files to existing features, the system now:
- Identifies null values in the target (merged) dataset
- Searches for matching records in the source (temp) dataset using
(symbol, interval_timestamp)
as the key - Fills null values only when:
- The same symbol + timestamp exists in the temp data
- The temp data has a non-null value for that column
- The column exists in both datasets
2. Cross-Dataset Null Filling
During train file creation and merged features generation:
- Combines multiple sources (archive, features, temp files)
- Creates a comprehensive lookup of all non-null values by
(symbol, timestamp)
- Fills nulls intelligently using the best available data from any source
- Preserves data integrity by only filling with values from the exact same symbol and time
3. Enhanced Functions
fill_nulls_from_temp(df_merged, df_temp)
- Fills null values in
df_merged
using data fromdf_temp
- Only fills when exact
(symbol, interval_timestamp)
match exists - Returns count of null values filled
- Provides detailed logging of the filling process
merge_temp_to_merged(temp_name, merged_name)
- Enhanced to perform null filling before adding new records
- Reports both new records added and null values filled
- Maintains existing functionality while adding intelligent null handling
merge_all_to_train()
- Cross-source null filling during train file creation
- Combines archive, features, and temp data optimally
- Eliminates duplicates while preserving the best available data
create_merged_features()
- Creates the main
merged_features.parquet
file - Combines crypto and stock features with cross-dataset null filling
- Provides comprehensive statistics on the merge process
Benefits
π― Data Quality Improvements
- Preserves Symbol Characteristics: Uses same-symbol data to fill nulls
- Temporal Consistency: Only uses data from the exact same timestamp
- No Data Pollution: Never mixes data from different symbols or times
π Better Coverage
- Reduced Null Values: Significantly fewer missing values in final datasets
- Multi-Source Integration: Leverages all available data sources
- Smart Deduplication: Keeps the best version of each record
π§ Robust Processing
- Error Handling: Graceful handling of missing files and edge cases
- Detailed Logging: Clear reporting of what was filled and why
- Validation: Built-in checks to ensure data integrity
Usage Examples
Test the Null Filling
cd src/merge
python merge_temp.py --test-null-filling
Run Normal Merge Process
cd src/merge
python merge_temp.py
Manual Testing
cd src/merge
python test_null_filling_merge.py
Integration with Main Pipeline
The enhanced merge functionality is automatically integrated into the main pipeline:
- After data collection: Temp files are created with new data
- During merge_temp.py: Null filling happens automatically
- Before normalization: Data is as complete as possible
- Train file creation: Uses all available historical data
Example Output
[INFO] Attempting to fill nulls in 4 columns: ['price', 'volume', 'rsi', 'macd']
[INFO] Successfully filled 7 null values from temp data
[INFO] Column 'price': 0 nulls remaining
[INFO] Column 'volume': 0 nulls remaining
[INFO] Column 'rsi': 0 nulls remaining
[INFO] Column 'macd': 0 nulls remaining
[OK] Added 15 new records from crypto_features.parquet to crypto_features.parquet, filled 7 null values
Performance Considerations
- Efficient Lookups: Uses dictionary-based lookups for O(1) access
- Memory Optimized: Processes data in chunks when possible
- Minimal Overhead: Only processes columns that actually have nulls
Future Enhancements
- Time-Window Filling: Fill with nearest timestamp if exact match not found
- Interpolation: Smart interpolation for numerical features
- Symbol Similarity: Fill using similar symbols when exact match unavailable
- Quality Scoring: Rank data sources by quality for better filling decisions
This enhanced merge system ensures that your machine learning models receive the highest quality, most complete data possible while preserving the integrity and characteristics of each financial instrument.