Spaces:

maaroufabousaleh
/

advisorai-data-enhanced

Sleeping

App Files Files Community

advisorai-data-enhanced / src /merge /ENHANCED_MERGE_README.md

Maaroufabousaleh

c49b21b about 1 month ago

preview code

raw

history blame contribute delete

4.77 kB

Enhanced Merge with Intelligent Null Filling

Overview

The merge_temp.py module has been enhanced with sophisticated null filling capabilities that prioritize finding values from the same symbol + interval_timestamp combination across different data sources before falling back to other strategies.

Key Features

1. Symbol-First Null Filling Strategy

When merging temp files to existing features, the system now:

Identifies null values in the target (merged) dataset
Searches for matching records in the source (temp) dataset using (symbol, interval_timestamp) as the key
Fills null values only when:
- The same symbol + timestamp exists in the temp data
- The temp data has a non-null value for that column
- The column exists in both datasets

2. Cross-Dataset Null Filling

During train file creation and merged features generation:

Combines multiple sources (archive, features, temp files)
Creates a comprehensive lookup of all non-null values by (symbol, timestamp)
Fills nulls intelligently using the best available data from any source
Preserves data integrity by only filling with values from the exact same symbol and time

3. Enhanced Functions

`fill_nulls_from_temp(df_merged, df_temp)`

Fills null values in df_merged using data from df_temp
Only fills when exact (symbol, interval_timestamp) match exists
Returns count of null values filled
Provides detailed logging of the filling process

`merge_temp_to_merged(temp_name, merged_name)`

Enhanced to perform null filling before adding new records
Reports both new records added and null values filled
Maintains existing functionality while adding intelligent null handling

`merge_all_to_train()`

Cross-source null filling during train file creation
Combines archive, features, and temp data optimally
Eliminates duplicates while preserving the best available data

`create_merged_features()`

Creates the main merged_features.parquet file
Combines crypto and stock features with cross-dataset null filling
Provides comprehensive statistics on the merge process

Benefits

🎯 Data Quality Improvements

Preserves Symbol Characteristics: Uses same-symbol data to fill nulls
Temporal Consistency: Only uses data from the exact same timestamp
No Data Pollution: Never mixes data from different symbols or times

📊 Better Coverage

Reduced Null Values: Significantly fewer missing values in final datasets
Multi-Source Integration: Leverages all available data sources
Smart Deduplication: Keeps the best version of each record

🔧 Robust Processing

Error Handling: Graceful handling of missing files and edge cases
Detailed Logging: Clear reporting of what was filled and why
Validation: Built-in checks to ensure data integrity

Usage Examples

Test the Null Filling

cd src/merge
python merge_temp.py --test-null-filling

Run Normal Merge Process

cd src/merge
python merge_temp.py

Manual Testing

cd src/merge
python test_null_filling_merge.py

Integration with Main Pipeline

The enhanced merge functionality is automatically integrated into the main pipeline:

After data collection: Temp files are created with new data
During merge_temp.py: Null filling happens automatically
Before normalization: Data is as complete as possible
Train file creation: Uses all available historical data

Example Output

[INFO] Attempting to fill nulls in 4 columns: ['price', 'volume', 'rsi', 'macd']
[INFO] Successfully filled 7 null values from temp data
[INFO] Column 'price': 0 nulls remaining
[INFO] Column 'volume': 0 nulls remaining
[INFO] Column 'rsi': 0 nulls remaining
[INFO] Column 'macd': 0 nulls remaining
[OK] Added 15 new records from crypto_features.parquet to crypto_features.parquet, filled 7 null values

Performance Considerations

Efficient Lookups: Uses dictionary-based lookups for O(1) access
Memory Optimized: Processes data in chunks when possible
Minimal Overhead: Only processes columns that actually have nulls

Future Enhancements

Time-Window Filling: Fill with nearest timestamp if exact match not found
Interpolation: Smart interpolation for numerical features
Symbol Similarity: Fill using similar symbols when exact match unavailable
Quality Scoring: Rank data sources by quality for better filling decisions

This enhanced merge system ensures that your machine learning models receive the highest quality, most complete data possible while preserving the integrity and characteristics of each financial instrument.