Spaces:

maaroufabousaleh
/

advisorai-data-enhanced

Sleeping

advisorai-data-enhanced / src /merge /ENHANCED_MERGE_README.md

Maaroufabousaleh

c49b21b about 1 month ago

4.77 kB

	# Enhanced Merge with Intelligent Null Filling

	## Overview

	The `merge_temp.py` module has been enhanced with sophisticated null filling capabilities that prioritize finding values from the same symbol + interval_timestamp combination across different data sources before falling back to other strategies.

	## Key Features

	### 1. Symbol-First Null Filling Strategy

	When merging temp files to existing features, the system now:

	1. Identifies null values in the target (merged) dataset
	2. Searches for matching records in the source (temp) dataset using `(symbol, interval_timestamp)` as the key
	3. Fills null values only when:
	- The same symbol + timestamp exists in the temp data
	- The temp data has a non-null value for that column
	- The column exists in both datasets

	### 2. Cross-Dataset Null Filling

	During train file creation and merged features generation:

	1. Combines multiple sources (archive, features, temp files)
	2. Creates a comprehensive lookup of all non-null values by `(symbol, timestamp)`
	3. Fills nulls intelligently using the best available data from any source
	4. Preserves data integrity by only filling with values from the exact same symbol and time

	### 3. Enhanced Functions

	#### `fill_nulls_from_temp(df_merged, df_temp)`
	- Fills null values in `df_merged` using data from `df_temp`
	- Only fills when exact `(symbol, interval_timestamp)` match exists
	- Returns count of null values filled
	- Provides detailed logging of the filling process

	#### `merge_temp_to_merged(temp_name, merged_name)`
	- Enhanced to perform null filling before adding new records
	- Reports both new records added and null values filled
	- Maintains existing functionality while adding intelligent null handling

	#### `merge_all_to_train()`
	- Cross-source null filling during train file creation
	- Combines archive, features, and temp data optimally
	- Eliminates duplicates while preserving the best available data

	#### `create_merged_features()`
	- Creates the main `merged_features.parquet` file
	- Combines crypto and stock features with cross-dataset null filling
	- Provides comprehensive statistics on the merge process

	## Benefits

	### 🎯 Data Quality Improvements
	- Preserves Symbol Characteristics: Uses same-symbol data to fill nulls
	- Temporal Consistency: Only uses data from the exact same timestamp
	- No Data Pollution: Never mixes data from different symbols or times

	### 📊 Better Coverage
	- Reduced Null Values: Significantly fewer missing values in final datasets
	- Multi-Source Integration: Leverages all available data sources
	- Smart Deduplication: Keeps the best version of each record

	### 🔧 Robust Processing
	- Error Handling: Graceful handling of missing files and edge cases
	- Detailed Logging: Clear reporting of what was filled and why
	- Validation: Built-in checks to ensure data integrity

	## Usage Examples

	### Test the Null Filling
	```bash
	cd src/merge
	python merge_temp.py --test-null-filling
	```

	### Run Normal Merge Process
	```bash
	cd src/merge
	python merge_temp.py
	```

	### Manual Testing
	```bash
	cd src/merge
	python test_null_filling_merge.py
	```

	## Integration with Main Pipeline

	The enhanced merge functionality is automatically integrated into the main pipeline:

	1. After data collection: Temp files are created with new data
	2. During merge_temp.py: Null filling happens automatically
	3. Before normalization: Data is as complete as possible
	4. Train file creation: Uses all available historical data

	## Example Output

	```
	[INFO] Attempting to fill nulls in 4 columns: ['price', 'volume', 'rsi', 'macd']
	[INFO] Successfully filled 7 null values from temp data
	[INFO] Column 'price': 0 nulls remaining
	[INFO] Column 'volume': 0 nulls remaining
	[INFO] Column 'rsi': 0 nulls remaining
	[INFO] Column 'macd': 0 nulls remaining
	[OK] Added 15 new records from crypto_features.parquet to crypto_features.parquet, filled 7 null values
	```

	## Performance Considerations

	- Efficient Lookups: Uses dictionary-based lookups for O(1) access
	- Memory Optimized: Processes data in chunks when possible
	- Minimal Overhead: Only processes columns that actually have nulls

	## Future Enhancements

	- Time-Window Filling: Fill with nearest timestamp if exact match not found
	- Interpolation: Smart interpolation for numerical features
	- Symbol Similarity: Fill using similar symbols when exact match unavailable
	- Quality Scoring: Rank data sources by quality for better filling decisions

	This enhanced merge system ensures that your machine learning models receive the highest quality, most complete data possible while preserving the integrity and characteristics of each financial instrument.